skip to main content
research-article
Open Access

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Published:16 October 2020Publication History
Skip Abstract Section

Abstract

The numerical solution of partial differential equations is at the heart of many grand challenges in supercomputing. Solvers based on high-order discontinuous Galerkin (DG) discretisation have been shown to scale on large supercomputers with excellent performance and efficiency if the implementation exploits all levels of parallelism and is tailored to the specific architecture. However, every year new supercomputers emerge and the list of hardware-specific considerations grows simultaneously with the list of desired features in a DG code. Thus, we believe that a sustainable DG code needs an abstraction layer to implement the numerical scheme in a suitable language. We explore the possibility to abstract the numerical scheme as small tensor operations, describe them in a domain-specific language (DSL) resembling the Einstein notation, and to map them to small General Matrix-Matrix Multiplication routines. The compiler for our DSL implements classic optimisations that are used for large tensor contractions, and we present novel optimisation techniques such as equivalent sparsity patterns and optimal index permutations for temporary tensors. Our application examples, which include the earthquake simulation software SeisSol, show that the generated kernels achieve over 50% peak performance of a recent 48-core Skylake system while the DSL considerably simplifies the implementation.

References

  1. K. Åhlander. 2002. Einstein summation for multidimensional arrays. Comput. Math. Applic. 44, 8 (2002), 1007--1017. DOI:https://doi.org/10.1016/S0898-1221(02)00210-9Google ScholarGoogle ScholarCross RefCross Ref
  2. artin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40, 2 (Mar. 2014). DOI:https://doi.org/10.1145/2566630Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Harold L. Atkins and Chi-Wang Shu. 1998. Quadrature-free implementation of discontinuous Galerkin method for hyperbolic equations. AIAA J. 36:5 (1998), 775--782.Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R. J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (Feb. 2005), 276--292. DOI:https://doi.org/10.1109/JPROC.2004.840311Google ScholarGoogle ScholarCross RefCross Ref
  5. Nathan W. Brei. 2018. Generating Small Sparse Matrix Multiplication Kernels for Knights Landing. Master’s thesis. Technical University of Munich, Garching, Germany.Google ScholarGoogle Scholar
  6. Alexander Breuer, Alexander Heinecke, Michael Bader, and Christian Pelties. 2014a. Accelerating SeisSol by generating vectorized code for sparse matrix operators. In Parallel Computing: Accelerating Computational Science and Engineering (CSE). IOS Press, 347--356. DOI:https://doi.org/10.3233/978-1-61499-381-0-347Google ScholarGoogle Scholar
  7. Alexander Breuer, Alexander Heinecke, and Yifeng Cui. 2017. EDGE: Extreme scale fused seismic simulations with the discontinuous Galerkin method. In High Performance Computing, ISC 2017. Springer International Publishing, Cham, 41--60.Google ScholarGoogle Scholar
  8. Alexander Breuer, Alexander Heinecke, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, and Christian Pelties. 2014b. Sustained petascale performance of seismic simulations with SeisSol on SuperMUC. In Proceedings of the 29th International Conference on Supercomputing (ISC’14). Springer, 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Edith Cohen. 1998. Structure prediction and computation of sparse matrix products. J. Combin. Optimiz. 2, 4 (1998), 307--332. DOI:https://doi.org/10.1023/A:1009716300509Google ScholarGoogle ScholarCross RefCross Ref
  10. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Steven M. Day, Jacobo Bielak, Doug Dreger, Shawn Larsen, Robert Graves, Arben Pitarka, and Kim B. Olsen. 2003. Tests of 3D elastodynamics Codes: Final Report for Lifelines Program Task 1A02. Pacific Earthquake Engineering Research Center.Google ScholarGoogle Scholar
  12. Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ortí, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454--468. DOI:https://doi.org/10.1016/j.amc.2014.02.051Google ScholarGoogle ScholarCross RefCross Ref
  13. Michael Dumbser and Martin Käser. 2006. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—II. The three-dimensional isotropic case. Geophys. J. Int. 167 (2006), 319--336.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Einstein. 1916. Die Grundlage der allgemeinen Relativitätstheorie. Annal. Phys. 354, 7 (1916), 769--822. DOI:https://doi.org/10.1002/andp.19163540702Google ScholarGoogle ScholarCross RefCross Ref
  15. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (2008), 12:1--12:25. DOI:https://doi.org/10.1145/1356052.1356053Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.Google ScholarGoogle Scholar
  18. R. Harrison, G. Beylkin, F. Bischoff, J. Calvin, G. Fann, J. Fosso-Tande, D. Galindo, J. Hammond, R. Hartman-Baker, J. Hill, J. Jia, J. Kottmann, M. Yvonne Ou, J. Pei, L. Ratcliff, M. Reuter, A. Richie-Halford, N. Romero, H. Sekino, W. Shelton, B. Sundahl, W. Thornton, E. Valeev, Á. Vázquez-Mayagoitia, N. Vence, T. Yanai, and Y. Yokoi. 2016. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. SIAM J. Sci. Comput. 38, 5 (2016), S123--S142. DOI:https://doi.org/10.1137/15M1026171Google ScholarGoogle ScholarCross RefCross Ref
  19. Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey. 2016a. High order seismic simulations on the Intel Xeon Phi processor (Knights Landing). In Proceedings of the 31st International Conference on High Performance Computing. Springer, 343--362. DOI:https://doi.org/10.1007/978-3-319-41321-1_18Google ScholarGoogle ScholarCross RefCross Ref
  20. Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke Liao, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, and Pradeep Dubey. 2014. Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 3--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016b. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, 84:1--84:11. Retrieved from http://dl.acm.org/citation.cfm?id=3014904.3015017.Google ScholarGoogle ScholarCross RefCross Ref
  22. Jan S. Hesthaven and Tim Warburton. 2008. Nodal Discontinuous Galerkin Methods. Springer, New York. DOI:https://doi.org/10.1007/978-0-387-72067-8Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Miklós Homolya, Robert C. Kirby, and David A. Ham. 2017. Exposing and exploiting structure: Optimal code generation for high-order finite element methods. Retrieved from: arxiv:cs.MS/1711.02473.Google ScholarGoogle Scholar
  24. Miklós Homolya, Lawrence Mitchell, Fabio Luporini, and David A. Ham. 2018. TSFC: A structure-preserving form compiler. SIAM J. Sci. Comput. 40, 3 (2018), C401--C428. DOI:https://doi.org/10.1137/17M1130642Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Maxwell Hutchinson, Alexander Heinecke, Hans Pabst, Greg Henry, Matteo Parsani, and David Keyes. 2016. Efficiency of high order spectral element methods on petascale architectures. In Proceedings of the 31st International Conference on High Performance Computing. Springer, 449--466. DOI:https://doi.org/10.1007/978-3-319-41321-1_23Google ScholarGoogle ScholarCross RefCross Ref
  26. Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rüde. 2012. Expression templates revisited: A performance analysis of current methodologies. SIAM J. Sci. Comput. 34, 2 (2012), C42--C69. DOI:https://doi.org/10.1137/110830125Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Intel Corporation. 2020. Intel Xeon Processor Scalable Family: Specification Update (June 2020 ed.). Retrieved from https://www.intel.de/content/www/de/de/processors/xeon/scalable/xeon-scalable-spec-update.html.Google ScholarGoogle Scholar
  28. Martin Käser, Michael Dumbser, Josep de la Puente, and Heiner Igel. 2007. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—III. Viscoelastic attenuation. Geophy. J. Int. 168 (2007), 224--242.Google ScholarGoogle Scholar
  29. D. Kempf, R. Heß, S. Müthing, and P. Bastian. 2018. Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. arXiv e-printsarxiv:math.NA/arXiv:1812.08075 (2018).Google ScholarGoogle Scholar
  30. T. Kolda and B. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (2009), 455--500. DOI:https://doi.org/10.1137/07070111XGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  31. David A. Kopriva. 2009. Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers (1st ed.). Springer.Google ScholarGoogle Scholar
  32. Chi Chung Lam. 1999. Performance optimization of a class of loops implementing multi-dimensional integrals. Ph.D. Dissertation. Graduate School of the Ohio State University, UMI Company. Retrieved from: http://rave.ohiolink.edu/etdc/view?acc_num=osu1488191667180786.Google ScholarGoogle Scholar
  33. Chi-Chung Lam, P. Sadayappan, Cociorva Daniel, Mebarek Alouani, and John Wilkins. 1999. Performance optimization of a class of loops involving sums of products of sparse arrays. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.Google ScholarGoogle Scholar
  34. Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. 1997. Optimal reordering and mapping of a class of nested-loops for parallel execution. In Languages and Compilers for Parallel Computing. Springer Berlin, 315--329.Google ScholarGoogle Scholar
  35. Randall J. LeVeque. 2002. Finite Volume Methods for Hyperbolic Problems. Vol. 31. Cambridge University Press.Google ScholarGoogle Scholar
  36. J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 1--12. DOI:https://doi.org/10.1145/2807591.2807671Google ScholarGoogle Scholar
  37. Anders Logg, Kent-Andre Mardal, and Garth Wells. 2012. Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. Cross-loop optimization of arithmetic intensity for finite element local assembly. ACM Trans. Archit. Code Optimiz. 11, 4 (Jan. 2015). DOI:https://doi.org/10.1145/2687415Google ScholarGoogle Scholar
  39. D. Matthews. 2018. High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40, 1 (2018), C1--C24. DOI:https://doi.org/10.1137/16M108968XGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  40. T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris. 2015. Generating efficient tensor contractions for GPUs. In Proceedings of the 44th International Conference on Parallel Processing. 969--978. DOI:https://doi.org/10.1109/ICPP.2015.106Google ScholarGoogle Scholar
  41. Elmar Peise and Paolo Bientinesi. 2012. Performance modeling for dense linear algebra. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, 406--416. DOI:https://doi.org/10.1109/SC.Companion.2012.60Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Elmar Peise, Diego Fabregat-Traver, and Paolo Bientinesi. 2015. On the performance prediction of BLAS-based tensor contractions. In Proceedings of the Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems Workshop (Lecture Notes in Computer Science, Vol. 8966). Springer, 193--212. DOI:https://doi.org/10.1007/978-3-319-17248-4_10Google ScholarGoogle ScholarCross RefCross Ref
  43. Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the finite element method by composing abstractions. ACM Trans. Math. Softw. 43, 3 (Dec. 2016). DOI:https://doi.org/10.1145/2998441Google ScholarGoogle Scholar
  44. S. Schoeder, K. Kormann, W. A. Wall, and M. Kronbichler. 2018. Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves. SIAM J. Sci. Comput. 40, 6 (2018), C803--C826. DOI:https://doi.org/10.1137/18M1185399Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Helmut Seidl, Reinhard Wilhelm, and Sebastian Hack. 2012. Compiler Design: Analysis and Transformation. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  46. Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the IEEE 23rd International Conference on High Performance Computing (HiPC’16). 193--202. DOI:https://doi.org/10.1109/HiPC.2016.031Google ScholarGoogle Scholar
  47. E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. 2013. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. 813--824. DOI:https://doi.org/10.1109/IPDPS.2013.112Google ScholarGoogle Scholar
  48. Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program generation for small-scale linear algebra applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, 327--339. DOI:https://doi.org/10.1145/3168812Google ScholarGoogle Scholar
  49. Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, 23--32. DOI:https://doi.org/10.1145/2544137.2544155Google ScholarGoogle Scholar
  50. Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3 (2018), 28:1--28:29. DOI:https://doi.org/10.1145/3157733Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2017. TTC: A high-performance compiler for tensor transpositions. ACM Trans. Math. Softw. 44, 2 (Aug. 2017). DOI:https://doi.org/10.1145/3104988Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kevin Stock, Tom Henretty, Iyyappa Murugandi, P. Sadayappan, and Robert Harrison. 2011. Model-driven SIMD code generation for a multi-resolution tensor kernel. In Proceedings of the IEEE Parallel and Distributed Processing Symposium. IEEE Computer Society, 1058--1067. DOI:https://doi.org/10.1109/IPDPS.2011.101Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI’10).Google ScholarGoogle Scholar
  54. Carsten Uphoff and Michael Bader. 2016. Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’16). 908--916. DOI:https://doi.org/10.1109/HPCSim.2016.7568431Google ScholarGoogle ScholarCross RefCross Ref
  55. Carsten Uphoff, Sebastian Rettenberger, Michael Bader, Elizabeth H. Madden, Thomas Ulrich, Stephanie Wollherr, and Alice-Agnes Gabriel. 2017. Extreme scale multi-physics simulations of the Tsunamigenic 2004 Sumatra megathrust earthquake. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY.DOI:https://doi.org/10.1145/3126908.3126948Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (June 2015). DOI:https://doi.org/10.1145/2764454Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Peter Vincent, Freddie Witherden, Brian Vermeire, Jin Seok Park, and Arvind Iyer. 2016. Towards green aviation with Python at petascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ. Retrieved from: http://dl.acm.org/citation.cfm?id=3014904.3014906.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Peter Wauligmann and Nathan W. Brei. 2019. PSpaMM: Portable Sparse Matrix Multiplication. Retrieved from https://github.com/peterwauligmann/pspamm.Google ScholarGoogle Scholar
  59. Stephanie Wollherr, Alice-Agnes Gabriel, and Carsten Uphoff. 2018. Off-fault plasticity in three-dimensional dynamic rupture simulations using a modal Discontinuous Galerkin method on unstructured meshes: Implementation, verification and application. Geophys. J. Int. 214, 3 (2018), 1556--1584. DOI:https://doi.org/10.1093/gji/ggy213Google ScholarGoogle ScholarCross RefCross Ref
  60. Bartosz D. Wozniak, Freddie D. Witherden, Francis P. Russell, Peter E. Vincent, and Paul H. J. Kelly. 2016. GiMMiK--Generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational fluid dynamics. Comput. Phys. Commun. 202 (2016), 12--22. DOI:https://doi.org/10.1016/j.cpc.2015.12.012Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Mathematical Software
            ACM Transactions on Mathematical Software  Volume 46, Issue 4
            December 2020
            272 pages
            ISSN:0098-3500
            EISSN:1557-7295
            DOI:10.1145/3430683
            Issue’s Table of Contents

            Copyright © 2020 Owner/Author

            This work is licensed under a Creative Commons Attribution-NoDerivatives International 4.0 License.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 16 October 2020
            • Accepted: 1 June 2020
            • Revised: 1 February 2020
            • Received: 1 March 2019
            Published in toms Volume 46, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format