Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Authors:
Carsten Uphoff

Technical University of Munich, Garching, Germany

Technical University of Munich, Garching, Germany
View Profile

,
Michael Bader

Technical University of Munich, Garching, Germany

Technical University of Munich, Garching, Germany
View Profile

Authors Info & Claims

ACM Transactions on Mathematical Software Volume 46 Issue 4Article No.: 34pp 1–40https://doi.org/10.1145/3406835

Published:16 October 2020Publication History

ACM Transactions on Mathematical Software

Abstract

The numerical solution of partial differential equations is at the heart of many grand challenges in supercomputing. Solvers based on high-order discontinuous Galerkin (DG) discretisation have been shown to scale on large supercomputers with excellent performance and efficiency if the implementation exploits all levels of parallelism and is tailored to the specific architecture. However, every year new supercomputers emerge and the list of hardware-specific considerations grows simultaneously with the list of desired features in a DG code. Thus, we believe that a sustainable DG code needs an abstraction layer to implement the numerical scheme in a suitable language. We explore the possibility to abstract the numerical scheme as small tensor operations, describe them in a domain-specific language (DSL) resembling the Einstein notation, and to map them to small General Matrix-Matrix Multiplication routines. The compiler for our DSL implements classic optimisations that are used for large tensor contractions, and we present novel optimisation techniques such as equivalent sparsity patterns and optimal index permutations for temporary tensors. Our application examples, which include the earthquake simulation software SeisSol, show that the generated kernels achieve over 50% peak performance of a recent 48-core Skylake system while the DSL considerably simplifies the implementation.

References

K. Åhlander. 2002. Einstein summation for multidimensional arrays. Comput. Math. Applic. 44, 8 (2002), 1007--1017. DOI:https://doi.org/10.1016/S0898-1221(02)00210-9Google ScholarCross Ref
artin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40, 2 (Mar. 2014). DOI:https://doi.org/10.1145/2566630Google ScholarDigital Library
Harold L. Atkins and Chi-Wang Shu. 1998. Quadrature-free implementation of discontinuous Galerkin method for hyperbolic equations. AIAA J. 36:5 (1998), 775--782.Google ScholarCross Ref
G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R. J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (Feb. 2005), 276--292. DOI:https://doi.org/10.1109/JPROC.2004.840311Google ScholarCross Ref
Nathan W. Brei. 2018. Generating Small Sparse Matrix Multiplication Kernels for Knights Landing. Master’s thesis. Technical University of Munich, Garching, Germany.Google Scholar
Alexander Breuer, Alexander Heinecke, Michael Bader, and Christian Pelties. 2014a. Accelerating SeisSol by generating vectorized code for sparse matrix operators. In Parallel Computing: Accelerating Computational Science and Engineering (CSE). IOS Press, 347--356. DOI:https://doi.org/10.3233/978-1-61499-381-0-347Google Scholar
Alexander Breuer, Alexander Heinecke, and Yifeng Cui. 2017. EDGE: Extreme scale fused seismic simulations with the discontinuous Galerkin method. In High Performance Computing, ISC 2017. Springer International Publishing, Cham, 41--60.Google Scholar
Alexander Breuer, Alexander Heinecke, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, and Christian Pelties. 2014b. Sustained petascale performance of seismic simulations with SeisSol on SuperMUC. In Proceedings of the 29th International Conference on Supercomputing (ISC’14). Springer, 1--18.Google ScholarDigital Library
Edith Cohen. 1998. Structure prediction and computation of sparse matrix products. J. Combin. Optimiz. 2, 4 (1998), 307--332. DOI:https://doi.org/10.1023/A:1009716300509Google ScholarCross Ref
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press.Google ScholarDigital Library
Steven M. Day, Jacobo Bielak, Doug Dreger, Shawn Larsen, Robert Graves, Arben Pitarka, and Kim B. Olsen. 2003. Tests of 3D elastodynamics Codes: Final Report for Lifelines Program Task 1A02. Pacific Earthquake Engineering Research Center.Google Scholar
Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ortí, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454--468. DOI:https://doi.org/10.1016/j.amc.2014.02.051Google ScholarCross Ref
Michael Dumbser and Martin Käser. 2006. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—II. The three-dimensional isotropic case. Geophys. J. Int. 167 (2006), 319--336.Google ScholarCross Ref
A. Einstein. 1916. Die Grundlage der allgemeinen Relativitätstheorie. Annal. Phys. 354, 7 (1916), 769--822. DOI:https://doi.org/10.1002/andp.19163540702Google ScholarCross Ref
Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarDigital Library
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (2008), 12:1--12:25. DOI:https://doi.org/10.1145/1356052.1356053Google ScholarDigital Library
Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.Google Scholar
R. Harrison, G. Beylkin, F. Bischoff, J. Calvin, G. Fann, J. Fosso-Tande, D. Galindo, J. Hammond, R. Hartman-Baker, J. Hill, J. Jia, J. Kottmann, M. Yvonne Ou, J. Pei, L. Ratcliff, M. Reuter, A. Richie-Halford, N. Romero, H. Sekino, W. Shelton, B. Sundahl, W. Thornton, E. Valeev, Á. Vázquez-Mayagoitia, N. Vence, T. Yanai, and Y. Yokoi. 2016. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. SIAM J. Sci. Comput. 38, 5 (2016), S123--S142. DOI:https://doi.org/10.1137/15M1026171Google ScholarCross Ref
Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey. 2016a. High order seismic simulations on the Intel Xeon Phi processor (Knights Landing). In Proceedings of the 31st International Conference on High Performance Computing. Springer, 343--362. DOI:https://doi.org/10.1007/978-3-319-41321-1_18Google ScholarCross Ref
Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke Liao, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, and Pradeep Dubey. 2014. Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 3--14.Google ScholarDigital Library
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016b. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, 84:1--84:11. Retrieved from http://dl.acm.org/citation.cfm?id=3014904.3015017.Google ScholarCross Ref
Jan S. Hesthaven and Tim Warburton. 2008. Nodal Discontinuous Galerkin Methods. Springer, New York. DOI:https://doi.org/10.1007/978-0-387-72067-8Google ScholarDigital Library
Miklós Homolya, Robert C. Kirby, and David A. Ham. 2017. Exposing and exploiting structure: Optimal code generation for high-order finite element methods. Retrieved from: arxiv:cs.MS/1711.02473.Google Scholar
Miklós Homolya, Lawrence Mitchell, Fabio Luporini, and David A. Ham. 2018. TSFC: A structure-preserving form compiler. SIAM J. Sci. Comput. 40, 3 (2018), C401--C428. DOI:https://doi.org/10.1137/17M1130642Google ScholarDigital Library
Maxwell Hutchinson, Alexander Heinecke, Hans Pabst, Greg Henry, Matteo Parsani, and David Keyes. 2016. Efficiency of high order spectral element methods on petascale architectures. In Proceedings of the 31st International Conference on High Performance Computing. Springer, 449--466. DOI:https://doi.org/10.1007/978-3-319-41321-1_23Google ScholarCross Ref
Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rüde. 2012. Expression templates revisited: A performance analysis of current methodologies. SIAM J. Sci. Comput. 34, 2 (2012), C42--C69. DOI:https://doi.org/10.1137/110830125Google ScholarDigital Library
Intel Corporation. 2020. Intel Xeon Processor Scalable Family: Specification Update (June 2020 ed.). Retrieved from https://www.intel.de/content/www/de/de/processors/xeon/scalable/xeon-scalable-spec-update.html.Google Scholar
Martin Käser, Michael Dumbser, Josep de la Puente, and Heiner Igel. 2007. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—III. Viscoelastic attenuation. Geophy. J. Int. 168 (2007), 224--242.Google Scholar
D. Kempf, R. Heß, S. Müthing, and P. Bastian. 2018. Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. arXiv e-printsarxiv:math.NA/arXiv:1812.08075 (2018).Google Scholar
T. Kolda and B. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (2009), 455--500. DOI:https://doi.org/10.1137/07070111XGoogle ScholarDigital Library
David A. Kopriva. 2009. Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers (1st ed.). Springer.Google Scholar
Chi Chung Lam. 1999. Performance optimization of a class of loops implementing multi-dimensional integrals. Ph.D. Dissertation. Graduate School of the Ohio State University, UMI Company. Retrieved from: http://rave.ohiolink.edu/etdc/view?acc_num=osu1488191667180786.Google Scholar
Chi-Chung Lam, P. Sadayappan, Cociorva Daniel, Mebarek Alouani, and John Wilkins. 1999. Performance optimization of a class of loops involving sums of products of sparse arrays. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.Google Scholar
Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. 1997. Optimal reordering and mapping of a class of nested-loops for parallel execution. In Languages and Compilers for Parallel Computing. Springer Berlin, 315--329.Google Scholar
Randall J. LeVeque. 2002. Finite Volume Methods for Hyperbolic Problems. Vol. 31. Cambridge University Press.Google Scholar
J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 1--12. DOI:https://doi.org/10.1145/2807591.2807671Google Scholar
Anders Logg, Kent-Andre Mardal, and Garth Wells. 2012. Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book. Springer.Google ScholarDigital Library
Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. Cross-loop optimization of arithmetic intensity for finite element local assembly. ACM Trans. Archit. Code Optimiz. 11, 4 (Jan. 2015). DOI:https://doi.org/10.1145/2687415Google Scholar
D. Matthews. 2018. High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40, 1 (2018), C1--C24. DOI:https://doi.org/10.1137/16M108968XGoogle ScholarDigital Library
T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris. 2015. Generating efficient tensor contractions for GPUs. In Proceedings of the 44th International Conference on Parallel Processing. 969--978. DOI:https://doi.org/10.1109/ICPP.2015.106Google Scholar
Elmar Peise and Paolo Bientinesi. 2012. Performance modeling for dense linear algebra. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, 406--416. DOI:https://doi.org/10.1109/SC.Companion.2012.60Google ScholarDigital Library
Elmar Peise, Diego Fabregat-Traver, and Paolo Bientinesi. 2015. On the performance prediction of BLAS-based tensor contractions. In Proceedings of the Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems Workshop (Lecture Notes in Computer Science, Vol. 8966). Springer, 193--212. DOI:https://doi.org/10.1007/978-3-319-17248-4_10Google ScholarCross Ref
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the finite element method by composing abstractions. ACM Trans. Math. Softw. 43, 3 (Dec. 2016). DOI:https://doi.org/10.1145/2998441Google Scholar
S. Schoeder, K. Kormann, W. A. Wall, and M. Kronbichler. 2018. Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves. SIAM J. Sci. Comput. 40, 6 (2018), C803--C826. DOI:https://doi.org/10.1137/18M1185399Google ScholarDigital Library
Helmut Seidl, Reinhard Wilhelm, and Sebastian Hack. 2012. Compiler Design: Analysis and Transformation. Springer.Google ScholarCross Ref
Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the IEEE 23rd International Conference on High Performance Computing (HiPC’16). 193--202. DOI:https://doi.org/10.1109/HiPC.2016.031Google Scholar
E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. 2013. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. 813--824. DOI:https://doi.org/10.1109/IPDPS.2013.112Google Scholar
Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program generation for small-scale linear algebra applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, 327--339. DOI:https://doi.org/10.1145/3168812Google Scholar
Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, 23--32. DOI:https://doi.org/10.1145/2544137.2544155Google Scholar
Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3 (2018), 28:1--28:29. DOI:https://doi.org/10.1145/3157733Google ScholarDigital Library
Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2017. TTC: A high-performance compiler for tensor transpositions. ACM Trans. Math. Softw. 44, 2 (Aug. 2017). DOI:https://doi.org/10.1145/3104988Google ScholarDigital Library
Kevin Stock, Tom Henretty, Iyyappa Murugandi, P. Sadayappan, and Robert Harrison. 2011. Model-driven SIMD code generation for a multi-resolution tensor kernel. In Proceedings of the IEEE Parallel and Distributed Processing Symposium. IEEE Computer Society, 1058--1067. DOI:https://doi.org/10.1109/IPDPS.2011.101Google ScholarDigital Library
J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI’10).Google Scholar
Carsten Uphoff and Michael Bader. 2016. Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’16). 908--916. DOI:https://doi.org/10.1109/HPCSim.2016.7568431Google ScholarCross Ref
Carsten Uphoff, Sebastian Rettenberger, Michael Bader, Elizabeth H. Madden, Thomas Ulrich, Stephanie Wollherr, and Alice-Agnes Gabriel. 2017. Extreme scale multi-physics simulations of the Tsunamigenic 2004 Sumatra megathrust earthquake. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY.DOI:https://doi.org/10.1145/3126908.3126948Google ScholarDigital Library
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (June 2015). DOI:https://doi.org/10.1145/2764454Google ScholarDigital Library
Peter Vincent, Freddie Witherden, Brian Vermeire, Jin Seok Park, and Arvind Iyer. 2016. Towards green aviation with Python at petascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ. Retrieved from: http://dl.acm.org/citation.cfm?id=3014904.3014906.Google ScholarDigital Library
Peter Wauligmann and Nathan W. Brei. 2019. PSpaMM: Portable Sparse Matrix Multiplication. Retrieved from https://github.com/peterwauligmann/pspamm.Google Scholar
Stephanie Wollherr, Alice-Agnes Gabriel, and Carsten Uphoff. 2018. Off-fault plasticity in three-dimensional dynamic rupture simulations using a modal Discontinuous Galerkin method on unstructured meshes: Implementation, verification and application. Geophys. J. Int. 214, 3 (2018), 1556--1584. DOI:https://doi.org/10.1093/gji/ggy213Google ScholarCross Ref
Bartosz D. Wozniak, Freddie D. Witherden, Francis P. Russell, Peter E. Vincent, and Paul H. J. Kelly. 2016. GiMMiK--Generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational fluid dynamics. Comput. Phys. Commun. 202 (2016), 12--22. DOI:https://doi.org/10.1016/j.cpc.2015.12.012Google ScholarCross Ref

Index Terms

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

Recommendations

Superconvergence of discontinuous Galerkin and local discontinuous Galerkin methods

Various superconvergence properties of discontinuous Galerkin (DG) and local DG (LDG) methods for linear hyperbolic and parabolic equations have been investigated in the past. Due to these superconvergence properties, DG and LDG methods have been known ...
Read More
An $hp$-Version Discontinuous Galerkin Method for Integro-Differential Equations of Parabolic Type

We study the numerical solution of a class of parabolic integro-differential equations with weakly singular kernels. We use an $hp$-version discontinuous Galerkin (DG) method for the discretization in time. We derive optimal $hp$-version error estimates ...
Read More
Discontinuous Galerkin time stepping method for solving linear space fractional partial differential equations

In this paper, we consider the discontinuous Galerkin time stepping method for solving the linear space fractional partial differential equations. The space fractional derivatives are defined by using Riesz fractional derivative. The space variable is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Mathematical Software Volume 46, Issue 4
December 2020
272 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3430683
Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA
Issue’s Table of Contents
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution-NoDerivatives International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 October 2020
- Accepted: 1 June 2020
- Revised: 1 February 2020
- Received: 1 March 2019
Published in toms Volume 46, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ADER-DG
Tensor operations
finite element method
high-performance computing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 767
  Total Downloads
- Downloads (Last 12 months)177
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Superconvergence of discontinuous Galerkin and local discontinuous Galerkin methods

An $hp$-Version Discontinuous Galerkin Method for Integro-Differential Equations of Parabolic Type

Discontinuous Galerkin time stepping method for solving linear space fractional partial differential equations