Abstract
The numerical solution of partial differential equations is at the heart of many grand challenges in supercomputing. Solvers based on high-order discontinuous Galerkin (DG) discretisation have been shown to scale on large supercomputers with excellent performance and efficiency if the implementation exploits all levels of parallelism and is tailored to the specific architecture. However, every year new supercomputers emerge and the list of hardware-specific considerations grows simultaneously with the list of desired features in a DG code. Thus, we believe that a sustainable DG code needs an abstraction layer to implement the numerical scheme in a suitable language. We explore the possibility to abstract the numerical scheme as small tensor operations, describe them in a domain-specific language (DSL) resembling the Einstein notation, and to map them to small General Matrix-Matrix Multiplication routines. The compiler for our DSL implements classic optimisations that are used for large tensor contractions, and we present novel optimisation techniques such as equivalent sparsity patterns and optimal index permutations for temporary tensors. Our application examples, which include the earthquake simulation software SeisSol, show that the generated kernels achieve over 50% peak performance of a recent 48-core Skylake system while the DSL considerably simplifies the implementation.
- K. Åhlander. 2002. Einstein summation for multidimensional arrays. Comput. Math. Applic. 44, 8 (2002), 1007--1017. DOI:https://doi.org/10.1016/S0898-1221(02)00210-9Google ScholarCross Ref
- artin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified form language: A domain-specific language for weak formulations of partial differential equations. ACM Trans. Math. Softw. 40, 2 (Mar. 2014). DOI:https://doi.org/10.1145/2566630Google ScholarDigital Library
- Harold L. Atkins and Chi-Wang Shu. 1998. Quadrature-free implementation of discontinuous Galerkin method for hyperbolic equations. AIAA J. 36:5 (1998), 775--782.Google ScholarCross Ref
- G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R. J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R. M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proc. IEEE 93, 2 (Feb. 2005), 276--292. DOI:https://doi.org/10.1109/JPROC.2004.840311Google ScholarCross Ref
- Nathan W. Brei. 2018. Generating Small Sparse Matrix Multiplication Kernels for Knights Landing. Master’s thesis. Technical University of Munich, Garching, Germany.Google Scholar
- Alexander Breuer, Alexander Heinecke, Michael Bader, and Christian Pelties. 2014a. Accelerating SeisSol by generating vectorized code for sparse matrix operators. In Parallel Computing: Accelerating Computational Science and Engineering (CSE). IOS Press, 347--356. DOI:https://doi.org/10.3233/978-1-61499-381-0-347Google Scholar
- Alexander Breuer, Alexander Heinecke, and Yifeng Cui. 2017. EDGE: Extreme scale fused seismic simulations with the discontinuous Galerkin method. In High Performance Computing, ISC 2017. Springer International Publishing, Cham, 41--60.Google Scholar
- Alexander Breuer, Alexander Heinecke, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, and Christian Pelties. 2014b. Sustained petascale performance of seismic simulations with SeisSol on SuperMUC. In Proceedings of the 29th International Conference on Supercomputing (ISC’14). Springer, 1--18.Google ScholarDigital Library
- Edith Cohen. 1998. Structure prediction and computation of sparse matrix products. J. Combin. Optimiz. 2, 4 (1998), 307--332. DOI:https://doi.org/10.1023/A:1009716300509Google ScholarCross Ref
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press.Google ScholarDigital Library
- Steven M. Day, Jacobo Bielak, Doug Dreger, Shawn Larsen, Robert Graves, Arben Pitarka, and Kim B. Olsen. 2003. Tests of 3D elastodynamics Codes: Final Report for Lifelines Program Task 1A02. Pacific Earthquake Engineering Research Center.Google Scholar
- Edoardo Di Napoli, Diego Fabregat-Traver, Gregorio Quintana-Ortí, and Paolo Bientinesi. 2014. Towards an efficient use of the BLAS library for multilinear tensor contractions. Appl. Math. Comput. 235 (2014), 454--468. DOI:https://doi.org/10.1016/j.amc.2014.02.051Google ScholarCross Ref
- Michael Dumbser and Martin Käser. 2006. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—II. The three-dimensional isotropic case. Geophys. J. Int. 167 (2006), 319--336.Google ScholarCross Ref
- A. Einstein. 1916. Die Grundlage der allgemeinen Relativitätstheorie. Annal. Phys. 354, 7 (1916), 769--822. DOI:https://doi.org/10.1002/andp.19163540702Google ScholarCross Ref
- Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. 1995. Design Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarDigital Library
- Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (2008), 12:1--12:25. DOI:https://doi.org/10.1145/1356052.1356053Google ScholarDigital Library
- Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. Retrieved from http://eigen.tuxfamily.org.Google Scholar
- R. Harrison, G. Beylkin, F. Bischoff, J. Calvin, G. Fann, J. Fosso-Tande, D. Galindo, J. Hammond, R. Hartman-Baker, J. Hill, J. Jia, J. Kottmann, M. Yvonne Ou, J. Pei, L. Ratcliff, M. Reuter, A. Richie-Halford, N. Romero, H. Sekino, W. Shelton, B. Sundahl, W. Thornton, E. Valeev, Á. Vázquez-Mayagoitia, N. Vence, T. Yanai, and Y. Yokoi. 2016. MADNESS: A multiresolution, adaptive numerical environment for scientific simulation. SIAM J. Sci. Comput. 38, 5 (2016), S123--S142. DOI:https://doi.org/10.1137/15M1026171Google ScholarCross Ref
- Alexander Heinecke, Alexander Breuer, Michael Bader, and Pradeep Dubey. 2016a. High order seismic simulations on the Intel Xeon Phi processor (Knights Landing). In Proceedings of the 31st International Conference on High Performance Computing. Springer, 343--362. DOI:https://doi.org/10.1007/978-3-319-41321-1_18Google ScholarCross Ref
- Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke Liao, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, and Pradeep Dubey. 2014. Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 3--14.Google ScholarDigital Library
- Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016b. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, 84:1--84:11. Retrieved from http://dl.acm.org/citation.cfm?id=3014904.3015017.Google ScholarCross Ref
- Jan S. Hesthaven and Tim Warburton. 2008. Nodal Discontinuous Galerkin Methods. Springer, New York. DOI:https://doi.org/10.1007/978-0-387-72067-8Google ScholarDigital Library
- Miklós Homolya, Robert C. Kirby, and David A. Ham. 2017. Exposing and exploiting structure: Optimal code generation for high-order finite element methods. Retrieved from: arxiv:cs.MS/1711.02473.Google Scholar
- Miklós Homolya, Lawrence Mitchell, Fabio Luporini, and David A. Ham. 2018. TSFC: A structure-preserving form compiler. SIAM J. Sci. Comput. 40, 3 (2018), C401--C428. DOI:https://doi.org/10.1137/17M1130642Google ScholarDigital Library
- Maxwell Hutchinson, Alexander Heinecke, Hans Pabst, Greg Henry, Matteo Parsani, and David Keyes. 2016. Efficiency of high order spectral element methods on petascale architectures. In Proceedings of the 31st International Conference on High Performance Computing. Springer, 449--466. DOI:https://doi.org/10.1007/978-3-319-41321-1_23Google ScholarCross Ref
- Klaus Iglberger, Georg Hager, Jan Treibig, and Ulrich Rüde. 2012. Expression templates revisited: A performance analysis of current methodologies. SIAM J. Sci. Comput. 34, 2 (2012), C42--C69. DOI:https://doi.org/10.1137/110830125Google ScholarDigital Library
- Intel Corporation. 2020. Intel Xeon Processor Scalable Family: Specification Update (June 2020 ed.). Retrieved from https://www.intel.de/content/www/de/de/processors/xeon/scalable/xeon-scalable-spec-update.html.Google Scholar
- Martin Käser, Michael Dumbser, Josep de la Puente, and Heiner Igel. 2007. An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes—III. Viscoelastic attenuation. Geophy. J. Int. 168 (2007), 224--242.Google Scholar
- D. Kempf, R. Heß, S. Müthing, and P. Bastian. 2018. Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. arXiv e-printsarxiv:math.NA/arXiv:1812.08075 (2018).Google Scholar
- T. Kolda and B. Bader. 2009. Tensor decompositions and applications. SIAM Rev. 51, 3 (2009), 455--500. DOI:https://doi.org/10.1137/07070111XGoogle ScholarDigital Library
- David A. Kopriva. 2009. Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scientists and Engineers (1st ed.). Springer.Google Scholar
- Chi Chung Lam. 1999. Performance optimization of a class of loops implementing multi-dimensional integrals. Ph.D. Dissertation. Graduate School of the Ohio State University, UMI Company. Retrieved from: http://rave.ohiolink.edu/etdc/view?acc_num=osu1488191667180786.Google Scholar
- Chi-Chung Lam, P. Sadayappan, Cociorva Daniel, Mebarek Alouani, and John Wilkins. 1999. Performance optimization of a class of loops involving sums of products of sparse arrays. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.Google Scholar
- Chi-Chung Lam, P. Sadayappan, and Rephael Wenger. 1997. Optimal reordering and mapping of a class of nested-loops for parallel execution. In Languages and Compilers for Parallel Computing. Springer Berlin, 315--329.Google Scholar
- Randall J. LeVeque. 2002. Finite Volume Methods for Hyperbolic Problems. Vol. 31. Cambridge University Press.Google Scholar
- J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. 2015. An input-adaptive and in-place approach to dense tensor-times-matrix multiply. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). 1--12. DOI:https://doi.org/10.1145/2807591.2807671Google Scholar
- Anders Logg, Kent-Andre Mardal, and Garth Wells. 2012. Automated Solution of Differential Equations by the Finite Element Method: The FEniCS Book. Springer.Google ScholarDigital Library
- Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. Cross-loop optimization of arithmetic intensity for finite element local assembly. ACM Trans. Archit. Code Optimiz. 11, 4 (Jan. 2015). DOI:https://doi.org/10.1145/2687415Google Scholar
- D. Matthews. 2018. High-performance tensor contraction without transposition. SIAM J. Sci. Comput. 40, 1 (2018), C1--C24. DOI:https://doi.org/10.1137/16M108968XGoogle ScholarDigital Library
- T. Nelson, A. Rivera, P. Balaprakash, M. Hall, P. D. Hovland, E. Jessup, and B. Norris. 2015. Generating efficient tensor contractions for GPUs. In Proceedings of the 44th International Conference on Parallel Processing. 969--978. DOI:https://doi.org/10.1109/ICPP.2015.106Google Scholar
- Elmar Peise and Paolo Bientinesi. 2012. Performance modeling for dense linear algebra. In Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, Washington, DC, 406--416. DOI:https://doi.org/10.1109/SC.Companion.2012.60Google ScholarDigital Library
- Elmar Peise, Diego Fabregat-Traver, and Paolo Bientinesi. 2015. On the performance prediction of BLAS-based tensor contractions. In Proceedings of the Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems Workshop (Lecture Notes in Computer Science, Vol. 8966). Springer, 193--212. DOI:https://doi.org/10.1007/978-3-319-17248-4_10Google ScholarCross Ref
- Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. McRae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the finite element method by composing abstractions. ACM Trans. Math. Softw. 43, 3 (Dec. 2016). DOI:https://doi.org/10.1145/2998441Google Scholar
- S. Schoeder, K. Kormann, W. A. Wall, and M. Kronbichler. 2018. Efficient explicit time stepping of high order discontinuous Galerkin schemes for waves. SIAM J. Sci. Comput. 40, 6 (2018), C803--C826. DOI:https://doi.org/10.1137/18M1185399Google ScholarDigital Library
- Helmut Seidl, Reinhard Wilhelm, and Sebastian Hack. 2012. Compiler Design: Analysis and Transformation. Springer.Google ScholarCross Ref
- Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka. 2016. Tensor contractions with extended BLAS kernels on CPU and GPU. In Proceedings of the IEEE 23rd International Conference on High Performance Computing (HiPC’16). 193--202. DOI:https://doi.org/10.1109/HiPC.2016.031Google Scholar
- E. Solomonik, D. Matthews, J. Hammond, and J. Demmel. 2013. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. 813--824. DOI:https://doi.org/10.1109/IPDPS.2013.112Google Scholar
- Daniele G. Spampinato, Diego Fabregat-Traver, Paolo Bientinesi, and Markus Püschel. 2018. Program generation for small-scale linear algebra applications. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, 327--339. DOI:https://doi.org/10.1145/3168812Google Scholar
- Daniele G. Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). Association for Computing Machinery, New York, 23--32. DOI:https://doi.org/10.1145/2544137.2544155Google Scholar
- Paul Springer and Paolo Bientinesi. 2018. Design of a high-performance GEMM-like tensor-tensor multiplication. ACM Trans. Math. Softw. 44, 3 (2018), 28:1--28:29. DOI:https://doi.org/10.1145/3157733Google ScholarDigital Library
- Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2017. TTC: A high-performance compiler for tensor transpositions. ACM Trans. Math. Softw. 44, 2 (Aug. 2017). DOI:https://doi.org/10.1145/3104988Google ScholarDigital Library
- Kevin Stock, Tom Henretty, Iyyappa Murugandi, P. Sadayappan, and Robert Harrison. 2011. Model-driven SIMD code generation for a multi-resolution tensor kernel. In Proceedings of the IEEE Parallel and Distributed Processing Symposium. IEEE Computer Society, 1058--1067. DOI:https://doi.org/10.1109/IPDPS.2011.101Google ScholarDigital Library
- J. Treibig, G. Hager, and G. Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI’10).Google Scholar
- Carsten Uphoff and Michael Bader. 2016. Generating high performance matrix kernels for earthquake simulations with viscoelastic attenuation. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’16). 908--916. DOI:https://doi.org/10.1109/HPCSim.2016.7568431Google ScholarCross Ref
- Carsten Uphoff, Sebastian Rettenberger, Michael Bader, Elizabeth H. Madden, Thomas Ulrich, Stephanie Wollherr, and Alice-Agnes Gabriel. 2017. Extreme scale multi-physics simulations of the Tsunamigenic 2004 Sumatra megathrust earthquake. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY.DOI:https://doi.org/10.1145/3126908.3126948Google ScholarDigital Library
- Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41, 3 (June 2015). DOI:https://doi.org/10.1145/2764454Google ScholarDigital Library
- Peter Vincent, Freddie Witherden, Brian Vermeire, Jin Seok Park, and Arvind Iyer. 2016. Towards green aviation with Python at petascale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ. Retrieved from: http://dl.acm.org/citation.cfm?id=3014904.3014906.Google ScholarDigital Library
- Peter Wauligmann and Nathan W. Brei. 2019. PSpaMM: Portable Sparse Matrix Multiplication. Retrieved from https://github.com/peterwauligmann/pspamm.Google Scholar
- Stephanie Wollherr, Alice-Agnes Gabriel, and Carsten Uphoff. 2018. Off-fault plasticity in three-dimensional dynamic rupture simulations using a modal Discontinuous Galerkin method on unstructured meshes: Implementation, verification and application. Geophys. J. Int. 214, 3 (2018), 1556--1584. DOI:https://doi.org/10.1093/gji/ggy213Google ScholarCross Ref
- Bartosz D. Wozniak, Freddie D. Witherden, Francis P. Russell, Peter E. Vincent, and Paul H. J. Kelly. 2016. GiMMiK--Generating bespoke matrix multiplication kernels for accelerators: Application to high-order computational fluid dynamics. Comput. Phys. Commun. 202 (2016), 12--22. DOI:https://doi.org/10.1016/j.cpc.2015.12.012Google ScholarCross Ref
Index Terms
- Yet Another Tensor Toolbox for Discontinuous Galerkin Methods and Other Applications
Recommendations
Superconvergence of discontinuous Galerkin and local discontinuous Galerkin methods
Various superconvergence properties of discontinuous Galerkin (DG) and local DG (LDG) methods for linear hyperbolic and parabolic equations have been investigated in the past. Due to these superconvergence properties, DG and LDG methods have been known ...
An $hp$-Version Discontinuous Galerkin Method for Integro-Differential Equations of Parabolic Type
We study the numerical solution of a class of parabolic integro-differential equations with weakly singular kernels. We use an $hp$-version discontinuous Galerkin (DG) method for the discretization in time. We derive optimal $hp$-version error estimates ...
Discontinuous Galerkin time stepping method for solving linear space fractional partial differential equations
In this paper, we consider the discontinuous Galerkin time stepping method for solving the linear space fractional partial differential equations. The space fractional derivatives are defined by using Riesz fractional derivative. The space variable is ...
Comments