ABSTRACT
Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte Carlo "analysis" phase which accounts for a substantial fraction of the workload in a typical LQCD calculation, the initial Monte Carlo "gauge field generation" phase requires capability-level supercomputing, corresponding to O(100) GPUs or more. Such strong scaling has not been previously achieved. In this contribution, we demonstrate that using a multi-dimensional parallelization strategy and a domain-decomposed preconditioner allows us to scale into this regime. We present results for two popular discretizations of the Dirac operator, Wilson-clover and improved staggered, employing up to 256 GPUs on the Edge cluster at Lawrence Livermore National Laboratory.
- http://en.wikipedia.org/wiki/Grand_Challenge, 2011.Google Scholar
- http://www.research.ibm.com/bluegene/BG_External_Presentation_January_2002.pdf, 2002.Google Scholar
- M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi, "Solving Lattice QCD systems of equations using mixed precision solvers on GPUs," Comput. Phys. Commun. 181 (2010) 1517--1528, arXiv:0911.3191 {hep-lat}.Google ScholarCross Ref
- R. Babich, M. A. Clark, and B. Joó, "Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pp. 1--11. IEEE Computer Society, Washington, DC, USA, 2010. arXiv:1011.0024 {hep-lat}. Google ScholarDigital Library
- S. Gottlieb, G. Shi, A. Torok, and V. Kindratenko, "QUDA programming for staggered quarks," PoS LATTICE2010 (2010) 026.Google Scholar
- G. Shi, S. Gottlieb, A. Torok, and V. V. Kindratenko, "Design of MILC lattice QCD application for GPU clusters," in IPDPS. IEEE, 2011. Google ScholarDigital Library
- B. Sheikholeslami and R. Wohlert, "Improved Continuum Limit Lattice Action for QCD with Wilson Fermions," Nucl. Phys. B259 (1985) 572.Google ScholarCross Ref
- A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, C. DeTar, L. Levkova, M. B. Oktay, S. Gottlieb, U. M. Heller, J. E. Hetrick, P. B. Mackenzie, R. Sugar, and R. S. Van de Water, "Nonperturbative QCD simulations with 2 + 1 flavors of improved staggered quarks," Rev. Mod. Phys. 82 no. 2, (May, 2010) 1349--1417.Google ScholarCross Ref
- M. R. Hestenes and E. Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems," Journal of Research of the National Bureau of Standards 49 no. 6, (Dec., 1952) 409--436.Google ScholarCross Ref
- H. A. van der Vorst, "Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems," SIAM Journal on Scientific and Statistical Computing 13 no. 2, (1992) 631--644. Google ScholarDigital Library
- T. A. Degrand and P. Rossi, "Conditioning techniques for dynamical fermions," Computer Physics Communications 60 no. 2, (1990) 211--214.Google ScholarCross Ref
- B. Jegerlehner, "Krylov space solvers for shifted linear systems," arXiv:hep-lat/9612014.Google Scholar
- H. A. Schwarz, "Über einen Grenzübergang durch alternierendes Verfahren," Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich 15 (1870) 272--286.Google Scholar
- G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nógrádi, and K. K. Szabó, "Lattice QCD as a video game," Computer Physics Communications 177 no. 8, (2007) 631--639, arXiv:0611022 {hep-lat}.Google ScholarCross Ref
- M. A. Clark, "QCD on GPUs: cost effective supercomputing," PoS LATTICE2009 (2009) 003.Google Scholar
- A. Alexandru, C. Pelissier, B. Gamari, and F. Lee, "Multi-mass solvers for lattice QCD on GPUs," arXiv:1103.5103 {hep-lat}.Google Scholar
- TWQCD Collaboration, T.-W. Chiu, T.-H. Hsieh, Y.-Y. Mao, and K. Ogawa, "GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions," PoS LATTICE2010 (2010) 030, arXiv:1101.0423 {hep-lat}.Google Scholar
- A. Alexandru, M. Lujan, C. Pelissier, B. Gamari, and F. X. Lee, "Efficient implementation of the overlap operator on multi- GPUs," arXiv:1106.4964 {hep-lat}. Google ScholarDigital Library
- S. Borsáni, "Thermodynamics from accelerated architectures." http://crunch.ikp.physik.tu-darmstadt.de/gpu2011/Talks/Borsanyi_Darmstadt_GPU.pdf, 2011.Google Scholar
- M. Luscher, "Solution of the Dirac equation in lattice QCD using a domain decomposition method," Comput.Phys.Commun. 156 (2004) 209--220, arXiv:hep-lat/0310048 {hep-lat}.Google ScholarCross Ref
- Y. Osaki and K.-I. Ishikawa, "Domain Decomposition method on GPU cluster," PoS LATTICE2010 (2010) 036, arXiv:1011.3318 {hep-lat}.Google Scholar
- http://lattice.github.com/quda, 2011.Google Scholar
- G. Ruetsch and P. Micikevicius, "Optimizing matrix transpose in CUDA," NVIDIA Technical Report (2009).Google Scholar
- http://www.mellanox.com/pdf/whitepapers/TB_GPU_Direct.pdf, 2010.Google Scholar
- R. G. Edwards and B. Joó, "The Chroma software system for lattice QCD," Nucl. Phys. Proc. Suppl. 140 (2005) 832, arXiv:hep-lat/0409003.Google ScholarCross Ref
- H.-W. Lin et al., "First results from 2+1 dynamical quark flavors on an anisotropic lattice: light-hadron spectroscopy and setting the strange-quark mass," Phys. Rev. D79 (2009) 034502, arXiv:0810.3588 {hep-ph}.Google Scholar
- MIMD Lattice Collaboration, C. Bernard et al., "The MILC Code." http://www.physics.utah.edu/~detar/milc/milcv7.pdf, 2010.Google Scholar
- A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, C. DeTar, et al., "Nonperturbative QCD simulations with 2+1 flavors of improved staggered quarks," Rev.Mod.Phys. 82 (2010) 1349--1417, arXiv:0903.3598 {hep-lat}.Google ScholarCross Ref
Index Terms
- Scaling lattice QCD beyond 100 GPUs
Recommendations
Lattice QCD with domain decomposition on Intel® Xeon Phi™ co-processors
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisThe gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of ...
The Fat-Link Computation on Large GPU Clusters for Lattice QCD
SAAHPC '12: Proceedings of the 2012 Symposium on Application Accelerators in High Performance ComputingGraphics Processing Units (GPU) are becoming increasingly popular in high performance computing due to their high performance, high power efficiency and low cost. In this paper, we present results of an effort to implement the fatlink computation -- an ...
Efficient Implementation of the Overlap Operator on Multi-GPUs
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingLattice QCD calculations were one of the first applications to show the potential of GPUs in the area of high performance computing. Our interest is to find ways to effectively use GPUs for lattice calculations using the overlap operator. The large ...
Comments