research-article

Scaling lattice QCD beyond 100 GPUs

Authors:
R. Babich

Boston University, Boston, MA

Boston University, Boston, MA
View Profile

,
M. A. Clark

Harvard-Smithsonian Center for Astrophysics, Cambridge, MA

Harvard-Smithsonian Center for Astrophysics, Cambridge, MA
View Profile

,
B. Joó

Thomas Jefferson National, Newport News, VA

Thomas Jefferson National, Newport News, VA
View Profile

,
G. Shi

University of Illinois, Urbana, IL

University of Illinois, Urbana, IL
View Profile

,
R. C. Brower

Boston University, Boston, MA

Boston University, Boston, MA
View Profile

,
S. Gottlieb

Indiana University, Bloomington, IN

Indiana University, Bloomington, IN
View Profile

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2011Article No.: 70Pages 1–11https://doi.org/10.1145/2063384.2063478

Published:12 November 2011Publication History

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–11

ABSTRACT

Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte Carlo "analysis" phase which accounts for a substantial fraction of the workload in a typical LQCD calculation, the initial Monte Carlo "gauge field generation" phase requires capability-level supercomputing, corresponding to O(100) GPUs or more. Such strong scaling has not been previously achieved. In this contribution, we demonstrate that using a multi-dimensional parallelization strategy and a domain-decomposed preconditioner allows us to scale into this regime. We present results for two popular discretizations of the Dirac operator, Wilson-clover and improved staggered, employing up to 256 GPUs on the Edge cluster at Lawrence Livermore National Laboratory.

References

http://en.wikipedia.org/wiki/Grand_Challenge, 2011.Google Scholar
http://www.research.ibm.com/bluegene/BG_External_Presentation_January_2002.pdf, 2002.Google Scholar
M. A. Clark, R. Babich, K. Barros, R. C. Brower, and C. Rebbi, "Solving Lattice QCD systems of equations using mixed precision solvers on GPUs," Comput. Phys. Commun. 181 (2010) 1517--1528, arXiv:0911.3191 {hep-lat}.Google ScholarCross Ref
R. Babich, M. A. Clark, and B. Joó, "Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pp. 1--11. IEEE Computer Society, Washington, DC, USA, 2010. arXiv:1011.0024 {hep-lat}. Google ScholarDigital Library
S. Gottlieb, G. Shi, A. Torok, and V. Kindratenko, "QUDA programming for staggered quarks," PoS LATTICE2010 (2010) 026.Google Scholar
G. Shi, S. Gottlieb, A. Torok, and V. V. Kindratenko, "Design of MILC lattice QCD application for GPU clusters," in IPDPS. IEEE, 2011. Google ScholarDigital Library
B. Sheikholeslami and R. Wohlert, "Improved Continuum Limit Lattice Action for QCD with Wilson Fermions," Nucl. Phys. B259 (1985) 572.Google ScholarCross Ref
A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, C. DeTar, L. Levkova, M. B. Oktay, S. Gottlieb, U. M. Heller, J. E. Hetrick, P. B. Mackenzie, R. Sugar, and R. S. Van de Water, "Nonperturbative QCD simulations with 2 + 1 flavors of improved staggered quarks," Rev. Mod. Phys. 82 no. 2, (May, 2010) 1349--1417.Google ScholarCross Ref
M. R. Hestenes and E. Stiefel, "Methods of Conjugate Gradients for Solving Linear Systems," Journal of Research of the National Bureau of Standards 49 no. 6, (Dec., 1952) 409--436.Google ScholarCross Ref
H. A. van der Vorst, "Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems," SIAM Journal on Scientific and Statistical Computing 13 no. 2, (1992) 631--644. Google ScholarDigital Library
T. A. Degrand and P. Rossi, "Conditioning techniques for dynamical fermions," Computer Physics Communications 60 no. 2, (1990) 211--214.Google ScholarCross Ref
B. Jegerlehner, "Krylov space solvers for shifted linear systems," arXiv:hep-lat/9612014.Google Scholar
H. A. Schwarz, "Über einen Grenzübergang durch alternierendes Verfahren," Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich 15 (1870) 272--286.Google Scholar
G. I. Egri, Z. Fodor, C. Hoelbling, S. D. Katz, D. Nógrádi, and K. K. Szabó, "Lattice QCD as a video game," Computer Physics Communications 177 no. 8, (2007) 631--639, arXiv:0611022 {hep-lat}.Google ScholarCross Ref
M. A. Clark, "QCD on GPUs: cost effective supercomputing," PoS LATTICE2009 (2009) 003.Google Scholar
A. Alexandru, C. Pelissier, B. Gamari, and F. Lee, "Multi-mass solvers for lattice QCD on GPUs," arXiv:1103.5103 {hep-lat}.Google Scholar
TWQCD Collaboration, T.-W. Chiu, T.-H. Hsieh, Y.-Y. Mao, and K. Ogawa, "GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions," PoS LATTICE2010 (2010) 030, arXiv:1101.0423 {hep-lat}.Google Scholar
A. Alexandru, M. Lujan, C. Pelissier, B. Gamari, and F. X. Lee, "Efficient implementation of the overlap operator on multi- GPUs," arXiv:1106.4964 {hep-lat}. Google ScholarDigital Library
S. Borsáni, "Thermodynamics from accelerated architectures." http://crunch.ikp.physik.tu-darmstadt.de/gpu2011/Talks/Borsanyi_Darmstadt_GPU.pdf, 2011.Google Scholar
M. Luscher, "Solution of the Dirac equation in lattice QCD using a domain decomposition method," Comput.Phys.Commun. 156 (2004) 209--220, arXiv:hep-lat/0310048 {hep-lat}.Google ScholarCross Ref
Y. Osaki and K.-I. Ishikawa, "Domain Decomposition method on GPU cluster," PoS LATTICE2010 (2010) 036, arXiv:1011.3318 {hep-lat}.Google Scholar
http://lattice.github.com/quda, 2011.Google Scholar
G. Ruetsch and P. Micikevicius, "Optimizing matrix transpose in CUDA," NVIDIA Technical Report (2009).Google Scholar
http://www.mellanox.com/pdf/whitepapers/TB_GPU_Direct.pdf, 2010.Google Scholar
R. G. Edwards and B. Joó, "The Chroma software system for lattice QCD," Nucl. Phys. Proc. Suppl. 140 (2005) 832, arXiv:hep-lat/0409003.Google ScholarCross Ref
H.-W. Lin et al., "First results from 2+1 dynamical quark flavors on an anisotropic lattice: light-hadron spectroscopy and setting the strange-quark mass," Phys. Rev. D79 (2009) 034502, arXiv:0810.3588 {hep-ph}.Google Scholar
MIMD Lattice Collaboration, C. Bernard et al., "The MILC Code." http://www.physics.utah.edu/~detar/milc/milcv7.pdf, 2010.Google Scholar
A. Bazavov, D. Toussaint, C. Bernard, J. Laiho, C. DeTar, et al., "Nonperturbative QCD simulations with 2+1 flavors of improved staggered quarks," Rev.Mod.Phys. 82 (2010) 1349--1417, arXiv:0903.3598 {hep-lat}.Google ScholarCross Ref

Index Terms

Scaling lattice QCD beyond 100 GPUs

Recommendations

Lattice QCD with domain decomposition on Intel^® Xeon Phi^™ co-processors
SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of ...
Read More
The Fat-Link Computation on Large GPU Clusters for Lattice QCD
SAAHPC '12: Proceedings of the 2012 Symposium on Application Accelerators in High Performance Computing

Graphics Processing Units (GPU) are becoming increasingly popular in high performance computing due to their high performance, high power efficiency and low cost. In this paper, we present results of an effort to implement the fatlink computation -- an ...
Read More
Efficient Implementation of the Overlap Operator on Multi-GPUs
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

Lattice QCD calculations were one of the first applications to show the potential of GPUs in the area of high performance computing. Our interest is to find ways to effectively use GPUs for lattice calculations using the overlap operator. The large ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
Krylov solvers
domain decomposition
lattice QCD
Qualifiers
- research-article
Conference

Acceptance Rates
SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 189
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling lattice QCD beyond 100 GPUs

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Lattice QCD with domain decomposition on Intel^® Xeon Phi^™ co-processors

The Fat-Link Computation on Large GPU Clusters for Lattice QCD

Efficient Implementation of the Overlap Operator on Multi-GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scaling lattice QCD beyond 100 GPUs

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Lattice QCD with domain decomposition on Intel® Xeon Phi™ co-processors

The Fat-Link Computation on Large GPU Clusters for Lattice QCD

Efficient Implementation of the Overlap Operator on Multi-GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Lattice QCD with domain decomposition on Intel^® Xeon Phi^™ co-processors