Elsevier

Parallel Computing

Volume 100, December 2020, 102703
Parallel Computing

A parallel strategy for density functional theory computations on accelerated nodes

https://doi.org/10.1016/j.parco.2020.102703Get rights and content

Highlights

  • First Principle MD simulations need efficient DFT solvers for fast time-to-solution.

  • We use the Löwdin orthonormalization as a proxy-app on GPU-accelerated nodes.

  • We propose a parallel strategy to reduce time-to-solution in Rayleigh–Ritz approach.

  • Iterative solvers using matrix–matrix multiplications replace dense eigensolvers.

Abstract

Using the Löwdin orthonormalization of tall-skinny matrices as a proxy-app for wavefunction-based Density Functional Theory solvers, we investigate a distributed memory parallel strategy focusing on Graphics Processing Unit (GPU)-accelerated nodes as available on some of the top ranked supercomputers at the present time. We present numerical results in the strong limit regime, as it is particularly relevant for First-Principles Molecular Dynamics. We also examine how matrix product-based iterative solvers provide a competitive alternative to dense eigensolvers on GPUs, allowing to push the strong scaling limit of these computations to a larger number of distributed tasks. Our strategy, which relies on replicated Gram matrices and efficient collective communications using the NCCL library, leads to a time-to-solution under 0.5 s for the Löwdin orthonormalization of a tall-skinny matrix of 3000 columns on Summit at Oak Ridge Leadership Facility (OLCF). Given the similarity in computational operations between one iteration of a DFT solver and this proxy-app, this shows the possibility of solving accurately the DFT equations well under a minute for 3000 electronic wave functions, and thus perform First-Principles molecular dynamics of physical systems much larger than traditionally solved on CPU systems.

Introduction

Accelerated nodes with high performance Graphics Processing Units (GPU) bring an opportunity and a challenge for First-Principles Molecular Dynamics (FPMD). FPMD is a technique that is used to study condensed matter, and materials, at the atomistic scale following the trajectories of each individual atoms over time (see for instance [1]). Its appeal comes from explicitly including the electrons in the model, describing those by quantum electronic wave functions, thus, in principle at least, providing the correct interaction between atoms and allowing chemical reactions to happen. In the Born–Oppenheimer approximation often used in practice, atoms are represented by classical particles surrounded by quantum electrons. Density Functional Theory (DFT) [2], [3] is often the methodology chosen to model the electrons at the quantum level since it allows for calculations of hundreds of electrons with reasonable computer resources.

Modeling electrons at the quantum level can still be computationally expensive since it involves solving a large-scale eigenvalue problem with a nonlinear operator, the Kohn–Sham (KS) equations [4]. The eigenfunctions describe the quantum electronic wave-functions, and even though there are techniques to reduce the number of electrons actually computed, their number can be large. In addition, for FPMD applications, these equations need to be solved numerous times. Every time the atoms move in a molecular dynamics time-step, the Hamiltonian operator in the KS equations changes, and the wave-functions satisfying the KS equations need to be recomputed. In practice, tens of thousands of Molecular Dynamics (MD) steps need to be executed to obtain a thorough sampling of the phase-space [5]. That translates into strong constraints on the wall clock time per time step if we want to be able to complete these long simulations in a reasonable turnaround time. One goal of this paper is to evaluate how fast these time-steps can be computed on a computer architecture made of GPU-accelerated nodes, where most of the floating point operations per second (>95%) are performed by the GPU. A general question we try to answer is how much faster these calculations can be carried out on these modern architectures, or to what extent these architectures can reduce the time-to-solution for large problems.

In this paper, we consider general numerical discretizations of the KS equations that would be considered accurate and systematically improvable, and that are often used for FPMD. That includes Finite Difference, Finite Elements and Plane-Wave basis sets, which all include a parameter allowing to tune the discretization error, either a measure of mesh spacing or a Plane-Wave cutoff. There are other approaches to discretize the KS equations, typically using basis sets associated with the atoms in the simulations, such as atom-centered Gaussian functions or numeric orbitals [6]. These require a different type of solvers that we will not discuss here.

For general wavefunction-based DFT solvers, one can describe and represent the solution of the KS equations by a tall-skinny matrix Ψ, where each column is made of the coefficients representing a discretized quantum wave function. The number of degrees-of-freedom per function M is typically orders of magnitude larger than the number of wave functions N. Given an initial trial solution Ψ̃, a typical iterative solver involves two main steps: (1) solve an N×N matrix eigenvalue problem in the projected subspace spanned by the N column vectors in Ψ̃ to determine the matrix to multiply Ψ̃ with to get the Ritz pairs (Rayleigh–Ritz procedure) (2) use the residual of the KS equations for the Ritz pairs to update Ψ̃ using a gradient-based algorithm. Due to the nonlinearity of the Hamiltonian operator, this procedure is often performed for a frozen Hamiltonian and embedded inside an outer loop where the Hamiltonian iteratively changes.

In this paper, we focus on block algorithms where all the columns of Ψ̃ are updated simultaneously, such as block Davidson [7], Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) [8], the Accelerated Block Preconditioned Gradient method [9] or the Chebyshev filtered subspace iteration [10]. Such algorithms are more cache-friendly since they rely heavily on matrix–matrix multiplications, instead of matrix–vector multiplications. The block nature of the solution of the KS equations combined with block solvers permits calculations that make use of a large fraction of peak flops on modern architectures for very large-scale problems (M50,000 electrons or more) [11]. FPMD applications are typically smaller, with at most hundreds of electrons. In this case, it becomes more difficult to efficiently use a large number of compute nodes. While various matrix operations run efficiently on a shared memory node, other operations such as the Rayleigh–Ritz procedure inherently couple all the distributed tasks together and become the bottlenecks that limit strong scaling and fast time-to-solution in a memory distributed environment.

In the last two decades, ScaLAPACK [12] has been a popular software library to solve this distributed Rayleigh–Ritz problem [13], [14], [15]. ScaLAPACK has also been used in various contexts in electronic structure calculations, for instance for discretization based on numerical atomic orbitals [6], or for Tight-Binding solvers [16] where matrices are typically larger and the dense distributed matrix eigensolver dominates the whole computation. While the ongoing SLATE project [17] is a rewrite of ScaLAPACK with improved software design and support for hardware accelerators, the problem we are facing with FPMD and discussing in this paper is more fundamental. More specifically, it relates to the growing gap in performance between communications and computation.

In this paper, we use a proxy-app that we developed to perform numerical experiments and evaluate parallel implementation strategies on compute nodes with high performance GPU accelerators.

Section snippets

Orthonormalization as a proxy-app for DFT solvers

DFT codes can be complex to analyze since they involve multiple nested iterations to solve the nonlinear KS equations. In addition, codes tend to implement their own variations of KS solvers which often include several tuning parameters and various numerical algorithmic variations difficult to fully document in a publication. Obviously, the discretization method also affects algorithmic choices and preconditioners. Therefore we chose to focus on a simpler algorithm that is representative of the

Numerical results

The numerical experiments in this section aim to assess the performance of our parallel implementation of the Löwdin orthonormalization of a tall-skinny matrix on the supercomputer Summit [32] at the Oak Ridge Leadership Computing Facility (OLCF). Each compute node of Summit is made of two IBM POWER9 CPUs and six NVIDIA Volta V100 GPUs, all connected together with NVIDIA’s high-speed NVLink. Data can be transferred from GPU-to-GPU and CPU-to-GPU at a peak rate of 50 GB/s in each direction. In

Related work

Unlike the Schulz-type iteration that is only locally convergent, direct methods such as the Cholesky factorization have the advantage that they can perform the orthonormalization of A without any specific assumption on A, as long as this matrix is guaranteed to be full column rank. While the Cholesky decomposition is not straightforward to implement on GPU and will not achieve the performance of a simple matrix–matrix multiplication for matrix sizes of the order of N=3,000 [36], it still

Concluding remarks

Taking advantage of powerful GPUs such as the NVIDIA V100 on Summit (the number one supercomputer at the moment) is not a straightforward task. In particular to speed up First-Principles Molecular Dynamics simulations. As peak performance in terms of floating points operations has been increasing at a very fast pace in the last few decades, speedup in communications performance has not been keeping up at the same rate. This is a well known issue that is affecting many applications areas. This

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.parco.2020.102703.

Acknowledgments

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT-Battelle, LLC for the U. S. Department of Energy under Contract No. De-AC05-00OR22725.

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

References (42)

  • AllenM.P. et al.

    Computer Simulation of Liquids

    Oxford science publications

    (1987)
  • CrouzeixM. et al.

    The Davidson method

    SIAM J. Sci. Comput.

    (1994)
  • KnyazevA.V.

    Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method

    SIAM J. Sci. Comput.

    (2001)
  • DasS. et al.

    Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system

  • BlackfordL.S. et al.
  • Fattebert.J.-L. et al.

    Towards grid-based O(N) density-functional theory methods: Optimized nonorthogonal orbitals and multigrid acceleration

    Phys. Rev. B

    (2000)
  • F. Gygi, R.K. Yates, J. Lorenz, E.W. Draeger, F. Franchetti, C.W. Ueberhuber, B.R. de Supinski, S. Kral, J.A. Gunnels,...
  • Ruiz-SerranoA. et al.

    A variational method for density functional theory calculations on metallic systems with thousands of atoms

    J. Chem. Phys.

    (2013)
  • GatesM. et al.

    Performance Tuning SLATETechnical Report 14, ICL-UT-20-01

    (2020)
  • I. Karlin, A. Bhatele, J. Keasler, B.L. Chamberlain, J. Cohen, Z. Devito, R. Haque, D. Laney, E. Luke, F. Wang, D....
  • Exascale proxy applications

    (2020)
  • Cited by (0)

    This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

    View full text