Abstract
The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four- dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems.
Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.
- ACES II, a program product of the quantum theory project. See http://www.qtp.ufl.edu/aces/, 1996.Google Scholar
- The massively parallel quantum chemistry program (MPQC). http://www.mpqc.org/index.php, 2004.Google Scholar
- MOLPRO, a package of ab initio programs. See http://www.molpro. net, 2006.Google Scholar
- Nwchem: A comprehensive and scalable open-source solution for large scale molecular simulations. See http://www.nwchem-sw. org/index.php, 2010.Google Scholar
- Psi4, an open-source ab initio electronic structure program. See http://www.psicode.org/, 2012.Google Scholar
- M. Abe, T. Yanai, T. Nakajima, and K. Hirao. A four-index transformation in dirac's four-component relativistic theory. Chem. Phys. Letters, 388 (1-3): 68--73, 2004. Google ScholarCross Ref
- G. Bilardi and E. Peserico. A characterization of temporal locality and its portability across memory hierarchies. Automata, Languages and Programming, pages 128--139, 2001. Google ScholarCross Ref
- L. A. Covick and K. M. Sando. Four-index transformation on distributed-memory parallel computers. J. Comp. Chem., 11 (10): 1151--1159, 1990. Google ScholarDigital Library
- J. Dongarra, J.-F. Pineau, Y. Robert, and F. Vivien. Matrix product on heterogeneous master-worker platforms. In PPoPP, pages 53--62, 2008. Google ScholarDigital Library
- G. Fletcher, M. Schmidt, and M. Gordon. Developments in parallel electronic structure theory. Adv. Chem. Phys., 110: 267--294, 1999. Google ScholarCross Ref
- T. R. Furlani and H. F. King. Implementation of a parallel direct scf algorithm on distributed memory computers. J. Comp. Chem., 16 (1): 91--104, 1995. Google ScholarCross Ref
- X. Gao, S. Krishnamoorthy, S. K. Sahoo, C. Lam, G. Baumgartner, J. Ramanujam, and P. Sadayappan. Efficient search-space pruning for integrated fusion and tiling transformations. CCPE, 19 (18): 2425--2443, 2007. Google ScholarCross Ref
- J.-W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC, pages 326--333, 1981.Google Scholar
- D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64 (9): 1017--1026, 2004. Google ScholarDigital Library
- C. Lam, T. Rauber, G. Baumgartner, D. Cociorva, and P. Sadayappan. Memory-optimal evaluation of expression trees involving large objects. Comp. Lang. Sys. Struc., 37 (2): 63--75.Google Scholar
- A. C. Limaye and S. R. Gadre. A general parallel solution to the integral transformation and second-order Møller-Plesset energy evaluation on distributed memory parallel machines. J. Chem. Phys., 100 (2): 1303--1307, 1994. Google ScholarCross Ref
- W. Ma, S. Krishnamoorthy, and G. Agrawal. Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies. In CC 2011, pages 266--285, 2011. Google ScholarCross Ref
- ga1J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20 (2): 203--231, May 2006. Google ScholarDigital Library
- M. Pernpointner, L. Visscher, W. A. de Jong, and R. Broer. Parallelization of four-component calculations. i. integral generation, SCF, and four-index transformation in the Dirac-Fock package MOLFDIR. J. Comp. Chem., 21 (13): 1176--1186.Google ScholarCross Ref
- G. Rauhut, P. Pulay, and H.-J. Werner. Integral transformation with low-order scaling for large local second-order Møller-Plesset calculations. J. Comp. Chem., 19 (11): 1241--1254.Google ScholarCross Ref
- S. Sæbø and J. Almlöf. Avoiding the integral storage bottleneck in LCAO calculations of electron correlation. Chem. Phys. Let., 154 (1): 83 -- 89, 1989. Google ScholarCross Ref
- S. K. Sahoo, S. Krishnamoorthy, R. Panuganti, and P. Sadayappan. Integrated loop optimizations for data locality enhancement of tensor contraction expressions. In SC 2005. Google ScholarDigital Library
- M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, et al. General atomic and molecular electronic structure system. J. Comp. Chem., 14 (11): 1347--1363, 1993. Google ScholarDigital Library
- R. A. Whiteside, J. S. Binkley, M. E. Colvin, and H. F. Schaefer III. Parallel algorithms for quantum chemistry. i. integral transformations on a hypercube multiprocessor. J. Chem. Phys., 86 (4): 2185--2193, 1987. Google ScholarCross Ref
- S. Wilson. Four-index transformations. In Methods in Computational Chemistry, pages 251--309. Springer, 1987. Google ScholarCross Ref
- T. L. Windus, M. W. Schmidt, and M. S. Gordon. Parallel algorithm for integral transformations and GUGA MCSCF. Theoretica chimica acta, 89 (1): 77--88, 1994. Google ScholarCross Ref
- A. T. Wong, R. J. Harrison, and A. P. Rendell. Parallel direct four-index transformations. Th. Chim. Acta, 93 (6): 317--331.Google ScholarCross Ref
Index Terms
- Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
Recommendations
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingThe four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most ...
Discrete Fourier Transform Tensors and Their Ranks
We introduce a tensor generalization of the matrix discrete Fourier transform (DFT) which we call the collapsed DFT (CDFT) tensor. The CDFT tensor is different from the standard even order DFT tensor (except when the order is two). We study the action and ...
Input-adaptive parallel sparse fast fourier transform for stream processing
ICS '14: Proceedings of the 28th ACM international conference on SupercomputingFast Fourier Transform (FFT) is frequently invoked in stream processing, e.g., calculating the spectral representation of audio/video frames, and in many cases the inputs are sparse, i.e., most of the inputs' Fourier coefficients being zero. Many sparse ...
Comments