Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

Authors:
Samyam Rajbhandari

The Ohio State University, columbus, OH, USA

The Ohio State University, columbus, OH, USA
View Profile

,
Fabrice Rastello

INRIA, Rocquencourt, France

INRIA, Rocquencourt, France
View Profile

,
Karol Kowalski

PNNL, Richland, WA, USA

PNNL, Richland, WA, USA
View Profile

,
Sriram Krishnamoorthy

PNNL, Richland, WA, USA

PNNL, Richland, WA, USA
View Profile

,
P. Sadayappan

The Ohio State University, Columbus, OH, USA

The Ohio State University, Columbus, OH, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 52 Issue 8August 2017pp 327–340https://doi.org/10.1145/3155284.3018771

Published:26 January 2017Publication History

ACM SIGPLAN Notices

Abstract

The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four- dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems.

Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.

References

ACES II, a program product of the quantum theory project. See http://www.qtp.ufl.edu/aces/, 1996.Google Scholar
The massively parallel quantum chemistry program (MPQC). http://www.mpqc.org/index.php, 2004.Google Scholar
MOLPRO, a package of ab initio programs. See http://www.molpro. net, 2006.Google Scholar
Nwchem: A comprehensive and scalable open-source solution for large scale molecular simulations. See http://www.nwchem-sw. org/index.php, 2010.Google Scholar
Psi4, an open-source ab initio electronic structure program. See http://www.psicode.org/, 2012.Google Scholar
M. Abe, T. Yanai, T. Nakajima, and K. Hirao. A four-index transformation in dirac's four-component relativistic theory. Chem. Phys. Letters, 388 (1-3): 68--73, 2004. Google ScholarCross Ref
G. Bilardi and E. Peserico. A characterization of temporal locality and its portability across memory hierarchies. Automata, Languages and Programming, pages 128--139, 2001. Google ScholarCross Ref
L. A. Covick and K. M. Sando. Four-index transformation on distributed-memory parallel computers. J. Comp. Chem., 11 (10): 1151--1159, 1990. Google ScholarDigital Library
J. Dongarra, J.-F. Pineau, Y. Robert, and F. Vivien. Matrix product on heterogeneous master-worker platforms. In PPoPP, pages 53--62, 2008. Google ScholarDigital Library
G. Fletcher, M. Schmidt, and M. Gordon. Developments in parallel electronic structure theory. Adv. Chem. Phys., 110: 267--294, 1999. Google ScholarCross Ref
T. R. Furlani and H. F. King. Implementation of a parallel direct scf algorithm on distributed memory computers. J. Comp. Chem., 16 (1): 91--104, 1995. Google ScholarCross Ref
X. Gao, S. Krishnamoorthy, S. K. Sahoo, C. Lam, G. Baumgartner, J. Ramanujam, and P. Sadayappan. Efficient search-space pruning for integrated fusion and tiling transformations. CCPE, 19 (18): 2425--2443, 2007. Google ScholarCross Ref
J.-W. Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In STOC, pages 326--333, 1981.Google Scholar
D. Irony, S. Toledo, and A. Tiskin. Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput., 64 (9): 1017--1026, 2004. Google ScholarDigital Library
C. Lam, T. Rauber, G. Baumgartner, D. Cociorva, and P. Sadayappan. Memory-optimal evaluation of expression trees involving large objects. Comp. Lang. Sys. Struc., 37 (2): 63--75.Google Scholar
A. C. Limaye and S. R. Gadre. A general parallel solution to the integral transformation and second-order Møller-Plesset energy evaluation on distributed memory parallel machines. J. Chem. Phys., 100 (2): 1303--1307, 1994. Google ScholarCross Ref
W. Ma, S. Krishnamoorthy, and G. Agrawal. Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies. In CC 2011, pages 266--285, 2011. Google ScholarCross Ref
ga1J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà. Advances, applications and performance of the global arrays shared memory programming toolkit. Int. J. High Perform. Comput. Appl., 20 (2): 203--231, May 2006. Google ScholarDigital Library
M. Pernpointner, L. Visscher, W. A. de Jong, and R. Broer. Parallelization of four-component calculations. i. integral generation, SCF, and four-index transformation in the Dirac-Fock package MOLFDIR. J. Comp. Chem., 21 (13): 1176--1186.Google ScholarCross Ref
G. Rauhut, P. Pulay, and H.-J. Werner. Integral transformation with low-order scaling for large local second-order Møller-Plesset calculations. J. Comp. Chem., 19 (11): 1241--1254.Google ScholarCross Ref
S. Sæbø and J. Almlöf. Avoiding the integral storage bottleneck in LCAO calculations of electron correlation. Chem. Phys. Let., 154 (1): 83 -- 89, 1989. Google ScholarCross Ref
S. K. Sahoo, S. Krishnamoorthy, R. Panuganti, and P. Sadayappan. Integrated loop optimizations for data locality enhancement of tensor contraction expressions. In SC 2005. Google ScholarDigital Library
M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, et al. General atomic and molecular electronic structure system. J. Comp. Chem., 14 (11): 1347--1363, 1993. Google ScholarDigital Library
R. A. Whiteside, J. S. Binkley, M. E. Colvin, and H. F. Schaefer III. Parallel algorithms for quantum chemistry. i. integral transformations on a hypercube multiprocessor. J. Chem. Phys., 86 (4): 2185--2193, 1987. Google ScholarCross Ref
S. Wilson. Four-index transformations. In Methods in Computational Chemistry, pages 251--309. Springer, 1987. Google ScholarCross Ref
T. L. Windus, M. W. Schmidt, and M. S. Gordon. Parallel algorithm for integral transformations and GUGA MCSCF. Theoretica chimica acta, 89 (1): 77--88, 1994. Google ScholarCross Ref
A. T. Wong, R. J. Harrison, and A. P. Rendell. Parallel direct four-index transformations. Th. Chim. Acta, 93 (6): 317--331.Google ScholarCross Ref

Index Terms

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most ...
Read More
Discrete Fourier Transform Tensors and Their Ranks

We introduce a tensor generalization of the matrix discrete Fourier transform (DFT) which we call the collapsed DFT (CDFT) tensor. The CDFT tensor is different from the standard even order DFT tensor (except when the order is two). We study the action and ...
Read More
Input-adaptive parallel sparse fast fourier transform for stream processing
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

Fast Fourier Transform (FFT) is frequently invoked in stream processing, e.g., calculating the spectral representation of audio/video frames, and in many cases the inputs are sparse, i.e., most of the inputs' Fourier coefficients being zero. Many sparse ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Editor:
Matthew Fluet
Issue’s Table of Contents
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2017
476 pages
ISBN:9781450344937
DOI:10.1145/3018743
General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA
Copyright © 2017 ACM
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 January 2017
Check for updates
Author Tags
4-index
communication optimization
distributed algorithm
four-index
fusion
lower bounds
optimal schedule
optimizing 4-index transform
parallel algorithm
processor mapping
scheduling
tensor contraction
tensors
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 721
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis

Discrete Fourier Transform Tensors and Their Ranks

Input-adaptive parallel sparse fast fourier transform for stream processing