research-article

Scalable communication protocols for dynamic sparse data exchange

Authors:
Torsten Hoefler

Indiana University, Bloomington, IN, USA

Indiana University, Bloomington, IN, USA
View Profile

,
Christian Siebert

NEC Laboratories Europe, Sankt Augustin, Germany

NEC Laboratories Europe, Sankt Augustin, Germany
View Profile

,
Andrew Lumsdaine

Indiana University, Bloomington, IN, USA

Indiana University, Bloomington, IN, USA
View Profile

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingJanuary 2010Pages 159–168https://doi.org/10.1145/1693453.1693476

Published:09 January 2010Publication History

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 159–168

ABSTRACT

Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic.

To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop NBX--a novel fast algorithm with constant memory overhead for solving it. Algorithm NBX improves the runtime of a sparse data-exchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.

References

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
T. Hoefler and J. L. Traeff. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), HIPS Workshop, May 2009. Google ScholarDigital Library
R. Das, Y. Hwang, M. Uysal, J. Saltz, and A. Sussman. Applying the chaos/parti library to irregular problems in computational chemistry and computational aerodynamics. In Mississippi State University, Starkville, MS, pages 45--56. IEEE Computer Society Press, 1993.Google Scholar
J. L. Träff. Implementing the mpi process topology mechanism. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1--14, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. Google ScholarDigital Library
A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model. J. of Par. and Distr. Comp., 44(1):71--79, 1995. Google ScholarDigital Library
K. M. Chandy and J. Misra. How processes learn. In Proceedings of the fourth annual ACM symposium on Principles of Distributed Computing, pages 204--214. ACM, 1985. Google ScholarDigital Library
I. Cidon, I. Gopal, and S. Kutten. Optimal computation of global sensitive functions in fast networks. In Proceedings of the 4th International Workshop on Distributed Algorithms, pages 185--191, 1991. Google ScholarDigital Library
R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal broadcast and summation in the LogP model. In Proc. of Symposium on Parallel Algorithms and Architectures, pages 142--153, 1993. Google ScholarDigital Library
A. Bar-Noy and S. Kipnis. Designing broadcasting algorithms in the postal model for message--passing systems. Math. Syst. Theory, 27(5):431--452, 1994. Google ScholarDigital Library
A. Bar-Noy, S. Kipnis, and B. Schieber. Optimal computation of census functions in the postal model. Discrete Appl. Math., 58(3):213--222, 1995. Google ScholarDigital Library
G. Iannello. Efficient Algorithms for the Reduce-Scatter Operation in LogGP. IEEE Trans. Par. Distr. Syst., 8(9):970--982, 1997. Google ScholarDigital Library
J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message--passing systems. Trans. on Par. and Distrib. Syst., 8(11):1143--1156, 1997. Google ScholarDigital Library
T. Hoefler, P. Gottschling, and A. Lumsdaine. Leveraging nonblocking collective communication in high-performance applications. In Proc. of the 20th Annual Symp. on Parallelism in Algorithms and Architectures, pages 113--115. ACM, June 2008. Google ScholarDigital Library
T. Hoefler and A. Lumsdaine. Overlapping Communication and Computation with High Level Communication Routines. In Proc. of the 8th IEEE Symp. on Cluster Computing and the Grid, pages 572--577, May 2008. Google ScholarDigital Library
IBM Blue Gene Team. Overview of the Blue Gene/P project. IBM Journal of Research and Development, 52(1/2), January 2008. Google ScholarDigital Library
T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 Intl. Conf. on High Perf. Comp., Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007. Google ScholarDigital Library
D. Hengsen, R. Finkel, and U. Manber. Two algorithms for barrier synchronization. Int. J. Parallel Program., 17(1):1--17, 1988. Google ScholarDigital Library
S. Kumar et al. The deep computing messaging Framework: Generalized scalable message passing on the BlueGene/P supercomputer. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 94--103, New York, NY, 2008. Google ScholarDigital Library
C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, and K. Yelick. An evaluation of current high-performance networks. In Proc. of the 17th Int. Symp. on Par. and Distr. Proc., page 28.1, 2003. Google ScholarDigital Library
T. Hoefler, A. Lichei, and W. Rehm. low-overhead LogGP parameter assessment for modern interconnection networks. In Proc. of the 21st IEEE Intl. Par. & Distrib. Proc. Symp., March 2007.Google ScholarCross Ref
M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. In ACM Trans. on Modeling and Computer Simulations, pages 3--30, 1998. Google ScholarDigital Library
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5--20, 2007.Google ScholarCross Ref
D. A. Bader and K. Madduri. Parallel algorithms for evaluating centrality indices in real-world networks. In Proc. of the 2006 Int. Conf. on Parallel Processing, pages 539--550, 2006. Google ScholarDigital Library
D. Gregor and A. Lumsdaine. Lifting sequential graph algorithms for distributed-memory parallel computation. SIGPLAN Not., 40(10):423--437, 2005. Google ScholarDigital Library
P. Erdos and A. Rényi. On random graphs. Publicationes Mathematicae (Debrecen), 6:290--297, 1959.Google Scholar
D. J. Watts and S. H. Strogatz. Collective dynamics of "small-world" networks. Nature, 393(6684):440--442, June 1998.Google ScholarCross Ref
J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(6096):446--449, December 1986.Google ScholarCross Ref
U. Becciani, R. Ansalonib, V. Antonuccio-Delogua, G. Erbaccic, M. Gamberaa, and A. Pagliaro. A parallel tree code for large N-body simulation: Dynamic load balance and data distribution on a CRAY T3D system. Comp. Phys. Comm., 106:105--113, October 1997.Google ScholarCross Ref
J. K. Salmon. Parallel N log N N-body algorithms and applications to astrophysics. Compcon Spring, (91):73--78, March 1991.Google Scholar
T. A. Davis. University of Florida sparse matrix collection. Submitted to ACM Transactions on Mathematical Software, 1994.Google Scholar

Index Terms

Scalable communication protocols for dynamic sparse data exchange
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Scalable communication protocols for dynamic sparse data exchange
PPoPP '10

Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes,...
Read More
Data communication for scalable parallel computers
Read More
Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm
EuroMPI/USA '22: Proceedings of the 29th European MPI Users' Group Meeting

This paper presents algorithms for all-to-all and all-to-all(v) MPI collectives optimized for small-medium messages and large task counts per node to support multicore CPUs in HPC systems. The complexity of these algorithms is analyzed for two metrics: ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2010
372 pages
ISBN:9781605588773
DOI:10.1145/1693453
General Chairs:
R. Govindarajan
Indian Institute of Science
,
David Padua
UIUC
,
Program Chair:
Mary Hall
University of Utah
ACM SIGPLAN Notices Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
alltoall
distributed termination
irregular algorithms
nonblocking collective operations
sparse data exchange
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 690
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable communication protocols for dynamic sparse data exchange

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable communication protocols for dynamic sparse data exchange

Data communication for scalable parallel computers

Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm