ABSTRACT
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic.
To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop NBX--a novel fast algorithm with constant memory overhead for solving it. Algorithm NBX improves the runtime of a sparse data-exchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- T. Hoefler and J. L. Traeff. Sparse collective operations for MPI. In Proceedings of the 23rd IEEE International Parallel & Distributed Processing Symposium (IPDPS), HIPS Workshop, May 2009. Google ScholarDigital Library
- R. Das, Y. Hwang, M. Uysal, J. Saltz, and A. Sussman. Applying the chaos/parti library to irregular problems in computational chemistry and computational aerodynamics. In Mississippi State University, Starkville, MS, pages 45--56. IEEE Computer Society Press, 1993.Google Scholar
- J. L. Träff. Implementing the mpi process topology mechanism. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1--14, Los Alamitos, CA, USA, 2002. IEEE Computer Society Press. Google ScholarDigital Library
- A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating long messages into the LogP model. J. of Par. and Distr. Comp., 44(1):71--79, 1995. Google ScholarDigital Library
- K. M. Chandy and J. Misra. How processes learn. In Proceedings of the fourth annual ACM symposium on Principles of Distributed Computing, pages 204--214. ACM, 1985. Google ScholarDigital Library
- I. Cidon, I. Gopal, and S. Kutten. Optimal computation of global sensitive functions in fast networks. In Proceedings of the 4th International Workshop on Distributed Algorithms, pages 185--191, 1991. Google ScholarDigital Library
- R. M. Karp, A. Sahay, E. E. Santos, and K. E. Schauser. Optimal broadcast and summation in the LogP model. In Proc. of Symposium on Parallel Algorithms and Architectures, pages 142--153, 1993. Google ScholarDigital Library
- A. Bar-Noy and S. Kipnis. Designing broadcasting algorithms in the postal model for message--passing systems. Math. Syst. Theory, 27(5):431--452, 1994. Google ScholarDigital Library
- A. Bar-Noy, S. Kipnis, and B. Schieber. Optimal computation of census functions in the postal model. Discrete Appl. Math., 58(3):213--222, 1995. Google ScholarDigital Library
- G. Iannello. Efficient Algorithms for the Reduce-Scatter Operation in LogGP. IEEE Trans. Par. Distr. Syst., 8(9):970--982, 1997. Google ScholarDigital Library
- J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby. Efficient algorithms for all-to-all communications in multiport message--passing systems. Trans. on Par. and Distrib. Syst., 8(11):1143--1156, 1997. Google ScholarDigital Library
- T. Hoefler, P. Gottschling, and A. Lumsdaine. Leveraging nonblocking collective communication in high-performance applications. In Proc. of the 20th Annual Symp. on Parallelism in Algorithms and Architectures, pages 113--115. ACM, June 2008. Google ScholarDigital Library
- T. Hoefler and A. Lumsdaine. Overlapping Communication and Computation with High Level Communication Routines. In Proc. of the 8th IEEE Symp. on Cluster Computing and the Grid, pages 572--577, May 2008. Google ScholarDigital Library
- IBM Blue Gene Team. Overview of the Blue Gene/P project. IBM Journal of Research and Development, 52(1/2), January 2008. Google ScholarDigital Library
- T. Hoefler, A. Lumsdaine, and W. Rehm. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proc. of the 2007 Intl. Conf. on High Perf. Comp., Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM, Nov. 2007. Google ScholarDigital Library
- D. Hengsen, R. Finkel, and U. Manber. Two algorithms for barrier synchronization. Int. J. Parallel Program., 17(1):1--17, 1988. Google ScholarDigital Library
- S. Kumar et al. The deep computing messaging Framework: Generalized scalable message passing on the BlueGene/P supercomputer. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 94--103, New York, NY, 2008. Google ScholarDigital Library
- C. Bell, D. Bonachea, Y. Cote, J. Duell, P. Hargrove, P. Husbands, C. Iancu, M. Welcome, and K. Yelick. An evaluation of current high-performance networks. In Proc. of the 17th Int. Symp. on Par. and Distr. Proc., page 28.1, 2003. Google ScholarDigital Library
- T. Hoefler, A. Lichei, and W. Rehm. low-overhead LogGP parameter assessment for modern interconnection networks. In Proc. of the 21st IEEE Intl. Par. & Distrib. Proc. Symp., March 2007.Google ScholarCross Ref
- M. Matsumoto and T. Nishimura. Mersenne twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. In ACM Trans. on Modeling and Computer Simulations, pages 3--30, 1998. Google ScholarDigital Library
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. W. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5--20, 2007.Google ScholarCross Ref
- D. A. Bader and K. Madduri. Parallel algorithms for evaluating centrality indices in real-world networks. In Proc. of the 2006 Int. Conf. on Parallel Processing, pages 539--550, 2006. Google ScholarDigital Library
- D. Gregor and A. Lumsdaine. Lifting sequential graph algorithms for distributed-memory parallel computation. SIGPLAN Not., 40(10):423--437, 2005. Google ScholarDigital Library
- P. Erdos and A. Rényi. On random graphs. Publicationes Mathematicae (Debrecen), 6:290--297, 1959.Google Scholar
- D. J. Watts and S. H. Strogatz. Collective dynamics of "small-world" networks. Nature, 393(6684):440--442, June 1998.Google ScholarCross Ref
- J. Barnes and P. Hut. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(6096):446--449, December 1986.Google ScholarCross Ref
- U. Becciani, R. Ansalonib, V. Antonuccio-Delogua, G. Erbaccic, M. Gamberaa, and A. Pagliaro. A parallel tree code for large N-body simulation: Dynamic load balance and data distribution on a CRAY T3D system. Comp. Phys. Comm., 106:105--113, October 1997.Google ScholarCross Ref
- J. K. Salmon. Parallel N log N N-body algorithms and applications to astrophysics. Compcon Spring, (91):73--78, March 1991.Google Scholar
- T. A. Davis. University of Florida sparse matrix collection. Submitted to ACM Transactions on Mathematical Software, 1994.Google Scholar
Index Terms
- Scalable communication protocols for dynamic sparse data exchange
Recommendations
Scalable communication protocols for dynamic sparse data exchange
PPoPP '10Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes,...
Applying on Node Aggregation Methods to MPI Alltoall Collectives: Matrix Block Aggregation Algorithm
EuroMPI/USA '22: Proceedings of the 29th European MPI Users' Group MeetingThis paper presents algorithms for all-to-all and all-to-all(v) MPI collectives optimized for small-medium messages and large task counts per node to support multicore CPUs in HPC systems. The complexity of these algorithms is analyzed for two metrics: ...
Comments