ABSTRACT
The significant increase in complexity of Exascale platforms due to energy-constrained, billion-way parallelism, with major changes to processor and memory architecture, requires new energy-efficient and resilient programming techniques that are portable across multiple future generations of machines. We believe that guaranteeing adequate scalability, programmability, performance portability, resilience, and energy efficiency requires a fundamentally new approach, combined with a transition path for existing scientific applications, to fully explore the rewards of todays and tomorrows systems. We present HPX -- a parallel runtime system which extends the C++11/14 standard to facilitate distributed operations, enable fine-grained constraint based parallelism, and support runtime adaptive resource management. This provides a widely accepted API enabling programmability, composability and performance portability of user applications. By employing a global address space, we seamlessly augment the standard to apply to a distributed case. We present HPX's architecture, design decisions, and results selected from a diverse set of application runs showing superior performance, scalability, and efficiency over conventional practice.
- "X-Stack: Programming Challenges, Runtime Systems, and Tools, DoE-FOA-0000619," 2012, http://science.energy.gov//media/grants/pdf/foas/2012/SC_FOA_0000619.pdf.Google Scholar
- "The Qthread Library," 2014, http://www.cs.sandia.gov/qthreads/.Google Scholar
- K. Huck, S. Shende, A. Malony, H. Kaiser, A. Porterfield, R. Fowler, and R. Brightwell, "An early prototype of an autonomic performance environment for exascale," in Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers, ser. ROSS '13. New York, NY, USA: ACM, 2013, pp. 8:1--8:8. {Online}. Available: http://doi.acm.org/10.1145/2491661.2481434 Google ScholarDigital Library
- M. Anderson, M. Brodowicz, H. Kaiser, B. Adelstein-Lelbach, and T. L. Sterling, "Neutron star evolutions using tabulated equations of state with a new execution model," CoRR, vol. abs/1205.5055, 2012. {Online}. Available: http://dblp.uni-trier.de/db/journals/corr/corr1205.html#abs-1205-5055Google Scholar
- C. Dekate, H. Kaiser, M. Anderson, B. Adelstein-Lelbach, and T. Sterling, "N-Body SVN repository," 2011, available under a BSD-style open source license. Contact [email protected] for repository access. {Online}. Available: https: //svn.cct.lsu.edu/repos/projects/parallex/trunk/history/nbodyGoogle Scholar
- Intel, "Intel Thread Building Blocks 3.0," 2010, http://www.threadingbuildingblocks.org.Google Scholar
- Microsoft, "Microsoft Parallel Pattern Library," 2010, http://msdn.microsoft.com/en-us/library/dd492418.aspx.Google Scholar
- "StarPU - A Unified Runtime System for Heterogeneous Multicore Architectures," 2013, http://runtime.bordeaux.inria.fr/StarPU/.Google Scholar
- "Intel(R) Cilk(tm) Plus," 2014, http://software.intel.com/en-us/intel-cilk-plus.Google Scholar
- "OpenMP Specifications," 2013, http://openmp.org/wp/openmp-specifications/.Google Scholar
- B. L. Chamberlain, D. Callahan, and H. P. Zima, "Parallel programmability and the Chapel language," International Journal of High Performance Computing Applications, vol. 21, pp. 291--312, 2007. Google ScholarDigital Library
- "Intel SPMD Program Compiler," 2011-2012, http://ispc.github.io/.Google Scholar
- P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, "X10: An object-oriented approach to non- uniform cluster computing," SIGPLAN Not., vol. 40, pp. 519--538, October 2005. {Online}. Available: http://doi.acm.org/10.1145/1103845.1094852 Google ScholarDigital Library
- The C++ Standards Committee, "ISO/IEC 14882:2011, Standard for Programming Language C++,", Tech. Rep., 2011, http://www.open-std.org/jtc1/sc22/wg21.Google Scholar
- The C++ Standards Committee, "N3797: Working Draft, Standard for Programming Language C++," Tech. Rep., 2013, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3797.pdf.Google Scholar
- Niklas Gustafsson and Artur Laksberg and Herb Sutter and Sana Mithani, "N3857: Improvements to std::future<T> and Related APIs," The C++ Standards Committee, Tech. Rep., 2014, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3857.pdf.Google Scholar
- "The OmpSs Programming Model," 2013, https://pm.bsc.es/ompss.Google Scholar
- "OpenACC - Directives for Accelerators," 2013, http://www.openacc-standard.org/.Google Scholar
- "C++ AMP (C++ Accelerated Massive Parallelism)," 2013, http://msdn.microsoft.com/en-us/library/hh265137.aspx.Google Scholar
- "CUDA," 2013, http://www.nvidia.com/object/cuda_home_new.html.Google Scholar
- "OpenCL - The open standard for parallel programming of heterogeneous systems," 2013, https://www.khronos.org/opencl/.Google Scholar
- UPC Consortium, "UPC Language Specifications, v1.2," Lawrence Berkeley National Lab, Tech Report LBNL-59208, 2005. {Online}. Available: http://www.gwu.edu/\~{}upc/publications/LBNL-59208.pdfGoogle Scholar
- Oracle, "Project Frotress," 2011, https://projectfortress.java.net/.Google Scholar
- PGAS, "PGAS - Partitioned Global Address Space," 2011, http://www.pgas.org.Google Scholar
- S. Chatterjee, S. Tasirlar, Z. Budimlic, V. Cavé, M. Chabbi, M. Grossman, V. Sarkar, and Y. Yan, "Integrating asynchronous task parallelism with mpi." in IPDPS. IEEE Computer Society, 2013, pp. 712--725. {Online}. Available: http://dblp.uni-trier.de/db/conf/ipps/ipdps2013.html#ChatterjeeTBCCGSY13 Google ScholarDigital Library
- Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 2.2. Stuttgart, Germany: High Performance Computing Center Stuttgart (HLRS), September 2009.Google Scholar
- H. Kaiser, M. Brodowicz, and T. Sterling, "ParalleX: An Advanced Parallel Execution Model for Scaling-Impaired Applications," in Parallel Processing Workshops. Los Alamitos, CA, USA: IEEE Computer Society, 2009, pp. 394--401. Google ScholarDigital Library
- T. Heller, H. Kaiser, A. Schäfer, and D. Fey, "Using HPX and LibGeoDecomp for Scaling HPC Applications on Heterogeneous Supercomputers," in Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA '13. New York, NY, USA: ACM, 2013, pp. 1:1--1:8. {Online}. Available: http://doi.acm.org/10.1145/2530268.2530269 Google ScholarDigital Library
- C. Dekate, M. Anderson, M. Brodowicz, H. Kaiser, B. Adelstein-Lelbach, and T. L. Sterling, "Improving the scalability of parallel N-body applications with an event driven constraint based execution model," The International Journal of High Performance Computing Applications, vol. abs/1109.5190, 2012, http://arxiv.org/abs/1109.5190. Google ScholarDigital Library
- A. Tabbal, M. Anderson, M. Brodowicz, H. Kaiser, and T. Sterling, "Preliminary design examination of the ParalleX system from a software and hardware perspective," SIGMETRICS Performance Evaluation Review, vol. 38, p. 4, Mar 2011. Google ScholarDigital Library
- M. Anderson, M. Brodowicz, H. Kaiser, and T. L. Sterling, "An application driven analysis of the ParalleX execution model," CoRR, vol. abs/1109.5201, 2011, http://arxiv.org/abs/1109.5201.Google Scholar
- "InifiniBand Trade Association," 2014, http://www.infinibandta.org/.Google Scholar
- A. Kopser and D. Vollrath, "Overview of the Next Generation Cray XMT," in Cray User Group Proceedings, 2011, pp. 1--10.Google Scholar
- C. E. Leiserson, "The Cilk++ concurrency platform," in DAC '09: Proceedings of the 46th Annual Design Automation Conference. New York, NY, USA: ACM, 2009, pp. 522--527. {Online}. Available: http://dx.doi.org/10.1145/1629911.1630048 Google ScholarDigital Library
- L. Dagum and R. Menon, "OpenMP: An Industry- Standard API for Shared-Memory Programming," IEEE Computational Science and Engineering, vol. 5, no. 1, pp. 46--55, 1998. Google ScholarDigital Library
- R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Menon, Parallel programming in OpenMP. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001. Google ScholarDigital Library
- G. Papadopoulos and D. Culler, "Monsoon: An Explicit Token-Store Architecture," in 17th International Symposium on Computer Architecture, ser. ACM SIGARCH Computer Architecture News, no. 18(2). Seattle, Washington, May 28--31: ACM Digital Library, June 1990, pp. 82--91. Google ScholarDigital Library
- J. B. Dennis, "First version of a data flow procedure language," in Symposium on Programming, 1974, pp. 362--376. Google ScholarDigital Library
- PPL, "PPL - Parallel Programming Laboratory," 2011, http://charm.cs.uiuc.edu/.Google Scholar
- "CppLINDA: C++ LINDA implementation," 2013, http://sourceforge.net/projects/cpplinda/.Google Scholar
- D. W. Wall, "Messages as active agents," in Proceedings of the 9th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, ser. POPL '82. New York, NY, USA: ACM, 1982, pp. 34--39. {Online}. Available: http://doi.acm.org/10.1145/582153.582157 Google ScholarDigital Library
- K. Yelick, V. Sarkar, J. Demmel, M. Erez, and D. Quinlan, "DEGAS: Dynamic Exascale Global Address Space," 2013, http://crd.lbl.gov/assets/Uploads/FTG/Projects/DEGAS/RetreatSummer13/DEGAS-Overview-Yelick-Retreat13.pdf.Google Scholar
- H. C. Baker and C. Hewitt, "The incremental garbage collection of processes," in SIGART Bull. New York, NY, USA: ACM, August 1977, pp. 55--59. {Online}. Available: http://doi.acm.org/10.1145/872736.806932 Google ScholarDigital Library
- D. P. Friedman and D. S. Wise, "CONS Should Not Evaluate its Arguments," in ICALP, 1976, pp. 257--284.Google Scholar
- R. H. Halstead, Jr., "MULTILISP: A language for concurrent symbolic computation," ACM Trans. Program. Lang. Syst., vol. 7, pp. 501--538, October 1985. {Online}. Available: http://doi.acm.org/10.1145/4472.4478 Google ScholarDigital Library
- J. B. Dennis and D. Misunas, "A Preliminary Architecture for a Basic Data-Flow Processor," in 25 Years ISCA: Retrospectives and Reprints, 1998, pp. 125--131. Google ScholarDigital Library
- Arvind and R. Nikhil, "Executing a Program on the MIT Tagged-Token Dataflow Architecture"," in PARLE '87, Parallel Architectures and Languages Europe, Volume 2: Parallel Languages, J. W. de Bakker, A. J. Nijman, and P. C. Treleaven, Eds. Berlin, DE: Springer-Verlag, 1987, lecture Notes in Computer Science 259. Google ScholarDigital Library
- P. J. Courtois, F. Heymans, and D. L. Parnas, "Concurrent control with "readers" and "writers"," Commun. ACM, vol. 14, no. 10, pp. 667--668, 1971. Google ScholarDigital Library
- Vicente J. Botet Escriba, "N3865: More Improvements to std::future<T>," The C++ Standards Committee, Tech. Rep., 2014, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3865.pdf.Google Scholar
- Chris Mysen and Niklas Gustafsson and Matt Austern and Jeffrey Yasskin, "N3785: Executors and schedulers, revision 3,", Tech. Rep., 2013, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3785.pdf.Google Scholar
- A. Schïl¡fer and D. Fey, "LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes," in Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. Berlin, Heidelberg: Springer, 2008, pp. 285--294. Google ScholarDigital Library
- MetaScale, "NT2 -- High-performance MATLAB-inspired C++ framework," 2014, http://www.metascale.org/products/nt2.Google Scholar
- Odeint, "Boost.Odeint -- a C++ Library for Solving ODEs," 2014, http://www.odeint.com.Google Scholar
- R. F. Barrett, C. T. Vaughan, and M. A. Heroux, "Minighost: a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing," Sandia National Laboratories, Tech. Rep. SAND, vol. 5294832, 2011.Google Scholar
- M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving performance via mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.Google Scholar
- Texas Advanced Computing Center - Stampede. Http://www.tacc.utexas.edu/resources/hpc/stampede. {Online}. Available: http://www.tacc.utexas.edu/resources/hpc/stampedeGoogle Scholar
- T. Heller, H. Kaiser, and K. Iglberger, "Application of the ParalleX Execution Model to Stencil-based Problems," in Proceedings of the International Supercomputing Conference ISC'12, Hamburg, Germany, 2012. {Online}. Available: http://stellar.cct.lsu.edu/pubs/isc2012.pdf Google ScholarDigital Library
Recommendations
A massively parallel distributed n-body application implemented with HPX
ScalA '16: Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsOne of the major challenges in parallelization is the difficulty of improving application scalability with conventional techniques. HPX provides efficient scalable parallelism by significantly reducing node starvation and effective latencies while ...
Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers
ScalA '13: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale SystemsWith the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is ...
Application of the ParalleX execution model to stencil-based problems
In the prospect of the upcoming exa-scale era with millions of execution units, the question of how to deal with this level of parallelism efficiently is of time-critical relevance. State-of-the-Art parallelization techniques such as OpenMP and MPI are ...
Comments