skip to main content
10.1145/2462902.2462916acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Published:25 October 2018Publication History

ABSTRACT

Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

References

  1. Intel® Xeon Phi#8482; Coprocessor: Software Developers Guide. https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor- system-software-developers-guide.html, 2012.Google ScholarGoogle Scholar
  2. A. Agarwal, J. Hennessy, and M. Horowitz. An Analytical Cache Model. ACM Trans. Comput. Syst., 7(2):184--215, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model - One Step Closer towards a Realistic Model for Parallel Computation. In Proc. 7th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'95), pages 95--105, S. Barbara, CA, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Andrade, B. B. Fraguela, and R. Doallo. Accurate Prediction of the Behavior of Multithreaded Applications in Shared Caches. Parallel Computing, 39(1):36 -- 57, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. W. Cameron, R. Ge, and X. H. Sun. lognP and log3P: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems. IEEE Trans. Computers, 53(3):314--327, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. W. Cameron and X. H. Sun. Quantifying Locality Effect in Data Access Delay: Memory logP. In Proc. 17th IEEE Intl. Parallel & Distrib. Processing Symp. (IPDPS'03), page (8 pages), Nice, France, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Chrysos. Intel® Xeon Phi#8482; Coprocessor (Codename Knights Corner). Keynote talk at the 24th Hot Chips: A Symp. on High Perf. Chips, 2012.Google ScholarGoogle Scholar
  8. T. Cramer, D. Schmidl, M. Klemm, and D. an Mey. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. In Proc. Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University, pages 38--44, 2012.Google ScholarGoogle Scholar
  9. D. Culler et al. LogP: towards a Realistic Model of Parallel Computation. SIGPLAN Not., 28(7):1--12, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Hackenberg, D. Molka, and W. E. Nagel. Comparing Cache Architectures and Coherency Protocols on x86--64 Multicore SMP Systems. In Proc. 42nd Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO'42), pages 413--422, New York, NY, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. W. Hockney. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389 -- 398, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Hoefler and T. Schneider. Optimization Principles for Collective Neighborhood Communications. In Proc. 25th ACM/IEEE Intl. Supercomputing Conf. for High Performance Computing, Networking, Storage and Analysis (SC'12), Salt Lake City, UT, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Ivanov and R. Nunna. Modeling and Verification of Cache Coherence Protocols. In Proc. 2001 IEEE Intl. Symp. on Circuits and Systems (ISCAS'01), pages 129--132, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. M. Karp et al. Optimal Broadcast and Summation in the LogP Model. In Proc. 5th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'93), pages 142--153, Velen, Germany, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. M. Karp and V. Ramachandran. A Survey of Parallel Algorithms for Shared-Memory Machines. Technical report, Berkeley, CA, USA, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Kielmann, H. E. Bal, and K. Verstoep. Fast Measurement of LogP Parameters for Message Passing Platforms. In Proc. 15th IPDPS 2000 Workshops on Parallel & Distrib. Processing, pages 1176--1183, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. McGill, J. W. Tukey, and W. A. Larsen. Variations of Box Plots. The American Statistician, 32(1):12--16, 1978.Google ScholarGoogle Scholar
  18. D. Molka, D. Hackenberg, R. Schoene, and M. S. Mueller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'09), pages 261--270, Raleigh, NC, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. A. Moritz and M. I. Frank. LoGPC: Modeling Network Contention in Message-Passing Programs. IEEE Trans. on Parallel and Distrib. Systems, 12(4):404--415, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO. In Proc. 22nd Intl. Conf. on Theorem Proving in Higher Order Logics (TPHOLs'09), pages 391--407, Munich,Germany, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Petrović, O. Shahmirzadi, T. Ropars, and A. Schiper. High-performance RMA-based Broadcast on the Intel SCC. In Proc. 24th ACM Symp. on Parallelism in Alg. and Arch. (SPAA'12), pages 121--130, Pittsburgh, PA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Ramos and T. Hoefler. Modeling Communications in Cache Coherent Systems . Technical report, University of A Coruna, ETH Zurich, 2013.Google ScholarGoogle Scholar
  23. P. Sanders, J. Speck, and J. L. Traff. Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Comput., 35(12):581--594, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. G. Valiant. A Bridging Model for Multi-core Computing. Journal of Computer and System Sciences, 77(1):154 -- 166, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. L. Welch. The Generalization of 'Student's' Problem when Several Different Population Variances are Involved. Biometrika, (1--2):28--35, 1947.Google ScholarGoogle Scholar

Index Terms

  1. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader