ABSTRACT
Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.
- Intel® Xeon Phi#8482; Coprocessor: Software Developers Guide. https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor- system-software-developers-guide.html, 2012.Google Scholar
- A. Agarwal, J. Hennessy, and M. Horowitz. An Analytical Cache Model. ACM Trans. Comput. Syst., 7(2):184--215, 1989. Google ScholarDigital Library
- A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model - One Step Closer towards a Realistic Model for Parallel Computation. In Proc. 7th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'95), pages 95--105, S. Barbara, CA, USA, 1995. Google ScholarDigital Library
- D. Andrade, B. B. Fraguela, and R. Doallo. Accurate Prediction of the Behavior of Multithreaded Applications in Shared Caches. Parallel Computing, 39(1):36 -- 57, 2013. Google ScholarDigital Library
- K. W. Cameron, R. Ge, and X. H. Sun. lognP and log3P: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems. IEEE Trans. Computers, 53(3):314--327, 2007. Google ScholarDigital Library
- K. W. Cameron and X. H. Sun. Quantifying Locality Effect in Data Access Delay: Memory logP. In Proc. 17th IEEE Intl. Parallel & Distrib. Processing Symp. (IPDPS'03), page (8 pages), Nice, France, 2003. Google ScholarDigital Library
- G. Chrysos. Intel® Xeon Phi#8482; Coprocessor (Codename Knights Corner). Keynote talk at the 24th Hot Chips: A Symp. on High Perf. Chips, 2012.Google Scholar
- T. Cramer, D. Schmidl, M. Klemm, and D. an Mey. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. In Proc. Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University, pages 38--44, 2012.Google Scholar
- D. Culler et al. LogP: towards a Realistic Model of Parallel Computation. SIGPLAN Not., 28(7):1--12, 1993. Google ScholarDigital Library
- D. Hackenberg, D. Molka, and W. E. Nagel. Comparing Cache Architectures and Coherency Protocols on x86--64 Multicore SMP Systems. In Proc. 42nd Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO'42), pages 413--422, New York, NY, USA, 2009. Google ScholarDigital Library
- R. W. Hockney. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389 -- 398, 1994. Google ScholarDigital Library
- T. Hoefler and T. Schneider. Optimization Principles for Collective Neighborhood Communications. In Proc. 25th ACM/IEEE Intl. Supercomputing Conf. for High Performance Computing, Networking, Storage and Analysis (SC'12), Salt Lake City, UT, USA, 2012. Google ScholarDigital Library
- L. Ivanov and R. Nunna. Modeling and Verification of Cache Coherence Protocols. In Proc. 2001 IEEE Intl. Symp. on Circuits and Systems (ISCAS'01), pages 129--132, 2001.Google ScholarCross Ref
- R. M. Karp et al. Optimal Broadcast and Summation in the LogP Model. In Proc. 5th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'93), pages 142--153, Velen, Germany, 1993. Google ScholarDigital Library
- R. M. Karp and V. Ramachandran. A Survey of Parallel Algorithms for Shared-Memory Machines. Technical report, Berkeley, CA, USA, 1988. Google ScholarDigital Library
- T. Kielmann, H. E. Bal, and K. Verstoep. Fast Measurement of LogP Parameters for Message Passing Platforms. In Proc. 15th IPDPS 2000 Workshops on Parallel & Distrib. Processing, pages 1176--1183, 2000. Google ScholarDigital Library
- R. McGill, J. W. Tukey, and W. A. Larsen. Variations of Box Plots. The American Statistician, 32(1):12--16, 1978.Google Scholar
- D. Molka, D. Hackenberg, R. Schoene, and M. S. Mueller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'09), pages 261--270, Raleigh, NC, USA, 2009. Google ScholarDigital Library
- C. A. Moritz and M. I. Frank. LoGPC: Modeling Network Contention in Message-Passing Programs. IEEE Trans. on Parallel and Distrib. Systems, 12(4):404--415, 2001. Google ScholarDigital Library
- S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO. In Proc. 22nd Intl. Conf. on Theorem Proving in Higher Order Logics (TPHOLs'09), pages 391--407, Munich,Germany, 2009. Google ScholarDigital Library
- D. Petrović, O. Shahmirzadi, T. Ropars, and A. Schiper. High-performance RMA-based Broadcast on the Intel SCC. In Proc. 24th ACM Symp. on Parallelism in Alg. and Arch. (SPAA'12), pages 121--130, Pittsburgh, PA, USA, 2012. Google ScholarDigital Library
- S. Ramos and T. Hoefler. Modeling Communications in Cache Coherent Systems . Technical report, University of A Coruna, ETH Zurich, 2013.Google Scholar
- P. Sanders, J. Speck, and J. L. Traff. Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Comput., 35(12):581--594, 2009. Google ScholarDigital Library
- L. G. Valiant. A Bridging Model for Multi-core Computing. Journal of Computer and System Sciences, 77(1):154 -- 166, 2011. Google ScholarDigital Library
- B. L. Welch. The Generalization of 'Student's' Problem when Several Different Population Variances are Involved. Biometrika, (1--2):28--35, 1947.Google Scholar
Index Terms
- Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
Recommendations
Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingMost multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the ...
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and SimulationHigh performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Write buffer design for cache-coherent shared-memory multiprocessors
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and ProcessorsWe evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation ...
Comments