research-article

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Authors:
Sabela Ramos

University of A Coruña, A Coruña, Spain

University of A Coruña, A Coruña, Spain
View Profile

,
Torsten Hoefler

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computingJune 2013Pages 97–108https://doi.org/10.1145/2462902.2462916

Published:25 October 2018Publication History

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Pages 97–108

ABSTRACT

Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design.

References

Intel® Xeon Phi#8482; Coprocessor: Software Developers Guide. https://www-ssl.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor- system-software-developers-guide.html, 2012.Google Scholar
A. Agarwal, J. Hennessy, and M. Horowitz. An Analytical Cache Model. ACM Trans. Comput. Syst., 7(2):184--215, 1989. Google ScholarDigital Library
A. Alexandrov, M. F. Ionescu, K. E. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model - One Step Closer towards a Realistic Model for Parallel Computation. In Proc. 7th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'95), pages 95--105, S. Barbara, CA, USA, 1995. Google ScholarDigital Library
D. Andrade, B. B. Fraguela, and R. Doallo. Accurate Prediction of the Behavior of Multithreaded Applications in Shared Caches. Parallel Computing, 39(1):36 -- 57, 2013. Google ScholarDigital Library
K. W. Cameron, R. Ge, and X. H. Sun. lognP and log3P: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems. IEEE Trans. Computers, 53(3):314--327, 2007. Google ScholarDigital Library
K. W. Cameron and X. H. Sun. Quantifying Locality Effect in Data Access Delay: Memory logP. In Proc. 17th IEEE Intl. Parallel & Distrib. Processing Symp. (IPDPS'03), page (8 pages), Nice, France, 2003. Google ScholarDigital Library
G. Chrysos. Intel® Xeon Phi#8482; Coprocessor (Codename Knights Corner). Keynote talk at the 24th Hot Chips: A Symp. on High Perf. Chips, 2012.Google Scholar
T. Cramer, D. Schmidl, M. Klemm, and D. an Mey. OpenMP Programming on Intel Xeon Phi Coprocessors: An Early Performance Comparison. In Proc. Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University, pages 38--44, 2012.Google Scholar
D. Culler et al. LogP: towards a Realistic Model of Parallel Computation. SIGPLAN Not., 28(7):1--12, 1993. Google ScholarDigital Library
D. Hackenberg, D. Molka, and W. E. Nagel. Comparing Cache Architectures and Coherency Protocols on x86--64 Multicore SMP Systems. In Proc. 42nd Annual IEEE/ACM Intl. Symp. on Microarchitecture (MICRO'42), pages 413--422, New York, NY, USA, 2009. Google ScholarDigital Library
R. W. Hockney. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing, 20(3):389 -- 398, 1994. Google ScholarDigital Library
T. Hoefler and T. Schneider. Optimization Principles for Collective Neighborhood Communications. In Proc. 25th ACM/IEEE Intl. Supercomputing Conf. for High Performance Computing, Networking, Storage and Analysis (SC'12), Salt Lake City, UT, USA, 2012. Google ScholarDigital Library
L. Ivanov and R. Nunna. Modeling and Verification of Cache Coherence Protocols. In Proc. 2001 IEEE Intl. Symp. on Circuits and Systems (ISCAS'01), pages 129--132, 2001.Google ScholarCross Ref
R. M. Karp et al. Optimal Broadcast and Summation in the LogP Model. In Proc. 5th Annual ACM Symp. on Parallel Alg. and Arch. (SPAA'93), pages 142--153, Velen, Germany, 1993. Google ScholarDigital Library
R. M. Karp and V. Ramachandran. A Survey of Parallel Algorithms for Shared-Memory Machines. Technical report, Berkeley, CA, USA, 1988. Google ScholarDigital Library
T. Kielmann, H. E. Bal, and K. Verstoep. Fast Measurement of LogP Parameters for Message Passing Platforms. In Proc. 15th IPDPS 2000 Workshops on Parallel & Distrib. Processing, pages 1176--1183, 2000. Google ScholarDigital Library
R. McGill, J. W. Tukey, and W. A. Larsen. Variations of Box Plots. The American Statistician, 32(1):12--16, 1978.Google Scholar
D. Molka, D. Hackenberg, R. Schoene, and M. S. Mueller. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'09), pages 261--270, Raleigh, NC, USA, 2009. Google ScholarDigital Library
C. A. Moritz and M. I. Frank. LoGPC: Modeling Network Contention in Message-Passing Programs. IEEE Trans. on Parallel and Distrib. Systems, 12(4):404--415, 2001. Google ScholarDigital Library
S. Owens, S. Sarkar, and P. Sewell. A Better x86 Memory Model: x86-TSO. In Proc. 22nd Intl. Conf. on Theorem Proving in Higher Order Logics (TPHOLs'09), pages 391--407, Munich,Germany, 2009. Google ScholarDigital Library
D. Petrović, O. Shahmirzadi, T. Ropars, and A. Schiper. High-performance RMA-based Broadcast on the Intel SCC. In Proc. 24th ACM Symp. on Parallelism in Alg. and Arch. (SPAA'12), pages 121--130, Pittsburgh, PA, USA, 2012. Google ScholarDigital Library
S. Ramos and T. Hoefler. Modeling Communications in Cache Coherent Systems . Technical report, University of A Coruna, ETH Zurich, 2013.Google Scholar
P. Sanders, J. Speck, and J. L. Traff. Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Comput., 35(12):581--594, 2009. Google ScholarDigital Library
L. G. Valiant. A Bridging Model for Multi-core Computing. Journal of Computer and System Sciences, 77(1):154 -- 166, 2011. Google ScholarDigital Library
B. L. Welch. The Generalization of 'Student's' Problem when Several Different Population Variances are Involved. Biometrika, (1--2):28--35, 1947.Google Scholar

Index Terms

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the ...
Read More
Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
Read More
Write buffer design for cache-coherent shared-memory multiprocessors
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors

We evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
June 2013
276 pages
ISBN:9781450319102
DOI:10.1145/2493123
General Chairs:
Manish Parashar
Rutgers University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Renato Figueiredo
University of Florida, USA and Vrije Universiteit, The Netherlands
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 October 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Intel Xeon Phi
cache coherency
communication modeling
shared memory systems
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '13 Paper Acceptance Rate20of131submissions,15%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 67
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Evaluation of Rodinia Codes on Intel Xeon Phi

Write buffer design for cache-coherent shared-memory multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi

Evaluation of Rodinia Codes on Intel Xeon Phi

Write buffer design for cache-coherent shared-memory multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media