research-article

How to simulate 1000 cores

Authors:
Matteo Monchiero

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Jung Ho Ahn

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Ayose Falcón

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Daniel Ortega

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

,
Paolo Faraboschi

Hewlett-Packard Laboratories

Hewlett-Packard Laboratories
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 37 Issue 2May 2009pp 10–19https://doi.org/10.1145/1577129.1577133

Published:23 July 2009Publication History

ACM SIGARCH Computer Architecture News

Abstract

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

References

Ambric. Massively Parallel Processor Array technology. http://www.ambric.com.Google Scholar
AMD Developer Central. AMD SimNow simulator. http://developer.amd.com/simnow.aspx.Google Scholar
E. Argollo, A. Falc&#243;n, P. Faraboschi, M. Monchiero, and D. Ortega. COTSon: Infrastructure for full system simulation. SIGOPS Operating Systems Review, Jan. 2009. Google ScholarDigital Library
K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google Scholar
S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, and M. Reif. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarCross Ref
N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google ScholarDigital Library
E.S. Chung, E. Nurvitadhi, J.C. Hoe, B. Falsafi, and K. Mai. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays, pages 77--86, Feb. 2008. Google ScholarDigital Library
J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core Opteron processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2007), Feb. 2007.Google ScholarCross Ref
W. Eatherton. Keynote address: The push of network processing to the top of the pyramid. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS), Oct. 2005.Google Scholar
S.J. Eggers and R.H. Katz. A characterization of sharing in parallel programs and its application to coherency protocol evaluation. In Proceedings of the 15th Annual International Symposium on Computer architecture, pages 373--382, 1988. Google ScholarDigital Library
S.R. Goldschmidt and J.L. Hennessy. The accuracy of trace-driven simulations of multiprocessors. SIGMETRICS Perform. Eval. Rev., 21(1):146--157, 1993. Google ScholarDigital Library
M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale CMPs. Comput. Archit. News, 33(4):24--33, 2005. Google ScholarDigital Library
A. Jaleel, R.S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MOBS'08), 2008.Google Scholar
E.J. Koldinger, S.J. Eggers, and H.M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244--253, 1991. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 2005. Google ScholarDigital Library
P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. H&#229;llberg, J. H&#246;gberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50--58, Feb. 2002. Google ScholarDigital Library
M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Comput. Archit. News, 33(4):92--99, 2005. Google ScholarDigital Library
C.J. Mauer, M.D. Hill, and D.A. Wood. Full-system timing-first simulation. SIGMETRICS Perform. Eval. Rev., 30(1):108--116, 2002. Google ScholarDigital Library
C. McCurdy and C. Fischer. Using Pin as a memory reference generator for multiprocessor simulation. Comput. Archit. News, 33(5):39--44, 2005. Google ScholarDigital Library
S.K. Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. SIGMETRICS Perform. Eval. Rev., 21(1):48--60, 1993. Google ScholarDigital Library
J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.Google Scholar
J. Singh, J. Hennessy, and A. Gupta. Scaling parallel programs for multiprocessors: methodology and examples. Computer, 26(7):42--50, Jul 1993. Google ScholarDigital Library
A. Srivastava and A. Eustace. ATOM--a system for building customized program analysis tools. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 1994. Google ScholarDigital Library
B. Stackhouse. A 65nm 2-billion-transistor quad-core Itanium processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google Scholar
M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarCross Ref
J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C.E. Kozyrakis, J. C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research Accelerator for Multiple Processors. IEEE Micro, 27(2):46--57, 2007. Google ScholarDigital Library
S. Wee, J. Casper, N. Njoroge, Y. Tesylar, D. Ge, C. Kozyrakis, and K. Olukotun. A practical FPGA-based framework for novel CMP research. In Proceedings of the 15th International Symposium on Field Programmable Gate Arrays, pages 116--125, 2007. Google ScholarDigital Library
S.C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
L. Zhao, R. Iyer, J. Moses, R. Illikkal, S. Makineni, and D. Newell. Exploring large-scale CMP architectures using ManySim. IEEE Micro, 27(4):21--33, 2007. Google ScholarDigital Library

Index Terms

How to simulate 1000 cores

Recommendations

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided ...
Read More
Wimpy or brawny cores: A throughput perspective

In this paper, we conduct a coarse-granular comparative analysis of wimpy (i.e., simple) fine-grain multicore processors against brawny (i.e., complex) simultaneous multithreaded (SMT) multicore processors for server applications with strong request-...
Read More
The full story of 1000 cores: An examination of concurrency control on real(ly) large multi-socket hardware
Abstract
In our initial DaMoN paper, we set out the goal to revisit the results of “Starring into the Abyss [...] of Concurrency Control with [1000] Cores” (Yu in Proc. VLDB Endow 8: 209-220, 2014). Against their assumption, today we do not see single-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGARCH Computer Architecture News Volume 37, Issue 2
May 2009
69 pages
ISSN:0163-5964
DOI:10.1145/1577129
Issue’s Table of Contents

Copyright © 2009 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2009
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 721
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

How to simulate 1000 cores

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Wimpy or brawny cores: A throughput perspective

The full story of 1000 cores: An examination of concurrency control on real(ly) large multi-socket hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

How to simulate 1000 cores

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Wimpy or brawny cores: A throughput perspective

The full story of 1000 cores: An examination of concurrency control on real(ly) large multi-socket hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media