Abstract
This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.
- Ambric. Massively Parallel Processor Array technology. http://www.ambric.com.Google Scholar
- AMD Developer Central. AMD SimNow simulator. http://developer.amd.com/simnow.aspx.Google Scholar
- E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega. COTSon: Infrastructure for full system simulation. SIGOPS Operating Systems Review, Jan. 2009. Google ScholarDigital Library
- K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google Scholar
- S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, and M. Reif. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarCross Ref
- N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google ScholarDigital Library
- E.S. Chung, E. Nurvitadhi, J.C. Hoe, B. Falsafi, and K. Mai. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays, pages 77--86, Feb. 2008. Google ScholarDigital Library
- J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core Opteron processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2007), Feb. 2007.Google ScholarCross Ref
- W. Eatherton. Keynote address: The push of network processing to the top of the pyramid. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS), Oct. 2005.Google Scholar
- S.J. Eggers and R.H. Katz. A characterization of sharing in parallel programs and its application to coherency protocol evaluation. In Proceedings of the 15th Annual International Symposium on Computer architecture, pages 373--382, 1988. Google ScholarDigital Library
- S.R. Goldschmidt and J.L. Hennessy. The accuracy of trace-driven simulations of multiprocessors. SIGMETRICS Perform. Eval. Rev., 21(1):146--157, 1993. Google ScholarDigital Library
- M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarDigital Library
- L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale CMPs. Comput. Archit. News, 33(4):24--33, 2005. Google ScholarDigital Library
- A. Jaleel, R.S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MOBS'08), 2008.Google Scholar
- E.J. Koldinger, S.J. Eggers, and H.M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244--253, 1991. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 2005. Google ScholarDigital Library
- P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50--58, Feb. 2002. Google ScholarDigital Library
- M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Comput. Archit. News, 33(4):92--99, 2005. Google ScholarDigital Library
- C.J. Mauer, M.D. Hill, and D.A. Wood. Full-system timing-first simulation. SIGMETRICS Perform. Eval. Rev., 30(1):108--116, 2002. Google ScholarDigital Library
- C. McCurdy and C. Fischer. Using Pin as a memory reference generator for multiprocessor simulation. Comput. Archit. News, 33(5):39--44, 2005. Google ScholarDigital Library
- S.K. Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. SIGMETRICS Perform. Eval. Rev., 21(1):48--60, 1993. Google ScholarDigital Library
- J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.Google Scholar
- J. Singh, J. Hennessy, and A. Gupta. Scaling parallel programs for multiprocessors: methodology and examples. Computer, 26(7):42--50, Jul 1993. Google ScholarDigital Library
- A. Srivastava and A. Eustace. ATOM--a system for building customized program analysis tools. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 1994. Google ScholarDigital Library
- B. Stackhouse. A 65nm 2-billion-transistor quad-core Itanium processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google Scholar
- M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarCross Ref
- J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C.E. Kozyrakis, J. C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research Accelerator for Multiple Processors. IEEE Micro, 27(2):46--57, 2007. Google ScholarDigital Library
- S. Wee, J. Casper, N. Njoroge, Y. Tesylar, D. Ge, C. Kozyrakis, and K. Olukotun. A practical FPGA-based framework for novel CMP research. In Proceedings of the 15th International Symposium on Field Programmable Gate Arrays, pages 116--125, 2007. Google ScholarDigital Library
- S.C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995. Google ScholarDigital Library
- L. Zhao, R. Iyer, J. Moses, R. Illikkal, S. Makineni, and D. Newell. Exploring large-scale CMP architectures using ManySim. IEEE Micro, 27(4):21--33, 2007. Google ScholarDigital Library
Index Terms
- How to simulate 1000 cores
Recommendations
Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores
ICS '14: Proceedings of the 28th ACM international conference on SupercomputingWhile the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided ...
Wimpy or brawny cores: A throughput perspective
In this paper, we conduct a coarse-granular comparative analysis of wimpy (i.e., simple) fine-grain multicore processors against brawny (i.e., complex) simultaneous multithreaded (SMT) multicore processors for server applications with strong request-...
The full story of 1000 cores: An examination of concurrency control on real(ly) large multi-socket hardware
AbstractIn our initial DaMoN paper, we set out the goal to revisit the results of “Starring into the Abyss [...] of Concurrency Control with [1000] Cores” (Yu in Proc. VLDB Endow 8: 209-220, 2014). Against their assumption, today we do not see single-...
Comments