skip to main content
research-article

How to simulate 1000 cores

Published:23 July 2009Publication History
Skip Abstract Section

Abstract

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

References

  1. Ambric. Massively Parallel Processor Array technology. http://www.ambric.com.Google ScholarGoogle Scholar
  2. AMD Developer Central. AMD SimNow simulator. http://developer.amd.com/simnow.aspx.Google ScholarGoogle Scholar
  3. E. Argollo, A. Falcón, P. Faraboschi, M. Monchiero, and D. Ortega. COTSon: Infrastructure for full system simulation. SIGOPS Operating Systems Review, Jan. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006.Google ScholarGoogle Scholar
  5. S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, and M. Reif. TILE64 processor: A 64-core SoC with mesh interconnect. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  6. N.L. Binkert, R.G. Dreslinski, L.R. Hsu, K.T. Lim, A.G. Saidi, and S.K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E.S. Chung, E. Nurvitadhi, J.C. Hoe, B. Falsafi, and K. Mai. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. In Proceedings of the 16th International Symposium on Field Programmable Gate Arrays, pages 77--86, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dorsey, S. Searles, M. Ciraula, S. Johnson, N. Bujanos, D. Wu, M. Braganza, S. Meyers, E. Fang, and R. Kumar. An integrated quad-core Opteron processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2007), Feb. 2007.Google ScholarGoogle ScholarCross RefCross Ref
  9. W. Eatherton. Keynote address: The push of network processing to the top of the pyramid. In Proceedings of the Symposium on Architecture for Networking and Communications Systems (ANCS), Oct. 2005.Google ScholarGoogle Scholar
  10. S.J. Eggers and R.H. Katz. A characterization of sharing in parallel programs and its application to coherency protocol evaluation. In Proceedings of the 15th Annual International Symposium on Computer architecture, pages 373--382, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S.R. Goldschmidt and J.L. Hennessy. The accuracy of trace-driven simulations of multiprocessors. SIGMETRICS Perform. Eval. Rev., 21(1):146--157, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki. Synergistic processing in Cell's multicore architecture. IEEE Micro, 26(2):10--24, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Hsu, R. Iyer, S. Makineni, S. Reinhardt, and D. Newell. Exploring the cache design space for large scale CMPs. Comput. Archit. News, 33(4):24--33, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Jaleel, R.S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MOBS'08), 2008.Google ScholarGoogle Scholar
  15. E.J. Koldinger, S.J. Eggers, and H.M. Levy. On the validity of trace-driven simulation for multiprocessors. In Proceedings of the 18th Annual International Symposium on Computer Architecture, pages 244--253, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50--58, Feb. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, and D.A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Comput. Archit. News, 33(4):92--99, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C.J. Mauer, M.D. Hill, and D.A. Wood. Full-system timing-first simulation. SIGMETRICS Perform. Eval. Rev., 30(1):108--116, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. McCurdy and C. Fischer. Using Pin as a memory reference generator for multiprocessor simulation. Comput. Archit. News, 33(5):39--44, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S.K. Reinhardt, M.D. Hill, J.R. Larus, A.R. Lebeck, J.C. Lewis, and D.A. Wood. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. SIGMETRICS Perform. Eval. Rev., 21(1):48--60, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net.Google ScholarGoogle Scholar
  23. J. Singh, J. Hennessy, and A. Gupta. Scaling parallel programs for multiprocessors: methodology and examples. Computer, 26(7):42--50, Jul 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Srivastava and A. Eustace. ATOM--a system for building customized program analysis tools. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Stackhouse. A 65nm 2-billion-transistor quad-core Itanium processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarGoogle Scholar
  26. M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In Proceedings of the International Solid-State Circuits Conference (ISSCC 2008), Feb. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  27. J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C.E. Kozyrakis, J. C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research Accelerator for Multiple Processors. IEEE Micro, 27(2):46--57, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Wee, J. Casper, N. Njoroge, Y. Tesylar, D. Ge, C. Kozyrakis, and K. Olukotun. A practical FPGA-based framework for novel CMP research. In Proceedings of the 15th International Symposium on Field Programmable Gate Arrays, pages 116--125, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S.C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Zhao, R. Iyer, J. Moses, R. Illikkal, S. Makineni, and D. Newell. Exploring large-scale CMP architectures using ManySim. IEEE Micro, 27(4):21--33, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. How to simulate 1000 cores

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGARCH Computer Architecture News
            ACM SIGARCH Computer Architecture News  Volume 37, Issue 2
            May 2009
            69 pages
            ISSN:0163-5964
            DOI:10.1145/1577129
            Issue’s Table of Contents

            Copyright © 2009 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 July 2009

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader