research-article

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

Authors:
Trevor E. Carlson

Ghent University, Belgium and Intel ExaScience Lab, Leuven, Belgium

Ghent University, Belgium and Intel ExaScience Lab, Leuven, Belgium
View Profile

,
Wim Heirman

Ghent University, Belgium and Intel ExaScience Lab, Leuven, Belgium

Ghent University, Belgium and Intel ExaScience Lab, Leuven, Belgium
View Profile

,
Lieven Eeckhout

Ghent University, Belgium

Ghent University, Belgium
View Profile

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2011Article No.: 52Pages 1–12https://doi.org/10.1145/2063384.2063454

Published:12 November 2011Publication History

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Two major trends in high-performance computing, namely, larger numbers of cores and the growing size of on-chip cache memory, are creating significant challenges for evaluating the design space of future processor architectures. Fast and scalable simulations are therefore needed to allow for sufficient exploration of large multi-core systems within a limited simulation time budget. By bringing together accurate high-abstraction analytical models with fast parallel simulation, architects can trade off accuracy with simulation speed to allow for longer application runs, covering a larger portion of the hardware design space. Interval simulation provides this balance between detailed cycle-accurate simulation and one-IPC simulation, allowing long-running simulations to be modeled much faster than with detailed cycle-accurate simulation, while still providing the detail necessary to observe core-uncore interactions across the entire system. Validations against real hardware show average absolute errors within 25% for a variety of multi-threaded workloads; more than twice as accurate on average as one-IPC simulation. Further, we demonstrate scalable simulation speed of up to 2.0 MIPS when simulating a 16-core system on an 8-core SMP machine.

References

A. Alameldeen and D. Wood. Variability in architectural simulations of multi-threaded workloads. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA), pages 7--18, Feb. 2003. Google ScholarDigital Library
K. C. Barr, H. Pan, M. Zhang, and K. Asanovic. Accelerating multiprocessor simulation with a memory timestamp record. In Proceedings of the 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 66--77, Mar. 2005. Google ScholarDigital Library
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008. Google ScholarDigital Library
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26:52--60, 2006. Google ScholarDigital Library
J. Chen, L. K. Dabbiru, D. Wong, M. Annavaram, and M. Dubois. Adaptive and speculative slack simulations of CMPs on CMPs. In Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 523--534. IEEE Computer Society, 2010. Google ScholarDigital Library
D. Chiou, D. Sunwoo, J. Kim, N. A. Patil, W. Reinhart, D. E. Johnson, J. Keefe, and H. Angepat. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 249--261, Dec. 2007. Google ScholarDigital Library
T. M. Conte, M. A. Hirsch, and K. N. Menezes. Reducing state loss for effective trace sampling of superscalar processors. In Proceedings of the International Conference on Computer Design (ICCD), pages 468--477, Oct. 1996. Google ScholarDigital Library
Y. Cui, W. Wu, Y. Wang, X. Guo, Y. Chen, and Y. Shi. A discrete event simulation model for understanding kernel lock thrashing on multi-core architectures. In Proceedings of the 16th International Conference on Parallel and Distributed Systems (ICPADS), pages 1--8, Dec. 2010. Google ScholarDigital Library
M. Ekman and P. Stenström. Enhancing multiprocessor architecture simulation speed using matched-pair comparison. In Proceedings of the 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 89--99, Mar. 2005. Google ScholarDigital Library
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems (TOCS), 27(2):42--53, May 2009. Google ScholarDigital Library
A. Fog. Instruction tables. http://www.agner.org/optimize/instruction_tables.pdf, April 2011.Google Scholar
H. Franke, R. Russell, and M. Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in Linux. In Proceedings of the 2002 Ottawa Linux Summit, pages 479--495, 2002.Google Scholar
R. M. Fujimoto. Parallel discrete event simulation. Communications of the ACM, 33(10):30--53, Oct. 1990. Google ScholarDigital Library
D. Genbrugge, S. Eyerman, and L. Eeckhout. Interval simulation: Raising the level of abstraction in architectural simulation. In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA), pages 307--318, Feb. 2010.Google ScholarCross Ref
L. Hammond, V. Wong, M. Chen, B. D. Carlstrom, J. D. D. an B. Hertzberg, M. K. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun. Transactional memory coherence and consistency. In Proceedings of the International Symposium on Computer Architecture (ISCA), pages 102--113, June 2004. Google ScholarDigital Library
A. Jaleel, R. S. Cohn, C.-K. Luk, and B. Jacob. CMP$im: A Pin-based on-the-fly multi-core cache simulator. In Proceedings of the Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS), co-located with ISCA 2008, pages 28--36, June 2008.Google Scholar
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT), pages 208--219, 2008. Google ScholarDigital Library
B. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable performance regression for scalable multiprocessor models. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 270--281, Nov. 2008. Google ScholarDigital Library
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pages 190--200. ACM, June 2005. Google ScholarDigital Library
M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Computer Architecture News, 33(4):92--99, Nov. 2005. Google ScholarDigital Library
A. M. G. Maynard, C. M. Donnelly, and B. R. Olszewski. Contrasting characteristics and cache performance of technical and multi-user commercial workloads. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 145--156, Oct. 1994. Google ScholarDigital Library
J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald III, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th International Symposium on High Performance Computer Architecture (HPCA), pages 1--12, Jan. 2010.Google ScholarCross Ref
K. E. Moore, J. Bobba, M. J. Moravan, M. D. Hill, and D. A. Wood. LogTM: Log-based transactional memory. In Proceedings of the 13th International Symposium on High Performance Computer Architecture (HPCA), pages 254--265, Feb. 2006.Google ScholarCross Ref
M. Pellauer, M. Adler, M. Kinsy, A. Parashar, and J. Emer. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), pages 406--417, Feb. 2011. Google ScholarDigital Library
S. K. Reinhardt, M. D. Hill, J. R. Larus, A. R. Lebeck, J. C. Lewis, and D. A. Wood. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 48--60, May 1993. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 45--57, Oct. 2002. Google ScholarDigital Library
D. J. Sorin, V. S. Pai, S. V. Adve, M. K. Vernon, and D. A. Wood. Analytic evaluation of shared-memory systems with ILP processors. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA), pages 380--391, June 1998. Google ScholarDigital Library
V. Uzelac and A. Milenkovic. Experiment flows and microbenchmarks for reverse engineering of branch predictor structures. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 207--217, 2009.Google ScholarCross Ref
J. Wawrzynek, D. Patterson, M. Oskin, S.-L. Lu, C. Kozyrakis, J. C. Hoe, D. Chiou, and K. Asanovic. RAMP: Research accelerator for multiple processors. IEEE Micro, 27(2):46--57, Mar. 2007. Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), pages 24--36, June 1995. Google ScholarDigital Library
R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA), pages 84--95, June 2003. Google ScholarDigital Library
M. Yourst. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the 2007 IEEE International Symmposium on Performance Analysis of Systems and Software (ISPASS), pages 23--34. Apr. 2007.Google ScholarCross Ref

Index Terms

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

An Evaluation of High-Level Mechanistic Core Models

Large core counts and complex cache hierarchies are increasing the burden placed on commonly used simulation and modeling techniques. Although analytical models provide fast results, they do not apply to complex, many-core shared-memory systems. In ...
Read More
Power-aware multi-core simulation for early design stage hardware/software co-optimization
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Stringent performance targets and power constraints push designers towards building specialized workload-optimized systems across a broad spectrum of the computing arena, including supercomputing applications as exemplified by the IBM BlueGene and Intel ...
Read More
An Implementation of Parallel 1-D FFT Using AVX Instructions on Multi-core Processors
IWIA '12: Proceedings of the 2012 International Workshop on Innovative Architecture for Future Generation Processors and Systems

In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) using Intel Advanced Vector Extensions (AVX) instructions on multi-core processors. The combination of vectorization and a block six-step FFT ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
interval model
interval simulation
multi-core processor
performance modeling
Qualifiers
- research-article
Conference

Acceptance Rates
SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 609
  Total Citations
  View Citations
- 1,785
  Total Downloads
- Downloads (Last 12 months)202
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Evaluation of High-Level Mechanistic Core Models

Power-aware multi-core simulation for early design stage hardware/software co-optimization

An Implementation of Parallel 1-D FFT Using AVX Instructions on Multi-core Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Evaluation of High-Level Mechanistic Core Models

Power-aware multi-core simulation for early design stage hardware/software co-optimization

An Implementation of Parallel 1-D FFT Using AVX Instructions on Multi-core Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media