TQSIM: A fast cycle-approximate processor simulator based on QEMU

https://doi.org/10.1016/j.sysarc.2016.04.012Get rights and content

Abstract

Timing simulation of a processor is a key enabling technique to explore the design space of system architecture or to develop the software without an available hardware. We propose a fast cycle-approximate simulation technique for modern superscalar out-of-order processors. The proposed simulation technique is designed in two parts; the front-end provides correct functional execution of the guest application, and the back-end provides a timing model. For the back-end, we developed a novel processor timing model that combines a simple-formula-based analytical model and a scheduling analysis of sampled traces so as to boost up the simulation speed with minimal accuracy loss. Attached with a cache simulator, a branch predictor, and a trace analyzer, the proposed technique is implemented over the popular and portable QEMU emulator, so named TQSIM (Timed QEMU-based SIMulator). Sacrificing around 8 percent of the accuracy, TQSIM enables one or two orders of magnitude faster simulation than a reference cycle-accurate simulation when the target architecture is an ARM Cortex A15 processor. TQSIM is an open-source project currently available online.

Introduction

Simulation is one of the most critical techniques to facilitate the computer architecture design and the application development without using a real hardware [1]. A cycle-accurate simulator has been widely used to validate a target architecture at the cycle level. The cycle-accurate simulator focuses on the accuracy to predict the performance of the target architecture precisely, updating values of all the state elements of the machine at every clock cycle. Unfortunately, such high accuracy comes at the price of high development cost and long simulation time; It is known that single-core cycle-accurate simulators typically run at 0.01 to 0.3 million instructions per second (MIPS) which means that it would take several days of simulation time to simulate a couple of minutes of simulated time. Moreover, the simulation speed is degraded even more as the architectural complexity and the number of processors increase in a system.

The simulation speed is sometimes prioritized over the simulation accuracy. One case is a microprocessor or system level design space exploration. To compare the performance of several candidate architectures, fast simulation is necessary to simulate all of possible candidates. Another case is software development in the hardware/software codesign methodology. In this case, an application program has to be executed repetitively for software development and debugging, so simulation speed is a crucial factor that affects the productivity of software design. However, a certain level of accuracy is supposed to be guaranteed even in those cases. For example, accurate timing information about the events of each processor is essential to perform communication and synchronization of a multi-processor system.

Although recent CPU designs rely on multicore/multiprocessor designs to expose task-level parallelism, the reuse of the existing processor simulators with minor modification can save considerable development time and effort. Hence, many existing parallel simulation frameworks, such as MCEmu [2], Graphite [3], and HSIM [4], integrate existing single processor simulators or instruction set simulators (ISSs) to perform multicore/multiprocessor simulation. Thus improved single-core simulation performance is a key to make the simulation of larger multicore systems viable. Compromising the simulation speed and cycle accuracy becomes more challenging as the number of processors increases.

To serve those cases, in this paper, we propose a fast cycle-approximate simulation technique supporting modern superscalar out-of-order processors, boosting up the simulation speed with low accuracy loss. Like other modern simulators [5], [6], [7], the proposed simulator is designed in two parts; a front-end provides a functionally correct execution of the guest application, and a back-end provides timing models and recording data. For the front-end, we select one of the most popular and portable open-source emulators, QEMU. For the back-end, we combine the analytical and the sampled simulation techniques in a novel way. The baseline technique is an analytical simulation that models the processor timing with a simple formula. On top of that, the scheduling analysis of sampled traces improves timing accuracy by providing more accurate parameters used in the analytical formula. Since the sampling-based approach reduces the overhead of the scheduling analysis, we achieve faster simulation speed at minimal loss of accuracy. Since the proposed technique is implemented on top of QEMU, it is named TQSIM (Timed QEMU-based Simulator). Experiments show that TQSIM is one or two orders of magnitude faster than a cycle-accurate simulator, while maintaining high timing accuracy (average error approximately 8% with MiBench programs). The simulator is only eight times slower than the baseline functional simulator, QEMU.

The following section provides a brief overview of various processor timing models and existing simulators based on QEMU. The detail description of the proposed technique is presented in Section 3, and evaluated in Section 4. Finally, we summarize and conclude in Section 5.

Section snippets

Processor timing models

Over the years, there have been several timing simulation techniques proposed to guarantee an adequate level of accuracy and offer better performance than a cycle-accurate simulator at the same time. Extending the terminologies in [1], we summarize the representative timing simulation techniques as follows:

  • k-CPI simulation assumes that it takes k cycles to execute one instruction. The number of cycles is given by one cycle for all instructions (1-CPI model) or given according to the datasheet

Overview

In this paper, we propose a combined analytical/sampled timing simulation technique, which is independent of the specific functional simulator. As the processor timing model, we adopt an analytical modeling technique, the interval simulation technique [12]. Essential parameters in the formula are estimated through sampled simulation with a trace analyzer, while the other parameters are obtained from architecture specification and functional simulation. As shown in Fig. 3 with rectangle boxes,

Experimental results

Simulation target system: To evaluate the proposed technique, we first configure the simulation target system based on the Cortex A15 processor as closely as possible. Table 1 shows the system characteristics with key parameter values. By modifying some parameters, we will evaluate the adaptability of the proposed simulation technique, compared with the reference simulator in terms of accuracy.

Reference simulator / simulation error: For the reference simulator to compare the accuracy and speed

Conclusion

In this paper, we propose a fast timed-simulation technique supporting modern superscalar out-of-order processors. The simulator is developed on QEMU that is an open source machine emulator. For timing estimation, we use a novel combined analytical/sampled method that computes the simulation cycles analytically by using the steady-state IPC, which is obtained by the scheduling analysis of sampled traces. QEMU is extended with a cache simulator, a branch predictor, and the trace analyzer with

Acknowledgment

This work was supported by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.

Shin-haeng Kang received his B.S. degree in computer science and engineering from Seoul National University in 2010. Currently, he is in the Ph.D course of the M.S.-Ph.D integrated program at Seoul National University. His research interests include design and simulation methodologies for parallel embedded systems.

References (41)

  • YiJ.J. et al.

    The future of simulation: a field of dreams

    Computer

    (2006)
  • TuC.-H. et al.

    Mcemu: a framework for software development and performance analysis of multicore systems

    ACM Trans. Des. Autom. Electron. Syst. (TODAES)

    (2012)
  • J.E. Miller et al.

    Graphite: a distributed parallel simulator for multicores

    Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA)

    (2010)
  • D. Yun et al.

    Simulation environment configuration for parallel simulation of multicore embedded systems

    Proceedings of the 48th Design Automation Conference (DAC)

    (2011)
  • T.E. Carlson et al.

    Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation

    Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

    (2011)
  • J. Wang et al.

    Manifold: a parallel simulation framework for multicore systems

    IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

    (2014)
  • D. Sanchez et al.

    Zsim: fast and accurate microarchitectural simulation of thousand-core systems

    ACM SIGARCH Computer Architecture News

    (2013)
  • Ovp homepage. URL...
  • N. Romdan

    Arm fastmodels–virtual platforms for embedded software development

    Inf. Quart. Mag.

    (2008)
  • T.S. Karkhanis et al.

    A first-order superscalar processor model

    ACM SIGARCH Computer Architecture News

    (2004)
  • S. Eyerman et al.

    Efficient design space exploration of high performance embedded out-of-order processors

    Proceedings of the Conference on Design, Automation and Test in Europe

    (2006)
  • D. Genbrugge et al.

    Interval simulation: raising the level of abstraction in architectural simulation

    Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA)

    (2010)
  • E. Ïpek et al.

    Efficiently exploring architectural design spaces via predictive modeling

    (2006)
  • T.M. Conte et al.

    Reducing state loss for effective trace sampling of superscalar processors

    Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD)

    (1996)
  • R.E. Wunderlich et al.

    Smarts: accelerating microarchitecture simulation via rigorous statistical sampling

    Proceedings of the 30th Annual International Symposium on Computer Architecture

    (2003)
  • T. Sherwood et al.

    Automatically characterizing large scale program behavior

    ACM SIGARCH Comput. Archit. News

    (2002)
  • S. Nussbaum et al.

    Modeling superscalar processors via statistical simulation

    Proceedings of the International Conference on Parallel Architectures and Compilation Techniques

    (2001)
  • L. Eeckhout et al.

    Statistical simulation: adding efficiency to the computer designer’s toolbox

    IEEE Micro

    (2003)
  • D. Chiou et al.

    Fpga-accelerated simulation technologies (fast): fast, full-system, cycle-accurate simulators

    Proceedings of the 40th Annual IEEE/ACM international Symposium on Microarchitecture

    (2007)
  • J. Wawrzynek et al.

    Ramp: Research accelerator for multiple processors

    IEEE Micro

    (2007)
  • Cited by (17)

    • Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86

      2020, Future Generation Computer Systems
      Citation Excerpt :

      On the other hand, empirical models (or black-box models) [43], utilize a parameterized performance model, trained using machine learning or regression analysis, without any specific knowledge about the micro-architecture of the target processor. TQSIM uses analytical models based on sampled simulation over disruptive events, such as cache misses and branch misprediction [13]. Regarding validation of simulation infrastructures, each simulation framework has been validated against real architectures of the same period.

    • Manycore simulation for peta-scale system design: Motivation, tools, challenges and prospects

      2017, Simulation Modelling Practice and Theory
      Citation Excerpt :

      But this might have greater costs in other aspects of simulator performance. Along this line, techniques such as selecting a region of interest through statistical simulation [161,162] (as used in Graphite, LiveSim [163] and BigHouse [164]), choosing a limited but representative set of program-input pairs [165–167], reduced input sets [168], trace sampling [169] (as used in TQSIM [170]), barrier interval time parallelism [171] and simulation optimization [172,173] (as these are partially used in BigSim and COTSon) might be taken into account. We must consider modeling manycore systems in different scenarios, as for some studies and analyses we need to have a simulator with capability of multi-dimensional modeling.

    • Fast Instruction Cache Simulation is Trickier than You Think

      2023, ACM International Conference Proceeding Series
    • Application Runtime Estimation for AURIX Embedded MCU Using Deep Learning

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Shin-haeng Kang received his B.S. degree in computer science and engineering from Seoul National University in 2010. Currently, he is in the Ph.D course of the M.S.-Ph.D integrated program at Seoul National University. His research interests include design and simulation methodologies for parallel embedded systems.

    Donghoon Yoo is a technical manager of GPU software at Samsung Electronics. His research interests include processor architecture, device driver, compiler, simulator and graphics & compute APIs. He has a PhD degree in computer engineering from Gwangju Institute of Science and Technology.

    Soonhoi Ha is a full professor in the School of Computer Science and Engineering at Seoul National University. From 1993 to 1994, he worked for Hyundai Electronics Industries Corporation. He received his Bachelors (1985) and Masters (1987) in Electronics Engineering from Seoul National University, and PhD (1992) degrees in Electrical Engineering and Computer Science from University of California, Berkeley. He has worked on the Ptolemy project and the PeaCE (development of a HW/SW codesign environment) project. Currently he is leading the HOPES (development of an embedded S/W design environment for MPSoC) project. His research interests include hardware-software codesign, design methodology for embedded systems and embedded S/W. He is a senior member of the IEEE Computer Society.

    View full text