TQSIM: A fast cycle-approximate processor simulator based on QEMU
Introduction
Simulation is one of the most critical techniques to facilitate the computer architecture design and the application development without using a real hardware [1]. A cycle-accurate simulator has been widely used to validate a target architecture at the cycle level. The cycle-accurate simulator focuses on the accuracy to predict the performance of the target architecture precisely, updating values of all the state elements of the machine at every clock cycle. Unfortunately, such high accuracy comes at the price of high development cost and long simulation time; It is known that single-core cycle-accurate simulators typically run at 0.01 to 0.3 million instructions per second (MIPS) which means that it would take several days of simulation time to simulate a couple of minutes of simulated time. Moreover, the simulation speed is degraded even more as the architectural complexity and the number of processors increase in a system.
The simulation speed is sometimes prioritized over the simulation accuracy. One case is a microprocessor or system level design space exploration. To compare the performance of several candidate architectures, fast simulation is necessary to simulate all of possible candidates. Another case is software development in the hardware/software codesign methodology. In this case, an application program has to be executed repetitively for software development and debugging, so simulation speed is a crucial factor that affects the productivity of software design. However, a certain level of accuracy is supposed to be guaranteed even in those cases. For example, accurate timing information about the events of each processor is essential to perform communication and synchronization of a multi-processor system.
Although recent CPU designs rely on multicore/multiprocessor designs to expose task-level parallelism, the reuse of the existing processor simulators with minor modification can save considerable development time and effort. Hence, many existing parallel simulation frameworks, such as MCEmu [2], Graphite [3], and HSIM [4], integrate existing single processor simulators or instruction set simulators (ISSs) to perform multicore/multiprocessor simulation. Thus improved single-core simulation performance is a key to make the simulation of larger multicore systems viable. Compromising the simulation speed and cycle accuracy becomes more challenging as the number of processors increases.
To serve those cases, in this paper, we propose a fast cycle-approximate simulation technique supporting modern superscalar out-of-order processors, boosting up the simulation speed with low accuracy loss. Like other modern simulators [5], [6], [7], the proposed simulator is designed in two parts; a front-end provides a functionally correct execution of the guest application, and a back-end provides timing models and recording data. For the front-end, we select one of the most popular and portable open-source emulators, QEMU. For the back-end, we combine the analytical and the sampled simulation techniques in a novel way. The baseline technique is an analytical simulation that models the processor timing with a simple formula. On top of that, the scheduling analysis of sampled traces improves timing accuracy by providing more accurate parameters used in the analytical formula. Since the sampling-based approach reduces the overhead of the scheduling analysis, we achieve faster simulation speed at minimal loss of accuracy. Since the proposed technique is implemented on top of QEMU, it is named TQSIM (Timed QEMU-based Simulator). Experiments show that TQSIM is one or two orders of magnitude faster than a cycle-accurate simulator, while maintaining high timing accuracy (average error approximately 8% with MiBench programs). The simulator is only eight times slower than the baseline functional simulator, QEMU.
The following section provides a brief overview of various processor timing models and existing simulators based on QEMU. The detail description of the proposed technique is presented in Section 3, and evaluated in Section 4. Finally, we summarize and conclude in Section 5.
Section snippets
Processor timing models
Over the years, there have been several timing simulation techniques proposed to guarantee an adequate level of accuracy and offer better performance than a cycle-accurate simulator at the same time. Extending the terminologies in [1], we summarize the representative timing simulation techniques as follows:
- •
k-CPI simulation assumes that it takes k cycles to execute one instruction. The number of cycles is given by one cycle for all instructions (1-CPI model) or given according to the datasheet
Overview
In this paper, we propose a combined analytical/sampled timing simulation technique, which is independent of the specific functional simulator. As the processor timing model, we adopt an analytical modeling technique, the interval simulation technique [12]. Essential parameters in the formula are estimated through sampled simulation with a trace analyzer, while the other parameters are obtained from architecture specification and functional simulation. As shown in Fig. 3 with rectangle boxes,
Experimental results
Simulation target system: To evaluate the proposed technique, we first configure the simulation target system based on the Cortex A15 processor as closely as possible. Table 1 shows the system characteristics with key parameter values. By modifying some parameters, we will evaluate the adaptability of the proposed simulation technique, compared with the reference simulator in terms of accuracy.
Reference simulator / simulation error: For the reference simulator to compare the accuracy and speed
Conclusion
In this paper, we propose a fast timed-simulation technique supporting modern superscalar out-of-order processors. The simulator is developed on QEMU that is an open source machine emulator. For timing estimation, we use a novel combined analytical/sampled method that computes the simulation cycles analytically by using the steady-state IPC, which is obtained by the scheduling analysis of sampled traces. QEMU is extended with a cache simulator, a branch predictor, and the trace analyzer with
Acknowledgment
This work was supported by Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.
Shin-haeng Kang received his B.S. degree in computer science and engineering from Seoul National University in 2010. Currently, he is in the Ph.D course of the M.S.-Ph.D integrated program at Seoul National University. His research interests include design and simulation methodologies for parallel embedded systems.
References (41)
- et al.
The future of simulation: a field of dreams
Computer
(2006) - et al.
Mcemu: a framework for software development and performance analysis of multicore systems
ACM Trans. Des. Autom. Electron. Syst. (TODAES)
(2012) - et al.
Graphite: a distributed parallel simulator for multicores
Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA)
(2010) - et al.
Simulation environment configuration for parallel simulation of multicore embedded systems
Proceedings of the 48th Design Automation Conference (DAC)
(2011) - et al.
Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
(2011) - et al.
Manifold: a parallel simulation framework for multicore systems
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
(2014) - et al.
Zsim: fast and accurate microarchitectural simulation of thousand-core systems
ACM SIGARCH Computer Architecture News
(2013) - Ovp homepage. URL...
Arm fastmodels–virtual platforms for embedded software development
Inf. Quart. Mag.
(2008)- et al.
A first-order superscalar processor model
ACM SIGARCH Computer Architecture News
(2004)
Efficient design space exploration of high performance embedded out-of-order processors
Proceedings of the Conference on Design, Automation and Test in Europe
Interval simulation: raising the level of abstraction in architectural simulation
Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture (HPCA)
Efficiently exploring architectural design spaces via predictive modeling
Reducing state loss for effective trace sampling of superscalar processors
Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD)
Smarts: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th Annual International Symposium on Computer Architecture
Automatically characterizing large scale program behavior
ACM SIGARCH Comput. Archit. News
Modeling superscalar processors via statistical simulation
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
Statistical simulation: adding efficiency to the computer designer’s toolbox
IEEE Micro
Fpga-accelerated simulation technologies (fast): fast, full-system, cycle-accurate simulators
Proceedings of the 40th Annual IEEE/ACM international Symposium on Microarchitecture
Ramp: Research accelerator for multiple processors
IEEE Micro
Cited by (17)
Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86
2020, Future Generation Computer SystemsCitation Excerpt :On the other hand, empirical models (or black-box models) [43], utilize a parameterized performance model, trained using machine learning or regression analysis, without any specific knowledge about the micro-architecture of the target processor. TQSIM uses analytical models based on sampled simulation over disruptive events, such as cache misses and branch misprediction [13]. Regarding validation of simulation infrastructures, each simulation framework has been validated against real architectures of the same period.
Manycore simulation for peta-scale system design: Motivation, tools, challenges and prospects
2017, Simulation Modelling Practice and TheoryCitation Excerpt :But this might have greater costs in other aspects of simulator performance. Along this line, techniques such as selecting a region of interest through statistical simulation [161,162] (as used in Graphite, LiveSim [163] and BigHouse [164]), choosing a limited but representative set of program-input pairs [165–167], reduced input sets [168], trace sampling [169] (as used in TQSIM [170]), barrier interval time parallelism [171] and simulation optimization [172,173] (as these are partially used in BigSim and COTSon) might be taken into account. We must consider modeling manycore systems in different scenarios, as for some studies and analyses we need to have a simulator with capability of multi-dimensional modeling.
Fast Instruction Cache Simulation is Trickier than You Think
2023, ACM International Conference Proceeding SeriesDecoupling processor and memory hierarchy simulators for efficient design space exploration
2022, ACM International Conference Proceeding SeriesApplication Runtime Estimation for AURIX Embedded MCU Using Deep Learning
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Enabling parallelized-QEMU for hardware/software co-simulation virtual platforms
2021, Electronics (Switzerland)
Shin-haeng Kang received his B.S. degree in computer science and engineering from Seoul National University in 2010. Currently, he is in the Ph.D course of the M.S.-Ph.D integrated program at Seoul National University. His research interests include design and simulation methodologies for parallel embedded systems.
Donghoon Yoo is a technical manager of GPU software at Samsung Electronics. His research interests include processor architecture, device driver, compiler, simulator and graphics & compute APIs. He has a PhD degree in computer engineering from Gwangju Institute of Science and Technology.
Soonhoi Ha is a full professor in the School of Computer Science and Engineering at Seoul National University. From 1993 to 1994, he worked for Hyundai Electronics Industries Corporation. He received his Bachelors (1985) and Masters (1987) in Electronics Engineering from Seoul National University, and PhD (1992) degrees in Electrical Engineering and Computer Science from University of California, Berkeley. He has worked on the Ptolemy project and the PeaCE (development of a HW/SW codesign environment) project. Currently he is leading the HOPES (development of an embedded S/W design environment for MPSoC) project. His research interests include hardware-software codesign, design methodology for embedded systems and embedded S/W. He is a senior member of the IEEE Computer Society.