ABSTRACT
Current computer architectures --- ARM, MIPS, PowerPC, SPARC, x86 --- have evolved from a 32-bit architecture to a 64-bit one. Computer architects often consider whether it could be possible to eliminate hardware support for a subset of the instruction set as to reduce hardware complexity, which could improve performance, reduce power usage and accelerate processor development. This paper considers the scenario where we want to eliminate 32-bit hardware support from the ARMv8 architecture.
Dynamic binary translation can be used for this purpose and generally comes in one of two forms: application-level translators that translate a single user mode process on top of a native operating system, and system-level translators that translate an entire operating system and all its processes.
Application-level translators can have good performance but is not totally transparent; system-level translators may be 100% compatible but performance suffers. HyperMAMBO-X64 uses a new approach that gets the best of both worlds, being able to run the translator as an application under the hypervisor but still react to the behavior of guest operating systems. It works with complete transparency with regards to the virtualized system whilst delivering performance close to that provided by hardware execution.
A key factor in the low overhead of HyperMAMBO-X64 is its deep integration with the virtualization and memory management features of ARMv8. These are exploited to support caching of translations across multiple address spaces while ensuring that translated code remains consistent with the source instructions it is based on. We show how these attributes are achieved without sacrificing either performance or accuracy.
- K. Adams and O. Agesen. A comparison of software and hardware techniques for x86 virtualization. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, pages 2--13. ACM, 2006. doi: 10.1145/1168857. 1168860. Google ScholarDigital Library
- Apple. Apple --- Rosetta, 2006. URL https://www.apple.com/rosetta/. [Archived at http://web.archive.org/web/20060113055505/http://www.apple.com/rosetta/].Google Scholar
- ARM. big.LITTLE technology: The future of mobile, 2013. URL https://www.arm.com/files/pdf/big_LITTLE_Technology_the_Futue_of_Mobile.pdf. (Visited on 13/07/2016).Google Scholar
- V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimization system. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 1--12. ACM, 2000. doi: 10.1145/349299.349303. Google ScholarDigital Library
- L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach. IA-32 execution layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium-based systems. In Proceedings of the 36th Annual International Symposium on Microarchitecture, pages 191--204. ACM/IEEE Computer Society, 2003. doi: 10.1109/MICRO.2003.1253195. Google ScholarCross Ref
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, pages 164--177. ACM, 2003. doi: 10.1145/945445.945462. Google ScholarDigital Library
- F. Bellard. QEMU, a fast and portable dynamic translator. In Proceedings of the 2005 USENIX Annual Technical Conference, pages 41--46. USENIX, 2005. URL http://www.usenix.org/events/usenix05/tech/freenix/bellard.html.Google ScholarDigital Library
- R. Bhardwaj, P. Reames, R. Greenspan, V. S. Nori, and E. Ucan. A Choices hypervisor on the ARM architecture. Department of Computer Science, University of Illinois at Urbana-Champaign, 2006. CS523 Course Project Report.Google Scholar
- D. Boggs, G. Brown, N. Tuck, and K. S. Venkatraman. Denver: Nvidia's first 64-bit ARM processor. IEEE Micro, 35(2): 46--55, 2015. doi: 10.1109/MM.2015.12. Google ScholarCross Ref
- D. Bruening and V. Kiriansky. Process-shared and persistent code caches. In Proceedings of the 4th International Conference on Virtual Execution Environments, VEE 2008, pages 61--70. ACM, 2008. doi: 10.1145/1346256.1346265. Google ScholarDigital Library
- P. P. Bungale and C. Luk. PinOS: a programmable framework for whole-system dynamic instrumentation. In Proceedings of the 3rd International Conference on Virtual Execution Environments, VEE 2007, pages 137--147. ACM, 2007. doi: 10.1145/1254810.1254830. Google ScholarDigital Library
- C. Chang, J. Wu, W. Hsu, P. Liu, and P. Yew. Efficient memory virtualization for cross-ISA system mode emulation. In 10th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, VEE '14, pages 117--128. ACM, 2014. doi: 10.1145/2576195.2576201. Google ScholarDigital Library
- M. Chapman, D. J. Magenheimer, and P. Ranganathan. Magixen: Combining binary translation and virtualization. Technical report, Technical Report HPL-2007-77, Hewlett-Packard Laboratories, 2007.Google Scholar
- A. Chernoff, M. Herdeg, R. Hookway, C. Reeve, N. Rubin, T. Tye, S. B. Yadavalli, and J. Yates. FX! 32: A profile-directed binary translator. IEEE Micro, (2):56--64, 1998. Google ScholarDigital Library
- J. Corbet. Supporting filesystems in persistent memory, 2014. URL https://lwn.net/Articles/610174/.Google Scholar
- K. V. Craeynest, A. Jaleel, L. Eeckhout, P. Narváez, and J. S. Emer. Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In 39th International Symposium on Computer Architecture (ISCA 2012), pages 213--224. IEEE Computer Society, 2012. doi: 10.1109/ISCA.2012. 6237019.Google ScholarCross Ref
- C. Dall and J. Nieh. KVM/ARM: the design and implementation of the Linux ARM hypervisor. In Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 333--348. ACM, 2014. doi: 10.1145/2541940. 2541946. Google ScholarDigital Library
- A. d'Antras, C. Gorgovan, J. D. Garside, and M. Luján. Optimizing indirect branches in dynamic binary translators. ACM Transactions on Architecture and Code Optimization, 13(1): 7, 2016. doi: 10.1145/2866573. Google ScholarDigital Library
- A. d'Antras, C. Gorgovan, J. D. Garside, and M. Luján. Low overhead dynamic binary translation on ARM. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017. ACM, 2017.Google Scholar
- J. C. Dehnert, B. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson. The Transmeta code morphing software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In 1st IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2003), pages 15--24. IEEE Computer Society, 2003. doi: 10.1109/CGO.2003.1191529. Google ScholarCross Ref
- J.-H. Ding, C.-J. Lin, P.-H. Chang, C.-H. Tsang, W.-C. Hsu, and Y.-C. Chung. ARMvisor: System virtualization for ARM. In Proceedings of the Ottawa Linux Symposium (OLS), pages 93--107, 2012.Google Scholar
- E. Duesterwald and V. Bala. Software profiling for hot path prediction: Less is more. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 202--211. ACM Press, 2000. doi: 10.1145/356989.357008. Google ScholarDigital Library
- R. Grisenthwaite. ARMv8 Technology Preview, 2011.Google Scholar
- B. Hawkins, B. Demsky, D. Bruening, and Q. Zhao. Optimizing binary translation of dynamically generated code. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2015, pages 68--78. IEEE Computer Society, 2015. doi: 10.1109/CGO.2015.7054188. Google ScholarCross Ref
- R. J. Hookway and M. A. Herdeg. DIGITAL fx!32: Combining emulation and binary translation. Digital Technical Journal, 9(1), 1997. URL http://www.hpl.hp.com/hpjournal/dtj/vol9num1/vol9num1art1.pdf.Google ScholarDigital Library
- C. Luk, R. S. Cohn, R. Muth, H. Patil, A. Klauser, P. G. Lowney, S. Wallace, V. J. Reddi, and K. M. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, pages 190--200. ACM, 2005. doi: 10.1145/1065010.1065034. Google ScholarDigital Library
- T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In Proceedings of the 4th Conference on Computing Frontiers, pages 143--152. ACM, 2007. doi: 10.1145/1242531.1242554. Google ScholarDigital Library
- A. Patel, M. Daftedar, M. Shalan, and M. W. El-Kharashi. Embedded hypervisor xvisor: A comparative analysis. In 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, pages 682--691. IEEE Computer Society, 2015. doi: 10.1109/PDP.2015.108. Google ScholarDigital Library
- N. Penneman, D. Kudinskas, A. Rawsthorne, B. D. Sutter, and K. D. Bosschere. Formal virtualization requirements for the ARM architecture. Journal of Systems Architecture - Embedded Systems Design, 59(3):144--154, 2013. doi: 10. 1016/j.sysarc.2013.02.003.Google Scholar
- G. J. Popek and R. P. Goldberg. Formal requirements for virtualizable third generation architectures. Communications of the ACM, 17(7):412--421, 1974. doi: 10.1145/361011. 361073.Google ScholarDigital Library
- V. J. Reddi, D. Connors, R. Cohn, and M. D. Smith. Persistent code caching: Exploiting code reuse across executions and applications. In Fifth International Symposium on Code Generation and Optimization (CGO 2007), pages 74--88. IEEE Computer Society, 2007. doi: 10.1109/CGO.2007.29. Google ScholarDigital Library
- Y. Sato, Y. Inoguchi, and T. Nakamura. On-the-fly detection of precise loop nests across procedures on a dynamic binary translation system. In Proceedings of the 8th Conference on Computing Frontiers, page 25. ACM, 2011. doi: 10.1145/2016604.2016634. Google ScholarDigital Library
- D. Seal. ARM Architecture Reference Manual. Pearson Education, 2001.Google Scholar
- J. Seward and N. Nethercote. Using Valgrind to detect undefined value errors with bit-precision. In Proceedings of the 2005 USENIX Annual Technical Conference, pages 17--30. USENIX, 2005. URL http://www.usenix.org/events/usenix05/tech/general/seward.html.Google ScholarDigital Library
- A. Smirnov, M. Zhidko, Y. Pan, P. Tsao, K. Liu, and T. Chiueh. Evaluation of a server-grade software-only ARM hypervisor. In 2013 IEEE Sixth International Conference on Cloud Computing, pages 855--862. IEEE, 2013. doi: 10.1109/CLOUD. 2013.71.Google ScholarDigital Library
- Standard Performance Evaluation Corporation. SPEC CPU2006. http://www.spec.org/cpu2006/.Google Scholar
- X. Tong, T. Koju, M. Kawahito, and A. Moshovos. Optimizing memory translation emulation in full system emulators. ACM Transactions on Architecture and Code Optimization, 11(4): 60:1--60:24, 2014. doi: 10.1145/2686034. Google ScholarDigital Library
- Transitive. Transitive, 2008. URL http://www.transitive.com. [Archived at https://web.archive.org/web/20080914184751/http://www.transitive.com].Google Scholar
- A. van de Ven. An introduction to clear containers, 2015. URL https://lwn.net/Articles/644675/.Google Scholar
- C. Wang, S. Hu, H. Kim, S. R. Nair, M. B. Jr., Z. Ying, and Y. Wu. StarDBT: An efficient multi-platform dynamic binary translation system. In Advances in Computer Systems Architecture, 12th Asia-Pacific Conference, ACSAC 2007, Proceedings, volume 4697 of Lecture Notes in Computer Science, pages 4--15. Springer, 2007. doi: 10.1007/978-3-540-74309-5_3. Google ScholarCross Ref
- W. Wang, P. Yew, A. Zhai, and S. McCamant. A general persistent code caching framework for dynamic binary translation (DBT). In 2016 USENIX Annual Technical Conference, USENIX ATC 2016, pages 591--603. USENIX Association, 2016. URL https://www.usenix.org/conference/atc16/technical-sessions/presentation/wang.Google Scholar
- J. Watson. Virtualbox: bits and bytes masquerading as machines. Linux Journal, 2008(166):1, 2008.Google ScholarDigital Library
- Q. Zhao, D. Koh, S. Raza, D. Bruening, W. Wong, and S. P. Amarasinghe. Dynamic cache contention detection in multithreaded applications. In Proceedings of the 7th International Conference on Virtual Execution Environments, VEE 2011, pages 27--38. ACM, 2011. doi: 10.1145/1952682.1952688. Google ScholarDigital Library
- C. Zheng and C. L. Thompson. PA-RISC to IA-64: transparent execution, no recompilation. IEEE Computer, 33(3):47--52, 2000. doi: 10.1109/2.825695. Google ScholarDigital Library
- HyperMAMBO-X64: Using Virtualization to Support High-Performance Transparent Binary Translation
Recommendations
HyperMAMBO-X64: Using Virtualization to Support High-Performance Transparent Binary Translation
VEE '17Current computer architectures --- ARM, MIPS, PowerPC, SPARC, x86 --- have evolved from a 32-bit architecture to a 64-bit one. Computer architects often consider whether it could be possible to eliminate hardware support for a subset of the instruction ...
How far can we go on the x64 processors?
FSE'06: Proceedings of the 13th international conference on Fast Software EncryptionThis paper studies the state-of-the-art software optimization methodology for symmetric cryptographic primitives on the new 64-bit x64 processors, AMD Athlon64 (AMD64) and Intel Pentium 4 (EM64T). We fully utilize newly introduced 64-bit registers and ...
Comments