ABSTRACT
While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.
- 1996. COFF. https://wiki.osdev.org/COFF (accessed 2020) Google Scholar
- 2003. ELF - format of Executable and Linking Format (ELF) files. http://man7.org/linux/man-pages/man5/elf.5.html (accessed Aug 20 2019) Google Scholar
- 2009. OS X ABI Mach-O File Format Reference. https://developer.apple.com/ Google Scholar
- 2010. LLVM MC Project. http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html (accessed Aug 15 2019) Google Scholar
- 2019. Machine IR (MIR) Format Reference Manual. https://llvm.org/docs/MIRLangRef.html (accessed Aug 20 2019) Google Scholar
- 2019. MITE Micro-ops to IDQ. https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/308522 (accessed Aug 20 2019) Google Scholar
- Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. 2016. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In 25th $USENIX$ Security Symposium ($USENIX$ Security 16). 583–600. Google Scholar
- ARM. 2021. Branch and Call Sequences Explained. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/branch-and-call-sequences-explained [Online; accessed 6-August-2021] Google Scholar
- Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473. Google ScholarDigital Library
- Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2018. The datacenter as a computer: Designing warehouse-scale machines. Synthesis Lectures on Computer Architecture, 13, 3 (2018), i–189. Google Scholar
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track. 41, 46. Google Scholar
- P. Briggs, Doug Evans, B. Grant, R. Hundt, W. Maddox, D. Novillo, Seongbae Park, D. Sehr, Ian Taylor, and Ollie. [n. d.]. WHOPR-Fast and Scalable Whole Program Optimizations in GCC Initial Draft 12-Dec-2007. Google Scholar
- Derek Bruening. 2017. Restartable Sequences. https://dynamorio.org/page_rseq.html (accessed 2022) Google Scholar
- Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 265–275. Google ScholarCross Ref
- Bryan Buck and Jeffrey K Hollingsworth. 2000. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14, 4 (2000), 317–329. Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 205–218. Google Scholar
- Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016. AutoFDO: automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 12–23. https://doi.org/10.1145/2854038.2854044 Google ScholarDigital Library
- Robert S Cohn, David W Goodwin, P Geoffrey Lowney, and N Rubin. 1997. Optimizing alpha executables on windows nt with spike. Digital Technical Journal, 9 (1997), 3–20. Google Scholar
- Jonathan Corbet. 2015. Restartable Sequences. https://lwn.net/Articles/650333/ (accessed 2022) Google Scholar
- James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst., 31, 3 (2013), Article 8, Aug., 22 pages. issn:0734-2071 https://doi.org/10.1145/2491245 Google ScholarDigital Library
- Cary Coutant. 2013. DWARF Extensions for Separate Debug Information Files a.k.a. "Fission" project. https://gcc.gnu.org/wiki/DebugFission Google Scholar
- Bruno De Bus, Bjorn De Sutter, Ludo Van Put, Dominique Chanet, and Koen De Bosschere. 2004. Link-time optimization of ARM binaries. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems. 211–220. Google ScholarDigital Library
- Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation intel core: New microarchitecture code-named skylake. IEEE Micro, 37, 2 (2017), 52–62. Google ScholarDigital Library
- Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s Distributed and Caching Build Service. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE ’16). Association for Computing Machinery, New York, NY, USA. 11–20. isbn:9781450342056 https://doi.org/10.1145/2889160.2889222 Google ScholarDigital Library
- The LLVM Foundation. 2002. The LLVM Compiler Infrastructure. http://llvm.org (accessed Aug 20 2019) Google Scholar
- The LLVM Foundation. 2002. LLVM Link Time Optimization: Design and Implementation. https://llvm.org/docs/LinkTimeOptimization.html (accessed Aug 20 2019) Google Scholar
- The LLVM Foundation. 2020. SHT_LLVM_BB_ADDR_MAP Section (basic block address map). https://llvm.org/docs/Extensions.html##sht-llvm-bb-addr-map-section-basic-block-address-map Google Scholar
- Taras Glek and Jan Hubicka. 2010. Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196. Google Scholar
- Google Propeller. 2021. llvm-propeller. https://github.com/google/llvm-propeller (accessed 2021) Google Scholar
- Aysylu Greenberg. 2016. Building a Distributed Build System at Google Scale. https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistributedBuildSystemAtGoogleScale.pdf Google Scholar
- Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani. 2011. MAO–An extensible micro-architectural optimizer. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 1–10. Google ScholarDigital Library
- Andrew Hamilton Hunter, Chris Kennelly, Darryl Gove, Parthasarathy Ranganathan, Paul Jack Turner, and Tipp James Moseley. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). Google Scholar
- IBM. 2021. Using remote build clearmake command. https://www.ibm.com/docs/en/rational-clearcase/9.0.0?topic=feature-using-remote-build-clearmake-command [Online; accessed 6-August-2021] Google Scholar
- LLVM Compiler Infrastructure. 2003. Exception Handling in LLVM. https://llvm.org/docs/ExceptionHandling.html Google Scholar
- Texas Instruments. 2015. TMS320C28x Optimizing C/C++ Compiler. http://downloads.ti.com/docs/esd/SPRU514I/Content/SPRU514I_HTML/post_link_optimizer.html (accessed Aug 20 2019) Google Scholar
- Intel. 2017. Intel Xeon Processor Scalable Family based on Skylake microarchitecture. https://perfmon-events.intel.com/skylake_server.html (accessed 2022) Google Scholar
- Teresa Johnson, Mehdi Amini, and David Xinliang Li. 2017. ThinLTO: scalable and incremental LTO. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017. 111–121. http://dl.acm.org/citation.cfm?id=3049845 Google ScholarCross Ref
- Svilen Kanev, Juan Pablo Darago, Kim M. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David M. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015. 158–169. https://doi.org/10.1145/2749469.2750392 Google ScholarDigital Library
- Andi Kleen. 2016. An introduction to last branch records. https://lwn.net/Articles/680985/ (accessed Aug 20 2019) Google Scholar
- Konrad Kleine. 2019. 2 tips to make your C projects compile 3 times faster. https://developers.redhat.com/blog/2019/05/15/2-tips-to-make-your-c-projects-compile-3-times-faster Google Scholar
- Kumar, Snehasish. 2021. [RFC] Machine Function Splitter. https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/wFAVxa6fCgAJ [Online; accessed 6-August-2021] Google Scholar
- Rahman Lavaee, John Criswell, and Chen Ding. 2019. Codestitcher: inter-procedural basic block layout optimization. In Proceedings of the 28th International Conference on Compiler Construction, CC 2019, Washington, DC, USA, February 16-17, 2019. 65–75. https://doi.org/10.1145/3302516.3307358 Google ScholarDigital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200. Google Scholar
- Chi-Keung Luk, Robert Muth, Harish Patil, Robert Cohn, and Geoff Lowney. 2004. Ispike: A post-link optimizer for the Intel® Itanium® architecture. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. 15. Google Scholar
- Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P Geoffrey Lowney, and Robert Cohn. 2002. Profile-guided post-link stride prefetching. In Proceedings of the 16th international conference on Supercomputing. 167–178. Google ScholarDigital Library
- Robert Muth, Saumya K Debray, Scott Watterson, and Koen De Bosschere. 2001. alto: a link-time optimizer for the Compaq Alpha. Software: Practice and Experience, 31, 1 (2001), 67–101. Google ScholarDigital Library
- Itai Nahshon and David Bernstein. 1996. FDPR: A Post-pass Object-code Optimization Tool. In International Conference on Compiler Construction. Google Scholar
- Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices. 42, 89–100. Google Scholar
- Andy Newell and Sergey Pupyrev. 2018. Improved Basic Block Reordering. CoRR, abs/1809.04676 (2018), arxiv:1809.04676. arxiv:1809.04676 Google Scholar
- Maksim Panchenko. 2022. BOLT Open Projects. https://discourse.llvm.org/t/bolt-open-projects/61857 (accessed 2022) Google Scholar
- Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019. 2–14. https://doi.org/10.1109/CGO.2019.8661201 Google ScholarCross Ref
- Maksim Panchenko, Rafael Auler, Laith Sakka, and Guilherme Ottoni. 2021. Lightning BOLT: Powerful, Fast, and Scalable Binary Optimization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (CC 2021). Association for Computing Machinery, New York, NY, USA. 119–130. isbn:9781450383257 https://doi.org/10.1145/3446804.3446843 Google ScholarDigital Library
- Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of code in a single repository. Commun. ACM, 59, 7 (2016), 78–87. Google ScholarDigital Library
- Krzysztof Pszeniczny. 2022. llvm-bolt registers .eh_frames which may refer to unmapped sections. https://github.com/llvm/llvm-project/issues/56726 (accessed 2022) Google Scholar
- Krzysztof Pszeniczny. 2022. Stripping BOLTed binaries may result in misaligned PT_LOADs. https://github.com/llvm/llvm-project/issues/56738 (accessed 2022) Google Scholar
- NIST FIPS PUB. 2001. 140-2: Security requirements for cryptographic modules. Information Technology Laboratory, National Institute of Standards and Technology. Google Scholar
- Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P Geoffrey Lowney, and Mateo Valero. 2001. Code layout optimizations for transaction processing workloads. In ACM SIGARCH Computer Architecture News. 29, 155–164. Google ScholarDigital Library
- Benjamin Schwarz, Saumya Debray, Gregory Andrews, and Matthew Legendre. 2001. Plto: A link-time optimizer for the Intel IA-32 architecture. In Proc. 2001 Workshop on Binary Translation (WBT-2001). Google Scholar
- Han Shen, Rahman Lavaee, Krzysztof Pszeniczny, Snehasish Kumar, Sriraman Tallam, and Xinliang (David) Li. 2022. Artifacts for "Propeller: A Profile Guided, Relinking Optimizer for Warehouse Scale Applications". https://doi.org/10.5281/zenodo.7222794 Google ScholarDigital Library
- Amitabh Srivastava and Alan Eustace. 2004. ATOM: A system for building customized program analysis tools. ACM SIGPLAN Notices, 39, 4 (2004), 528–539. Google ScholarDigital Library
- James Swift. 2017. Crazy Fast Builds Using distcc. https://pspdfkit.com/blog/2017/crazy-fast-builds-using-distcc/ Google Scholar
- Sriraman Tallam. 2020. LLD Support for Basic Block Sections. https://reviews.llvm.org/rG94317878d826 (accessed June 29, 2022) Google Scholar
- Ian Lance Taylor. 2008. A New ELF Linker. In Proceedings of the GCC Developers’ Summit. http://ols.fedoraproject.org/GCC/Reprints-2008/taylor-reprint.pdf Google Scholar
- Rui Ueyama. 2017. LLD - The LLVM Linker. https://lld.llvm.org/lld Google Scholar
- Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutter, and Koen De Bosschere. 2005. Diablo: a reliable, retargetable and extensible link-time rewriting framework. In Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005.. 7–12. Google Scholar
- Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. Stardbt: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4–15. Google ScholarCross Ref
- Kaiyuan Wang, Greg Tener, Vijay Gullapalli, Xin Huang, Ahmed Gad, and Daniel Rall. 2020. Scalable build service system with smart scheduling service. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 452–462. Google ScholarDigital Library
- Wikipedia contributors. 2021. Monorepo — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Monorepo&oldid=1024603377 [Online; accessed 6-August-2021] Google Scholar
- David Williams-King and Junfeng Yang. 2019. CodeMason: Binary-Level Profile-Guided Optimization. FEAST’19. Association for Computing Machinery, New York, NY, USA. 47–53. isbn:9781450368346 https://doi.org/10.1145/3338502.3359763 Google ScholarDigital Library
- Wired. 2011. Artificial intelligence: it’s nothing like we expected - Internet Search. https://www.wired.co.uk/article/artificial-intelligence (accessed 2022) Google Scholar
- Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44. Google ScholarCross Ref
Index Terms
- Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
Recommendations
The continuous artificial bee colony algorithm for binary optimization
This paper introduces an ABC variant to solve binary optimization problems.The performance of the proposed method is investigated on well-known UFLPs.The proposed method is compared with the ABC variants and PSO variants.The experimental results show ...
A modified competitive swarm optimizer for large scale optimization problems
Display Omitted The proposed work (MCSO) is motivated by the Competitive Swarm Optimizer (CSO).2/3rd of the swarm are updated in MCSO every time by a tri-competitive criteria.Both CEC 2008 and CEC 2010 benchmark functions have been solved using ...
The Ant Lion Optimizer
The Ant Lion Optimizer inspired by the hunting mechanism of antlions is proposed.The ALO algorithm is benchmarked on 29 well-known test functions.The results on the unimodal functions show the superior exploitation of ALO.The exploratory ability of ALO ...
Comments