skip to main content
10.1145/3575693.3575727acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications

Published:30 January 2023Publication History

ABSTRACT

While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.

References

  1. 1996. COFF. https://wiki.osdev.org/COFF (accessed 2020) Google ScholarGoogle Scholar
  2. 2003. ELF - format of Executable and Linking Format (ELF) files. http://man7.org/linux/man-pages/man5/elf.5.html (accessed Aug 20 2019) Google ScholarGoogle Scholar
  3. 2009. OS X ABI Mach-O File Format Reference. https://developer.apple.com/ Google ScholarGoogle Scholar
  4. 2010. LLVM MC Project. http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html (accessed Aug 15 2019) Google ScholarGoogle Scholar
  5. 2019. Machine IR (MIR) Format Reference Manual. https://llvm.org/docs/MIRLangRef.html (accessed Aug 20 2019) Google ScholarGoogle Scholar
  6. 2019. MITE Micro-ops to IDQ. https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/308522 (accessed Aug 20 2019) Google ScholarGoogle Scholar
  7. Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. 2016. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In 25th $USENIX$ Security Symposium ($USENIX$ Security 16). 583–600. Google ScholarGoogle Scholar
  8. ARM. 2021. Branch and Call Sequences Explained. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/branch-and-call-sequences-explained [Online; accessed 6-August-2021] Google ScholarGoogle Scholar
  9. Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2018. The datacenter as a computer: Designing warehouse-scale machines. Synthesis Lectures on Computer Architecture, 13, 3 (2018), i–189. Google ScholarGoogle Scholar
  11. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track. 41, 46. Google ScholarGoogle Scholar
  12. P. Briggs, Doug Evans, B. Grant, R. Hundt, W. Maddox, D. Novillo, Seongbae Park, D. Sehr, Ian Taylor, and Ollie. [n. d.]. WHOPR-Fast and Scalable Whole Program Optimizations in GCC Initial Draft 12-Dec-2007. Google ScholarGoogle Scholar
  13. Derek Bruening. 2017. Restartable Sequences. https://dynamorio.org/page_rseq.html (accessed 2022) Google ScholarGoogle Scholar
  14. Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 265–275. Google ScholarGoogle ScholarCross RefCross Ref
  15. Bryan Buck and Jeffrey K Hollingsworth. 2000. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14, 4 (2000), 317–329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 205–218. Google ScholarGoogle Scholar
  17. Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016. AutoFDO: automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 12–23. https://doi.org/10.1145/2854038.2854044 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Robert S Cohn, David W Goodwin, P Geoffrey Lowney, and N Rubin. 1997. Optimizing alpha executables on windows nt with spike. Digital Technical Journal, 9 (1997), 3–20. Google ScholarGoogle Scholar
  19. Jonathan Corbet. 2015. Restartable Sequences. https://lwn.net/Articles/650333/ (accessed 2022) Google ScholarGoogle Scholar
  20. James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst., 31, 3 (2013), Article 8, Aug., 22 pages. issn:0734-2071 https://doi.org/10.1145/2491245 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cary Coutant. 2013. DWARF Extensions for Separate Debug Information Files a.k.a. "Fission" project. https://gcc.gnu.org/wiki/DebugFission Google ScholarGoogle Scholar
  22. Bruno De Bus, Bjorn De Sutter, Ludo Van Put, Dominique Chanet, and Koen De Bosschere. 2004. Link-time optimization of ARM binaries. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems. 211–220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation intel core: New microarchitecture code-named skylake. IEEE Micro, 37, 2 (2017), 52–62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s Distributed and Caching Build Service. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE ’16). Association for Computing Machinery, New York, NY, USA. 11–20. isbn:9781450342056 https://doi.org/10.1145/2889160.2889222 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. The LLVM Foundation. 2002. The LLVM Compiler Infrastructure. http://llvm.org (accessed Aug 20 2019) Google ScholarGoogle Scholar
  26. The LLVM Foundation. 2002. LLVM Link Time Optimization: Design and Implementation. https://llvm.org/docs/LinkTimeOptimization.html (accessed Aug 20 2019) Google ScholarGoogle Scholar
  27. The LLVM Foundation. 2020. SHT_LLVM_BB_ADDR_MAP Section (basic block address map). https://llvm.org/docs/Extensions.html##sht-llvm-bb-addr-map-section-basic-block-address-map Google ScholarGoogle Scholar
  28. Taras Glek and Jan Hubicka. 2010. Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196. Google ScholarGoogle Scholar
  29. Google Propeller. 2021. llvm-propeller. https://github.com/google/llvm-propeller (accessed 2021) Google ScholarGoogle Scholar
  30. Aysylu Greenberg. 2016. Building a Distributed Build System at Google Scale. https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistributedBuildSystemAtGoogleScale.pdf Google ScholarGoogle Scholar
  31. Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani. 2011. MAO–An extensible micro-architectural optimizer. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 1–10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Andrew Hamilton Hunter, Chris Kennelly, Darryl Gove, Parthasarathy Ranganathan, Paul Jack Turner, and Tipp James Moseley. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). Google ScholarGoogle Scholar
  33. IBM. 2021. Using remote build clearmake command. https://www.ibm.com/docs/en/rational-clearcase/9.0.0?topic=feature-using-remote-build-clearmake-command [Online; accessed 6-August-2021] Google ScholarGoogle Scholar
  34. LLVM Compiler Infrastructure. 2003. Exception Handling in LLVM. https://llvm.org/docs/ExceptionHandling.html Google ScholarGoogle Scholar
  35. Texas Instruments. 2015. TMS320C28x Optimizing C/C++ Compiler. http://downloads.ti.com/docs/esd/SPRU514I/Content/SPRU514I_HTML/post_link_optimizer.html (accessed Aug 20 2019) Google ScholarGoogle Scholar
  36. Intel. 2017. Intel Xeon Processor Scalable Family based on Skylake microarchitecture. https://perfmon-events.intel.com/skylake_server.html (accessed 2022) Google ScholarGoogle Scholar
  37. Teresa Johnson, Mehdi Amini, and David Xinliang Li. 2017. ThinLTO: scalable and incremental LTO. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017. 111–121. http://dl.acm.org/citation.cfm?id=3049845 Google ScholarGoogle ScholarCross RefCross Ref
  38. Svilen Kanev, Juan Pablo Darago, Kim M. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David M. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015. 158–169. https://doi.org/10.1145/2749469.2750392 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andi Kleen. 2016. An introduction to last branch records. https://lwn.net/Articles/680985/ (accessed Aug 20 2019) Google ScholarGoogle Scholar
  40. Konrad Kleine. 2019. 2 tips to make your C projects compile 3 times faster. https://developers.redhat.com/blog/2019/05/15/2-tips-to-make-your-c-projects-compile-3-times-faster Google ScholarGoogle Scholar
  41. Kumar, Snehasish. 2021. [RFC] Machine Function Splitter. https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/wFAVxa6fCgAJ [Online; accessed 6-August-2021] Google ScholarGoogle Scholar
  42. Rahman Lavaee, John Criswell, and Chen Ding. 2019. Codestitcher: inter-procedural basic block layout optimization. In Proceedings of the 28th International Conference on Compiler Construction, CC 2019, Washington, DC, USA, February 16-17, 2019. 65–75. https://doi.org/10.1145/3302516.3307358 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200. Google ScholarGoogle Scholar
  44. Chi-Keung Luk, Robert Muth, Harish Patil, Robert Cohn, and Geoff Lowney. 2004. Ispike: A post-link optimizer for the Intel® Itanium® architecture. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. 15. Google ScholarGoogle Scholar
  45. Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P Geoffrey Lowney, and Robert Cohn. 2002. Profile-guided post-link stride prefetching. In Proceedings of the 16th international conference on Supercomputing. 167–178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Robert Muth, Saumya K Debray, Scott Watterson, and Koen De Bosschere. 2001. alto: a link-time optimizer for the Compaq Alpha. Software: Practice and Experience, 31, 1 (2001), 67–101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Itai Nahshon and David Bernstein. 1996. FDPR: A Post-pass Object-code Optimization Tool. In International Conference on Compiler Construction. Google ScholarGoogle Scholar
  48. Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices. 42, 89–100. Google ScholarGoogle Scholar
  49. Andy Newell and Sergey Pupyrev. 2018. Improved Basic Block Reordering. CoRR, abs/1809.04676 (2018), arxiv:1809.04676. arxiv:1809.04676 Google ScholarGoogle Scholar
  50. Maksim Panchenko. 2022. BOLT Open Projects. https://discourse.llvm.org/t/bolt-open-projects/61857 (accessed 2022) Google ScholarGoogle Scholar
  51. Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019. 2–14. https://doi.org/10.1109/CGO.2019.8661201 Google ScholarGoogle ScholarCross RefCross Ref
  52. Maksim Panchenko, Rafael Auler, Laith Sakka, and Guilherme Ottoni. 2021. Lightning BOLT: Powerful, Fast, and Scalable Binary Optimization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (CC 2021). Association for Computing Machinery, New York, NY, USA. 119–130. isbn:9781450383257 https://doi.org/10.1145/3446804.3446843 Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of code in a single repository. Commun. ACM, 59, 7 (2016), 78–87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Krzysztof Pszeniczny. 2022. llvm-bolt registers .eh_frames which may refer to unmapped sections. https://github.com/llvm/llvm-project/issues/56726 (accessed 2022) Google ScholarGoogle Scholar
  55. Krzysztof Pszeniczny. 2022. Stripping BOLTed binaries may result in misaligned PT_LOADs. https://github.com/llvm/llvm-project/issues/56738 (accessed 2022) Google ScholarGoogle Scholar
  56. NIST FIPS PUB. 2001. 140-2: Security requirements for cryptographic modules. Information Technology Laboratory, National Institute of Standards and Technology. Google ScholarGoogle Scholar
  57. Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P Geoffrey Lowney, and Mateo Valero. 2001. Code layout optimizations for transaction processing workloads. In ACM SIGARCH Computer Architecture News. 29, 155–164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Benjamin Schwarz, Saumya Debray, Gregory Andrews, and Matthew Legendre. 2001. Plto: A link-time optimizer for the Intel IA-32 architecture. In Proc. 2001 Workshop on Binary Translation (WBT-2001). Google ScholarGoogle Scholar
  59. Han Shen, Rahman Lavaee, Krzysztof Pszeniczny, Snehasish Kumar, Sriraman Tallam, and Xinliang (David) Li. 2022. Artifacts for "Propeller: A Profile Guided, Relinking Optimizer for Warehouse Scale Applications". https://doi.org/10.5281/zenodo.7222794 Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Amitabh Srivastava and Alan Eustace. 2004. ATOM: A system for building customized program analysis tools. ACM SIGPLAN Notices, 39, 4 (2004), 528–539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. James Swift. 2017. Crazy Fast Builds Using distcc. https://pspdfkit.com/blog/2017/crazy-fast-builds-using-distcc/ Google ScholarGoogle Scholar
  62. Sriraman Tallam. 2020. LLD Support for Basic Block Sections. https://reviews.llvm.org/rG94317878d826 (accessed June 29, 2022) Google ScholarGoogle Scholar
  63. Ian Lance Taylor. 2008. A New ELF Linker. In Proceedings of the GCC Developers’ Summit. http://ols.fedoraproject.org/GCC/Reprints-2008/taylor-reprint.pdf Google ScholarGoogle Scholar
  64. Rui Ueyama. 2017. LLD - The LLVM Linker. https://lld.llvm.org/lld Google ScholarGoogle Scholar
  65. Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutter, and Koen De Bosschere. 2005. Diablo: a reliable, retargetable and extensible link-time rewriting framework. In Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005.. 7–12. Google ScholarGoogle Scholar
  66. Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. Stardbt: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4–15. Google ScholarGoogle ScholarCross RefCross Ref
  67. Kaiyuan Wang, Greg Tener, Vijay Gullapalli, Xin Huang, Ahmed Gad, and Daniel Rall. 2020. Scalable build service system with smart scheduling service. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 452–462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Wikipedia contributors. 2021. Monorepo — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Monorepo&oldid=1024603377 [Online; accessed 6-August-2021] Google ScholarGoogle Scholar
  69. David Williams-King and Junfeng Yang. 2019. CodeMason: Binary-Level Profile-Guided Optimization. FEAST’19. Association for Computing Machinery, New York, NY, USA. 47–53. isbn:9781450368346 https://doi.org/10.1145/3338502.3359763 Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Wired. 2011. Artificial intelligence: it’s nothing like we expected - Internet Search. https://www.wired.co.uk/article/artificial-intelligence (accessed 2022) Google ScholarGoogle Scholar
  71. Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader