research-article

Open Access

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications

Authors:
Han Shen

Google, USA

Google, USA
View Profile

,
Krzysztof Pszeniczny

Google, Switzerland

Google, Switzerland
View Profile

,
Rahman Lavaee

Google, USA

Google, USA
View Profile

,
Snehasish Kumar

Google, USA

Google, USA
View Profile

,
Sriraman Tallam

Google, USA

Google, USA
View Profile

,
Xinliang David Li

Google, USA

Google, USA
View Profile

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2January 2023Pages 617–631https://doi.org/10.1145/3575693.3575727

Published:30 January 2023Publication History

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 617–631

ABSTRACT

While profile guided optimizations (PGO) and link time optimiza-tions (LTO) have been widely adopted, post link optimizations (PLO)have languished until recently when researchers demonstrated that late injection of profiles can yield significant performance improvements. However, the disassembly-driven, monolithic design of post link optimizers face scaling challenges with large binaries andis at odds with distributed build systems. To reconcile and enable post link optimizations within a distributed build environment, we propose Propeller, a relinking optimizer for warehouse scale work-loads. To enable flexible code layout optimizations, we introduce basic block sections, a novel linker abstraction. Propeller uses basic block sections to enable a new approach to PLO without disassembly. Propeller achieves scalability by relinking the binary using precise profiles instead of rewriting the binary. The overhead of relinking is lowered by caching and leveraging distributed compiler actions during code generation. Propeller has been deployed to production at Google with over tens of millions of cores executing Propeller optimized code at any time. An evaluation of internal warehouse-scale applications show Propeller improves performance by 1.1% to 8% beyond PGO and ThinLTO. Compiler tools such as Clang improve by 7% while MySQL improves by 1%. Compared to the state of the art binary optimizer, Propeller achieves comparable performance while lowering memory overheads by 30%-70% on large benchmarks.

References

1996. COFF. https://wiki.osdev.org/COFF (accessed 2020) Google Scholar
2003. ELF - format of Executable and Linking Format (ELF) files. http://man7.org/linux/man-pages/man5/elf.5.html (accessed Aug 20 2019) Google Scholar
2009. OS X ABI Mach-O File Format Reference. https://developer.apple.com/ Google Scholar
2010. LLVM MC Project. http://blog.llvm.org/2010/04/intro-to-llvm-mc-project.html (accessed Aug 15 2019) Google Scholar
2019. Machine IR (MIR) Format Reference Manual. https://llvm.org/docs/MIRLangRef.html (accessed Aug 20 2019) Google Scholar
2019. MITE Micro-ops to IDQ. https://software.intel.com/en-us/forums/intel-performance-bottleneck-analyzer/topic/308522 (accessed Aug 20 2019) Google Scholar
Dennis Andriesse, Xi Chen, Victor Van Der Veen, Asia Slowinska, and Herbert Bos. 2016. An in-depth analysis of disassembly on full-scale x86/x64 binaries. In 25th $USENIX$ Security Symposium ($USENIX$ Security 16). 583–600. Google Scholar
ARM. 2021. Branch and Call Sequences Explained. https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/branch-and-call-sequences-explained [Online; accessed 6-August-2021] Google Scholar
Grant Ayers, Nayana Prasad Nagendra, David I August, Hyoun Kyu Cho, Svilen Kanev, Christos Kozyrakis, Trivikram Krishnamurthy, Heiner Litz, Tipp Moseley, and Parthasarathy Ranganathan. 2019. Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers. In Proceedings of the 46th International Symposium on Computer Architecture. 462–473. Google ScholarDigital Library
Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. 2018. The datacenter as a computer: Designing warehouse-scale machines. Synthesis Lectures on Computer Architecture, 13, 3 (2018), i–189. Google Scholar
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track. 41, 46. Google Scholar
P. Briggs, Doug Evans, B. Grant, R. Hundt, W. Maddox, D. Novillo, Seongbae Park, D. Sehr, Ian Taylor, and Ollie. [n. d.]. WHOPR-Fast and Scalable Whole Program Optimizations in GCC Initial Draft 12-Dec-2007. Google Scholar
Derek Bruening. 2017. Restartable Sequences. https://dynamorio.org/page_rseq.html (accessed 2022) Google Scholar
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In International Symposium on Code Generation and Optimization, 2003. CGO 2003.. 265–275. Google ScholarCross Ref
Bryan Buck and Jeffrey K Hollingsworth. 2000. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14, 4 (2000), 317–329. Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data. In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 205–218. Google Scholar
Dehao Chen, David Xinliang Li, and Tipp Moseley. 2016. AutoFDO: automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO 2016, Barcelona, Spain, March 12-18, 2016. 12–23. https://doi.org/10.1145/2854038.2854044 Google ScholarDigital Library
Robert S Cohn, David W Goodwin, P Geoffrey Lowney, and N Rubin. 1997. Optimizing alpha executables on windows nt with spike. Digital Technical Journal, 9 (1997), 3–20. Google Scholar
Jonathan Corbet. 2015. Restartable Sequences. https://lwn.net/Articles/650333/ (accessed 2022) Google Scholar
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s Globally Distributed Database. ACM Trans. Comput. Syst., 31, 3 (2013), Article 8, Aug., 22 pages. issn:0734-2071 https://doi.org/10.1145/2491245 Google ScholarDigital Library
Cary Coutant. 2013. DWARF Extensions for Separate Debug Information Files a.k.a. "Fission" project. https://gcc.gnu.org/wiki/DebugFission Google Scholar
Bruno De Bus, Bjorn De Sutter, Ludo Van Put, Dominique Chanet, and Koen De Bosschere. 2004. Link-time optimization of ARM binaries. In Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems. 211–220. Google ScholarDigital Library
Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-generation intel core: New microarchitecture code-named skylake. IEEE Micro, 37, 2 (2017), 52–62. Google ScholarDigital Library
Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomiets, Erica Lan, Erik Mavrinac, Wolfram Schulte, Newton Sanches, and Srikanth Kandula. 2016. CloudBuild: Microsoft’s Distributed and Caching Build Service. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE ’16). Association for Computing Machinery, New York, NY, USA. 11–20. isbn:9781450342056 https://doi.org/10.1145/2889160.2889222 Google ScholarDigital Library
The LLVM Foundation. 2002. The LLVM Compiler Infrastructure. http://llvm.org (accessed Aug 20 2019) Google Scholar
The LLVM Foundation. 2002. LLVM Link Time Optimization: Design and Implementation. https://llvm.org/docs/LinkTimeOptimization.html (accessed Aug 20 2019) Google Scholar
The LLVM Foundation. 2020. SHT_LLVM_BB_ADDR_MAP Section (basic block address map). https://llvm.org/docs/Extensions.html##sht-llvm-bb-addr-map-section-basic-block-address-map Google Scholar
Taras Glek and Jan Hubicka. 2010. Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196. Google Scholar
Google Propeller. 2021. llvm-propeller. https://github.com/google/llvm-propeller (accessed 2021) Google Scholar
Aysylu Greenberg. 2016. Building a Distributed Build System at Google Scale. https://gotocon.com/dl/goto-chicago-2016/slides/AysyluGreenberg_BuildingADistributedBuildSystemAtGoogleScale.pdf Google Scholar
Robert Hundt, Easwaran Raman, Martin Thuresson, and Neil Vachharajani. 2011. MAO–An extensible micro-architectural optimizer. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. 1–10. Google ScholarDigital Library
Andrew Hamilton Hunter, Chris Kennelly, Darryl Gove, Parthasarathy Ranganathan, Paul Jack Turner, and Tipp James Moseley. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). Google Scholar
IBM. 2021. Using remote build clearmake command. https://www.ibm.com/docs/en/rational-clearcase/9.0.0?topic=feature-using-remote-build-clearmake-command [Online; accessed 6-August-2021] Google Scholar
LLVM Compiler Infrastructure. 2003. Exception Handling in LLVM. https://llvm.org/docs/ExceptionHandling.html Google Scholar
Texas Instruments. 2015. TMS320C28x Optimizing C/C++ Compiler. http://downloads.ti.com/docs/esd/SPRU514I/Content/SPRU514I_HTML/post_link_optimizer.html (accessed Aug 20 2019) Google Scholar
Intel. 2017. Intel Xeon Processor Scalable Family based on Skylake microarchitecture. https://perfmon-events.intel.com/skylake_server.html (accessed 2022) Google Scholar
Teresa Johnson, Mehdi Amini, and David Xinliang Li. 2017. ThinLTO: scalable and incremental LTO. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017. 111–121. http://dl.acm.org/citation.cfm?id=3049845 Google ScholarCross Ref
Svilen Kanev, Juan Pablo Darago, Kim M. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David M. Brooks. 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015. 158–169. https://doi.org/10.1145/2749469.2750392 Google ScholarDigital Library
Andi Kleen. 2016. An introduction to last branch records. https://lwn.net/Articles/680985/ (accessed Aug 20 2019) Google Scholar
Konrad Kleine. 2019. 2 tips to make your C projects compile 3 times faster. https://developers.redhat.com/blog/2019/05/15/2-tips-to-make-your-c-projects-compile-3-times-faster Google Scholar
Kumar, Snehasish. 2021. [RFC] Machine Function Splitter. https://groups.google.com/g/llvm-dev/c/RUegaMg-iqc/m/wFAVxa6fCgAJ [Online; accessed 6-August-2021] Google Scholar
Rahman Lavaee, John Criswell, and Chen Ding. 2019. Codestitcher: inter-procedural basic block layout optimization. In Proceedings of the 28th International Conference on Compiler Construction, CC 2019, Washington, DC, USA, February 16-17, 2019. 65–75. https://doi.org/10.1145/3302516.3307358 Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200. Google Scholar
Chi-Keung Luk, Robert Muth, Harish Patil, Robert Cohn, and Geoff Lowney. 2004. Ispike: A post-link optimizer for the Intel® Itanium® architecture. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. 15. Google Scholar
Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P Geoffrey Lowney, and Robert Cohn. 2002. Profile-guided post-link stride prefetching. In Proceedings of the 16th international conference on Supercomputing. 167–178. Google ScholarDigital Library
Robert Muth, Saumya K Debray, Scott Watterson, and Koen De Bosschere. 2001. alto: a link-time optimizer for the Compaq Alpha. Software: Practice and Experience, 31, 1 (2001), 67–101. Google ScholarDigital Library
Itai Nahshon and David Bernstein. 1996. FDPR: A Post-pass Object-code Optimization Tool. In International Conference on Compiler Construction. Google Scholar
Nicholas Nethercote and Julian Seward. 2007. Valgrind: a framework for heavyweight dynamic binary instrumentation. In ACM Sigplan notices. 42, 89–100. Google Scholar
Andy Newell and Sergey Pupyrev. 2018. Improved Basic Block Reordering. CoRR, abs/1809.04676 (2018), arxiv:1809.04676. arxiv:1809.04676 Google Scholar
Maksim Panchenko. 2022. BOLT Open Projects. https://discourse.llvm.org/t/bolt-open-projects/61857 (accessed 2022) Google Scholar
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A Practical Binary Optimizer for Data Centers and Beyond. In IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2019, Washington, DC, USA, February 16-20, 2019. 2–14. https://doi.org/10.1109/CGO.2019.8661201 Google ScholarCross Ref
Maksim Panchenko, Rafael Auler, Laith Sakka, and Guilherme Ottoni. 2021. Lightning BOLT: Powerful, Fast, and Scalable Binary Optimization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction (CC 2021). Association for Computing Machinery, New York, NY, USA. 119–130. isbn:9781450383257 https://doi.org/10.1145/3446804.3446843 Google ScholarDigital Library
Rachel Potvin and Josh Levenberg. 2016. Why Google stores billions of lines of code in a single repository. Commun. ACM, 59, 7 (2016), 78–87. Google ScholarDigital Library
Krzysztof Pszeniczny. 2022. llvm-bolt registers .eh_frames which may refer to unmapped sections. https://github.com/llvm/llvm-project/issues/56726 (accessed 2022) Google Scholar
Krzysztof Pszeniczny. 2022. Stripping BOLTed binaries may result in misaligned PT_LOADs. https://github.com/llvm/llvm-project/issues/56738 (accessed 2022) Google Scholar
NIST FIPS PUB. 2001. 140-2: Security requirements for cryptographic modules. Information Technology Laboratory, National Institute of Standards and Technology. Google Scholar
Alex Ramirez, Luiz André Barroso, Kourosh Gharachorloo, Robert Cohn, Josep Larriba-Pey, P Geoffrey Lowney, and Mateo Valero. 2001. Code layout optimizations for transaction processing workloads. In ACM SIGARCH Computer Architecture News. 29, 155–164. Google ScholarDigital Library
Benjamin Schwarz, Saumya Debray, Gregory Andrews, and Matthew Legendre. 2001. Plto: A link-time optimizer for the Intel IA-32 architecture. In Proc. 2001 Workshop on Binary Translation (WBT-2001). Google Scholar
Han Shen, Rahman Lavaee, Krzysztof Pszeniczny, Snehasish Kumar, Sriraman Tallam, and Xinliang (David) Li. 2022. Artifacts for "Propeller: A Profile Guided, Relinking Optimizer for Warehouse Scale Applications". https://doi.org/10.5281/zenodo.7222794 Google ScholarDigital Library
Amitabh Srivastava and Alan Eustace. 2004. ATOM: A system for building customized program analysis tools. ACM SIGPLAN Notices, 39, 4 (2004), 528–539. Google ScholarDigital Library
James Swift. 2017. Crazy Fast Builds Using distcc. https://pspdfkit.com/blog/2017/crazy-fast-builds-using-distcc/ Google Scholar
Sriraman Tallam. 2020. LLD Support for Basic Block Sections. https://reviews.llvm.org/rG94317878d826 (accessed June 29, 2022) Google Scholar
Ian Lance Taylor. 2008. A New ELF Linker. In Proceedings of the GCC Developers’ Summit. http://ols.fedoraproject.org/GCC/Reprints-2008/taylor-reprint.pdf Google Scholar
Rui Ueyama. 2017. LLD - The LLVM Linker. https://lld.llvm.org/lld Google Scholar
Ludo Van Put, Dominique Chanet, Bruno De Bus, Bjorn De Sutter, and Koen De Bosschere. 2005. Diablo: a reliable, retargetable and extensible link-time rewriting framework. In Proceedings of the Fifth IEEE International Symposium on Signal Processing and Information Technology, 2005.. 7–12. Google Scholar
Cheng Wang, Shiliang Hu, Ho-seop Kim, Sreekumar R Nair, Mauricio Breternitz, Zhiwei Ying, and Youfeng Wu. 2007. Stardbt: An efficient multi-platform dynamic binary translation system. In Asia-Pacific Conference on Advances in Computer Systems Architecture. 4–15. Google ScholarCross Ref
Kaiyuan Wang, Greg Tener, Vijay Gullapalli, Xin Huang, Ahmed Gad, and Daniel Rall. 2020. Scalable build service system with smart scheduling service. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 452–462. Google ScholarDigital Library
Wikipedia contributors. 2021. Monorepo — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Monorepo&oldid=1024603377 [Online; accessed 6-August-2021] Google Scholar
David Williams-King and Junfeng Yang. 2019. CodeMason: Binary-Level Profile-Guided Optimization. FEAST’19. Association for Computing Machinery, New York, NY, USA. 47–53. isbn:9781450368346 https://doi.org/10.1145/3338502.3359763 Google ScholarDigital Library
Wired. 2011. Artificial intelligence: it’s nothing like we expected - Internet Search. https://www.wired.co.uk/article/artificial-intelligence (accessed 2022) Google Scholar
Ahmad Yasin. 2014. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 35–44. Google ScholarCross Ref

Index Terms

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
1. Information systems
  1. Data management systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Retargetable compilers

Index terms have been assigned to the content through auto-classification.

Recommendations

The continuous artificial bee colony algorithm for binary optimization

This paper introduces an ABC variant to solve binary optimization problems.The performance of the proposed method is investigated on well-known UFLPs.The proposed method is compared with the ABC variants and PSO variants.The experimental results show ...
Read More
A modified competitive swarm optimizer for large scale optimization problems

Display Omitted The proposed work (MCSO) is motivated by the Competitive Swarm Optimizer (CSO).2/3rd of the swarm are updated in MCSO every time by a tri-competitive criteria.Both CEC 2008 and CEC 2010 benchmark functions have been solved using ...
Read More
The Ant Lion Optimizer

The Ant Lion Optimizer inspired by the hunting mechanism of antlions is proposed.The ALO algorithm is benchmarked on 29 well-known test functions.The results on the unimodal functions show the superior exploitation of ALO.The exploratory ability of ALO ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
January 2023
947 pages
ISBN:9781450399166
DOI:10.1145/3575693
General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA
Copyright © 2023 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 January 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
Binary Optimization
Datacenters
Distributed Build System
Post-Link Optimization
Profile Guided Optimization
Warehouse-Scale Applications
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 1,484
  Total Downloads
- Downloads (Last 12 months)991
- Downloads (Last 6 weeks)178
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

ABSTRACT

References

Cited By

Index Terms

Recommendations

The continuous artificial bee colony algorithm for binary optimization

A modified competitive swarm optimizer for large scale optimization problems

The Ant Lion Optimizer