skip to main content
10.1145/3593856.3595911acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article
Open Access

NextGen-Malloc: Giving Memory Allocator Its Own Room in the House

Authors Info & Claims
Published:22 June 2023Publication History

ABSTRACT

Memory allocation and management have a significant impact on performance and energy of modern applications. We observe that performance can vary by as much as 72% in some applications based on which memory allocator is used. Many current allocators are multi-threaded to support concurrent allocation requests from different threads. However, such multi-threading comes at the cost of maintaining complex metadata that is tightly coupled and intertwined with user data. When memory management functions and other user programs run on the same core, the metadata used by management functions may pollute the processor caches and other resources.

In this paper, we make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains. To offload these multi-threaded fine-granularity functions, we propose to decouple the metadata of these functions from the rest of application data to reduce the overhead of inter-thread metadata synchronization. We draw attention to the following key questions to realize this opportunity: (a) What are the tradeoffs and challenges in offloading memory allocation to a dedicated core? (b) Should we use general-purpose cores or special-purpose cores for executing critical system management functions? (c) Can this methodology apply to heterogeneous systems (e.g., with GPUs, accelerators) and other service functions as well?

References

  1. Tyler Allen and Rong Ge. 2021. In-depth analyses of unified virtual memory system for GPU accelerated computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Amazon. 2023. AWS Lambda. https://aws.amazon.com/lambda/.Google ScholarGoogle Scholar
  3. Ashkan Asgharzadeh, Juan M Cebrian, Arthur Perais, Stefanos Kaxiras, and Alberto Ros. 2022. Free atomics: hardware atomic operations without fences.. In ISCA. 14--26.Google ScholarGoogle Scholar
  4. David Boreham. 2000. Malloc () performance in a multithreaded Linux environment. In 2000 USENIX Annual Technical Conference (USENIX ATC 00).Google ScholarGoogle Scholar
  5. Joao Carreira, Sumer Kohli, Rodrigo Bruno, and Pedro Fonseca. 2021. From warm to hot starts: Leveraging runtimes for the serverless era. In Proceedings of the Workshop on Hot Topics in Operating Systems. 58--64.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. 153--167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zheng Dang, Shuibing He, Peiyi Hong, Zhenxin Li, Xuechen Zhang, Xian-He Sun, and Gang Chen. 2022. NVAlloc: rethinking heap metadata management in persistent memory allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 115--127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aniket Deshmukh, Ruihao Li, Rathijit Sen, Robert R Henry, Monica Beckwith, and Gagan Gupta. 2021. Performance characterization of. net benchmarks. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 107--117.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jason Evans. 2006. A scalable concurrent malloc (3) implementation for FreeBSD. In Proc. of the bsdcan conference, ottawa, canada.Google ScholarGoogle Scholar
  10. Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. 2020. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 281--297.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jayneel Gandhi, Mark D Hill, and Michael M Swift. 2016. Agile paging: Exceeding the best of nested and shadow paging. ACM SIGARCH Computer Architecture News 44, 3 (2016), 707--718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wolfram Gloger. 2022. "Wolfram Gloger's malloc homepage". http://www.malloc.de/en/.Google ScholarGoogle Scholar
  13. Google. 2023. TCMalloc. https://github.com/google/tcmalloc/.Google ScholarGoogle Scholar
  14. A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257--273. https://www.usenix.org/conference/osdi21/presentation/hunterGoogle ScholarGoogle Scholar
  15. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 158--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. 2017. Mallacc: Accelerating Memory Allocation. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 33--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daan Leijen, Ben Zorn, and Leonardo de Moura. 2019. Mimalloc: Free List Sharding in Action. Technical Report MSR-TR-2019-18. Microsoft. https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/Google ScholarGoogle Scholar
  19. Martin Maas, Krste Asanović, and John Kubiatowicz. 2018. A hardware accelerator for tracing garbage collection. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 138--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Martin Maas, Chris Kennelly, Khanh Nguyen, Darryl Gove, Kathryn S. McKinley, and Paul Turner. 2021. Adaptive Huge-Page Subrelease for Non-Moving Memory Allocators in Warehouse-Scale Computers. In Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management (Virtual, Canada) (ISMM 2021). Association for Computing Machinery, New York, NY, USA, 28--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Microsoft. 2023. Azure Functions. https://azure.microsoft.com/en-us/products/functions/.Google ScholarGoogle Scholar
  22. Microsoft. 2023. Mimalloc-bench. https://github.com/daanx/mimalloc-bench/.Google ScholarGoogle Scholar
  23. SPEC org. 2022. SPEC CPU 2017. https://www.spec.org/cpu2017/.Google ScholarGoogle Scholar
  24. Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 361--378.Google ScholarGoogle Scholar
  25. Reena Panda, Shuang Song, Joseph Dean, and Lizy K John. 2018. Wait of a decade: Did SPEC CPU 2017 broaden the performance horizon?. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 271--282.Google ScholarGoogle ScholarCross RefCross Ref
  26. Bharghava Rajaram, Vijay Nagarajan, Susmit Sarkar, and Marco Elver. 2013. Fast RMWs for TSO: Semantics and implementation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K John. 2017. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB. ACM SIGARCH Computer Architecture News 45, 2 (2017), 469--480.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Divyanshu Saxena, Tao Ji, Arjun Singhvi, Junaid Khalid, and Aditya Akella. 2022. Memory deduplication for serverless computing with medes. In Proceedings of the Seventeenth European Conference on Computer Systems. 714--729.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the cost of atomic operations on modern architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 445--456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 205--218. https://www.usenix.org/conference/atc20/presentation/shahradGoogle ScholarGoogle Scholar
  31. Devesh Tiwari, Sanghoon Lee, James Tuck, and Yan Solihin. 2010. Mmt: Exploiting fine-grained parallelism in dynamic memory management. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  32. Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, and Boris Grot. 2021. Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 559--572.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yawen Wang, Kapil Arya, Marios Kogias, Manohar Vanga, Aditya Bhandari, Neeraja J Yadwadkar, Siddhartha Sen, Sameh Elnikety, Christos Kozyrakis, and Ricardo Bianchini. 2021. SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud. In Proceedings of the Sixteenth European Conference on Computer Systems. 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Qinzhe Wu, Jonathan Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 182--191.Google ScholarGoogle Scholar
  35. Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Weixi Zhu, Guilherme Cox, Jan Vesely, Mark Hairgrove, Alan L Cox, and Scott Rixner. 2022. UVM Discard: Eliminating Redundant Memory Transfers for Accelerators. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 27--38.Google ScholarGoogle Scholar

Index Terms

  1. NextGen-Malloc: Giving Memory Allocator Its Own Room in the House

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems
      June 2023
      247 pages
      ISBN:9798400701955
      DOI:10.1145/3593856

      Copyright © 2023 Owner/Author(s)

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2023

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader