NextGen-Malloc: Giving Memory Allocator Its Own Room in the House

Authors:
Ruihao Li

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, United States

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, United States

https://orcid.org/0000-0002-7092-2401
View Profile

,
Qinzhe Wu

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA

https://orcid.org/0000-0002-7988-1431
View Profile

,
Krishna Kavi

Computer Science and Engineering, University of North Texas, Denton, TX, United States

Computer Science and Engineering, University of North Texas, Denton, TX, United States

https://orcid.org/0000-0003-1581-8166
View Profile

,
Gayatri Mehta

Computer Science and Engineering, University of North Texas, Denton, TX, USA

Computer Science and Engineering, University of North Texas, Denton, TX, USA

https://orcid.org/0000-0001-7754-1874
View Profile

,
Neeraja J. Yadwadkar

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, United States

VMware Research, Palo Alto, CA, USA

Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, United States

VMware Research, Palo Alto, CA, USA

https://orcid.org/0009-0007-7556-3069
View Profile

,
Lizy K. John

Electrical and Computer Engineering, The University of Texas at Austin, Austin, United States

Electrical and Computer Engineering, The University of Texas at Austin, Austin, United States

https://orcid.org/0000-0002-8747-5214
View Profile

HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating SystemsJune 2023Pages 135–142https://doi.org/10.1145/3593856.3595911

Published:22 June 2023Publication History

HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems

Pages 135–142

ABSTRACT

Memory allocation and management have a significant impact on performance and energy of modern applications. We observe that performance can vary by as much as 72% in some applications based on which memory allocator is used. Many current allocators are multi-threaded to support concurrent allocation requests from different threads. However, such multi-threading comes at the cost of maintaining complex metadata that is tightly coupled and intertwined with user data. When memory management functions and other user programs run on the same core, the metadata used by management functions may pollute the processor caches and other resources.

In this paper, we make a case for offloading memory allocation (and other similar management functions) from main processing cores to other processing units to boost performance, reduce energy consumption, and customize services to specific applications or application domains. To offload these multi-threaded fine-granularity functions, we propose to decouple the metadata of these functions from the rest of application data to reduce the overhead of inter-thread metadata synchronization. We draw attention to the following key questions to realize this opportunity: (a) What are the tradeoffs and challenges in offloading memory allocation to a dedicated core? (b) Should we use general-purpose cores or special-purpose cores for executing critical system management functions? (c) Can this methodology apply to heterogeneous systems (e.g., with GPUs, accelerators) and other service functions as well?

References

Tyler Allen and Rong Ge. 2021. In-depth analyses of unified virtual memory system for GPU accelerated computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.Google ScholarDigital Library
Amazon. 2023. AWS Lambda. https://aws.amazon.com/lambda/.Google Scholar
Ashkan Asgharzadeh, Juan M Cebrian, Arthur Perais, Stefanos Kaxiras, and Alberto Ros. 2022. Free atomics: hardware atomic operations without fences.. In ISCA. 14--26.Google Scholar
David Boreham. 2000. Malloc () performance in a multithreaded Linux environment. In 2000 USENIX Annual Technical Conference (USENIX ATC 00).Google Scholar
Joao Carreira, Sumer Kohli, Rodrigo Bruno, and Pedro Fonseca. 2021. From warm to hot starts: Leveraging runtimes for the serverless era. In Proceedings of the Workshop on Hot Topics in Operating Systems. 58--64.Google ScholarDigital Library
Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russinovich, Marcus Fontoura, and Ricardo Bianchini. 2017. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proceedings of the 26th Symposium on Operating Systems Principles. 153--167.Google ScholarDigital Library
Zheng Dang, Shuibing He, Peiyi Hong, Zhenxin Li, Xuechen Zhang, Xian-He Sun, and Gang Chen. 2022. NVAlloc: rethinking heap metadata management in persistent memory allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 115--127.Google ScholarDigital Library
Aniket Deshmukh, Ruihao Li, Rathijit Sen, Robert R Henry, Monica Beckwith, and Gagan Gupta. 2021. Performance characterization of. net benchmarks. In 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 107--117.Google ScholarCross Ref
Jason Evans. 2006. A scalable concurrent malloc (3) implementation for FreeBSD. In Proc. of the bsdcan conference, ottawa, canada.Google Scholar
Joshua Fried, Zhenyuan Ruan, Amy Ousterhout, and Adam Belay. 2020. Caladan: Mitigating interference at microsecond timescales. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 281--297.Google ScholarDigital Library
Jayneel Gandhi, Mark D Hill, and Michael M Swift. 2016. Agile paging: Exceeding the best of nested and shadow paging. ACM SIGARCH Computer Architecture News 44, 3 (2016), 707--718.Google ScholarDigital Library
Wolfram Gloger. 2022. "Wolfram Gloger's malloc homepage". http://www.malloc.de/en/.Google Scholar
Google. 2023. TCMalloc. https://github.com/google/tcmalloc/.Google Scholar
A.H. Hunter, Chris Kennelly, Paul Turner, Darryl Gove, Tipp Moseley, and Parthasarathy Ranganathan. 2021. Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 257--273. https://www.usenix.org/conference/osdi21/presentation/hunterGoogle Scholar
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1--12.Google ScholarDigital Library
Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Profiling a warehouse-scale computer. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 158--169. Google ScholarDigital Library
Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. 2017. Mallacc: Accelerating Memory Allocation. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (Xi'an, China) (ASPLOS '17). Association for Computing Machinery, New York, NY, USA, 33--45. Google ScholarDigital Library
Daan Leijen, Ben Zorn, and Leonardo de Moura. 2019. Mimalloc: Free List Sharding in Action. Technical Report MSR-TR-2019-18. Microsoft. https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/Google Scholar
Martin Maas, Krste Asanović, and John Kubiatowicz. 2018. A hardware accelerator for tracing garbage collection. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 138--151.Google ScholarDigital Library
Martin Maas, Chris Kennelly, Khanh Nguyen, Darryl Gove, Kathryn S. McKinley, and Paul Turner. 2021. Adaptive Huge-Page Subrelease for Non-Moving Memory Allocators in Warehouse-Scale Computers. In Proceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management (Virtual, Canada) (ISMM 2021). Association for Computing Machinery, New York, NY, USA, 28--38. Google ScholarDigital Library
Microsoft. 2023. Azure Functions. https://azure.microsoft.com/en-us/products/functions/.Google Scholar
Microsoft. 2023. Mimalloc-bench. https://github.com/daanx/mimalloc-bench/.Google Scholar
SPEC org. 2022. SPEC CPU 2017. https://www.spec.org/cpu2017/.Google Scholar
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high {CPU} efficiency for latency-sensitive datacenter workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 361--378.Google Scholar
Reena Panda, Shuang Song, Joseph Dean, and Lizy K John. 2018. Wait of a decade: Did SPEC CPU 2017 broaden the performance horizon?. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 271--282.Google ScholarCross Ref
Bharghava Rajaram, Vijay Nagarajan, Susmit Sarkar, and Marco Elver. 2013. Fast RMWs for TSO: Semantics and implementation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. 61--72.Google ScholarDigital Library
Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K John. 2017. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB. ACM SIGARCH Computer Architecture News 45, 2 (2017), 469--480.Google ScholarDigital Library
Divyanshu Saxena, Tao Ji, Arjun Singhvi, Junaid Khalid, and Aditya Akella. 2022. Memory deduplication for serverless computing with medes. In Proceedings of the Seventeenth European Conference on Computer Systems. 714--729.Google ScholarDigital Library
Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the cost of atomic operations on modern architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 445--456.Google ScholarDigital Library
Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. 2020. Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 205--218. https://www.usenix.org/conference/atc20/presentation/shahradGoogle Scholar
Devesh Tiwari, Sanghoon Lee, James Tuck, and Yan Solihin. 2010. Mmt: Exploiting fine-grained parallelism in dynamic memory management. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1--12.Google ScholarCross Ref
Dmitrii Ustiugov, Plamen Petrov, Marios Kogias, Edouard Bugnion, and Boris Grot. 2021. Benchmarking, analysis, and optimization of serverless function snapshots. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 559--572.Google ScholarDigital Library
Yawen Wang, Kapil Arya, Marios Kogias, Manohar Vanga, Aditya Bhandari, Neeraja J Yadwadkar, Siddhartha Sen, Sameh Elnikety, Christos Kozyrakis, and Ricardo Bianchini. 2021. SmartHarvest: harvesting idle CPUs safely and efficiently in the cloud. In Proceedings of the Sixteenth European Conference on Computer Systems. 1--16.Google ScholarDigital Library
Qinzhe Wu, Jonathan Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 182--191.Google Scholar
Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20--24.Google ScholarDigital Library
Weixi Zhu, Guilherme Cox, Jan Vesely, Mark Hairgrove, Alan L Cox, and Scott Rixner. 2022. UVM Discard: Eliminating Redundant Memory Transfers for Accelerators. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 27--38.Google Scholar

Index Terms

NextGen-Malloc: Giving Memory Allocator Its Own Room in the House
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

McRT-Malloc: a scalable transactional memory allocator
ISMM '06: Proceedings of the 5th international symposium on Memory management

Emerging multi-core processors promise to provide an exponentially increasing number of hardware threads with every generation. Applications will need to be highly concurrent to fullyuse the power of these processors. To enable maximum concurrency, ...
Read More
Enabling Hybrid PCM Memory System with Inherent Memory Management
RACS '16: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Replacing the traditional volatile main memory, e.g., DRAM, with a non-volatile phase change memory (PCM) has become a possible solution to reduce the energy consumption of computing systems. To further reduce the bit cost of PCM, the development trend ...
Read More
Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform
PMAM '12: Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores

The advent of manycore in computing architecture causes severe energy consumption and memory wall problem. Thus, emerging technologies such as on-chip memory and nonvolatile memory (NVRAM) have led to a paradigm shift in computing architecture era. For ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems
June 2023
247 pages
ISBN:9798400701955
DOI:10.1145/3593856
General Chair:
Malte Schwarzkopf,
Program Chairs:
Andrew Baumann,
Natacha Crooks
Copyright © 2023 Owner/Author(s)
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2023
Check for updates
Author Tags
memory management
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 663
  Total Downloads
- Downloads (Last 12 months)663
- Downloads (Last 6 weeks)58
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NextGen-Malloc: Giving Memory Allocator Its Own Room in the House

HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

McRT-Malloc: a scalable transactional memory allocator

Enabling Hybrid PCM Memory System with Inherent Memory Management

Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform