Scale-out NUMA

Authors:
Stanko Novakovic

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Alexandros Daglis

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Edouard Bugnion

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Babak Falsafi

EPFL, Lausanne, Switzerland

EPFL, Lausanne, Switzerland
View Profile

,
Boris Grot

University of Edinburgh, Edinburgh, United Kingdom

University of Edinburgh, Edinburgh, United Kingdom
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 42 Issue 1March 2014pp 3–18https://doi.org/10.1145/2654822.2541965

Published:24 February 2014Publication History

ACM SIGARCH Computer Architecture News

Abstract

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.

References

A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarDigital Library
C. Amza, A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18--28, 1996. Google ScholarDigital Library
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003. Google ScholarDigital Library
A. Baumann, P. Barham, P.- É. Dagand, T. L. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: a New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
N. L. Binkert, A. G. Saidi, and S. K. Reinhardt. Integrated Network Interfaces for High-Bandwidth TCP/IP. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII), 2006. Google ScholarDigital Library
M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, Version 2.0. 2007.Google Scholar
Calxeda Inc. Calxeda Energy Core ECX-1000 Fabric Switch. http://www.calxeda.com/architecture/fabric/, 2012.Google Scholar
Calxeda Inc. ECX-1000 Technical Specifications. http://www.calxeda.com/ecx-1000-techspecs/, 2012.Google Scholar
J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP), 1991. Google ScholarDigital Library
J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-Memory Multiprocessors. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
M. Chapman and G. Heiser. NUMA: A Virtual Shared-Memory Multiprocessor. In Proceedings of the 2009 conference on USENIX Annual Technical Conference, 2009. Google ScholarDigital Library
C. Coarfa, Y. Dotsenko, J. M. Mellor-Crummey, F. Cantonnet, T. A. El-Ghazawi, A. Mohanti, Y. Yao, and D. G. Chavarría-Miranda. An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2005. Google ScholarDigital Library
D. Crupnicoff. Personal communication (Mellanox Corp.), 2013.Google Scholar
D. E. Culler, A. C. Arpaci-Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. A. Yelick. Parallel Programming in Split-C. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing (SC), 1993. Google ScholarDigital Library
M. Davis and D. Borland. System and Method for High- Performance, Low-Power Data Center Interconnect Fabric.WO Patent 2,011,053,488, 2011.Google Scholar
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007. Google ScholarDigital Library
A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang. SeaMicro SM10000- 64 Server: Building Datacenter Servers Using Cell Phone Chips. In Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google ScholarCross Ref
B. Falsafi and D. A. Wood. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarDigital Library
B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A.Wood. Application-Specific Protocols for User-Level Shared Memory. In Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarDigital Library
M. Flajslik and M. Rosenblum. Network Interface Design for Low Latency Request-Response Protocols. In Proceedings of the 2013 USENIX Annual Technical Conference, 2013. Google ScholarDigital Library
R. Gillett. Memory Channel: An Optimized Cluster Interconnect. IEEE Micro, 16(2):12--18, 1996.Google ScholarDigital Library
J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarDigital Library
HPC Advisory Council. Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing. http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf, 2009.Google Scholar
IEEE 802.1Qbb: Priority-Based Flow Control. IEEE, 2011.Google Scholar
InfiniBand Trade Association. InfiniBand Architecture Specification: Release 1.0. 2000.Google Scholar
R. Kessler and J. Schwarzmeier. Cray T3D: A New Dimension for Cray Research. In Compcon Spring '93, Digest of Papers, 1993.Google ScholarCross Ref
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. B. Moon. What is Twitter, a Social Network or a News Media? In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. Google ScholarDigital Library
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63--79, 1992. Google ScholarDigital Library
K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarDigital Library
J. Liu, J.Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. International Journal of Parallel Programming, 32(3):167--198, 2004. Google ScholarDigital Library
P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi. Scale-Out Processors. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
K. Mackenzie, J. Kubiatowicz, A. Agarwal, and F. Kaashoek. Fugu: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor. In Proceedings of the 1994 Workshop on Shared Memory Multiprocessors, 1994.Google Scholar
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
Mellanox Corp. ConnectX-3 Pro Product Brief. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf, 2012.Google Scholar
Mellanox Corp. RDMA Aware Networks Programming User Manual, Rev 1.4. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf, 2013.Google Scholar
C. Mitchell, Y. Geng, and J. Li. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Proceedings of the USENIX Annual Technical Conference, 2013. Google ScholarDigital Library
S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D.Webb. The Alpha 21364 Network Architecture. In Hot Interconnects IX, 2001. Google ScholarDigital Library
S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarDigital Library
J. Nelson, B. Myers, A. H. Hunter, P. Briggs, L. Ceze, C. Ebeling, D. Grossman, S. Kahan, and M. Oskin. Crunching Large Graphs with Commodity Processors. In Proceedings of the 3rd USENIX Conference on Hot Topics in Parallelism (HotPar), 2011. Google ScholarDigital Library
L. Noordergraaf and R. van der Pas. Performance Experiences on Sun's WildFire Prototype. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 1999. Google ScholarDigital Library
D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast Crash Recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarDigital Library
Oracle Corp. Oracle Exalogic Elastic Cloud X3--2 (Datasheet). http://www.oracle.com/us/products/middleware/exalogic/overview/index.html, 2013.Google Scholar
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab Technical Report, 1999.Google Scholar
S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois FastMessages (FM) forMyrinet. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing. ACM, 1995. Google Scholar
R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. A Remote Direct Memory Access Protocol Specification. RFC 5040, 2007.Google Scholar
S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters, 10(1):16--19, 2011. Google ScholarDigital Library
S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's Time for Low Latency. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, 2011. Google ScholarDigital Library
D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), 1996. Google ScholarDigital Library
I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarDigital Library
S. L. Scott and G. M. Thorson. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In Hot Interconnects IV, 1996.Google Scholar
S. Shelach. Mellanox wins $200m Google, Microsoft deals. http://www.globes.co.il/serveen/globes/docview.asp?did=1000857043&fid=1725, 2013.Google Scholar
Q. O. Snell, A. R. Mikler, and J. L. Gustafson. Netpipe: A Network Protocol Independent Performance Evaluator. In IASTED International Conference on Intelligent Information Management and Systems, volume 6, 1996.Google Scholar
R. Stets, S. Dwarkadas, N. Hardavellas, G. C. Hunt, L. I. Kontothanassis, S. Parthasarathy, and M. L. Scott. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), 1997. Google ScholarDigital Library
L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
T. Wenisch, R. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. Hoe. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro, 26:18 --31, 2006. Google ScholarDigital Library
WinterCorp. Big Data and Data Warehousing. http://www.wintercorp.com/.Google Scholar
K. A. Yelick, D. Bonachea, W.-Y. Chen, P. Colella, K. Datta, J. Duell, S. L. Graham, P. Hargrove, P. N. Hilfinger, P. Husbands, C. Iancu, A. Kamil, R. Nishtala, J. Su,M. L.Welcome, and T.Wen. Productivity and Performance Using Partitioned Global Address Space Languages. In Workshop on Parallel Symbolic Computation (PASCO), 2007. Google ScholarDigital Library

Index Terms

Scale-out NUMA
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Client-server architectures
2. Information systems
  1. Data management systems
    1. Middleware for databases
      1. Application servers
      2. Database web servers

Recommendations

Scale-out NUMA
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Read More
Scale-out NUMA
ASPLOS '14

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Read More
Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory
Abstract
Emerging non-volatile memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ
Copyright © 2014 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2014
Check for updates
Author Tags
numa
rmda
system-on-chips
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 3,363
  Total Downloads
- Downloads (Last 12 months)390
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scale-out NUMA

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Scale-out NUMA

Scale-out NUMA

Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory