Abstract
Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarDigital Library
- C. Amza, A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18--28, 1996. Google ScholarDigital Library
- P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003. Google ScholarDigital Library
- A. Baumann, P. Barham, P.- É. Dagand, T. L. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: a New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarDigital Library
- N. L. Binkert, A. G. Saidi, and S. K. Reinhardt. Integrated Network Interfaces for High-Bandwidth TCP/IP. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII), 2006. Google ScholarDigital Library
- M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
- D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, Version 2.0. 2007.Google Scholar
- Calxeda Inc. Calxeda Energy Core ECX-1000 Fabric Switch. http://www.calxeda.com/architecture/fabric/, 2012.Google Scholar
- Calxeda Inc. ECX-1000 Technical Specifications. http://www.calxeda.com/ecx-1000-techspecs/, 2012.Google Scholar
- J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP), 1991. Google ScholarDigital Library
- J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-Memory Multiprocessors. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
- M. Chapman and G. Heiser. NUMA: A Virtual Shared-Memory Multiprocessor. In Proceedings of the 2009 conference on USENIX Annual Technical Conference, 2009. Google ScholarDigital Library
- C. Coarfa, Y. Dotsenko, J. M. Mellor-Crummey, F. Cantonnet, T. A. El-Ghazawi, A. Mohanti, Y. Yao, and D. G. Chavarría-Miranda. An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2005. Google ScholarDigital Library
- D. Crupnicoff. Personal communication (Mellanox Corp.), 2013.Google Scholar
- D. E. Culler, A. C. Arpaci-Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. A. Yelick. Parallel Programming in Split-C. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing (SC), 1993. Google ScholarDigital Library
- M. Davis and D. Borland. System and Method for High- Performance, Low-Power Data Center Interconnect Fabric.WO Patent 2,011,053,488, 2011.Google Scholar
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007. Google ScholarDigital Library
- A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang. SeaMicro SM10000- 64 Server: Building Datacenter Servers Using Cell Phone Chips. In Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google ScholarCross Ref
- B. Falsafi and D. A. Wood. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarDigital Library
- B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A.Wood. Application-Specific Protocols for User-Level Shared Memory. In Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarDigital Library
- M. Flajslik and M. Rosenblum. Network Interface Design for Low Latency Request-Response Protocols. In Proceedings of the 2013 USENIX Annual Technical Conference, 2013. Google ScholarDigital Library
- R. Gillett. Memory Channel: An Optimized Cluster Interconnect. IEEE Micro, 16(2):12--18, 1996.Google ScholarDigital Library
- J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarDigital Library
- HPC Advisory Council. Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing. http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf, 2009.Google Scholar
- IEEE 802.1Qbb: Priority-Based Flow Control. IEEE, 2011.Google Scholar
- InfiniBand Trade Association. InfiniBand Architecture Specification: Release 1.0. 2000.Google Scholar
- R. Kessler and J. Schwarzmeier. Cray T3D: A New Dimension for Cray Research. In Compcon Spring '93, Digest of Papers, 1993.Google ScholarCross Ref
- J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. B. Moon. What is Twitter, a Social Network or a News Media? In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. Google ScholarDigital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63--79, 1992. Google ScholarDigital Library
- K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarDigital Library
- J. Liu, J.Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. International Journal of Parallel Programming, 32(3):167--198, 2004. Google ScholarDigital Library
- P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi. Scale-Out Processors. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- K. Mackenzie, J. Kubiatowicz, A. Agarwal, and F. Kaashoek. Fugu: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor. In Proceedings of the 1994 Workshop on Shared Memory Multiprocessors, 1994.Google Scholar
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
- Mellanox Corp. ConnectX-3 Pro Product Brief. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf, 2012.Google Scholar
- Mellanox Corp. RDMA Aware Networks Programming User Manual, Rev 1.4. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf, 2013.Google Scholar
- C. Mitchell, Y. Geng, and J. Li. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Proceedings of the USENIX Annual Technical Conference, 2013. Google ScholarDigital Library
- S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D.Webb. The Alpha 21364 Network Architecture. In Hot Interconnects IX, 2001. Google ScholarDigital Library
- S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarDigital Library
- J. Nelson, B. Myers, A. H. Hunter, P. Briggs, L. Ceze, C. Ebeling, D. Grossman, S. Kahan, and M. Oskin. Crunching Large Graphs with Commodity Processors. In Proceedings of the 3rd USENIX Conference on Hot Topics in Parallelism (HotPar), 2011. Google ScholarDigital Library
- L. Noordergraaf and R. van der Pas. Performance Experiences on Sun's WildFire Prototype. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 1999. Google ScholarDigital Library
- D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast Crash Recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarDigital Library
- Oracle Corp. Oracle Exalogic Elastic Cloud X3--2 (Datasheet). http://www.oracle.com/us/products/middleware/exalogic/overview/index.html, 2013.Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab Technical Report, 1999.Google Scholar
- S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois FastMessages (FM) forMyrinet. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing. ACM, 1995. Google Scholar
- R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. A Remote Direct Memory Access Protocol Specification. RFC 5040, 2007.Google Scholar
- S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters, 10(1):16--19, 2011. Google ScholarDigital Library
- S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's Time for Low Latency. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, 2011. Google ScholarDigital Library
- D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), 1996. Google ScholarDigital Library
- I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarDigital Library
- S. L. Scott and G. M. Thorson. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In Hot Interconnects IV, 1996.Google Scholar
- S. Shelach. Mellanox wins $200m Google, Microsoft deals. http://www.globes.co.il/serveen/globes/docview.asp?did=1000857043&fid=1725, 2013.Google Scholar
- Q. O. Snell, A. R. Mikler, and J. L. Gustafson. Netpipe: A Network Protocol Independent Performance Evaluator. In IASTED International Conference on Intelligent Information Management and Systems, volume 6, 1996.Google Scholar
- R. Stets, S. Dwarkadas, N. Hardavellas, G. C. Hunt, L. I. Kontothanassis, S. Parthasarathy, and M. L. Scott. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), 1997. Google ScholarDigital Library
- L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
- T. Wenisch, R. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. Hoe. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro, 26:18 --31, 2006. Google ScholarDigital Library
- WinterCorp. Big Data and Data Warehousing. http://www.wintercorp.com/.Google Scholar
- K. A. Yelick, D. Bonachea, W.-Y. Chen, P. Colella, K. Datta, J. Duell, S. L. Graham, P. Hargrove, P. N. Hilfinger, P. Husbands, C. Iancu, A. Kamil, R. Nishtala, J. Su,M. L.Welcome, and T.Wen. Productivity and Performance Using Partitioned Global Address Space Languages. In Workshop on Parallel Symbolic Computation (PASCO), 2007. Google ScholarDigital Library
Index Terms
- Scale-out NUMA
Recommendations
Scale-out NUMA
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsEmerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Scale-out NUMA
ASPLOS '14Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Zen+: a robust NUMA-aware OLTP engine optimized for non-volatile main memory
AbstractEmerging non-volatile memory (NVM) technologies like 3DXpoint promise significant performance potential for OLTP databases. However, transactional databases need to be redesigned because the key assumptions that non-volatile storage is orders of ...
Comments