skip to main content
research-article
Open Access

Scale-out NUMA

Published:24 February 2014Publication History
Skip Abstract Section

Abstract

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.

References

  1. A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Amza, A. L. Cox, S. Dwarkadas, P. J. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. Treadmarks: Shared Memory Computing on Networks of Workstations. IEEE Computer, 29(2):18--28, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Baumann, P. Barham, P.- É. Dagand, T. L. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The Multikernel: a New OS Architecture for Scalable Multicore Systems. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. L. Binkert, A. G. Saidi, and S. K. Reinhardt. Integrated Network Interfaces for High-Bandwidth TCP/IP. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XII), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, E. W. Felten, and J. Sandberg. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Bonachea. Proposal for Extending the UPC Memory Copy Library Functions and Supporting Extensions to GASNet, Version 2.0. 2007.Google ScholarGoogle Scholar
  8. Calxeda Inc. Calxeda Energy Core ECX-1000 Fabric Switch. http://www.calxeda.com/architecture/fabric/, 2012.Google ScholarGoogle Scholar
  9. Calxeda Inc. ECX-1000 Technical Specifications. http://www.calxeda.com/ecx-1000-techspecs/, 2012.Google ScholarGoogle Scholar
  10. J. B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. In Proceedings of the 13th ACM Symposium on Operating Systems Principles (SOSP), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta. Hive: Fault Containment for Shared-Memory Multiprocessors. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Chapman and G. Heiser. NUMA: A Virtual Shared-Memory Multiprocessor. In Proceedings of the 2009 conference on USENIX Annual Technical Conference, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Coarfa, Y. Dotsenko, J. M. Mellor-Crummey, F. Cantonnet, T. A. El-Ghazawi, A. Mohanti, Y. Yao, and D. G. Chavarría-Miranda. An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Crupnicoff. Personal communication (Mellanox Corp.), 2013.Google ScholarGoogle Scholar
  15. D. E. Culler, A. C. Arpaci-Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. A. Yelick. Parallel Programming in Split-C. In Proceedings of the 1993 ACM/IEEE Conference on Supercomputing (SC), 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Davis and D. Borland. System and Method for High- Performance, Low-Power Data Center Interconnect Fabric.WO Patent 2,011,053,488, 2011.Google ScholarGoogle Scholar
  17. G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang. SeaMicro SM10000- 64 Server: Building Datacenter Servers Using Cell Phone Chips. In Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  19. B. Falsafi and D. A. Wood. Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A.Wood. Application-Specific Protocols for User-Level Shared Memory. In Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Flajslik and M. Rosenblum. Network Interface Design for Low Latency Request-Response Protocols. In Proceedings of the 2013 USENIX Annual Technical Conference, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Gillett. Memory Channel: An Optimized Cluster Interconnect. IEEE Micro, 16(2):12--18, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. HPC Advisory Council. Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing. http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf, 2009.Google ScholarGoogle Scholar
  25. IEEE 802.1Qbb: Priority-Based Flow Control. IEEE, 2011.Google ScholarGoogle Scholar
  26. InfiniBand Trade Association. InfiniBand Architecture Specification: Release 1.0. 2000.Google ScholarGoogle Scholar
  27. R. Kessler and J. Schwarzmeier. Cray T3D: A New Dimension for Cray Research. In Compcon Spring '93, Digest of Papers, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Kwak, C. Lee, H. Park, and S. B. Moon. What is Twitter, a Social Network or a News Media? In Proceedings of the 19th International Conference on World Wide Web (WWW), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3):63--79, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Trans. Comput. Syst., 7(4):321--359, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Liu, J.Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. International Journal of Parallel Programming, 32(3):167--198, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi. Scale-Out Processors. In Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Mackenzie, J. Kubiatowicz, A. Agarwal, and F. Kaashoek. Fugu: Implementing Translation and Protection in a Multiuser, Multimodel Multiprocessor. In Proceedings of the 1994 Workshop on Shared Memory Multiprocessors, 1994.Google ScholarGoogle Scholar
  35. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-Scale Graph Processing. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mellanox Corp. ConnectX-3 Pro Product Brief. http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf, 2012.Google ScholarGoogle Scholar
  37. Mellanox Corp. RDMA Aware Networks Programming User Manual, Rev 1.4. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf, 2013.Google ScholarGoogle Scholar
  38. C. Mitchell, Y. Geng, and J. Li. Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store. In Proceedings of the USENIX Annual Technical Conference, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D.Webb. The Alpha 21364 Network Architecture. In Hot Interconnects IX, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood. Coherent Network Interfaces for Fine-Grain Communication. In Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Nelson, B. Myers, A. H. Hunter, P. Briggs, L. Ceze, C. Ebeling, D. Grossman, S. Kahan, and M. Oskin. Crunching Large Graphs with Commodity Processors. In Proceedings of the 3rd USENIX Conference on Hot Topics in Parallelism (HotPar), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. L. Noordergraaf and R. van der Pas. Performance Experiences on Sun's WildFire Prototype. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Ongaro, S. M. Rumble, R. Stutsman, J. K. Ousterhout, and M. Rosenblum. Fast Crash Recovery in RAMCloud. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Oracle Corp. Oracle Exalogic Elastic Cloud X3--2 (Datasheet). http://www.oracle.com/us/products/middleware/exalogic/overview/index.html, 2013.Google ScholarGoogle Scholar
  45. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab Technical Report, 1999.Google ScholarGoogle Scholar
  46. S. Pakin, M. Lauria, and A. Chien. High performance messaging on workstations: Illinois FastMessages (FM) forMyrinet. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing. ACM, 1995. Google ScholarGoogle Scholar
  47. R. Recio, B. Metzler, P. Culley, J. Hilland, and D. Garcia. A Remote Direct Memory Access Protocol Specification. RFC 5040, 2007.Google ScholarGoogle Scholar
  48. S. K. Reinhardt, J. R. Larus, and D. A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters, 10(1):16--19, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. M. Rumble, D. Ongaro, R. Stutsman, M. Rosenblum, and J. K. Ousterhout. It's Time for Low Latency. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. D. J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine-Grain Shared Memory. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-Grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. L. Scott and G. M. Thorson. The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. In Hot Interconnects IV, 1996.Google ScholarGoogle Scholar
  54. S. Shelach. Mellanox wins $200m Google, Microsoft deals. http://www.globes.co.il/serveen/globes/docview.asp?did=1000857043&fid=1725, 2013.Google ScholarGoogle Scholar
  55. Q. O. Snell, A. R. Mikler, and J. L. Gustafson. Netpipe: A Network Protocol Independent Performance Evaluator. In IASTED International Conference on Intelligent Information Management and Systems, volume 6, 1996.Google ScholarGoogle Scholar
  56. R. Stets, S. Dwarkadas, N. Hardavellas, G. C. Hunt, L. I. Kontothanassis, S. Parthasarathy, and M. L. Scott. Cashmere-2L: Software Coherent Shared Memory on a Clustered Remote-Write Network. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. L. G. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: A User-Level Network Interface for Parallel and Distributed Computing. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. T. Wenisch, R. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. Hoe. SimFlex: Statistical Sampling of Computer System Simulation. IEEE Micro, 26:18 --31, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. WinterCorp. Big Data and Data Warehousing. http://www.wintercorp.com/.Google ScholarGoogle Scholar
  61. K. A. Yelick, D. Bonachea, W.-Y. Chen, P. Colella, K. Datta, J. Duell, S. L. Graham, P. Hargrove, P. N. Hilfinger, P. Husbands, C. Iancu, A. Kamil, R. Nishtala, J. Su,M. L.Welcome, and T.Wen. Productivity and Performance Using Partitioned Global Address Space Languages. In Workshop on Parallel Symbolic Computation (PASCO), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scale-out NUMA

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 42, Issue 1
          ASPLOS '14
          March 2014
          729 pages
          ISSN:0163-5964
          DOI:10.1145/2654822
          Issue’s Table of Contents
          • cover image ACM Conferences
            ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
            February 2014
            780 pages
            ISBN:9781450323055
            DOI:10.1145/2541940

          Copyright © 2014 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 February 2014

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader