Abstract
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance.
In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
- Cisco Data Center Infrastructure 2.5 Design Guide. http://www.cisco.com/univercd/cc/td/doc/solution/dcidg21.pdf.Google Scholar
- InfiniBand Architecture Specification Volume 1, Release 1.0. http://www.infinibandta.org/specs.Google Scholar
- Juniper J-Flow. http://www.juniper.net/techpubs/software/erx/junose61/swconfig-routing-vol1/html/ip-jflow-stats-config2.html.Google Scholar
- Sun Datacenter Switch 3456 Architecture White Paper. http://www.sun.com/products/networking/datacenter/ds3456/ds3456_wp.pdf.Google Scholar
- M. Blumrich, D. Chen, P. Coteus, A. Gara, M. Giampapa, P. Heidelberger, S. Singh, B. Steinmacher-Burow, T. Takken, and P. Vranas. Design and Analysis of the BlueGene/L Torus Interconnection Network. IBM Research Report RC23025 (W0312--022), 3, 2003.Google Scholar
- N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, and J. Seizovic. Myrinet: A Gigabit-per-second Local Area Network. Micro, IEEE, 15(1), 1995. Google ScholarDigital Library
- S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1--7), 1998. Google ScholarDigital Library
- R. Cheveresan, M. Ramsay, C. Feucht, and I. Sharapov. Characteristics of Workloads used in High Performance and Technical Computing. In International Conference on Supercomputing, 2007. Google ScholarDigital Library
- L. Chisvin and R. J. Duckworth. Content-Addressable and Associative Memory: Alternatives to the Ubiquitous RAM. Computer, 22(7):51--64, 1989. Google ScholarDigital Library
- B. Claise. Cisco Systems NetFlow Services Export Version 9. RFC 3954, Internet Engineering Task Force, 2004.Google Scholar
- C. Clos. A Study of Non-blocking Switching Networks. Bell System Technical Journal, 32(2), 1953.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. USENIX Symposium on Operating Systems Design and Implementation, 2004. Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's Highly Available Key-Value Store. ACM Symposium on Operating Systems Principles, 2007. Google ScholarDigital Library
- A. B. Downey. Evidence for Long-tailed Distributions in the Internet. ACM SIGCOMM Workshop on Internet Measurement, 2001. Google ScholarDigital Library
- W. Eatherton, G. Varghese, and Z. Dittia. Tree Bitmap : Hardware/Software IP Lookups with Incremental Updates. SIGCOMM Computer Communications Review, 34(2):97--122, 2004. Google ScholarDigital Library
- S. B. Fred, T. Bonald, A. Proutiere, G. Régnié, and J. W. Roberts. Statistical Bandwidth Sharing: A Study of Congestion at Flow Level. SIGCOMM Computer Communication Review, 2001. Google ScholarDigital Library
- M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. ACM SIGOPS Operating Systems Review, 37(5), 2003. Google ScholarDigital Library
- C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. RFC 2992, Internet Engineering Task Force, 2000. Google ScholarDigital Library
- D. Katz, D. Ward. BFD for IPv4 and IPv6 (Single Hop) (Draft). Technical report, Internet Engineering Task Force, 2008.Google Scholar
- E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek. The Click Modular Router. ACM Transactions on Computer Systems, 18(3), 2000. Google ScholarDigital Library
- C. Leiserson, Z. Abuhamdeh, D. Douglas, C. Feynman, M. Ganmukhi, J. Hill, D. Hillis, B. Kuszmaul, M. Pierre, D. Wells, et al. The Network Architecture of the Connection Machine CM-5 (Extended Abstract). ACM Symposium on Parallel Algorithms and Architectures, 1992. Google ScholarDigital Library
- C. E. Leiserson. Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers, 34(10):892--901, 1985. Google ScholarDigital Library
- J. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, R. Raghuraman, and J. Luo. NetFPGA-An Open Platform for Gigabit-rate Network Switching and Routing. In IEEE International Conference on Microelectronic Systems Education, 2007. Google ScholarDigital Library
- J. Moy. OSPF Version 2. RFC 2328, Internet Engineering Task Force, 1998.Google Scholar
- F. Schmuck and R. Haskin. GPFS: A Shared-Disk File System for Large Computing Clusters. In USENIX Conference on File and Storage Technologies, 2002. Google ScholarDigital Library
- L. R. Scott, T. Clark, and B. Bagheri. Scientific Parallel Computing. Princeton University Press, 2005. Google ScholarDigital Library
- SGI Developer Central Open Source Linux XFS. XFS: A High-performance Journaling Filesystem. http://oss.sgi.com/projects/xfs/.Google Scholar
- V. Srinivasan and G. Varghese. Faster IP Lookups using Controlled Prefix Expansion. ACM Transactions on Computer Systems (TOCS), 17(1):1--40, 1999. Google ScholarDigital Library
- D. Thaler and C. Hopps. Multipath Issues in Unicast and Multicast Next-Hop Selection. RFC 2991, Internet Engineering Task Force, 2000. Google ScholarDigital Library
- L. Tucker and G. Robertson. Architecture and Applications of the Connection Machine. Computer, 21(8), 1988. Google ScholarDigital Library
- J. Vetter, S. Alam, J. Dunigan, T.H., M. Fahey, P. Roth, and P. Worley. Early Evaluation of the Cray XT3. In IEEE International Parallel and Distributed Processing Symposium, 2006. Google ScholarDigital Library
- M. Woodacre, D. Robb, D. Roe, and K. Feind. The SGI Altix 3000 Global Shared-Memory Architecture. SGI White Paper, 2003.Google Scholar
Index Terms
- A scalable, commodity data center network architecture
Recommendations
Data center TCP (DCTCP)
SIGCOMM '10Cloud data centers host diverse applications, mixing workloads that require small predictable latency with others requiring large sustained throughput. In this environment, today's state-of-the-art TCP protocol falls short. We present measurements of a ...
A scalable, commodity data center network architecture
SIGCOMM '08: Proceedings of the ACM SIGCOMM 2008 conference on Data communicationToday's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive ...
Comments