Understanding network failures in data centers: measurement, analysis, and implications

Authors:
Phillipa Gill

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Navendu Jain

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Nachiappan Nagappan

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conferenceAugust 2011Pages 350–361https://doi.org/10.1145/2018436.2018477

Published:15 August 2011Publication History

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

Pages 350–361

ABSTRACT

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Supplemental Material

sigcomm_11_1.mp4

mp4

107.9 MB

Download

References

Cisco: Data center: Load balancing data center services, 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.Google Scholar
H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, 2010. Google ScholarDigital Library
M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008. Google ScholarDigital Library
M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010. Google ScholarDigital Library
T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010. Google ScholarDigital Library
T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look at problems in the cloud. In HotCloud, 2010. Google ScholarDigital Library
J. Brodkin. Amazon EC2 outage calls "availability zones" into question, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.Google Scholar
X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarative configuration management for complex and dynamic networks. In CoNEXT, 2010. Google ScholarDigital Library
Cisco. UniDirectional Link Detection (UDLD). http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.Google Scholar
Cisco. Spanning tree protocol root guard enhancement, 2011. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.Google Scholar
D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In OSDI, 2010. Google ScholarDigital Library
A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009. Google ScholarDigital Library
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, 2008. Google ScholarDigital Library
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009. Google ScholarDigital Library
D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008. Google ScholarDigital Library
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM, 2010. Google ScholarDigital Library
C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008. Google ScholarDigital Library
C. Labovitz and A. Ahuja. Experimental study of internet stability and wide-area backbone failures. In The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999. Google ScholarDigital Library
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. Characterization of failures in an operational IP backbone network. IEEE/ACM Transactions on Networking, 2008. Google ScholarDigital Library
N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. In SIGCOMM CCR, 2008. Google ScholarDigital Library
R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009. Google ScholarDigital Library
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006. Google ScholarDigital Library
B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In FAST, 2007. Google ScholarDigital Library
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009. Google ScholarDigital Library
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002. Google ScholarDigital Library
D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010. Google ScholarDigital Library
K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010. Google ScholarDigital Library
D. Watson, F. Jahanian, and C. Labovitz. Experiences with monitoring OSPF on a regional service provider network. In ICDCS, 2003. Google ScholarDigital Library

Index Terms

Understanding network failures in data centers: measurement, analysis, and implications
1. Networks
  1. Network services
    1. Network management

Recommendations

A Large Scale Study of Data Center Network Reliability
IMC '18: Proceedings of the Internet Measurement Conference 2018

The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness ...
Read More
Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
Read More
Reliability in layered networks with random link failures

We consider network reliability in layered networks where the lower layer experiences random link failures. In layered networks, each failure at the lower layer may lead to multiple failures at the upper layer. We generalize the classical polynomial ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference
August 2011
502 pages
ISBN:9781450307970
DOI:10.1145/2018436
General Chairs:
Srinivasan Keshav
University of Waterloo, Canada
,
Jörg Liebeherr
University of Toronto, Canada
,
Program Chairs:
John Byers
Boston University, USA
,
Jeffrey Mogul
HP Labs, USA
ACM SIGCOMM Computer Communication Review Volume 41, Issue 4
SIGCOMM '11
August 2011
480 pages
ISSN:0146-4833
DOI:10.1145/2043164
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data centers
network reliability
Qualifiers
- research-article
Conference

Acceptance Rates
SIGCOMM '11 Paper Acceptance Rate32of223submissions,14%Overall Acceptance Rate554of3,547submissions,16%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 656
  Total Citations
  View Citations
- 4,838
  Total Downloads
- Downloads (Last 12 months)651
- Downloads (Last 6 weeks)91
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding network failures in data centers: measurement, analysis, and implications

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

A Large Scale Study of Data Center Network Reliability

Understanding network failures in data centers: measurement, analysis, and implications

Reliability in layered networks with random link failures