ABSTRACT
Avoiding packet loss is crucial for ISPs. Unfortunately, malfunctioning hardware at ISPs can cause long-lasting packet drops, also known as gray failures, which are undetectable by existing monitoring tools.
In this paper, we describe the design and implementation of FANcY, an ISP-targeted system that detects and localizes gray failures quickly and accurately. FANcY complements previous monitoring approaches, which are mainly tailored for low-delay networks such as data center networks and do not work at ISP scale. We experimentally confirm FANcY's capability to accurately detect gray failures in seconds, as long as only tiny fractions of traffic experience losses. We also implement FANcY in an Intel Tofino switch, demonstrating how it enables fine-grained fast rerouting.
Supplemental Material
Available for Download
Supplemental material.
- Cisco Bug: CSCea91692 - PSA has a corrupted cef entry, affecting IP:IP traffic. https://quickview.cloudapps.cisco.com/quickview/bug/CSCea91692.Google Scholar
- Cisco Bug: CSCtc33158 - 7600-ES+40G3CXL drops random sized L2TPv3 packets with cookies enabled. https://quickview.cloudapps.cisco.com/quickview/bug/CSCtc33158.Google Scholar
- Cisco Bug: CSCti14290 - VPN Aggregate Label dmac corruption in hardware forwarding entry. https://quickview.cloudapps.cisco.com/quickview/bug/CSCti14290.Google Scholar
- Cisco Bug: CSCuv31196 - Random MPLS Packet Drops With IP Multicast Over L3 Ring on ASR901. https://quickview.cloudapps.cisco.com/quickview/bug/CSCuv31196.Google Scholar
- Intel tofino 3 brief. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-3-brief.html.Google Scholar
- Juniper Bug: PR1296089 - Traffic received from core are not sent to locally attached circuit due to QSN timeout. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/18.1/jd0e17997.html.Google Scholar
- Juniper Bug: PR1309613 - Traffic loss may be seen if sending traffic via the 40G interface. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/17.4/jd0e19328.html.Google Scholar
- Juniper Bug: PR1313977 - Traffic drop occurs on sending traffic over "et" interfaces due to CRC errors. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/17.4/jd0e19328.html.Google Scholar
- Juniper Bug: PR1398407 - On SRX4600 and SRX5000 line of devices, BGP packets might be dropped under high CPU usage. (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1398407.Google Scholar
- Juniper Bug: PR1434567 - IPv6 neighbor solicitation packets getting dropped on PTX. (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1434567.Google Scholar
- Juniper Bug: PR1441816 - Egress stream flush failure and traffic blackhole might occur (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1441816.Google Scholar
- Juniper Bug: PR1450545 - Traffic loss might occur when there are around 80,000 routes in FIB (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1450545.Google Scholar
- Juniper Bug: PR1459698 - Silent dropping of traffic upon interface flapping after DRD auto-recovery (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1459698.Google Scholar
- Summary of anonymization best practice techniques. hhttps://www.caida.org/projects/predict/anonymization/.Google Scholar
- Visibility of ipv4 and ipv6 prefix lengths in 2019. https://labs.ripe.net/Members/stephen_strowes/visibility-of-prefix-lengths-in-ipv4-and-ipv6.Google Scholar
- Network Simulator 3., 2018. https://www.nsnam.org/.Google Scholar
- Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 007: Democratically finding the cause of packet drops. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 419--435, 2018.Google Scholar
- Barefoot. Barefoot Tofino, World's fastest P4-programmable Ethernet switch ASICs. https://barefootnetworks.com/products/brief-tofino/.Google Scholar
- Ran Ben-Basat, Xiaoqi Chen, Gil Einziger, and Ori Rottenstreich. Efficient measurement on programmable switches using probabilistic recirculation. In 2018 IEEE 26th International Conference on Network Protocols, ICNP 2018, Cambridge, UK, September 25--27, 2018, pages 313--323. IEEE Computer Society, 2018.Google ScholarCross Ref
- Andrei Broder and Michael Mitzenmacher. Network Applications of Bloom Filters: A Survey. In Internet Mathematics, volume 1, pages 636--646, 2002.Google Scholar
- CAIDA. The CAIDA UCSD Anonymized 2013/2014/2015/2016/2018 Internet Traces. http://www.caida.org/data/passive/passive_2013_dataset.xml.Google Scholar
- Benoit Claise. Cisco Systems NetFlow Services Export Version 9. RFC 3954 (Informational), October 2004. http://www.ietf.org/rfc/rfc3954.txt.Google Scholar
- Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, David a Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. 45:139--152, 2015.Google Scholar
- Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2014), pages 71--85, 2014.Google Scholar
- Mary Hogan, Shir Landau-Feibish, Mina Tahmasbi Arashloo, Jennifer Rexford, David Walker, and Rob Harrison. Elastic switch programming with p4all. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks, HotNets '20, page 168--174, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarDigital Library
- Thomas Holterbach, Edgar Costa Molero, Maria Apostolaki, Alberto Dainotti, Stefano Vissicchio, and Laurent Vanbever. Blink: Fast Connectivity Recovery Entirely in the Data Plane. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019.Google Scholar
- Nikita Ivkin, Zhuolong Yu, Vladimir Braverman, and Xin Jin. Qpipe: Quantiles sketch fully in the data plane. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies, pages 285--291, 2019.Google ScholarDigital Library
- D. Katz and D. Ward. Bidirectional Forwarding Detection. RFC 5880, 2010.Google Scholar
- Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 234--247, 2003.Google ScholarDigital Library
- Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. ACM SIGMETRICS Performance Evaluation Review, 34(1):145--156, 2006.Google ScholarDigital Library
- Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Lossradar: Fast detection of lost packets in data center networks. In Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies, pages 481--495. ACM, 2016.Google ScholarDigital Library
- Stephane Litkowski, Ahmed Bashandy, Clarence Filsfils, Pierre Francois, Bruno Decraene, and Daniel Voyer. Topology Independent Fast Reroute using Segment Routing. Internet-Draft draft-ietf-rtgwg-segment-routing-ti-lfa-08, Internet Engineering Task Force, January 2022. Work in Progress.Google Scholar
- Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 101--114, 2016.Google ScholarDigital Library
- Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. Scream: Sketch resource allocation for software-defined measurement. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, pages 1--13, 2015.Google ScholarDigital Library
- Hun Namkung, Daehyeok Kim, Zaoxing Liu, Vyas SekaR, and Peter Steenkiste. Telemetry Retrieval Inaccuracy in Programmable Switches: Analysis and Recommendations, page 176--182. Association for Computing Machinery, New York, NY, USA, 2021.Google ScholarDigital Library
- Peter Phaal, Sonia Panchen, and Neil McKee. InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks. RFC 3176 (Informational), September 2001. http://www.ietf.org/rfc/rfc3176.txt.Google ScholarDigital Library
- Arjun Roy, Hongyi Zeng, Jasmeet Bagga, and Alex C Snoeren. Passive Realtime Datacenter Fault Detection and Localization. Nsdi, pages 25--30, 2017.Google Scholar
- Nadi Sarrar, Steve Uhlig, Anja Feldmann, Rob Sherwood, and Xin Huang. Leveraging zipf's law for traffic offloading. ACM SIGCOMM Computer Communication Review, 42(1):16--22, 2012.Google ScholarDigital Library
- Kausik Subramanian, Anubhavnidhi Abhashkumar, Loris D'Antoni, and Aditya Akella. D2R: policy-compliant fast reroute. In SOSR, pages 148--161. ACM, 2021.Google ScholarDigital Library
- Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. Netbouncer: Active device and link failure localization in data center networks. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), pages 599--614, 2019.Google Scholar
- Olivier Tilmans, Tobias Bühler, Ingmar Poese, Stefano Vissicchio, and Laurent Vanbever. Stroboscope: Declarative network monitoring on a budget. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, 2018. USENIX Association.Google Scholar
- Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pages 29--42, 2013.Google Scholar
- Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM '21, page 560--579, 2021.Google ScholarDigital Library
- Yu Zhou, Chen Sun, Hongqiang Harry Liu, Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, et al. Flow event telemetry on programmable data plane. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 76--89, 2020.Google ScholarDigital Library
- Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. Packet-Level Telemetry in Large Datacenter Networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM '15, page 479--491, New York, NY, USA, 2015. Association for Computing Machinery.Google Scholar
- Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Klaus-Tycho Förster, Arvind Krishnamurthy, and Thomas Anderson. Understanding and mitigating packet corruption in data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '17, page 362--375, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarDigital Library
Index Terms
- FAst in-network GraY failure detection for ISPs
Recommendations
Evaluation of the QoS of crash-recovery failure detection
SAC '07: Proceedings of the 2007 ACM symposium on Applied computingCrash failure detection is a key topic in fault tolerance, and it is important to be able to assess the QoS of failure detection services. Most previous work on crash failure detectors has been based on the crash-stop or fail-free assumption. In this ...
BFD-based failure detection and localization in IP over OBS/WDM multilayer network
Optical burst switching (OBS), which takes advantage of both high capacity of optical fiber and sophisticated control of electronics, has been considered as a promising paradigm for the modern Internet. In this paper, the recovery procedures for IP over ...
Rejuvenation and Failure Detection in Partitionable Systems
PRDC '01: Proceedings of the 2001 Pacific Rim International Symposium on Dependable ComputingCertain gateways (e.g., some cable or DSL modems)are known to have low reliability and low availability.Most failures of these devices can however be "fixed"by rejuvenating the device after a failure has been detected.Such a detection based rejuvenation ...
Comments