skip to main content
10.1145/3544216.3544242acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections

FAst in-network GraY failure detection for ISPs

Published:22 August 2022Publication History

ABSTRACT

Avoiding packet loss is crucial for ISPs. Unfortunately, malfunctioning hardware at ISPs can cause long-lasting packet drops, also known as gray failures, which are undetectable by existing monitoring tools.

In this paper, we describe the design and implementation of FANcY, an ISP-targeted system that detects and localizes gray failures quickly and accurately. FANcY complements previous monitoring approaches, which are mainly tailored for low-delay networks such as data center networks and do not work at ISP scale. We experimentally confirm FANcY's capability to accurately detect gray failures in seconds, as long as only tiny fractions of traffic experience losses. We also implement FANcY in an Intel Tofino switch, demonstrating how it enables fine-grained fast rerouting.

Skip Supplemental Material Section

Supplemental Material

References

  1. Cisco Bug: CSCea91692 - PSA has a corrupted cef entry, affecting IP:IP traffic. https://quickview.cloudapps.cisco.com/quickview/bug/CSCea91692.Google ScholarGoogle Scholar
  2. Cisco Bug: CSCtc33158 - 7600-ES+40G3CXL drops random sized L2TPv3 packets with cookies enabled. https://quickview.cloudapps.cisco.com/quickview/bug/CSCtc33158.Google ScholarGoogle Scholar
  3. Cisco Bug: CSCti14290 - VPN Aggregate Label dmac corruption in hardware forwarding entry. https://quickview.cloudapps.cisco.com/quickview/bug/CSCti14290.Google ScholarGoogle Scholar
  4. Cisco Bug: CSCuv31196 - Random MPLS Packet Drops With IP Multicast Over L3 Ring on ASR901. https://quickview.cloudapps.cisco.com/quickview/bug/CSCuv31196.Google ScholarGoogle Scholar
  5. Intel tofino 3 brief. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-3-brief.html.Google ScholarGoogle Scholar
  6. Juniper Bug: PR1296089 - Traffic received from core are not sent to locally attached circuit due to QSN timeout. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/18.1/jd0e17997.html.Google ScholarGoogle Scholar
  7. Juniper Bug: PR1309613 - Traffic loss may be seen if sending traffic via the 40G interface. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/17.4/jd0e19328.html.Google ScholarGoogle Scholar
  8. Juniper Bug: PR1313977 - Traffic drop occurs on sending traffic over "et" interfaces due to CRC errors. https://www.juniper.net/documentation/en_US/junos/information-products/topic-collections/release-notes/17.4/jd0e19328.html.Google ScholarGoogle Scholar
  9. Juniper Bug: PR1398407 - On SRX4600 and SRX5000 line of devices, BGP packets might be dropped under high CPU usage. (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1398407.Google ScholarGoogle Scholar
  10. Juniper Bug: PR1434567 - IPv6 neighbor solicitation packets getting dropped on PTX. (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1434567.Google ScholarGoogle Scholar
  11. Juniper Bug: PR1441816 - Egress stream flush failure and traffic blackhole might occur (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1441816.Google ScholarGoogle Scholar
  12. Juniper Bug: PR1450545 - Traffic loss might occur when there are around 80,000 routes in FIB (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1450545.Google ScholarGoogle Scholar
  13. Juniper Bug: PR1459698 - Silent dropping of traffic upon interface flapping after DRD auto-recovery (Open Registration Required). https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1459698.Google ScholarGoogle Scholar
  14. Summary of anonymization best practice techniques. hhttps://www.caida.org/projects/predict/anonymization/.Google ScholarGoogle Scholar
  15. Visibility of ipv4 and ipv6 prefix lengths in 2019. https://labs.ripe.net/Members/stephen_strowes/visibility-of-prefix-lengths-in-ipv4-and-ipv6.Google ScholarGoogle Scholar
  16. Network Simulator 3., 2018. https://www.nsnam.org/.Google ScholarGoogle Scholar
  17. Behnaz Arzani, Selim Ciraci, Luiz Chamon, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Boon Thau Loo, and Geoff Outhred. 007: Democratically finding the cause of packet drops. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18), pages 419--435, 2018.Google ScholarGoogle Scholar
  18. Barefoot. Barefoot Tofino, World's fastest P4-programmable Ethernet switch ASICs. https://barefootnetworks.com/products/brief-tofino/.Google ScholarGoogle Scholar
  19. Ran Ben-Basat, Xiaoqi Chen, Gil Einziger, and Ori Rottenstreich. Efficient measurement on programmable switches using probabilistic recirculation. In 2018 IEEE 26th International Conference on Network Protocols, ICNP 2018, Cambridge, UK, September 25--27, 2018, pages 313--323. IEEE Computer Society, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  20. Andrei Broder and Michael Mitzenmacher. Network Applications of Bloom Filters: A Survey. In Internet Mathematics, volume 1, pages 636--646, 2002.Google ScholarGoogle Scholar
  21. CAIDA. The CAIDA UCSD Anonymized 2013/2014/2015/2016/2018 Internet Traces. http://www.caida.org/data/passive/passive_2013_dataset.xml.Google ScholarGoogle Scholar
  22. Benoit Claise. Cisco Systems NetFlow Services Export Version 9. RFC 3954 (Informational), October 2004. http://www.ietf.org/rfc/rfc3954.txt.Google ScholarGoogle Scholar
  23. Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, David a Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. 45:139--152, 2015.Google ScholarGoogle Scholar
  24. Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2014), pages 71--85, 2014.Google ScholarGoogle Scholar
  25. Mary Hogan, Shir Landau-Feibish, Mina Tahmasbi Arashloo, Jennifer Rexford, David Walker, and Rob Harrison. Elastic switch programming with p4all. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks, HotNets '20, page 168--174, New York, NY, USA, 2020. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thomas Holterbach, Edgar Costa Molero, Maria Apostolaki, Alberto Dainotti, Stefano Vissicchio, and Laurent Vanbever. Blink: Fast Connectivity Recovery Entirely in the Data Plane. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), 2019.Google ScholarGoogle Scholar
  27. Nikita Ivkin, Zhuolong Yu, Vladimir Braverman, and Xin Jin. Qpipe: Quantiles sketch fully in the data plane. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies, pages 285--291, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Katz and D. Ward. Bidirectional Forwarding Detection. RFC 5880, 2010.Google ScholarGoogle Scholar
  29. Balachander Krishnamurthy, Subhabrata Sen, Yin Zhang, and Yan Chen. Sketch-based change detection: Methods, evaluation, and applications. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 234--247, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ashwin Lall, Vyas Sekar, Mitsunori Ogihara, Jun Xu, and Hui Zhang. Data streaming algorithms for estimating entropy of network traffic. ACM SIGMETRICS Performance Evaluation Review, 34(1):145--156, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. Lossradar: Fast detection of lost packets in data center networks. In Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies, pages 481--495. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Stephane Litkowski, Ahmed Bashandy, Clarence Filsfils, Pierre Francois, Bruno Decraene, and Daniel Voyer. Topology Independent Fast Reroute using Segment Routing. Internet-Draft draft-ietf-rtgwg-segment-routing-ti-lfa-08, Internet Engineering Task Force, January 2022. Work in Progress.Google ScholarGoogle Scholar
  33. Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 101--114, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. Scream: Sketch resource allocation for software-defined measurement. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, pages 1--13, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hun Namkung, Daehyeok Kim, Zaoxing Liu, Vyas SekaR, and Peter Steenkiste. Telemetry Retrieval Inaccuracy in Programmable Switches: Analysis and Recommendations, page 176--182. Association for Computing Machinery, New York, NY, USA, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Peter Phaal, Sonia Panchen, and Neil McKee. InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks. RFC 3176 (Informational), September 2001. http://www.ietf.org/rfc/rfc3176.txt.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Arjun Roy, Hongyi Zeng, Jasmeet Bagga, and Alex C Snoeren. Passive Realtime Datacenter Fault Detection and Localization. Nsdi, pages 25--30, 2017.Google ScholarGoogle Scholar
  38. Nadi Sarrar, Steve Uhlig, Anja Feldmann, Rob Sherwood, and Xin Huang. Leveraging zipf's law for traffic offloading. ACM SIGCOMM Computer Communication Review, 42(1):16--22, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Kausik Subramanian, Anubhavnidhi Abhashkumar, Loris D'Antoni, and Aditya Akella. D2R: policy-compliant fast reroute. In SOSR, pages 148--161. ACM, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang. Netbouncer: Active device and link failure localization in data center networks. In 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19), pages 599--614, 2019.Google ScholarGoogle Scholar
  41. Olivier Tilmans, Tobias Bühler, Ingmar Poese, Stefano Vissicchio, and Laurent Vanbever. Stroboscope: Declarative network monitoring on a budget. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, 2018. USENIX Association.Google ScholarGoogle Scholar
  42. Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In 10th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 13), pages 29--42, 2013.Google ScholarGoogle Scholar
  43. Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, and Ying Zhang. ARROW: Restoration-Aware Traffic Engineering. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference, SIGCOMM '21, page 560--579, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yu Zhou, Chen Sun, Hongqiang Harry Liu, Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, et al. Flow event telemetry on programmable data plane. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, pages 76--89, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. Packet-Level Telemetry in Large Datacenter Networks. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM '15, page 479--491, New York, NY, USA, 2015. Association for Computing Machinery.Google ScholarGoogle Scholar
  46. Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Klaus-Tycho Förster, Arvind Krishnamurthy, and Thomas Anderson. Understanding and mitigating packet corruption in data center networks. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '17, page 362--375, New York, NY, USA, 2017. Association for Computing Machinery.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. FAst in-network GraY failure detection for ISPs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGCOMM '22: Proceedings of the ACM SIGCOMM 2022 Conference
            August 2022
            858 pages
            ISBN:9781450394208
            DOI:10.1145/3544216

            Copyright © 2022 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 August 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate554of3,547submissions,16%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader