research-article

Public Access

REINFORCE: achieving efficient failure resiliency for network function virtualization based services

Authors:
Sameer G Kulkarni

University of Göttingen, Germany

University of Göttingen, Germany
View Profile

,
Guyue Liu

George Washington University

George Washington University
View Profile

,
K. K. Ramakrishnan

University of California, Riverside

University of California, Riverside
View Profile

,
Mayutan Arumaithurai

University of Göttingen, Germany

University of Göttingen, Germany
View Profile

,
Timothy Wood

George Washington University

George Washington University
View Profile

,
Xiaoming Fu

University of Göttingen, Germany

University of Göttingen, Germany
View Profile

CoNEXT '18: Proceedings of the 14th International Conference on emerging Networking EXperiments and TechnologiesDecember 2018Pages 41–53https://doi.org/10.1145/3281411.3281441

Published:04 December 2018Publication History

CoNEXT '18: Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies

Pages 41–53

ABSTRACT

Ensuring high availability (HA) for software-based networks is a critical design feature that will help the adoption of software-based network functions (NFs) in production networks. It is important for NFs to avoid outages and maintain mission-critical operations. However, HA support for NFs on the critical data path can result in unacceptable performance degradation. We present REINFORCE, an integrated framework to support efficient resiliency for NFs and NF service chains. REINFORCE includes timely failure detection and consistent failover mechanisms. REINFORCE replicates state to standby NFs (local and remote) while enforcing correctness. It minimizes the number of state transfers by exploiting the concept of external synchrony, and leverages opportunistic batching and multi-buffering to optimize performance. Experimental results show that, even at line-rate packet processing (10 Gbps), REINFORCE achieves chain-level failover across servers in a LAN (or within the same node) within 10ms (100/μs), incurring less than 10% (1%) performance overhead, and adds average latency of only ~400/μs (5/μs), with a worst-case latency of less than 1ms (10/μs).

Supplemental Material

p41-kulkarni.mp4

mp4

288.8 MB

Download

Available for Download

zip

p41-kulkarni-1.zip (5.4 MB)

Supplemental material.

zip

p41-kulkarni.zip (232.2 KB)

Supplemental material.

References

Data plane development kit. http://dpdk.org/, 2014. {online}.Google Scholar
Criu: Checkpoint restore in userspace. http://criu.org/, 2017. {online}.Google Scholar
ndpi test pcap traces. https://github.com/ntop/nDPI/tree/dev/tests/pcap, 2018. {online}.Google Scholar
wrk: a http benchmarking tool. https://github.com/wg/wrk, 2018. {online}.Google Scholar
Alpernas, K., Manevich, R., Panda, A., Sagiv, M., Shenker, S., Shoham, S., and Velner, Y. Abstract interpretation of stateful networks. In International Static Analysis Symposium (2018), Springer, pp. 86--106.Google ScholarCross Ref
Bench, A. ab-apache http server benchmarking tool.Google Scholar
Cachin, C., Schubert, S., and Vukolić, M. Non-determinism in byzantine fault-tolerant replication. arXiv preprint arXiv:1603.07351 (2016).Google Scholar
Deri, L., Martinelli, M., Bujlow, T., and Cardigliano, A. nDPI: Open-source high-speed deep packet inspection. In 2014 International Wireless Communications and Mobile Computing Conference (IWCMC) (Aug. 2014), pp. 617--622.Google ScholarCross Ref
Dragojević, A., Narayanan, D., Castro, M., and Hodson, O. Farm: Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (Seattle, WA, 2014), USENIX Association, pp. 401--414. Google ScholarDigital Library
Emmerich, P., Gallenmüller, S., Raumer, D., Wohlfart, F., and Carle, G. Moongen: a scriptable high-speed packet generator. In Proceedings of the 2015 ACM Conference on Internet Measurement Conference (2015), ACM, pp. 275--287. Google ScholarDigital Library
ETSI-GS-NFV-002. Network Functions Virtualization (NFV): Architectural Framework. http://www.etsi.org/deliver/etsi_gs/nfv/001_099/002/01.01.01_60/gs_nfv002v010101p.pdf, 2013. {online}.Google Scholar
ETSI-GS-NFV-REL-001. Network Functions Virtualization (NFV): Resiliency Requirements. http://www.etsi.org/deliver/etsi_gs/NFV-REL/001_099/001/01.01.01_60/gs_NFV-REL001v010101p.pdf, 2015. {online}.Google Scholar
Gallenmüller, S., Emmerich, p., Wohlfart, F., Raumer, D., and Carle, G. Comparison of frameworks for high-performance packet io. In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for networking and communications systems (2015), IEEE Computer Society, pp. 29--38. Google ScholarDigital Library
Gember, A., Krishnamurthy, A., John, S. S., Grandl, R., Gao, X., Anand, A., Benson, T., Akella, A., and Sekar, V. Stratos: A network-aware orchestration layer for middleboxes in the cloud. CoRR abs/1305.0209 (2013).Google Scholar
Gember-Jacobson, a., Viswanathan, R., Prakash, C., Grandl, R., Khalid, J., Das, S., and Akella, A. Opennf: Enabling innovation in network function control. SIGCOMM Comput. Commun. Rev. 44, 4 (Aug. 2014), 163--174. Google ScholarDigital Library
Gill, P., Jain, N., and Nagappan, N. Understanding network failures in data centers: Measurement, analysis, and implications. SIGCOMM Comput. Commun. Rev. 41, 4 (Aug. 2011), 350--361. Google ScholarDigital Library
Gunawi, H. S., Hao, M., Leesatapornwongsa, T., Patana-anake, T., Do, T., Adityatama, J., Eliazar, K. J., Laksono, A., Lukman, J. F., Martin, V., and Satria, A. D. What bugs live in the cloud? a study of 3000+ issues in cloud systems. In Proceedings of the ACM Symposium on Cloud Computing (New York, NY, USA, 2014), SOCC '14, ACM, pp. 7:1--7:14. Google ScholarDigital Library
Gunawi, H. S., Hao, M., Suminto, R. O., Laksono, a., Satria, A. D., Adityatama, J., and Eliazar, K. J. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings of the Seventh ACM Symposium on Cloud Computing (New York, NY, USA, 2016), SoCC '16, ACM, pp. 1--16. Google ScholarDigital Library
Jackson, E. J., Walls, M., Panda, A., Pettit, J., Pfaff, B., Rajahalme, J., Koponen, T., and Shenker, S. Softflow: A middlebox architecture for open vswitch. In USENIX Annual Technical Conference (2016), pp. 15--28. Google ScholarDigital Library
Kablan, M., Alsudais, A., Keller, E., and Le, F. Stateless network functions: Breaking the tight coupling of state and processing. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 97--112. Google ScholarDigital Library
Katz, D., and Ward, D. Bidirectional Forwarding Detection (BFD). RFC 5880, June 2010.Google Scholar
Katz, D., and Ward, D. Bidirectional Forwarding Detection (BFD) for IPv4 and IPv6 (Single Hop). RFC 5881, June 2010.Google Scholar
Katz, D., and Ward, D. Generic Application of Bidirectional Forwarding Detection (BFD). RFC 5882, June 2010.Google Scholar
Khalid, J., and Akella, A. Streamnf: Performance and correctness for stateful chained nfs. CoRR abs/1612.01497 (2016).Google Scholar
Khalid, J., Gember-Jacobson, A., Michael, R., Abhashkumar, A., and Akella, A. Paving the way for NFV: Simplifying middlebox modifications using statealyzr. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16) (Santa Clara, CA, 2016), USENIX Association, pp. 239--253. Google ScholarDigital Library
Kohler, E., Morris, R., Chen, B., Jannotti, J., and Kaashoek, M. F. The click modular router. ACM Trans. Comput. Syst. 18, 3 (Aug. 2000), 263--297. Google ScholarDigital Library
Madhavapeddy, A., Mortier, R., Rotsos, C., Scott, D., Singh, B., Gazagnaire, T., Smith, S., Hand, S., and Crowcroft, J. Unikernels: Library operating systems for the cloud. SIGPLAN Not. 48, 4 (Mar. 2013), 461--472. Google ScholarDigital Library
Martins, J., Ahmed, M., Raiciu, C., Olteanu, V., Honda, M., Bifulco, R., and Huici, F. Clickos and the art of network function virtualization. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2014), NSDI'14, USENIX Association, pp. 459--473. Google ScholarDigital Library
Nightingale, E. B., Veeraraghavan, K., Chen, P. M., and Flinn, J. Rethink the sync. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Berkeley, CA, USA, 2006), OSDI '06, USENIX Association, pp. 1--14. Google ScholarDigital Library
Nightingale, E. B., Veeraraghavan, K., Chen, P. M., and Flinn, J. Rethink the sync. ACM Trans. Comput. Syst. 26, 3 (Sept. 2008), 6:1--6:26. Google ScholarDigital Library
Ongaro, D., Rumble, S. M., Stutsman, R., Ousterhout, J., and Rosenblum, M. Fast crash recovery in ramcloud. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 29--41. Google ScholarDigital Library
Palkar, S., Lan, C., Han, S., Jang, K., Panda, A., Ratnasamy, S., Rizzo, L., and Shenker, S. E2: A framework for nfv applications. In Proceedings of the 25th Symposium on Operating Systems Principles (New York, NY, USA, 2015), SOSP '15, ACM, pp. 121--136. Google ScholarDigital Library
Panda, A., Lahav, O., Argyraki, K., Sagiv, M., and Shenker, S. Verifying isolation properties in the presence of middleboxes. arXiv preprint arXiv:1409.7687 (2014).Google Scholar
Pignataro, C., Ward, D., Akiya, N., Bhatia, M., and Networks, J. Seamless Bidirectional Forwarding Detection (S-BFD). RFC 7880, July 2016.Google Scholar
Potharaju, R., and Jain, N. Demystifying the dark side of the middle: A field study of middlebox failures in datacenters. In Proceedings of the 2013 Conference on Internet Measurement Conference (New York, NY, USA, 2013), IMC '13, ACM, pp. 9--22. Google ScholarDigital Library
Quinn, P., and Nadeau, T. Problem Statement for Service Function Chaining. RFC 7498, Apr. 2015.Google Scholar
Rajagopalan, S., Williams, D., and Jamjoom, H. Pico replication: A high availability framework for middleboxes. In Proceedings of the 4th Annual Symposium on Cloud Computing (New York, NY, USA, 2013), SOCC '13, ACM, pp. 1:1--1:15. Google ScholarDigital Library
Rajagopalan, S., Williams, D., Jamjoom, H., and Warfield, A. Split/merge: System support for elastic execution in virtual middleboxes. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13) (Lombard, IL, 2013), USENIX, pp. 227--240. Google ScholarDigital Library
Sahoo, S. K., Criswell, J., and Adve, V. An empirical study of reported bugs in server software with implications for automated bug diagnosis. In Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1 (New York, NY, USA, 2010), ICSE '10, ACM, pp. 485--494. Google ScholarDigital Library
Sherry, J., Gao, P. X., Basu, S., Panda, a., Krishnamurthy, A., Maciocco, C., Manesh, M., Martins, J. a., Ratnasamy, S., Rizzo, L., and Shenker, S. Rollback-recovery for middleboxes. SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 227--240. Google ScholarDigital Library
Velner, Y., Alpernas, K., Panda, a., Rabinovich, a., Sagiv, M., Shenker, S., and Shoham, S. Some complexity results for stateful network verification. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (2016), Springer, pp. 811--830. Google ScholarDigital Library
Wang, C., Chen, X., Jia, W., Li, B., Qiu, H., Zhao, S., and Cui, H. PLOVER: Fast, multi-core scalable virtual machine fault-tolerance. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) (Renton, WA, 2018), USENIX Association, pp. 483--489.Google Scholar
Woo, S., Sherry, J., Han, S., Moon, S., Ratnasamy, S., and Shenker, S. Elastic scaling of stateful network functions. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18) (Renton, WA, 2018), USENIX Association, pp. 299--312.Google Scholar
Zhang, W., Liu, G., Zhang, W., Shah, N., Lopreiato, P., Todeschi, G., Ramakrishnan, K., and Wood, T. Opennetvm: A platform for high performance network service chains. In Proceedings of the 2016 Workshop on Hot Topics in Middleboxes and Network Function Virtualization (New York, NY, USA, 2016), HotMIddlebox '16, ACM, pp. 26--31. Google ScholarCross Ref

Index Terms

REINFORCE: achieving efficient failure resiliency for network function virtualization based services
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Availability
2. Networks

Recommendations

NFVnice: Dynamic Backpressure and Scheduling for NFV Service Chains
SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data Communication

Managing Network Function (NF) service chains requires careful system resource management. We propose NFVnice, a user space NF scheduling and service chain management framework to provide fair, efficient and dynamic resource scheduling capabilities on ...
Read More
Enhancing Reliability for Virtual Machines via Continual Migration
ICPADS '09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems

Our approach is to design and implement a continual migration strategy for virtual machines to achieve automatic failure recovery. By continually and transparently propagating virtual machine s state to a backup host via live migration techniques, ...
Read More
Achieving graceful performance in distributed error-prone databases

Data availability is an important requirement of distributed databases. Replication is a technique that has been proposed to meet this need. In the absence of failures, traditional replica control algorithms provide complete availability in the sense ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CoNEXT '18: Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies
December 2018
408 pages
ISBN:9781450360807
DOI:10.1145/3281411
General Chairs:
Xenofontas Dimitropoulos
University of Crete and FORTH, Greece
,
Alberto Dainotti
CAIDA, University of California, San Diego
,
Program Chairs:
Laurent Vanbever
ETH Zurich, Switzerland
,
Theophilus Benson
Brown University
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 December 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available
Author Tags
availability
fault-tolerance
network functions (NF)
resiliency
service function chains (SFC)
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate198of789submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 1,255
  Total Downloads
- Downloads (Last 12 months)178
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

REINFORCE: achieving efficient failure resiliency for network function virtualization based services

CoNEXT '18: Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

NFVnice: Dynamic Backpressure and Scheduling for NFV Service Chains

Enhancing Reliability for Virtual Machines via Continual Migration

Achieving graceful performance in distributed error-prone databases