Elsevier

Computer Communications

Volume 33, Issue 4, 1 March 2010, Pages 485-499
Computer Communications

Analytical characterization of failure recovery in REAP

https://doi.org/10.1016/j.comcom.2009.10.014Get rights and content

Abstract

This paper characterizes analytically the performance of REAchability Protocol (REAP), a network layer end-to-end recovery protocol for IPv6. REAP was developed by the IETF SHIM6 Working Group as part of its multihoming solution. The behavior of REAP is governed by a small number of parameters: three timers, a simple characterization of the application traffic, and the communication delay. The key figure of merit of REAP performance is the time to recover from a path failure as seen by the upper layers, figure that cannot be trivially obtained, despite the apparent simplicity of this reachability protocol. In this paper we provide upper bounds for the recovery time of REAP for different deployment scenarios, applying these analytical results to two interesting case studies, TCP and VoIP traffic.

Introduction

The SHIM6 (Site Multihoming by IPv6 Intermediation) Working Group1 of the IETF has developed a framework that enables scalable fault tolerance protection for on-going communications in IPv6 multihomed environments. Considering that the large address space of IPv6 allows end hosts to configure as many addresses as available providers, this framework aims to enable the use of these different addresses for a single communication which enables the use of different paths. The address agility function is performed by a shim sublayer, named SHIM6, defined inside the IPv6 layer. This SHIM6 sublayer manages the mapping between the addresses being exposed to the upper layers, which remain constant during the communication lifetime, and the addresses included in the packets sent through the wire, that could vary and enforce the use of different paths. The SHIM6 protocol [1] creates and manages these mappings between the SHIM6 sublayers of the two nodes involved in the communication.

A fault tolerance solution requires a mechanism to detect failures across the communicating path, and a mechanism to discover a valid path after a failure. In particular, the mechanism should allow transport layer survivality, to be fully transparent for transport-layer sessions [2]. The SHIM6 Working Group defines such a component, named REAP (REAchability Protocol, [3]), which detects failures in any of the two unidirectional paths in use for a communication, and explores different unidirectional paths to find a valid one after an outage. Note that a bidirectional path is modeled by REAP as two unidirectional paths.

The REAP instance of an endpoint detects a failure by monitoring the packets received for a given communication. When a communication involves a bidirectional exchange of data at a sufficient rate, the availability of the path is determined without exchanging REAP-specific packets. If one of the endpoints is not sending data regularly or the if the rate at which data is being sent is too low, its REAP entity generates Keepalive messages that prevent the expiration at the other end of the timer used to detect failures. When no party sends upper layer data for some time REAP stops generating Keepalive messages and failure monitoring is suspended.

When a failure is detected, REAP triggers the path exploration function. The currently used unidirectional paths are initially tested by sending REAP Probe messages. If this validation fails, Probe messages with different combinations of source/destination addresses are sent until a new pair of working addresses is found. Note that SHIM6 and REAP support the use of paths defined by different source and destination address pairs in each direction.

Ideally, a failure detection mechanism should require as low resources and bandwidth as possible. The amount of state required for REAP operation is just three timers per communication and per endpoint. Additionally, it is quite efficient in terms of the number of protocol-specific messages exchanged since Keepalive messages are only sent for unidirectional or low-rate communications.

REAP is a good solution to provide a failure detection and path exploration mechanism to other protocols requiring such functionality, because it has minimal requirements and it is independent from the SHIM6 protocol. Examples of protocols that could benefit from this functionality are Host Identity Protocol (HIP) [4] [5], Mobile IPv6 with registration of multiple CoAs (Care-of Address) [6], Mobike (IKEv2 Mobility and Multihoming Protocol) [7] or combined SHIM6/Mobile IPv6 operation [8].

Although two simulation and experimental studies have been previously published, focusing either on the path exploration process [9] or on the impact of the transport protocol on the recovery time [10], no analytical characterization of the time required to recover from a failure has been provided so far. Note that this value is a key figure of merit for determining the impact perceived by upper layers. Too large recovery times can result in the communication being discarded by the upper layers. But even if the communication continues, the quality can be degraded if the recovery process takes longer than the time required by the application. Proper characterization of REAP performance would enable the configuration on a per communication basis of the REAP timers in order to comply with specific upper layer constraints.

However, despite the functional simplicity of the REAP mechanism, the characterization of the recovery time is far from trivial. It could be initially thought that the time required to detect a failure by REAP depends only on the value of the three defined timers, the Send Timer, the Keepalive Timer and the Retransmission Timer, but this would obviate the relation of the time at which these timers are started and the time at which the failure occurs. This relation largely depends on the specificities of the communication. Without a proper computation of this time it is not possible to provide an upper bound which can be used by the applications to cope with the failures in the communication.

In this paper we characterize analytically the upper bound of the time required by REAP to detect a failure and recover from it in different scenarios. These results are applied to different traffic patterns and to the specific case of TCP as the transport protocol.

The remainder of the paper is organized as follows: Section 2 provides an insight on the REAP protocol. In Section 3 we present the reference model used for the analysis, and we detail the definition of the Recovery Time. In Section 4 we start a top/down characterization by analyzing the contributions to the Recovery Time of the failure detection process and the exploration process. For this analysis, different types of communication (bidirectional and unidirectional) and failure (Two-Way and One-Way) are considered. Section 5 is devoted to characterize the τ parameter, that depends on the time elapsed between the transmit time of the first lost packet in any node and the starting time of the Send Timer. τ is the main parameter to be considered when estimating the duration of the detection process. To obtain the least upper bound of τ, the maximum value among several cases has to be considered. The methodology followed to obtain the upper bound of τ is verified by simulation results. Then, in Section 6, we provide more compact expressions for the combination of the results obtained in Sections 4 Characterization of the recovery time, 5 Characterization of. These expressions eliminate the dependency on the failure type or location, and are the results to be used for configuring REAP to comply with the specific requirements of an application. An applicability example of the results is presented in Section 7. To give further details of the applicability of the analytical results, we consider the variable rate traffic case and applications that use TCP as transport-layer in Section 8. Finally, in Section 9 we present the conclusions and future work.

Section snippets

Failure detection and path exploration in REAP

In this section we describe in detail the two components of REAP [3], the failure detection and the path exploration mechanisms. The failure detection mechanism of REAP is used to monitor the status of the pair of unidirectional paths active in a communication. Note that although SHIM6 is able to manage alternative paths for a communication, REAP only tests the pair of paths in use at a given time. To validate the current two unidirectional paths of a communication, REAP relies on two timers in

Model for performance evaluation in REAP

In this section we present the reference model to be used in the performance analysis for REAP. We first discuss the parameters involved in the failure detection and recovery procedures. Then, we define the figure of merit through which we evaluate the performance of REAP, the Recovery Time.

Characterization of the recovery time

In order to characterize the behavior of the Recovery Time (Trecovery hereafter), we have to consider all possible communication scenarios that may result from the type of communication. First, we must separate the analysis of bidirectional and unidirectional traffic. For bidirectional traffic we assume that the packet rate is high enough, precluding a Keepalive messages exchange. Additionally, we have to consider that on each peer runs a different Send Timer, resulting in different event

Characterization of τ

Most of the components of the expressions presented in Section 4 are simple to characterize. However, this is not the case for the τ parameter. In the following sections we provide a set of equations to characterize τ for each of the cases presented in Section 4. The results are upper bounds of τ that are always a supremum (or least upper bound, i.e. the smallest real number that is greater than or equal to every possible τ). We use the term maximal for the case in which τ equals the upper bound

Upper bound for the recovery time regardless of the location and type of the failure

In this section we simplify the analytical results obtained in Section 5 to obtain an upper bound for Trecovery independently of the failure point and type of failure for both the Bidirectional and Unidirectional traffic. This is the result expected to be useful for the configuration of REAP in a real deployment.

First an upper bound for τmax regardless the point of failure is obtained for the scenarios that depended on the α and α parameters, that were the Bidirectional traffic, Two-Way

A case study of the applicability of the results

Eqs. (59), (60), (61), (62), provide the appropriate values for the timers of REAP to comply with a target Recovery Time given the characteristics of an application and a scenario. We show how the previous results can be applied with a case study: Suppose a bidirectional VoIP (Voice over IP) application for which we require a Trecovery value of 2 s, considering that this time is short enough not to make the user think that the call has been disconnected, using a codec which generates a packet

Generalization for variable rate traffic and TCP

The previous results have been derived on the assumption of constant rate traffic. This section is devoted to analyze the impact of variable rate traffic, and to analyze the specific case of TCP, that can be considered for our purpose as a specific case of constrained variable traffic pattern. The main results of this paper are Eqs. (60), (62), which present the upper bound for the Recovery Time for Bidirectional and Unidirectional traffic respectively. If we focus on the Unidirectional traffic

Conclusion

In this paper we have presented an exhaustive analytical study of the time required by REAP to recover from a path failure. We have focused on characterizing the time since the first data packet is lost in any node and the time at which a peer willing to send a packet can do so again, i.e. the Trecovery figure of merit. The analysis has considered all the possible situations that may occur for each communication type (bidirectional or unidirectional traffic exchange), different types of failure

Acknowledgments

The authors would like to thank Jose Felix Kukielka for his helpful contributions to this paper. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement 214994 (CARMEN project). It was also partly funded by the Ministry of Science and Innovation of Spain through Project CONPARTE (TEC2004-05622-C04-03) and project T2C2 (TIN2008-06739-C04-01).

References (16)

  • E. Nordmark, M. Bagnulo, Shim6: Level 3 Multihoming Shim Protocol for IPv6, IETF RFC 5533, 2009...
  • J. Abley, B. Black, V. Gill, Goals for IPv6 Site-Multihoming Architectures, IETF RFC 3582, 2003...
  • J. Arkko, I. van Beijnum, Failure Detection and Locator Pair Exploration Protocol for IPv6 Multihoming, IETF RFC 5534,...
  • R. Moskowitz, P. Nikander, Host Identity Protocol (HIP) Architecture, IETF RFC 4423, 2006...
  • T.R. Henderson, Host mobility for IP networks: a comparison, IEEE Networks, 2003...
  • R. Wakikawa, T. Ernst, K. Nagami, Multiple Care-of Addresses Registration, IETF draft;...
  • T. Kivinen, H. Tschopfening, Design of the IKEv2 Mobility and Multihoming Protocol, IETF RFC 4621, 2006...
  • Marcelo Bagnulo et al.

    IPv6 multihoming support in the mobile Internet

    IEEE Wireless Communications Magazine

    (2007)
There are more references available in the full text version of this article.

Cited by (4)

  • Multihoming: A Comprehensive Review

    2013, Advances in Computers
    Citation Excerpt :

    In this split, SHIM6 provides the mapping function between Upper Layer Identifier and locator at the receiver and sender end-hosts. SHIM6 uses failure detection and recovery mechanisms described in the Reachability Protocol (REAP) [108], which work independently from upper layer protocols. Failure detection can be based on keep-alive mechanisms or using information from upper layers (e.g., TCP control features).

  • Putting SHIM6 into practice

    2015, 2014 Australasian Telecommunication Networks and Applications Conference, ATNAC 2014
  • Multihoming management for future networks

    2011, Mobile Networks and Applications
View full text