skip to main content
research-article
Open Access

Secure and Reliable Network Updates

Published:09 November 2022Publication History

Skip Abstract Section

Abstract

Software-defined wide area networking (SD-WAN) enables dynamic network policy control over a large distributed network via network updates. To be practical, network updates must be consistent (i.e., free of transient errors caused by updates to multiple switches), secure (i.e., only be executed when sent from valid controllers), and reliable (i.e., function despite the presence of faulty or malicious members in the control plane), while imposing only minimal overhead on controllers and switches.

We present SERENE: a protocol for secure and reliable network updates for SD-WAN environments. In short: Consistency is provided through the combination of an update scheduler and a distributed transactional protocol. Security is preserved by authenticating network events and updates, the latter with an adaptive threshold cryptographic scheme. Reliability is provided by replicating the control plane and making it resilient to a dynamic adversary by using a distributed ledger as a controller failure detector. We ensure practicality by providing a mechanism for scalability through the definition of independent network domains and exploiting the parallelism of network updates both within and across domains. We formally define SERENE’s protocol and prove its safety with regards to event-linearizability. Extensive experiments show that SERENE imposes minimal switch burden and scales to large networks running multiple network applications all requiring concurrent network updates, imposing at worst a 16% overhead on short-lived flow completion and negligible overhead on anticipated normal workloads.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The advent of software-defined wide area networking (SD-WAN) has brought the concurrent network update problem [1] to the forefront. In short, SD-WANs are wide area networks (WANs) covering multiple sites of an organization managed using software-defined networking (SDN) concepts [2]—chiefly the separation of (1). the data plane, in which packets are forwarded towards their destinations by switches based on forwarding rules installed at those switches, from (2). the control plane, which is responsible for setting up said forwarding rules across switches from a conceptually centralized perspective. The challenge is thus to construct a control plane for SD-WAN capable of covering several large geographically separated networks. Building a single consolidated control plane across WANs agnostic of the different underlying domains (e.g., constituting autonomous systems or based on some locality in the physical topology) can optimize the processing of consistent updates [3, 4, 5, 6]. Yet, it is likely to be ineffective and scale poorly in practice due to the high communication cost of synchronization, besides requiring strong trust between the domains. Inversely, managing domains independently, each with a separate control plane, can help perform updates in parallel (e.g., when updates only affect single domains), and can ensure that failures (e.g., misconfigurations, crashes, malicious tampering) in one domain do not affect others. However, this does not provide support for updates affecting multiple domains in a consistent manner.

Requirements. A viable SD-WAN control plane should reconcile all the following requirements:

Consistency:

First and foremost, updates can occur concurrently, yet—whether affecting individual domains (intra-domain routes) or multiple domains (inter-domain routes)—these should meet the sequential specification of the shared network application. That is, they should not create inconsistencies leading to network loops, link congestion, or packet drops.

Security:

Messages—whether sent by the data plane due to some networking event or sent by the control plane to update a switch in response to some event or change in network policy—should only be considered from valid sources and when not tampered with by a third party.

Reliability:

The control plane should be able to perform updates in the face of high rates of failures including crashes of controllers and compromised controllers; in particular failures should be detected and should not spread from one domain to another.

Practicality:

Last but not least, a solution should be practical. In particular, performance should support real-life deployments that scale to as many switches as possible across multiple domains, while imposing minimal overhead on switches and (thus) sustaining high update rates. In that light, a solution should support the replacement of failed controllers to ensure 24\( \times \)7 deployment.

State-of-the-art. Several approaches have tackled the problem of making the control plane tolerate failures, yet these approaches either solely handle crash failures [7, 8, 9], or handle potentially malicious behaviors [10, 11] without control plane authentication for the data plane, thus not fully shielding the data plane against masquerading malicious controllers. In addition, most of these approaches consider only single-domain setups.

Protocols for Byzantine fault tolerance (BFT) [12], a failure model subsuming crash failures, provide safety and liveness guarantees [13, 14] up to a given threshold of faulty or malicious participants, most often growing linearly with regards to the number of participants. Most work here similarly considers single domain setups, putting little emphasis on handling failures to quickly yet permanently retain trustworthiness and support cooperation across domains throughout successive failures. Yet while application-specific solutions exist for performance-aware routing [15] or optimal scheduling for network updates [16], we are not aware of any practical system providing a generic protocol to securely enforce arbitrary application network updates across a faulty and asynchronous distributed network environment. Crucially, from the point of view of practical adoption, existing work introducing distributed resiliency techniques to address the network update problem treat both switches and controllers as equal participants in the protocol despite important differences, thus inducing prohibitive overhead on the switching fabric [11, 17].

Contributions. We present SERENE, a comprehensive protocol for secure and reliable network updates in SD-WAN environments. SERENE ensures network update consistency amidst a dynamic control plane prone to malicious or faulty members, all the while exploiting parallelism in network updates for practicality with minimal switch instrumentation. SERENE ensures consistency via an update scheduler to enforce resilient ordering of dependent network updates. Security and reliability are ensured via a Byzantine fault-tolerant consensus protocol with an adaptable threshold-based authentication of updates leveraging Distributed key generation [18]. SERENE is able to detect a wide range of controller failures (e.g., benign crashes, muteness failures [19], creation of malicious updates) thanks to a distributed ledger enabling network provenance [20]. To deal with the detected failures, SERENE supports dynamic membership within the control plane, allowing controllers to join a live control plane to replace and offset faulty controllers. Our mechanism for control plane membership changes allows for a varying membership size for the control plane while allowing a live adaptation of the threshold used in update authentication. In addition, we propose an alteration to SERENE that slightly sacrifices network update setup time to reduce switches’ computation load.

The evaluation shows that our SERENE implementation, built off the Ryu controller framework [21] and compatible with any controller application, performs with nominal overhead in data center-sized topologies and improves performance when expanded to large network setups, e.g., multiple data centers. Furthermore, our SERENE implementation is extensible to allow the use of any update scheduler (e.g., Contra [15], Dionysus [16]) whose update policies can be specified in Ryu.

In summary, this article makes the following contributions. We present

(1)

an intuitive view of SERENE’s protocol for secure and reliable network updates across multiple domains, while preserving consistency and practicality, that supports dynamic membership in each domain’s control plane including detection and removal of faulty or malicious controllers through the use of a per-domain distributed ledger;

(2)

an algorithmic formalization of SERENE’s protocol, proofs that these achieve consistent networks in the sense of event-linearizability [22], and a security analysis of the protocol;

(3)

SERENE’s implementation on top of the Ryu runtime, using open-source components such as the BFT-SMaRt [14] and Pairing Based Cryptography [23] libraries;

(4)

an integration of SERENE into the OpenFlow discovery protocol (OFDP) [24] for secure data plane (topology) discovery, and evaluate it over the Abilene network [25];

(5)

an evaluation of SERENE in single and multiple domains, demonstrating its practicality.

SERENE supersedes our consistent secure practical controller (Cicero) work [26], which had several limitations compared to SERENE. In short, SERENE integrates a distributed ledger to better handle compromised controllers, and the present report further includes formalization and proofs of correctness (event-linearizability) along with a security analysis, and provides secure topology discovery through an integration with OpenFlow discovery protocol (OFDP). All technical additions are empirically evaluated.

Roadmap. Section 2 presents motivating examples for secure and consistent network updates and discusses the need for a comprehensive solution. Section 3 presents the main components of SERENE. Section 4 presents the SERENE protocol that puts the components together. Section 5 presents the formal properties of SERENE including pseudocode for algorithms, proofs of correctness, and a security analysis. Section 6 describes our SERENE implementation. Section 7 presents a secure topology discovery protocol using OFDP. Section 8 presents the performance evaluation of SERENE in a multi-data center deployment. Section 9 presents conclusions.

Skip 2BACKGROUND Section

2 BACKGROUND

From a high level, network traffic is shaped by policies set by network administrators. Based on an unbounded number of motivating factors (e.g., demand for network resources, application bandwidth requirements, firewall rules, and other network tenant requirements), it is impossible to be 100% certain of what drives network policies. For a network switch in a data plane, policies are represented by forwarding rules that describe the store and forward behavior of network packets. An individual switch has no understanding of a policy or how it affects the entire network. In an SD-WAN environment, a control plane of one or more controllers enforces policies set by the network administrator by translating policies into flow table entries installed on switches. As network traffic arrives or as network policies change, updates to switch flow tables are needed through network updates. Furthermore, the topology of the network may be dynamic as physical cabling is changed and/or failures happen in switch or fabric hardware. These topology changes may also result in network updates.

2.1 Definitions

A network policy consists of a high-level description of intent for network traffic. In other words, it consists of desired packet handling behavior (e.g., shortest path routing, firewall rules). A network flow is an active transfer of packets in the data plane identified by its source, target, and bandwidth requirements. A route indicates the specific path that a network flow takes within the network; multiple possible routes may exist for a network flow. Forwarding rules instruct a data plane switch on how to forward received packets in a flow. The data plane state consists of all forwarding rules currently in use by all data plane switches. The control plane is thus responsible for maintaining forwarding rules in the data plane state for all routes such that they comply with network policies at all times, even during a change to the data plane state.

2.2 Challenges

In this section we outline several motivating examples that show not only the need for consistent network updates performed in a secure and reliable manner but also the need for practicality for policy specification and scalability for deployment in large networks.

Consistency. Asynchrony in network updates can cause transient side effects that can significantly affect switch resources such as overall network availability and/or violation of established network policies. Since data plane switches do not coordinate themselves to ensure update consistency, updates sent to switches in parallel may be applied in any order. While the OpenFlow message layer, arguably the most widely used southbound application programming interface (API) for network updates, has proposed bundled updates [27] to provide transaction style updates to switches, it only supports these updates for a single switch. It does not address inconsistencies that can occur due to updates that span multiple switches. Additionally, OpenFlow scheduled bundles require synchronized clocks among switches to enforce the time at which bundles are applied but even the slightest clock skew may provoke transient network behavior.

Table 1 summarizes several circumstances as well as potential problems that can arise if update consistency is not provided. For each example, certain preconditions may also be needed by the controller for ensuring update consistency. For instance, even a simple network policy change may have unintended consequences when network updates are not consistent (cf. Figure 1). The process of changing data plane state must also be free of transient effects caused by updates to multiple data plane switches: loop and black hole freedom ensures no network loops or unintended drops of network packets (cf. Figure 2), and congestion freedom ensures no over-provisioning of bandwidth to network links (cf. Figure 3).

Fig. 1.

Fig. 1. (a) Depiction of the flows \( f_1 \) in green and \( f_2 \) in yellow. Unused network links are dashed. (b) The network is intended to be modified by an update, which respects the firewall rule in which no traffic should flow from \( s_1 \) to \( s_3 \) . The modification is made to send \( f_1 \) and \( f_2 \) both through \( s_2 \) to \( s_5 \) . Updates are required at \( s_1 \) and \( s_2 \) to modify the flows, (c) but \( s_1 \) applies the update before \( s_2 \) which breaks the firewall rule.

Fig. 2.

Fig. 2. (a) Depiction of the flows \( f_1 \) in green and \( f_2 \) in yellow. (b) The link \( s_4 \) – \( s_5 \) fails and the network is planned to be modified by an update to bypass this failure, but (c) \( s_3 \) applies the update before \( s_2 \) which creates an unintended network loop.

Fig. 3.

Fig. 3. (a) Depiction of the flows \( f_1 \) in green and \( f_2 \) in yellow, (b) that are planned to be modified by an update alleviating \( s_3 \) , (c) but the update is applied by \( s_1 \) before it is applied by \( s_2 \) which causes an unintended over-provisioning of the \( s_4 \) – \( s_5 \) link.

Table 1.
ExampleNetwork changeDesired behaviorPotential problemsUpdate consistencypreconditions
Figure 1Firewall rule changesPolicy enforcementCompromize or loss of dataAware of existing firewall rules
Figure 2Network hardware maintenanceLoop and black hole freedomPacket lossAware of existing flows
Figure 3Bandwidth load balancingLoop, black hole and congestion freedomOver-provisioning of link resourcesAware of existing bandwidth usage

Table 1. Examples of Network Changes with their Desired Behaviors, Potential Problems, and Consistency Preconditions

Security. When considering a control plane prone to faulty controllers, enforcing a consistent ordering of network updates is not sufficient, those updates must only be applied when received from authenticated controllers. Additionally, since a malicious controller masquerading as a switch can report incorrect links and switch states to the control plane [28], messages sent by switches must also be authenticated.

OpenFlow enables endpoint authentication through transport layer security (TLS) for both controllers and switches. However, it has no mechanism to support a dynamic control plane such as group authentication, e.g., to verify that an update has been emitted by any member of the control plane, or Distributed key generation to adapt the key of the group as the membership of the control plane changes.

Reliability. An authenticated controller that is faulty or compromised is still able to affect the data plane state. Beyond security, a system for network updates must therefore remain correct in the midst of failures and be able to detect when failures happen. A comprehensive solution for secure and reliable network updates must be able to tolerate arbitrary and dynamic controller faults.

A faulty or malicious controller may corrupt or cause loss of network data, violate firewall rules, or even leak network data to a malicious party. While solutions for reliable controllers have been proposed, they either focus on resiliency (e.g., intrusion detection, intrusion prevention) for a singleton controller [29, 30] or provide resiliency only in the presence of crash failures [7, 8, 9, 31]. Single controller solutions, proven to be single points of failure [32, 33, 34, 35, 36], must be avoided.

Many of the existing limitations when considering a faulty control plane arise from shortcomings in the southbound API itself. For example, OpenFlow has a mechanism for the control plane to inject arbitrary packets into the data plane (\( \mathsf {PACKET\_OUT} \) [37]). Using this, a malicious controller can perform a denial of service attack against the data plane or to corrupt existing flows [38].

Practicality. The usefulness of a system is often evaluated on factors such as ease of use, performance, and efficiency. Network policy specifications must not only be straightforward but also flexible enough to allow arbitrary network policies. Several solutions for policy specification have been proposed [39, 40, 41], but these are either control plane implementation specific, or do not ensure update consistency or security. A practical system must allow a network administrator the flexibility to use any solution desired while ensuring consistency, security, and reliability.

Furthermore, a system for managing changes to the data plane state must scale to a wide network infrastructure consisting of multiple data centers with potentially thousands of switches [42, 43]. Existing work [16] shows that applying updates on commodity switches can require seconds to complete. For data center workloads where flows start and complete in under a second [44], applying updates quickly is vital to guarantee adequate network response time when changing data plane state. However, responsiveness becomes even harder to ensure if updates are to be applied in a consistent manner. In a naïve approach to enforcing consistency, updates would be applied sequentially (e.g., by updating \( s_2 \), \( s_1 \), \( s_3 \), \( s_4 \) in that order in Figure 1), increasing response time. Yet, updates that do not depend on any others, (i.e., causally concurrent updates) may be applied in parallel (e.g., updates to \( s_3 \) and \( s_4 \) in Figure 1). Identifying causally concurrent updates to apply in parallel and improve response times is a challenge.

Finally, the data plane’s runtime load for updates must be low to ensure as many resources as possible are used for the network’s core purpose; the transmission of network data.

2.3 Related Work

While the following solutions present methods for solving significant problems that arise in SD-WAN deployments, none however provide the desirable guarantees of consistent network updates in the midst of controller faults while remaining practical. Table 2 highlights the shortcomings of these solutions that make them impractical in a realistic deployment.

Table 2.

Table 2. Comparison of Network Management Solutions Considering Different Features Related to Consistency [Cons], Security [Sec], Reliability [Rel], and Practicality [Prac]

Consistency. Additionally, there have been several works published in the realm of consistent network updates. McClurg et al. [53] proposed network event structures (NES) to model constraints on network updates. Jin et al. [16] propose Dionysus, a method for consistent updates using dependence graphs with a performance optimization through dynamic scheduling. Nguyen et al. [54] propose ez-Segway, a method providing consistent network updates through decentralization, pushing certain functionalities away from the centralized controller and into the switches themselves. Header space analysis [56] and Minesweeper [57] both provide a mechanism for ensuring consistency of network updates through formalism, however, do not provide a means to ensure that those updates are applied securely. Černỳ et al. [55] show that in some situations it may not be possible to ensure consistent network updates in all cases. As such, it may be desirable to wait until the packets for a particular flow are “drained” from the network prior to applying switch updates. They define this behavior as packet-waits and provide an at-worst polynomial runtime called optimal order updates which provides a mechanism for detecting such situations.

Security. While adding TLS for OpenFlow [58] may seem trivial, it requires overcoming additional complexities inherent in the protocol. For example, TLS uses certificates to authenticate participants and encryption to ensure data confidentiality, but does not protect against a malicious controller. Such a controller with a valid certificate has the ability to maliciously install a faulty data plane state, e.g., crafting the undesired situations mentioned in Figure 1(a)–(c). Besides, as distributed control plane membership changes, individual controller and switch certificates must be redistributed to all participants. Solutions to address a malicious controller exist [29, 30], but focus on protection in a single controller environment and do not address a replicated control plane.

Li et al. [10] proposed a method of devising a Byzantine fault tolerance (BFT) control plane by assigning switches to multiple controllers that participate in BFT agreement. However, this work focuses significantly on the problem of “controller assignment in fault-tolerant SDN (CAFTS)” with little discussion on how BFT is used to ensure protection from faults. MORPH [11] expands the solution of controller assignment in fault-tolerant SDN (CAFTS) with a dynamic reassigner which allows for changes to the switch/controller assignment. Neither method fully protects against malicious updates sent to the data plane; assuming that controllers participate in a BFT protocol for state machine replication is not enough to ensure the security of such updates. Without control plane authentication, a malicious controller can make arbitrary updates to a data plane switch. Note also that despite partitioning switches among controllers, MORPH, just like other related approaches, does not support multiple update domains. DiffProv [59] and NetSight [60] both provide a mechanism for network anomaly detection, but do not prevent inconsistencies.

Reliability. The area of fault-tolerant network updates has been explored in many facets. ONOS [7] and ONIX [8] provide a redundant control plane through a distributed data store, however, their primary focus is on tolerance of crash failures. Botelho et al. [52] also make use of a replicated data store, following a crash-recovery model, for maintaining a consistent network state among a replicated control plane built upon Floodlight [61]. Ravana [9], another protocol that only tolerates crashes, differs slightly in its use of a distributed event queue rather than a distributed data store. While Botelho et al. and Ravana ensure event ordering and prevent duplicate processing of events, they do not provide a mechanism for authenticating updates sent to the data plane. RoSCo [22] makes use of a BFT protocol to ensure event-linearizability, but does not support a dynamic control plane and requires extensive key management for controller authentication.

Zhou et al. [20] propose a protocol for secure network provenance to provide forensic capabilities for network policies in an environment consisting of malicious nodes. However, their protocol requires the instrumentation of switch nodes to participate in the protocol and does not provide fault detection. In addition, it requires a network operator to check the provenance graph for anomalies. DistBlockNet [62] presents a protocol for blockchain-based network policy management in an Internet of things application, however, requires that each switch authenticates each update with a “verifying controller”. While updates are verified against a distributed blockchain (i.e., a distributed ledger), DistBlockNet does not prevent a malicious controller from modifying the blockchain itself.

Skip 3SERENE OVERVIEW Section

3 SERENE OVERVIEW

In this section, we detail our system threat model and describe the mechanisms SERENE employs to ensure consistency, security, and reliability while being efficient enough for practical deployment in a production data center. Descriptions contained in this section makes use of several symbols for conciseness. A summary of these symbols is provided in Table 3.

Table 3.
SymbolDefinition
\( pkt_{} \)Network packet
\( s_{} \)Switch process
\( spk_{} \)Switch public key
\( ssk_{} \)Switch secret key
\( tpk_{} \)Threshold public key
\( e_{} \)Network event
\( c_{} \)Controller process
\( cpk_{} \)Controller public key
\( csk_{} \)Controller secret key
\( css_{} \)Controller secret share
\( C_{} \)Controller communication object
\( \mathcal {C}_{} \)Control plane communication group
qMinimum quorum size
\( C_{A} \)Aggregator controller
\( \pi _{} \)Network state
\( U_{} \)Network update
\( u_{} \)Switch update
\( r_{} \)Flow table rule
\( D_{} \)Switch update dependence set

Table 3. Basic SERENE Notation

3.1 System and Threat Model

System model. The data plane is considered to consist of a set of switches \( s_{i} \) connected by links encompassing multiple domains of operation. We consider the control plane to consist in a dynamic set of distributed controllers \( c_{j} \). The current state of the switches, or more specifically the data plane state (essentially a set of flow table rules for switches) is referred to also as the network state \( \pi _{} \) for brevity. A change in data plane state generally involves a network update \( U_{} \) consisting of a set of switch updates \( \lbrace u_{k}\rbrace \). A switch update \( u_{k} \) (which is uniquely identifiable) may have a set of attributes associated with it, abbreviated as a tuple of the form \( \langle s_{k}, r_{k}, D_{k} \rangle \). The first two indicate that switch update \( u_{k} \) consists of rule \( r_{k} \) to be applied to switch \( s_{k} \). Where needed/used, the dependence set \( D_{k} \) indicates a set of switch updates that must be applied before \( u_{k} \), and is thus essentially used to capture dependencies between switch updates as elaborated on shortly below. As for any (tuple of) attributes associated with an object, we assume that attributes of a given switch update can be accessed by dereferencing it—e.g., for a switch update \( u_{k} \) above, \( u_{k}.r_{} \) denotes its rule (i.e., \( r_{k} \)), or \( u_{k}.D_{} \) its dependence set (\( D_{k} \)).

Switches and controllers communicate by sending and receiving messages on an asynchronous network in which links between switches, controllers, and/or switches and controllers may fail. Messages may take an arbitrary amount of time to reach switches and controllers.

Threat model. We consider a failure/threat model where a controller may fail or become malicious at any time. Such a controller may eavesdrop on the communication between switches in the data plane, between other controllers in the control plane, and/or between switches and controllers. We also consider that a faulty/malicious controller can modify the contents of any message sent between controllers and/or between controllers and switches. For example, such a controller may send any arbitrary update to a switch, send an arbitrary event to another controller, or prevent an event and/or update from being received by a controller and/or switch.

While a controller may fail or become malicious we assume that switches always remain correct. Protection of the data plane is the topic of ongoing research [63] through analysis of flow behavior [64], authentication [65], and intrusion detection [66]. In addition, host endpoints can protect data packets through existing secure transport protocols such as TLS. The topic of utilizing software-defined networking (SDN) as a means of protecting against malicious hosts, (i.e., utilizing distributed denial of service (DDoS) [67, 68, 69] or man-in-the-middle [70]) attacks is subject to ongoing research. We also assume that a faulty/malicious controller can only view but not modify the contents of data sent between switches. Furthermore, in relation to the cryptographic mechanisms employed by our solution, we assume their implementation is sound, that private keys remain private and that with the exception of negligible probability an adversary cannot sign a message for a member where the private key is not known.

3.2 Consistency

Consistent network updates are accomplished by pairing an update scheduler that establishes the order in which updates should be performed, and a blocking update application scheme that relies on switch acknowledgments.

Update scheduler. An update scheduler determines a schedule that enforces the sequential specification of registered network policies by denoting a set \( U_{} \) of switch updates including their respective dependencies \( D_{} \) as defined above. That is, for any given switch update \( u_{}=\langle s_{},r_{},D_{}\rangle \) part of a network update \( U_{} \), \( D_{} \) refers to the set of switch updates that must be applied before \( u_{} \) can be applied to \( s_{} \).

Figure 1 depicts an example that requires a set of updates for switches \( s_1 \), \( s_2 \), \( s_3 \), and \( s_4 \). To ensure update consistency, an update scheduler would require the update at \( s_2 \) to be applied first and, only then, the remaining updates can be performed in any order. Figure 4 depicts another example where a set of network updates require modifications to the switches highlighted with green dashes and red dots. While the updates within these two sets of switches may require ordering, modifications across sets involve a disjoint set of switches and can be performed in any order.

Fig. 4.

Fig. 4. The update scheduler determines that there are no dependencies between the updates for the green (dashed) set of switches and the updates for the red (dotted) set.

Update schedulers have been extensively discussed [16, 55, 71, 72]. We employ a simple update scheduler implemented using any of these approaches. We discuss in Section 3.5 how SERENE exploits it to perform updates to switches in parallel while still preserving consistency.

In addition, we assume controller applications are deterministic. As a result, when policies allow for multiple rules that may result in differing routes, all controllers must use the same heuristic for choosing rules to update. For example, if a policy requires that data plane traffic be routed using the shortest path yet there exists multiple shortest paths in the network, all controllers would deterministically choose the same route resulting in the same set of required updates to switches. Existing solutions that focus on crash-only tolerance follow a similar assumption [7, 8, 9, 31].

Switch acknowledgments. While the update scheduler determines dependencies between updates, it does not handle execution. To ensure consistent execution, controllers expect to receive update acknowledgments from switches every time they apply an update. For every switch update \( u_{}=\langle s_{},r_{},D_{}\rangle \) with dependence set \( D_{} \) proposed by the update scheduler, a controller only sends the update \( u_{} \) to the data plane once it receives the acknowledgments for every update in \( D_{} \).

3.3 Security

At their core, secure network updates require switches to apply updates only from a trusted controller. SERENE fulfills this requirement by authenticating both events, which may induce updates, and updates themselves such that only those emitted by control plane members are considered by switches. SERENE further eases the deployment of a dynamic control plane by ensuring that switches only need to store a single public key for the control plane, handed out when a switch is setup.

Event source PKI—event authentication. A change in data plane state is assumed to be invoked as the direct result of some event, whether it is the result of a switch detecting an unroutable packet (e.g., mismatch in flow table rules), a change in network policy, a failure of network hardware, or some other factor. Events received by the control plane require validation to ensure that they originated from a reliable source and that they have not been tampered with during transit. To this end, SERENE makes use of a public key infrastructure (PKI) system where each event source is assigned a public/private key pair. Event sources sign each event they generate with their private key; controllers verify the signature of each event they receive against their respective public key.

Controller threshold key—update authentication. Each controller signs the updates they emit so switches can verify the origin of the updates they receive. The strawman approach consists of controllers being assigned different pairs of public/private keys for signing updates. However, the managing all the public keys on all the switches rapidly becomes cumbersome as controllers may be added to and/or removed from the control plane. Moreover, the limited physical resources of switches must be preserved (cf. Section 3.5)

To this end, we employ a system based on threshold cryptography [73, 74]. In a \( (t,n) \)-threshold signature scheme, a single public/private key pair is generated for the entire control plane. The public key is distributed to each switch and each controller obtains a share of the associated private key used for signing updates thanks to Shamir secret sharing [75]. To verify an update, the signature shares received from controllers are combined with an aggregation function to create a signature that is verified against the single public key. The aggregated signature can only be validated if correctly signed by at least t out of n controllers, thus any \( t-1 \) controllers, with the exception of negligible probability, cannot on their own construct a signature that can be verified against the control plane public key. The choice of t impacts SERENE’s reliability as presented in Section 3.4.

Controller DKG—dynamic unique controller key. Using threshold cryptography and secret sharing for update verification establishes a method for secure updates in a dynamic distributed control plane. However, distribution of private key shares when controller group membership changes creates a significant complication: no single controller should ever have knowledge of a private key share other than its own. Verifiable secret sharing (VSS) [76] is a method in which a designated dealer distributes shares of a secret to all participating members. Verifiable secret sharing (VSS) differs from standard secret sharing in that clients can construct a valid share even if the dealer is malicious. These shares can be used in a \( (t,n) \)-threshold signature scheme to create message signatures that are only validated if at least t members correctly sign the message with their shared secret. Naïvely, one could employ such a system to distribute private key shares to controllers when the control plane membership changes. However, requiring the setup and maintenance of such a system is impractical as the VSS dealer is a single point of failure for confidentiality.

We instead employ a system based on distributed key generation (DKG) [77] that expands on the concept of VSS to an environment where there is no trusted dealer. In short, each controller acts as a sub-dealer, creating and distributing private key sub-shares to each other controller. The sub-shares are then aggregated to create the private key share for the controller. Distributed key generation (DKG) uses homomorphic commitments to ensure that the corresponding public key for the group is known by all controllers, but except for negligible probability, no one controller can create a signature that is successfully validated by the public key. Once generated, this public key must be shared to all switches, which is done when switches are setup. Future instances of DKG ensure that new shares can be generated for the control plane as group membership changes without changing the public key.

3.4 Reliability

While an update from a controller can be easily validated using signatures, trust in a single controller is not enough when considering malicious faults (e.g., a compromised controller can sign malicious messages with a valid signature). SERENE increases the reliability of the control plane by supporting a dynamic distributed control plane where all controllers monitor each other to detect and remove failed members. SERENE uses event agreement between controllers as well as update agreement verifiable by the data plane to ensure correct behavior of the control plane, assuming a quorum majority of correct controllers at all times. By detecting and removing failed controllers, SERENE remains reliable in the face of a dynamic adversary. SERENE can detect failures ranging from simple crashes thanks to heartbeats, to more complex and pernicious failures thanks to a distributed ledger.

Atomic broadcast—event agreement. Once an event is signed, the event source sends it to all known controllers in the control plane. A controller, upon receiving an event and verifying its signature, proposes agreement on the event with all other controllers through an established agreement protocol to ensure a total order of processed events. Upon deciding on the event ordering with other controllers in the control plane, each controller independently responds to the event with network update(s). A switch only applies an update once received from a quorum of trusted controllers. We use an atomic broadcast [78] (i.e., consensus) to ensure each controller has a consistent view of the data plane state. Controllers use a public key infrastructure (PKI) system to validate messages sent with the atomic broadcast. We employ a dynamic control plane membership protocol to ensure flexibility of the control plane. The current communication group of controllers is indicated as \( \mathcal {C}_{}= \lbrace C_{1}, \ldots , C_{j}\rbrace \) of controller communication objects. Each \( C_{}= \langle c_{}, cpk_{}, id_{}\rangle \) contains the controller process, its public key for message validation, and the controller process identifier within the communication group.

Threshold signatures—update agreement. Controllers do not need to explicitly agree on an update using the atomic broadcast since they already agree on the events and their order. Rather, it is sufficient for switches to only apply updates with valid signatures (i.e., from controllers) that are emitted from a quorum of verified controllers. As explained in Section 3.3, SERENE uses a \( (t,n) \)-threshold signature scheme for controller authentication. We set t to the controller quorum size necessary to apply an update, i.e., \( t = 2 \times \lfloor \frac{n-1}{3}\rfloor + 1 , \) and represent this quorum size as q for brevity. Note that to tolerate a single failure, there must be at least 4 members in the control plane (i.e., \( n=3f+1 \) with \( f \ge 1 \)). Thus, we assume SERENE never runs on control planes with \( n \lt 4 \).

Heartbeats—crash detection. SERENE uses a failure detector (FD) that relies on heartbeat messages to detect controller crashes (due e.g., to power loss). Heartbeats are periodically broadcast within the control plane; a controller is suspected of failure when other controllers do not receive its heartbeats for a given amount of time. Because of this upper bound in detection time, the FD provides strong completeness and weak accuracy for crashes (i.e., the detector outputs no false negatives but may output false positives [79]). Weak accuracy implies that a suspected controller may be prematurely removed from the control plane (e.g., if a controller is too slow), which only affects the system’s liveness. Since SERENE supports a dynamic control plane, prematurely removed controllers may be re-added later (cf. Section 4.3).

Distributed ledger—beyond crash detection. Faulty controllers may issue incorrect updates, or no update at all, as a response to an event they received. Such incorrect behaviors are undetected by the heartbeat FD since it can only suspect slow or crashed controllers of failure. To complement the heartbeat FD, SERENE includes a distributed ledger per domain to detect a wider range of controller misbehavior which may affect the safety (e.g., inconsistent updates, invalid updates) or the liveness (e.g., muteness failures [19]) of the system. In essence, controllers hold each other accountable [80, 81] by storing in the ledger, to further audit, the (1) events received and (2) events decided by the control plane, as well as (3) every update issued by a controller to the data plane and the matching (4) update acknowledgments by switches. Events are stored in the ledger twice—first when they are received by controllers, then upon decision by the atomic broadcast—to detect those that are rejected. Following the event decision, the corresponding update(s) are also recorded, using a scheme we describe further, alongside their acknowledgments from the data plane to detect irregularities such as updates signed by a minority of controllers (more examples in Section 4.4).

A strawman design would entirely rely on an external (permissioned) ledger [82, 83, 84] to record events and updates. However, these ledgers require a round of consensus for each recorded item to ensure controllers store the same view of the ledger. As we show further, the cost incurred by these extra rounds of consensus is unnecessary and we can design a more efficient, thus practical, solution. Instead of using an external ledger, we propose to tightly couple the workings of SERENE’s distributed ledger with SERENE’s core protocol for network updates as described in the following.

In SERENE, recording an event e in the ledger is performed locally by each controller once the atomic broadcast of e, used for consistency, completes. Hence, recording events comes at no additional communication cost. Since controllers can equivocate [81, 85], recording updates requires extra steps and must involve the data plane. Faulty controllers may selectively omit messages or lie to preserve an appearance of correct behavior by, for instance, issuing deceitful updates to the data plane yet advertising correct ones to the control plane. As such, updates must only be recorded if they have been sent to the data plane. To that end, SERENE leverages the assumed correctness of switches by making them echo the signed updates they receive back to the control plane. Upon reception of an echoed update, each controller directly records it in its local ledger, thus avoiding the cost of consensus of an external ledger. As long as the control plane contains at least one correct member that received an event, incorrect updates for that event are ensured to be recorded. Recorded updates can then be audited, either automatically (cf. Section 4.4) or manually by network administrators, and controllers emitting incorrect updates can be detected.

3.5 Practicality

Amidst consistency and security, for a solution to be feasible in a real data center deployment it must also be practical. SERENE provides an effective solution by exploiting intra- and inter-domain update parallelism and enabling efficient signature aggregation to alleviate switches runtimes.

Update parallelism—intra-domain parallelism. Using an update scheduler (cf. Section 3.2) allows SERENE to exploit parallelism in switch updates. Given a set of switch updates and their corresponding update dependencies determined by the update scheduler, two updates \( u_{i,} \) and \( u_{j} \) can be applied in parallel if their dependencies \( D_{i} \) and \( D_{j} \) are disjoint, i.e., \( D_{i} \cap D_{j} = \emptyset \).

Update domains—inter-domain parallelism. SERENE employs an atomic broadcast (cf. Section 3.4) to ensure a consistent ordering of events processed by the control plane. The responsiveness of such agreement protocols unfortunately greatly deteriorates as the size of the control plane increases, hence creating a tradeoff between fault tolerance and performance. Additionally, in large networks such as a collection of data centers, this responsiveness is further impacted by having a geographically dispersed control plane. This distribution is initially set to minimize latency between local control and data planes but ultimately increases latency within the global control plane.

As such, SERENE allows the division of network resources into domains, each as its own separate instance of the protocol functioning on disjoint control and data planes, e.g., separate IP subnetworks. Domains may rely on separate update schedulers, agreement within communication groups, and control plane public keys. The goal of this division is to enable data plane events that involve updates to switches fully contained within the same domain to be processed independently of other such events in other domains, i.e., in parallel. Events that require updates spanning multiple domains must however be handled in a consistent manner by the control plane as a whole.

SERENE avoids the need for inter-domain agreement through assumptions on setup and global domain policies. First, we assume operators of different domains trust each other, e.g., domains are sub-domains of the same institution. Doing so prevents conflicting policies from being set across domains, and prevents unexpected events from being forwarded across domains. Domain isolation thus offers the security that a, potentially faulty, domain’s control plane cannot update another domain’s data plane, but it may affect flows with a remote origin crossing the data plane it is responsible for. Second, we assume the global domain policies are agreed upon before network deployment and set manually by system administrators. This provides the advantage that each domain’s control plane is able to determine which domains require updates based on a received event without collaboration with other domains. A controller receiving an event that involves updates to multiple domains merely forwards the event to the control plane of each affected domain. This does mean that any update to a global domain policy requires manual updates to all controllers in the affected domains.

For example, consider the flow outlined in Figure 5 where an event generated by switch \( s_1 \) in domain A needs a route to \( s_4 \) to be established. Using the global domain policies, the controller in A that receives the event determines that it requires updates to both domain A and B, and forwards the event to the control plane of domain B. Both domains process the event in parallel and update the switches within their domain accordingly, setting the flow table rules of switches to establish a flow from \( s_1 \) to \( s_4 \).

Fig. 5.

Fig. 5. Depiction of a two-domain network where an event generated by switch \( s_1 \) and sent to its local domain control plane. The control plane then uses global domain policies to determine that network updates involve domains A and B. A’s control plane forwards the event to B’s and both domains update their local switches to set flow tables rules.

This brings out some unique challenges, specifically in relation to membership changes in the control plane of each domain. As we will detail with the SERENE protocol in Section 4, events—be it those that originate from the data plane for link events or from the control plane for membership changes—are not processed sequentially by the control plane. If an event sent across a domain is received by a controller participating in a membership change, this event must be queued and processed after the completion of the membership change operation. In our implementation of SERENE described in Section 6, we use the BFT-SMaRt [14] library to ensure that events received while processing other operations are properly queued and not dropped.

While SERENE allows for division of the network into domains, for SERENE to inter-operate between domains, each must remain in control of the network administrator. Doing so requires no need for negotiation between network service providers or autonomous systems (ASs). This assumption simplifies our requirements for cross-domain routing policies as they can be set globally as viewed by the network administrator. Negotiation of policies between physical sites and multiple network administrators for policies is assumed to be possible due to the fact that multiple administrators of networks across multiple domains are within the same company/organization. In other words, we assume that administration and communication between domains is trusted. This avoids gaps in the network where data plane traffic may be handled by an untrusted third party and avoids the need for network tunneling between third party providers. Rules for cross domain policies can be set on a global level due to centralized control by the network administration team.

In addition, the setup of new domains requires planning by system administrators to establish the global domain policies appropriately. We assume that this is handled offline by system administrators and set in the control plane before domain deployment.

Update signature aggregation. To verify an update, the signature shares from each controller must be collected and aggregated prior to verification against the threshold public key. Putting this responsibility on switches can put an unnecessary load on their hardware. SERENE thus presents two approaches for signature aggregation:

(1)

switch aggregation in which each individual switch is responsible for collecting and aggregating update signatures, and

(2)

controller aggregation in which a single designated “aggregator” controller, \( C_{A} \), collects and aggregates signatures.

Each approach comes with its own tradeoffs. While switch aggregation requires additional resources and instrumentation on switches for storing and aggregating signatures, controller aggregation increases latency since switches must wait for the aggregator to collect and aggregate responses. Furthermore, controller aggregation must be able to handle the detection of a failed or malicious aggregator. Our evaluation in Section 8 further quantifies the tradeoffs of each approach.

Skip 4SERENE PROTOCOL Section

4 SERENE PROTOCOL

In this section, we show how the components depicted in Section 3 form a protocol with

(1)

consistent, secure and reliable network updates,

(2)

signature aggregation,

(3)

dynamic membership, and

(4)

failure detection.

We further comment on the guarantees of the protocol in Section 5.

4.1 Core Update Protocol

The SERENE protocol is composed of two independent routines: switch runtime and controller runtime. The controller runtime can further be broken down into the handling of events within and across multiple domains.

Switch protocol. Figure 6 depicts the update processes for a switch when it receives either a packet from the data plane (Figure 6(a)) or an update from the control plane (Figure 6(b)).

Fig. 6.

Fig. 6. Flow charts describing the processes of a switch (a) handling incoming packets on the data plane and (b) handling updates received from the control plane.

Under normal operation, a switch uses the flow table rules enforcing network policies to store and forward packets in the data plane. Upon receiving a packet that does not match any rule, a switch create, signs, and sends an event indicating the mismatch to all controllers of its domain.

Upon receiving a network update from the control plane, the switch immediately signs it and echoes it back to the control plane so controllers can record all updates in the distributed ledger. The switch then stores the received message, containing an update and a controller signature, until the switch receives a quorum majority of identical updates from control plane members. Once enough messages are received, using the threshold signature aggregation function, the switch aggregates the signatures for the update and verifies the resulting signature against the public key for the control plane. The update is then either applied or ignored, depending on the validity of the signature. Finally, the switch sends a signed acknowledgment to all members of the domain control plane to alert them of the network update application.

Controller protocol. Figure 7 depicts the process for a controller when it receives an event (Figure 7(a)) or when the agreement is reached on the ordering of events (Figure 7(d)).

Fig. 7.

Fig. 7. Flow charts for controller’s processes (a) handling incoming events, (b) handling echoed updates sent from the data plane, (c) detecting ledger inconsistencies, (d) handling updates to be sent to the data plane, and (e) aggregating updates from other controllers.

Under normal operations controllers for a domain of switches are idle waiting to receive signed events. Upon receiving an event, the source of the event is verified and the event is either broadcast to all members of the domain’s control plane or ignored if the event was previously processed or the event source cannot be verified.

Upon delivery of a broadcast event, each member of the control plane records the event in their local ledger and independently determines the necessary network updates and dependency sets in response to the event using the established network policies and the update scheduler. Network updates are signed with the controller’s private key share. Network updates for disjoint dependency sets are processed in parallel with network updates having no dependencies being immediately sent to the corresponding switch(es). As verified acknowledgments for applied updates are received, these updates are removed from dependency sets and additional updates with empty dependency sets are sent, in parallel, to switch(es). Since switches are assumed to be non-faulty, these received acknowledgments ensure forward progress in event processing despite loops in the protocol flow. In parallel, every signed update echoed by a switch is recorded in the ledger as depicted in Figure 7(b).

Inter-domain updates. If, thanks to the global domain policies, a controller determines that an event affects multiple domains, it forwards the event to a controller in each affected domain. The receiving controllers broadcast the event to all other controllers of their respective domains as with any validated event. To select a valid recipient, each controller maintains a set of active controllers in each other domain. This list is updated every time a controller is added or removed to/from any other domain’s control plane (cf. Section 4.3). Furthermore, to prevent never-ending dissemination of the event, a forwarded event is tagged as such to indicate it should not be further forwarded to other domains and only be processed locally.

4.2 Controller Aggregation

The SERENE protocol outlined in Section 4.1 specifically focuses on switches aggregating signatures. Optionally, controller aggregation may be used in which a controller is assigned to be the aggregator for both receiving events from switches and collecting (to aggregate) signed updates.

Aggregation process. The process for controller aggregation is depicted in Figure 7(e). Controllers, instead of sending signed updates to switches, send them to the designated aggregator. The aggregator collects signed switch updates, aggregates the signatures once a quorum has been received, and sends the update along with the aggregated signature to their respective switch. A switch receiving aggregated signatures merely verifies the update’s signature against the public key of the control plane and either applies or ignores the update. At any time, a controller may become faulty, including the aggregator. As such, switches must broadcast signed events to all controllers even when controller aggregation is used.

Aggregator selection. All controllers for a domain maintain a representation of the control plane communication group containing each controller’s identifier, public key, and any information needed for communication (e.g., IP address, port). As new controllers are added (cf. Section 4.3), they are given the next highest unused identifier. Identifiers are never reused, even when controllers leave the group. At any given time, the aggregator can be determined as the controller with the lowest identifier. Since all controllers in the domain have the same view of the communication group, this provides stability in the selection. Once an aggregator is determined, the control plane members inform switches by sending a signed message.

4.3 Control Plane Membership Changes

The process for a domain’s control plane membership change is depicted in Figure 8. Due to the potential change in quorum size, both add and remove operations require the distribution of new private key shares.

Fig. 8.

Fig. 8. Flow charts for controller membership change: (a) and (b) show the processes for the bootstrap controller and the joining controller respectively, and (c) shows the controller process when a membership change consensus is reached.

General process. The SERENE protocol ensures that no events are processed until after the membership change has been completed, which prevents control plane members from having to keep old and new shares concurrently. A phase value records the current iteration of membership change. The phase value is incremented with each controller addition or removal. To ensure consistency in the control plane state, controllers are added and removed sequentially. Each step in the control plane modification increments the phase. Events broadcast to all domain controllers are tagged with the current phase. Thanks to the atomic broadcast, controllers queue events received during a change in control plane membership and only broadcast and treat them after the phase has changed.

Controller addition. The procedure to add a controller to the control plane is as follows: (i) public keys for event originators and existing control plane members are distributed to the new controller alongside its identifier; (ii) the new controller is added to the control plane communication group though consensus proposed by the bootstrap controller; (iii) DKG is executed to distribute signature shares to the new controller group reflecting the new quorum size and ensuring that the threshold public key remains the same; (iv) the data plane state and both local network policies from the control plane and global domain policies are sent to the new controller.

SERENE uses a trusted bootstrap controller to manage additions to the control plane. It is the only control plane member that can initiate consensus rounds to add new controllers.

The final step requires updating all other domains to indicate the new controller as a valid recipient of forwarded events. Here, the bootstrap controller generates and signs an event containing the new controller’s communication information and forwards this to a member of each other domain. Each receiving domain, in parallel, processes the event as any other network event (e.g., atomically broadcasts the event to all members of the local domain). However, instead of sending network updates, a controller handles this event by updating its view of the sender’s control plane.

Controller removal. The procedure to remove a controller \( c_{} \) from the control plane is as follows: (i) \( c_{} \) is removed from the control plane communication group; (ii) DKG is executed to distribute signature shares to the controller group reflecting the new quorum size and ensuring that the threshold public key remains the same; (iii) switches are (potentially) assigned a new aggregator.

Removing the controller from the communication group is performed via a round of consensus proposed by a member that detects that the member should be removed.

The final step requires updating all other domains to indicate the removed controller is no longer a valid recipient of forwarded events. When adding a controller, an event is sent to a controller of each other domain. The event is in turn processed in parallel by each domain’s control plane where each controller updates its view of the sender’s control plane.

Overhead. The overhead of membership change involves an instance of atomic broadcast, for the control plane to agree on the membership change event, and an instance of DKG to distribute new key shares to the new control plane communication group. Our implementation uses BFT-SMaRt [14] for group member management which is based off established literature [86, 87]. While a faulty/malicious bootstrap controller sends repeated membership change messages, additional policies such as blacklists can be used to prevent repeated leaving/rejoining.

4.4 Controller Failure Detection

A controller suspected of failure, either by the heartbeat FD or after auditing the distributed ledger, is removed from the control plane as described in Section 4.3. Failures can optionally be reported to network administrators to help find the root cause of the failure. Thanks to the ledger, reports can contain the type of failure detected and all relevant information (e.g., events, update signatures).

The heartbeat FD functions in a straightforward manner: controllers set timeouts for heartbeat messages and a crash is detected when its associated timeout is reached. As for the distributed ledger, it is periodically audited by all controllers following the failure detection policies that express suspicious controller behaviors (cf. Figure 7(c)). Examples of such suspicious behaviors includes:

(1)

When an incorrect event is received, i.e., the ledger contains a record for a received event but not a matching record for the decided event once the atomic broadcast is completed.

(2)

muteness failure [19]: when a controller broadcasts heartbeats but does not send updates, i.e., the ledger contains no update signed by this controller;

(3)

when less than a quorum number of controllers send an update, i.e., the ledger contains between 1 and \( t-1 \) signatures from the \( (t, n) \)-threshold key for an update;

(4)

when a controller does not sign some updates or does so in the wrong order, i.e., the ledger is missing some update signatures or contains a signed update before its dependencies.

The audit proper is performed on a snapshot of the ledger rather than on the ledger itself to avoid considering recent events and updates that may still be under deployment as this could lead to false positives in the detection. Once the detection policies have been executed on a snapshot, a new snapshot is taken and all the audited content may be discarded to reduce the storage footprint.

Skip 5SERENE PROTOCOL FORMALIZATION Section

5 SERENE PROTOCOL FORMALIZATION

In this section, we present the pseudocode for SERENE and prove it provides event-linearizability [22]: the execution of SERENE is indistinguishable from the correct sequential execution of a single controller enforcing network updates. We further analyze the security of SERENE.

5.1 Algorithms

The pseudocode for SERENE’s switch runtime is shown in Algorithm 1. The algorithm describes the handling of received packets and the transmission of events to the control plane. It also describes the details of processing switch updates, quorum authentication, and finally the sending of acknowledgments. For the purposes of the distributed ledger, switch updates must be echoed back to the control plane which is indicated through the sending of echo messages in the algorithm.

The controller implementation consists of multiple algorithms. Algorithm 2 describes the controller runtime for receiving events, event agreement, and sending of switch updates. Additional functional description needed for controller aggregation is presented in Algorithm 3. Control plane membership change is shown in Algorithm 4. The algorithm describes the necessary use of agreement needed for adding and removing a member from the control plane communication group as well as generating new secret key shares using DKG. Finally, Algorithm 5 presents the controller functionality used for recording entries into the distributed ledger. At a periodic interval, controller processes use the entries recorded in the ledger to detect failures following established policies. This failure detection is presented in Algorithm 6. The failure detection policies presented in Section 4.4 are implemented in Algorithm 6.

In addition to the notation summarized in Table 3, the algorithms make use of several additional symbols summarized in Table 4 while a summary of message types is presented in Table 5. Note that we use \( \oplus \) to denote concatenation to a sequence of an element or another sequence. Analogously we use \( \ominus \) for removing from a sequence an element or a set of elements, in any position. Similarly, we use \( \in \) to assert whether an element is contained in a sequence in any position.

Table 4.
SymbolDefinition
Switch-specific notation
\( R_{} \)Map record of previously received updates
TMap of recorded switch update signature shares
Controller-specific notation
\( H_{} \)Map history of previously received events and their corresponding network updates
\( P_{} \)Sequence of pending network updates
\( A_{} \)Map record of aggregated updates
\( ph_{} \)Current phase of controller membership
\( id_{} \)Controller identifier
Distributed ledger notation
\( ACK_{} \)Set of received acknowledgments
\( L_{} \)Local version of the distributed ledger
\( LS_{} \)Ledger snapshot used for the detection
\( t_L \)Failure detection interval
\( N_{} \)Map of network updates sent by each controller

Table 4. Algorithm and Proof Notation

Table 5.
SymbolDefinition
\( \mbox{$\mathsf {EV}$} \)Event
\( \mbox{$\mathsf {UPD}$} \)Switch update
\( \mbox{$\mathsf {ACK}$} \)Acknowledgment
\( \mbox{$\mathsf {BOOT}$} \)Bootstrap new control plane member
\( \mbox{$\mathsf {ADD}$} \)Add control plane member
\( \mbox{$\mathsf {LEAVE}$} \)Leave the control plane
\( \mbox{$\mathsf {REM}$} \)Remove control plane member
\( \mbox{$\mathsf {SETCA}$} \)Change aggregator controller
\( \mbox{$\mathsf {ECH}$} \)Echo of received switch update
\( \mbox{$\mathsf {EVR}$} \)Ledger entry of event received
\( \mbox{$\mathsf {EVD}$} \)Ledger entry of event decided

Table 5. Algorithm Message Types

5.2 Interfaces

The algorithms use the following application interfaces where functions are prefixed with _:

Rule installation:

\( \mathsf {\_apply}(r_{}) \), applies rule \( r_{} \) to the switch runtime.

Signature creation:

\( \mathsf {\_sign}(msg_{}, sk_{}) \), a function to sign a message (\( msg_{} \)) with given key (\( sk_{} \)).

Signature verification:

\( \mathsf {\_verifySig}(msg_{}, sig_{}, pk_{}) \), a function to verify a signature (\( sig_{} \)) for the given message (\( msg_{} \)) using the public key (\( pk_{} \)).

Signature aggregation:

\( \mathsf {\_aggSig}(\lbrace sig_{1}, \ldots \rbrace) \), a function to aggregate the signature shares.

Event generation:

\( \mathsf {\_generateEventData}(pkt_{}) = e_{} \), creates the necessary event data to be sent to the controller given packet data \( pkt_{} \).

Controller application invocation:

\( \mathsf {\_handleEvent}(\pi _{}, e_{}) \), returns the network state \( \pi _{}^{\prime } \) to be applied in “response” to an event \( e_{} \) in state \( \pi _{} \).

Update scheduler:

\( \mathsf {\_scheduleUpdates}(\pi _{1}, \pi _{2}) \), returns \( U_{} \), a network update (i.e., a set of switch updates) to transition the data plane state from state \( \pi _{1} \) to \( \pi _{2} \).

Update domain:

\( \mathsf {\_updateDomain}(e_{}) \), a function that uses the update scheduler and the global domain policies to determine the update domain for an input event.

Reliable unicast:

\( \mathsf {\_send}(msg_{}) \), used to send message \( msg_{} \) to a single target. Once a message is received, the callback \( \mathsf {\_receive}(msg_{}) \) is invoked on the target.

Agreement:

\( \mathsf {\_propose}(\lbrace c_{1}, \ldots \rbrace , msg_{}) \), used by a set of controllers \( \lbrace c_{1}, \ldots \rbrace \), to initiate an instance of consensus, by proposing message \( msg_{} \). Once consensus has been reached, controllers receive the outcome through the callback \( \mathsf {\_decide}(\lbrace c_{1}, \ldots \rbrace , msg_{}) \).

Distributed key generation (DKG):

\( \mathsf {\_DKGStart}(\mathcal {C}_{}, ph_{}, sh_{}) \), performs DKG using the communication group \( \mathcal {C}_{} \) in phase \( ph_{} \). The input share is \( sh_{} \) which ensures that the threshold key remains the same. All participation controllers receive the outcome of DKG through the callback \( \mathsf {\_DKGComplete}(sh_{}) \) which receives as input share \( sh_{} \), a new share for each participating node, that collectively verifies to the threshold public key. If a member of the communication group does not have an existing share, it does not participate in the initial rounds of DKG, however, it will still receive a share through the callback \( \mathsf {\_DKGComplete}(sh_{}) \). DKG maintains a phase value to ensure that previous protocol messages are ignored once an instance of the protocol completes. Controllers keep track of the current phase an input this to each instance of the protocol initiated by \( \mathsf {\_DKGStart}(\mathcal {C}_{}, ph_{}, sh_{}) \).

Heartbeat failure detector:

\( \mathsf {\_detectHBFailure}(c_{}) \) invoked as a callback when controller \( c_{} \) is suspected of failure by the heartbeat FD executed on the detecting controller.

5.3 Computational Model and Consistency Definitions

Here we present the computation model of SERENE, proofs of correctness, and discuss robustness of the SERENE protocol. In addition to the notation presented in Table 3, the proofs and discussion make use of symbols shown in Table 6. We consider a full communication model in which each controller process may send messages to, and receive messages from, any other controller process or any switch. Switches communicate with each other solely for sending data plane traffic.

Table 6.
SymbolDefinition
\( \mathcal {E} \)Execution of a network update
\( \mathcal {H} \)Execution history
\( \lt _{\mathcal {E}} \)\( \lt _{\mathcal {H}} \)Total order in \( \mathcal {E} \) or in \( \mathcal {H} \)
\( \pi _{i} \prec _{\mathcal {E}} \pi _{j} \)Precedence of network states in \( \mathcal {E} \)
QSequential history

Table 6. Summary of Proof Notation

Recall the notion of a network state which intuitively specifies the state of the flow tables in data plane switches for forwarding packets across the network. A network state \( \pi _{} \) specifies the state (of flow tables) of each switch in the data plane. An event is initiated by a switch or a controller and results in a network update \( U_{} \) to apply the network state of the flow tables of some subset of switches. A network update consists of a set of switch updates.

Recall that a switch update is the modification of the flow table for a switch with the given rule. A step of a network update is a switch update \( u_{} \) of \( U_{} \) or a primitive (e.g., message send/receive, atomic actions on process memory state) performed during \( U_{} \) along with its response. A configuration of a network update specifies the state of each switch and the state of each controller process. The initial configuration is the configuration in which all switches have their initial flow table entries and all controllers are in their initial states. An execution fragment is a (finite or infinite) sequence of steps where, starting from the initial configuration, each step is issued according to the network update, and each response of a primitive matches the state resulting from all preceding steps.

Two executions \( \mathcal {E}_i \) and \( \mathcal {E}_j \) are indistinguishable to a set of control processes and switches if each of them take identical steps in \( \mathcal {E}_i \) and \( \mathcal {E}_j \). We use the notation \( \mathcal {E}\cdot {\tilde{\mathcal {E}}} \) to refer to an execution in which the execution fragment \( \tilde{\mathcal {E}} \) extends \( \mathcal {E} \). A state \( \pi _{i} \) precedes another state \( \pi _{j} \) in an execution \( \mathcal {E} \), denoted \( \pi _{i} \prec _{\mathcal {E}} \pi _{j} \), if the network update for \( \pi _{i} \) occurs before the network update of \( \pi _{j} \) in \( \mathcal {E} \). If none of the two states \( \pi _{i} \) and \( \pi _{j} \) precede the other, we say that \( \pi _{i} \) and \( \pi _{j} \) are concurrent. An execution without concurrent states is a sequential execution. A network state is complete in an execution \( \mathcal {E} \) if the invocation event is followed (possibly non-contiguously) in \( \mathcal {E} \) by a completed network update; otherwise, it is incomplete. Execution of \( \mathcal {E} \) is complete if every state in \( \mathcal {E} \) is complete. A high-level history \( \mathcal {H}_\mathcal {E} \) of an execution \( \mathcal {E} \) is the subsequence of \( \mathcal {E} \) consisting of the network state event invocations and network updates.

Definition 5.1

(Event-linearizability of Network Updates).

An execution \( \mathcal {E} \) is event-linearizable [22] if there exists a sequential high-level history Q equivalent to some completion of \( \mathcal {H}_\mathcal {E} \) such that (1) \( \prec _{\mathcal {H}_\mathcal {E}}\subseteq \prec _Q \) (state precedence is respected) and (2) \( \mathcal {H}_\mathcal {E} \) respects the sequential specification of states in Q. A network update is event-linearizable if every execution \( \mathcal {E} \) of the network updates is event-linearizable.

5.4 Event-linearizability of the SERENE Protocol

Theorem 5.2.

Every execution of the SERENE protocol provides event-linearizable network updates.

Proof.

The proof proceeds by iteration on the epochs associated with changes in the controller membership. Specifically, each epoch in an execution \( \mathcal {E} \) is characterized by a static set \( \mathcal {C}_{}= \lbrace \langle c_{1}, \ldots \rangle , \ldots , \langle c_{i}, \ldots \rangle \rbrace \) of controllers. In the following, we present the event linearizability of the SERENE protocol without using any controller aggregation. Event linearizability for an execution in the first epoch. The application of a network state \( \pi _{i} \) in an execution \( \mathcal {E} \) begins with an event invocation by a switch \( s_{i} \in \mathcal {S} \) (Line 8 of Algorithm 1) followed by a network update performed by the procedure \( \mathsf {handleRule} \) in Line 35 of Algorithm 1. All steps performed by the state machines described by the pseudocode within these lines denote the lifetime of \( \pi _{i} \). Specifically, the lifetime of \( \pi _{i} \) in an execution \( \mathcal {E} \) starts with the invocation of the procedure \( \mathsf {sendEvent} \) (Line 8 of Algorithm 1) which sends a signed event to a controller to initiate the network update protocol. The proof proceeds by assigning a serialization point for a state which identifies the step in the execution in which the state takes effect. First, we obtain a completion of \( \mathcal {E} \) by removing every incomplete state from \( \mathcal {E} \). Henceforth, we only consider complete executions.

Let \( \mathcal {H} \) denote the high-level history of \( \mathcal {E} \) constructed as follows: firstly, we derive linearization points of procedures performed in \( \mathcal {E} \). The linearization point of any procedure op is associated with a message step performed between the lifetime of op. A linearization \( \mathcal {H} \) of \( \mathcal {E} \) is obtained by associating the last event performed within op as the linearization point. We then derive \( \mathcal {H} \) as the subsequence of \( \mathcal {E} \) consisting of the network state event invocations and network updates. Let \( \lt _\mathcal {E} \) denote a total order on steps performed in \( \mathcal {E} \) and \( \lt _\mathcal {H} \) denotes a total order on steps in the complete history \( \mathcal {H} \). We then define the serialization point of a state \( \pi _{i} \); this is associated with an execution step or the linearization point of an operation performed within the execution of \( \pi _{i} \). Specifically, a complete sequential history Q is obtained by associating serialization points to states in \( \mathcal {H} \) as follows: for every complete network update in \( \mathcal {E} \), the serialization point is assigned to the last event of the loop in Line 35 of Algorithm 1.

Claim 1.

For any two states \( \pi _{i} \) and \( \pi _{j} \) in \( \mathcal {E} \), if \( \pi _{i} \prec _{\mathcal {H}} \pi _{j} \), then \( \pi _{i} \lt _S \pi _{j} \).□

Proof.

The proof immediately follows from the fact that the serialization point for a state \( \pi _{i} \) (and respectively \( \pi _{j} \)) is assigned to a step within the lifetime of \( \pi _{i} \) (and respectively \( \pi _{j} \)).□

Let \( Q^k \) be the prefix of Q consisting of the first k complete operations. We associate each \( Q^k \) with a set \( \pi _{}^k \) of states that were successfully completed in \( Q^k \). We show by induction on k that the sequence of state transitions in \( Q^k \) is consistent with the sequential state specification. The base case \( k=1 \) is trivial: only one state is sequentially executed.

Claim 2.

\( Q^{k+1} \) is consistent with the sequential specification of network updates.

Proof.

Let \( [U_{1}, \ldots , U_{n}] \) be the sequence of network updates where for all \( i\in \lbrace 1, \ldots , n \rbrace \), \( U_{i} \) is the network update for \( \pi _{i} \). Recall that each network update consists of \( \lbrace u_{1}, \dots , u_{m}\rbrace \): a set of switch updates. Suppose by contradiction that \( Q^{k+1} \) does not respect the sequential specification. The only nontrivial case to consider is that there exist two concurrent updates \( \pi _{i} \) and \( \pi _{j} \) in \( \mathcal {E}^{k+1} \) such that \( Q^{k+1} \) is not consistent with the sequential specification.

Note that if \( \pi _{i} \) precedes \( \pi _{k} \) according to the sequential specification, there does not exist \( i\lt j\lt k \) such that \( \pi _{i} \lt _Q\pi _{j} \lt _Q\pi _{k} \). Suppose by contradiction that such a \( \pi _{j} \) exists. Recall that every controller agrees on the output of the sequence of events in Line 9 of Algorithm 2. Consequently, the only reason for such a \( \pi _{j} \) to exist is if the last switch update of \( U_{j} \) precedes the first switch update of \( U_{k} \). But this is not possible because by the assignment of serialization points, the outcome of \( \mathsf {\_propose} \) enforces the execution of \( \pi _{k} \) immediately after \( \pi _{i} \) and any other \( \pi _{j} \) will have to wait for the acknowledgment from successful completion of switch updates in \( \pi _{k} \) before starting its own switch updates. We now show that the state of the data plane as constructed in \( Q^{k+1} \) is consistent with the sequential specification. Specifically, we show that given any two network updates \( U_{i} \lt _QU_{j} \), the individual switch updates within each are not interleaved. Since every switch update performed in \( U_{i} \) (and respectively \( U_{j} \)) is applied only if it has been received from a quorum of trusted controllers, we only consider the case where a switch update associated with \( U_{j} \) is executed prior to the last switch update performed in \( U_{i} \). However, as described in Line 41 of Algorithm 2, the switch updates for \( U_{j} \) is not sent until acknowledgments for all updates in \( U_{i} \) have been received.□

The conjunction of Claims 1 and 2 together establishes that \( \mathcal {E} \) is event linearizable.

Extending the proof to arbitrary executions. To complete the proof, we show that the execution \( \mathcal {E}\cdot \mathcal {\tilde{\mathcal {E}}} \) is event linearizable, where \( \mathcal {C}_{} \) and \( \tilde{\mathcal {C}_{}} \) are not necessarily related by containment (here \( \mathcal {C}_{} \) and \( \tilde{\mathcal {C}_{}} \) are the set of controllers in \( \mathcal {E} \) and \( \mathcal {\tilde{\mathcal {E}}} \)). A phase value records the current iteration of membership change and uniquely defines the controller membership set and is incremented with each controller addition or removal. Each phase change is initiated by the membership change proposal: \( \mathsf {addController} \) in Algorithm 4 and \( \mathsf {handleFailure} \) in Algorithm 5. Observe that both membership changes and event proposals are processed using the same agreement protocol. By the nature of this protocol, only a single instance of agreement can be performed at a time. As such, no events are processed until after the membership change has been completed which prevents control plane members from having to keep old and new signature shares concurrently. Concurrent events received from the data plane are queued and not executed until after the instance of agreement has been completed, in which case, the execution fragment extending the phase 1 execution extends a well-defined data plane state as proved in Claim 2.

5.5 Security Analysis of the SERENE Protocol

We argue why even forward progress of SERENE is not affected by faulty/malicious controllers with respect to our threat model (Section 3.1).

We remark that Theorem 5.2 holds even if the faulty/malicious controller may eavesdrop on the communication between switches in the data plane, controllers in the control plane, and/or between switches and controllers. Note that eavesdropping allows a malicious controller to gain knowledge of network data, therefore, allowing an adversary the ability to record events and/or updates. However, in SERENE, it is assumed that events and updates do not need to be kept confidential. The risk of such an assumption merely allows for an adversary to modify and/or replay the transmission of the message. Consequently, we can consider the possibility of the following threats and explain how SERENE mitigates them.

Adversarial events:

A faulty/malicious controller may modify or create a network event, however, a valid event is signed with the source’s secret key. The public keys for valid event sources are distributed to all controllers. Therefore, except for negligible probability, valid events cannot be created by any process other than verified sources. Furthermore, event sources in our threat model remain correct and therefore never create and sign incorrect events.

Adversarial switch updates:

A faulty/malicious controller may send any arbitrary update to a switch, however, the update must be verified against the control plane threshold public key. Utilizing the guarantees of DKG, except for negligible probability, valid updates cannot be created by any process other than a quorum of controllers. Existing research has shown that attacks exist in an attempt to force malicious updates to be applied at the controller application level [88]. However, for those attacks to be effective against the guarantees of DKG used by SERENE, the attack must be performed by a quorum majority of controller processes. If a switch receives an update signed by less than a majority of controllers, the verification of the update signature fails and the update is discarded.

Duplicated events:

A faulty/malicious controller may resend any previously sent event, however, all events are given a unique identifier and duplicate events are ignored by the control plane.

Duplicated switch updates:

A faulty/malicious controller may resend any previously sent update, however, all updates are given a unique identifier and duplicate updates, even those with valid signatures from an aggregator controller, are ignored by the data plane.

Adversarial/duplicated switch updates—cross-domain:

While switch updates are given a unique identifier, this identifier is unique to the domain. A faulty/malicious controller may observe and replay any update to a switch from another domain. However, the control plane for each domain is given a unique threshold public key. Except for negligible probability, any update sent from another domain is never validated nor applied by a switch.

Skip 6SERENE IMPLEMENTATION Section

6 SERENE IMPLEMENTATION

This section outlines the implementation of SERENE. As Figure 9 shows, SERENE is implemented as a middleware between the controller application, containing network policies, and the data plane switches, storing and forwarding network traffic based on established flow table rules.

Fig. 9.

Fig. 9. Depiction of the SERENE runtime components on controllers and switches.

6.1 Control Plane Components

The controller platform is extended with a Java layer for SERENE, which processes the received events (e.g., signature verification, broadcast) and updates sent to the data plane (e.g., signing with secret share, ordering updates, and handling acknowledgments). Another process in the Java layer handles signature aggregation to be sent to the data plane when controller aggregation is used. A controller is made up of the following nine components:

Controller application:

Network policies are set based on the controller application. While SERENE is designed as a separate layer to support any controller application, our implementation uses the Ryu [21] runtime and establishes flow rules based on shortest path routing.

Global domain policies:

SERENE requires global domain policies for determining network updates for flows that cross domains. The implementation is specific to the controller application. Our implementation uses global policies based on the shortest path between domains.

Update scheduler:

To ensure update consistency, the SERENE runtime depends on the existence of an update scheduler used to determine dependencies between network updates. The update scheduler used for the evaluation assigns dependencies for network updates based on the reverse of a network flow’s path. For example, consider a network flow that traverses three switches (\( s_1 \rightarrow s_2 \rightarrow s_3 \)). Establishing this flow requires updating all of these switches. The update scheduler assigns dependencies for these updates such that (1) all updates are applied to \( s_3 \) before any updates to \( s_2 \) can be applied, and that (2) all updates are applied to \( s_2 \) before any updates to \( s_1 \) can be applied. This ensures downstream rules for the flow are set before any network data is allowed to traverse the network.

Broadcast library:

SERENE utilizes atomic broadcast to distribute events among the members of the control plane communication group. The broadcast library strictly follows atomic broadcast’s specifications and guarantees [78], by using the BFT-SMaRt library [14].

Threshold signatures:

Data plane switches authenticate updates with threshold signatures that can only be verified when a quorum of signatures is formed. Our implementation makes use of BLS signatures [89] implemented in the Pairing Based Cryptography library [23].

Private key share distribution:

The distribution of private shares for controllers—so they can sign switch updates—is performed using the DKG library [18].

Southbound interface:

We extend the OpenFlow message protocol with new message types for signed messages, and add unique identifiers to messages to prevent duplicate processing of events and updates. We also utilize TLS with OpenFlow to ensure integrity and confidentiality of communication between the data plane and the control plane.

Signature aggregation:

SERENE supports switch and controller aggregation. For the latter, switches are assigned the aggregator with OpenFlow “master/slave role request” messages [90].

Failure detector:

We use periodic heartbeat messages to detect crash failures, they are sent using the broadcast library. The distributed ledger implements Algorithms 5 and 6.

6.2 Data Plane Components

The SERENE switch platform is an extension to Open vSwitch (OVS) to perform signature aggregation and verification of updates both thanks to the threshold public key component. The signature aggregation modules stores signed updates in a hash map provided within the Open vSwitch (OVS) implementation. The management of received rules and signatures consists of \( \approx \)600 lines of code (LOC). The threshold public key component consists of a \( \approx \)300 LOC-extension to OVS that utilizes the pairing based cryptography (PBC) library [23] for the creation and verification of signatures. OVS uses a single function for handling events from the control plane. The SERENE extension injects code into this function to redirect received events to the signature verification module.

Additionally, changes are made for switches to either send events only to the aggregator controller if there is one, or multicast events to all the members of the control plane. As a further consistency mechanism, acknowledgments are sent to the control plane once updates are applied.

As is clear in Figure 9, the switch runtime is considerably simpler than the controller runtime. We specifically designed SERENE to minimize the resource consumption (both memory and storage of executable size) impact on switches because of their low capabilities. Our implementation, being an extension of OVS, may function on any switch with the ability to run this software package.

Skip 7SECURE TOPOLOGY DISCOVERY Section

7 SECURE TOPOLOGY DISCOVERY

In many cases, to make accurate network policy decisions, it is essential to have a correct method for discovering the data plane state. This is useful to a network controller to determine optimal provisioning of network resources to flows as well as to discover link and/or switch failures. However, there are a number of attack vectors in the OFDP as discussed by Azzouni et al. [28]. To prevent these attacks, we implemented a secure topology discovery layer with SERENE. While our computation model assumes switches themselves are not malicious, controllers could masquerade as switches and send erroneous information to the control plane. Without protection, such information may corrupt the control plane’s view of the data plane state.

7.1 Discovery Process

Topology discovery is twofold: switch discovery, as part of OpenFlow connection setup, uses pairs of “feature request” and “feature response” messages while link discovery uses OFDP [24], based on the link layer discovery protocol (LLDP) [91].

The algorithm for OFDP secured with SERENE is described in Algorithm 7. Highlighted portions indicate where OFDP integrates with SERENE to utilize the security mechanisms of the protocol. The algorithm makes use of the following functions to build network messages.

Create link layer discovery protocol (LLDP) message:

\( \mathsf {\_createLLDPMsg}(port) \), a function to create an LLDP frame with the given source port as described by the protocol [91].

Create output action:

\( \mathsf {\_OUTPUT}(port) \), instruct a switch to take the action to send a packet through the specified port.

Retrieve OpenFlow message type:

\( \mathsf {\_type}(msg) \), a function to retrieve the OpenFlow message type from a given message.

Create packet out message:

\( \mathsf {\_createPktOut}(action, data) \), a function to create an OpenFlow PacketOut message with the given action and payload data.

Create flow modify message:

\( \mathsf {\_createFlowMod}(match, action) \), a function to create an OpenFlow FlowMod message with the given flow table match data and action.

Create feature request message:

\( \mathsf {\_createFeatureRequest}() \), a function to create an OpenFlow FeatureRequest message.

Formally, the discovered topology is maintained by the controller as a graph \( G_{}= \langle V_{}, E_{}\rangle \), where \( V_{} \) is the set of vertices (switches and hosts), and \( E_{} \) is the set of edges consisting of a set of 3-tuples \( (s, t, p) \) where s is the source, t is the target, and p is the port identifier on the source for which traffic must be sent in order to reach t. A source s or target t may be a switch (datapath) identifier or a port hardware address. If p is the special value \( \bot \) then the source and the target are the same. This would be the case when the source is a switch identifier and the target is a port hardware address for a port in the same switch.

7.2 Switch and Link Discovery

As part of an OpenFlow connection setup, when a switch connects to a controller, the controller sends a \( \mathsf {FeatureRequest} \) message to the switch. The switch responds with a \( \mathsf {FeatureResponse} \) containing its switch (datapath) identifier, and a list of physical ports. Each physical port entry contains the port identifier and corresponding port hardware address. Switch discovery establishes entries in \( V_{} \). An entry is created for each switch identifier and each port hardware address. Once a switch is discovered, a controller sets a flow table entry instructing the switch to forward all received LLDP frames to the controller as \( \mathsf {PacketIn} \) events.

At regular intervals, for all discovered switches, the controller sends \( \mathsf {PacketOut} \) messages containing LLDP frames as payload to be sent to each switch port. When the switch on the other end of the link receives the LLDP frame, using the forwarding rule set during switch discovery, it encapsulates the LLDP frame in a \( \mathsf {PacketIn} \) event and forwards it to the controller. The LLDP frame contains the port hardware address for the sending switch while the \( \mathsf {PacketOut} \) event contains the port identifier and hardware address for the receiving switch. Using this information the controller creates an entry in \( E_{} \) for the discovered link endpoints.

Skip 8SERENE EVALUATION Section

8 SERENE EVALUATION

We here show how the strong guarantees for consistent, secure, and reliable updates in SERENE can be achieved with little overhead in practical networked environments. We show how aggregation and multi-domain parallelism reduce that cost. Lastly, we evaluate SERENE secure OFDP.

8.1 Experimental Methodology

We evaluate SERENE against existing update frameworks in typical business-like environments. As such, we compare a centralized controller, a crash-only tolerant update protocol where communication within the control plane is performed using a crash-tolerant broadcast with no update authentication on switches, and the SERENE update protocol on a single-domain setup with and without aggregation on controllers (cf. Section 8.2) and on a multi-domain setup (cf. Section 8.3).

Setting. We executed the implementation detailed in Section 6 on a network simulated atop compute nodes from the DeterLab test framework [92, 93] connected via a 1 Gb test network. Nodes ran Ubuntu 18.04.1 LTS with kernel 4.15.0-43, two Intel® Xeon® E5-2420 processors at 2.2 GHz, 24 GB of RAM and a SATA attached 256 GB SSD. Controllers had their own node, switches and hosts were node-sharing OpenVz [94] instances.

Topology. We simulated the Facebook data center topology [95] where data centers are divided into server pods (as depicted in Figure 10) consisting of 40 racks of compute servers. Each rack contains a top-of-rack switch connecting all servers in the rack. Each top-of-rack switch is connected to four edge switches that provide high speed bandwidth and redundancy between racks. Edge switches connect multiple pods to spine switches (unshown in Figure 10) linked to the upstream network. Rack hosts and the top-of-rack switches were simulated using OpenVz images on a single physical node. Edge and spine switches were each collectively simulated on their own physical node. One physical node for each switch type.

Fig. 10.

Fig. 10. Depiction of a pod in a Facebook data center [95] spanning racks and two switch layers.

For larger evaluations on multiple data centers, we combined the upstream spine switches for the data center server pods together through backbone switches using topologies documented by the Internet Topology Zoo [96], specifically Abilene and Deutsche Telekom. In our evaluation, we set the latency network links between data centers to be five times that of links within a data center.

Workloads. To evaluate flow completion rates, we ran Hadoop MapReduce and web server traffic workloads with parameters as described in [44] over the given topology and measured their flow completion times according to the shortest path routing policy used by the controller application. We used 5,000 flows per framework following a Poisson distribution using average packet sizes and total flow sizes for inter-rack, intra-data center, and inter-data center defined for each workload. Table 7 summarizes the average sizes for packets and flows for each workload. Our tested workloads focus on flow creation from data plane requests. While SERENE supports a dynamic control plane, the requests for establishing new flows outweighs the overhead from churn in the control plane.

Table 7.
WorkloadFlow localityAvg packet size (B)Avg flow size (kB)Flow arrival rate (flow/s)
Hadoop87% intra-rack250100500
13% inter-rack0.5
Web server88% intra-rack1751500
12% inter-rack

Table 7. Parameters of the Hadoop MapReduce and Web Server Traffic Workloads [44]

To evaluate data plane state discovery, we used topology discovery workload based on OFDP [24].

Creating routes. Unless explicitly stated otherwise, rules in flow tables are reused for multiple flows. Flow tables in switches initially contain no forwarding rules. As flows enter the network, events for unroutable packets are generated by switches and sent to the control plane. Controllers respond with network updates sent to switches to establish rules for the flows. As flows complete, these rules remain in switch flow tables and are reused by later flows matching them. As reported in [44] for Hadoop workloads 99.8% of traffic originating from Hadoop nodes is destined for other Hadoop nodes in the cluster. Reusing rules requires fewer overall events. Switches do not need to contact the control plane for each new flow.

8.2 Single-domain Evaluation

In the following, we used a single server pod topology with a control plane made up of four controllers that tolerate1 failure and results in a quorum size of 3. This evaluated control plane size is similar to evaluations of related work [9, 22, 52].

Flow completion time. Figure 11(a) and (b) show flow completion times for the Hadoop and web server workloads, respectively. Setting up a flow takes \( \approx \)2.9 ms on average for a centralized controller and \( \approx \)4.3 ms for a crash fault-tolerant replicated control plane. SERENE is slower due to the extra messaging and therefore takes \( \approx \)8.3 ms without and \( \approx \)11.6 ms with controller aggregation for flow setup. However, flow rules are reused for future arriving flows since they are not removed from switches once established. Therefore, after initial flow setup, SERENE’s overhead is negligible. Note that flows are only really transmitted once connections are set up at the application level. This is typical for TCP/IP, used here, also in SDN scenarios. If applications started to transmitting immediately, many packets would be dropped almost inevitably until paths are established, regardless of SERENE’s overheads. However, we have never observed any failure in connection establishment caused by the increased setup time, despite relying on default parameters only.

Fig. 11.

Fig. 11. SERENE performance on a single-domain network comparing a centralized solution to a control plane, made of 4 controller replicas, that uses either a crash-tolerant update protocol, SERENE without/with controller aggregation. (a) and (b) depict the cumulative distribution function (CDF) of Hadoop and web server flow completion times, respectively. (c) depicts the CDF of Hadoop flow completion times when routes are removed upon flow completion. (d) depicts the CPU utilization of OVS during a Hadoop workload without/with echoed updates in SERENE.

Unamortized flow creation. To further investigate the overhead of SERENE, we ran the Hadoop workload using a setup/teardown approach. In this approach, no flow rules for routes are initially set in the data plane. Each flow is managed by a pair of events to inform the control plane to set the route for the flow before it starts, and clear the flow rules for the route once the flow is completed, hence preventing overhead amortization. Each event results in appropriate network updates. The setup/teardown approach is applicable in hosted networks such as those utilizing subscription-based services.

The average flow completion times are depicted in Figure 11(c). For Hadoop flows, lasting \( \approx \)33.6 ms on average, SERENE has an overhead of 16% with switch aggregation and 29% with controller aggregation over the centralized approach. Setup times are constant regardless of overall flow duration. Since these setup times are the same for all flows, SERENE’s overhead with these short-lived flows would be shadowed by the total flow execution time for longer running flows.

Switch resource usage and verification rate. To reduce switches’ CPU utilization, update signatures can be aggregated on the control plane at the cost of increased latency (cf. Figure 11(c)). Figure 11(d) depicts OVS CPU utilization on switches for the Hadoop workload. While SERENE signature verification increases CPU utilization on switches, controller aggregation halves switch CPU usage. Having switches aggregating signatures themselves did not result in an increased latency in the processing of updates. Similarly, having switches echoing updates back to the control plane for the purpose of recording them in the ledger (cf. Section 3.4) only incurred a minimal CPU utilization overhead. To further test switch load we measured the rate at which switches can verify message signatures. In our environment, a single switch is able to process on average \( \approx \)1,163 message signatures per second. This value is well within acceptable limits considering our characteristic workloads have a flow arrival rate of 500 flows per second on average (cf. Table 7).

8.3 Multi-domain Evaluation

As discussed in Section 3.5, SERENE provides a means to logically divide the data plane into separate network domains each with its own separate control plane. Events generated within a domain requiring updates solely to the data plane contained in the domain, i.e., local events, can be processed independently of other domains’ local events. As we will show shortly, this separation can reduce the load on the control plane(s) and improve scalability. This separation is particularly useful in the face of large networks that share the same large control plane for simplicity. We first evaluate the cost of various control plane sizes to display the benefit for multiple domains.

Control plane size. While increasing the control plane membership size allows for more controllers to be faulty, providing additional robustness, it also results in additional messaging for broadcasting events as well as an increased latency, both of which increase the overhead of updates. To examine this overhead we performed a series of updates with control plane sizes varying up to 10 members.

The results in Figure 12(a) depict the average time to perform a switch update for an event depending on the size of the control plane. A control plane size of one represents an unprotected centralized control plane. As expected, increasing the control plane size with SERENE increases update time due to the extra messaging needed for broadcast and verification of aggregated signatures. The crash-tolerant update approach is less impacted by the size of the control plane since switches do not authenticate updates; the additional overhead is merely due to extra messaging.

Fig. 12.

Fig. 12. SERENE performance for multi-domain networks. (a) depicts the average time to apply switch rules in a domain for a varying sized control plane. (b) depicts the comparison of events processed by each controller in a pod configured as single vs. multi-domain. (c) depicts the CDF of Hadoop flow completion times for both single and multiple domains. The single domain is made of 12 controller replicas while the multi-domain consists of three domains each with four controller replicas (i.e., 12 controllers in total). (d) depicts the CDF of web server flow completion times for a larger multi-data centers topology.

With SERENE, the overhead for a single switch update can be significant for a large control plane, e.g., 2.5\( \times \) that of a centralized approach when using 10 controllers to support three failures. However, in a data center environment, such a large control plane might be excessive as failures are typically short-lived and failed controllers are quickly replaced with new correct ones. For instance, tolerating 2 concurrent failures is enough to achieve 99.999% of up-time [97]. Further, splitting the network into disjoint domains may help reduce the overhead inherent to a growing control plane.

Event locality. We next investigated how increasing the number of domains within a single pod affects event processing. Due to the locality of flows as reported by Facebook [44], only 5.8% of the Hadoop workload and 31.6% of the web server workload required processing by multiple domains.

Figure 12(b) shows the percentage of total events (for the whole data center) that must be processed by each control plane. For a single network domain, all events must naturally be processed by the single control plane. As the number of domains increases, the number of events processed by each domain’s control plane is greatly reduced, however with diminishing returns. While this evaluation shows the gains achievable using multiple domains for one pod, it is more practical to increase the size of the network by adding more pods. To that end, we next evaluated the impact of event locality by increasing the number of pods in the data center with one domain per pod.

Multi-domain flow completion time. We executed the Hadoop workload using 2two server pods, each set into its own domain with a third domain (containing 4 redundant switches) used to interconnect them. Each domain’s control plane consisted of four controller replicas resulting in 12 replicas for the entire network. We compared this setup to the same network topology with a single domain and a control plane of 12 replicas.

Figure 12(c) shows flow completion time using SERENE in the single and multi-domain (MD) setup, with and without controller aggregation. Thanks to their locality, most events are processed in parallel when using multiple domains, thus greatly reducing flow completion time compared to a single domain. While flows crossing domains incur an additional overhead, an efficient domain architecture can reduce their number.

Multiple data centers. Our final multi-domain evaluation involved pods located in multiple data centers following Deutsche Telekom’s topology as documented by the Internet Topology Zoo [96]. Each data center consisted of four pods interconnected via the spine and edge switches as described in the Facebook data center topology [95]. Each pod was set as its own domain for SERENE, while a single controller was used for the entire network (all data centers) for the centralized approach. We evaluated the completion time of web server flows taking into account their locality as reported by Facebook [44]: 15.7% traverse pods within the same data center and 15.9% traverse data centers.

The results depicted in Figure 12(d) show that the centralized controller suffers from the increased latency for the establishment of flows across data centers. However, SERENE does not suffer from this increased latency thanks to domain parallelism and hence performs better than the centralized approach, unlike the single-domain setup, while being much more secure. These results exhibit the benefits of parallelism even under the web server workload (with 15.7% + 15.9% crossing flows) that has far fewer local events than the Hadoop one (3.3% + 2.5%).

8.4 Topology Discovery Evaluation

Here we evaluated the time to discover all switches and links using SERENE secure OFDP as described in Section 7 for the Abilene topology depicted in Figure 13(a). This topology represents the backbone created by the Internet2 community in the U.S. [25]. The results are shown in Figure 13(b). SERENE exhibits an average discovery time of 1.45 seconds and 1.48 seconds when controller aggregation is used compared to a discovery time of 1.3 seconds for a centralized controller. This results in an overhead of 11.5%, and 13.8% with controller aggregation. The overhead has a direct result in the control plane’s response to changes in topology (e.g., link and/or switch failures). Given that topology discovery is an ongoing process executed within an established time unit, this overhead is tolerable.

Fig. 13.

Fig. 13. Network depiction and results for SERENE secure topology discovery. (a) depicts the connectivity of the Abilene network topology. (b) depicts the time for the control plane to discover the network topology using a centralized, crash tolerant, and SERENE based control plane.

Skip 9CONCLUSIONS Section

9 CONCLUSIONS

We present SERENE, a practical construction for secure and reliable network updates that ensures consistency, thanks to an update scheduler that reduces ordering constraints by exploiting update parallelism through dependency analysis, and scalability to large networks through update domains. Threshold cryptography and distributed key generation allows for verification of updates by the data plane and flexibility in control plane membership, while minimizing switch instrumentation. SERENE’s control plane is resilient to a dynamic adversary by employing a failure detector that combines heartbeats to detect controller crashes and a distributed ledger to detect (potentially transient and malicious) failures based on the outputs of controllers (e.g., muteness failures [19]). We provide an algorithmic formalization of SERENE and prove its safety with regards to event-linearizability. We further present how SERENE integrates with OpenFlow discovery protocol to propose a novel secure data plane topology discovery protocol. We show that SERENE can provide consistency, security, and reliability with minimal overhead to flow completion time through extensive analysis using a functional Facebook data center topology with characteristic workloads. Additional optimizations using controller aggregation reduce the load on data plane switches.

In future work, we plan to alleviate the assumption that switches remain correct and investigate protection mechanisms against policy-related faults from the data plane. We also plan to investigate dynamic policies across multiple domains as well as domains distributed across multiple autonomous systems (ASs).

REFERENCES

  1. [1] Mahajan Ratul and Wattenhofer Roger. 2013. On consistent updates in software defined networks. In Proceedings of the 12th ACM Workshop on Hot Topics in Networks. 7 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Hong Chi-Yao, Kandula Srikanth, Mahajan Ratul, Zhang Ming, Gill Vijay, Nanduri Mohan, and Wattenhofer Roger. 2013. Achieving high utilization with software-driven WAN. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM. 1526. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Reitblatt Mark, Foster Nate, Rexford Jennifer, Schlesinger Cole, and Walker David. 2012. Abstractions for network update. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 323334. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Brandt Sebastian, Foerster Klaus-Tycho, and Wattenhofer Roger. 2017. Augmenting flows for the consistent migration of multi-commodity single-destination flows in SDNs. Pervasive and Mobile Computing 36 (2017), 134150. DOI:Special Issue on Pervasive Social Computing.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Luo Long, Yu Hongfang, Luo Shouxi, and Zhang Mingui. 2015. Fast lossless traffic migration for SDN updates. In Proceedings of the 2015 IEEE International Conference on Communications. 58035808. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Foerster Klaus-Tycho and Wattenhofer Roger. 2016. The power of two in consistent network updates: Hard loop freedom, easy flow migration. In Proceedings of the 25th International Conference on Computer Communication and Networks. 19. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Berde Pankaj, Gerola Matteo, Hart Jonathan, Higuchi Yuta, Kobayashi Masayoshi, Koide Toshio, Lantz Bob, O’Connor Brian, Radoslavov Pavlin, Snow William, and Parulkar Guru. 2014. ONOS: Towards an open, distributed SDN OS. In Proceedings of the 3rd Workshop on Hot Topics in Software Defined Networking. 16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Koponen Teemu, Casado Martin, Gude Natasha, Stribling Jeremy, Poutievski Leon, Zhu Min, Ramanathan Rajiv, Iwata Yuichiro, Inoue Hiroaki, Hama Takayuki, and Shenker Scott. 2010. Onix: A distributed control platform for large-scale production networks. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. 351364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Katta Naga, Zhang Haoyu, Freedman Michael, and Rexford Jennifer. 2015. Ravana: Controller fault-tolerance in software-defined networking. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. 12 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Li He, Li Peng, Guo Song, and Nayak Amiya. 2014. Byzantine-resilient secure software-defined networks with multiple controllers in cloud. IEEE Transactions on Cloud Computing 2, 4 (2014), 436447. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Sakic Ermin, Deric Nemanja, and Kellerer Wolfgang. 2018. MORPH: An adaptive framework for efficient and byzantine fault-tolerant SDN control plane. IEEE Journal on Selected Areas in Communications 36, 10 (2018), 21582174. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Lamport Leslie, Shostak Robert, and Pease Marshall. 1982. The byzantine generals problem. ACM Transactions on Programming Languages and Systems 4, 3 (1982), 382401. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Castro Miguel and Liskov Barbara. 1999. Practical byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation. 173186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Bessani Alysson, Sousa João, and Alchieri Eduardo E. P.. 2014. State machine replication for the masses with BFT-SMaRt. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 355362. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Hsu Kuo-Feng, Beckett Ryan, Chen Ang, Rexford Jennifer, and Walker David. 2020. Contra: A programmable system for performance-aware routing. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation. 701721.Google ScholarGoogle Scholar
  16. [16] Jin Xin, Liu Hongqiang Harry, Gandhi Rohan, Kandula Srikanth, Mahajan Ratul, Zhang Ming, Rexford Jennifer, and Wattenhofer Roger. 2014. Dynamic scheduling of network updates. In Proceedings of the 2014 Conference of the ACM Special Interest Group on Data Communication. 539550. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Dang Huynh Tu, Sciascia Daniele, Canini Marco, Pedone Fernando, and Soulé Robert. 2015. NetPaxos: Consensus at network speed. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research.7 pages. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Kate Aniket. ([n. d.]). Distributed Key Generator. Retrieved 7 Dec., 2020 from https://crysp.uwaterloo.ca/software/DKG/.Google ScholarGoogle Scholar
  19. [19] Doudou Assia, Garbinato Benoît, Guerraoui Rachid, and Schiper André. 1999. Muteness failure detectors: Specification and implementation. In Proceedings of the 3rd European Dependable Computing Conference on Dependable Computing. 7187. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Zhou Wenchao, Fei Qiong, Narayan Arjun, Haeberlen Andreas, Loo Boon Thau, and Sherr Micah. 2011. Secure network provenance. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 295310. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] ([n. d.]). Ryu SDN Framework. Retrieved 7 Dec., 2020 from http://osrg.github.io/ryu.Google ScholarGoogle Scholar
  22. [22] Lembke James, Ravi Srivatsan, Eugster Patrick, and Schmid Stefan. 2020. RoSCo: Robust updates for software-defined networks. IEEE Journal on Selected Areas in Communications 38, 7 (2020), 13521365. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Lynn Ben. ([n. d.]). The Pairing Based Cryptography Library. Retrieved 7 Dec., 2020 from https://crypto.stanford.edu/pbc/.Google ScholarGoogle Scholar
  24. [24] ([n. d.]). OpenFlow Discovery Protocol. Retrieved 7 Dec., 2020 from https://groups.geni.net/geni/wiki/OpenFlowDiscoveryProtocol.Google ScholarGoogle Scholar
  25. [25] Internet2 Community. Retrieved 20 Feb., 2021 https://internet2.edu.Google ScholarGoogle Scholar
  26. [26] Lembke James, Ravi Srivatsan, Roman Pierre-Louis, and Eugster Patrick. 2020. Consistent and secure network updates made practical. In Proceedings of the 21st International Middleware Conference. 149162. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Open Networking Foundation. 2015. OpenFlow Switch Specification. v1.5.1.Google ScholarGoogle Scholar
  28. [28] Azzouni Abdelhadi, Boutaba Raouf, Trang Nguyen Thi Mai, and Pujolle Guy. 2018. sOFTDP: Secure and efficient openflow topology discovery protocol. In Proceedings of the 2018 IEEE/IFIP Network Operations and Management Symposium. 17. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Chandrasekaran Balakrishnan and Benson Theophilus. 2014. Tolerating SDN application failures with LegoSDN. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks. 17. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Shin Seungwon, Song Yongjoo, Lee Taekyung, Lee Sangho, Chung Jaewoong, Porras Phillip, Yegneswaran Vinod, Noh Jiseong, and Kang Brent Byunghoon. 2014. Rosemary: A robust, secure, and high-performance network operating system. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 7889. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Yeganeh Soheil Hassas and Ganjali Yashar. 2016. Beehive: Simple distributed programming in software-defined networks. In Proceedings of the Symposium on SDN Research. 112. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Dargin Mark. ([n. d.]). Secure your SDN controller. Retrieved 1 Jan., 2021 from https://www.networkworld.com/article/3245173/secure-your-sdn-controller.html.Google ScholarGoogle Scholar
  33. [33] Hogg Scott. ([n. d.]). SDN Security Attack Vectors and SDN Hardening. Retrieved 1 Jan., 2021 from https://www.networkworld.com/article/2840273/sdn-security-attack-vectors-and-sdn-hardening.html.Google ScholarGoogle Scholar
  34. [34] Asturias Diego. ([n. d.]). 9 Types of Software Defined Network attacks and how to protect from them. Retrieved 1 Jan., 2021 from https://www.routerfreak.com/9-types-software-defined-network-attacks-protect/.Google ScholarGoogle Scholar
  35. [35] Brooks Michael and Yang Baijian. 2015. A man-in-the-middle attack against opendaylight SDN controller. In Proceedings of the 4th Annual ACM Conference on Research in Information Technology. 4549. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Dover Jeremy M.. 2013. A denial of service attack against the open floodlight SDN controller. Dover Networks LCC, Edgewater, MD (2013). Retrieved 1 Jan., 2021 http://dovernetworks.com/wp-content/uploads/2013/12/OpenFloodlight-12302013.pdf.Google ScholarGoogle Scholar
  37. [37] ([n. d.]). OpenFlow PacketOut. Retrieved 7 Dec., 2020 from http://flowgrammable.org/sdn/openflow/message-layer/packetout/.Google ScholarGoogle Scholar
  38. [38] Lee Seungsoo, Yoon Changhoon, and Shin Seungwon. 2016. The smaller, the shrewder: A simple malicious application can kill an entire SDN environment. In Proceedings of the 2016 ACM International Workshop on Security in Software Defined Networks & Network Function Virtualization. 2328. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] ([n. d.]). Policy Framework for ONOS. Retrieved 7 May, 2020 from https://wiki.onosproject.org/display/ONOS/POLICY+FRAMEWORK+FOR+ONOS.Google ScholarGoogle Scholar
  40. [40] Bosshart Pat, Daly Dan, Gibb Glen, Izzard Martin, McKeown Nick, Rexford Jennifer, Schlesinger Cole, Talayco Dan, Vahdat Amin, Varghese George, and Walker David. 2014. P4: Programming protocol-independent packet processors. SIGCOMM Computer Communication Review 44, 3 (2014), 8795. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] ([n. d.]). OpenDaylight Group Based Policy. Retrieved 1 Jan., 2021 from https://docs.opendaylight.org/en/stable-fluorine/user-guide/group-based-policy-user-guide.html.Google ScholarGoogle Scholar
  42. [42] Karakus Murat and Durresi Arjan. 2017. A survey: Control plane scalability issues and approaches in software-defined networking (SDN). Computer Networks 112 (2017), 279293. DOI: http://dx.doi.org/0.1016/j.comnet.2016.11.017Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Thai Peter and Oliveira Jaudelice C. de. 2013. Decoupling policy from routing with software defined interdomain management: Interdomain routing for SDN-based networks. In Proceedings of the 2013 22nd International Conference on Computer Communication and Networks. 16. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Roy Arjun, Zeng Hongyi, Bagga Jasmeet, Porter George, and Snoeren Alex C.. 2015. Inside the social network’s (datacenter) network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. 123137. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Gude Natasha, Koponen Teemu, Pettit Justin, Pfaff Ben, Casado Martín, McKeown Nick, and Shenker Scott. 2008. NOX: Towards an operating system for networks. SIGCOMM Computer Communication Review 38, 3 (2008), 105110. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] ([n. d.]). Cisco Open SDN Controller. Retrieved 7 May, 2020 from http://www.cisco.com/c/en/us/products/cloud-systems-management/opensdn-controller/index.html.Google ScholarGoogle Scholar
  47. [47] ([n. d.]). OpenDaylight. Retrieved 1 April, 2020 from https://www.opendaylight.org.Google ScholarGoogle Scholar
  48. [48] ([n. d.]). Central Office Re-architected as a Datacenter (CORD). Retrieved 1 April, 2020 from https://opencord.org/.Google ScholarGoogle Scholar
  49. [49] ([n. d.]). Packet-Optical. Retrieved 1 April, 2020 from https://wiki.onosproject.org/display/ONOS/Packet+Optical+Convergence.Google ScholarGoogle Scholar
  50. [50] ([n. d.]). Configuring TLS for inter-controller communication. Retrieved 1 April, 2020 from https://wiki.onosproject.org/display/ONOS/Configuring+TLS+for+inter-controller+communication.Google ScholarGoogle Scholar
  51. [51] ([n. d.]). Configuring OVS connection using SSL/TLS with self-signed certificates. Retrieved 1 April, 2020 from https://wiki.onosproject.org/pages/viewpage.action?pageId=6358090.Google ScholarGoogle Scholar
  52. [52] Botelho Fábio, Ribeiro Tulio A., Ferreira Paulo, Ramos Fernando M. V., and Bessani Alysson. 2016. Design and implementation of a consistent data store for a distributed SDN control plane. In Proceedings of the 2016 12th European Dependable Computing Conference. 169180. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] McClurg Jedidiah, Hojjat Hossein, Foster Nate, and Černý Pavol. 2016. Event-driven network programming. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. 369385. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Nguyen Thanh Dang, Chiesa Marco, and Canini Marco. 2017. Decentralized consistent updates in SDN. In Proceedings of the Symposium on SDN Research. 2133. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Černỳ Pavol, Foster Nate, Jagnik Nilesh, and McClurg Jedidiah. 2016. Optimal consistent network updates in polynomial time. In Proceedings of the International Symposium on Distributed Computing. 114128. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Kazemian Peyman, Varghese George, and McKeown Nick. 2012. Header space analysis: Static checking for networks. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation. 113126.Google ScholarGoogle Scholar
  57. [57] Beckett Ryan, Gupta Aarti, Mahajan Ratul, and Walker David. 2017. A general approach to network configuration verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 155168. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Agborubere Belema and Sanchez-Velazquez Erika. 2017. OpenFlow communications and TLS security in software-defined networks. In Proceedings of the 2017 IEEE International Conference on Internet of Things and IEEE Green Computing and Communications and IEEE Cyber, Physical and Social Computing and IEEE Smart Data. 560566. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Chen Ang, Wu Yang, Haeberlen Andreas, Zhou Wenchao, and Loo Boon Thau. 2016. The good, the bad, and the differences: Better network diagnostics with differential provenance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication. 115128. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Handigol Nikhil, Heller Brandon, Jeyakumar Vimalkumar, Mazières David, and McKeown Nick. 2014. I know what your packet did last hop: Using packet histories to troubleshoot networks. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Wallner Ryan and Cannistra Robert. 2013. An SDN approach: Quality of service using big switch’s floodlight open-source controller. Proceedings of the Asia-Pacific Advanced Network 35 (2013), 1419. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Sharma Pradip Kumar, Singh Saurabh, Jeong Young-Sik, and Park Jong Hyuk. 2017. DistBlockNet: A distributed blockchains-based secure SDN architecture for IoT networks. IEEE Communications Magazine 55, 9 (2017), 7885. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Arash Shaghaghi, Mohamed Ali Kaafar, Rajkumar Buyya, and Sanjay Jha. 2020. Software-Defined Network (SDN) Data Plane Security: Issues, Solutions and Future Directions. In Handbook of Computer Networks and Cyber Security. 341–387.Google ScholarGoogle Scholar
  64. [64] Shamseddine Maha, Itani Wassim, Kayssi Ayman, and Chehab Ali. 2017. Virtualized network views for localizing misbehaving sources in SDN data planes. In Proceedings of the 2017 IEEE International Conference on Communications. 17. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Skowyra Richard, Lapets Andrei, Bestavros Azer, and Kfoury Assaf. 2014. A verification platform for SDN-enabled applications. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering. 337342. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Yuan Bin, Lin Chen, Zou Deqing, Yang Laurence Tianruo, and Jin Hai. 2021. Detecting malicious switches for a secure software-defined tactile internet. ACM Transactions on Internet Technology 21, 4 (2021), 123. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Anil Ashidha, Rufzal TA, and Vasudevan Vipindev Adat. 2022. DDoS detection in software-defined network using entropy method. In Proceedings of the 7th International Conference on Mathematics and Computing. 129139. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  68. [68] Bawany Narmeen Zakaria, Shamsi Jawwad A., and Salah Khaled. 2017. DDoS attack detection and mitigation using SDN: methods, practices, and solutions. Arabian Journal for Science and Engineering 42, 2 (2017), 425441. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Buragohain Chaitanya and Medhi Nabajyoti. 2016. FlowTrApp: An SDN based architecture for DDoS attack detection and mitigation in data centers. In Proceedings of the 2016 3rd International Conference on Signal Processing and Integrated Networks. 519524. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  70. [70] Sebbar Anass, Zkik Karim, Baddi Youssef, Boulmalf Mohammed, and Kettani Mohamed Dafir Ech-Cherif El. 2020. MitM detection and defense mechanism CBNA-RF based on machine learning for large-scale SDN context. Journal of Ambient Intelligence and Humanized Computing 11, 12 (2020), 58755894. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  71. [71] Pereíni Peter, Kuzniar Maciej, Canini Marco, and Kostić Dejan. 2014. ESPRES: Transparent SDN update scheduling. In Proceedings of the 3rd Workshop on Hot Topics in Software Defined Networking. 7378. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] McClurg Jedidiah, Hojjat Hossein, Černý Pavol, and Foster Nate. 2015. Efficient synthesis of network updates. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. 196207. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Desmedt Yvo G.. 1994. Threshold cryptography. European Transactions on Telecommunications 5, 4 (1994), 449458. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Gennaro Rosario, Jarecki Stanislaw, Krawczyk Hugo, and Rabin Tal. 1996. Robust threshold DSS signatures. In Proceedings of the Advances in Cryptology – EUROCRYPT. 354371. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  75. [75] Shamir Adi. 1979. How to share a secret. Communications of the ACM 22, 11 (1979), 612613. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Chor Benny, Goldwasser Shafi, Micali Silvio, and Awerbuch Baruch. 1985. Verifiable secret sharing and achieving simultaneity in the presence of faults. In Proceedings of the 26th Annual Symposium on Foundations of Computer Science. 383395. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Kate Aniket, Huang Yizhou, and Goldberg Ian. 2012. Distributed Key Generation in the Wild. Cryptology ePrint Archive, Paper 2012/377. (2012). Retrieved 7 Dec., 2020 from https://eprint.iacr.org/2012/377.Google ScholarGoogle Scholar
  78. [78] Hadzilacos Vassos and Toueg Sam. 1994. A Modular Approach to Fault-Tolerant Broadcasts and Related Problems. Technical Report. Cornell University.Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Chandra Tushar Deepak and Toueg Sam. 1996. Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 2 (1996), 225267. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Haeberlen Andreas, Kouznetsov Petr, and Druschel Peter. 2007. PeerReview: Practical accountability for distributed systems. In Proceedings of the21st ACM SIGOPS Symposium on Operating Systems Principles. 175188. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Chun Byung-Gon, Maniatis Petros, Shenker Scott, and Kubiatowicz John. 2007. Attested append-only memory: Making adversaries stick to their word. In Proceedings of the21st ACM SIGOPS Symposium on Operating Systems Principles. 189204. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Androulaki Elli, Barger Artem, Bortnikov Vita, Cachin Christian, Christidis Konstantinos, Caro Angelo De, Enyeart David, Ferris Christopher, Laventman Gennady, Manevich Yacov, Muralidharan Srinivasan, Murthy Chet, Nguyen Binh, Sethi Manish, Singh Gari, Smith Keith, Sorniotti Alessandro, Stathakopoulou Chrysoula, Vukolić Marko, Cocco Sharon Weed, and Yellick Jason. 2018. Hyperledger fabric: A distributed operating system for permissioned blockchains. In Proceedings of the 13th EuroSys Conference. 30:1–30:15. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. [83] Kokoris-Kogias Eleftherios, Jovanovic Philipp, Gasser Linus, Gailly Nicolas, Syta Ewa, and Ford Bryan. 2018. OmniLedger: A secure, scale-out, decentralized ledger via sharding. In Proceedings of the 2018 IEEE Symposium on Security and Privacy. 1934. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Zamani Mahdi, Movahedi Mahnush, and Raykova Mariana. 2018. RapidChain: Scaling blockchain via full sharding. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 931948. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Clement Allen, Junqueira Flavio, Kate Aniket, and Rodrigues Rodrigo. 2012. On the (limited) power of non-equivocation. In Proceedings of the 2012 ACM Symposium on Principles of Distributed Computing. 301308. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. [86] Lamport Leslie, Malkhi Dahlia, and Zhou Lidong. 2010. Reconfiguring a state machine. ACM SIGACT News 41, 1 (2010), 6373. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Lamport Leslie. 1998. The part-time parliament. ACM Transactions on Computer Systems 16, 2 (1998), 133169. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Cao Jiahao, Xie Renjie, Sun Kun, Li Qi, Gu Guofei, and Xu Mingwei. 2020. When match fields do not need to match: Buffered packets hijacking in SDN. In Proceedings of the 27th Annual Network and Distributed System Security Symposium. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  89. [89] Boneh Dan, Lynn Ben, and Shacham Hovav. 2004. Short signatures from the weil pairing. Journal of Cryptology 17, 4 (2004), 297319. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. [90] ([n. d.]). OpenFlow Role Request Messages. Retrieved 7 Dec., 2020 from https://ryu.readthedocs.io/en/latest/ofproto_v1_3_ref.html#role-request-message.Google ScholarGoogle Scholar
  91. [91] Standard for Local and Metropolitan Area Networks - Station and Media Access Control Connectivity Discovery, 802.1AB-REV Draft 6.0, IEEE, Jun. 24.Google ScholarGoogle Scholar
  92. [92] ([n. d.]). About DETERLab. Retrieved 1 April, 2020 from https://deter-project.org/about_deterlab.Google ScholarGoogle Scholar
  93. [93] ([n. d.]). DETERLab PC3000 Node Information. Retrieved 1 April, 2020 from https://www.isi.deterlab.net/shownodetype.php?node_type=pc3000.Google ScholarGoogle Scholar
  94. [94] ([n. d.]). OpenVz. Retrieved 1 April, 2020 from https://openvz.org/.Google ScholarGoogle Scholar
  95. [95] ([n. d.]). Introducing data center fabric, the next-generation Facebook data center network. Retrieved 7 May, 2020 from https://code.fb.com/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/.Google ScholarGoogle Scholar
  96. [96] ([n. d.]). The Internet Topology Zoo. Retrieved 7 May, 2020 from http://www.topology-zoo.org/.Google ScholarGoogle Scholar
  97. [97] Ros Francisco Javier and Ruiz Pedro Miguel. 2014. Five nines of southbound reliability in software-defined networks. In Proceedings of the 3rd Workshop on Hot Topics in Software Defined Networking. 3136. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Secure and Reliable Network Updates

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Privacy and Security
          ACM Transactions on Privacy and Security  Volume 26, Issue 1
          February 2023
          342 pages
          ISSN:2471-2566
          EISSN:2471-2574
          DOI:10.1145/3561959
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 November 2022
          • Online AM: 12 August 2022
          • Accepted: 23 July 2022
          • Revised: 7 May 2022
          • Received: 8 June 2021
          Published in tops Volume 26, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)773
          • Downloads (Last 6 weeks)266

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format