1. Introduction
Software-defined networking (SDN) separates the control plane from the data plane, i.e., it moves the control logic from the network devices to a central controller. The centralized controller manages the flow of data through a southbound application programming interface (SB-API). Similarly, a centralized management of the networking devices has the advantage that new applications, services, and networking functions can be flexibly deployed with minimum operating and capital expenses. A few survey studies on SDN operation, history, architecture, programmability, and research directions are described in [
1,
2,
3,
4,
5,
6].
Link failure recovery approaches leverage the SDN unique features of centralized control and the flexibility of programmable data plane for real-time applications such as video conferencing [
7] and voice over IP (VOIP), which can tolerate a delay of 50 ms in case of recovery. Thus, the quality of service (QoS) can be maintained in case of a link failure to ensure untroubled and constant communication. The reported mean for link and device failures in a traditional data center network per day have been recorded as 40.8 and 5.2 failures per time unit [
8], respectively, which necessitates the discovery of a method that enables faster recovery of failed links. The study [
8] also reported that the frequency of link failures is higher than that of the node failures. Therefore, fault-resilient approaches play an important role in traffic engineering for operator networks to ensure a fast failure recovery, which will ultimately accomplish the requirements of the end-users.
The tight coupling of control and data planes in legacy networks makes them sluggish and complex to manage. Although traditional networks have been adopted universally, their management and configuration are cumbersome [
9] because of the following reasons:
Vendors are hesitant in providing the source code of the protocols to the developer and user community because of being afraid of unverified changes to their devices that can lead to malfunctions in the networks [
10].
A global view of the network is hard to obtain in the traditional network architecture; hence, only distributed routing protocols can be used, e.g., routing information protocol (RIP) [
11] and open shortest path first (OSPF) [
12].
The co-existence of data and control planes also leads to an improper utilization of the bandwidth [
13], as it is shared by both the planes. Thus, the packets are broadcasted to the network, which leads to low link utilization. Similarly, the ball game gets worse as soon as there is a link failure because the system tries to search alternate paths in the network for packet broadcasting, leading to network congestion.
In case of a link failure, re-routing is performed for discovering an alternative path to divert the packets from a failed link to the alternative path. However, the implementation of traditional routing protocols hinders the network growth and causes delays owing to several problems, such as flooding of the link-state information, long convergence time of path detection [
14], deployment complexity of the network [
15], and route flaps caused by prefix instability [
16]. Additionally, there may be network destabilization because of routing conflicts owing to the autonomous system (AS) [
17]. Consequently, there is a lack of optimum decisions due to the unavailability of the global statistics of the network. These problems exist in traditional internet architecture because of two reasons: First, because implementing changes in the traditional routing protocols is difficult owing to the software being embedded in the firmware; and second, the internet companies feel at risk and shy away from implementing any new proposals, even if it can increase the performance of the network, as this will also increase the network complexity and, consequently, the maintenance cost.
Fast failure recovery within a fixed time interval is vital for providing a service guarantee in next-generation technologies. In literature, several architectures [
18,
19,
20] have been proposed for enabling the fast recovery of networks. The architecture proposed in [
18] consists of an automatic failure recovery or fault management framework. The research conducted in [
19] leverages 5G, secure Internet-of-Things (IoT), and unmanned aerial vehicle (UAV) swarms to ensure service in mission-critical infrastructures. Likewise, a platform for virtualization of services based on SDN and network function virtualization (NFV) was proposed in [
20], which enables the development, implementation, and functioning of media services over 5G networks.
Moreover, the nodes may operate in remote and harsh environments with a possibility of frequent failures. Therefore, consecutive changes are essential to discover an alternative path for the nodes that have experienced failure [
21]. In addition, the SDN handles the link failures using one of two main approaches, proactive and reactive [
22]. In the proactive approach, the alternate paths are preconfigured, and in the case of a link failure, the disrupted flows are forwarded to the backup path. In contrast, in the reactive scheme, the controller is approached for finding an alternative path and the flow rules for the new path are inserted when the controller calculates the path. The SDN controller, which has access to the global topological information, will search the optimum alternative path for the failed link and will push the flow rules to it. Hence, the data plane is not interrupted. Consequently, the packets, are not broadcasted to the network here due to the centralized control architecture, which leads to a performance improvement in the network. However, both schemes have their pros and cons along with a trade-off in performance and efficiency.
Link failure recovery in SDN was overviewed in [
23,
24]. In this survey, we investigate the link failure detection and recovery approaches in SDN. A demonstration of the SDN-based failure recovery with proactive and reactive approaches is presented with pictorial diagrams. We compare the proactive and reactive schemes in terms of latency, scalability, routing updates, ternary contents addressable memory (TCAM) space, flow operations matching, configuration, robustness to backup path failures, routing information access, processing of switches, as well as the routing, controller, and switch overheads. The research issues in SDN-based link failure recovery schemes for large-scale, hybrid, inter-domain, in-band, and machine learning (ML) approaches are discussed and summarized. We simulate two application scenarios in a Mininet testbed for Naval tactical and datacenter networks (DCN) and evaluate the recovery time and throughput when using the proactive and reactive schemes.
The rest of the paper is organized as follows. In
Section 2, an overview of SDN architecture is presented and the importance of SDN for achieving the recovery is explicated. In
Section 3, we discuss the various link failure detection techniques. In
Section 4, the two most common methods for searching the alternative paths, i.e., proactive and reactive, are described, in addition to failure recovery approaches in large-scale networks, inter-domain architecture, hybrid SDN, in-band environment, and ML-based techniques. In
Section 5, we discuss SDN application scenarios, experimental setup and an experimental demonstration of the proactive and reactive approaches in Naval tactical and DCN operations. In
Section 6, a summary is provided based on the findings in various research studies. Finally, in
Section 7, we conclude the paper and highlight the main points of the survey.
2. SDN Architecture
2.1. An Overview of SDN
Figure 1 [
25] shows the different layers of the SDN architecture and the way the devices in the data plane interact with those in the control plane through an SB-API. The SB-API provides an interface for interaction between the data and control planes. Several protocols are available for the interaction of the two planes, such as OpenFlow and Netconf [
26]. The control plane is implemented through SDN controllers, e.g., POX [
27], OpenDaylight (ODL) [
28], open network operating systems (ONOS) [
29], Floodlight [
30], and RYU [
31]. The northbound API (NB-API) is an interface between the control and management planes. The SDN controller acts as a bridge between the management and data planes leveraging the representational state transfer (REST) API. Similarly, the statistics about the data plane such as the flows are gathered through REST API.
The programmability and orchestration in SDN, which is at the top of the control plane, are implemented through the management plane. Different applications are executed for performing the versatile and dynamic operations needed for an efficient link failure recovery. The switches in the data plane leverage the flexibility and programmability features of the management plane through the abstractions offered by the control plane. For example, network monitoring and failure detection can be thoroughly performed by developing and deploying snipping and failure detection applications in the management plane.
2.2. A Global View of the Network
There is a large recovery delay in classical IP networks owing to the flooding of packets, the increased time for failure detection, computation of alternate paths, and updating of the routing table. However, in SDN, the controller has a global view and control of the network. Therefore, it decides optimally while searching an efficient alternate path for the disrupted flows. In addition, the controller monitors the end-to-end connectivity, and therefore, when a link is broken, the controller can reconfigure the network to re-establish the end-to-end (E2E) connectivity for all paths.
In contrast to traditional networks where every node floods the network with packets to find an alternate path, the SDN provides the solutions with less complexity and flexibility. The programmability and flexibility can be used to dynamically apply policies in the network through the control plane, according to the changing QoS requirements as soon as the link failure occurs. Consequently, the cost, time, and workforce is reduced.
2.3. The Low Complexity of Data Plane
SDN shifts the data plane complexity to the centralized controller; therefore, the nodes react efficiently by utilizing the SBI (e.g., OpenFlow/Netconf) unique features for automatic failure recovery and load balancing. Additionally, applications can be executed with more flexibility in the data plane, resulting in improved QoS and quality of experience (QoE), and the fulfillment of the requirements of carrier-grade networks (CGNs). Furthermore, OpenFlow provides fast detection mechanisms that reduce the link failure recovery delay and packet loss. Similarly, backup paths can be easily configured on the switches, and the match and forwarding features can help achieve high network throughput and QoS.
3. Link Failure Detection Mechanisms
The failure recovery process in SDN starts with a detection of the failed links. If the detection is swift, then the overall delay of the recovery will be small, which is why detection schemes are so crucial to the overall process.
Table 1 gives an overview of the link failure detection mechanisms, the detection methodology for the failed links, and the related problems in the detection schemes. Modern-day networks leverage the SDN centralized control and flexibility of managing the data plane in the link failure recovery. A global view of the network, facilitated with central control, provides several possibilities in the failure identification process [
32,
33,
34]. The schemes proposed in [
32,
33] used the concept of monitoring cycles to reduce the link failure recovery time. The monitoring circle paths detect and locate link failures in the SDN. However, the problem of minimizing the number of hops and monitoring cycles is vital. Therefore, in Reference [
32], a binary search technique was introduced to minimize this overhead. Similarly, in Reference [
33,
34], the monitoring assignment was formulated as the postman problem with a heuristic method [
35] for the assignment of the monitoring cycles. However, the detection process was still observed to be slow.
Quick detection of the failed links improves the performance of the link failure recovery techniques; therefore, the failed links must be detected efficiently before the recovery process. Several mechanisms are available for link failure detection; a few of them are mentioned in [
36]. The OpenFlow implementations use the same tool of heartbeat messages for detection as that in Ethernet. A heartbeat message is exchanged between the nodes at regular time intervals, which determines the status of the network. The liveliness is checked by the rate of exchange of hello packets between the nodes; therefore, if a node does not receive a hello packet within the regular time interval of 16 ± 8 ms, the controller is notified about the failed link. If a node does not receive a response within the time of 50–150 ms, the link is considered disconnected [
37]. Due to the slow detection rate, the Ethernet cannot meet the CGNs delay demands (< 50 ms).
The spanning tree protocol (STP) [
38] and reverse spanning tree protocol (RSTP) have also been used on data link layers for link failure detection. However, their detection period spans in seconds and cannot guarantee the delay requirements of modern technologies. Similarly, OpenFlow fast failover (FF) group [
1] and Bidirectional Forwarding Detection (BFD) [
39] are also routinely available in the SDN community [
36,
40,
41] for link failure detection and recovery. Additionally, a failure detection mechanism known as failure detection service with low mistake rates (FDLM), which uses heartbeat messages to minimize the errors in detection, was proposed in [
42].
The failure detection method for transport networks in the E2E path, described by Kemp et al. [
43], used multiprotocol label switching (MPLS) BFD [
44]. The scheme utilizes packet generators implemented in the switches by sending probe messages along with the data packets. A link failure is detected when there is a gap between consecutive probe messages. The proposed methodology was able to achieve scalability because different network elements can utilize identical packet generators, which can be separated later through MPLS.
A failure detection approach using a counter mechanism based on the outgoing packets was proposed in [
45]. The flow rules installed on the link were tagged and monitored, and the packets were then counted at the destination. The error rate calculated by looking at the difference between the sent and received packets was compared with a threshold value. For a given link, if the error rate exceeded the threshold value, the link was assumed to have failed.
The SDN elastic optical network (SDN-EON) is described in [
46], which also uses a threshold mechanism using the bit-error-rate (BER). An OpenFlow agent deployed on the data plane nodes periodically monitors the BER. The BER ratio is compared with a threshold to decide whether the link or node on the path has failed. In case of failure, an alarm message is sent to the controller.
The failure detection scheme at the switch level [
47], known as switch failure detection (SFD), uses the failed link and the network topology as an input. To identify a failed switch, the algorithm first finds the source and destination of the failed link. Then, it discovers all the hosts connected with the switch and computes whether the packet loss ratio is 100%. If so, the switch is assumed to have failed. Thus, the failed switch is identified as the one initiating the recovery process.
In the current study, we elaborate on the use of these schemes in SDN. The utilization of each failure detection scheme depends on the particular demands of the network where it will be used. For example, the circle paths monitoring [
32,
33] and monitoring cycle failure detection [
34,
35] methods are slow due to the presence of many hops on the E2E path and the heuristic algorithm. Hence, these schemes are not favorable for networks with low latency requirements such as CGNs. However, the OpenFlow [
36,
37], STP or RSTP [
38], and FDLM-based [
42] schemes that leverage the periodic heartbeat messages and port status updates are relatively faster. The BFD [
36,
39,
40,
41] and MPLS BFD [
43,
44] approaches are comparatively fast in failure detection. Hence, modern technologies can leverage these schemes because of their low latency. A few schemes [
45,
46] are limited to particular networks in SDN, such as link failure recovery in smart grids, SDN-EON, and node failures only.
6. Summary and Challenges of the SDN-Based Failure Recovery Approaches
In the proactive method, the backup paths are calculated in advance. Therefore, when a link fails, the controller forwards the traffic on the backup path. The method has its pros, such as the controller does not need to recalculate the path as the forwarding rules for the backup path already exist in SDN switches. However, a disadvantage of this approach is that the TCAM space cost of the SDN switches increases. Besides this, the switches have a limitation of 8000 flow entries in the flow tables.
In a few cases, the backup path may fail earlier than the original primary path. If the failure occurs early, the performance is affected, because the incoming packets are matched with the flow rules due to the redundancy of backup path flow entries in the switches. In the reactive approach, the SDN controller installs the flow rules for the alternative path when a link failure event occurs. The methodology is economical in terms of TCAM space; however, the calculation of an alternative path at run time and the installation of rules for the alternative path incurs an additional delay.
To summarize, the critiques of the reactive approach argue that the induced delay incurred by the controller in finding an alternative path cannot meet the minimum delay requirements of the CGNs. However, approaches that have used efficient routing algorithms and minimum flow operations have achieved the desired results. There is always a space for future researchers in terms of improving the previous works because there is a tradeoff between flow operations, large-scale SDN, the minimum shortest cost path, complexity of the algorithm, delay, congestion, load balancing, etc. The inter-domain techniques have synchronization, E2E service provisioning, and interoperability problems that hamper failure recovery. Similarly, in the in-band schemes, the differentiation between data and control traffic is a complex process. Therefore, efficient solutions with minimum complexity can be proposed with which the innovative features of southbound interface protocols, such as OpenFlow/Netconf, can be combined for achieving efficient results. In the end, we discussed ML-based schemes. There is a high probability of the ML-based schemes being used in the future because of the increase in the internet nodes and users as well as the enormous usage of data. However, the lack of standard datasets for the SDN environment hinders the use of ML in SDN research. The development of ML applications with high accuracy for link failure detection and the formation of versatile datasets should be considered for using ML in SDN in future.
7. Conclusions
The introduction of SDN for combating link failure recovery is a novel approach that leverages centralized control concepts. In this paper, we described the background and importance of SDN in link failure recovery by explaining the vulnerabilities of the traditional networking architecture. Then, the three SDN planes and their interaction mechanisms were described along with the importance of SDN for link failure recovery. The failure recovery speed is dependent on the time taken in failure detection. Therefore, we described the state-of-the-art approaches for link failure detection with their pros and cons. We described the proactive and reactive approaches. First, we explained the link failure detection and recovery process with proactive failure recovery in SDN. Then, previous schemes using proactive recovery were described in detail. Similarly, we described reactive failure recovery approaches, i.e., the reactive failure recovery mechanism in SDN and its related literature. We compared the effectiveness of proactive and reactive failure recovery approaches in SDN from the summaries of previous works. A comparison was performed between the proactive and reactive schemes in terms of latency, scalability, routing updates, TCAM space, flow operations matching, configuration, robustness to backup path failures, routing information access, processing of switches, and the overheads of routing, controller and switches. The inter-domain and intra-domain architectures for link failure recovery were discussed. Finally, the link failure recovery in a hybrid SDN environment, large-scale networks, in-band SDN, and machine learning schemes were discussed. We simulated two application scenarios of the Naval tactical networks and DCN using the ODL controller for proactive and reactive approaches, showing the recovery time and throughput comparison. The experimental results after applying the two schemes show that flow insertion by the SDN controller, in case of the reactive approach, causes an extra burden due to controller intervention. Finally, in the summary section, we listed the problems and future directions for SDN-based link failure recovery. Keeping in view the research challenges, this study can help in selecting the optimum resilient SDN solutions that can guarantee a smooth functioning of the CGNs.