Implementing Dual Base Stations within an IoT Network for Sustaining the Fault Tolerance of an IoT Network through an Efficient Path Finding Algorithm

The IoT networks for implementing mission-critical applications need a layer to effect remote communication between the cluster heads and the microcontrollers. Remote communication is affected through base stations using cellular technologies. Using a single base station in this layer is risky as the fault tolerance level of the network will be zero when the base stations break down. Generally, the cluster heads are within the base station spectrum, making seamless integration possible. Implementing a dual base station to cater for a breakdown of the first base station creates huge remoteness as the cluster heads are not within the spectrum of the second base station. Furthermore, using the remote base station involves huge latency affecting the performance of the IoT network. In this paper, a relay-based network is presented with intelligence to fetch the shortest path for communicating to reduce latency and sustain the fault tolerance capability of the IoT network. The results demonstrate that the technique improved the fault tolerance of the IoT network by 14.23%.


Introduction
Fault tolerance of any network is generally defined as the reliability of the availability of the IoT system in working conditions. The more reliable the IoT network, the more tolerable the network, leading to the high acceptance of such a network. The IoT networks relating to mission criticality systems must be tolerable to the extent of 99% [1].
IoT networks use networking topologies (butterfly, Crossbar, Hybrid) in different layers connecting small devices (sensors, actuators, controllers) and big devices such as high-end servers, base stations, gateways, and splitters. Small devices often fail and induce different faults, rendering the entire IoT network less fault tolerant [2].
The devices within an IoT system are heterogenous and are driven through different protocols requiring conversions, speed matching and the use of several sophisticated algorithms for dealing with data transmission for the purpose of performance enhancements [3][4][5][6][7].
The fault tolerance of an IoT network can be measured in terms of success rate, failure rate, false alarm rate, and power depletion rate [8] and the computation of these metrics. In addition, different computation models are to be used, which include FTA (Fault Tree Analysis) [9] for linear models, probability models [10], hybrid models [11] that combine linear and probability models, empirical model [12], and bipartite flow graph modeling [13].
Small devices tend to fail, inducing different faults, sometimes rendering the entire IoT system non-operational and producing unpredictable behaviour. The faults that are generally induced in the IoT system include Cascading faults [14,15], Pattern faults [16], and device-to-device communication faults [17]. Various faults occurring within the devices have been reviewed by Norris et al., [18,19].
Several methods have been implemented to enhance the fault tolerance of an IoT system in the presence of the faults mentioned above. The methods aim to enhance the provides alternate paths of communication is required. The shortest path must be selected for communication, so that the data move to the service server with the least latency.
There is a need to establish communication between the slow-speed and high-speed devices and networks, which leads to heavy failures. Heterogeneity results in significant failures and thus needs to be managed. Communication between the cluster heads and remotely situated microcontrollers is affected through base stations. The cluster heads and the microcontrollers communicate with the base stations through cellular communication.
The risk of failure of the IoT network is very high due to failure of the base station and the presence of heterogeneity in the communication protocol. The remote base stations are out of sight to the devices that sense and transport data.
A specialized networking topology is required to connect the devices to a redundant base station remotely situated from the cluster heads. Redundant base stations enhance the fault tolerance of the IoT network. The base stations communicate with the microcontrollers in peer-to-peer mode using parallel communication so that the speeds match. There should be an element of redundancy in the network to accommodate the link failures. Furthermore, communication must be done via the shortest path with the least traffic so as to minimise latency.
The following research questions are answered in this paper: 1.
How will the longevity of the IoT network be affected due to the use of a single base station in the IoT network? 2.
How to determine fault-free and fast responsive paths in a given IoT network 3.
What parameters must be considered to decide the path to effect the fastest communication through the 2nd base station? 4.
How many base stations are required to guarantee a 100% fault tolerant system

1.
A method to determine the shortest and fastest fault-free path for data transmission from cluster heads to microcontrollers en route to the redundant base station.

2.
A method to facilitate communication between the base stations and the microcontrollers.

3.
A method to convert a networking diagram to a Fault tree Analysis diagram considering different networking topologies used to connect both the device layer and the controller layer.

Outline of the paper
The rest of this paper is presented in Sections 3-10. The related work and the GAP are presented in Section 2. In Section 3, the overall method used for improving the fault tolerance of the IoT network through modifications effected in the controller layer is presented. Section 4 presents an updated and improved IoT network up to the device level, along with its fault tolerance diagram and computations. The revised IoT network and the changes effected in the controller layer of the IoT network are explained in Section 5. Section 6 explains the networking topology and the devices used in the network connecting the cluster heads and the base station. A novel fault-free, shortest and fastest path-finding algorithm is discussed in Section 7. Results of the experiments on the revised network and a discussion on the same are presented in Sections 8 and 9, respectively. Conclusions and future scope are presented in Section 10.

Related Work
D. Koziol et al. [36] proposed that achieving QoS between the base station and the remote user equipment (UE) requires resolving many trade-offs regarding signalling overhead, implementation complexity and overall delay. They did not recommend any alternate mechanisms to ensure fault tolerance is maintained in the event of failures while achieving the required level of QoS.
Skorin-Kapov [37] presented Machine-Type Communication (MTC) which considers simplification of the channels and interfaces such that the communication system is suitable for NB-IoT and LTE-M-based applications. They did not consider any specific networking topology or failure situations.
C. Min et al. [38] expressed that MTC created an interest in D2D applications, especially in the automotive sector. They did not consider any specific networking topology or failure situations.
According to Hunukumbure et al. [39], adding D2D to cellular infrastructure will increase reliability, reduce power usage, and prevent network congestion. They suggested a random-access process that uses broadcast messages to announce the D2D mode of communication so that UEs can directly talk with one another, utilising CSMA with collision avoidance. Yet, the endeavour must consider IoT devices and the various resource limits they impose. The concerns with the distance between the base stations and the equipment on either side of the stations must still be considered.
A D2D approach for improving battery life and the accessibility of Cellular IoT (CIoT) deployment was proposed by J. Lianghai et al. [40]. With the help of the gathered environmental data, the network oversees and maintains the assignment of the UEs to the devices (remote devices, relays) (e.g., battery level and position). The authors create new signalling behaviour allowing UE attachment, transmission mode (re)configuration, and uplink data delivery. Applying a D2D solution controlled by a core network could address RAN (Radio Access Network) failure issues more effectively.
Al-Salihi NK [41] proposed an Internet of Things (IoT)-based position-fixing system instead of GPS as they are found to be useful in tracking the daily activities of children, the elderly and vehicle tracking. They proposed a redundancy-based model for improving the fault tolerance of the IOT-based position-fixing system. However, they did not account for alternate communication paths in the event of a failure occurring.
Bhupathi et al. [42] presented different metrics for computing the fault tolerance of the IoT network. They also presented [11] a crossbar network topology to connect the clusters to the cluster heads. Moreover, they implemented a method to predict the fault that a device will inject due to the power depletion rate. The devices are isolated before the fault can be injected into the system.
Many have proposed algorithms to find the shortest path from a source node to the sink node.
Daniel Foead et al. [43] presented a complete review of different variants of A*-based search algorithms and opined that the algorithm fails as the sizes of the network increase and dynamically change.
F. Xia et al. [44] presented a review of different variants of random walk algorithms, which are meant to find the shortest path in the network. They proposed that one has to select a variant of the algorithm string for a specific application. The algorithm, as such, does not consider the existence of the loops in the network.
Niranjane P. B. et al. [45] presented a comparison of variants of Yen's algorithm for finding k-simple shortest paths, which is based on the number of deviations that the network contains. However, the need for k-shortest paths rarely arises when pruning the paths is achieved to eliminate the failure paths from the system. Kyle E. et al. [46] studied the choice of a graph search algorithm to find the shortest path in a directed relation graph with error propagation (DRGEP and have compared the method with other algorithms that include depth-first search, basic and R-value-based breadth-first search (RBFS), and Dijkstra's algorithm and found that Dijkstra's algorithm combined with coefficient scaling approach most accurate results when applied to bio application. Kalyan Mohanta B. P. et al. [47] presented a comprehensive review of the existing k-shortest algorithms and showed the computational efficiency of each of the algorithms. Andrej Brodnik et al. [48] presented an all-pairs shortest path algorithm for directed acrylic graphs and arbitrary edge lengths. Muteb Alshmmari et al. [49] proposed an algorithm (single source shortest path) for dynamic graphs with large change frequency.
The methods proposed in the literature did not focus on the issue of failure of the main base station and the need for a second redundant base station, and the way the second base station is to be connected to cluster heads considering distance limitations and non-crossover of the communication spectrum. None have attempted to determine the fail free shortest and fastest path that should be used for affecting communication between cluster heads and the second base station.

Methodology
A flow diagram depicting the execution of the proposed methods is shown in Figure 1. The blocks relating to metrics development, initial prototype development and its FTA development, implementing a crossbar network in the device layer, implementing fault detection and isolation method in the sensing and actuating devices to counter the possible fault in injection in the device layer and its related FTA model and the improvement in fault tolerance capability of the IoT network were explained by Bhupathi et al. [11,42]. Further to the implementations carried out in the device layer, a second base station has been added to the controller layer. The second base station has been connected to the cluster heads through an intelligent-relay-based network with built-in redundancy to tackle the issue of remoteness of the base station from the cluster heads. An intelligent pathfinding algorithm is implemented in each relay to find the shortest path with the least traffic to significantly reduce the latency. An FTA diagram for the newly introduced network is developed and combined with the FTA diagram of the prototype model. Fault tolerance values have been computed through the generation of a fault table. A comparison of the stage improvements in the fault tolerance of the IoT network due to the implementation of different methods in the device layer and the controller layer has been presented.

The Updated IoT Network
An IoT network with changes at the device level is presented in Figure 2. The updated IoT system implements the following in the device layers to enhance the fault tolerance of the network [11,42].

1.
Implement a fault detection system in the device layer that detects possible powerrelated faults and then isolates the faulty devices.

2.
Establishing a crossbar network between the cluster heads and the device clusters to provide alternative redundant communication paths. 3.
Develop redundant networks using different topologies connecting base stations and cluster heads. 4.
Connecting the base stations to the microcontrollers in peer-to-peer mode.

5.
A method to compute fault tolerance considering linear and probability models. 6.
Connecting several controllers to a single services server 7.
Connecting a services server to a gateway en route to the internet connecting the cloud.
The devices in a cluster are linearized and connected to a cluster head free from sensing or actuating function. A crossbar network connects the linearized clusters' outputs to the cluster heads. The cluster heads are connected to a base station in a parallel computing mode, and the base stations communicate with multiple microcontrollers in a peer-peer computing mode.
The microcontrollers are connected to the server of the service in many-to-one mode. The services server receives the requests from the devices or the users, executes the servicerelated code, and transmits the results back to controllers or the user. The revised IoT network with changes made in the device layer is shown in the Figure 2.
For developing a fault tree of the sample IoT network, the crossbar network is replaced by a single device whose success rate is computed using probability models connected with the crossbar network [11]. The success rate of such a network is computed as 0.842. The fault tree of the residual network and a table showing the success rate computations are generated through the algorithms presented by Bhupathi et al. [11]. The fault tree generated for the updated network is shown in Figure 3.
From the sample network, the single base station is the real bottleneck and forms the most vulnerable areas of failure of the entire IoT network. Any fault accruing in this patch will disconnect the entire network, and the fault tolerance of such a network becomes zero. The need for implementation of redundancy of the base stations thus arises.

Revised IoT Network
The sample IoT network has been modified, as shown in Figure 4. The following changes have been made in the controller layer.

1.
A second base station, which is remotely situated due to spectrum reasons, has been added. The second base station is connected to cluster heads using a separate network established by relays/switches placed strategically because of the distant locations. Two layers of relays have been considered keeping in view the maximum connectivity distance to be 100 km.

2.
An algorithm is implemented that finds the shortest distance from a source node to a sink node and ensures that the traffic is minimum in that path. The number of bytes to be transmitted over a path is considered as the decision to select the path.

3.
Parallel communication is implemented to establish communication between the first base station and the cluster heads and between the base station.

4.
A relay-driven network with built-in redundancy establishes communication between the second base station and cluster heads. Parallel communication affects the communication between the 2nd base station and the controller.
This paper discusses ways of improving the fault tolerance of the communication between the cluster heads, base stations and microcontrollers. This paper focuses on introducing topologies that connect the cluster heads to the base stations so that fail-free operations are carried out within the least possible response time, which is related to traffic/communication distance. Here, a path with minimum traffic/distance is chosen for communication.

The Network between the Cluster Heads and the 2nd Base Station
The second base station is situated very far from the cluster heads due to the requirement of avoiding spectrum collisions. A separate network with built-in redundancy is required to cater for the failures affecting communication between cluster heads and the base station. Figure 5 shows the network between the cluster heads and the second base station. Nodes 1, 2, 3, 4 are cluster heads. Nodes 5, 6, 7 are first layer relays, and nodes 8 and 9 are the intelligent relays. Node 10 is the base station. Redundancy is maintained between the cluster heads and the first layer's relays and between the first and second layers, as is the case with second layer relays and the base stations. Redundancy among the nodes is achieved by making available two paths from one node to the next superseding nodes. Each node is intelligent such that the node can run a path-finding algorithm.
The FTA equivalent of such a network can be developed using the linear cluster concept. No probability model can be developed as the network follows no specific structure. An algebraic path-finding algorithm is implemented in each node to select a path based on the distance and traffic on a specific path.

Pathfinding through the Algebraic Method
Step-1 Capture the network The following algorithm finds all the paths between the source and sink nodes and then finds the path that requires transmitting minimum data per KM distance. The steps involved are described below: The algebraic method primarily involves capturing the network in terms of precedence matric, which predominantly represents the network's structure. For the network shown in Figure 5, the precedence relationships are shown in Table 1. The table shows the number of bytes of data to be transmitted from a node at a point in time, the transmission speeds used by each node, the amount of time it takes to transmit the data, the preceding node and the distance of the preceding node from the current node. The details of the preceding connected nodes are captured for every existing node in the network. Step-2 Capture traffic at the nodes Based on the number of bytes transmitted from the base nodes and the number of bytes received at the base station, estimate the number of bytes yet to be transmitted at each node in the network. Every node is a relay/intelligent communicating device; all relays are assumed to communicate at the same speed of 11 Mbps. The size of the data pending transmission is recorded and the data distribution is performed based on equal proportions considering the number of outgoing paths from a specific node. The distance between any pair of nodes is known and recorded. Table 1 shows that the node with no preceding nodes is a starting node, and the node not succeeding is the sink node or terminal node.
Step-6 Path pruning If any node in the network fails, say node 6, the path containing node 6 is ignored and marked as pruned. The pruned status of the paths due to the failure of node 6 is shown in Table 2. A total of 8 Paths now remain for communication.

Results
Developing fault tree for the revised IoT network. The crossbar network between the controllers and services servers is replaced by a single device assigned with a success rate that is the same as the crossbar network's success rate using its related probability model. The additional network added to the network is converted to an FT diagram using AND/OR conditions based on the precedence and data flow designed through building redundancy in the network. The modified FT diagram related to the revised IoT diagram is shown in Figure 6.
The FTA diagram is generated using the algorithm presented by Bhupathi et al. [11]. Similarly, the crossbar network between the devices and the cluster heads is replaced by a single device assigned with a success rate the same as the crossbar network's success rate using its related probability model. These transformations convert the FTA diagram into a linear model.

Success Rate Computations
Bhupathi et al. [11] presented the use of an algorithm to generate the success table, given the FTA diagram as the input. The generated success table is presented in Table 4. The success rates of every device are computed using its precedence relationships with other devices. From the table, it can be observed that the success rate of the revised IoT network is 0.980.

Discussion
The algebraic algorithm built into every node in the additional network to connect the second base station to the cluster heads performed well compared to its nearest pathfinding algorithm (single source shortest path). A comparison of the algorithms is shown in Table 5.
Several pathfinding algorithms proposed in the literature have been surveyed and the same is compared with the algebraic pathfinding algorithm presented in this paper. The comparison considers the number of nodes, edges, shortest paths, path pairs and several elements. The comparison is carried out considering the network shown in the revised IoT diagram. From the table, it can be observed that algebraic methods require fewer operations for selecting a path for data transmission, given that a source node from the transmission is initiated.
The failure rate of a base station is negligible. The success rate is around 98.0%. The entire IoT network will malfunction when a base station fails. The probability of which is 0.98. To ensure a failure-free situation (100% success rate), adding a second base station is necessary, which requires a different type of networking topology because the distance between the cluster head and the second base station becomes a major issue.Adding one redundant base station is sufficient as it provides a 100% success rate. With the addition of more base stations, the cluster heads will become overloaded, which leads to a depletion of response time due to increased latency. Graph Search Algorithms [46] (ne) 2 128,000 3 Yen shortest paths citeref-journal45 kn + m × log m 170 4 All Pair's shortest paths [48] m × n + m × log n 110 5 Random walk [44] n × e 160 6 Single Source shortest path [49] f + f × log(f) 62 7 Algebraic The success rate of the revised IoT network is 0.948 when compared with an IoT network that caters for changes in the device, the fault rate of which is fixed at 0.827. A comparative analysis of improvements in the fault tolerance in the IoT network achieved with changes made into different layers of the IoT network is shown in Table 6. With the changes made to the IoT network, the fault tolerance level of the IoT network is increased by 14.63%.

Alternative Justification
The fault tolerance capacity of both the sample network and the revised network are computed considering different failure conditions, including communication failure between a cluster head and the base station. The computations are shown in Table 7. The table shows that the revised network retains the FTA level even though some failures happen in the communication paths that connect the cluster heads to the base stations. The combined failure of the IoT network, considering the failures between cluster heads and the base stations and the failures between the base stations and the controllers, improved from 0.45 to 0.64, a 42% improvement.

1.
The fault-tolerance capability of an IoT network is critical, especially when missioncritical systems are built using IoT technologies.

2.
The fault-tolerance capability of an IoT network can be enhanced by making suitable changes in each of the layers of the IoT. This paper focuses on enhancing the faulttolerance capability considering the controller layer.

3.
A single base station-based IoT is risky and unsuitable for implementing missioncritical systems. 4.
The fault-tolerance capability of the IoT network improves when a redundant base station is added, and the same is connected via an intelligent-relay-based network. The success rate of the revised IoT network increased from 0.827 to 0.948, which is a 14.23% improvement.

5.
Considering both parts of the network, the combined success rate, which includes the path from cluster heads to the base station and the path from the base station to the controller, improved from 0.45 to 0.64, a 42% improvement. The latency of communication using such a network is minimum. 6.
The path-finding algorithm implemented in the intelligent relays requires fewer operations than any other algorithm presented in the literature.

Future Work
Further enhancement in the fault tolerance of the IoT network can be carried out at the controller level, services layer, and gateway layer by incorporating suitable changes considering the devices and the connectivity between devices in those layers.