In this section, we first outline some of the main challenges that exist for designing and developing fault tolerance approaches in IoT systems, and then we present the findings of our study on fault tolerance approaches to motivate the need for a multi-layer fault tolerance approach.
2.2. Related Work
Different authors have suggested quite diverse approaches and techniques; however, the heterogeneity, complexity, and large scale make discovering and treating faults large challenges to overcome. The subject of fault tolerance in IoT systems has been a hot topic in recent years. Therefore, we realize an extensive analysis of the existing literature to identify research that addresses the main challenges to achieving fault tolerance in IoT systems. Thus, we list and summarize them in
Table 1, which compares the challenges addressed by the related works and the approach proposed in this work.
Li et al. [
18] proposed a layer-based framework for fault detection, location, and recovery for heterogeneous IoT systems in order to unify fault measures and maximize existing resources. For this purpose, they used fuzzy cognitive maps, allowing adaptive monitoring model of the observed points. Fault management was performed through a layered scheme in which the first layer includes detection and location steps. Several observation points were defined in this layer that monitored and analyzed the communication links using FCM-advanced to predict the risk of a failure. When one of the observer points identified that a failure had occurred, the second layer activated failure recovery mechanisms across the network. As a result, the identification of failures occurred in a distributed way, but the recovery occurred in the specific network. The authors evaluated the framework using a scenario containing three types of networks: wireless sensor networks, wired networks, and Wi-Fi networks. The evaluation was carried out by comparing the proposed scheme with a traditional destination reporting algorithm in which aspects related to fault location time, transmission time, and probability of false alarms were evaluated. The results presented by the authors demonstrate the suitability of the proposed fault management schema. It improves fault detection and reduces the probability of false alarms.
Fortino et al. [
54] defined a framework to address the interoperability issue in the IoT domain by proposing a complete approach that aims to facilitate "voluntary interoperability" at any level of a system and across domains of any IoT application, ensuring simplified integration of heterogeneous IoT technologies. The INTER-IoT approach includes three solutions to provide voluntary interoperability. (1) INTER-LAYER: creates a layer-oriented approach providing interoperability and exploiting layer-specific functionalities. (2) INTER-FW: this component provides a global and open platform to manage the interoperability between IoT platforms acting over the INTER-LAYER. (3) INTER-METH: defines a methodology for the process of building interoperable IoT platforms.
The authors demonstrated how the framework works by describing three use cases. Use cases for container transport in a smart port, decentralized monitoring of assisted living, and integration between health monitoring platforms were presented. Despite describing the usage scenarios, the proposed framework lacks evaluation and implementation in simulated or real environments to prove its functioning and possible problems.
In Abreu et al. [
55], the authors proposed an end-to-end resilient IoT architecture for smart cities. Their contribution was to define an architecture that considers the design, implementation, and protocols that support key dependability features in different architecture layers. The proposed architecture is composed of three layers: IoT infrastructure, IoT middleware, and IoT services. One of the key features is the possibility of having more than one instance per layer. Consequently, the services of a layer can be instantiated several times. In addition, for more ubiquity and flexibility of components, the middleware and service layers reside in the cloud. Furthermore, the architecture supports the virtualization and deployment of essential components to reduce the latency of critical applications. The main features included are: The heterogeneity manager, which enables intercommunication between physical devices and the cloud, translating data using heterogeneous protocols in the lower layer to a common language. The communication manager, which establishes how the information will be exchanged between services and applications and smart objects. This component works by providing route control, communication infrastructure, entity mobility, and specific requirements of the IoT context. A virtualized device manager, which provides device administration enabling the identification, discovery, and location of services. A resilience manager, which provides reliability for the IoT infrastructure. In addition, it offers protection and recovery mechanisms that work together with path control, topology, and mobility control mechanisms to be applied in case of failures. This component also orchestrates heterogeneous dependability techniques, enabling recovery of the IoT infrastructure. IoT services manage applications and services that are supported in smart cities. In addition, it makes it possible to analyze the data collected by sensors using big data.
The authors did not present a concrete implementation. They only described the functioning of the architecture through scenarios of failures occurring and the system recovering from them. However, the description is succinct, and it is not possible to prove the functioning of the architecture; and there is no clear evidence that its use improves the dependability of IoT systems.
Woo et al. [
25] researched, proposed, and built an IoT system for personal healthcare devices (PHD) based on the oneM2M communication protocol. However, using PHD in oneM2M systems is necessary to perform the conversion between protocols. Thus, to guarantee the operation with the different IoT servers, they used the ISO/IEEE 11073 protocol supported by most PHDs, thereby requiring a translation mechanism. The proposed system is divided into an application dedicated node-application entity which collects and transmits data using the ISO/IEEE 11073 protocol. The collected data are transmitted to a centralizing entity (middle node-common service entity—(MN)) or a PHD (infrastructure node-common service entity) management server. The latter is responsible for centralizing the information and making it available using the oneM2M protocol. Translation also takes place in the communication between the PHD and the MN.
Concerned with faults that might occur in gateways, a fault-tolerant algorithm was proposed to increase the system’s reliability. The algorithm uses a hierarchical network that connects gateways from the same layer and immediately above the layer, forming an interconnected chain. Each gateway stores a copy of the gateway previously placed in the chain. The data from the last gateway in the chain are stored in the next higher layer gateway. This way, even if a failure occurs in two gateways simultaneously, it can be recovered. The proposed system and algorithm were evaluated through experiments with multiple hypothetical scenarios, revealing that it was possible to recover from failures.
Despite showing that the system could recover from a failure, the authors did not demonstrate a benefit of using the algorithm to recover the system by measuring the time it took to recover it. This information is crucial for assessing system availability. Another aspect the authors mention is the increased complexity of implementing this solution in an environment composed of several layers and devices. In addition, the fault tolerance mechanism is focused only on the recovery process. It is not clear which steps or processes allow for identifying failures and how the proposal improves this step.
Consequently, it is not clear that the upper layers can make decisions about fault tolerance, since they only have data from the layers immediately below. Furthermore, the fault tolerance mechanism does not consider devices and sensors as parts of the fault recovery process. The proposed solution, although promising, is focused only on the healthcare area and uses specific protocols, not hindering the inclusion of new technologies and protocols.
Belkacem et al. [
56] proposed in their work an approach to improving the dependability of IoT systems, using fault tolerance and statistical data from several remotely distributed locations equipped with redundant communication technologies (such as RFID, NFC, and beacons) and sensors for monitoring context and environments. This approach aims to detect and correct failures that occur, especially in locations where local fault tolerance (LFT) is unreliable, or where there is no LFT system. A central server performs reliability and error correction.
To improve dependability, the authors explored collaboration between various remote locations, thereby profiling the behavior of each identifying node and sensor node. This profiling process makes it possible to select the most reliable nodes from a comparative study between data collected from different locations and nodes with similar conditions. Decision-making can occur in a distributed or centralized way. A central server (common to all remote locations) diagnoses failures based on profiling and collected data in the centralized approach. This approach ensures complete monitoring of network status to ensure a more accurate error diagnosis. Thus, to limit this dependence on the central server, fault detection is also performed in a distributed manner at each remote location. Therefore, if the local server does not have an LFT or has an inconclusive LFT, the decision-making is transferred to the central server, detecting a failure and correcting it. Failure detection occurs through a statistical analysis involving the detection of outliers and the correction of extreme studentized deviate (ESD) test data. The proposal presented by the authors was not evaluated in a simulation or assessed in a real system. The authors only compared their proposal with others existing in the state-of-the-art, considering aspects such as data analysis, data correction, and decision-making.
Guimaraes et al. [
33] proposed a framework (IoTUS) that allows information sharing between layers while preserving layering benefits, such as modularity and portability. The framework defines a transversal and extensible service layer allowing information exchange regarding the system’s functioning (e.g., numbers of transmissions, receptions, and collisions in the data-link layer) and services (e.g., neighbor discovery and data aggregation). The framework can be used with existing communication protocols, as it works by intermediating communication and building new packages that will be exchanged between layers, including metadata information. It has a set of modules that act in various functions, such as discovery, routing, package assembly, and synchronization. It aims to improve the system’s power consumption. However, introducing extra communication processing causes an increase in CPU consumption. All adaptation and interoperability must be defined at design and compile time to define which protocols will be supported. The authors evaluated the framework without considering the latency introduced by the solution. The results showed that the IoTUS approach achieved better performance in energy consumption when compared to other state-of-the-art approaches.
In Su et al. [
57], a framework which provides a decentralized fault tolerance mechanism was proposed. The developed mechanism aims to detect failures, recover from them, and dynamically reconfigure the system. This mechanism aims to provide a failover for the system components (services) meeting the fault tolerance requirements by adopting a decentralized mechanism that avoids a single point of failure and performance bottlenecks. The decentralized mechanism uses a service replication mechanism to support fault tolerance, in which each component is replicated in other devices, thereby creating a “redundancy level”. Each device monitors the other by forming a daisy monitoring chain through heartbeat checks. Consequently, when a device does not send its information through the heartbeat message, the device update process is activated, thereby removing the faulty device from the chain and delegating to the “redundant” devices the components necessary for the system to return to operating normally.
The authors evaluated the performance of the proposed mechanism through an experiment. This experiment considered ten devices mapped into four nodes. The first metric evaluated was message overhead during device failure. It took about 550 bytes of data to recover from a failed node. The second metric evaluated by the authors was average recovery time for nodes, which was approximately 2500 ms. The third metric combined the detection time with the recovery time and could not have a time greater than the heartbeat time, which was confirmed by the presented data. Despite demonstrating the feasibility of the mechanism, the authors did not raise or indicate concerns regarding interoperability and adaptation to new technologies.
Furthermore, in some cases, the recovery mechanism took more than 3500 ms to recover from a failure using ten devices. However, it cannot achieve efficient scalability in a real scenario with thousands or even millions of devices. The authors did not demonstrate complexity or overhead impact on the overall system functioning due to the introduction of the decentralized tolerance strategy.