1. Introduction
The rapid proliferation of the Internet of Things (IoT) has led to the deployment of dense mesh networks composed of heterogeneous, resource-constrained nodes in diverse domains such as environmental monitoring, industrial automation, smart agriculture, and urban infrastructure [
1,
2]. These networks are often deployed in harsh or inaccessible locations, making physical maintenance both costly and impractical. In such scenarios, ensuring operational resilience and minimizing downtime is paramount.
IoT mesh networks, particularly those utilizing protocols like Zigbee, 6LoWPAN, or BLE Mesh, rely on decentralized node collaboration to route data to gateways or sink nodes. However, the limited energy reserves, susceptibility to environmental interference, and hardware degradation of individual nodes pose significant challenges to reliable long-term operation [
3]. Node failures—whether due to battery depletion, communication loss, or component fault—can trigger cascading effects, resulting in disrupted data flows and partitioned networks [
4]. Consequently, self-healing capabilities have become an essential design requirement in next-generation IoT systems.
Traditional self-healing mechanisms are often reactive and depend heavily on central coordinators or cloud-based intelligence [
5]. Such approaches introduce communication overhead, latency, and single points of failure, which are undesirable in energy-constrained and mission-critical deployments. Moreover, many existing solutions lack predictive intelligence, failing to detect latent faults before total failure occurs [
6]. Prior efforts such as LP-OPTIMA [
7] have explored prescriptive maintenance and resource optimization for low-power embedded systems, and related work has investigated self-healing in semantically interoperable IoT edge devices [
8]. While these contributions advance the state of the art, they do not fully address the joint challenge of on-device anomaly detection and distributed recovery in mesh topologies under tight energy constraints. There is therefore an urgent need for lightweight, autonomous, and intelligent fault management that can operate natively within the network edge.
To address these challenges, this paper proposes EdgeRescue—a novel self-healing framework designed for energy-constrained IoT mesh networks. EdgeRescue integrates compact edge AI modules directly into low-power nodes, enabling them to continuously monitor local health parameters, detect anomalies, and initiate cooperative recovery actions. Unlike centralized or cloud-reliant models, EdgeRescue operates entirely at the edge, offering real-time responsiveness and resilience without external dependencies. Our design ensures that nodes can adaptively reroute traffic, isolate failing peers, and preserve connectivity, all while minimizing energy consumption and memory overhead.
The contributions of this work are threefold:
We introduce a fully decentralized, AI-driven self-healing system for IoT meshes that operates on low-power microcontrollers.
We propose a novel on-device anomaly detection strategy using lightweight 1D convolutional neural networks (1D-CNNs) tailored for runtime energy efficiency.
We evaluate the framework through simulation-based experiments on IoT mesh environments and demonstrate improvements in packet delivery, recovery latency, and energy efficiency compared to existing methods.
EdgeRescue aims to bridge the gap between predictive fault detection and energy-efficient autonomous recovery, representing a step toward resilient edge intelligence in distributed IoT systems.
The remainder of this paper is organized as follows.
Section 2 presents a comprehensive review of related work in the areas of self-healing IoT systems, edge-based anomaly detection, and energy-efficient recovery protocols.
Section 3 describes the proposed EdgeRescue system architecture, detailing the core components deployed at each node and their interactions within the mesh network.
Section 4 introduces the methodology behind our approach, including the anomaly detection model, distributed healing protocol, and energy-aware routing algorithm. In
Section 5, we present experimental results based on simulations, with quantitative comparisons against state-of-the-art baselines and visual analysis of performance metrics.
Section 6 discusses the implications of our findings, outlines limitations, and identifies potential directions for future research. Finally,
Section 7 concludes the paper with a summary of contributions and the broader significance of EdgeRescue for resilient IoT infrastructure.
2. Related Works
The need for resilience in distributed IoT mesh networks has led to substantial research into fault detection, anomaly prediction, and autonomous recovery strategies. However, most existing approaches either lack scalability in constrained environments or rely heavily on centralized cloud components that increase latency and power usage.
Self-healing mechanisms in wireless sensor networks (WSNs) and IoT meshes have historically centered around routing-based recovery and fault-tolerant topologies. Protocols such as the Collection Tree Protocol (CTP) and RPL (Routing Protocol for Low-Power and Lossy Networks) [
9,
10] enable basic path reconstruction when nodes fail, but are reactive and often incur convergence delays that disrupt real-time data transmission. Moreover, they do not offer any prediction or pre-failure detection.
To address fault localization, techniques using redundant node deployment and heartbeat message exchanges have been proposed [
11]. However, these methods significantly increase energy consumption and bandwidth usage. Some studies incorporate cluster-based organization [
12] to localize decision-making but lack intelligent fault interpretation capabilities.
The rise of edge computing has enabled anomaly detection to be performed closer to the data source. Deep learning models such as AutoEncoders (AEs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs) have been deployed in edge-based systems for detecting network failures and malicious behaviors [
5]. However, their deployment on resource-constrained devices remains limited due to memory and processing bottlenecks.
Lightweight alternatives have emerged in recent years. For instance, LightNet [
13] proposed a decision-tree-based anomaly detector suitable for low-end microcontrollers. Similarly, STAD [
14] introduced a statistical pattern-based model that performs zero-shot anomaly detection on embedded IoT devices. Nevertheless, these models offer limited adaptability and lack autonomous recovery mechanisms.
Designing AI solutions that are compatible with constrained energy budgets is a growing subfield of edge intelligence. Pruning and quantization techniques have been widely adopted to compress deep models for IoT nodes [
15]. TinyML, the field of deploying machine learning models on microcontrollers, has advanced significantly with platforms like TensorFlow Lite Micro and CMSIS-NN [
16].
Several works have explored reinforcement learning for adaptive resource allocation and node health management [
17,
18]. However, these models often require long convergence times and pre-training, making them impractical in zero-day or intermittent deployment settings. Moreover, their inference pipelines are often too heavy for on-chip execution on Class 1 and 2 constrained devices [
19].
Distributed recovery mechanisms have been investigated in the context of Mobile Ad Hoc Networks (MANETs) and WSNs. Gossip-based protocols [
20] and distributed consensus models like Raft or Paxos [
21] offer scalable approaches for peer-to-peer failure agreement, but require frequent inter-node messaging and synchronization. For ultra-low-power networks, such message overhead is unsustainable.
More recently, self-organizing architectures such as SelfIoT [
22] and HealNet [
23] introduced early-stage AI modules capable of partial diagnosis and healing. However, these approaches either assume cloud interaction for decision support or operate only at the gateway level, leaving individual node-level healing unaddressed.
On the resource optimization side, LP-OPTIMA [
7] targets prescriptive maintenance in low-power embedded systems by managing resource usage proactively. While it shares EdgeRescue’s concern for energy efficiency, it does not incorporate anomaly-driven fault detection or peer-based mesh recovery. Similarly, work on self-healing of semantically interoperable IoT edge devices [
8] demonstrates autonomous recovery at the device level, but focuses on semantic service restoration rather than network-layer resilience in mesh topologies. More recent distributed anomaly detection frameworks such as [
24] tackle node-level detection in edge networks but stop short of coupling detection with energy-aware routing reconfiguration.
Gaps and Motivation
Despite promising advances, existing self-healing solutions for IoT mesh networks fall short in three key aspects: (1) they are often reactive rather than proactive, (2) they assume cloud connectivity for learning or recovery orchestration, and (3) they lack modular, scalable design suited for mesh topologies.
EdgeRescue addresses these limitations by embedding autonomous anomaly detection, distributed peer-aware recovery, and energy-aware routing reconfiguration into a single modular framework. Unlike LP-OPTIMA-style optimization or gateway-level healing systems, EdgeRescue introduces a new design point where detection, recovery, and routing adaptation all happen locally at each node without any cloud dependency. It is designed to operate fully within the computational and power constraints of edge mesh nodes, representing a practical evolution toward deployable intelligent self-healing.
3. System Architecture
EdgeRescue is designed as a decentralized, modular architecture tailored for constrained IoT mesh networks. It enables autonomous fault detection and recovery at the edge, without dependence on cloud infrastructure or centralized decision-making entities. The architecture is composed of lightweight AI modules embedded within each node, a localized collaboration protocol, and an energy-aware routing subsystem that allows for dynamic reconfiguration of the mesh topology.
Figure 1 presents an overview of the EdgeRescue architecture, illustrating the interaction among its major components deployed across IoT nodes.
3.1. On-Node Intelligence Layer
Each participating node in the IoT mesh is equipped with a Local Intelligence Unit (LIU) composed of the following submodules:
Anomaly Detector (AD): A lightweight 1D Convolutional Neural Network (1D-CNN) model trained offline to detect deviations in node behavior. It processes a real-time input vector comprising sensor readings (e.g., voltage levels, RSSI, queue size, transmission interval).
Node Health Profiler (NHP): Continuously collects system metrics and creates temporal feature embeddings that feed into the anomaly detector. It includes threshold calibration mechanisms to reduce false positives.
Energy Monitor (EM): Tracks residual battery and duty cycle statistics. It enforces node-aware routing changes to prevent overburdening energy-depleted nodes during rerouting.
These submodules operate independently on edge hardware and require less than 50 KB of SRAM and 100 KB of flash, making them suitable for deployment on microcontrollers such as ARM Cortex-M4F or ESP32-class devices.
The autonomous operation of each node follows a fixed local execution cycle with no dependency on external coordination at any stage. Node samples its telemetry vector every seconds, feeds it into the on-chip 1D-CNN model , and computes an anomaly score entirely in local SRAM. If , the node independently decides to broadcast a Fault Notification Packet (FNP) to its 1-hop neighbors. No gateway instruction, cloud query, or inter-node vote triggers this decision—the node acts on its own inference output alone. Neighbors receiving the FNP also act independently: each one checks its own routing table, excludes the flagged node, and recomputes its forwarding paths using the local cost function . The FNP is a one-directional notification, not a command; the receiving node decides what to do with it based on its own routing state. This design means the full detection-to-recovery cycle—from telemetry sampling to rerouted packet delivery—runs entirely within the mesh, with the gateway playing no role in fault management.
Unlike LP-OPTIMA-style frameworks, which rely on periodic prescriptions sent to stakeholders and centralized control signals to act on faults, EdgeRescue places the full detection-to-recovery pipeline on each individual node. The interaction with neighboring nodes is purely peer-to-peer and is limited to short fault notification packets, keeping communication overhead minimal.
3.2. Distributed Healing Protocol
When a node detects anomalous behavior—such as signal degradation, excessive queuing, or voltage drop—the Anomaly Detector flags the node state as unstable. This triggers the Distributed Healing Protocol (DHP), which operates as follows:
- (1)
Fault Broadcast: The detecting node notifies its 1-hop neighbors of its compromised state through a lightweight broadcast containing its node ID and fault code. This packet is 6 bytes in size and is transmitted once per detected fault event, so the added channel load is negligible relative to normal data traffic.
- (2)
Neighborhood Assessment: Neighboring nodes evaluate their routing tables to check for paths transiting the failed node and recompute alternative paths using local heuristics (e.g., shortest signal path or energy-balanced hops). Each neighbor does this computation locally using only its own routing state—no coordination with other neighbors is needed.
- (3)
Path Reinforcement: A reinforcement timer validates the new paths over several data cycles. If packet loss persists above a defined threshold, the healing process is retriggered from Step 1.
The sequence from FNP broadcast to validated rerouted delivery is fully self-contained within the mesh. To make this concrete,
Table 1 shows a representative event trace from a single node failure captured during simulation. Node
degrades gradually over several seconds before its anomaly score crosses
, at which point the local loop triggers the FNP without any external prompt. The full recovery completes in under 4 s, with no gateway involvement at any step.
The DHP has a few known limitations worth acknowledging. If multiple nodes fail simultaneously, fault broadcasts can collide and neighboring nodes may receive conflicting routing updates, potentially causing transient loops before convergence. If a node fails silently without having time to broadcast its fault, nearby nodes rely solely on link-quality degradation signals to infer the failure, which can introduce a short detection delay. These are inherent trade-offs of fully distributed operation and are discussed further in
Section 6.
3.3. Routing Reconfiguration Engine
Routing in EdgeRescue is based on an adaptive Distance Vector model enhanced with real-time link quality and node health metrics. Each node maintains a cost function:
where:
is the signal strength from node i to j;
is the remaining energy of node j;
is the anomaly score of node j;
, , are tunable weights.
This cost function discourages routing through unhealthy or energy-depleted nodes.
Figure 2 demonstrates this distributed self-healing loop.
3.4. Edge Scalability and Robustness
The modularity of EdgeRescue ensures seamless scalability. New nodes automatically integrate by broadcasting their health profile and participating in the DHP loop. The decentralized approach ensures that no single node becomes a point of failure. The architecture also supports heterogeneity, enabling integration with nodes that use different sensing capabilities or power sources.
The current design targets static mesh deployments, which represent the primary and most common deployment context for low-power IoT infrastructure. Extension to mobile scenarios and fog-level multi-tier architectures is outside the scope of this work and is left for future investigation.
EdgeRescue achieves a practical balance between on-node intelligence and resource constraints, offering a scalable and resilient platform for IoT mesh deployments in energy-sensitive environments.
4. Proposed Methodology
EdgeRescue integrates embedded anomaly detection, distributed fault localization, and energy-aware routing to enable fully decentralized self-healing in IoT mesh networks. This section formally presents the methodology governing its operation, including the mathematical modeling of fault detection, routing adaptation, and node collaboration.
4.1. Feature Vector Construction
Each node
maintains a real-time monitoring vector
composed of the following telemetry features:
where:
: instantaneous voltage level.
: received signal strength indicator.
: packet inter-arrival time.
: transmission queue length.
These signals are normalized and temporally windowed into short sequences that are input to the anomaly detection model.
4.2. Lightweight Anomaly Detection
A compact 1D-CNN model
is trained offline to learn typical behavior sequences. At runtime, each node performs:
where
is an anomaly score in
. If
, the node is flagged as anomalous. The model is trained using binary cross-entropy loss on labeled sequences.
4.3. Fault Signaling and Distributed Healing
Once a node is flagged as faulty, it broadcasts a Fault Notification Packet (FNP) to its 1-hop neighbors. The neighbors then execute the following healing protocol:
- (1)
Update Routing Table: Nodes identify if their routing paths pass through the faulty node .
- (2)
Compute New Path: For each destination
d, the new path is computed using:
where
is the composite cost:
with
the residual energy and
the anomaly score of node
j.
- (3)
Path Validation: New paths are validated using heartbeat ACKs. If loss rate exceeds threshold, re-routing is retried.
4.4. Energy-Aware Decision Scaling
To avoid overloading a single node during recovery, the routing cost incorporates real-time energy
, updated every
seconds. Nodes use a weighted decision buffer to prevent flapping:
where
is a forgetting factor that balances new measurements with stability.
4.5. Protocol Pseudocode
- 1:
EdgeRescue Node Workflow
- (1)
Monitor telemetry and update
- (2)
If , broadcast FNP
- (3)
Upon receiving FNP:
Update routing table by excluding node ;
Recompute paths using Equations (3) and (4);
Validate new path via ACK monitoring.
- (4)
Loop every seconds
4.6. Model Deployment and Resource Profile
The CNN-based detector is quantized to int8 using post-training quantization. Model footprint:
The method ensures compatibility with TinyML platforms like TensorFlow Lite Micro and CMSIS-NN and integrates easily into real-time operating systems like Contiki-NG and ZephyrRTOS.
5. Experimental Results
To evaluate the effectiveness of EdgeRescue in realistic IoT environments, we conducted large-scale simulations using the Cooja simulator within the Contiki-NG operating system. The objective was to assess the framework’s capability to proactively detect faults, recover network functionality with minimal latency, and maintain communication efficiency in energy-constrained mesh networks.
5.1. Simulation Setup
Our simulation consisted of 100 randomly distributed IoT nodes within a m grid. Each node emulated real-world IoT constraints, including a battery capacity of 2000 mAh, a transmission range of 30 m, and a variable packet transmission interval ranging from 1 to 5 s. The network was configured using the RPL routing protocol, with DIO messages transmitted every 4 s. To examine the robustness of our approach, we introduced three types of faults: battery depletion at select nodes, congestion-induced packet queue overflow, and intermittent signal degradation due to simulated RSSI drops. Each experiment was repeated across 10 independent runs with different random node placements and fault injection seeds, and results are reported as mean values with standard deviation.
Fault injection was performed manually by scripting node shutdown events at predetermined simulation timestamps within Cooja. The anomaly detection threshold was set to 0.5 based on validation performance and was kept fixed across all runs.
To verify that EdgeRescue operates autonomously without any central coordination, we ran an additional gateway-isolation experiment using the same 100-node setup. In this variant, the gateway node was disconnected at before any fault injection, so no sink-level routing existed at any point during the run. EdgeRescue nodes were left to detect, broadcast, and reroute entirely within the mesh. RPL was included as a reference baseline under the same condition. We also ran an ablation study across three stripped-down variants of EdgeRescue to measure the contribution of each individual component: (i) EdgeRescue without the Anomaly Detector (DHP triggered only by link-loss timeout), (ii) EdgeRescue without energy-aware routing (, in ), and (iii) EdgeRescue without ACK-based path validation (paths accepted after first successful packet). These variants isolate the effect of each design choice on the final performance numbers.
We compared EdgeRescue against four baseline methods: the standard RPL protocol without self-healing capabilities [
10], LightNet—a decision tree-based anomaly detection model optimized for edge devices [
13], HealNet—an edge-enabled recovery system with centralized learning modules [
23], and UDAD [
24]—a recently proposed unsupervised distributed anomaly detection framework for IoT edge networks.
These four baselines were selected to cover the full design space that EdgeRescue sits within. RPL represents the standard routing-only baseline with no fault intelligence. LightNet covers lightweight on-device detection but has no recovery mechanism. HealNet adds recovery but centralizes the learning component off-node. UDAD brings distributed ML-based detection but stops short of coupling it with energy-aware rerouting. No existing system in the current literature, to our knowledge, jointly addresses on-device anomaly detection and energy-aware distributed mesh recovery as a single node-level framework, which is the exact gap EdgeRescue targets. The four baselines together cover each individual aspect of this problem and form the closest available set of analogs for comparison.
5.2. Evaluation Metrics
To evaluate system behavior, we employed six primary metrics. The first was Packet Delivery Ratio (PDR), defined as the percentage of successfully delivered packets over total transmissions. Second, we measured the Mean Recovery Time (MRT), calculated as the average time interval between fault detection and successful network stabilization. Third, we monitored the Average Node Energy Drain, capturing the average mAh consumed per node during operation. Fourth, we recorded the False Positive Rate (FPR), representing the proportion of incorrectly flagged non-faulty nodes as anomalies. Fifth, to better characterize detection quality, we report Precision, Recall, and F1-score of the anomaly detector against ground-truth fault labels. Sixth, we measured the average time from anomaly flag to first successful rerouted packet delivery, which we refer to as Healing Response Time (HRT), as a finer-grained complement to MRT.
5.3. Quantitative Comparison and Analysis
The quantitative results are presented in
Table 2. EdgeRescue consistently outperformed all baseline methods across all metrics. Specifically, it achieved a Packet Delivery Ratio of 94.6% ± 1.2%, which is significantly higher than the 81.4% achieved by RPL, 86.3% by LightNet, and 89.7% by HealNet. This improvement highlights the effectiveness of proactive anomaly detection and immediate local rerouting. The improvement is visually confirmed in
Figure 3, which compares the PDR of all methods.
Table 3 reports the anomaly detection performance of EdgeRescue alongside the detection-capable baselines. EdgeRescue achieved a Precision of 96.1%, Recall of 93.8%, and F1-score of 94.9%, with a mean Healing Response Time of 1.6 s from anomaly flag to first successfully rerouted packet. LightNet, despite reasonable recall, showed higher false positives, while HealNet’s centralized design introduced additional latency before rerouting began.
Table 4 summarizes the per-method computational overhead. EdgeRescue’s on-device model occupies 92 KB of flash and requires 24 KB SRAM during inference, with a single inference completing in 1.2 ms on an ARM Cortex-M4F. This is competitive with LightNet’s simpler decision-tree model and significantly lighter than HealNet’s centralized module, which runs off-device.
Regarding recovery speed, EdgeRescue showed a Mean Recovery Time of just 3.9 s—markedly faster than LightNet’s 9.1 s and HealNet’s 6.2 s. This rapid fault response is shown in
Figure 4, reflecting EdgeRescue’s efficient local decision-making mechanism.
The average energy drain per node was also minimized in EdgeRescue, with each node consuming only 63.5 mAh on average, compared to 78.2 mAh for RPL and 72.4 mAh for LightNet. The energy saving comes from localized anomaly processing and reduced message overhead, as illustrated in
Figure 5.
In terms of False Positive Rate, EdgeRescue achieved only 2.3% false alarms, lower than both LightNet (6.7%) and HealNet (4.1%).
Figure 6 confirms this, and the gap is explained by the temporal pattern-learning capability of the 1D-CNN—threshold-based detectors like LightNet react to instantaneous signal drops, which causes more false triggers in environments where RSSI fluctuates naturally.
Table 5 presents the ablation study results. Removing the Anomaly Detector drops PDR by 6.8 percentage points and raises MRT to 8.1 s—close to LightNet’s performance—since the DHP then relies only on link-loss timeout to detect failures, which is slower than CNN-based early flagging. Disabling energy-aware routing (
) keeps PDR relatively stable but raises energy drain by 9.4 mAh, as the rerouting ignores node energy state and repeatedly burdens the same high-connectivity nodes. Removing ACK-based path validation increases FPR slightly and raises MRT, because some newly selected paths carry residual congestion that is only caught during validation. The full EdgeRescue configuration consistently outperforms all three stripped variants, confirming that each component contributes independently to the final result.
Table 6 reports how the anomaly detection threshold
affects the precision–recall trade-off. As
decreases from 0.7 to 0.3, recall rises from 88.4% to 97.1% but FPR also rises sharply—from 0.9% to 7.6%. The value
sits at a practical operating point where recall stays above 93% while FPR remains below 2.5%. This threshold may need adjustment in deployments where either missed detections or false alarms carry higher costs—for instance, in safety-critical control applications, a lower
is preferable, while in battery-constrained monitoring, a higher
reduces unnecessary rerouting overhead.
Table 7 shows the results of the gateway-isolation experiment. With the gateway disconnected, RPL’s PDR drops to 43.2% and MRT becomes unmeasurable—the protocol stalls because it depends on the sink node for route computation under failure. EdgeRescue, by contrast, holds a PDR of 91.3% ± 1.8% and an MRT of 4.6 ± 0.9 s—a modest degradation from the connected case (94.6% and 3.9 s) that comes purely from the loss of one routing destination, not from any loss of healing capability. Each node detects faults, broadcasts FNPs, and reroutes paths exactly as it does in the connected case, with no change in local behavior. This confirms that the autonomous healing loop functions entirely within the mesh and does not rely on gateway presence at any stage.
Across all experiments—standard fault injection, gateway isolation, ablation, and threshold sweep—EdgeRescue’s results are consistent. The gains do not come from a single lucky design choice but from all three components working together: early CNN-based detection, energy-aware routing, and ACK-validated path confirmation. The simulation environment (Cooja/Contiki-NG) is a well-established platform in IoT research and the results reflect realistic node constraints. Physical hardware deployment on platforms such as ESP32 or TelosB would validate timing and energy numbers under real radio conditions, and this forms a natural extension of the present work.
6. Discussion
The experimental results presented in the previous section underscore the robustness and efficiency of the EdgeRescue framework in managing faults within resource-constrained IoT mesh environments. Notably, the significant improvement in Packet Delivery Ratio (PDR) and reduction in Mean Recovery Time (MRT) demonstrate that EdgeRescue enables proactive detection and swift mitigation of network anomalies. This is especially critical in real-world deployments such as environmental monitoring, industrial automation, and smart agriculture, where downtime or delayed recovery can result in data loss or operational failure.
Compared to baseline methods, EdgeRescue offers two key innovations that explain its superior performance. First, its use of lightweight 1D-CNN models on each node ensures real-time anomaly detection without requiring cloud or gateway intervention. This approach effectively decentralizes intelligence and eliminates the bottlenecks associated with centralized diagnostics. Second, the framework’s energy-aware routing reconfiguration ensures that rerouting decisions are made not only based on connectivity but also with consideration of the energy status and reliability of each node. This prevents cascading failures and extends the operational life of the network.
It is also worth noting under what conditions EdgeRescue’s advantage is most pronounced. The gains in PDR and MRT are largest when faults are isolated—that is, when one or a small number of nodes degrade gradually rather than failing all at once. In such cases, the 1D-CNN has sufficient time to build up an anomaly score before full failure, and the DHP can reroute cleanly. When multiple nodes fail simultaneously or when the network is already heavily loaded, the advantage over baselines narrows, as discussed in
Section 3. EdgeRescue also outperforms statistical baselines more clearly in dynamic signal environments where RSSI fluctuates—settings where threshold-based detectors like LightNet tend to produce more false positives.
Overall, EdgeRescue’s modularity, low computational overhead, and distributed intelligence present a promising direction in how self-healing is implemented in low-power IoT mesh networks. The framework is well-suited for integration with open-source operating systems such as Contiki-NG, Zephyr, and RIOT, making it a reasonable candidate for deployment across a variety of hardware platforms.
Despite the encouraging results, several limitations of the current work should be acknowledged. First, the anomaly detection model is trained offline on a fixed dataset, which means it may not generalize well to fault patterns that differ significantly from those seen during training. Deploying the model on a new network type without retraining introduces a cold-start problem that is not yet addressed. Second, the detection threshold
was fixed at 0.5 throughout all experiments; in practice, different deployment environments may require different threshold values, and the sensitivity of results to this choice was not fully explored. Third, as noted in
Section 3, silent node failures—where a node dies without broadcasting a fault—rely on indirect link-quality signals for detection, which can delay the healing response by a few seconds. Fourth, the current evaluation covers only static mesh topologies; mobile deployments and fog-level multi-tier architectures remain untested. Finally, all results are simulation-based, and while the Cooja/Contiki-NG environment is widely used in IoT research, physical hardware experiments on platforms such as ESP32 or TelosB are needed before strong deployment claims can be made.
7. Conclusions
In this paper, we introduced EdgeRescue—a novel, lightweight, AI-powered self-healing framework tailored for energy-constrained IoT mesh networks. Unlike traditional approaches that depend on centralized processing or heavy cloud analytics, EdgeRescue decentralizes anomaly detection and recovery by embedding intelligent modules directly within each node. The system achieves fault detection through compact CNN-based models and adapts routing paths in real time using a multi-criteria cost function incorporating energy, link quality, and anomaly scores.
Through simulation-based evaluation, we showed that EdgeRescue performs consistently well across packet delivery, recovery latency, energy efficiency, and detection precision compared to existing methods. By reducing Mean Recovery Time by up to 57% and False Positive Rate by more than 45% relative to prior baselines, the results support the value of localized, intelligent fault management in IoT mesh networks. These findings are encouraging, though hardware validation on physical platforms remains an important next step.
Future work will explore extending the architecture with unsupervised and online learning capabilities so the model can adapt to new fault patterns without retraining. We also plan to investigate healing stability under adversarial or non-cooperative node conditions, evaluate EdgeRescue on real hardware platforms such as TelosB, ESP32, and STM32-based motes, and extend the design to mobile mesh and fog-level multi-tier IoT architectures.
EdgeRescue represents a practical step toward self-healing in next-generation IoT systems, with potential applicability in critical infrastructure, industrial control, and remote monitoring environments.