4.1. From Robustness to Resilience to Antifragility
In IoT devices and CPS, three progressively advanced paradigms, robustness, resilience, and antifragility, play a crucial role in guiding the design and evaluation of trustworthy and dependable systems [
29,
30]. Understanding these concepts and their practical implications is essential for researchers and practitioners who are building next-generation IoT deployments.
Table 2 compares these three concepts by defining them, providing a typical metric, and providing a typical IoT example for each.
Robustness is the most fundamental property. As illustrated in
Figure 2a, robustness refers to a system’s ability to endure disturbances or uncertainties without a significant decline in performance [
31,
32,
33]. In simple terms, a robust IoT device or algorithm is capable of operating as intended, as long as any environmental changes, faults, or attacks fall within predetermined, manageable limits. For instance, a sensor fusion algorithm used in a smart home may be designed to handle up to 10% packet loss before the accuracy of its estimations drops below acceptable levels. Robustness is typically static: if the perturbation remains within the designed safe region, the system output is largely unaffected [
34].
Resilience enhances the concept of trustworthiness by not only focusing on the ability to withstand disruptions but also on the dynamic processes of recovery and adaptation [
35,
36,
37]. In resilient systems, when a disturbance, like a cyberattack or sensor fault, causes a drop in performance, the system can absorb the impact, adapt, and restore or even improve its performance [
38]. Recovery may involve switching to backup modes, using redundant data sources, or activating adaptive controls. The resilience curve visualizes this process by plotting system performance over time. While a robust system’s curve remains flat during disruptions, as shown in
Figure 2a, a resilient system may dip but gradually returns to its baseline performance, as shown in
Figure 2b. The area under this curve during the disruption and recovery period quantifies the loss of resilience. The speed and completeness of recovery distinguish a highly resilient system from one that is only robust [
39].
Antifragility is the most advanced and ambitious concept. Coined by Nassim Nicholas Taleb [
40] and now adapted in AI [
41,
42], IoT [
43], and CPSs [
44] literature, as shown in
Figure 2c, an antifragile system not only survives stress and disruption but actually improves because of it [
45]. For instance, an antifragile IoT anomaly detector could use real-world attack traffic as new training data, adjusting its detection thresholds and classifier boundaries to improve accuracy over time. Similarly, a resilient CPS may leverage environmental disturbances, such as rare sensor failures, to identify new fault modes and strengthen its redundancy mechanisms, thereby expanding its operational range.
In summary, while traditional IoT systems have focused on robustness, resisting known stressors, future systems must be designed for resilience, capable of rapid recovery, and ultimately for antifragility, where each disruption becomes an opportunity to learn, adapt, and strengthen the system for the future.
4.2. Stressors in IoT Systems
IoT systems operate at the intersection of the digital and physical worlds [
46]. They sense, compute, and communicate under real-world constraints [
47], meaning they face not only algorithmic attacks but also physical and environmental disruptions [
48]. Sensors can drift or fail, wireless channels can become noisy or congested, batteries can drain unpredictably [
49], and adversaries continuously adapt to system defenses. To reason about these diverse challenges, it is helpful to categorize stressors, factors that degrade reliability or performance, into three broad families: adversarial stressors, caused intentionally by malicious agents; environmental or operational stressors, arising naturally from the physical or logistical environment; and hybrid stressors, where cyber and physical disruptions co-occur.
Figure 3 shows the various stressors affecting IoT systems, organized along three axes: Intent, Origin, and Severity. The x-axis distinguishes natural stressors, like sensor drift, from malicious ones, such as model poisoning. The y-axis indicates whether stressors originate in the physical environment or in the data and model layers, while the z-axis measures their severity. Blue data points represent environmental stressors, such as hardware degradation; red data points indicate adversarial stressors, such as poisoning; and green data points illustrate hybrid stressors, such as adversarial traffic and signal interference. This figure highlights the diverse threats based on their intent, origin, and impact on system performance. It serves as a framework for designing targeted resilience strategies, ensuring that defenses deployed at one layer (e.g., protocol-level rate limiting) complement those at another (e.g., adversarially robust learning).
Figure 4 shows a heatmap that categorizes IoT ecosystem stressors by their impact on system layers. The vertical axis shows stressors, such as packet loss, interference, and data poisoning, while the horizontal axis represents IoT stack layers, including hardware, protocols, and governance. Color intensity indicates impact level (0 = low, 3 = high), allowing for quick identification of critical vulnerabilities. For example, packet loss and energy scarcity have the greatest impact at the device and network layers, whereas data poisoning and inference time evasion are more prevalent at the learning and application layers. This visualization shows that resilience strategies should be tailored to the specific stress points within the stack, rather than using a one-size-fits-all defense approach.
Figure 5 shows the coupling strength between IoT layers, illustrating how disturbances in one layer can affect the entire system. Diagonal cells indicate dependencies within layers, while off-diagonal values highlight interactions between layers. The results indicate that while each layer retains primary sensitivity to its own dynamics, significant coupling exists between adjacent layers, such as between network and learning layers or between application and governance layers, where disruptions can cascade upward or downward. This coupling map shows that IoT resilience is fundamentally systemic: merely strengthening individual components is not enough unless it is aligned with a cross-layer design and adaptive feedback systems that work together to prevent cascading failures.
Figure 3.
Three-dimensional landscape of IoT stressors organized by intent, origin, and severity.
Figure 3.
Three-dimensional landscape of IoT stressors organized by intent, origin, and severity.
Figure 4.
Mapping of IoT stressors across the system stack.
Figure 4.
Mapping of IoT stressors across the system stack.
Figure 5.
Cross-layer coupling strength among IoT resilience mechanisms.
Figure 5.
Cross-layer coupling strength among IoT resilience mechanisms.
Adversarial stressors target and exploit vulnerabilities in algorithms, communication protocols, or operational assumptions, compromising system and network integrity [
50]. These stressors are intentional and adaptive, aiming to mislead, exhaust, or subvert IoT components, and they can be categorized into specific types.
Data poisoning: before or during training, an adversary corrupts a fraction
of the training set so that the learned model exhibits degraded or adversarially biased behavior at deployment [
51,
52]. Common tactics include label flipping, clean labeling, backdoor or trigger attacks, and model or gradient poisoning. Label flip poisoning changes ground truth labels while leaving features intact [
53]. Clean label poisoning creates what appear to be legitimate examples that manipulate the decision boundary [
54]. Backdoor or trigger attacks exploit an uncommon pattern, leading the model to misclassify any input that includes it [
55]. Model or gradient poisoning in federated environments can introduce biases into global aggregation (for instance, due to Byzantine attacks). As shown in
Figure 6, we simulate label flip poisoning on ToN-IoT by flipping a fraction
of training labels uniformly at random while keeping validation and test sets clean. Test Accuracy and Macro-F1 decrease smoothly as
p increases. The small absolute drops (e.g.,
Acc
at
) indicate that random label flips primarily act as label noise, to which this tabular model is relatively robust, likely due to high class separability and redundancy in features. This is a lower bound on risk. Structured attacks, such as class-conditional or rare-class targeting, clean label poisoning, and backdoor triggers, can cause larger errors at lower probabilities
p. The effects may also amplify in federated training without robust aggregation methods. Attack goals when using poisoning attacks range from availability (overall error inflation) to targeted or class-specific failures on chosen classes or trigger patterns. Mitigations include distributional validation and data filtering (outlier and influence diagnostics), robust training losses and regularization, differential clipping or noise in FL, robust aggregation (trimmed mean, coordinate wise median, Krum), holdout audits with canaries, and cryptographic or signed provenance logs to trace data lineage [
51,
52,
56].
Figure 6.
Simulated label flip poisoning on ToN-IoT by flipping a fraction of training labels.
Figure 6.
Simulated label flip poisoning on ToN-IoT by flipping a fraction of training labels.
Evasion at inference: once models are deployed, an adversary can add carefully tuned, norm-bounded noise to inputs so that the perturbations remain visually or statistically imperceptible but still induce misclassification [
57]. Representative attacks include: fast gradient sign method (FGSM) [
58], basic iterative method (BIM) [
59], projected gradient descent (PGD) [
60], momentum iterative method (MIM) [
61], and DeepFool [
62]. FGSM takes a single step that prompts each input feature in the direction that most increases the loss (the sign of the gradient), with step size
, yielding a small
-bounded change that can already flip the prediction. BIM repeats FGSM in many small steps (step size
), clipping after each step to keep the total perturbation within the
ball of radius
. This iterative refinement typically finds stronger adversarial examples than a single step. PGD further strengthens BIM by starting from a random point inside the
-ball and then iterating gradient steps with projection back to the ball; random restarts help avoid weak local optima, which is why PGD is a widely used (strong first order) baseline [
63]. MIM also iterates, but it accumulates a momentum term (an exponential average of recent gradients) to stabilize the update direction; this often improves transferability, making the crafted examples more effective even on unseen models [
64]. DeepFool approximates the classifier’s decision boundary and iteratively moves the input in the smallest direction that crosses that boundary, aiming for near-minimal
change; unlike the
-bounded methods above, it does not fix a budget in advance but adapts the perturbation to reach misclassification with minimal effort [
65]. To illustrate the effect of evasion at inferecne attacks, we experimented using ToN-IoT dataset and control the maximum perturbation size by a budget
(e.g.,
), where
bounds the per-feature deviation under the chosen norm (typically
); here, the symbol ∈ means is an element of (i.e., chosen from the set).
Figure 7 shows the test accuracy under
evasion as the perturbation budget
increases. Relative to the clean baseline where the accuracy is 0.9996, FGSM declines gradually to 0.299 at
. At the same time, PGD and BIM collapse rapidly (e.g., at
, accuracy is 0.006 and 0.170, respectively, and
for
). MIM degrades slowest among iterative methods (0.962 at
, 0.923 at
, 0.656 at
). The shaded band (
) denotes our imperceptible regime, where iterative attacks already cause large drops (e.g., PGD 0.273, BIM 0.288 at
). Even with small budgets within our imperceptible range, iterative attacks such as PGD and BIM can substantially degrade accuracy, while MIM tends to degrade more slowly on our model; DeepFool, which does not use an explicit
, instead adapts its steps to cross the nearest decision boundary. A representative example of evasion at inference, a small modulation change in a wireless packet could bypass an intrusion detection model. To address such attacks, it is a valuable practice to integrate one or more of the following techniques: adversarial training, randomized smoothing, and/or cross-modal input consistency checks.
Figure 7.
ToN-IoT accuracy vs. perturbation budget for inference time evasion attacks.
Figure 7.
ToN-IoT accuracy vs. perturbation budget for inference time evasion attacks.
Model extraction and inversion: in cloud or edge APIs, an adversary can iteratively query a deployed (black-box) model to extract a high fidelity surrogate (recovering decision boundaries or even approximating parameters), or to invert the model to reconstruct features of sensitive training records [
66]. Leakage surfaces include top-1 labels, confidence scores (soft probabilities), and auxiliary signals (temperature scaling artifacts, calibration curves), which together facilitate knowledge distillation and amplify privacy risks, potentially exposing patient attributes, user habits, or network signatures [
67,
68]. As shown in
Figure 8, to illustrate model extraction, we train a high-accuracy teacher (test Acc
) and simulate an attacker who can fit a student using teacher-labeled queries under three disclosures: label only (hard top 1), soft probs (full confidences), and noisy soft (soft probs with Laplace noise,
). Across budgets
, student accuracy rises monotonically toward the teacher: for example, at
queries the student attains
for {label only, soft probs, noisy soft}, and at
reaches
respectively, still below the dashed teacher baseline. Soft prob disclosure yields the strongest extraction at high budgets (e.g.,
at
), while injecting calibrated noise slightly suppresses student performance (noisy soft
) with minimal utility loss for benign users, and label only sits in between. The x-axis is log-scaled to emphasize early query efficiency: most gains occur by
–
queries, underscoring the value of early throttling and score suppression in production APIs. Practical approaches that can be used to mitigate model extraction and inversion include API governance (authentication, rate or volume limits, burst throttling, per class quota), output minimization (label only, confidence truncation or quantization, randomized response on scores), noise mechanisms (Laplace or Gaussian noise on logits or probabilities), and privacy-preserving training (DP-SGD or post-training DP), complemented by extraction watermarking, server-side audit tests (monitor agreement patterns vs. natural data), and adaptive blocking when query statistics deviate from benign usage [
67,
68].
Figure 8.
Model extraction using a teacher and a simulated attacker under three disclosures.
Figure 8.
Model extraction using a teacher and a simulated attacker under three disclosures.
Protocol spoofing: beyond software-level API abuse, adversaries can impersonate endpoints by manipulating the RF channel itself. Common attack vectors include satellite deception, such as GPS spoofing, link-layer replay attacks like Roll-Jam, and waveform injections that imitate a device’s modulation [
69,
70,
71]. In a typical replay attack, illustrated in
Figure 9, an attacker captures a fob’s rolling code
at time
, preventing the vehicle’s receiver from decoding it at time
. The attacker then replays
within the receiver’s acceptance window
at time
, causing the vehicle to respond with an unlock or acknowledgment (ACK) at time
. To mitigate protocol spoofing, various methods can be used, including cross-layer defenses, desynchronization resilient rolling code updates, PHY hardening, spectrum level defenses, and RF fingerprinting. Defense at the cross-layer can be employed, such as cryptographic freshness with nonce-based challenge–response and strict single-use counters (no grace window or
for critical actions). Desynchronization resilient rolling code updates use session-bound keys and monotone counters with limited resynchronization attempts. Physical layer security is enhanced by using time of flight and multi antenna angle of arrival measurements to constrain plausible emitter geometries. RF fingerprinting that exploits device-specific imperfections (carrier frequency offset, I/Q imbalance, transient shape) and channel state or timing features to reject cloned waveforms [
72,
73]. Spectrum-level defenses (frequency-hopping spread spectrum (FHSS) or direct-sequence spread spectrum (DSSS), adaptive carrier sensing) with anomaly analytics on energy, inter-frame timing, and Doppler patterns. Operational controls, lockout or backoff after failed frames, localized rate-limits, and out-of-band second factors (e.g., proximity ultra wide band (UWB) ranging), further shrink the attack surface while preserving usability [
74].
Denial-of-Service/Distributed DOS (DoS/DDoS): in IoT environments, attackers exploit resource limitations to overwhelm bandwidth at gateways or drain device resources (CPU, memory, battery), resulting in service interruptions or cascading failures [
75,
76,
77]. Botnet-driven swarms of compromised endpoints (e.g., cameras or smart plugs) can generate high-rate floods or carefully timed bursts that defeat naive token buckets, overwhelm queueing buffers, and trigger retransmission storms, further reducing goodput. Let
L denote the offered load (benign and malicious) and
C the gateway capacity. In the benign regime, goodput
G roughly increases with
. However, under attack, issues, such as queue overflows and packet drops cause
G to drop sharply below a critical threshold
, well before reaching nominal capacity
C. This behavior is reflected in the goodput versus offered load curve in
Figure 10. Rate limiting delays collapse goodput; puzzles add slight latency but reduce bot amplification; and edge filtering maximizes performance beyond
C by eliminating unnecessary traffic before it occupies limited buffer space. In practice, robust deployments combine the following practical countermeasures with lightweight anomaly scoring, short control loops for threshold tuning, and fail-open exceptions for safety-critical flows to avoid overblocking. Practical countermeasures span admission control and in-network enforcement, trading reactivity against collateral damage. Per-packet or flow rate limiting caps burstiness and bounds worst-case load, raising
while keeping implementation lightweight at edge routers [
78]. Client puzzles that are stateless and adjustable in difficulty shift computational tasks to suspected sources, limiting the impact of bot swarms on CPU usage and reducing unnecessary gateway processing. The difficulty of the puzzles can be modified based on observed queue occupancy to maintain quality of service for compliant devices [
79]. In-network filtering at access gateways (e.g., prefix or behavior-based filters, Bloom-filter aggregates, or programmable data-plane rules) removes malicious traffic near its ingress and prevents backpressure into constrained subnets, preserving goodput even when
[
80].
Environmental and operational stressors originate from physical reality rather than malice [
81]. Yet, their cumulative effect can be equally damaging, especially in large-scale or remote deployments. These stressors can be categorized as follows.
Figure 9.
An illustration of link-layer replay attack in the form of Roll–Jam on rolling codes. (1) The attacker jams the channel while recording the fob’s rolling code, so the vehicle drops it. (2) The attacker later replays the rolling code, which is accepted if it falls within the receiver window. (3) The vehicle acknowledges and executes the action (e.g., unlock).
Figure 9.
An illustration of link-layer replay attack in the form of Roll–Jam on rolling codes. (1) The attacker jams the channel while recording the fob’s rolling code, so the vehicle drops it. (2) The attacker later replays the rolling code, which is accepted if it falls within the receiver window. (3) The vehicle acknowledges and executes the action (e.g., unlock).
Figure 10.
Goodput vs. offered load under IoT DoS/DDoS.
Figure 10.
Goodput vs. offered load under IoT DoS/DDoS.
Packet loss and desynchronization: wireless IoT connections, especially within low-power wide-area networks (LPWANs), often encounter burst losses caused by interference, duty-cycle limitations, and synchronization drift [
82,
83]. In rolling-code systems, missing even one frame can lead to a permanent authentication failure [
84,
85]. To mitigate these problems, it is advantageous to use self-synchronizing codes [
86], selective retransmissions [
87], and interleaved packet scheduling [
88].
Noise and interference: the unlicensed industrial, scientific, and medical (ISM) bands that most IoT devices rely on are congested [
89], leading to signal collisions, higher bit-error rates, and increased timing jitter. In industrial control systems, this congestion can destabilize control loops [
90]. To deal with noise and interference, the following measures can be employed: frequency hopping, adaptive modulation or coding, and sensor fusion with uncertainty-aware weighting [
91,
92]. To better illustrate the impact of physical layer resilience mechanisms under realistic channel conditions,
Figure 11 shows the packet loss probability in relation to the signal-to-noise ratio (SNR) for uncoded, forward error correction (FEC) for coded, and FEC and automatic repeat request (ARQ) transmission modes. The uncoded link shows the classical waterfall region where even small SNR degradations cause orders-of-magnitude increases in loss. By introducing FEC (rate
), the reliability curve shifts toward lower SNR values, representing a coding gain of several decibels. Adding a single ARQ layer significantly reduces effective loss, demonstrating that lightweight hybrid error-control strategies can enhance environmental resilience without redesigning protocols. The shaded area represents the typical SNR range of −3 to +5 dB for LPWAN, crucial for maintaining connectivity in noisy industrial and outdoor settings.
Energy scarcity: battery-powered and energy-harvesting devices often enter aggressive sleep modes [
93], resulting in sparse or delayed data streams. For instance, a remote soil sensor may report only once per hour on cloudy days. It is helpful to implement event-driven sensing, compressive sampling, and lightweight on-device learning (i.e., tiny machine learning (TinyML)) [
94,
95,
96,
97].
Hardware degradation: over time, sensors drift due to temperature fluctuations, aging, or wear [
98]. PUFs used for device identification can also lose reliability under thermal stress [
99]. To address these issues, specific measures can be utilized, such as periodic recalibration, helper data schemes for PUF correction, and redundant sensing with majority voting [
100,
101,
102].
Non-stationary data (concept drift): IoT data often evolves as environments, users, or firmware change [
103]. A model trained on winter energy patterns may perform poorly during the summer months [
104]. To mitigate this issue, the following measures can be employed: sliding-window retraining, online learning, and drift detection algorithms such as adaptive windowing (ADWIN) or the drift detection method (DDM) [
105,
106].
Figure 11.
Packet loss probability versus SNR for different transmission strategies in low-power IoT networks.
Figure 11.
Packet loss probability versus SNR for different transmission strategies in low-power IoT networks.
Real-world IoT incidents rarely involve a single, isolated stressor. Therefore, hybrid stressors, which combine cyber and physical stressors, are common. Instead, disruptions often combine physical degradation and adversarial manipulation. For example, adversarial traffic might occur during a network outage, or GPS spoofing could coincide with heavy radio frequency (RF) interference. In autonomous drone fleets, an attacker might exploit simultaneous packet loss and model drift to induce collisions or miscoordination. These combined effects are hazardous because traditional countermeasures tend to address one dimension at a time.
Figure 12 illustrates a simulated IoT scenario experiencing simultaneous adversarial and environmental disturbances. The shaded disruption window corresponds to a period in which a PGD adversarial attack occurs concurrently with 30% network packet loss. Three cross-layer indicators are logged in parallel: physical layer, where SNR drops from ≈20 dB to 8 dB due to interference and energy depletion. Network layer, where packet loss rises sharply to 30%, representing congestion or wireless fading. Application layer, where model confidence (e.g., from a classifier or anomaly detector) declines from 0.94 to 0.60, showing the combined impact of noise and malicious perturbation. The red resilience curve tracks system-level utility or normalized performance over time. During the disruption, the performance falls by a magnitude of
, indicating the immediate drop. After mitigation and adaptation mechanisms are triggered (e.g., retransmission, adversarial retraining, redundancy), the system gradually recovers, reaching its baseline level within about 48 s, the recovery time. In some cases, the curve may show an overshoot, where the system slightly exceeds its original baseline due to adaptive learning or parameter re-tuning. The area under the curve between the onset of disruption and recovery represents the loss of resilience, which can be quantified as a function of the recovery speed and the magnitude of the drop. Overall, this figure illustrates how cross-layer monitoring (physical, network, and application) can aid in characterizing the resilience behavior of IoT systems under realistic compound stressors, providing a framework for quantitative benchmarking of recovery and adaptation mechanisms.
In summary, IoT resilience cannot be understood by analyzing a single layer or stressor in isolation. Actual robustness emerges only when adversarial, environmental, and hybrid stressors are jointly modeled, tested, and mitigated across the entire system stack, from hardware and protocol layers to AI-driven decision logic and governance mechanisms.
Figure 12.
Resilience under Compound Stressors (PGD Attack and 30% Packet Loss).
Figure 12.
Resilience under Compound Stressors (PGD Attack and 30% Packet Loss).
4.3. Layers of IoT Resilience
Resilience in the IoT is not a single mechanism but an emergent property that arises from coordinated behavior across architectural layers. Each layer, from low-level sensors to high-level governance, contributes to anticipating disruptions, absorbing their impact, restoring functionality, and, ideally, improving through adaptation. Resilience is usually measured by the security assumptions that are made. In real-world IoT systems, basic security includes device identity and authentication, secure communication to protect data confidentiality and accuracy, and safe mechanisms for device setup and updates. In this survey, we consider these protections standard and use them to define the types of threats we face. Resilience mechanisms are used to address problems that persist when systems face failures, attacks, limited visibility, or resource constraints.
The device and hardware Layer comprises physical devices, sensors, actuators, and embedded controllers, operating under tight constraints in power, memory, and processing [
107]. Typical stressors include noise, wear and tear, temperature drift, and physical tampering. Resilience strategies like Physically Unclonable Functions (PUFs), lightweight authentication, and self-calibrating redundant sensing secure identities and ensure reliable operation. PUFs create unique, device-specific keys from manufacturing variations, removing the need for vulnerable stored secrets [
108]. PUFs are one example of a basic security idea that can also support secure startup, safe key storage, and device checking in trusted settings [
109]. Lightweight authentication (e.g., hash-based challenge-response) fits kilobyte-scale memory and low-MHz CPUs found in microcontrollers. Side-channel protections (masking, jitter, current flattening) reduce leakage through timing or power profiles. Redundant sensing and self-calibration help reduce drift and counteract the effects of aging components. By employing multiple sensors that cross-validate readings and periodically re-baseline, we can ensure ongoing accuracy. For instance, in smart agriculture, a soil moisture sensor utilizes redundant probes and supplementary data to maintain precision, even when faced with temperature fluctuations or partial sensor failures.
Above the hardware lies the protocol and network layer that moves data through LPWAN, mesh, and 5G or edge links [
110]. Stressors in this layer include packet loss, interference, congestion, and selective jamming. Resilience methods, such as resilient routing, flooding or DoS defenses, decentralized consensus, and cognitive radio or adaptive spectrum, focus on maintaining end-to-end delivery and trustworthy coordination. While confidentiality and integrity protection are often provided by standard secure communications in modern stacks (e.g., authenticated and encrypted channels), resilience concerns persist because availability and timeliness can still be degraded by jamming, congestion, and partitioning. Resilient routing (multi-path, opportunistic forwarding) re-routes around failed or jammed nodes. Flooding and DoS defenses differentiate between legitimate bursts of activity, such as firmware rollouts, and malicious overloads through techniques like rate limiting and in-network filtering. Decentralized consensus mechanisms, including directed acyclic graph (DAG) ledgers and Proof-of-Authority, can tolerate network partitions while maintaining the auditability of updates and commands. Additionally, cognitive radio technology can adaptively shift to cleaner channels and modify coding and modulation in response to interference [
111]. For instance, an IIoT mesh automatically detours telemetry through backup gateways during a channel-specific jamming incident, preserving control-loop stability.
The third layer is the learning and AI layer (inference and adaptation or processing layer) [
112]. IoT increasingly relies on machine learning for anomaly detection, prediction, and closed-loop control. The stressors we face are dynamic and adversarial, including concept drift, data imbalance, poisoned updates, and evasion attacks. Notably, these stressors can arise even in protected environments, for example, through compromised endpoints, manipulated sensing contexts, or adversarial behavior that targets the learning pipeline rather than the cryptographic channel. Resilience focuses on models that can adapt, recover, and resist manipulation, such as adversarial training, ensembles, generative augmentation, FL, continual learning, and graph neural networks (GNNs). Adversarial training and ensembles improve classifiers’ robustness against perturbed inputs and single-point failures [
113]. Generative augmentation, such as generative adversarial networks (GANs) and variational autoencoders (VAEs), helps reconstruct missing modalities and balances rare events to stabilize learning when faced with loss or sparsity [
114]. It is a good practice to apply federated and continual learning with a local adaptation enabler to ensure privacy [
115]; robust aggregation methods (e.g., trimmed mean, median) mitigate the impact of poisoned clients [
116]. GNNs leverage the topology of devices and flows to detect correlated anomalies and localize faults. For example, a federated smart grid forecaster adapts to regional disturbances without sharing raw customer data, while robust aggregation downweights suspicious client updates.
Applications translate data into domain-specific actions (healthcare, transportation, manufacturing, energy) [
117]. Stressors include partial outages, delayed data, and degraded sensing. Resilience in sectors like smart healthcare, Industrial IoT, and smart grids focuses on maintaining service continuity and ensuring a graceful degradation of performance. At this layer, resilience is closely tied to safe fallback behavior under uncertainty, ensuring bounded performance and avoiding unsafe automation when inputs are incomplete or delayed. In smart healthcare, redundant biosignals from wearables are combined, so if a sensor fails, monitoring can continue with alerts that account for uncertainty. In IIoT, predictive maintenance and digital twins are utilized to simulate potential faults and develop pre-planned recovery strategies [
118]. Smart grids manage distributed generation and demand response to ensure stability during cyber or weather-related disturbances [
119]. For example, during a hospital network outage, electrocardiogram (ECG) wearables buffer data locally and synchronize later, maintaining clinical oversight with bounded data loss.
At the top sits the organizational, regulatory, and ethical backbone of resilience (i.e., governance and trust). Stressors include opaque decision-making, ambiguity of provenance, and policy non-compliance. Methods such as blockchain-anchored logs, explainable AI (XAI), trust scoring, and compliance engines enhance transparency, auditability, and adaptive policy. Blockchain-anchored logs document device identity, software lineage, and security events for post-incident forensics and incident response [
120]. XAI supports operator trust during incident response by clarifying model rationale and failure modes [
121]. Trust scoring continuously estimates the reliability of devices, links, and data sources, and decays scores for anomalous behavior [
122]. Compliance engines codify jurisdictional rules and update enforcement policies as regulations evolve. For instance, in autonomous mobility, immutable event logs ensure that braking decisions are based on authentic, time-synchronized sensor data rather than spoofed inputs. At a system level, governance also constrains which recovery and adaptation actions are permissible and auditable, which is essential when resilience mechanisms trigger automated mitigation.
True resilience emerges when these layers operate together. A noisy sensor reading at the device layer may be identified by a GNN-based detector at the AI layer, isolated by a routing policy at the network layer, explained to operators using XAI, and recorded for audit at the governance layer. Conversely, a policy change at the governance layer can tighten model thresholds at the AI layer, which in turn reconfigures sampling rates at the device layer to conserve energy during sustained attacks. This cross-layer view also clarifies the scope of this survey. We analyze resilience mechanisms and evaluations under realistic assumptions about trusted environments and secure communications, and we highlight where failures persist when these protections are incomplete or degraded.
4.4. Metrics and Evaluation Frameworks
Assessing resilience in the IoT is fundamentally multi-faceted. Unlike evaluations of static security or performance, measuring resilience requires reflecting the evolving behavior of systems over time, how they deteriorate, recover, and adjust in response to challenges. Conventional metrics like accuracy or throughput offer only fleeting glimpses; genuine resilience requires measures that account for temporal, structural, and interpretability aspects, demonstrating both immediate robustness and enduring adaptability.
Figure 12 conceptually illustrates a typical resilience evaluation, where system performance drops after a disruption and then recovers over time. Key metrics, such as performance under stress, scalability, system-level trust, transparency, and interpretability, as well as hybrid and compound benchmarks, can be used to evaluate resilience.
Performance metrics under stress: the first dimension assesses how well a model or system maintains operational quality during and after disruptions. Common indicators include: accuracy or macro-F1 under perturbation, area under the resilience curve (AURC), and latency and energy overhead. Accuracy, or macro-F1, evaluates prediction consistency under difficult conditions, such as data degradation or network issues (e.g., 30% packet loss). The AURC measures performance from the start of a disruption to recovery, with a higher AURC indicating a faster or more complete restoration. Latency and energy overhead capture the efficiency cost of resilience mechanisms, such as self-healing routing or retraining after poisoning. For instance, a resilient intrusion detection model might drop from 95% to 70% accuracy under a PGD adversarial attack but recover to 90% within 200 s, yielding a higher AURC than a static model that stagnates at 75%.
Scalability and system-level metrics: resilience must extend beyond individual devices to distributed IoT environments where hundreds of clients cooperate through federated or edge learning. Evaluations, therefore, include client scalability, network resilience index, and cross-layer coordination latency. Client scalability varies in performance as the number of participants increases (e.g., from 10 to 150 nodes in FL). Network resilience index is measured by the ratio of sustained throughput or model convergence speed under partial connectivity loss (e.g., 20% of clients offline). Cross-layer coordination latency measures the time between fault detection at one layer and adaptation at another, indicating the resilience mesh’s interdependence. In a federated healthcare network, an adaptive aggregator that maintains over 85% accuracy with a 30% client dropout rate is more resilient than one that drops below 70%.
Trust, transparency, and interpretability metrics: because resilience also involves human oversight, operators must trust the system’s adaptation process. Interpretability metrics quantify this human–machine alignment using Shapley Additive exPlanations (SHAP) or local interpretable model-agnostic explanations (LIME) attribution stability, trust-score variance, and recovery transparency [
123]. SHAP and LIME attribution stability assess the consistency of feature importance across model recoveries, highlighting semantic preservation. Trust-score variance measures fluctuations in model reliability under stress, with lower variance signifying more stable behavior. Recovery transparency is a qualitative or quantitative measure of how well recovery actions are logged, explained, and verifiable (e.g., via blockchain audit trails). For example, a resilient anomaly detector should not only regain performance after retraining but also maintain stable SHAP attributions, ensuring that its reasoning process remains interpretable and trustworthy.
Hybrid and compound benchmarks: disruptions in real-world IoT environments result from multiple factors. Current assessment frameworks should use compound stress testing by introducing various stressors, like adversarial perturbations, packet loss, and energy constraints, at the same time. Hybrid benchmarks, such as compound-scenario testing, cross-layer metrics, and resilience trade-off curves, remain scarce in the literature but are essential for realistic validation. Compound scenario testing involves two or more measures, e.g., PGD and 30% packet loss or concept drift and node dropout. Cross-layer metrics combine physical layer link reliability, network layer throughput, and model layer recovery accuracy. Resilience trade-off curves visualize the balance between recovery speed, energy cost, and trust stability across scenarios. For example, an antifragile federated model might slightly reduce accuracy during an attack but significantly shorten recovery time and energy cost across combined network and model stressors.
Assessing the resilience of IoT demands a comprehensive framework that incorporates performance, scalability, interpretability, and the assessment of multiple stressors. In the absence of such multi-dimensional metrics, there is a danger that systems may be deemed resilient based solely on incomplete evidence. Using metrics like AURC for temporal, cross-layer coordination for structure, and trust stability for cognition marks a key advancement in standardizing resilience assessment in IoT research.