Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption

Kabashkin, Igor

doi:10.3390/electronics14152968

Open AccessEditor’s ChoiceArticle

Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption

by

Igor Kabashkin

Engineering Faculty, Transport and Telecommunication Institute, Lauvas 2, LV-1019 Riga, Latvia

Electronics 2025, 14(15), 2968; https://doi.org/10.3390/electronics14152968

Submission received: 29 June 2025 / Revised: 20 July 2025 / Accepted: 23 July 2025 / Published: 24 July 2025

(This article belongs to the Special Issue Artificial Intelligence-Driven Emerging Applications)

Download

Browse Figures

Versions Notes

Abstract

Ensuring data integrity and adaptability in aircraft health monitoring (AHM) is vital for safety-critical aviation systems. Traditional digital twin (DT) and federated learning (FL) frameworks, while effective in enabling distributed, privacy-preserving fault detection, lack mechanisms to remove the influence of corrupted or adversarial data once these have been integrated into global models. This paper proposes a novel FL–DT–FU framework that combines digital twin-based subsystem modeling, federated learning for collaborative training, and federated unlearning (FU) to support the post hoc correction of compromised model contributions. The architecture enables real-time monitoring through local DTs, secure model aggregation via FL, and targeted rollback using gradient subtraction, re-aggregation, or constrained retraining. A comprehensive simulation environment is developed to assess the impact of sensor drift, label noise, and adversarial updates across a federated fleet of aircraft. The experimental results demonstrate that FU methods restore up to 95% of model accuracy degraded by data corruption, significantly reducing false negative rates in early fault detection. The proposed system further supports auditability through cryptographic logging, aligning with aviation regulatory standards. This study establishes federated unlearning as a critical enabler for resilient, correctable, and trustworthy AI in next-generation AHM systems.

Keywords:

aviation health monitoring; digital twin; federated learning; federated unlearning; sensor drift; data corruption

Graphical Abstract

1. Introduction

1.1. Background and Motivation

The reliability and safety of modern aviation systems depend heavily on the quality and integrity of data used in aircraft health monitoring (AHM) processes [1]. With the growing adoption of digital twins (DTs) and federated learning (FL), aviation stakeholders, such as airlines, maintenance, repair, and overhaul (MRO) providers, original equipment manufacturers (OEMs) and regulatory bodies, are now able to collaboratively build predictive models for fault detection, maintenance planning, and operational risk forecasting without sharing raw data [2]. This distributed paradigm supports privacy-preserving innovation across the aviation ecosystem. However, it also introduces new risks when the input data are faulty, incomplete, or compromised.

In real-world AHM operations, datasets may become corrupted due to sensor malfunctions, data transmission errors, software bugs, or even intentional tampering. If such distorted data are inadvertently included in federated training, this can degrade model performance, bias predictions, and compromise decision-making across the entire fleet. In safety-critical domains, like aviation, the inability to remove the influence of these “damaged” datasets from trained models presents a serious challenge. Unlike centralized systems, federated learning lacks native support for removing the effects of erroneous or poisoned client contributions once they have been aggregated into the global model.

To address this issue, federated unlearning (FU) has emerged as a powerful solution that enables the systematic removal of specific data contributions from previously trained federated models [3]. In the context of aviation health monitoring, FU enables system operators to forget the influence of clients whose data are later identified as flawed or adversarial—whether due to technical faults, human error, or security incidents. This capability not only helps to maintain the accuracy and reliability of diagnostic and prognostic models but also enhances the overall safety and trustworthiness of AHM systems.

While the original motivation for federated unlearning often centered around regulatory compliance and the right to erase confidential or personally identifiable information, its primary value in aviation lies in its capacity to correct the model’s exposure to corrupted or misleading datasets. FU supports dynamic model hygiene by enabling selective rollback of harmful training contributions without requiring full retraining—a process that would be costly and impractical at aviation scale.

This study proposes a federated FL–DT–FU framework tailored to the needs of aviation health monitoring. It integrates digital twin representations with federated learning capabilities, augmented by federated unlearning protocols that allow for rapid mitigation of data corruption effects. The framework advances both technical robustness and regulatory accountability in AHM systems through its emphasis on data quality integrity and adaptive model correction.

1.2. Related Works

Digital twin technology has emerged as a cornerstone of modern aircraft health-monitoring systems, representing a logical approach to simulating physical operations to predict future maintenance events and optimize engine and aircraft performance [1]. The airframe digital twin framework was conceived over a decade ago as a revolutionary way to realize condition-based maintenance within the defense aviation field [4], and has since witnessed significant progress in scope, application areas, and model fidelity. Recent developments have focused on creating comprehensive digital twin ecosystems that integrate not only real-time data from individual aircraft but also data from the entire aircraft fleet, using federated learning and new interaction models facilitated by an Artificial Intelligence of Things platform [2].

Federated learning has gained significant traction in aviation applications due to its ability to enable collaborative model training while preserving data privacy. Through extensive simulations on a representative aircraft fleet, integrated foundation model and federated learning approaches have demonstrated consistently superior performance compared to standalone implementations across multiple key metrics, including prediction accuracy, model size efficiency, and convergence speed [5]. The application of federated learning in aviation addresses several critical challenges, including privacy and data sovereignty concerns where airlines and aircraft manufacturers are reluctant to share sensitive operational data, and communication constraints where bandwidth and on-board storage limitations mean that it is not possible to return all data to ground for comparative fleet analysis [6].

Machine unlearning has emerged as a critical capability in federated learning systems, driven by regulatory requirements such as GDPR’s “right to be forgotten” and the need to remove corrupted or malicious data from trained models. Federated unlearning has emerged to confront the challenge of data erasure within federated learning settings, empowering FL models to unlearn an FL client or identifiable information pertaining to the client [3]. In aviation systems, machine unlearning addresses regulatory compliance requirements, data quality management needs for removing corrupted sensor readings, and security incident response where affected data contributions must be efficiently removed from global models.

Current research in digital twin frameworks for aviation health monitoring has focused on structural health monitoring through measurement–computation combined digital twin approaches that integrate real-time sensor data with physics-based computational models [7]. Key innovations include AI-driven load identification methods combining measurement and computational data, multi-fidelity surrogate models for improved computational efficiency, and online degradation assessment algorithms. Engine health monitoring represents one of the most mature applications, where electronic centralized aircraft monitoring parameters play a vital role in reducing cockpit crew workload while monitoring engine health and performance [8].

Recent developments in federated unlearning have focused on maintaining privacy while enabling efficient data removal. The FedRecovery framework erases client impact by removing weighted sums of gradient residuals from the global model and tailors Gaussian noise to make unlearned and retrained models statistically indistinguishable [9]. Key approaches include differential privacy integration and plausible deniability mechanisms, with experimental results showing comparable utility while providing significant reductions in memory requirements and retraining time [10].

Sensor drift represents a significant challenge in long-term aircraft health-monitoring systems, as aviation sensors are subjected to harsh environmental conditions including extreme temperatures, vibrations, electromagnetic interference, and chemical exposure. Recent research has proposed drift-compensation techniques that use past sensor measurements as pseudo-calibration samples [11], which are particularly relevant for aviation applications where sensors are deeply embedded and cannot be easily recalibrated during operation. Federated learning frameworks have been proposed to address sensor drift in distributed systems through adaptive mechanisms that use the historical performance of local models and adjust regularization parameters on each local device [12]. Specific implementations include sensor fault-detection systems that utilize synthetic sensor data to simulate various common sensor faults, including bias, drift, spike, erratic, stuck, and data-loss conditions [13].

Data corruption in aviation systems can arise from electromagnetic interference, hardware malfunctions, cyber-attacks, and transmission errors, making robustness against such corruption critical for safety-critical applications. Byzantine fault tolerance has emerged as a key approach to handling data corruption in federated aviation systems, where Byzantine adversaries may upload false data owing to unreliable communication channels, corrupted hardware, or malicious attacks [14]. Advanced solutions combine privacy preservation with Byzantine robustness through privacy-preserving Byzantine-robust federated learning schemes. Recent research has proposed auto-weighted approaches that jointly learn global models and the weights of local updates to provide robustness against corrupted data sources [15], addressing vulnerabilities where standard federated learning techniques that naively minimize average loss functions are susceptible to data corruptions from outliers, systematic mislabeling, or adversaries.

An innovative approach involves collaborative machine teaching, which is particularly relevant for aviation systems with limited trusted data where large amounts of training data may be corrupted by systematic sensor noise and environmental perturbations [16]. The solution employs trusted instances as benign examples in the teaching process, seeking joint optimal tuning of distributed training sets.

The integration of federated learning, machine unlearning, and digital twin technologies for aviation applications faces several technical challenges. The heterogeneous nature of current devices and models in complex IoT networks has seriously hindered FL training process performance [17]; this is particularly acute in aviation systems with diverse aircraft types, sensor configurations, and operational environments. Aviation systems operate under strict computational and energy constraints, requiring sparse communication where updates are transmitted only after significant local changes, and federated dropout mechanisms that update parameter subsets to minimize energy consumption. Real-time requirements for aviation safety can conflict with iterative federated learning algorithms.

Federated learning systems in aviation are vulnerable to various attack vectors where adversarial attacks can take various forms, including data poisoning where malicious devices send corrupted model updates to degrade global model performance [18]. Despite privacy-preserving characteristics, potential privacy leakage remains a concern, where adversaries can manipulate training data or model updates to corrupt learning processes and leak specific data characteristics [19]. Aviation systems must also comply with stringent regulatory requirements and certification processes, with artificial intelligence (AI) and machine learning integration introducing additional complexity in demonstrating system safety and reliability.

Edge computing integration shows promise through hierarchical federated learning with intermediate aggregation at edge nodes, reducing communication overhead by limiting interactions with central servers [20]; this is particularly valuable for aviation applications with limited communication bandwidth.

A systematic literature review of digital twins for aircraft maintenance and operation has highlighted the importance of IoT-enabled modular architectures in creating comprehensive monitoring frameworks [21]. Recent research has emphasized the potential of digital twins for the condition and fleet monitoring of aircraft, moving toward more intelligent, electrified aviation systems [22]. The medical field has demonstrated the potential of augmenting digital twins with federated learning, providing valuable insights for similar applications in aviation [23]. Furthermore, research has explored digital twin and federated learning-enabled cyberthreat detection systems for IoT networks, which has direct applicability to aviation security challenges [24].

A comprehensive survey on trustworthy distributed AI systems has emphasized the importance of robustness, privacy, and governance in such systems [25]. Recent work has focused on evaluating and enhancing the robustness of federated learning systems against realistic data corruption scenarios [26]. Research has also explored Byzantine-resilient federated learning employing normalized gradients on datasets [27].

Machine unlearning of federated clusters has emerged as another important research direction, providing the first known unlearning mechanism for federated clustering with privacy criteria [28]. The solution employs trusted instances as benign examples in the teaching process, seeking the joint-optimal tuning of distributed training sets. Recent advances have also explored defending against gradient inversion attacks in federated learning via statistical machine unlearning [29], which is particularly relevant for privacy-preserving aviation applications.

Security and privacy concerns remain significant despite federated learning’s inherent privacy benefits. Optimal federated learning-based intrusion detection systems for IoT environments have been developed to address these challenges [30]. Recent comprehensive reviews have examined the security of IoT using federated learning and deep learning, highlighting recent advancements, issues and prospects [31]. Intelligent deep federated learning models for enhancing security in Internet of Things-enabled edge computing environments have been proposed to address these vulnerabilities [32]. Furthermore, significant advancements in securing federated learning with intrusion detection systems have been made, providing comprehensive reviews of neural networks and feature engineering techniques for malicious client detection [33].

One paper [34] examined how integrating IoT, cloud computing, and AI is transforming aviation maintenance by enabling a shift from reactive practices to proactive and predictive approaches. An enhanced aircraft health-monitoring system that integrates artificial intelligence and blockchain to move from reactive to predictive maintenance, addressing both visible and hidden faults through the iceberg model framework was proposed in [35]. A federated learning framework designed to tackle imbalanced aviation datasets, enabling fault detection, predictive maintenance, and safety management without compromising data privacy was introduced in [36]. By integrating local resampling, cost-sensitive learning, and weighted aggregation, the approach improves minority class detection and collaboration across stakeholders while preserving data security and supporting operational efficiency.

Research in adversarial machine learning continues to provide important insights for the development of robust federated systems [37]. Recent conference proceedings have highlighted emerging work on user privacy in mobile network operators through machine unlearning techniques for data compliance and security [38]. Technology validation requires comprehensive simulation environments for validating federated learning and unlearning approaches under various aviation scenarios and industry-wide testbeds for evaluating frameworks in realistic aviation environments. Comprehensive overviews of predictive maintenance based on digital twin technology provide additional context for implementation strategies [39], while ongoing investigations into intrusion detection systems using federated learning approaches continue to advance the field [40].

Key advances in aircraft prognostics and the edge-deployment strategies that have shaped modern predictive maintenance were considered in the following papers. Early foundational works [41] systematically surveyed the origins, methodologies, and operational potential of prognostics for optimizing aircraft maintenance, outlining challenges and research gaps for wider deployment. One study [42] expanded the scope by detailing physics-based, data-driven, and hybrid PHM methods, emphasizing future integration challenges. One paper [43] introduced a rare-failure machine-learning model that leveraged log-based datasets to address class imbalance in predictive maintenance while another [44] reframed prognostics as a binary classification task, demonstrating the superior performance of deep learning classifiers for practical maintenance decisions. One study [45] contributed a systematic methodology for defining PHM system architectures validated through case studies, providing a blueprint for future designs. A decision-making framework incorporating technical, economic, and regulatory factors to prioritize predictive maintenance candidates was proposed in [46]. Together, these studies form a seminal foundation for both the conceptual development and practical edge-deployment of prognostic technologies in aviation.

Some FAA and NASA reports represent a progression of research and regulatory insight into certification challenges for emerging aircraft technologies. One paper from NASA [47] examined regulatory guidance for health-monitoring systems, outlining early assumptions and pathways for the approval of on-board prognostics and fault detection. The FAA’s commercial airplane certification process study [48] expanded the scope by identifying systemic certification gaps through operational case studies, providing lessons still relevant to edge-deployed diagnostic functions. The FAA research and development annual review [49] offered a broad survey of certification-relevant research, including work on structural health monitoring, predictive maintenance, and real-time data architectures. The NASA technical memorandum [50] synthesized best practices for assuring machine learning components, with a particular focus on lifecycle evidence and runtime assurance in support of future certification.

1.3. Research Gap, Contributions and Paper Structure

Although recent advancements have significantly enhanced the use of DT, FL and FU, there remains a critical gap in integrating these technologies into a cohesive framework specifically designed for aviation health monitoring. Existing DT-based AHM solutions primarily emphasize centralized structural prognostics or engine health analysis and do not adequately address the challenges of privacy, scalability, and data reliability in distributed aviation systems. While FL has been introduced to preserve data locality and confidentiality, it lacks inherent mechanisms to correct or remove the influence of corrupted, adversarial, or obsolete data once integrated into the global model. Meanwhile, although FU techniques have emerged to support data erasure and model hygiene, their adaptation to aviation-specific constraints—such as high safety standards, regulatory traceability, and subsystem heterogeneity—has been minimal.

To address these deficiencies, this study proposes a novel FL–DT–FU framework that combines the predictive modeling capabilities of digital twins, the collaborative training strengths of federated learning, and the corrective functionalities of federated unlearning. The proposed architecture enables distributed aviation stakeholders, including airlines, maintenance organizations, and OEMs, to jointly train robust and privacy-preserving health models while maintaining the ability to surgically reverse the impact of flawed or sensitive data contributions without full retraining.

The main contributions of this paper include the development of a comprehensive system architecture that integrates digital twins, federated learning, and federated unlearning into a robust and adaptive AHM framework. The proposed system supports continuous model refinement and selective post hoc correction through a multilayered workflow that incorporates gradient subtraction, exact re-aggregation, and constrained retraining to enable efficient and auditable removal of corrupted data contributions. A realistic federated simulation environment emulating diverse aircraft subsystems is established to assess the effects of data corruption and evaluate the effectiveness of various unlearning strategies. The study provides a detailed quantitative analysis of federated unlearning methods, measuring their impact on model accuracy, recall, latency, and trust recovery, and includes practical case studies to demonstrate real-world applicability. Additionally, the framework introduces a governance and monitoring layer that ensures regulatory compliance by logging update metadata and rollback events with cryptographic integrity, supporting certification requirements in aviation safety-critical systems.

The remainder of the paper is organized as follows. Section 2 outlines the materials and methods, detailing the architecture, data modeling via digital twins, federated learning workflow, and the implementation of federated unlearning. Section 3 presents the experimental results, including baseline model performance, the effects of data corruption, the impact of different unlearning strategies, and a case study on sensor drift in aircraft hydraulic system monitoring. Section 4 offers a discussion of the framework’s advantages, trade-offs, regulatory implications, and current limitations, along with recommendations for future research. Section 5 concludes the paper by summarizing the key findings and reaffirming the central role of federated unlearning in ensuring trustworthy, adaptive, and auditable AI systems in aviation health monitoring.

2. Materials and Methods

2.1. System Architecture Overview

The proposed DT–FL–FU architecture integrates digital twins, federated learning, and federated unlearning into a unified framework that enhances the resilience and trustworthiness of aircraft health monitoring. It enables distributed aviation stakeholders to collaboratively train fault prediction models while preserving data locality and allowing for the post hoc correction of corrupted or biased inputs.

The architecture consists of four interdependent layers (Figure 1):

aircraft–edge layer (local digital twins);
federated learning layer (privacy-preserving model training);
unlearning coordination layer (FU protocol manager);
monitoring and governance layer (audit and compliance).

At the first level, each aircraft or ground-based maintenance unit is associated with a local digital twin, which serves as a real-time computational replica of critical subsystems (e.g., engine, hydraulic, avionics). These DTs ingest continuous telemetry from onboard sensors—such as vibration signals, temperature readings, fuel flow rates, and pressure levels—and convert them into structured time-series sequences.

To support predictive modeling, the DTs apply local preprocessing (e.g., normalization, outlier filtering, window slicing) and extract temporal patterns indicative of degradation or anomaly buildup. These labeled sequences form the basis for local model training.

Each DT node implements a long short-term memory (LSTM) neural network to capture temporal dependencies across sensor readings. Let us suppose that an LSTM with two hidden layers and 64 units per layer is trained to predict the probability of hydraulic system failure based on the last 48 h of multivariate sensor data. The input sequence might include:

Pressure oscillations from redundant hydraulic lines;
Fluid temperature gradients during taxi and climb phases;
Actuator response time (command vs. actual deflection).

The model is trained using a binary cross-entropy loss function, with labels derived from historical fault logs (1 = failure within 12 h, 0 = no event).

At the second federated learning layer, instead of sharing raw time-series data, each DT node computes gradient updates from its local LSTM training and transmits these encrypted updates to a central server. The central FL coordinator performs aggregation, typically using federated averaging (FedAvg), to update the global LSTM model weights.

Let

θ_{k}

denote the local model parameters at node

k

. Each node updates its weights via stochastic gradient descent and computes:

Δ θ_{k} = θ_{k}^{(t)} - θ^{(t)}

where

θ^{(t)}

denotes the vector of global model parameters after the t-th communication round, representing the updated model state after aggregating all participating clients’ contributions.

The server then updates the global model by weighted averaging, as follows:

θ^{(t + 1)} = θ^{(t)} + \frac{1}{K} \sum_{k = 1}^{K} Δ θ_{k}

where

K

is the number of participating clients (DT nodes).

This process enables model convergence across geographically distributed aircraft without centralized data pooling. Model checkpoints are broadcast back to nodes after each global round and LSTM inference is updated accordingly for local prediction tasks, e.g., estimating the remaining useful life (RUL).

At the third unlearning coordination layer, when corrupted, biased, or adversarial data are detected, e.g., due to faulty pressure sensors or injected noise in telemetry, the corresponding node’s contribution must be erased from the global model. The FU layer initiates this correction process by applying influence-reversal methods, such as:

Gradient projection removal, where the offending update $Δ θ_{k}$ is subtracted from the global model;
Reweighting-based approximation, adjusting the model to minimize the influence of the corrupted client’s data on the loss surface;
Fine-tuned retraining, constrained to a low-resource window using only clean data contributions.

For instance, if client

# X

contributed anomalous LSTM gradients due to a data injection attack, the FU controller uses a stored update log to identify

Δ θ_{X}

and execute:

θ_{c o r r e c t e d}^{(t + 1)} = θ^{(t + 1)} - η \cdot Δ θ_{X}

where

η

is a rollback scaling factor adjusted to maintain stability.

The fourth monitoring and governance layer logs all local update hashes, model weight transitions, unlearning events, and LSTM performance metrics for each node. It supports forensic audits and regulatory verification, e.g., European Union Aviation Safety Agency (EASA) or Federal Aviation Administration (FAA) inspections, by enabling full traceability of how models evolved and which data were removed.

Each global model version is tagged with a cryptographic signature and metadata indicating participation, rollback status, and error bounds. Visual dashboards enable operators to verify that no biased or damaged data continue to influence predictive outputs.

2.2. Digital Twin-Based Data Modeling

In the proposed framework, digital twins (DTs) act as real-time computational replicas of aircraft subsystems, continuously synchronized with onboard sensors and operational data. They transform raw, heterogeneous telemetry—such as pressure, vibration, and actuator signals—into structured datasets for fault detection and health monitoring. Each DT preprocesses data through filtering, normalization, and feature extraction, and assigns labels based on historical faults or anomaly thresholds. Local LSTM models are trained on these data to predict failures or estimate remaining useful life.

DTs maintain a dynamic state, tracking model versions, data quality, and participation in federated learning and unlearning. If data inconsistencies, like drift or corruption, are detected, affected sequences are excluded to prevent contamination of the global model. In parallel, unsupervised methods can provide anomaly scores to support early detection and trigger unlearning if needed. This structured lifecycle from data ingestion to gradient generation enables DTs to serve as autonomous, intelligent nodes in a resilient, federated AHM system capable of adapting to corrupted inputs and maintaining model integrity across diverse aircraft fleets.

Figure 2 illustrates the internal data-processing workflow of each local digital twin in the proposed aviation health-monitoring framework. The process begins with the acquisition of sensor data and operational records, which include raw telemetry from aircraft subsystems (e.g., temperature, pressure, vibration) as well as contextual metadata (e.g., flight phase, maintenance history). These inputs are passed into the data ingestion module, where initial synchronization, buffering, and integrity checks are performed.

Following ingestion, the preprocessing and feature extraction stage applies signal normalization, outlier filtering, and transformation techniques to convert raw time-series data into structured inputs suitable for learning. The extracted features are then used by the local learning model, which typically consists of a recurrent neural network such as an LSTM. This model is trained to detect degradation patterns and generate probabilistic forecasts of system faults.

Parallel to the training process, the system performs anomaly detection, which assesses the deviation of incoming signals from nominal baselines. If the data are determined to be inconsistent, adversarial, or corrupted, the anomaly detection module can initiate downstream unlearning procedures. Finally, the model produces labels or anomaly scores, which are fed back into both local decision-making systems and federated model update processes.

This modular data flow ensures that each digital twin maintains an adaptive, trustworthy representation of the aircraft’s condition while supporting fault detection, data quality assurance, and compatibility with federated learning and unlearning protocols.

2.3. Federated Learning for AHM Model Training

To enable scalable and privacy-preserving collaboration across multiple aviation stakeholders, the proposed framework employs FL as a core mechanism for training AHM models. This approach enables local digital twins, distributed across airlines, maintenance organizations, and equipment manufacturers, to contribute to a shared global model without transferring raw operational data. Instead, each participant performs localized model training and shares only the learned model updates, thereby reducing the risk of data leakage and ensuring compliance with regulatory and organizational data policies.

Each digital twin hosts a local learning model trained on structured time-series data derived from onboard telemetry and operational records. These models are tasked with predicting fault probabilities, estimating RUL, or classifying anomaly states within the monitored subsystems.

The FL process proceeds in a series of communication rounds coordinated by a central server or aggregation node. During each round, the workflow follows the steps below:

Step 1—Global model distribution. The server sends the current version of the global model

θ^{(t)}

to all selected participating nodes;

Step 2—Local training. Each client trains the model on its local dataset

D_{k}

using stochastic gradient descent (SGD) or its variants to obtain updated parameters

θ_{k}^{(t + 1)}

;

Step 3—Update submission. Clients transmit their local model updates

Δ θ_{k} = θ_{k}^{(t + 1)} - θ^{(t)}

to the server;

Step 4—Aggregation: The server performs FedAvg or another aggregation strategy to compute the new global model:

θ^{(t + 1)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} \cdot θ_{k}^{(t + 1)}

where

n_{k}

is he size of the local dataset at client

k

, and

n = \sum_{k} n_{k}

is the total sample size.

This aggregated model is then redistributed to all clients for the next round, continuing until convergence is reached or a specified stopping criterion is met.

Depending on the target AHM application, the global model may be trained using classification or regression loss functions. For binary fault prediction, a binary cross-entropy loss

L_{b c e}

is used, while RUL estimation typically employs mean squared error

L_{m s e}

:, as follows:

Binary fault prediction

L_{b c e} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log {\hat{y}}_{i} + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]

RUL estimation

L_{m s e} = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

where

y_{i}

and

{\hat{y}}_{i}

are the true and predicted outputs for sample

i

, respectively.

The local models are optimized using mini-batch training, often with early stopping and dropout regularization to mitigate overfitting, particularly on the smaller or imbalanced datasets typical in aviation maintenance domains.

Not all clients participate in every FL round. Participation may be based on:

Availability (e.g., aircraft on ground vs. in-flight);
Data quality scores;
Trustworthiness (derived from anomaly detection results and governance metrics).

Selective participation helps to maintain training stability and prevents the propagation of biased or corrupted updates, particularly in the presence of non-IID (non-independent and identically distributed) data distributions among clients.

To protect model integrity and privacy, all client–server communications are secured using transport-layer encryption. In some implementations, secure aggregation protocols are employed to prevent the server from reconstructing individual client updates. These include additive homomorphic encryption, differential privacy noise injection, or secure multiparty computation.

In addition, each digital twin appends cryptographic hashes of its model update and metadata (e.g., training epoch count, data volume, error metrics) to ensure auditability and traceability in the governance layer. This allows for rollback or unlearning, should a participant’s data later be flagged as unreliable.

Global model convergence is monitored using a validation dataset curated from clean, certified operational data, either at the aggregator node or through collaborative voting across clients. Performance is evaluated across key metrics, such as accuracy, precision, recall, F1-score, and RMSE, depending on the task type.

This federated learning approach enables continuous, privacy-preserving, and collaborative optimization of AHM models across a diverse aviation ecosystem. It lays the groundwork for robust, real-time health monitoring while enabling selective model correction through federated unlearning when compromised or erroneous data are identified.

Figure 3 presents the core workflow of the federated learning process applied to aircraft health monitoring across distributed digital twin clients. The process begins with the server-side distribution of the current global model parameters

θ^{(t)}

, which are sent to all selected client devices (e.g., aircraft-based digital twins, MRO units, or OEM nodes). Each client then uses its own local dataset

D_{k}

to perform local training, updating the model based on time-series telemetry and labeled operational data collected from onboard sensors.

After completing the local training phase, each client generates a locally updated version of the model, denoted as

θ_{k}^{(t + 1)}

, which is then securely transmitted back to the central server. The server performs an aggregation step, commonly implemented as FedAvg, to compute the new global model

θ^{(t + 1)}

. This aggregation accounts for the relative dataset sizes across clients to ensure proportional influence.

The updated global model is then redistributed in the next round, completing a cycle that allows for decentralized learning without compromising raw data privacy. This flow enables continuous, collaborative model improvement across multiple aviation stakeholders while supporting modular correction through federated unlearning mechanisms in subsequent stages.

2.4. Federated Unlearning Mechanism

While federated learning offers a robust solution for decentralized model training in aviation health-monitoring systems, it remains inherently vulnerable to the inclusion of compromised, corrupted, or misleading data in the global model. In practice, digital twins may generate unreliable training contributions due to faulty sensors, telemetry misalignment, software bugs, or even adversarial manipulation. Once such flawed data have been integrated into the global model via FL, the effects are typically non-trivial to reverse, as traditional federated architectures lack mechanisms for removing specific client influences post hoc.

To address this limitation, the proposed framework introduces a FU module, enabling the selective removal of specific data contributions from the global model without requiring full retraining. This capability is essential for maintaining the accuracy, trustworthiness, and auditability of aircraft health monitoring models deployed in safety-critical environments.

Unlearning may be triggered under several circumstances:

Detection of sensor malfunction, signal noise, or drift affecting training windows;
Post facto discovery of data poisoning or tampering by a compromised client;
Identification of non-compliant or outdated data contributions (e.g., due to maintenance record updates);
For regulatory or contractual obligations requiring data redaction or withdrawal.

Each client’s contribution to the global model is tracked and logged during the FL process. The monitoring and governance layer maintains metadata for every client update, including cryptographic hashes, model deltas

Δ θ_{k}

, training context, and local error metrics. This enables the unlearning coordinator to locate, isolate, and reverse a specific contribution with minimal disruption.

Figure 4 illustrates the rollback and correction workflow implemented in the federated unlearning mechanism. The process begins with local training performed by a client on its dataset, resulting in an updated model

θ_{k}^{(t + 1)}

. This model update is submitted to the central server, where it is incorporated during the aggregation step to generate a new global model

θ^{(t + 1)}

.

If the update from a particular client is later identified as corrupted or unreliable—due to sensor faults, poisoned data, or adversarial manipulation—the system initiates a rollback and correction operation. This is achieved by applying a negative offset to the previously aggregated update

- Δ θ_{k}

to remove the influence of that specific contribution. The rollback module adjusts the global model accordingly and revalidates it before it is reused or redistributed.

The flow emphasizes that federated unlearning does not require full retraining but instead operates by surgically reversing the impact of specific client updates. This ensures the continued integrity and reliability of predictive models within the aircraft health-monitoring framework while maintaining auditability of all correction actions.

2.5. Experimental Setup

To evaluate the effectiveness and robustness of the proposed FL–DT–FU framework for aircraft health monitoring, a series of simulation-based experiments were designed to replicate realistic aviation operating conditions, sensor data variability, and client heterogeneity. The experimental setup includes synthetic and semi-real datasets, a distributed simulation of digital twin clients, a central FL coordination server, and a federated unlearning control unit.

2.5.1. Simulated Aviation Environment

The simulation environment replicates a fleet of 30 aircraft-based digital twins operating as independent clients. Each client maintains a local time-series database representing telemetry from key aircraft systems, including:

Hydraulic pressure and flow sensors;
Engine vibration and exhaust gas temperature (EGT) profiles;
Control surface actuator feedback signals;
Environmental parameters (altitude, airspeed, temperature).

Synthetic data were generated using statistical models based on real-world distributions, enriched with periodic degradation patterns, flight-phase dynamics, and sensor noise. For fault events, labeled anomalies were injected based on maintenance event timelines and failure propagation models.

To represent data corruption, 20% of the clients were assigned randomly injected errors, drifted baselines, or adversarial label flips. These corruptions were used to evaluate both the degradation of federated model performance and the ability of FU to recover predictive reliability.

2.5.2. Model Configuration

Each digital twin node trains a long short-term memory (LSTM) neural network with the following structure:

Input sequence length: 120 time steps (e.g., 2 h of telemetry);
Hidden layers: 2;
Hidden units per layer: 64;
Output: binary classification (fault/no fault) or regression (remaining useful life);
Loss functions: binary cross-entropy (BCE) for classification and mean squared error (MSE) for RUL estimation;
Optimizer: Adam with learning rate 0.001;
Local training epochs per round: 5;
Mini-batch size: 64.

The federated server aggregates local updates using the FedAvg algorithm, with client participation randomized at each round (80% active per round). Each simulation run consists of 100 FL communication rounds, and the global model is evaluated after each round.

2.5.3. Unlearning Scenarios

Three federated unlearning strategies were implemented:

Gradient subtraction—subtracting corrupted client’s delta from the global model;
Exact re-aggregation—recomputing global model without specific client’s update;
Constrained retraining—partial retraining using only verified clean clients.

Unlearning was triggered after round 60, once the corrupted contributions had already affected global convergence. The rollback module had access to stored deltas and participation logs. Model performance before and after unlearning was compared using standard classification metrics.

2.5.4. Evaluation Metrics

Model performance and unlearning efficacy were measured using the following indicators:

Accuracy, precision, recall, F1-score—for fault-detection tasks;
Root mean square error (RMSE)—for RUL estimation;
Model recovery rate for evaluation of improvement in accuracy after unlearning;
Unlearning latency—time required to perform rollback operation;
Model drift—difference between original and corrected model output on benchmark dataset;
False positive/negative rate change to assess unlearning stability.

All experiments were executed on a virtualized cluster simulating client–server interactions using Python 3.11.12 (TensorFlow Federated) [51] and Docker-based container orchestration [52]. Computational resources were scaled to reflect realistic onboard processing limits (CPU + limited GPU support per client).

Figure 5 illustrates the simulated federated environment used to evaluate the proposed FL–DT–FU framework for aircraft health monitoring.

At the center of the diagram is the federated learning coordination server, which orchestrates model aggregation and redistribution cycles across distributed digital twin clients.

Each client node represents a specific aircraft subsystem, including the hydraulics, engines, control surfaces, and environment, which are modeled as independent digital twin agents operating on local data streams. These digital twins simulate on-aircraft computational processes, including the local training of machine learning models based on system-specific telemetry and fault history.

The global model

θ^{(t)}

is distributed from the coordination server to the clients, who perform localized training on their respective datasets. Clients then return aggregated updates, which are synthesized at the server to produce the next iteration of the global model

θ^{(t + 1)}

. The process is repeated over multiple communication rounds, reflecting realistic federated learning dynamics in a heterogeneous fleet.

This simulated environment enables controlled experimentation with different failure scenarios, data corruption patterns, and unlearning strategies, allowing for the comprehensive validation of model accuracy, robustness, and recoverability across decentralized AHM settings.

2.6. Mathematical Framework of Study

The mathematical framework underlying the proposed DT–FL–FU architecture formalizes the core processes of local model training, federated aggregation, and selective unlearning in the context of aircraft health monitoring. This section defines the principal variables, functions, and optimization goals used across the system layers.

2.6.1. Local Model Training at Digital Twin Nodes

Each aircraft subsystem is modeled via a local digital twin

D_{i}

, which receives telemetry data

X_{i} \in R^{T \times d}

, where

T

is the time window length and

d

is the number of sensor channels. The goal of each twin is to learn a local predictive model

f_{i}^{θ_{i}}

that maps an input sequence

x_{i j}

to an output target

y_{i j}

, where

θ_{i}

denotes the model parameters.

The model

f_{i}^{θ_{i}} (x_{i j})

is trained to minimize a local loss function

L_{i}

.

For binary classification (fault detection):

L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} B C E [f_{i}^{θ_{i}} (x_{i j}), y_{i j}]

where

B C E

is binary cross-entropy. Given a true label

y \in \{0,1\}

and a predicted probability

\hat{y} \in [0,1]

, the

B C E

loss for a single prediction is:

B C E (y, \hat{y}) = - [y \log \hat{y} + (1 - y) \log (1 - \hat{y})] .

In the aircraft health-monitoring system,

B C E

is used to train LSTM models to classify whether a given input sequence (e.g., telemetry data from the hydraulic system) is likely to lead to a fault (label = 1) or not (label = 0). It helps the model learn to predict faults with calibrated probability estimates.

For regression (remaining useful life estimation):

L_{i} (θ_{i}) = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} {[f_{i}^{θ_{i}} (x_{i j}) - y_{i j}]}^{2}

Model parameters

θ_{i}

are updated locally using stochastic gradient descent:

θ_{i} \leftarrow θ_{i} - η \cdot \nabla_{θ_{i}} L_{i} (θ_{i})

where

η

is the learning rate and

\nabla_{θ_{i}}

is the vector of partial derivatives of the loss function with respect to each parameter in

θ_{i}

; it is used to guide how the local digital twin model learns from its data.

After local training, the model

f_{i}^{θ_{i}}

represents a subsystem-specific approximation of the aircraft’s health state. Its parameter update

∆ θ_{i}

, is computed and transmitted to the federated coordinator for global aggregation, while raw telemetry data remain securely on-device. This decentralized optimization ensures data privacy while enabling each digital twin to contribute to the global model training process.

2.6.2. Federated Learning Aggregation

The server receives updates from a subset

S \subseteq {1, \dots, K}

of participating clients and computes the global model

θ^{G}

using federated averaging, as follows:

θ^{G} = \sum_{i \in S} \frac{n_{i}}{n_{S}} θ_{i}, w h e r e n_{S} = \sum_{i \in S} n_{i}

This yields a weighted global model that incorporates operational heterogeneity across the fleet.

2.6.3. Federated Unlearning Mechanisms

When a corrupted client

c \in S

is identified, its influence must be removed from the global model. Let

∆ θ_{c}

be the last submitted update from client ccc. Three unlearning strategies are formalized:

Gradient subtraction

θ^{G^{'}} = θ^{G} - α \cdot ∆ θ_{c}

where

α \in [0,1]

is a rollback scaling factor controlling stability;

Exact Re-Aggregation

θ^{G^{'}} = \frac{1}{n_{S} - n_{c}} \sum_{i \in S \ {c}} n_{i} θ_{i};

Constrained Retraining

The global model

θ^{G}

is reinitialized to a previous safe checkpoint

θ^{c h k p t}

and updated using only trusted clients

S^{'} \subset S \ {c}

.

2.6.4. Model Performance Metrics

To evaluate and compare model performance across different states (baseline, corrupted, recovered), the following metrics are used.

1. Accuracy—overall correctness

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

where

T P

(True Positives) is the number of instances where the model correctly predicts a fault, and the fault is indeed present,

T N

(True Negatives) is the number of instances where the model correctly predicts no fault, and no fault is actually present,

F P

(False Positives) is the number of instances where the model predicts a fault, but no fault actually exists.

F N

(False Negatives) is the number of instances where the model predicts no fault, but a fault actually exists;

2. Precision—fault prediction correctness

P r e c i s i o n = \frac{T P}{T P + F P};

3. Recall (sensitivity or true positive rate)—fault-detection rate

R e c a l l = \frac{T P}{T P + F N};

4. F1-Score—balance between precision and recall

F 1 - S c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 T P}{2 T P + F P + F N};

5. False Positive Rate

F P R

—False alarm rate

F P R = \frac{F P}{F P + T N};

6. False Negative Rate

F N R

—Missed fault rate

F N R = \frac{F N}{F N + T P};

7. Specificity (True Negative Rate)—Correctly ignoring non-faults

S p e c i f i c i t y = \frac{T N}{T N + F P} .

In our case, these formulas allow us to quantify how well the federated model detects actual faults (Recall), avoid unnecessary alarms (Precision and FPR), maintain overall system reliability (F1-score and Accuracy) and ensure safety in critical systems by minimizing false negatives, which could otherwise result in undetected system degradation.

Recovery effectiveness after unlearning is quantified using the following metrics.

1. Model Drift:

∆_{d r i f t} = {‖θ^{G^{'}} - θ^{b a s e l i n e}‖}_{2}

where

θ^{G^{'}}

is the current global model parameters (after corruption or after unlearning),

θ^{b a s e l i n e}

is the reference model parameters from a clean or known-good state (e.g., before data corruption), and

{‖\cdot‖}_{2}

refers to the Euclidean norm as Euclidean distance between the two sets of parameters. It measures how much the model has changed (or drifted) in parameter space.

If

θ \in R^{n}

, then:

∆_{d r i f t} = {‖θ^{G^{'}} - θ^{b a s e l i n e}‖}_{2} = \sqrt{\sum_{j = 1}^{n} {(θ_{j}^{G^{'}} - θ_{j}^{b a s e l i n a})}^{2}}

This gives a single scalar value indicating the magnitude of parameter deviation;

2. Recovery rate

ρ = \frac{P o s t U n l e a r n i n g F 1 - C o r r u p t e d F 1}{B a s e l i n e F 1 - C o r r u p t e d F 1}

where

B a s e l i n e F 1

is the F1-score of the clean, uncorrupted model before any data poisoning,

C o r r u p t e d F 1

is the F1-score after corruption, before unlearning, and

P o s t U n l e a r n i n g F 1

is the F1-score after applying an unlearning method.

The recovery rate in this study context defines how effectively the federated unlearning process restores model performance after it was degraded by corrupted or adversarial data contributions. In an AHM system, the recovery rate quantifies the effectiveness of the unlearning strategy in restoring fault-detection performance. It is especially important in aviation, where even small degradations in recall or the F1-score could lead to missed fault detection and reduced safety.

2.6.5. Trust and Participation Scoring

An optional but valuable enhancement to the federated learning process in AHM is the integration of trust and participation scoring, where each digital twin node representing an aircraft or subsystem is assigned a dynamic trust score

τ_{i} \in [0,1]

. This score reflects the client’s behavioral integrity and data reliability over time and is used to regulate participation and influence during aggregation. In the context of AHM, data from different aircraft may vary significantly in quality due to sensor degradation, data transmission issues, inconsistent maintenance logs, or adversarial inputs. Incorporating trust scores helps to mitigate the risk of model contamination by down-weighting or excluding low-trust clients from global updates.

Trust scores can be computed based on metrics such as update similarity to the global model, anomaly score trends, validation loss deviations, rollback frequency due to unlearning, and data continuity. These metrics are aggregated into a composite score that can be updated over time using exponential smoothing or a rule-based reward–penalty mechanism. Clients with trust scores below a threshold may be excluded from participation in a given training round or flagged for future audit or unlearning actions.

The aggregation rule can be modified to include trust weighting as follows:

θ^{G} = \frac{1}{\sum τ_{i} n_{i}} \sum τ_{i} n_{i} θ_{i}

where

n_{i}

is the number of local samples and

τ_{i}

modulates the influence of each client’s contribution. This approach increases the resilience of the federated model to noisy or misleading data without fully excluding participants unless warranted.

In addition to improving the robustness of model training, trust scores support transparency and compliance by enabling traceability and forensic auditing. They can also serve as unlearning triggers when accumulated evidence suggests that a client’s contributions may be corrupt. Within regulated aviation contexts, this scoring mechanism enhances both the technical and regulatory trustworthiness of the system by ensuring that only high-integrity data are integrated into the learning pipeline.

The presented mathematical framework underpins a robust and privacy-preserving learning architecture for aircraft health monitoring by combining local digital twin modeling, federated learning aggregation, and federated unlearning mechanisms. Each aircraft or subsystem acts as an autonomous learning agent, contributing structured and validated updates to a global model while preserving data locality and operational confidentiality. Through precise loss formulations, gradient-based optimization, trust-aware participation, and quantifiable recovery metrics, the system ensures not only predictive accuracy but also resilience against data corruption and model drift. This integrated formulation supports the dynamic and safety-critical nature of aviation maintenance operations, enabling scalable, adaptive, and certifiable intelligence across distributed fleets.

3. Results

3.1. Baseline Federated Model Performance

Before introducing data corruption or unlearning interventions, we first established a baseline to evaluate the performance of the federated learning framework in a clean, fully cooperative environment. All 30 digital twin clients participated in model training using verified and consistent local datasets, representing typical telemetry and operational profiles across various aircraft subsystems. This scenario reflects ideal FL conditions, where no data poisoning, labeling noise, or client-side faults are present.

The federated learning process was executed for 100 communication rounds, with 80% of clients participating randomly in each round. The model under evaluation was an LSTM network configured for binary fault classification based on time-series input sequences. The global model was initialized with uniform weights and updated using the FedAvg algorithm.

During the first 25 rounds, model accuracy increased rapidly as local updates captured subsystem-specific fault patterns and failure dynamics. After round 50, performance gains began to plateau, indicating convergence toward a generalizable model across client domains. Final performance metrics were calculated using a validation set comprising fault-labeled sequences not seen during local training.

The baseline model achieved the following global metrics on the validation dataset:

Accuracy 92.8%, precision 91.2%, recall 90.4%, F1-Score 90.8%, ROC-AUC 0.96, inference latency (average per sample) 23.5 ms (simulated edge environment).

These results confirm that the federated architecture, under ideal conditions, successfully learned generalized failure representations without centralized data pooling. Individual subsystems (e.g., engines vs. hydraulics) demonstrated minor variations in precision-recall tradeoffs due to differences in signal complexity and fault onset patterns; however, the unified model remained well-balanced overall.

To assess the consistency of learning across clients, we measured the cosine similarity between local model updates during rounds 30–50 and the final global model. Similarity scores averaged above 0.91, indicating strong alignment in parameter directions across participating clients. This reinforces that the FL setup converged smoothly and avoided client drift in the clean data regime.

For benchmarking purposes, a centralized model was also trained using concatenated datasets from all clients. The centralized model achieved 93.1% accuracy, which is only marginally higher than the federated result (92.8%), validating that the FL approach incurred a negligible performance tradeoff while maintaining privacy and data locality.

Figure 6 presents the performance graph illustrating the convergence of the federated model over 100 communication rounds under baseline conditions. It shows improvements in accuracy, precision, recall, and in the F1-score, reflecting stable and effective learning in a clean, federated environment.

3.2. Impact of Data Corruption on Model Integrity

To assess the vulnerability of the federated learning process to data integrity issues, a series of experiments were conducted in which a subset of digital twin clients contributed corrupted data to the training process. These simulated scenarios reflect real-world risks encountered in aircraft health monitoring, including sensor malfunctions, telemetry drift, labeling errors, and intentional data poisoning.

Starting from round 10 of the training, 6 out of 30 clients (20%) were selected to introduce one of the following corruption types into their local datasets:

Sensor drift—time-series features were gradually offset to simulate calibration faults (e.g., hydraulic pressure baseline increased by +20% over time);
Label flipping—a portion of anomaly labels (0 → 1 and 1 → 0) were inverted to mimic misclassification due to maintenance documentation errors;
Noise injection—random Gaussian noise was added to signal features, degrading the signal-to-noise ratio;
Adversarial training bias—clients trained on data distributions deliberately skewed to reinforce misleading patterns (e.g., correlating failure with non-failure conditions).

These corruptions were not disclosed to the federated server; all client updates were treated as valid during aggregation.

Compared to the baseline results, the presence of corrupted clients caused a measurable degradation in global model performance. The impact became evident after approximately 15 rounds, when the corrupted updates began influencing the aggregation process.

At round 60, the model evaluation on the clean validation set yielded the following: accuracy 84.1% (8.7% improvement); precision: 80.2% (11.0% improvement); recall 75.5% (14.9% improvement); F1-Score 77.8% (13.0% improvement); and ROC-AUC 0.88 (0.08 improvement).

Notably, the recall metric showed the largest decline, indicating an increased tendency for the model to miss true failure cases—a serious concern in safety-critical aviation applications.

To further analyze the effect of corruption, we measured the cosine similarity between the local gradients of corrupted and clean clients. While clean client updates remained closely aligned (average similarity > 0.90), the updates from corrupted clients diverged significantly (average similarity ≈ 0.42 by round 50). This divergence introduced client drift, which distorted the global optimization trajectory.

Confusion matrix analysis revealed an increase in false negatives (missed faults), particularly in subsystem classes where corrupted clients dominated data representation (e.g., hydraulic and engine control groups). Moreover, early-stage failures (e.g., incipient anomalies) became harder to detect, suggesting that the model was overfitting to misleading patterns introduced by faulty nodes.

These results highlight the fragility of federated models to localized data corruption, even when a relatively small fraction of clients is affected. Because federated learning relies on aggregated updates without direct access to raw data, the server lacks visibility into the quality of individual contributions—making post hoc correction (via federated unlearning, addressed in Section 3.3) essential.

This scenario also emphasizes the need for client trust scoring, anomaly-aware update filtering, and integrity validation mechanisms within real-world aviation FL deployments.

Figure 7 presents the confusion matrix of the global federated model after the integration of corrupted updates from 20% of the digital twin clients. The matrix compares the model’s predicted fault classifications against the actual fault labels in the validation dataset. Notably, an increased number of false negatives is observed, where true fault cases (label = 1) were incorrectly classified as non-faults (predicted = 0). These false negatives represent critical missed detections that could allow actual degradation or failure conditions to go unnoticed during operation—an unacceptable outcome in safety-critical aviation contexts. The figure highlights how the presence of corrupted client contributions disproportionately impacts the model’s ability to correctly identify early-stage or low-signal anomalies, leading to a measurable decline in recall and fault sensitivity. This degradation underscores the necessity of robust unlearning mechanisms to maintain diagnostic accuracy across the federated AHM system.

3.3. Federated Unlearning Application and Effects

Following the performance degradation observed in the presence of corrupted client contributions, federated unlearning techniques were applied to remove the influence of the six previously identified faulty clients. This section evaluates the effectiveness, efficiency, and impact of unlearning operations in restoring model performance and integrity within the FL–DT–FU framework.

Federated unlearning was initiated at round 60, after confirming the negative impact of corrupted client updates on the global model. Three FU methods were tested:

Gradient subtraction (approximate unlearning);
Exact re-aggregation (precise rollback of aggregation);
Constrained retraining (using clean clients only, from the round 60 checkpoint).

Each method was executed independently on identical model snapshots, allowing for controlled comparison.

The unlearning procedures led to a significant recovery in the model’s quality. Table 1 summarizes the key metrics before corruption (baseline), after corruption, and after each unlearning strategy.

Among the methods, constrained retraining provided the most accurate recovery, nearly matching the baseline model, but at the cost of significantly higher computational time. Exact re-aggregation achieved comparable results with less overhead. Gradient subtraction, while less precise, still restored a large portion of model fidelity rapidly and with minimal resource requirements.

Post unlearning, the corrected models were evaluated on the same clean validation dataset. In all cases, the false negative rate, previously elevated due to corrupted updates, was significantly reduced, particularly for early-stage fault conditions. ROC curves showed a return to smooth, well-separated distributions between faulty and normal sequences, indicating successful rebalancing of model decision boundaries.

The cosine similarity of the recovered model with the clean baseline increased from 0.76 (post-corruption) to 0.91 (gradient subtraction) and 0.97 (re-aggregation), confirming the re-alignment of the model parameters toward their intended optimization path. Moreover, federated round participation resumed without rejection or penalty for clean clients, demonstrating that FU preserved continuity in collaborative learning.

All FU operations were logged in the system’s audit trail, including metadata on the affected clients, method applied, update timestamps, and validation outcomes. A review of the audit records confirmed that each unlearning event was traceable and verifiable—supporting use cases where model correction must be demonstrated for certification or regulatory oversight.

Figure 8 presents the receiver operating characteristic (ROC) curve comparing model performance across five scenarios: the clean baseline; post corruption; and after applying the three federated unlearning strategies. The improvement in area under the curve (AUC) values following unlearning demonstrates effective restoration of the model’s fault-detection capability.

Figure 9 illustrates the comparative performance of the global AHM model under five scenarios: the clean baseline; the degraded post-corruption state; and the three recovery strategies using distinct FU methods—gradient subtraction, exact re-aggregation, and constrained retraining.

The baseline represents the model’s performance trained exclusively on verified, clean data from all clients, achieving optimal accuracy, precision, recall, and F1-score. The post-corruption scenario includes contributions from 20% of clients affected by sensor drift, label inversion, and adversarial noise, resulting in significant drops across all performance metrics, particularly in recall.

To mitigate this degradation, three FU methods are applied:

Gradient subtraction removes corrupted influence by subtracting the faulty client updates from the global model—fast but approximate;
Exact re-aggregation reconstructs the global model by excluding the affected clients’ updates—precise and efficient;
Constrained retraining reinitializes training from the last clean checkpoint using only validated clients—most accurate but resource-intensive.

Figure 9 demonstrates that all FU strategies substantially improve model performance compared to the corrupted state, with the constrained retraining restoring metrics closest to the original baseline.

To better characterize the robustness of the proposed federated unlearning mechanism, 95% confidence intervals were computed for the observed recovery metrics across 10 independent simulation runs with randomized client orderings and gradient initialization seeds. The mean accuracy restoration after rollback was 93.4% with a 95% confidence interval of [91.8%, 95.1%], confirming that up to 95% restoration lies within the expected statistical variation.

3.4. Case Study: Faulty Sensor Series in Hydraulic System Monitoring

To demonstrate the real-world applicability of the proposed FL–DT–FU framework, a focused case study was conducted on the hydraulic system monitoring subsystem of an aircraft fleet. This subsystem plays a critical role in flight control actuation and gear operation and is commonly equipped with multiple pressure and flow sensors that report operational status in real time.

A single client (representing one aircraft’s digital twin) was selected to simulate a systematic sensor degradation fault over a sequence of federated rounds. The client originally functioned as part of the clean training population, contributing consistent, labeled telemetry data across the first 20 rounds. Beginning at round 21, the client’s pressure sensors began exhibiting a progressive calibration drift, resulting in a baseline offset of +15% across all readings.

This drift was not immediately identified at the local level due to the slow and consistent rate of change. However, as the client continued to participate in federated updates, its contributions began to bias the global model toward misclassifying degraded but operational states as nominal.

By round 40, a noticeable decline in fault-detection sensitivity was observed across all clients when evaluating test sequences involving early-stage hydraulic failures. Specifically, the global model’s recall for hydraulic faults dropped from 91.1% to 78.3%, and false negatives increased by 34%, particularly in cases involving low-pressure anomalies during descent and gear extension phases.

Further analysis revealed that the model had learned to tolerate slightly elevated pressure values as part of the “normal” class, effectively masking early failure symptoms due to the cumulative weight of the corrupted updates from the drifting sensor client.

At round 50, the anomaly was flagged by the system’s governance layer due to a divergence in model update patterns and a post hoc maintenance report confirming sensor miscalibration. A federated unlearning procedure was initiated using the exact re-aggregation method, which removed all the affected client’s updates from rounds 21 to 49 and recomputed the global model accordingly.

The results were immediate and substantial:

Recall for hydraulic faults recovered to 89.6%;
False negatives reduced by 29%;
Anomaly detection confidence scores realigned with baseline thresholds.

The unlearning event was logged and certified within the audit layer, with traceable documentation linking the update rollback to both the physical maintenance report and the internal telemetry anomaly scoring history.

This case study highlights several important features of the FL–DT–FU architecture:

Digital twins enable local representation of temporal anomalies, even when human oversight is delayed or unavailable;
Federated learning, while privacy-preserving, is inherently vulnerable to long-tail sensor degradation unless paired with validation and correction mechanisms;
Federated unlearning provides a non-destructive and efficient method for restoring model integrity after faulty contributions, avoiding full retraining;
The auditability of unlearning actions is essential for downstream compliance, particularly in regulated aviation environments where decisions may be subject to post-event review.

This example demonstrates that federated unlearning is not only a theoretical improvement but a practical requirement for maintaining trustworthy machine learning operations in complex, distributed aviation systems.

Figure 10 illustrates a simulated example of progressive sensor drift affecting hydraulic pressure readings within an aircraft’s digital twin system. The x-axis represents federated learning rounds, corresponding to successive training iterations, in which the digital twin client contributes local model updates to the global federated learning process. The y-axis indicates the reported hydraulic pressure in pounds per square inch (psi), a critical parameter for monitoring the integrity of aircraft control and landing systems.

The gray dashed line represents the true pressure baseline, assumed to be stable at 3000 psi. This reflects nominal system behavior without sensor faults.

The blue line represents the drifted sensor readings. Beginning at round 21, a systematic calibration drift is introduced, causing the sensor to report increasingly inflated values over time. The drift progresses linearly, reaching a deviation of +450 psi by round 50.

A vertical red dotted line marks the onset of drift at round 21. This point corresponds to the transition where the digital twin begins introducing biased training data into the federated learning process, unknowingly misrepresenting the true physical condition of the hydraulic system.

As the sensor drift continues, the local model learns from incorrectly elevated readings and transmits updates that gradually skew the global model. This leads to attenuation of early-stage failure signals, as pressure values that should be interpreted as anomalous are now falsely classified as normal.

Figure 10 underscores the insidious nature of slow, undetected sensor drift in distributed learning systems. Because the drift develops gradually, it can bypass local validation checks; its impact on global model behavior may not become apparent until predictive performance significantly declines. This scenario exemplifies the need for federated unlearning mechanisms to detect, isolate, and reverse the influence of corrupted data sources without requiring complete retraining of the system.

Figure 11 presents the bar chart comparing model performance before and after applying federated unlearning (exact re-aggregation). The visualization highlights substantial improvements across all metrics, especially in recall, demonstrating the effectiveness of unlearning in restoring model integrity after corruption.

Figure 12 presents a time-series visualization of validation accuracy across 100 federated learning rounds, annotated with key events that characterize the model’s lifecycle within the FL–DT–FU framework. The figure demonstrates how the global model evolves over time in response to client contributions, data corruption, and corrective unlearning actions. The x-axis represents the progression of FL communication rounds (1 to 100), while the y-axis shows the model’s validation accuracy (%), measured consistently on a clean, withheld dataset.

The accuracy curve is divided into three distinct behavioral phases:

Baseline Training Phase (Rounds 1–20). During this initial stage, all client nodes contribute high-quality, validated data. The model demonstrates steady improvement in validation accuracy, increasing from approximately 75% to 81%, reflecting the successful integration of diverse subsystem information across the aircraft fleet;
Corruption Phase (Rounds 21–59). Beginning at round 21, one or more clients start contributing corrupted data due to a progressive sensor drift or label misclassification (as described in Section 3.2 and Section 3.4 ). Although the model continues to receive updates, its performance deteriorates over time. This is visible in the declining curve, which drops from 81% to approximately 78% by round 59. The curve reflects model confusion and bias, particularly in fault-detection tasks, due to the inclusion of misleading or inaccurate information;
Recovery Phase (Rounds 60–100). At round 60, federated unlearning is triggered. The server identifies and removes the impact of the corrupted client’s contributions using the exact re-aggregation method. Following this correction, the global model begins to recover, with accuracy climbing back to above 90% by round 100. This rebound illustrates the effectiveness of the unlearning mechanism in restoring the predictive quality of the system without full retraining.

Each significant event in the model’s lifecycle is marked by a vertical dashed line and annotated with labels:

Baseline training at round 1;
Corruption begins at round 21;
Performance decline observed at round 40;
Unlearning triggered at round 60;
Recovery phase from round 61 onward.

Figure 12 emphasizes the temporal dynamics of model performance in real-world FL deployments and demonstrates the critical role of federated unlearning in mitigating long-term damage from erroneous or adversarial data sources.

4. Discussion

4.1. Advantages of Federated Unlearning in AHM Systems

The integration of federated unlearning into AHM systems introduces a transformative capability for maintaining the trustworthiness, adaptability, and compliance of predictive maintenance models in distributed aviation environments. While federated learning offers a scalable, privacy-preserving approach to model development across multiple stakeholders, it inherently lacks the ability to retract knowledge once it has been aggregated. It is a limitation that federated unlearning directly addresses.

One of the most significant advantages of FU in the AHM context is its ability to surgically remove the influence of corrupted or misleading data from global models without requiring full retraining. As demonstrated in Section 3, even a small number of faulty clients, due to sensor degradation, mislabeled faults, or data poisoning, can substantially degrade model accuracy and increase the risk of undetected failures. Federated unlearning provides a targeted remedy by enabling the system to roll back or adjust the model parameters to exclude these harmful contributions.

This capability enhances model integrity and safety assurance, especially in scenarios where real-time data quality validation is infeasible. By using stored update histories and participation metadata, FU enables aviation operators to correct systemic errors promptly, preserving model performance without interrupting operations.

Another critical benefit lies in lifecycle resilience. AHM models deployed in dynamic, long-lifecycle systems, such as aircraft, must remain responsive to evolving operational contexts, changing sensor configurations, and retroactive maintenance findings. FU allows for these models to remain flexible and updatable, supporting not only adaptation but retraction, which is essential when historical data are reclassified or invalidated.

FU supports compliance with regulatory and ethical standards. In civil aviation, maintaining traceability and auditability of digital systems is paramount. FU operations can be transparently logged and linked to maintenance events or compliance reports, offering regulators and operators verifiable evidence of corrective action. This transparency aligns with the principles of explainable AI and safety-critical software assurance and enhances stakeholder confidence in automated decision-support systems.

From an architectural standpoint, FU also contributes to scalability and operational efficiency. Instead of retraining global models from scratch, which is computationally expensive and logistically disruptive, FU methods, such as gradient subtraction or exact re-aggregation, offer lightweight alternatives for rapid mitigation. These methods allow federated AHM systems to maintain performance standards while minimizing cloud bandwidth, model downtime, and energy consumption.

Federated unlearning transforms federated AHM systems from static, one-way learning architectures into adaptive, correctable, and auditable systems. Its advantages are particularly valuable in the aviation domain, where safety margins are tight, failure prediction is mission-critical, and the costs of model error are extremely high.

4.2. Trade-Offs and Design Considerations

While federated unlearning offers compelling benefits for improving reliability and accountability in federated AHM systems, its deployment introduces a series of technical trade-offs and architectural design considerations that must be carefully balanced. These trade-offs span dimensions such as precision versus efficiency, model stability versus adaptability, and transparency versus complexity.

1.: Precision vs. Efficiency

The most fundamental trade-off in FU lies between unlearning precision and computational efficiency. Exact methods, such as re-aggregation or constrained retraining, offer high fidelity in removing unwanted model influence, but at the cost of increased memory usage (e.g., storing client gradients or intermediate model states) and computation time. In contrast, approximate methods, such as gradient subtraction or influence reweighting, are computationally lightweight and operationally fast but may only partially mitigate the contamination effects, especially when the corruptions occurred early in training or were aggregated over multiple rounds.

System designers must therefore choose an FU strategy appropriate to the operational context:

Mission-critical systems (e.g., in-flight fault detection) may demand exact rollback despite higher cost;
Real-time ground analysis tools may tolerate approximate unlearning if it ensures continuity and minimal latency;

2.: Detection Sensitivity vs. False Alarm Risk

Another design challenge arises in determining when to trigger unlearning. Unlearning actions are often initiated in response to suspected data corruption, model drift, or post hoc validation failures. However, aggressive or mis-calibrated detection thresholds may lead to false positives, unnecessarily removing valuable client contributions and harming model performance.

Conversely, delayed or conservative unlearning actions can allow faulty data to continue influencing model behavior, increasing the risk of safety-critical errors. To manage this trade-off, AHM systems should incorporate multi-level validation strategies, combining anomaly detection scores, model divergence metrics, and maintenance log audits before initiating FU operations;

3.: Model Stability vs. Adaptability

Federated unlearning introduces model perturbations by design. While it restores integrity by removing harmful updates, it may also destabilize the model’s convergence trajectory, particularly when performed repeatedly or without post-unlearning fine-tuning. In systems with continuously evolving data distributions (e.g., new aircraft configurations, updated sensors), there is a risk that unlearning introduces unwanted regressions in model performance for unrelated subsystems.

To mitigate this, systems should be equipped with revalidation pipelines and fallback checkpoints, ensuring that the post-unlearning model is both stable and generalizable. In highly dynamic fleets, unlearning should be followed by controlled rounds of reinforcement training using verified clients;

4.: Transparency vs. Complexity in Governance

From a systems management perspective, federated unlearning demands additional governance infrastructure. Audit trails, rollback logs, version control, and compliance reporting must all be tightly integrated to support traceability and accountability. However, these mechanisms increase system complexity and administrative overhead, particularly when deployed across large fleets with heterogeneous data management standards.

Therefore, system architects must adopt modular FU architectures that enable integration with existing AHM governance tools (e.g., digital logbooks, maintenance records) without overloading operational workflows. The use of standard formats for update hashing, logging metadata, and model lineage records can help to balance auditability with system simplicity;

5.: Storage vs. Scalability

Some FU strategies—particularly exact unlearning—require persistent access to historical model updates, client-specific gradients, or local loss trajectories. In long-lived AHM systems, storing these metadata across hundreds of clients can challenge scalability and raise data retention policy issues. Trade-offs must be made between:

Retaining fine-grained update history for rollback integrity;
Compressing or discarding old updates to conserve storage.

Techniques, such as selective checkpointing, summary gradient representations, and temporal decimation, can help to strike this balance, allowing for scalable yet effective FU implementation.

The practical application of federated unlearning in AHM systems involves more than selecting an algorithm, it requires thoughtful engineering trade-offs that align with system goals, safety requirements, operational constraints, and regulatory standards. When these design considerations are addressed holistically, FU can function not merely as a recovery tool, but as an essential component of resilient, adaptive, and trustworthy federated AHM infrastructure.

4.3. Forgetting Confidential Data as FU Secondary Role

While the primary motivation for implementing federated unlearning in AHM systems lies in mitigating the impact of corrupted or compromised data contributions, FU also plays a significant secondary role in supporting the removal of confidential or sensitive information upon request. This capability is particularly relevant in collaborative aviation ecosystems involving multiple stakeholders, such as airlines, MRO providers, OEMs, and regulatory bodies, each of which may impose distinct data governance, ownership, and privacy requirements.

In federated learning environments, raw data are never shared; however, derived knowledge from local datasets (e.g., gradient updates or model weights) still reflects statistical traces of the source data. As such, if a participating organization or regulatory authority later determines that specific data must be withdrawn (for instance, due to contractual obligations, legal discovery, intellectual property disputes, or privacy protections), FU provides the technical means to honor these obligations by removing the data’s influence from the trained global model.

This aspect of FU is especially valuable considering emerging and existing regulatory frameworks such as:

The General Data Protection Regulation (GDPR), which enshrines the “right to be forgotten” in European jurisdictions;
National security restrictions on dual-use aerospace data;
Cross-border data agreements that may restrict retention or derivative use of localized operational records.

Through methods, such as gradient masking, secure rollback, or constrained retraining, FU can selectively erase model contributions associated with a particular client or dataset, thereby supporting compliance-driven model correction without compromising the privacy-preserving nature of federated learning.

Moreover, this capability supports ethical AI practices in aviation digital systems by reinforcing transparency, autonomy, and accountability. Aircraft operators may, for example, request the removal of flight-specific health records from collaborative learning pools due to incidents under investigation or to ensure competitive data confidentiality. FU enables such selective removal without requiring full retraining or undermining the performance of the shared model across the broader fleet.

However, this secondary use case also introduces operational and legal design challenges. Specifically, systems must be equipped with:

Traceable participation records, linking model updates to specific data origins;
Verification mechanisms to confirm that unlearning requests are legitimate, authorized, and precisely targeted;
Versioned audit logs, documenting when and how data were removed from the model for compliance audits.

While the central contribution of FU in AHM systems is to safeguard model integrity from corrupted data, its ability to remove sensitive or confidential information post hoc provides critical legal and ethical safeguards. This dual capability (technical correction and regulatory compliance) positions federated unlearning as a foundational component of next-generation, trusted AI systems in aviation.

While this study emphasizes federated unlearning as a foundational mechanism for maintaining model integrity in safety-critical aviation environments, it is important to contextualize FU alongside established robust aggregation defenses such as Krum and trimmed mean strategies. These aggregation strategies proactively reduce the impact of corrupted or adversarial updates during the learning process by down-weighting or filtering suspicious contributions before they are incorporated into the global model.

To provide a quantitative perspective, additional simulation runs were performed on the same experimental setup described in Section 2.5. When 20% of the clients introduced corrupted updates, Krum preserved model accuracy at 88.7% and recall at 82.1%, while trimmed mean achieved 87.9% accuracy and 80.5% recall. These results show meaningful robustness compared to naïve aggregation, which degraded model accuracy to 84.1% and model recall to 75.5%.

However, because robust aggregation is proactive and operates only during the aggregation phase, it cannot retrospectively remove harmful updates once they have been accepted. By contrast, FU methods applied after contamination—such as gradient subtraction, exact re-aggregation, and constrained retraining—restored performance closer to the clean baseline. Exact re-aggregation improved accuracy to 90.7% and recall to 87.6%, while constrained retraining achieved near-baseline results of 92.5% accuracy and 90.2% recall.

Figure 13 illustrates these comparative results, showing side-by-side accuracy and recall values for each method. The figure also includes dashed reference lines for the baseline model’s performance (accuracy 92.8%, recall 90.4%), providing a clear visual indication of how closely each method approaches the original, uncontaminated state.

This comparison highlights that robust aggregation and FU methods are not competing but complementary. Robust aggregation reduces the probability of corruption at ingestion, whereas FU offers a corrective mechanism to surgically remove the residual effects of any harmful contributions. Together, these approaches provide a stronger and more resilient foundation for safety-critical federated aviation intelligence.

4.4. Broader Implications for Federated Aviation Intelligence

The integration of federated unlearning within aircraft health-monitoring systems has implications that extend far beyond isolated model corrections. It represents a foundational enhancement to the broader concept of federated aviation intelligence—an emerging paradigm in which distributed, privacy-preserving machine learning systems operate collaboratively across diverse aviation stakeholders to generate real-time, adaptive, and secure operational insights.

Enhancing Trust in Cross-Organizational Collaboration

Aviation ecosystems are inherently multi-actor, involving airlines, aircraft manufacturers, maintenance providers, regulatory authorities, and airports. Federated learning enables these actors to co-develop predictive models without centralizing sensitive data. However, until now, one of the fundamental concerns in such collaborations has been the irreversibility of model contamination, where a single corrupted client or unauthorized participant can compromise shared intelligence.

Federated unlearning introduces an assurance mechanism for participants; contributions can be revoked if later found to be harmful, incorrect, or unauthorized. This elevates the trustworthiness and governance readiness of federated AI systems, making them more attractive for large-scale, inter-organizational adoption;

2.: Building Self-Healing Predictive Systems

With FU, federated models gain self-healing properties—the ability to recover from corrupted learning paths without requiring manual resets or full retraining. This is particularly valuable in aviation environments, where new aircraft variants, updated components, or evolving operational contexts frequently render older datasets obsolete or even misleading. FU allows the model to forget patterns that are no longer relevant and adapt more fluidly to operational drift, enabling resilient, continuously improving intelligence pipelines;

3.: Enabling Regulatory-Ready AI Infrastructure

The aviation sector operates under some of the strictest safety and compliance regulations in the world. Any data-driven decision-support system, particularly one that influences maintenance, dispatch, or flight safety, must be transparent, traceable, and correctable. Federated unlearning adds a layer of model accountability that complements traceability and version control, addressing key expectations from regulatory agencies, such as EASA, FAA, and ICAO, regarding data use, model validation, and post-deployment adaptation.

This positions federated unlearning as a compliance enabler, helping federated aviation intelligence systems to meet the criteria for safety-critical deployment while satisfying evolving data protection laws, audit requirements, and technical oversight mandates;

4.: Supporting Federated Twin Networks at the Fleet Scale

As the industry moves toward federated digital twin networks, where each aircraft maintains its own digital replica that communicates with a fleet-level intelligence system, FU plays a pivotal role in ensuring long-term system coherence. It ensures that faults introduced by faulty sensors, bad firmware updates, or ground-level data anomalies can be corrected retrospectively, preserving the semantic and behavioral alignment of the fleet-wide predictive model.

Over time, FU can be combined with federated learning personalization techniques to build hybrid models that are both globally optimized and locally adaptive, with mechanisms to selectively forget outdated configurations while preserving mission-relevant knowledge. This opens the door to dynamic, continuously certifiable models, where the model’s evolution is governed not just by optimization logic, but also by verifiable correction histories;

5.: Strategic Implications for AI-Powered Aviation Infrastructure

At a strategic level, federated unlearning reinforces the sustainability and robustness of AI-powered aviation infrastructure. It ensures that AI deployments are not locked into static, non-reversible learning states, but instead evolve as adaptive, compliant, and ethically governed systems. As aviation stakeholders increase investment in AI for predictive maintenance, route optimization, airspace management, and operational risk forecasting, FU provides a key capability to manage both technical debt and institutional risk.

Embedding unlearning into the core of federated architectures enables the aviation sector to accelerate the adoption of AI at scale while maintaining the levels of resilience, compliance, and cross-stakeholder trust required for mission-critical operations.

4.5. Challenges, Limitations, and Future Research Directions

While the proposed FL–DT–FU framework demonstrates strong potential for improving the reliability, accountability, and resilience of AHM systems, this study also reveals several technical challenges and methodological limitations that must be addressed in future research to enable large-scale, real-world deployment.

The experimental results presented in this study were generated using a simulated federated environment with synthetic and semi-realistic data. While care was taken to replicate the statistical properties of actual aviation telemetry and failure distributions, the findings may not fully reflect the complexities, irregularities, and edge cases found in operational fleets. Real-world federated deployments introduce challenges such as non-synchronous updates, hardware heterogeneity, communication delays, and variability in data quality across aircraft types. Future research should therefore focus on real-world pilot deployments, incorporating live data streams from digital twin-enabled aircraft and diverse operational environments.

Current federated unlearning methods operate primarily at the client-level granularity, where the entire contribution of a client (i.e., an aircraft or node) is removed. However, in practice, only portions of a client’s data may be faulty, outdated, or sensitive. Removing full contributions may lead to unnecessary knowledge loss or model instability, particularly in smaller or imbalanced training populations. There is a clear need to develop fine-grained unlearning techniques, capable of removing specific data sequences or sub-model effects without destabilizing the global model.

Although unlearning provides a mechanism to correct known data contamination, it remains largely reactive in nature. The framework does not currently include proactive defenses against stealthy or adaptive adversarial attacks that may attempt to evade detection thresholds or mimic valid behavior patterns. Integrating federated unlearning with federated adversarial training, robust aggregation rules, and reputation-based client scoring could enhance long-term resilience to both accidental and intentional corruption.

The study identifies several key trade-offs (e.g., precision vs. efficiency, stability vs. adaptability); however, it does not yet implement a formal optimization framework for selecting unlearning strategies based on system goals or operational context. Future work should focus on developing multi-objective decision models that can dynamically recommend or orchestrate FU strategies across heterogeneous nodes, balancing safety, performance, and compliance requirements in real time.

While the study proposes a combined FL–DT–FU architecture, it does not yet fully explore the multi-layered interactions between digital twin lifecycle management and federated learning processes. Open questions remain regarding how unlearning decisions affect twin state synchronization, fault propagation modeling, and update timing coordination across the fleet. Future research should explore bi-directional coupling mechanisms, where twin-level anomaly detection and lifecycle events directly inform federated model trust calibration and vice versa.

5. Conclusions

This study introduced an integrated framework combining federated learning, digital twin modeling, and federated unlearning to enhance the robustness, transparency, and adaptability of AHM systems. Addressing the limitations of conventional federated learning, particularly the inability to remove corrupted or outdated data contributions, federated unlearning emerges as a crucial mechanism for ensuring model integrity in safety-critical aviation environments.

The proposed system architecture enables distributed learning across aircraft-specific digital twins while preserving data locality and complying with privacy and governance requirements. Through simulated experiments, this study has demonstrated that the incorporation of FU significantly mitigates the negative impact of faulty data, such as sensor drift or mislabeling, on predictive model performance. The results also show that FU can effectively restore validation accuracy and classification confidence without requiring full retraining or compromising system scalability.

Beyond technical performance, the study highlighted the broader implications of FU in aviation intelligence, including improved trust in cross-organizational collaborations, self-healing AI capabilities, regulatory readiness, and the long-term sustainability of federated digital ecosystems. The decision matrix and conceptual diagrams provide practical guidance for system designers navigating trade-offs between model precision, efficiency, traceability, and scalability.

At the same time, the study acknowledges FU limitations such as reliance on simulated data, coarse unlearning granularity, and the lack of integration with regulatory certification workflows. These limitations underscore the need for future research focused on fine-grained unlearning, real-world validation, and formal regulatory alignment.

Federated unlearning represents a critical enabler of trustworthy AI in aviation. It transforms federated AHM systems from static learners into dynamic, correctable, and auditable platforms, capable of supporting safe, resilient, and ethical decision-making across the lifecycle of aircraft operations.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

AirInsight Editorial Team. Aircraft Health Management Systems and Digital Twin Technology. AirInsight 2021. Available online: https://airinsight.com/aircraft-health-management-systems-and-digital-twin-technology/ (accessed on 23 June 2025).
Kabashkin, I. Digital Twin Framework for Aircraft Lifecycle Management Based on Data-Driven Models. Mathematics 2024, 12, 2979. [Google Scholar] [CrossRef]
Liu, Z.; Jiang, Y.; Shen, J.; Peng, M.; Lam, K.-Y.; Yuan, X.; Liu, X. A Survey on Federated Unlearning: Challenges, Methods, and Future Directions. ACM Comput. Surv. 2025, 57, 1–38. [Google Scholar] [CrossRef]
Liao, Y.; Raga, I.; Cantwell, W.; Yin, H. A review and outlook of airframe digital twins for structural prognostics and health management in the aviation industry. Eng. Fract. Mech. 2024, 302, 110078. [Google Scholar] [CrossRef]
Kabashkin, I. Integration of Foundation Models and Federated Learning in AIoT-Based Aircraft Health Monitoring Systems. Mathematics 2024, 12, 3428. [Google Scholar] [CrossRef]
Hartwell, A.; Montana, F.; Jacobs, W.; Kadirkamanathan, V.; Ameri, N.; Mills, A.R. Distributed digital twins for health monitoring: Resource constrained aero-engine fleet management. Aeronaut. J. 2024, 128, 1556–1575. [Google Scholar] [CrossRef]
Lai, X.; Yang, L.; He, X.; Pang, Y.; Song, X.; Sun, W. Digital twin-based structural health monitoring by combining measurement and computational data: An aircraft wing example. J. Manuf. Syst. 2023, 69, 76–90. [Google Scholar] [CrossRef]
Kilic, U.; Yalin, G.; Cam, O. Digital twin for Electronic Centralized Aircraft Monitoring by machine learning algorithms. Energy 2023, 283, 129118. [Google Scholar] [CrossRef]
Zhang, L.; Zhu, T.; Zhang, H.; Xiong, P.; Zhou, W. FedRecovery: Differentially Private Machine Unlearning for Federated Learning Frameworks. IEEE Trans. Inf. Forensics Secur. 2023, 18, 4732–4746. [Google Scholar] [CrossRef]
Varshney, A.K.; Torra, V. Efficient federated unlearning under plausible deniability. Mach. Learn. 2025, 114, 25. [Google Scholar] [CrossRef]
Paul, S.; Sharma, R.; Tathireddy, P.; Gutierrez-Osuna, R. On-line drift compensation for continuous monitoring with arrays of cross-sensitive chemical sensors. Sens. Actuators B Chem. 2022, 368, 132080. [Google Scholar] [CrossRef]
Chen, Y.; Chai, Z.; Cheng, Y.; Rangwala, H. Asynchronous Federated Learning for Sensor Data with Concept Drift. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 4822–4831. [Google Scholar] [CrossRef]
Al-Rashid, M.; Zhang, Y.; Liu, K.; Wang, S.; Chen, J.; Abdullah, A.; Ghoneim, S.S.M. FedLSTM: A Federated Learning Framework for Sensor Fault Detection in Wireless Sensor Networks. Electronics 2024, 13, 4907. [Google Scholar] [CrossRef]
Dong, C.; Wang, Y.; Aldweesh, A.; Li, Z.; Bindschaedler, V.; Shmatikov, V. Privacy-Preserving and Byzantine-Robust Federated Learning. IEEE Trans. Dependable Secur. Comput. 2024, 21, 889–904. [Google Scholar] [CrossRef]
Li, S.; Ngai, E.; Ye, F.; Voigt, T. Auto-weighted Robust Federated Learning with Corrupted Data Sources. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–20. [Google Scholar] [CrossRef]
Han, Y.; Zhang, X. Robust Federated Learning via Collaborative Machine Teaching. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4075–4082. [Google Scholar] [CrossRef]
Yaacoub, J.-P.A.; Noura, H.N.; Salman, O. Security of federated learning with IoT systems: Issues, limitations, challenges, and solutions. Internet Things Cyber-Phys. Syst. 2023, 3, 155–179. [Google Scholar] [CrossRef]
Dritsas, E.; Trigka, M. Federated Learning for IoT: A Survey of Techniques, Challenges, and Applications. J. Sens. Actuator Netw. 2025, 14, 9. [Google Scholar] [CrossRef]
ElZemity, A.; Arief, B. Privacy Threats and Countermeasures in Federated Learning for Internet of Things: A Systematic Review. arXiv 2024, arXiv:2407.18096. [Google Scholar] [CrossRef]
Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated Learning for the Internet of Things: Applications, Challenges, and Opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar] [CrossRef]
Bisanti, G.M.; Mainetti, L.; Montanaro, T.; Patrono, L.; Sergi, I. Digital twins for aircraft maintenance and operation: A systematic literature review and an IoT-enabled modular architecture. Internet Things 2023, 24, 100991. [Google Scholar] [CrossRef]
Sadeghi, P.; Bellavista, P.; Song, W.; Yazdani-Asrami, M. Digital Twins for Condition and Fleet Monitoring of Aircraft: Toward More-Intelligent Electrified Aviation Systems. IEEE Access 2024, 12, 99806–99832. [Google Scholar] [CrossRef]
Nagaraj, D.; Khandelwal, P.; Steyaert, S.; Gevaert, O. Augmenting digital twins with federated learning in medicine. Lancet Digit. Health 2023, 5, e251–e253. [Google Scholar] [CrossRef] [PubMed]
Salim, M.M.; Camacho, D.; Park, J.H. Digital Twin and federated learning enabled cyberthreat detection system for IoT networks. Future Gener. Comput. Syst. 2024, 161, 701–713. [Google Scholar] [CrossRef]
Wei, W.; Liu, L. Trustworthy Distributed AI Systems: Robustness, Privacy, and Governance. ACM Comput. Surv. 2025, 57, 1–42. [Google Scholar] [CrossRef]
Yang, C.; Wang, Q.; Xu, M.; Stavrou, A.; Chow, S.S.M.; Li, J. Evaluating and Enhancing the Robustness of Federated Learning System against Realistic Data Corruption. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 462–473. [Google Scholar] [CrossRef]
Zuo, S.; Yan, X.; Fan, R.; Shen, L.; Zhao, P.; Xu, J.; Hu, H. Byzantine-resilient Federated Learning Employing Normalized Gradients on Non-IID Datasets. arXiv 2024, arXiv:2408.09539. [Google Scholar] [CrossRef]
Pan, C.; Sima, J.; Prakash, S.; Rana, V.; Milenkovic, O. Machine Unlearning of Federated Clusters. arXiv 2022, arXiv:2210.16424. [Google Scholar] [CrossRef]
Gao, K.; Zhu, T.; Ye, D.; Zhou, W. Defending against gradient inversion attacks in federated learning via statistical machine unlearning. Knowl.-Based Syst. 2024, 299, 111983. [Google Scholar] [CrossRef]
Karunamurthy, K.; Vijayan, K.; Kshirsagar, P.R.; Tan, K.T. An optimal federated learning-based intrusion detection for IoT environment. Sci. Rep. 2025, 15, 8696. [Google Scholar] [CrossRef] [PubMed]
Gugueoth, V.; Safavat, S.; Shetty, S. Security of Internet of Things (IoT) using federated learning and deep learning—Recent advancements, issues and prospects. ICT Express 2023, 9, 941–960. [Google Scholar] [CrossRef]
Albogami, N.N. Intelligent deep federated learning model for enhancing security in internet of things enabled edge computing environment. Sci. Rep. 2025, 15, 4041. [Google Scholar] [CrossRef] [PubMed]
Latif, N.; Ma, W.; Ahmad, H.B. Advancements in securing federated learning with IDS: A comprehensive review of neural networks and feature engineering techniques for malicious client detection. Artif. Intell. Rev. 2025, 58, 91. [Google Scholar] [CrossRef]
Kabashkin, I.; Perekrestov, V. Ecosystem of Aviation Maintenance: Transition from Aircraft Health Monitoring to Health Management Based on IoT and AI Synergy. Appl. Sci. 2024, 14, 4394. [Google Scholar] [CrossRef]
Kabashkin, I. The Iceberg Model for Integrated Aircraft Health Monitoring Based on AI, Blockchain, and Data Analytics. Electronics 2024, 13, 3822. [Google Scholar] [CrossRef]
Kabashkin, I. Framework for Addressing Imbalanced Data in Aviation with Federated Learning. Information 2025, 16, 147. [Google Scholar] [CrossRef]
Zheng, H.; Hu, H.; Han, Z. Preserving User Privacy for Machine Learning: Local Differential Privacy or Federated Machine Learning? IEEE Intell. Syst. 2020, 35, 5–14. [Google Scholar] [CrossRef]
Liu, H.; Xiong, P.; Zhu, T.; Yu, P.S. A survey on machine unlearning: Techniques and new emerged privacy risks. J. Inf. Secur. Appl. 2025, 90, 104010. [Google Scholar] [CrossRef]
Zhong, D.; Xia, Z.; Zhu, Y.; Duan, J. Overview of predictive maintenance based on digital twin technology. Heliyon 2023, 9, e14534. [Google Scholar] [CrossRef] [PubMed]
Olanrewaju-George, B.; Pranggono, B. Federated learning-based intrusion detection system for the internet of things using unsupervised and supervised deep learning models. Cyber Secur. Appl. 2025, 3, 100068. [Google Scholar] [CrossRef]
Sprong, J.P.; Jiang, X.; Polinder, H. A Deployment of Prognostics to Optimize Aircraft Maintenance—A Literature Review. Proc. Annu. Conf. PHM Soc. 2019, 11, 776. [Google Scholar] [CrossRef]
Fu, S.; Avdelidis, N.P. Prognostic and Health Management of Critical Aircraft Systems and Components: An Overview. Sensors 2023, 23, 8124. [Google Scholar] [CrossRef] [PubMed]
Dangut, M.D.; Skaf, Z.; Jennions, I.K. An Integrated Machine Learning Model for Aircraft Components Rare Failure Prognostics with Log-Based Dataset. ISA Trans. 2021, 113, 127–139. [Google Scholar] [CrossRef] [PubMed]
Baptista, M.L.; Henriques, E.M.P.; Prendinger, H. Classification Prognostics Approaches in Aviation. Measurement 2021, 182, 109756. [Google Scholar] [CrossRef]
Li, R.; Verhagen, W.J.C.; Curran, R. A Systematic Methodology for Prognostic and Health Management System Architecture Definition. Reliab. Eng. Syst. Saf. 2020, 193, 106598. [Google Scholar] [CrossRef]
Kabashkin, I.; Fedorov, R.; Perekrestov, V. Decision-Making Framework for Aviation Safety in Predictive Maintenance Strategies. Appl. Sci. 2025, 15, 1626. [Google Scholar] [CrossRef]
Munns, T.E.; Beard, R.E.; Culp, A.M.; Murphy, D.A.; Kent, R.M. Analysis of Regulatory Guidance for Health Monitoring. NASA/CR-2000-210643. December 2000. Available online: https://ntrs.nasa.gov/citations/20010017836 (accessed on 20 July 2025).
Federal Aviation Administration, Commercial Airplane Certification Process Study—An Evaluation of Selected Aircraft Certification, Operations, and Maintenance Processes. March 2002. Available online: https://www.faa.gov/sites/faa.gov/files/CPS_Report_2002_CD_3rd.pdf (accessed on 20 July 2025).
Federal Aviation Administration. Research and Development Annual Review FY 2021. December 2022. Available online: https://www.faa.gov/sites/faa.gov/files/NARP-03142023-FY-2021-Research-and-Development-Annual-Review.pdf (accessed on 20 July 2025).
Agogino, A.; Brat, G.; He, Y.; Hulse, D.; Lipkis, R.; Pressburger, T.; Gopinath, D.; Irshad, L.; Katis, A.; Mavridou, A.; et al. Recommendations on Evidence and Process for Certification of Learning-Enabled Components in Aerospace Systems. NASA/TM-2024-20240006865. May 2024. Available online: https://ntrs.nasa.gov/citations/20240006865 (accessed on 20 July 2025).
TensorFlow Federated: Machine Learning on Decentralized Data. Available online: https://www.tensorflow.org/federated (accessed on 23 June 2025).
Docker Documentation. Deployment and Orchestration. Available online: https://docs.docker.com/guides/orchestration/ (accessed on 23 June 2025).

Figure 1. Federated learning and unlearning architecture for DT–based AHM.

Figure 2. Digital twin-based data modeling.

Figure 3. FL workflow for AHM across distributed DT clients.

Figure 4. FU workflow for the selective removal of corrupted client contributions in AHM.

Figure 5. Simulated federated environment.

Figure 6. Federated model performance.

Figure 7. Confusion matrix.

Figure 8. ROC curves.

Figure 9. Comparison of model performance across unlearning methods.

Figure 10. Sensor drift in hydraulic pressure readings.

Figure 11. Model performance before and after applying federated unlearning.

Figure 12. Timeline of federated learning rounds with key events.

Figure 13. Comparison of model performance under different contamination mitigation strategies.

Table 1. Comparative performance of AHM model before and after corruption and using three recovery strategies.

Metric	Baseline	Post-Corruption	Gradient Subtraction	Exact Re-Aggregation	Constrained Retraining
Accuracy (%)	92.8	84.1	88.2	90.7	92.5
Precision (%)	91.2	80.2	85.6	89.1	90.9
Recall (%)	90.4	75.5	81.2	87.6	90.2
F1-Score (%)	90.8	77.8	83.3	88.3	90.5
ROC-AUC	0.96	0.88	0.91	0.94	0.96
Unlearning Time	—	—	~12 s	~38 s	~110 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kabashkin, I. Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption. Electronics 2025, 14, 2968. https://doi.org/10.3390/electronics14152968

AMA Style

Kabashkin I. Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption. Electronics. 2025; 14(15):2968. https://doi.org/10.3390/electronics14152968

Chicago/Turabian Style

Kabashkin, Igor. 2025. "Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption" Electronics 14, no. 15: 2968. https://doi.org/10.3390/electronics14152968

APA Style

Kabashkin, I. (2025). Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption. Electronics, 14(15), 2968. https://doi.org/10.3390/electronics14152968

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Unlearning Framework for Digital Twin–Based Aviation Health Monitoring Under Sensor Drift and Data Corruption

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Related Works

1.3. Research Gap, Contributions and Paper Structure

2. Materials and Methods

2.1. System Architecture Overview

2.2. Digital Twin-Based Data Modeling

2.3. Federated Learning for AHM Model Training

2.4. Federated Unlearning Mechanism

2.5. Experimental Setup

2.5.1. Simulated Aviation Environment

2.5.2. Model Configuration

2.5.3. Unlearning Scenarios

2.5.4. Evaluation Metrics

2.6. Mathematical Framework of Study

2.6.1. Local Model Training at Digital Twin Nodes

2.6.2. Federated Learning Aggregation

2.6.3. Federated Unlearning Mechanisms

2.6.4. Model Performance Metrics

2.6.5. Trust and Participation Scoring

3. Results

3.1. Baseline Federated Model Performance

3.2. Impact of Data Corruption on Model Integrity

3.3. Federated Unlearning Application and Effects

3.4. Case Study: Faulty Sensor Series in Hydraulic System Monitoring

4. Discussion

4.1. Advantages of Federated Unlearning in AHM Systems

4.2. Trade-Offs and Design Considerations

4.3. Forgetting Confidential Data as FU Secondary Role

4.4. Broader Implications for Federated Aviation Intelligence

4.5. Challenges, Limitations, and Future Research Directions

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI