1. Introduction
Electrical power systems are the backbone of modern economies, and substations play a central role in ensuring reliable, safe, and efficient electricity delivery [
1]. However, the rapid integration of renewable energy, increasing grid complexity, and the accelerated aging of critical assets intensify the risk of faults and performance degradation—challenges widely reported in large-scale prognostic studies [
2]. Traditional maintenance strategies, such as corrective maintenance and interval-based preventive maintenance, have therefore become insufficient for guaranteeing resilience in modern substation environments [
3].
Recent surveys emphasize that data-driven PdM must progress beyond isolated local models into integrated, interoperable, and explainable intelligence capable of handling multi-modal signals, evolving operating conditions, and real-time situational constraints [
1,
4,
5]. Within this context, semi-supervised anomaly detection, hybrid ML–DL modeling, and uncertainty-aware inference have emerged as essential capabilities for robust deployment in real industrial environments [
6,
7,
8].
Recent DT research demonstrates reliability optimization in production lines, DT-based smart machine-tool control, comparative synchronization fidelity across DT engines, data reduction via metaheuristics for scalable simulation, and energy-aware dashboards for cyber–physical facilities—establishing a clear pathway to translate predictive analytics into trusted, auditable service layers [
9,
10,
11,
12,
13,
14]. These advances motivate a layered DT-enabled PdM architecture tailored to substation automation, where IEC 61850, CIM, and OPC UA Part 17 ensure semantic and operational interoperability, and IEC 62443 anchors cybersecurity and trust.
The following is uniquely integrated: (i) semantic interoperability using IEC 61850/CIM/OPC UA Part 17; (ii) defense-in-depth cybersecurity enforcement aligned with IEC 62443 and NERC Critical Infrastructure Protection (CIP) Reliability Standard CIP-015-1 (Cyber Security—Internal Network Security Monitoring), which mandates internal network security monitoring to improve detection of anomalous or unauthorized activity; (iii) stacked ensemble models for enhanced prediction robustness; (iv) a decision-support layer capable of maintaining synchronized operation with the SCADA/substation’s operational technology (OT) infrastructure systems in real time.
This work presents a deployment-oriented Digital Twin-enabled predictive maintenance (DT–PdM) architecture for substation automation that is aligned with IEC 61850, CIM, and OPC UA Part 17 and validated using multi-year, utility-grade operational data from the SS1 substation of the Badra Oil Field (2021–2025; 1 million records; 139 confirmed fault events), demonstrating feasibility under practical SCADA/OT constraints and cyber-secure governance.
To the best of our knowledge, this is among the first real-world validations of a standards-aligned DT–PdM architecture integrating IEC 61850/CIM/OPC UA with cybersecurity governance and large-scale OT data from an operating utility substation.
The main contributions of this study summarized as follows:
We propose a five-layer, deployment-ready, standards-aligned Digital Twin-enabled predictive maintenance (DT–PdM) architecture for substation automation, unifying OT acquisition with semantic interoperability (IEC 61850, CIM, and OPC UA) and cybersecurity-aligned decision support (IEC 62443) within a single operational framework.
We demonstrate large-scale, utility-grade real-world validation on the SS1 substation of the Badra Oil Field using ≈1 million multivariate operational records and 139 confirmed fault events, moving beyond simulation-only or laboratory-scale DT-PdM studies and confirming feasibility under practical SCADA/OT constraints.
We benchmark RF, GBM, SVM, DNN, and a stacked ensemble and identify the best-performing operational model using the F1-score for imbalanced fault detection while explicitly accounting for inference feasibility within a 60 s supervisory monitoring loop.
We provide an operation-oriented, human-in-the-loop decision layer linking predictive outputs to maintenance prioritization through composite scoring and cyber-trust-aware governance, supporting interpretable, auditable, and regulation-aligned maintenance actions.
2. Predictive Maintenance and Digital Twin Integration in Substation Automation
Predictive maintenance (PdM) enables utilities to transition from corrective and time-based interventions toward proactive asset management by exploiting real-time monitoring data and advanced analytics for anomaly detection, remaining useful life (
) prediction, and outage prevention [
15,
16]. Recent advances in machine learning, particularly deep learning architectures such as Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNNs), and hybrid ensemble models, have demonstrated high performance in detecting multi-modal fault signatures under varying operating conditions [
17,
18].
Recent predictive maintenance studies have explored a wide range of artificial intelligence models, including recurrent neural networks such as LSTM and the Gated Recurrent Unit (GRU) for temporal dependency modeling, transformer-based architectures for long-range sequence learning [
19,
20], Bayesian and evidential deep learning for uncertainty-aware inference, and graph-based models for topology-aware asset representation [
21]. While these approaches have demonstrated promising performance in specific contexts, their deployment in operational substation environments often involves increased model complexity, higher computational costs, substantial data-labeling requirements, and reduced interpretability [
22].
In the present study, the selection of the Random Forest, Gradient Boosting, the SVM, the DNN, and a stacked ensemble is motivated by the need to balance predictive accuracy, robustness, explainability, and real-time deep learning deployment ability under utility-grade constraints. In particular, tree-based models provide transparent feature-importance insights, while Deep Neural Networks capture nonlinear degradation patterns, and the ensemble framework enhances generalization stability without imposing excessive inference latency. This model selection strategy is therefore well suited for integration within a DT operating on SCADA-level data with minute-scale update cycles while maintaining compatibility with practical engineering and cybersecurity requirements.
Traditional PdM frameworks, however, often lack seamless integration with OT such as SCADA and Intelligent Electronic Devices (IEDs), resulting in limited real-time applicability [
16]. This leads to suboptimal data utilization, weak situational awareness, and delayed decision-making in substation environments.
Digital Twin (DT) technology has emerged as an enabler for synchronized cyber–physical intelligence, providing dynamic virtual replicas of electrical assets capable of real-time emulation, system-level diagnostics, and maintenance scenario testing [
3,
23]. DT-driven cyber–physical synchronization allows utilities to evaluate fault propagation, validate recovery actions, and optimize maintenance decisions before actions occur in the physical environment [
10]. Digital Twin-enabled PdM has also been demonstrated in other safety-critical infrastructure, such as railway turnout switch machines, where DT models were coupled with condition monitoring to support predictive decision-making and visualization [
24].
For electrical substations, achieving deployment-ready PdM requires standards-based operational interoperability. IEC 61850 and CIM define unified information models and communication protocols for asset data exchange, while OPC UA Part 17 enables scalable publish–subscribe messaging for cross-platform DT integration [
25,
26]. Simultaneously, IEC 62443 and NERC CIP-015-1 enforce cyber-resilient operations to protect OT infrastructure against adversarial threats [
27].
Despite the rapid progress of Digital Twin-enabled predictive maintenance frameworks, existing studies remain limited in several critical aspects when considered for real-world substation deployment. Many recent DT–PdM approaches primarily focus on algorithmic performance or simulation-based validation, often lacking full alignment with power-utility interoperability standards, integrated cybersecurity mechanisms, and verification using utility-grade operational data. In particular, interoperability across heterogeneous substation assets and enterprise systems, as well as compliance with industrial cybersecurity requirements, is frequently treated as a secondary consideration or omitted altogether. Moreover, a significant portion of the literature relies on laboratory-scale datasets or synthetic benchmarks, which limits the practical transferability of reported results to operational substation environments.
In addition, prior DT–PdM studies rarely address deployment constraints such as supervisory-cycle latency, human-in-the-loop governance, and auditable decision workflows required for safety-critical substation operation.
To address these gaps, the present work proposes a deployment-oriented DT–PdM architecture that unifies standards-based interoperability (IEC 61850, CIM, and OPC UA Part 17), defense-in-depth cybersecurity aligned with IEC 62443, and hybrid AI-based predictive analytics within a single, coherent framework. Unlike prior conceptual or simulation-centric approaches, the proposed framework is engineered and validated under practical SCADA/OT constraints, including a 60 s supervisory monitoring cycle and operator-approved advisory execution. The proposed architecture is validated using real utility substation data from the SS1 installation of the Badra Oil Field, thereby demonstrating not only predictive performance but also operational feasibility, cyber resilience, and scalability for next-generation substation automation systems.
Therefore, a unified PdM framework for substations must combine advanced analytics, real-time digital twinning, interoperability enforcement, and defense-in-depth cybersecurity. This study addresses these requirements through the proposed DT-enabled PdM architecture presented in
Section 3.
4. Case Study and Validation
4.1. Case Study Overview: Badra Oilfield Substation (SS1)
The proposed architecture was validated using field data from the SS1 Substation—GTG B (33/11.5 kV, 55/65 MVA)—as illustrated in
Figure 8. It is located in the Central Processing Facility (CPF) of the Badra Oil Project in Iraq. A detailed as-built single-line diagram (SLD) for SS1 is provided in
Appendix A, illustrating the full topology, feeder structure, and protection interfaces used for Digital Twin alignment.
This substation supplies three on-duty substations, forming a critical node within the 120 MW Gas Turbine Power Plant (GTPP).
The site represents a realistic hybrid environment combining legacy equipment and modern automation, making it suitable for evaluating deployment-ready PdM architectures [
16].
Key monitored assets include two 33/11.5 kV step-down transformers (Oil Natural Air Forced (ONAF) cooling and OLTC control), circuit breakers, protection relays, busbars, and distributed sensors connected to the SCADA system.
Data collected (2021–2025) comprise 1 million records across 14 parameters: voltages, currents, power factors, temperatures, oil levels, and gas analysis values. In this study, “14 parameters” refers to the primary raw OT measurements acquired from SS1 (electrical, thermal, and environmental channels). The exported analysis dataset contains additional columns (e.g., time-index fields and label descriptors) required for supervised learning, traceability, and auditability, while the model input is formed as an engineered feature vector derived from the raw parameters (e.g., gradients and imbalance indices) for robust learning under SCADA-resolution constraints. The dataset is collected from the SS1 substation’s operational technology (OT) environment; therefore, no human participants were involved, and participant selection criteria are not applicable, so ethics approval was not required because no human data used.
Instead, inclusion criteria were defined at the asset-channel and event-confirmation levels (utility-grade SS1 measurement points consistently available via SCADA/IED infrastructure and log-confirmed fault intervals). The monitored scope was restricted to utility-grade SS1 equipment, and measurement points were consistently available through the SCADA/IED infrastructure, including the transformer, feeder, and environmental monitoring channels. Fault events were included only when they were confirmed by substation logs and maintenance/operational records, and the corresponding time windows were aligned with recorded disturbance intervals to ensure reliable label assignment (normal vs. fault) for model training and validation.
In this study, fault detection is formulated as a binary classification task, with labels indicating normal versus fault operating states, which were derived from confirmed SS1 fault logs and aligned operational time windows. A total of 139 fault events were confirmed from SS1 operational and maintenance logs and aligned to the 1 min SCADA historian timeline; each event was mapped to its recorded disturbance interval, and minute-level samples were labeled accordingly to ensure consistent supervision under utility monitoring constraints. In parallel, the framework supports remaining useful life () estimation as a continuous regression task, where the target variable represents the estimated time-to-failure or degradation horizon inferred from historical fault occurrences and condition trends. For supervision, event-aligned pre-fault windows were defined to represent incipient degradation prior to logged fault onset (see the -labeling protocol in the corresponding subsection/appendix), thereby enabling consistent time-to-failure learning from SCADA-resolution sequences.
Measurements were recorded under normal plant operating conditions from the SS1 OT monitoring stack (SCADA/IEDs/sensors) at a 1 min sampling resolution, covering electrical, thermal, and environmental variables used for DT synchronization and PdM analytics. All records were time-stamped and aggregated at the SCADA level, reflecting real-world constraints of utility monitoring (e.g., supervisory sampling rather than waveform-level transients). The dataset includes both steady-state operating periods and fault-affected intervals derived from SS1 operational logs.
Each of the 139 confirmed SS1 fault events is mapped to the 1 min OT/SCADA timeline using the event start/end times recorded in operational logs. A time-window labeling scheme is then applied to construct supervised samples: minutes within the event-aligned fault interval were labeled as a fault, while minutes outside those intervals were labeled as normal. To support prognostic use, pre-fault windows are additionally defined to represent incipient degradation prior to logged fault onset, as described in the -labeling protocol below.
For regression, ground-truth targets were constructed using an event-referenced time-to-failure definition: for each minute within a pre-fault horizon preceding confirmed fault onset, the target equals the remaining time (in minutes) until the next logged fault event. For samples outside the pre-fault horizon or during long healthy operating periods, targets were capped at the maximum horizon to avoid unbounded values and to reflect supervisory-level prognostic utility with SCADA resolution. Overlapping windows were handled by assigning each timestamp to the nearest subsequent fault event.
4.2. Data Pre-Processing and Feature Engineering
The pre-processing workflow consisted of (i) semantic alignment of tags/signals using IEC 61850-consistent naming, (ii) time synchronization and resampling to a uniform 1 min grid, (iii) missing-value treatment and reconstruction (as described below), (iv) outlier screening and removal to suppress non-physical spikes, (v) normalization to the range for model stability, and (vi) feature engineering to derive thermal gradients and imbalance indicators that are physically meaningful for incipient fault detection and estimation. Following cleaning and synchronization, the learning input was formed as an engineered feature vector (33 features) by combining raw OT measurements with derived indicators (e.g., thermal gradients and differences, imbalance metrics, vibration severity proxies, and operational stability descriptors). This design preserves physical interpretability while enabling robust classification under SCADA-resolution constraints. Noise mitigation was handled at this stage through deployment-consistent pre-processing rather than through a dedicated learned denoising network. This pipeline ensures that the Digital Twin receives a clean, synchronized, and deployment-consistent data stream suitable for closed-loop operation.
The dataset underwent a rigorous cleaning and alignment procedure following IEC 61850 semantic naming standards. This pre-processing-based noise handling strategy was selected to preserve physical interpretability and to avoid introducing additional model complexity that may hinder reproducibility in utility-grade SCADA environments. Missing values were reconstructed using a Kalman Filter and Spline Interpolation hybrid approach, while outliers were suppressed using an
fusion method that is consistent with industrial PdM practices [
16].
To enhance model robustness, derived features included the following:
Electrical domain, including phase voltages, current densities, harmonic distortion, frequency deviation, and apparent power.
Thermal domain, including oil/winding temperature gradients, ambient compensation factors, and the transformer thermal stress index.
Reliability domain, including the failure rate (
and
estimation priors based on degradation signatures [
17].
Each parameter was normalized to the
range and time-synchronized at the 60 s resolution for AI training. The mathematical definitions for normalization, smoothing, interpolation, and outlier removal are summarized in
Appendix C.
Remaining useful life () is defined as the time to the next confirmed fault, measured on the 1 min timeline. For each sample within a pre-fault horizon, the label corresponds to the remaining minutes until the associated fault onset. Samples outside any pre-fault horizon were not assigned targets (or were excluded from regression), preventing ambiguous supervision during healthy steady-state operation. To reduce the risk of target leakage, label fields and maintenance-log descriptors were used only for supervised annotation and traceability and were excluded from the predictive feature set used by the learning models.
4.3. Model Development and Configuration
Five data-driven predictive models were developed using the same training/testing dataset (80/20 split with 5-fold cross-validation):
Random Forest (RF);
Gradient Boosting Machine (GBM);
Support Vector Machine (SVM);
Deep Neural Network (DNN);
Stacked Ensemble (RF + GBM + DNN).
Hyperparameters were optimized using Bayesian search. The optimization process employed Bayesian hyperparameter tuning coupled with 5-fold cross-validation to promote robust generalization under operational variability. The DNN (four hidden layers, rectified linear unit (ReLU) activations, and dropout = 0.2) was implemented using TensorFlow, while tree-based models and the ensemble meta-learner were implemented in Scikit-Learn. In this work, the DNN is a fully connected feed-forward model to reduce overfitting under operational variability. For fault detection, the output layer uses a sigmoid activation function to model the probability of the fault class (normal vs. fault), and the network is trained by minimizing the binary cross-entropy loss. For the estimation task, the regression head uses a linear output and is trained using the mean squared error (MSE).
The Bayesian search space covered key algorithm-specific parameters including network depth, dropout rate, and learning rate for the DNN, as well as tree depth, the number of estimators, and ensemble weighting for the classical models, with the final configuration selected based on cross-validated performance. Class imbalance was mitigated via stratified sampling and adaptive weight scaling [
18]. Across repeated optimization runs, the selected hyperparameter regions were generally consistent, and the finalized DNN and ensemble configurations exhibited limited sensitivity to initialization, indicating stable model selection. The final model configuration is illustrated in Figure 15, while tuning details are reported in
Appendix C.3.
The benchmark set (the RF, the GBM, the SVM, the DNN, and a stacked ensemble) was selected to cover complementary trade-offs relevant to substation deployment: (i) tree-based learners (RF/GBM) provide strong performance on tabular SCADA features with transparent feature-importance explanations; (ii) the SVM serves as a classical margin-based baseline with stable behavior under moderate dimensionality; (iii) the DNN captures nonlinear interactions among thermal, vibration, and imbalance indicators that are difficult to model explicitly; and (iv) the stacked ensemble combines heterogeneous learners to improve robustness to operating variability and reduce sensitivity to individual model bias. This specific ensemble composition was chosen to leverage complementary inductive biases—variance reduction (RF), nonlinear partitioning (GBM), and higher-order feature interaction modeling (DNN)—thereby improving robustness under non-stationary substation operating conditions while preserving interpretability and deployment feasibility. This design provides a practical balance between accuracy (F1-score under class-imbalance conditions), interpretability, and inference feasibility within the 60 s monitoring loop.
Recent deep sequence models (e.g., LSTM/transformers) can be advantageous for high-frequency waveforms or long-horizon temporal dependencies; however, in the present study, the primary data stream utilizes 1 min SCADA/OT measurements, and the selected models provide a stronger deployment trade-off (training/inference cost, interpretability, and reproducibility) while achieving high fault-detection performance. Accordingly, sequence-heavy architectures are considered complementary benchmarking candidates or future extensions rather than primary deployment models for the present supervisory-resolution dataset.
4.4. Model Comparison and Evaluation
Five supervised models—the Random Forest (RF), the Gradient Boosting Machine (GBM), the Support Vector Machine (SVM), the Deep Neural Network (DNN), and a stacked ensemble (RF + GBM + DNN)—were benchmarked using the same standardized dataset with an 80/20 train–test split and five-fold cross-validation. After event-window labeling, the resulting dataset comprised 802,695 fault-labeled samples and 197,305 normal samples (1,000,000 total records). The 80/20 split resulted in 671,756 fault samples and 128,244 normal samples in the training set, and 167,939 fault samples and 32,081 normal samples were included in the held-out test set. The test set was kept unchanged to preserve the realistic operational class imbalance observed in the SS1 substation environment.
The dataset was partitioned using an 80/20 train–test split, with five-fold cross-validation applied on the training portion for robust model selection. To address class imbalance in fault events, minority-class balancing was applied within the training folds (e.g., SMOTE/oversampling), while no resampling was applied to the test set to ensure unbiased operational evaluation.
Figure 9 illustrates the Receiver Operating Characteristic (ROC) curves, where the DNN and ensemble achieved nearly perfect separation (AUC
0.99), outperforming tree-based models (AUC
0.97
0.98) and the SVM (AUC
0.95). While
Figure 10 displays the confusion matrices, confirming the stacked ensemble’s superior class balance with near-zero false negatives, the SVM produced minor misclassifications under transient conditions [
17]. Given the imbalanced nature of fault detection, the F1-score is treated as the primary objective for model selection and comparative evaluation, with AUC–ROC reported as a complementary discrimination measure.
Additional mathematical formulations used in classification, anomaly scoring,
estimation, and reliability analysis are presented in
Appendix D for reproducibility.
As shown in
Figure 9, the ROC analysis indicates that both the DNN and the stacked ensemble achieve very strong separability; however, the stacked ensemble provides a consistently higher operating margin, particularly in the high-specificity regime that is relevant for minimizing false alarms while preserving detection sensitivity. Under class-imbalance conditions, precision–recall behavior further highlights the advantage of the ensemble, indicating improved precision retention at high recall compared with the DNN. This performance gap suggests that ensemble fusion better captures heterogeneous fault signatures and reduces sensitivity to transient operational variability.
To further assess model behavior under class-imbalance conditions,
Figure 10 provides a comparative precision–recall view of the evaluated models, highlighting their behavior under class-imbalanced fault-detection conditions. As shown, the Random Forest and Gradient Boosting models achieve perfect or near-perfect precision and recall, reflecting their strong discrimination capability on the evaluated dataset. The Support Vector Machine (SVM), however, exhibits a noticeable reduction in recall (
0.9835) despite maintaining high precision (
0.9974), indicating a higher tendency to miss fault instances under certain operating conditions. In contrast, the Deep Neural Network and the stacked ensemble consistently maintain both high precision and high recall (
0.999
1.0), demonstrating a more balanced trade-off between false-alarm suppression and fault-capture sensitivity.
From an operational perspective, this balanced behavior is critical for substation predictive maintenance, as it minimizes missed fault events while avoiding excessive false positives. The ensemble’s stable performance across both metrics further supports its selection as the preferred model for deployment within the Digital Twin-enabled decision-support loop, where reliability and risk-aware operation are paramount.
The confusion matrix results show that the stacked ensemble yields extremely low false-negative behavior, which is critical in safety- and reliability-sensitive substation environments where missed fault detection can lead to cascading damage, forced outages, and higher restoration costs. From an operational perspective, the observed detection behavior supports risk-aware deployment, where high recall (fault capture) is prioritized, while maintaining strong discrimination performance. This strengthens confidence in the suitability of the ensemble as the operational model embedded in the Digital Twin decision loop.
Model interpretability is assessed using feature-importance analysis, as shown in
Figure 11, which serves as the primary explainability artifact in this study, summarizing the dominant predictors that drive fault-related discrimination in the DT–PdM pipeline.
The importance rankings in
Figure 12 provide operationally meaningful insights into the SS1 condition dynamics. Thermal indicators (oil and winding temperatures and their derived gradients/differences) appear consistently among the top predictors, which is physically consistent with insulation aging and hotspot-driven degradation processes in transformer–feeder paths under variable loading conditions. Likewise, vibration-related features contribute strongly, aligning with mechanical looseness, cooling-system anomalies, or incipient component stress that often manifests prior to discrete fault events.
Electrical imbalance features (e.g., phase-current/voltage asymmetry and derived imbalance indices) are also informative because unbalanced loading, loose connections, and contact deterioration can produce asymmetric current patterns and elevated localized heating. From a maintenance perspective, these results support prioritizing (i) thermal monitoring trends, (ii) vibration excursions, and (iii) imbalance alarms as interpretable early-warning signals within the DT decision-support workflow.
Precision–recall curves comparing fault detection performance (
Figure 10 and
Table 2) confirm that the stacked ensemble achieved the best overall performance in terms of the F1-score (0.98), alongside high accuracy (97.5%), precision (0.98), recall (0.97), and AUC (0.995), demonstrating robust discrimination under imbalanced fault conditions [
15].
For visual clarity,
Figure 10 presents representative evaluation plots generated on a held-out subset, whereas all quantitative metrics reported in
Table 2 and the cross-validation statistics were computed based on the full test partition and the five-fold cross-validation procedure.
Cross-validation results demonstrated stable generalization of the proposed framework, with fold-to-fold performance variability remaining below across five folds, indicating that the reported results are not driven by a single data split and remain robust under resampling.
Figure 13 provides an integrated, multi-perspective comparison of model performance, synthesizing the quantitative results reported in
Table 2 into complementary visual forms. The grouped bar chart summarizes accuracy, precision, recall, and F1-scores across models, confirming the consistently superior balance achieved by the stacked ensemble. The AUC–ROC comparison highlights strong discriminative capability for all tree-based and ensemble models while revealing comparatively reduced separability for the SVM and the standalone DNN. The heat map offers a compact overview of metric-wise dominance, clearly illustrating the ensemble’s uniformly high performance across all evaluation criteria. Finally, the radar chart visualizes the trade-off among accuracy, precision, recall, and the F1-score, where the stacked ensemble encloses the largest area, indicating the most balanced and operationally robust behavior under imbalanced fault-detection conditions.
4.5. Integration Within the Digital Twin Core
The stacked ensemble model (identified in
Section 4.4 as the best performer) is embedded into the Digital Twin (DT) core to enable synchronized real-time analytics for SS1 substation assets.
For clarity and reproducibility, the finalized Deep Neural Network architecture and the stacked ensemble configuration employed in this study are summarized concisely in
Table 3, which consolidates the input representation, the number of hidden layers, activation functions, the regularization strategy, output heads for classification and
regression, loss functions, the optimizer choice, and the ensemble composition. This compact representation complements the textual description and provides a deployment-oriented overview of the model design adopted in the DT–PdM framework.
Figure 14 shows the operational performance visualization of the stacked ensemble within in the digital twin core.
Figure 15 illustrates the architecture of the Deep Neural Network (DNN) used for fault classification and
estimation, while convergence behavior during training is validated in
Figure 16, confirming a stable reduction in training and validation loss without overfitting.
The operational pipeline of the AI engine within the DT is presented in
Figure 16, which illustrates the input signals processed through the RF, GBM, and DNN branches and aggregated via a meta-classifier for fault prediction and
estimation.
The final DNN contained approximately 13,665 trainable parameters, which is modest relative to the dataset scale and contributes to stable generalization under operational variability.
Real-time data streams from sensors and SCADA are encoded following IEC 61850 semantic conventions and transferred through OPC UA Part 17 Pub/Sub [
26], enabling secure bidirectional communication with the DT.
Within this loop, the AI continuously updates fault probability , the remaining useful life , and the cyber-trust score every 60 s and transmits maintenance advisories to the operator’s dashboard.
These metrics are fused through the Composite Maintenance Decision Score:
where
is the remaining useful life,
is the predicted fault probability, and
is cybersecurity-trust score derived from IEC 62443 compliance checks [
25,
27]. A complete derivation of the composite score and its weighting methodology is provided in
Appendix C.
The weights , , and were chosen to reflect the relative operational importance of (i) prognostic urgency (), (ii) immediate fault likelihood, and (iii) cyber–physical trustworthiness while maintaining a bounded score. All three terms were normalized to [0, 1] prior to fusion, and was enforced to preserve interpretability. Unless otherwise specified, the initial weights were set through expert-informed engineering judgment, consistent with substation maintenance practice, and then verified through a sensitivity check to ensure that the decision ranking is not dominated by any single component under typical operating regimes.
The term was operationalized as a normalized scalar in the range of [0, 1], as derived from IEC 62443-aligned monitoring indicators within the OT security zone. Specifically, aggregates (a) authentication/authorization status, (b) integrity/anomaly flags (e.g., spoofing or command-injection attempts), and (c) network-policy compliance (zoning/segmentation and CIP-015-1 monitoring requirements). Each indicator is mapped to a penalty score and combined as , where indicates the presence of a security violation at time , while . Thus, approaches one under normal trusted operation and decreases as security anomalies or policy violations are detected.
This score forms the basis for predictive intervention timing, visualized for operators through the DT maintenance dashboard.
A conceptual 3D visualization of the synchronized physical–virtual environment and decision interface is provided, demonstrating the deployment applicability of the system within industrial operations.
To ensure robust cyber–physical integration, IEC 62443 zoning and CIP-015-1 network security monitoring are applied to secure data sources and prevent command injection or unauthorized manipulation of health indicators.
This alignment ensures that predictive reasoning remains trustworthy, in compliance with national grid cybersecurity mandates.
The finalized DNN architecture contains 13,665 trainable parameters, corresponding to an approximate parameter memory footprint of 0.052 MB (assuming 32-bit floating-point storage). This size enables execution of inference within the 60 s monitoring loop and supports deployment either at the OT edge gateway or within the DT service layer, depending on cybersecurity zoning and computation available.
4.6. Validation and Results
The proposed Digital Twin–AI predictive maintenance framework was validated using 139 real fault events extracted from SS1 Substation logs (2021–2025) and
1 million multivariate operational readings covering transformer, feeder, and environmental parameters. This validation approach aligns with methodologies reported in industrial DT–PdM studies, where Digital Twins are used to verify predictive models against real-world disturbance events and operational signatures [
36,
44]. In this study, fault classification was formulated as a binary task (normal vs. fault). Given the imbalanced nature of fault events, the F1-score is adopted as the primary performance objective for model selection and comparative evaluation, while accuracy and AUC–ROC are reported as complementary metrics.
Leveraging the stacked ensemble model identified in
Section 4.4 as the most effective learner according to the F1-score, the system demonstrated substantial operational improvements. Field deployment resulted in a 28% reduction in unplanned outages and a 22% decrease in maintenance cost, outcomes consistent with reductions reported in large-scale DT-enabled maintenance studies across power and industrial systems [
36,
44].
The reported reductions computed using SS1 operational and maintenance records were collected over the 2021–2025 period by comparing observed outcomes during DT-assisted advisory operation against historical baseline behavior within the same substation and equipment context. Unplanned outages were defined as forced trips or unscheduled interruptions and were recorded in SS1 operational logs, while the maintenance cost reflects corrective maintenance efforts aggregated over labor, spare-part usage, and intervention-related activities documented during the same horizon. Because this assessment is based on real-world field operations rather than a controlled experimental setup, the reported percentages are interpreted as observed site-level improvements associated with DT-assisted early warnings and maintenance prioritization. Other concurrent factors, such as incremental operational adjustments or routine equipment upgrades, may also influence these metrics; therefore, the results are intended to demonstrate deployment feasibility and magnitude of effect rather than to assert isolated causal attribution.
The predictive module generated early anomaly warnings several hours before SCADA alarm thresholds were reached. This behavior was aligned with findings from Digital Twin-enhanced early-warning systems in smart-grid and industrial environments, where early deviation detection is enabled through hybrid analytics and synchronized DT state estimation [
44].
The Composite Maintenance Decision Score (Equation (1)) provided robust prioritization under thermal-stress, phase-imbalance, and load-transient conditions. This multi-objective scoring approach supports proactive workforce scheduling and spare-part preparation, reflecting best practices identified in systematic DT–PdM reviews [
36].
When deployed, the DT interface presents recommendations as operator-facing advisories integrated with maintenance workflows (e.g., CMMS), while authorization remains human-controlled. Cyber–physical security validation was conducted using IEC 62443 zoning principles and CIP-015-1 internal network monitoring requirements. Stress testing confirmed resistance to command injection, data-spoofing attempts, and unauthorized access attempts into the OT network. These findings align with cybersecurity challenges and mitigation approaches discussed in AI-enabled maintenance reviews [
45].
Cross-validation results demonstrated high model generalization, with a standard deviation of
across five folds, outperforming benchmarks highlighted in recent PdM survey papers focused on deep learning model reliability in dynamic power-system environments [
15].
The composite score increases as fault likelihood rises and
decreases; if
degrades due to OT security anomalies,
is down-weighted (or triggers a security-first escalation), preventing unsafe automated actions and reinforcing human-in-the-loop governance. During deployment,
is compared against predefined advisory thresholds (e.g.,
triggers inspection;
triggers maintenance scheduling), with final authorization by operators. The values reported in
Table 4 are illustrative and normalized and are intended to demonstrate the temporal behavior of the composite decision score rather than reproduce a specific raw measurement trace.
The composite maintenance decision score was operationalized using normalized components and a lightweight linear fusion rule:
where
is the model-derived fault likelihood,
represents the normalized
risk (higher values indicate higher urgency), and
is a cyber-trust indicator derived from OT security monitoring status.
In this study, the weights were set according to engineering judgment to emphasize the operational risk while retaining a security-first modulation (e.g., α = 0.45, β = 0.45, and γ = 0.10), yielding an interpretable score suitable for real-time advisory use. For example, at
(
Table 4),
produce
, illustrating how increasing fault likelihood and
risk elevates urgency, while degrading cyber trust down-weights the composite score to prevent unsafe action escalation.
Finally, the synchronized physical–virtual visualization within the Digital Twin interface confirmed practical applicability for industrial operators, providing real-time diagnostic indicators, projections, and cyber-trust scores, thereby supporting situational awareness and informed maintenance decision-making.
4.7. Discussion
The experimental results confirm that the proposed Digital Twin-enabled predictive maintenance (DT-PdM) architecture provides a significant advancement over existing PdM frameworks in modern substation environments. Compared with recent DT-based maintenance studies—such as the operational DT architectures reviewed in Systematic Review of Predictive Maintenance and Digital Twin Technologies [
36] and the AI-guided DT implementations summarized in State-of-the-Art Review: Digital Twins to Support AI-Guided Predictive Maintenance [
46]—the presented system integrates a more comprehensive and deployment-oriented stack encompassing standards-aligned interoperability. Cyber resilience, runtime feasibility, and hybrid analytics are validated on utility-grade operational data.
The case study at SS1 demonstrated notable practical benefits, including a 28% reduction in unplanned outages and a 22% reduction in maintenance costs over the 2021–2025 evaluation horizon, following integration of the stacked ensemble within the DT core. These gains exceed those reported in comparable industrial deployments documented by [
36] and exhibit parallel trends to the predictive maintenance gains observed in deep learning-driven reviews [
15]. The observed improvements are attributable to early anomaly detection sensitivity,
-informed prioritization, and synchronized cyber–physical reasoning enabled by the DT feedback loop, rather than algorithmic accuracy alone.
A defining strength of the proposed architecture lies in its explicit compliance with power-utility interoperability and cybersecurity standards, including IEC 61850 for semantic data exchange, CIM IEC 61970/61968 for system-level modeling, OPC UA Part 17 for deterministic publish–subscribe synchronization, and IEC 62443/CIP-015-1 for OT cybersecurity. Existing PdM models frequently lack deployment readiness due to incomplete interoperability and weak cyber protection, an issue extensively emphasized in the power-system cybersecurity review [
45]. By explicitly incorporating zoning, conduit segmentation, certificate-based trust, and secure update pathways, this framework addresses these longstanding reliability and security gaps.
Furthermore, the system’s generalization robustness, as observed from its low cross-validation deviation (
across five folds), confirms the stability of the stacked ensemble strategy in dynamic grid environments. This aligns with the conclusions in Deep Learning Models for Predictive Maintenance: A Survey, Comparison, Challenges and Prospects [
15], which highlight ensemble and hybrid modeling as the most resilient approach for non-stationary asset behaviors.
Nevertheless, several limitations warrant consideration. Model performance may degrade under rare grid reconfigurations, atypical load-transfer events, or degradation mechanisms insufficiently represented in the training data. Such limitations echo those identified in multiple DT-PdM reviews [
36,
46], emphasizing the need for continuous Digital Twin recalibration and potential integration of physics-informed models or reinforcement learning agents for adaptive retraining under new operating conditions. Future extensions incorporating physics-informed constraints or reinforcement learning-based policy adaptation are therefore positioned as complementary enhancements rather than prerequisites for deployment readiness.
Overall, the validated results show that the proposed architecture is technically sound, operationally deployable, and aligned with the cybersecurity and interoperability requirements. For real-world high-voltage substations, this marks meaningful progression beyond the current DT-PdM literature.
4.7.1. Complexity, Scalability, and Deployment Trade-Offs
The proposed DT–PdM framework was designed with deployment constraints in mind, particularly the 60 s SCADA supervisory update cycle and the need for scalable extension to multi-substation environments. From a computational perspective, tree-based models (RF/GBM) and the SVM provide efficient inference and strong baseline robustness, while the DNN and stacked ensemble offer improved nonlinear modeling capacity at the cost of increased model complexity [
22,
47,
48,
49]. The finalized DNN contains approximately 13,665 trainable parameters, and the stacked ensemble introduces only modest overhead through meta-model fusion.
In operational use, the closed-loop pipeline—SCADA acquisitionDT state updateAI inferencemaintenance advisory—is executed within the 60 s supervisory update cycle. Dominant latency arises from SCADA acquisition and historian refresh, while DT state updates and model inference are executed within a small fraction of the cycle, confirming suitability for real-time advisory deployment.
Interpretability and accuracy represent a key deployment trade-off. The RF and GBM support transparent feature-importance analysis that aids engineering trust and root-cause investigations, whereas the DNN and ensemble models typically achieve higher predictive performance but require additional governance (e.g., calibration checks and drift monitoring) to maintain reliability over time [
50,
51]. To address scalability, the architecture supports extension toward fleet-level deployment via Digital Twin federation, where local inference is performed at each substation and only aggregated model updates, KPIs, and anonymized health indicators are exchanged across sites [
52]. This design reduces bandwidth requirements, preserves data confidentiality, and enables phased rollout across multiple substations while maintaining consistent interoperability and cybersecurity controls.
This work is subject to several data and deployment constraints that should considered when interpreting the results. First, the monitoring resolution is limited by 1 min SCADA/OT sampling, which restricts the capture of fast transient signatures and waveform-level phenomena that may precede certain fault modes.
The SS1 historian stream is recorded at a 1 min supervisory resolution, which is sufficient for monitoring slow-to-moderate dynamics (thermal trends, load imbalance evolution, and sustained abnormal operating states) but does not capture waveform-level transients or fast partial-discharge (PD) pulse activity. Accordingly, any PD-related variable used in this study should be interpreted as a monitor-derived SCADA indicator (e.g., alarm/status counters, aggregated severity indices, or device-level summarized PD metrics) rather than a high-frequency PD waveform measurement. This sampling constraint may reduce sensitivity to short-lived incipient events and can blur early-stage signatures that evolve faster than the historian refresh period; therefore, conclusions involving fast transient mechanisms are stated conservatively and framed as evidence from aggregated indicators. Future work will incorporate higher-rate acquisition (e.g., UHF/TEV PD monitors or transient recorders) to improve transient observability and strengthen physics-level attribution under fast-evolving fault modes.
Second, condition indicators such as DGA are available at a lower sampling frequency than operational measurements, and OLTC-related records are partially incomplete, which may limit sensitivity to slow-developing insulation or tap-changer degradation patterns. Third, validation is based on a single utility substation (SS1) with a specific equipment configuration; therefore, performance may vary for substations with different loading profiles, protection settings, or component technologies.
In addition to model complexity, uncertainty awareness is a critical consideration for deployment in safety-critical substation environments. In the present implementation, predictive uncertainty is primarily handled through probabilistic model outputs and performance stability analysis rather than through fully Bayesian inference. While the stacked ensemble and DNN provide confidence scores associated with fault classification and prognostic outputs, their reliability depends on calibration quality and data representativeness. To support deployment readiness, uncertainty characterization is therefore evaluated using lightweight calibration and reliability assessments rather than computationally intensive uncertainty frameworks, which may introduce additional overhead. This design choice reflects a trade-off between uncertainty expressiveness and operational feasibility within the 60 s monitoring cycle. More advanced uncertainty modeling—such as Bayesian neural networks or evidential learning—is identified as a future enhancement when higher-frequency data and additional computational resources become available.
Finally, model performance may degrade under rare or previously unseen operating regimes (e.g., atypical seasonal loading, switching events, or novel fault combinations), particularly when training data do not adequately represent such conditions. These limitations primarily affect generalization rather than the architectural feasibility of the proposed DT–PdM framework. Future extensions—such as physics-informed learning to embed degradation constraints, reinforcement learning for adaptive maintenance policies, and federated Digital Twin deployment across substations—are expected to improve robustness, transferability, and coverage of rare operating regimes beyond the current scope.
Runtime and Deployment Feasibility
To support real-time deployment, we report indicative runtime measurements for training and inference using the finalized implementation. The DNN training time was approximately
10.8 s/epoch, with a total training time of
9 min for 50 epochs (early stopping), while classical models (RF/GBM/SVM) and the stacked ensemble required
2–8 min for end-to-end fitting. For deployment, inference latency was measured as
5–15 ms/record (RF),
8–25 ms/record (GBM),
15–30 ms/record (SVM),
25–35 ms/record (DNN), and
80–120 ms/record (stacked ensemble) and included feature preparation and model evaluation, as clarified in
Table 5. These results confirm that ensemble inference remains well within the 60 s monitoring interval, enabling advisory generation without interfering with SCADA/OT update cycles.
Given the measured inference latency, the deployed ensemble operates comfortably within the 60 s update cycle, leaving sufficient margin for data ingestion, logging, and DT service orchestration. All runtime measurements were obtained on a professional-grade engineering workstation equipped with a multi-core CPU, dedicated GPU acceleration, and sufficient system memory, representative of the computational resources typically available in industrial analytics and Digital Twin development environments. The reported runtimes are therefore indicative of practical deployment performance rather than laboratory-optimized benchmarks.
4.7.2. Physical Performance Interpretation and Error Analysis
The dominant predictors identified by the explainability analysis are physically consistent with common degradation mechanisms in transformer–feeder operations. Oil and winding temperature levels and gradients reflect thermal stress accumulation, cooling-system anomalies, and hotspot-driven insulation aging; therefore, elevated temperatures and rapid thermal changes are meaningful precursors of incipient faults and reduced remaining useful life. Likewise, imbalance-related electrical indicators (e.g., phase-current asymmetry or derived imbalance indices) are strongly associated with unbalanced loading, loose terminations, contact deterioration, and asymmetric impedance conditions, which can induce localized overheating and accelerate degradation. Vibration-related features further capture mechanical looseness and abnormal operating states that may precede discrete fault events. Together, these findings confirm that the most influential features correspond to well-understood physical degradation processes, reinforcing the interpretability and engineering credibility of the proposed DT–PdM framework.
The confusion-matrix analysis indicates that misclassifications primarily occur between fault categories exhibiting similar symptom signatures under normal operating variability (e.g., thermally driven faults versus load-driven thermal excursions or mild imbalance conditions versus transient operational shifts). These confusion patterns are consistent with domain expectations: when faults present overlapping thermal/electrical manifestations—especially under coarse SCADA sampling—decision boundaries become less separable. A qualitative summary of the most frequent confusion patterns and their likely physical causes is provided in
Table 6. This observation highlights where additional sensing (e.g., higher-frequency transients or improved OLTC/DGA coverage) and richer contextual features could further improve class discrimination and generalization under rare operating regimes. In particular, the most challenging classes are those with a low event frequency or early-stage degradation signatures that closely resemble normal operational variability, explaining the residual off-diagonal errors observed in the confusion matrices.
Table 6 reports confusion-pattern statistics derived from the held-out test set of the SS1 dataset, comprising 167,939 fault-labeled samples and 32,081 normal samples. The stacked ensemble achieved zero false negatives across all evaluated fault events, a critical property for safety- and reliability-constrained substation operations where missed fault detection may propagate into cascading equipment damage or forced outages. In contrast, single-model approaches exhibited limited false-negative behavior and were primarily associated with early-stage degradation or transient operating conditions that are only weakly expressed in 1 min SCADA measurements. The ensemble’s superior robustness arises from complementary error compensation across tree-based and neural learners, enabling consistent fault capture across heterogeneous physical mechanisms, including thermal stress evolution, imbalance progression, and mechanically induced anomalies.
6. Conclusions
This study introduced a deployable, Digital Twin–integrated predictive maintenance (DT–PdM) architecture for electrical substations, which is designed to operate under real-world utility constraints. The proposed framework unifies OT–IT connectivity, standardized semantic interoperability, and AI-driven analytics within a cyber-secure and operationally feasible stack. Validation using utility-grade data from the SS1 substation demonstrates that the architecture can support continuous condition monitoring, fault prediction, and decision support within the 60 s supervisory SCADA update cycle, confirming its readiness for practical substation deployment rather than laboratory-scale experimentation. The finalized analytical pipeline combines a compact Deep Neural Network (≈13,665 trainable parameters) with a stacked ensemble strategy, achieving high predictive performance without compromising runtime feasibility.
This work makes several original contributions to the state of the art in DT-enabled predictive maintenance for substation automation:
It introduced a five-layer, standards-aligned DT–PdM architecture that integrates IEC 61850-, CIM-, and OPC UA-based interoperability with cybersecurity-aligned decision support following IEC 62443 principles;
It performed real-world validation on utility operational data comprising approximately one million multivariate records derived from 139 confirmed fault events through event-aligned, minute-level labeling, demonstrating feasibility under realistic SCADA/OT constraints;
It designed a hybrid predictive analytical and deployment pathway, showing that a stacked ensemble embedded within the Digital Twin core can achieve high predictive performance (F1-score = 0.98; AUC = 0.995) while maintaining interpretability and runtime feasibility;
It developed an operation-oriented decision layer, linking predictive outputs to maintenance prioritization through composite scoring and cyber-trust-aware governance, enabling actionable and auditable maintenance decisions.
In contrast to existing DT–PdM studies that primarily focus on algorithmic performance or conceptual Digital Twin representations, the proposed framework advances the field by jointly addressing full standards compliance, cybersecurity alignment, and utility-grade deployability within a unified architecture. The integration of standardized semantic layers, cyber-trust considerations, uncertainty-aware analytics, and real operational validation distinguishes this work from prior approaches that remain limited to partial interoperability, offline analysis, or simulation-only evaluations.
The proposed architecture is intended for human-supervised deployment in safety-critical substations, providing explainable recommendations rather than autonomous actuation, and therefore represents a future-proof and scalable DT–PdM solution capable of gradual extension toward fleet-level Digital Twin federation and adaptive maintenance strategies. The main limitations and generalization considerations are summarized in the discussion, along with mitigation pathways through the proposed future work, providing a transparent roadmap for extending the framework to broader asset classes, higher-frequency sensing, and multi-substation deployments.
Future Work
Future research will focus on extending the proposed Digital Twin-enabled predictive maintenance (DT–PdM) architecture along several complementary directions to enhance adaptability, interpretability, and scalability in large-scale power-system deployments. First, fleet-level Digital Twin federation across multiple sub-stations will be investigated using privacy-preserving and federated learning strategies, enabling knowledge sharing while respecting data confidentiality and regulatory constraints. Second, reinforcement learning techniques will be explored to support adaptive maintenance scheduling and decision optimization under uncertain and dynamically evolving operational conditions by building upon the predictive outputs generated by the current ensemble-based framework. Third, physics-informed neural networks (PINNs) and graph-based learning approaches, such as embed physical degradation laws, thermal–electrical constraints, and network topology into the learning process, will be considered explicitly, thereby improving model interpretability and reducing reliance on purely data-driven representations.
These techniques are not implemented in the present study and are identified as future extensions beyond the current experimental scope. In addition, future Digital Twin extensions will incorporate environmental and economic dimensions, including emissions awareness, asset utilization efficiency, and cost–risk trade-offs, to support sustainability-oriented maintenance planning. Finally, the development of automated compliance auditing and continuous certification-support engines will pursued to ensure long-term alignment with evolving interoperability and cybersecurity standards throughout OT–IT system evolution.
Future work will also include releasing security-reviewed artifacts (e.g., anonymized subsets and configuration files) to improve reproducibility and facilitate benchmarking under critical-infrastructure constraints.