A Standards-Aligned Hybrid AI–Digital Twin Framework for Robust Predictive Maintenance Under Data Scarcity

Park, Dongwook; Jeong, Jaeyoung; Kang, Jiwon; Shin, Dongkyoo

doi:10.3390/app16115303

Open AccessArticle

A Standards-Aligned Hybrid AI–Digital Twin Framework for Robust Predictive Maintenance Under Data Scarcity

¹

Department of Computer Engineering, Sejong University, Seoul 05006, Republic of Korea

²

Defense AI Cyber Convergence Research Institute, Sejong University, Seoul 05006, Republic of Korea

³

Department of Convergence Engineering for Intelligent Drones, Sejong University, Seoul 05006, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5303; https://doi.org/10.3390/app16115303

Submission received: 8 April 2026 / Revised: 17 May 2026 / Accepted: 20 May 2026 / Published: 25 May 2026

(This article belongs to the Special Issue AI- and Digital Twin-Driven Intelligent Diagnostics and Predictive Maintenance for Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a standards-aligned hybrid artificial intelligence–digital twin (DT) framework for predictive maintenance (PdM) in the maritime domain under conditions of data scarcity and heterogeneous sensor environments. The proposed framework adopts a DT-ready reference architecture centered on an ISO 19848-aligned data contract enabling consistent signal naming across vessels and equipment. On this foundation, the prognostics module is designed as a Domain-Knowledge Enhanced LSTM (DK-LSTM), a constraint-regularized sequence model in which three domain-informed constraints—(i) RUL non-negativity, (ii) monotonic degradation, and (iii) operating-range upper bounds—are formulated within the learning objective. Constraints (i) and (iii) are active throughout, while constraint (ii) is reserved for future work due to the structural limitation of batch-sort approximation in single-output architectures. An asymmetric safety penalty further suppresses hazardous over-predictions. Scenario-based virtual experiments are conducted using the NASA C-MAPSS turbofan degradation benchmark, evaluated under (1) sensor missingness via masking indicators and (2) structural domain shift comprising operational-condition shift (E3a: FD001 → FD002) and fault-mode shift (E3b: FD001 → FD003). Through systematic ablation of loss weights and stabilization techniques across multi-seed verification (seeds 0, 42, 123), the final stabilized configuration (DK-LSTM-v4) demonstrates robust safety-critical prediction in zero-shot domain-shift scenarios: 43.7% NASA Score improvement over the strongest baseline (GRU) under E3a and 20.8% improvement under E3b. The model trades modest in-domain performance for substantial cross-domain robustness, aligning with the core requirement of safety-critical maritime and defense applications where target-domain training data is unavailable.

Keywords:

predictive maintenance; LSTM; ISO 19848; digital twin; NASA C-MAPSS; sensor missingness; domain shift; multi-seed verification; loss function tuning; reproducibility; deep learning

1. Introduction

Modern vessels and offshore infrastructure systems serve as critical assets in global supply chains and energy transportation, with ever-increasing demands for safety, reliability, and availability. The transition from reactive maintenance to predictive maintenance (PdM) has accordingly emerged as a key challenge for preventing unexpected downtime, reducing operational costs, and ensuring crew safety [1]. In particular, condition-based maintenance (CBM) for key shipboard equipment such as main propulsion systems and generators can substantially improve resource efficiency compared to traditional maintenance practices [2].

However, data-driven PdM implementation in the maritime domain faces two fundamental challenges. First, actual vessel failure data are extremely scarce and are predominantly classified as proprietary information. Unlike the aviation (NASA C-MAPSS [3], PRONOSTIA [4]) and manufacturing (PHM Data Challenge [5]) domains, publicly available marine machinery degradation datasets are severely limited [6]. This data scarcity problem has been reported as a prevailing issue across PHM applications in general [7]. Second, heterogeneous sensor systems installed across diverse vessels often operate without standardized data labeling, impeding the scalability of model training and deployment across multiple equipment types and fleets [8].

Digital twin (DT) technology offers a promising solution to these challenges. The convergence of AI and DT enables the integration of diagnostic and predictive capabilities with physics-based virtual assets [9,10], and DT-based PdM architectures emphasize the importance of standards-based data layers [11]. However, as Mauro and Kana [12] highlight in their review of ship life-cycle digital twins, many DT implementations lack genuine bidirectional data exchange between physical and virtual entities, and the term “DT” may be overused. Accordingly, the present study targets not a “full DT” but a DT-ready workflow equipped with a standards-aligned data layer and an auditable prognostics layer.

This study proposes a hybrid AI–DT workflow that integrates an ISO 19848-aligned data contract [13] with a domain-knowledge enhanced deep learning model, and employs the NASA C-MAPSS benchmark as proxy data for scenario-based robustness evaluation against maritime conditions.

Beyond recurrent networks, recent prognostics research has explored diverse architectures including Transformer-based attention models [14], hybrid CNN-LSTM frameworks [15], deep convolutional networks [16], and physics-informed neural networks [17,18]. While these advanced models often achieve lower RMSE on benchmark datasets, the present study focuses on isolating the contribution of domain-knowledge constraints through a controlled comparison, with multi-seed verification across the core domain-shift scenarios.

The remainder of this paper is organized as follows. Section 2 reviews related work spanning maritime PdM and data availability, DT-based PdM architectures, physics-/domain-knowledge integrated RUL prediction, missingness-aware sequence modeling, and the NASA C-MAPSS benchmark. Section 3 presents the proposed process, including the DT-ready reference architecture, the ISO 19848-aligned data contract, the DK-LSTM model and its domain-knowledge enhanced loss function, and the missingness handling strategy. Section 4 describes the experimental setup, including datasets, preprocessing, scenario design, evaluation metrics, and reproducibility provisions. Section 5 reports the experimental results across all four scenarios. Section 6 discusses the implications and limitations of the findings, and Section 7 concludes the paper with a summary and directions for future research.

2. Related Works

2.1. Maritime Predictive Maintenance and Data Availability

Section 1 introduced the two-fold challenges of maritime PdM (data scarcity, heterogeneous sensors) and the rationale for a DT-ready, standards-aligned approach. This section reviews the relevant literature in five areas, identifying for each area the gap that motivates the present work.

The maritime PdM field has advanced substantially over the past decade, yet the absence of publicly available vessel degradation datasets remains a key bottleneck. Kalafatelis et al. [6] surveyed component-level PdM needs and the diversity of data-driven architectures in the maritime industry, noting that practical barriers to deployment across heterogeneous vessel subsystems remain substantial. As publicly available failure datasets for ship propulsion systems are extremely limited, the majority of maritime PHM studies rely on public benchmarks as proxies [19]. Lazakis et al. [1] combined analytical reliability-based approaches with artificial neural networks for vessel machinery condition prediction, and Velasco-Gallego and Lazakis [19] developed a real-time anomaly detection intelligent system for marine machinery fault diagnosis. In particular, Velasco-Gallego and Lazakis [2] explicitly addressed the handling of missing sensor data as a practical challenge in maritime CBM/PdM, showing that missing data imputation for marine sensor streams is a key preprocessing concern for real-time decision support.

While these studies establish the importance of data-driven PdM in maritime, none integrate a standards-aligned data contract layer with a constraint-regularized prognostic model under multi-seed verification. The present work fills this gap.

2.2. Digital Twin-Based Predictive Maintenance Architectures

DT-based PdM reference architectures present a multi-layered structure comprising physical assets, data connectivity, virtual models, and service layers [11,20]. Van Dinter et al. [11] argued that reference architectures serve as blueprints for consistent application architecture design when DT-based PdM is integrated into complex systems. Furthermore, van Dinter et al. [21] reported that the maturity gap of the data integration layer was a key finding in their systematic literature review connecting DT and predictive maintenance. Chen et al. [22] further examined the role of machine learning within DT-based PdM systems, reinforcing the need for mature data integration as a prerequisite for effective AI–DT convergence. In the maritime domain, Mauro and Kana [12] conducted a critical systematic review of ship life-cycle DTs, addressing state synchronization, standards-based data exchange, and computational complexity, and warned that the term “DT” may be misused when bidirectional physical–virtual data exchange is absent. Yang and Xiang [23] reviewed emerging trends in maritime digital twins, identifying data interoperability and standardization as persistent gaps across the sector. Fonseca et al. [24] reported a standards-based ship digital twin case study grounded in ISO 19847/19848, demonstrating that data standards are meaningful for scalable DT implementations.

Accordingly, this study posits the ISO 19848-aligned data contract as a necessary sub-component, while explicitly scoping the work to a DT-ready workflow rather than a fully synchronized physical–virtual loop.

2.3. Physics-/Domain-Knowledge Integrated RUL Prediction

Integrating physical constraints into neural network loss functions enables physically consistent predictions even under data-sparse conditions [25,26]. The review by Li et al. [25] on physics-informed data-driven RUL prediction noted that purely data-driven methods may produce physically infeasible or inconsistent predictions, and that embedding physics/domain knowledge can reduce data quality and quantity requirements. Raissi et al. [17] established the physics-informed neural network (PINN) paradigm, a representative approach that incorporates physical laws as constraints (e.g., soft penalties) into the learning objective, which has since been extended to prognostics applications. Liao et al. [18] further demonstrated the integration of self-attention mechanisms with physics-informed neural networks for RUL prediction, achieving improved prognostic accuracy under constrained data conditions. Lu et al. [27] specifically formulated monotonic decrease and non-negativity constraints as regularization terms in an LSTM-based RUL model for power electronic devices. Arias Chao et al. [28] demonstrated that fusing a physics-based thermodynamic model with deep learning extends the prediction horizon on the N-CMAPSS dataset.

This study combines non-negativity, operating-range upper bounds, and an asymmetric safety penalty within a single loss function, and links this to an ISO 19848-aligned data contract. Rather than inserting complex governing equations, the approach adopts constraint losses reflecting “safe prediction boundaries,” offering a practical approach tailored to maritime safety requirements.

Meanwhile, for RUL prediction of safety-critical electromechanical devices, Hu et al. [29] proposed a spatiotemporal attention-based multi-branch network and reported strong performance in long-term multivariate prediction, and [30] integrated physical constraints (band energy preservation) into an adaptive style transfer network, demonstrating data efficiency in small-sample fault diagnosis under physics-information scarcity.

Building on Lu et al. [27], the present work extends the constraint-regularization paradigm by (1) systematically ablating the constraint weights and stabilization techniques and (2) verifying multi-seed robustness under zero-shot domain-shift conditions, which are not addressed in prior LSTM-based constraint-regularization studies.

2.4. Missingness-Aware Sequence Modeling

The notion that missingness patterns themselves may be informative was explicitly addressed in the GRU-D study by Che et al. [31]. GRU-D directly integrates masking and temporal gap information into the RNN structure, presenting a principled direction for missingness handling beyond simple imputation.

This serves as the academic basis for the E2 (30% missingness) scenario design of the present study, and simultaneously provides the rationale for future research directions in response to the inferior E2 performance of DK-LSTM.

Furthermore, Li et al. [32] addressed RUL prediction under partial sensor malfunctions using deep adversarial networks, confirming that sensor degradation and missing observations remain a practical challenge for prognostics in real-world deployment. Hu et al. [33] addressed the problem of simultaneous fault diagnosis for sensors and equipment in autonomous rail systems, confirming that fault diagnosis under sensor fault/missingness conditions is a practical challenge.

2.5. NASA C-MAPSS Benchmark

NASA C-MAPSS provides turbofan engine simulation data encompassing diverse operational conditions and fault modes, and serves as the de facto standard benchmark in the PHM field [3]. Heimes [34] provided one of the earliest applications of recurrent neural networks for RUL estimation on C-MAPSS, establishing the foundation for subsequent deep learning approaches. Ramasso and Saxena [35] benchmarked multiple prognostic methods on C-MAPSS subsets and confirmed that the PHM’08 asymmetric scoring function is one of the most widely used metrics. The piecewise linear RUL labeling convention was established by Zheng et al. [36]. Since shipboard main propulsion gas turbines and large diesel engines share thermomechanical degradation mechanisms with aircraft turbofan engines—namely high temperature, high pressure, and high-speed rotation [25]—C-MAPSS serves as a suitable proxy for testing the generalization capability of maritime PHM algorithms.

The cross-evaluation protocol employed in this study—FD001 → FD002 (operational-condition shift) and FD001 → FD003 (fault-mode shift)—represents a standard approach in domain generalization research [35,37].

3. Proposed Process

The proposed process consists of four interconnected layers, depicted schematically in Figure 1. The remainder of this section details each layer.

3.1. DT-Ready Reference Architecture

The reference architecture of the AI–DT workflow proposed in this study comprises four layers. This layer separation is consistent with the discussion by van Dinter et al. [11] on the role of reference architectures in DT-based PdM.

(1) Physical Layer: Encompasses shipboard equipment and sensor interfaces. Generates raw telemetry, which may be noisy, subject to packet loss, and tagged with manufacturer-specific identifiers.

(2) Data Contract Layer: Transforms heterogeneous signals into structured time-series formats through ISO 19848-aligned channel mapping. Ensures that indicators such as temperature, pressure, and speed can be uniformly identified regardless of vessel source.

(3) Prognostics Layer: Embeds the DK-LSTM-based RUL prediction module.

(4) Service Layer: Links prediction outputs to maintenance decision-making, explicitly accounting for costs (unexpected failure vs. premature replacement).

Each layer has a clearly defined input/output interface: the physical layer outputs raw vendor-specific telemetry; the data contract layer outputs ISO 19848-aligned channel-identified time-series; the prognostics layer consumes this structured input and produces RUL estimates whose robustness is assessed through multi-seed verification (Section 5); and the service layer translates RUL estimates into maintenance decisions weighted by the asymmetric cost of failure versus premature replacement. This separation enables modular substitution—e.g., replacing the prognostics layer’s DK-LSTM with an alternative model without affecting the data contract above or the service logic below.

3.2. ISO 19848-Aligned Data Contract

ISO 19848:2024 provides naming and description conventions for shipboard machinery and equipment sensor data [13]. This study defines a reference schema in JSON format that maps C-MAPSS sensors to shipboard physical quantities, comprising a total of 16 channels. The target variable (RUL) includes physical constraint specifications (non-negative, upper bound = 125 cycles).

This implementation is a reference mapping inspired by the naming principles of ISO 19848, and does not claim full technical compliance with the ISO 19848 specification. Full compliance would require additional steps including formal channel registration with classification societies, data quality verification procedures, and integration with onboard data infrastructure conforming to ISO 19847 (ship data servers) [38].

Table 1 summarizes the ISO 19848-aligned channel mapping schema, including the selected C-MAPSS features, units, semantic channel names, maritime vessel counterparts, and the RUL target constraints.

The “Maritime Vessel Counterpart” column provides reference information indicating which measurement points on a ship’s main propulsion gas turbine or large diesel engine physically correspond to the C-MAPSS turbofan sensors, and is included to visually demonstrate semantic interoperability in heterogeneous sensor environments.

The ISO 19848 mapping process is carried out in three steps.

First, physical quantity classification: of the 26 raw columns in the C-MAPSS data, sensors with zero variance (s_1, s_5, s_6, s_10, s_16, s_18, s_19, setting_3) are excluded, and the remaining 16 features are classified into six physical quantity categories: temperature, pressure, efficiency, flow, ratio, and operational condition.

Second, ISO 19848 naming convention application: each feature is assigned a hierarchical channel identifier following the “component_location_quantity” pattern. For instance, s_4 (high-pressure compressor outlet temperature) is named HPC_OutletTemperature, and s_12 (high-pressure turbine outlet pressure) is named HPT_OutletPressure. This pattern references the hierarchical DataChannelID naming structure recommended by ISO 19848 [13,24].

Third, maritime equipment semantic mapping: based on the correspondence of thermodynamic components (compressor, turbine, combustor) shared between aircraft turbofan engines and shipboard gas turbines/large diesel engines [25], each ISO 19848 channel is mapped to the physical quantity it would measure on actual shipboard equipment.

This three-step process is designed such that semantic consistency can be maintained by repeating the same procedure when new vessel equipment types or heterogeneous sensor systems are added.

3.3. DK-LSTM PdM Model

DK-LSTM is a lightweight sequence model that combines a standard LSTM encoder with a domain-knowledge-based loss function. The input is a normalized sensor sequence with sliding window length L = 30, and the output is a single scalar RUL estimate. The single-output structure (return_sequences = False) was selected for deployment efficiency.

L S T M (64) \to D r o p o u t (0.2) \to L S T M (32) \to D r o p o u t (0.2) \to D e n s e (32, R e L U) \to D e n s e (1, L i n e a r)

(1)

As a key design decision for fair comparison, a linear activation function (activation = ‘linear’) is applied to the final output layer of all models (Standard LSTM, GRU, DK-LSTM). This excludes the expedient of structurally blocking negative values via ReLU or similar activations, forcing only the penalties embedded in the loss function to learn the physical constraints, thereby establishing the loss function contribution as the sole independent variable.

The hidden dimensions follow a stepwise compression pattern (LSTM 64 → 32, Dense 32 → 1), reflecting standard practice in C-MAPSS RUL prediction literature [34,36] for balancing representational capacity against overfitting risk in single-engine-unit time-series. The 32-node penultimate dense layer is consistent with the latent representation capacity of the second LSTM layer. No additional hyperparameter tuning was performed, to maintain fair comparison with baseline models.

Before presenting the detailed loss formulation, Algorithm 1 summarizes the optimization procedure used to train the DK-LSTM model. For multi-seed verification, the same procedure was repeated independently using seeds 0, 42, and 123.

Algorithm 1. DK-LSTM training procedure with a constraint-regularized objective.

Input: Training set {(Xi, yi)}, validation set {(Xv, yv)}, learning rate η,
              maximum epochs E, patience P = 15, gradient clipping threshold c,
              constraint weights {λsafety, λneg, λmono, λupper}, random seed s

Output: Trained model parameters θ

1:    Initialize model parameters θ using seed s
2:    for epoch = 1 to E do
3:            for each mini-batch (Xb, yb) do
4:                    ŷb ← fθ(Xb)
5:                    LMSE    ← mean((yb − ŷb)²)
6:                    Lsafety ← asymmetric_penalty(yb, ŷb)
7:                    Lneg    ← mean(ReLU(−ŷb)²)                                    // Constraint (i)
8:                    Lmono   ← batch_sort_violation(yb, ŷb)                // Constraint (ii)
9:                    Lupper  ← mean(ReLU(ŷb − 125)²)                        // Constraint (iii)
10:                  Ltotal  ← LMSE + λsafetyLsafety + λnegLneg
                                     + λmonoLmono + λupperLupper
11:                  θ ← Adam_step(θ, ∇θLtotal, η, clipnorm = c)
12:          end for
13:          Evaluate validation MAE on {(Xv, yv)}
14:          if validation MAE shows no improvement for P epochs then
15:                  break
16:          end if
17: end for
18: return θ

Domain-Knowledge Enhanced Loss Function

The learning objective of DK-LSTM comprises a data loss term, a practical safety loss term, and domain-informed constraint terms.

L_{t o t a l} = L_{M S E} + λ_{s} \cdot L_{s a f e t y} + λ_{n e g} \cdot L_{n e g} + λ_{m o n o} \cdot L_{m o n o} + λ_{u p p e r} \cdot L_{u p p e r}

(2)

Constraint (i) Non-negativity: RUL is physically non-negative.

L_{n} e g = E [r e l u {(- \hat{y})}^{2}], λ_{n} e g = 1.0

(3)

Constraint (ii) Monotonic Degradation: Predicted RUL should monotonically decrease along the degradation trajectory. Since DK-LSTM is a single-output model, direct intra-sequence comparison is infeasible; a batch-sort approximation was adopted whereby target values are sorted in descending order within each mini-batch and violations of monotonic decrease in predictions are penalized. However, in domain-shift environments with mixed operational conditions such as FD002 (six operating conditions), the batch-sort approximation was found in experiments to distort the physical temporal ordering of time series, destabilizing optimization. Accordingly, this component was simplified in the final experimental version. Trajectory-level monotonicity enforcement is deferred to future work.
Constraint (iii) Operating-Range Upper Bound: A penalty is applied when the predicted value exceeds MAX_RUL (=125). This constraint prevents the model from generating unrealistically high RUL predictions when encountering out-of-distribution noise.

L_{u} p p e r = E [r e l u {(\hat{y} - 125)}^{2}], λ_{u} p p e r = 1.0

(4)

Safety Penalty: An asymmetric penalty is applied with a weight of 2.0 for over-predictions and 0.2 for under-predictions. In safety-critical maritime environments, RUL over-prediction directly translates to delayed maintenance timing and thus to equipment failure risk; consequently, over-prediction is penalized more heavily than under-prediction.

o v e r p r e d i c t i o n, d > 0 = p e n a l t y 2.0, u n d e r p r e d i c t i o n, d < 0 = p e n a l t y 0.2

(5)

Table 2 summarizes the composition of the DK-LSTM loss function, including the data loss, asymmetric safety penalty, and the three domain-informed constraint terms.

3.4. Missingness Handling

The sensor data missingness rate for the E2 scenario is set at 30%. Forward-fill imputation is applied, combined with a binary masking indicator to expand the input dimension from 16 to 32. This borrows the key idea from GRU-D [31] of explicitly leveraging missingness information, while applying the same architecture to all baseline models to ensure fair comparison.

Limitation: The initialization of leading consecutive missing values using the first valid value may allow future-timepoint information to leak into the past, which deviates from strictly causal imputation. Accordingly, this study designates the approach as “leakage-minimized” rather than “leakage-free.” Fully leak-proof strategies, such as learnable tokens or constant initialization, are deferred to future work.

4. Experiments

4.1. Datasets: NASA C-MAPSS Proxy Benchmark

In the absence of publicly available maritime-specific failure datasets, this study employs the NASA C-MAPSS turbofan degradation dataset as a proxy benchmark [3]. C-MAPSS is utilized as a proxy based on the similarity of rotating machinery degradation patterns [25] and its established standing as a multi-condition, multi-fault benchmark in the PHM field, using three subsets.

FD001: Single operating condition, HPC degradation;
FD002: Six operating conditions, single fault mode;
FD003: Single condition, two fault modes.

4.2. Data Preprocessing

From 26 sensor and setting columns, 16 features with non-zero variance were selected. RUL labels are generated using piecewise linear labeling (MAX_RUL = 125), following standard practice for C-MAPSS benchmarks [36] and ensuring consistency with constraint (iii). MinMax normalization is applied only to the FD001 training data, and the same scaler is applied to FD002/FD003 test data to prevent information leakage at the scaling stage.

Time-dependent samples are handled through a sliding window of length L = 30, where each window represents a contiguous sequence of sensor observations. Windows are constructed within each engine unit independently to preserve temporal causality, ensuring that no information from future time steps leaks into the input of past predictions.

4.3. Experimental Procedure

E1 Scenario (FD001 Baseline Performance): Baseline comparison of Standard LSTM, GRU, and DK-LSTM under normal conditions. Input window L = 30.
E2 Scenario (Sensor Missingness Robustness): Robustness evaluation under 30% missingness. Forward-fill leakage-minimized imputation + binary mask indicator (input dimension: 16 → 32).
E3a Scenario (Operational-Condition Shift): FD001 training → FD002 testing. Single condition → six-condition distribution shift.
E3b Scenario (Fault-Mode Shift): FD001 training → FD003 testing. Single fault → two-fault distribution shift.

Table 3 summarizes the four evaluation scenarios, their train/test configurations, and the primary stress condition represented by each scenario.

4.4. Evaluation Metrics

RMSE/MAE: Overall prediction accuracy. Lower is better.
NASA Score: Asymmetric scoring function [3]. Lower is better. Penalizes over-prediction (delayed maintenance) more heavily than under-prediction, making it suitable for safety-critical settings [39].

o v e r p r e d i c t i o n, d > 0 = s = e x p (d / 10) - 1

(6)

u n d e r p r e d i c t i o n, d < 0 = s = e x p (- d / 13) - 1

(7)

Neg%. Negative Prediction Ratio: Post hoc non-negative clipping is uniformly applied to all models at evaluation time.

4.5. Baseline and Proposed Model

Three baseline models and four DK-LSTM configurations are evaluated under identical training protocol. Standard LSTM and GRU share the same recurrent backbone as DK-LSTM and serve to isolate the contribution of the loss function. A CNN-LSTM hybrid is additionally included following the architectural principles of Ren et al. [15], with two 1D convolutional layers (32 filters, kernel size 3) preceding the LSTM encoder, while sharing the same dense head and training protocol as other baselines.

The four DK-LSTM configurations form an ablation study designed to examine the effects of loss weighting and training stabilization on domain-shift robustness (Section 5.5). Specifically:

DK-LSTM (original): λ_safety = 1.0, asymmetric weights (2.0, 0.2), λ_upper = 1.0, clipnorm = 1.0, dropout = 0.2.
DK-LSTM-v2: Safety penalty weakened (λ_safety = 0.5, weights (1.5, 0.5)).
DK-LSTM-v3: v2 settings + λ_upper strengthened (1.0 → 2.0).
DK-LSTM-v4 (final stabilized): v3 settings + clipnorm 1.0 → 0.5, dropout 0.2 → 0.3, cosine learning rate scheduling with 5-epoch warmup.

Table 4 summarizes the architectures, loss functions, output activation settings, and experimental roles of the baseline models and DK-LSTM variants.

4.6. Reproducibility

Software environment: Python 3.10.12, TensorFlow 2.15.0, NumPy 1.26.4, and scikit-learn 1.3.2, executed on Google Colab (Google LLC, Mountain View, CA, USA).
Multi-seed setting: seeds 0, 42, and 123.
Data Split: Training/validation data fixed at 80:20 ratio (unit-based split).
EarlyStopping: Based on val_mae, patience = 15.
Hyperparameters: Batch size 128, Adam optimizer (lr = 0.001, clipnorm = 1.0).

All experiments were conducted in Google Colab. Hyperparameters (batch size 128, learning rate 0.001, hidden dimensions 64–32, sequence length L = 30) were set to commonly adopted values in C-MAPSS prognostics literature [34,36] without further tuning. Identical training data, validation split, random seed, EarlyStopping criteria, and optimizer settings are applied uniformly to all models, with the only differences being the loss function (DK-LSTM variants) and architecture-specific layers (CNN-LSTM). Multi-seed verification (seeds 0, 42, 123) is performed for the domain-shift scenarios (E3a, E3b). Additional DK-LSTM-v4 runs under E1 are reported to assess the in-domain trade-off introduced by the final stabilized configuration.

5. Experimental Results

This section presents fixed-seed results for E1, E2, E3a, and E3b, followed by multi-seed verification and ablation focused on the zero-shot domain-shift scenarios E3a and E3b. Additional E1 multi-seed results for DK-LSTM-v4 are used to characterize the in-domain performance trade-off of the final stabilized configuration.

5.1. Learning Curves

This subsection presents the validation MAE learning curves of the four models trained on FD001 (in-domain). The curves provide insight into the optimization characteristics induced by each model’s loss function design.

Figure 2 presents the validation MAE learning curves under E1 (FD001, normal conditions). GRU converges rapidly at approximately epoch 10, while Standard LSTM exhibits a step-wise descent around epoch 17. DK-LSTM shows high variability in early epochs and continues training until approximately epoch 65, attributable to the optimization complexity introduced by the composite loss function (Safety + Neg + Upper).

The process by which the neural network negotiates the boundaries imposed by safety, non-negativity, and upper-bound penalties simultaneously with MSE minimization is visually confirmed. Although the final converged values are broadly comparable, the in-domain E1 results show that CNN-LSTM achieves the lowest NASA Score, while DK-LSTM performs comparably to GRU. This indicates that the main advantage of the proposed stabilized DK-LSTM configuration should not be interpreted as in-domain optimization, but as robustness-oriented behavior under zero-shot domain-shift scenarios analyzed in Section 5.4 and Section 5.5.

All models share an identical weight initialization seed (42) for this single-seed visualization, which explains the common starting point of the validation MAE learning curves; subsequent divergence reflects the differing optimization landscapes induced by each loss function. Standard LSTM and GRU optimize a single MSE objective, while DK-LSTM optimizes a composite loss with multiple constraint terms, leading to higher early-epoch variability and longer convergence.

5.2. E1 Scenario: FD001 Baseline Performance

Under in-domain conditions (FD001), CNN-LSTM achieves the lowest NASA Score (275.8) by leveraging local pattern extraction, while DK-LSTM (393.6) and GRU (401.9) perform comparably. Standard LSTM lags substantially (596.3). This suggests that under in-domain conditions where the test distribution matches training, architectural complexity (CNN-LSTM) can be more impactful than constraint-regularization (DK-LSTM). However, as shown in Section 5.4, this ordering changes substantially under operational-condition shift, where CNN-LSTM degrades sharply and DK-LSTM shows stronger safety-oriented robustness.

Table 5 reports the fixed-seed E1 performance comparison under the in-domain FD001 setting.

For reference, state-of-the-art CNN-LSTM and Transformer-based methods report RMSE of approximately 10.9–12.5 on C-MAPSS FD001 [15,40]; the purpose of this study is not to minimize in-domain RMSE but to verify constraint-regularized loss behavior under zero-shot domain shift.

5.3. E2 Scenario: Sensor Missingness Robustness

Under 30% sensor missingness, CNN-LSTM (488.5) and DK-LSTM (427.9) show degradation relative to their E1 performance, while Standard LSTM (380.6) and GRU (387.0) remain more stable. The composite loss function of DK-LSTM and the local convolution kernels of CNN-LSTM both appear sensitive to the 32-dimensional input space formed by concatenating the 16 original features with 16 binary mask indicators. These results motivate future research into missingness-aware recurrent architectures (e.g., GRU-D [31]) or impute-then-predict joint approaches.

Table 6 reports the fixed-seed performance comparison under the E2 sensor-missingness scenario.

5.4. E3 Scenario: Structural Domain Shift

This subsection presents results under zero-shot domain shift, which represents the core contribution of this study. Both single-seed (seed = 42) results and multi-seed verification (seeds 0, 42, 123) are reported. The multi-seed analysis provides a statistically more stable assessment of domain-shift robustness and motivates the stabilized configuration (DK-LSTM-v4) presented in Section 5.5.

Under operational-condition shift (E3a, fixed seed = 42), the original DK-LSTM achieves the lowest NASA Score (334,736.8), outperforming Standard LSTM (667,607.1), GRU (847,741.0), and CNN-LSTM (12,539,335). Notably, CNN-LSTM exhibits catastrophic degradation under E3a, suggesting that local convolutional pattern extraction may overfit to source-domain temporal patterns and fail to generalize across the six operating conditions of FD002. This contrast supports the hypothesis that constraint-regularized learning can provide a useful inductive bias under operational-condition shift.

Table 7 reports the fixed-seed performance comparison for E3a, representing operational-condition shift from FD001 to FD002.

Under fault-mode shift (E3b, fixed seed = 42), CNN-LSTM achieves the lowest NASA Score (677,130), while DK-LSTM (886,388.4) outperforms GRU (1,017,999.0) and Standard LSTM (1,738,546.9). This indicates that DK-LSTM is not uniformly superior across all domain-shift settings. Instead, the benefit of constraint-regularized learning is more pronounced under operational-condition shift (E3a), whereas CNN-LSTM remains competitive under fault-mode shift (E3b), where the operational-condition distribution is less different from the source domain.

Table 8 reports the fixed-seed performance comparison for E3b, representing fault-mode shift from FD001 to FD003.

The exponential nature of the NASA Score amplifies large individual errors under zero-shot transfer. These fixed-seed results are complemented by the multi-seed verification and stabilization analysis in Section 5.5.

5.5. Multi-Seed Verification and Ablation

To evaluate whether the observed domain-shift behavior is robust to random initialization and training variability, we conduct multi-seed verification under the E3a and E3b scenarios using seeds 0, 42, and 123. In addition, we examine whether loss-weight adjustment and training-stabilization techniques can improve the stability of DK-LSTM under zero-shot domain shift. This analysis compares the original DK-LSTM with three stabilized variants, culminating in DK-LSTM-v4.

5.5.1. Loss Weight and Stabilization Ablation

Four DK-LSTM configurations were evaluated:

DK-LSTM (original): as defined in Section 3.3.
DK-LSTM-v2: Safety penalty weakened (λ_safety: 1.0 → 0.5; asymmetric weights: (2.0, 0.2) → (1.5, 0.5)).
DK-LSTM-v3: v2 + λ_upper strengthened (1.0 → 2.0).
DK-LSTM-v4 (final stabilized): v3 + clipnorm tightened (1.0 → 0.5), dropout strengthened (0.2 → 0.3), cosine learning rate scheduling with 5-epoch warmup.

Table 9 reports the multi-seed verification results for E3a, comparing the baseline models and the DK-LSTM variants under operational-condition shift.

Table 10 reports the corresponding multi-seed verification results for E3b under fault-mode shift.

5.5.2. Findings

Original DK-LSTM under multi-seed. The single-seed results in Table 7 and Table 8 reflected a favorable seed for the original DK-LSTM. As shown in Table 9 and Table 10, the original DK-LSTM shows mean NASA Scores of 1,011,947 in E3a and 1,084,459 in E3b, both higher than GRU’s mean.

Loss-weight tuning alone is insufficient. DK-LSTM-v2 (weakened safety) and DK-LSTM-v3 (v2 + strengthened upper bound) further amplify the seed sensitivity in E3a, with NASA Score means above 11.6 × 10⁶. This indicates that loss-weight rebalancing, while improving E3b moderately, does not address the underlying optimization instability under multi-condition operational shift.

Comprehensive stabilization (DK-LSTM-v4) resolves E3a instability. With cosine learning rate scheduling, tightened gradient clipping (0.5), and stronger dropout (0.3), DK-LSTM-v4 achieves a mean NASA Score of 269,778 ± 215,387 in E3a—a 43.7% improvement over GRU (479,241) and a 78.7% improvement over Standard LSTM (1,265,523). Under E3b, DK-LSTM-v4 maintains a 20.8% improvement over GRU. CNN-LSTM remains highly unstable in E3a (2.9 × 10⁶) and competitive with DK-LSTM-v4 in E3b (823,932), confirming that architectural complexity alone does not guarantee cross-domain robustness.

Trade-off in in-domain performance. Additional multi-seed runs for DK-LSTM-v4 under E1 confirm that the final stabilized configuration is not an in-domain optimizer. DK-LSTM-v4 records an E1 NASA Score of 3,532.26 ± 5,457.79 across seeds 0, 42, and 123, substantially worse than the less strongly regularized DK-LSTM-v3 configuration (462.5 ± 112.5). This confirms that the stronger stabilization strategy—dropout 0.3, tighter gradient clipping, and cosine learning-rate scheduling—sacrifices in-domain accuracy while improving zero-shot operational-condition shift robustness under E3a. DK-LSTM-v4 is therefore positioned as a domain-shift-oriented stabilized configuration rather than a general-purpose in-domain accuracy optimizer.

5.6. Result Summary

Figure 3 summarizes the multi-seed domain-shift behavior and stabilization effects under E3a and E3b, while Figure 4 provides a compact fixed-seed overview across E1, E2, E3a, and E3b. The scenario-specific quantitative results are reported in the corresponding tables in Section 5.2, Section 5.3, Section 5.4 and Section 5.5. In the in-domain E1 setting, CNN-LSTM achieves the best NASA Score, while DK-LSTM remains comparable to GRU. Under E2 sensor missingness, DK-LSTM and CNN-LSTM show degradation relative to simpler recurrent baselines. The central empirical finding is therefore not universal superiority across all scenarios, but the multi-seed robustness of the stabilized DK-LSTM-v4 configuration under zero-shot domain shift, particularly E3a and, relative to GRU and Standard LSTM, E3b.

6. Discussion

6.1. Effect of Domain Constraints Under Zero-Shot Domain Shift

The primary contribution of this study lies in zero-shot domain-shift performance under multi-seed verification. In the in-domain E1 setting, CNN-LSTM achieves the strongest NASA Score, while the original DK-LSTM remains comparable to GRU rather than dominating all baselines (Section 5.2). The stabilized DK-LSTM-v4, by contrast, is designed for domain-shift robustness and achieves substantial NASA Score reduction under E3a and E3b: 43.7% over GRU and 78.7% over Standard LSTM in E3a, and 20.8% over GRU in E3b (Table 9 and Table 10).

CNN-LSTM, despite stronger E1 performance, shows catastrophic degradation under E3a (single-seed: 12.5 × 10⁶; multi-seed mean: 2.9 × 10⁶), confirming that architectural complexity alone does not provide cross-domain inductive bias. In contrast, the constraint-regularized loss combined with comprehensive training stabilization (clipnorm 0.5, dropout 0.3, cosine LR) produces predictions that remain bounded and safety-shaped under unseen operating conditions.

In safety-critical maritime/defense environments, RUL over-prediction directly translates to delayed maintenance and equipment failure risk. The asymmetric NASA Score weights such errors more heavily, making it the primary metric for these applications. The 43.7% NASA Score improvement of DK-LSTM-v4 over the strongest baseline (GRU) under E3a thus carries practical significance beyond what RMSE alone would indicate.

6.2. Training Instability Under Sensor Missingness (E2)

Both DK-LSTM and CNN-LSTM show degradation under 30% sensor missingness (Section 5.3). DK-LSTM’s composite loss function generates conflicting gradients in the 32-dimensional (16 + 16 mask) input space, while CNN-LSTM’s local convolution kernels appear sensitive to imputation artifacts. Standard LSTM with simple MSE loss navigates the corrupted space more flexibly. These results demonstrate that neither composite domain constraints nor architectural depth confer universal benefits, motivating future research into missingness-aware recurrent architectures (GRU-D [31]) or impute-then-predict joint approaches (e.g., BRITS).

6.3. Deactivation of the Monotonic Degradation Constraint

This study does not claim its core contribution is a “fully activated physics-informed LSTM with monotonic degradation constraint.” The current implementation approximates monotonicity via in-batch target sorting in a single-time-point output model, which can distort the physical temporal ordering in data with mixed operating conditions such as FD002.

In numerous iterative experiments, setting λ_mono > 0 caused training divergence or substantially unstable convergence in E3a/E3b; accordingly, this study designates the monotonicity constraint as “implemented but deactivated in the final reported experiments.”

Trajectory-level monotonicity enforcement techniques that directly constrain outputs to decrease consistently along the time-series order are deferred to future work.

To further examine whether this instability could be mitigated through training stabilization, we explored hyperparameter-level interventions—gradient clipping intensification (clipnorm 1.0 → 0.5), dropout strengthening (0.2 → 0.3), and cosine learning-rate scheduling—as documented in Section 5.5. While these interventions successfully stabilize DK-LSTM-v4 under operational-condition shift (E3a), they do not enable activation of the monotonicity constraint, because the underlying limitation is structural (batch-sort approximation in single-output architecture) rather than optimization-related. Trajectory-level monotonicity enforcement via sequence-to-sequence formulations is identified as the natural next step.

6.4. Significance of the ISO 19848-Aligned Data Contract

The ISO 19848-aligned data contract is materialized as a JSON schema and provided as Supplementary Material. This schema serves as a reference point for maintaining semantic consistency when integrating actual vessel data into the same pipeline or extending to other shipboard equipment. The physical constraint specifications (non-negativity, upper bound) included in the data contract are directly linked to the model loss function design, ensuring consistency between the data layer and the learning algorithm.

Full ISO 19848 compliance requires additional steps including formal channel registration with classification societies, data quality verification procedures, and integration with onboard data infrastructure conforming to ISO 19847 (ship data servers) [24,38].

6.5. Comparison with State-of-the-Art

The focus of this study is not to achieve the lowest in-domain RMSE on C-MAPSS FD001. Recent Transformer-based architectures [14,40], hybrid CNN-LSTM models [15,28], and deep convolutional networks [16] report RMSE of approximately 10.9–12.5 on FD001. By contrast, the stabilized DK-LSTM-v4 configuration is not optimized for the lowest in-domain RMSE; it deliberately trades in-domain accuracy for stronger cross-domain robustness through stabilization and constraint-regularized learning. This design choice follows from two considerations:

(1) DK-LSTM is intentionally implemented as a simple two-layer LSTM to isolate the contribution of the loss function, and (2) DK-LSTM-v4 is targeted at zero-shot domain-shift robustness rather than in-domain accuracy maximization.

6.6. Limitations

This study is intentionally scoped toward evaluating whether domain-knowledge constraints and training stabilization improve RUL prediction under stress scenarios, rather than to achieving the lowest in-domain RMSE on C-MAPSS FD001.

(1) Absence of real vessel data: The experimental results constitute a proxy stress test on aviation-domain data and do not represent maritime environment validation.

(2) Monotonic degradation constraint deactivated: The batch-sort approximation in a single-output architecture does not permit trajectory-level monotonicity enforcement. This limitation is structural, not hyperparameter-related.

(3) In-domain performance trade-off: DK-LSTM-v4’s comprehensive stabilization (cosine LR, clipnorm 0.5, dropout 0.3) reduces in-domain (E1) accuracy. The model is positioned as a domain-shift-oriented stabilized configuration, not an in-domain optimizer.

(4) E2 performance: DK-LSTM does not outperform the simpler baselines under 30% sensor missingness.

(5) Imputation rigor: Initialization with the first valid value departs from strictly causal imputation; this is documented as “leakage-minimized” rather than “leakage-free.”

(6) Full DT not implemented: The current framework is at a DT-ready stage without physical–virtual state synchronization.

(7) Multi-seed evaluation scope: Multi-seed baseline comparison is limited to E3a and E3b, which represent the core zero-shot domain-shift contribution of this study. Additional E1 multi-seed runs are conducted only for DK-LSTM-v4 to characterize the in-domain trade-off of the stabilized configuration. Full multi-seed evaluation across all models under E1 and E2 remains future work.

(8) Architectural comparison scope: Direct multi-seed Transformer baselines under the same protocol remain future work.

(9) Disturbance scope: The robustness evaluation addresses sensor missingness; additional disturbances (Gaussian noise, sensor drift, outlier injection) remain future work.

6.7. Scope of Data Scarcity and Future Extension

In this paper, “data scarcity” is defined at two levels. First, domain-level scarcity, referring to the situation in which publicly available failure datasets for the maritime domain are absent, necessitating the use of C-MAPSS as a proxy. Second, observation-level scarcity, referring to missing data arising from sensor faults, communication failures, and packet loss during vessel operation, which the E2 (30% missingness) scenario represents.

Separately, experiments that reduce the training data volume itself (sample-level sparsity) may also be discussed as one facet of data-scarce environments. For example, the 100 FD001 training units could be reduced to 50% or 10% to evaluate whether DK-LSTM’s domain constraints exert a regularization effect on small data.

However, such experiments were not included in the scope of this study for the following reasons. First, the E3a/E3b scenarios in the current experimental design already constitute zero-shot cross-domain transfer in which target-domain training data is 0%, representing a more severe form of data insufficiency than training data reduction. Second, reducing training units to 10 would necessitate hyperparameter readjustment to prevent overfitting, compromising the controlled experimental design in which the loss function contribution is the sole independent variable. Third, under a single-seed (seed = 42) condition, small-unit experiments may be dominated by unit-selection bias, making interpretation difficult without concurrent multi-seed experiments. Accordingly, DK-LSTM robustness evaluation under training data reduction conditions is deferred to future work when a multi-seed experimental infrastructure is in place. This is also a meaningful follow-up task from the perspective of PHM methodology evaluation in small-data environments as highlighted by Li et al. [7].

The multi-seed ablation in Section 5.5 also indicates that DK-LSTM-v4’s robustness in E3a relies on the combination of constraint-regularized loss and training stabilization. Future work will examine whether the same stabilization techniques transfer to alternative architectures (Transformer, CNN-LSTM) when paired with the proposed loss function.

7. Conclusions

This paper proposed a standards-aligned hybrid AI–DT workflow for predictive maintenance of maritime vessels and offshore plants under data scarcity and heterogeneous sensor environments. The main contribution of the study is the integration of an ISO 19848-aligned data contract with a constraint-regularized DK-LSTM prognostics module, thereby linking standards-based signal semantics with safety-oriented RUL prediction. The proposed workflow further contributes a scenario-based robustness evaluation protocol that separates sensor missingness, operational-condition shift, and fault-mode shift, together with multi-seed ablation analysis and an additional CNN-LSTM hybrid baseline for architectural comparison. The core predictive component embeds two active physical-boundary constraints—RUL non-negativity and operating-range upper bound—together with an asymmetric safety penalty, while the monotonic degradation constraint is explicitly treated as an implemented but deactivated component due to the structural limitation of batch-sort approximation in a single-output architecture.

Through systematic ablation across multi-seed verification (seeds 0, 42, 123), the final stabilized configuration (DK-LSTM-v4) achieves 43.7% NASA Score improvement over the strongest baseline (GRU), 78.7% improvement over Standard LSTM under operational-condition shift (E3a), and 20.8% improvement over GRU under fault-mode shift (E3b). CNN-LSTM, despite stronger in-domain performance, exhibits catastrophic degradation under E3a, confirming that architectural complexity alone does not guarantee cross-domain robustness. DK-LSTM-v4 is positioned as a domain-shift-oriented stabilized configuration that trades modest in-domain accuracy for substantial cross-domain robustness, aligning with the practical requirement of safety-critical maritime and defense applications where target-domain training data is unavailable.

We acknowledge several limitations. First, the deactivation of the monotonic degradation constraint (λ_mono = 0.0) reflects the structural limitation of batch-sort approximation in single-output architectures, rather than a hyperparameter issue; trajectory-level monotonicity enforcement via sequence-to-sequence formulations is the natural next step. Second, while multi-seed verification confirms statistical robustness in the domain-shift scenarios, full multi-seed evaluation across all models under the in-domain (E1) and sensor-missingness (E2) scenarios remains future work. Third, the in-domain trade-off of DK-LSTM-v4 is acknowledged as a deliberate consequence of the stabilization-oriented design.

Future work will address (a) domain adaptation validation using real vessel sensor data, (b) mask-attention-based missingness handling, (c) trajectory-level monotonicity constraint implementation, (d) full DT extension including physical–virtual state synchronization, (e) application of the DK-LSTM-v4 loss and stabilization combination to more advanced architectures (Transformer, CNN-LSTM hybrids) under multi-seed protocol, (f) extension of robustness evaluation to additional data disturbances (Gaussian noise, sensor drift, outlier injection), and (g) full multi-seed evaluation of E1 and E2.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16115303/s1, File S1: ISO 19848-Aligned Channel Mapping Schema (iso19848_data_contract.json).

Author Contributions

Conceptualization, D.P. and D.S.; methodology, D.P. and J.K.; software, D.P.; validation, J.J., J.K.; formal analysis, D.P.; investigation, D.P. and J.J.; resources, D.P.; data curation, D.P.; writing—original draft preparation, D.P. and J.K.; writing—review and editing, J.J. and D.S.; visualization, D.P.; supervision, D.S.; project administration, D.S.; funding acquisition, D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant, funded by the Ministry of Culture, Sports and Tourism in 2025 (Project Name: Training Global Talent for Copyright Protection and Management of On-Device AI Models, Project Number: RS-2025-02221620, Contribution Rate: 100%).

Data Availability Statement

The NASA C-MAPSS turbofan degradation dataset used in this study is publicly available at https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data (accessed on 17 April 2026) [41]. The experimental scripts, quantitative results, and ISO 19848-aligned data contract are openly available at https://github.com/dongwook-park/dk-lstm-maritime-pdm (accessed on 17 April 2026) [42]. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lazakis, I.; Raptodimos, Y.; Varelas, T. Predicting ship machinery system condition through analytical reliability tools and artificial neural networks. Ocean Eng. 2018, 152, 404–415. [Google Scholar] [CrossRef]
Velasco-Gallego, C.; Lazakis, I. Real-time data-driven missing data imputation for short-term sensor data of marine systems: A comparative study. Ocean Eng. 2020, 218, 108261. [Google Scholar] [CrossRef]
Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage propagation modeling for aircraft engine run-to-failure simulation. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–9. [Google Scholar] [CrossRef]
Nectoux, P.; Gouriveau, R.; Medjaher, K.; Ramasso, E.; Chebel-Morello, B.; Zerhouni, N.; Varnier, C. PRONOSTIA: An experimental platform for bearings accelerated degradation tests. In Proceedings of the IEEE International Conference on Prognostics and Health Management, Denver, CO, USA, 18–21 June 2012; pp. 1–8. [Google Scholar]
PHM Society. PHM Data Challenge 2008. Available online: https://phmsociety.org/ (accessed on 17 April 2026).
Kalafatelis, A.S.; Nomikos, N.; Giannopoulos, A.; Alexandridis, G.; Karditsa, A.; Trakadas, P. Towards predictive maintenance in the maritime industry: A component-based overview. J. Mar. Sci. Eng. 2025, 13, 425. [Google Scholar] [CrossRef]
Li, C.; Li, S.; Feng, Y.; Gryllias, K.; Gu, F.; Pecht, M. Small data challenges for intelligent prognostics and health management: A review. Artif. Intell. Rev. 2024, 57, 214. [Google Scholar] [CrossRef]
IACS. Recommendation on Cyber Resilience (Rec. 166); International Association of Classification Societies: London, UK, 2020. [Google Scholar]
Tao, F.; Cheng, J.; Qi, Q.; Zhang, M.; Zhang, H.; Sui, F. Digital twin-driven product design, manufacturing and service with big data. Int. J. Adv. Manuf. Technol. 2018, 94, 3563–3576. [Google Scholar] [CrossRef]
Grieves, M.; Vickers, J. Digital twin: Mitigating unpredictable, undesirable emergent behavior in complex systems. In Transdisciplinary Perspectives on Complex Systems; Springer: Cham, Switzerland, 2017; pp. 85–113. [Google Scholar]
van Dinter, R.; Tekinerdogan, B.; Catal, C. Reference architecture for digital twin-based predictive maintenance systems. Comput. Ind. Eng. 2023, 177, 109099. [Google Scholar] [CrossRef]
Mauro, F.; Kana, A.A. Digital twin for ship life-cycle: A critical systematic review. Ocean Eng. 2023, 269, 113479. [Google Scholar] [CrossRef]
ISO 19848:2024; Ships and Marine Technology—Standard Data for Shipboard Machinery and Equipment, 2nd ed. International Organization for Standardization (ISO): Geneva, Switzerland, 2024.
Zhang, Z.; Song, W.; Li, Q. Dual-aspect self-attention based on transformer for remaining useful life prediction. IEEE Trans. Instrum. Meas. 2022, 71, 2505711. [Google Scholar] [CrossRef]
Ren, L.; Dong, J.; Wang, X.; Meng, Z.; Zhao, L.; Deen, M.J. A data-driven auto-CNN-LSTM prediction model for lithium-ion battery remaining useful life. IEEE Trans. Ind. Inform. 2021, 17, 3478–3487. [Google Scholar] [CrossRef]
Li, X.; Ding, Q.; Sun, J.-Q. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliab. Eng. Syst. Saf. 2018, 172, 1–11. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Liao, X.; Chen, S.; Wen, P.; Zhao, S. Remaining useful life with self-attention assisted physics-informed neural network. Adv. Eng. Inform. 2023, 58, 102195. [Google Scholar] [CrossRef]
Velasco-Gallego, C.; Lazakis, I. RADIS: A real-time anomaly detection intelligent system for fault diagnosis of marine machinery. Expert Syst. Appl. 2022, 204, 117634. [Google Scholar] [CrossRef]
Grieves, M. Digital Twin: Manufacturing Excellence Through Virtual Factory Replication; White Paper; Michael Grieves, LLC.: Melbourne, FL, USA, 2014; pp. 1–7. [Google Scholar]
van Dinter, R.; Tekinerdogan, B.; Catal, C. Predictive maintenance using digital twins: A systematic literature review. Inf. Softw. Technol. 2022, 151, 107008. [Google Scholar] [CrossRef]
Chen, C.; Fu, H.; Zheng, Y.; Tao, F.; Liu, Y. The advance of digital twin for predictive maintenance: The role and function of machine learning. J. Manuf. Syst. 2023, 71, 581–594. [Google Scholar] [CrossRef]
Yang, S.; Xiang, X. Digital twin in the maritime domain: A review and emerging trends. J. Mar. Sci. Eng. 2023, 11, 1021. [Google Scholar] [CrossRef]
Fonseca, Í.A.; Gaspar, H.M.; de Mello, P.C.; Sasaki, H.A. A standards-based digital twin of an experiment with a scale model ship. Comput.-Aided Des. 2022, 145, 103191. [Google Scholar] [CrossRef]
Li, H.; Zhang, Z.; Li, T.; Si, X. A review on physics-informed data-driven remaining useful life prediction: Challenges and opportunities. Mech. Syst. Signal Process. 2024, 209, 111120. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Hu, Z.; Zhang, H. Remaining useful life prediction of aero-engine enabled by fusing knowledge and deep learning models. Reliab. Eng. Syst. Saf. 2023, 229, 108869. [Google Scholar] [CrossRef]
Lu, Z.; Guo, C.; Liu, M.; Shi, R. Remaining useful lifetime estimation for discrete power electronic devices using physics-informed neural network. Sci. Rep. 2023, 13, 10167. [Google Scholar] [CrossRef]
Arias Chao, M.; Kulkarni, C.; Goebel, K.; Fink, O. Fusing physics-based and deep learning models for prognostics. Reliab. Eng. Syst. Saf. 2022, 217, 107961. [Google Scholar] [CrossRef]
Hu, X.; Tan, L.; Tang, T. M²BIST-SPNet: RUL prediction for railway signaling electromechanical devices. J. Supercomput. 2024, 80, 16744–16774. [Google Scholar] [CrossRef]
Hu, X.; Li, J.; Huang, Y.; Zhang, X.; Wang, H.; Wang, H.; He, Y. PCASTNet: A physics-constrained adaptive style transfer network for sample generation in cross-machine small-sample fault diagnosis. IEEE Trans. Instrum. Meas. 2025, 74, 3568417. [Google Scholar] [CrossRef]
Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef]
Li, X.; Xu, Y.; Li, N.; Yang, B.; Lei, Y. Remaining useful life prediction with partial sensor malfunctions using deep adversarial networks. IEEE/CAA J. Autom. Sin. 2023, 10, 121–134. [Google Scholar] [CrossRef]
Hu, X.; Zhang, X.; Chen, F.; Liu, Z.; Liu, J.; Tan, L.; Tang, T. Simultaneous fault diagnosis for sensor and railway point machine for autonomous rail system. In Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), Edmonton, AB, Canada, 24–27 September 2024; pp. 1–8. [Google Scholar] [CrossRef]
Heimes, F.O. Recurrent neural networks for remaining useful life estimation. In Proceedings of the 2008 International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–6. [Google Scholar]
Ramasso, E.; Saxena, A. Performance benchmarking and analysis of prognostic methods for CMAPSS datasets. Int. J. Progn. Health Manag. 2014, 5, 1–15. [Google Scholar] [CrossRef]
Zheng, S.; Ristovski, K.; Farahat, A.; Gupta, C. Long short-term memory network for remaining useful life estimation. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health Management (ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95. [Google Scholar] [CrossRef]
da Costa, P.R.O.; Akçay, A.; Zhang, Y.; Kaymak, U. Remaining useful lifetime prediction via deep domain adaptation. Reliab. Eng. Syst. Saf. 2020, 195, 106682. [Google Scholar] [CrossRef]
ISO 19847:2024; Ships and Marine Technology—Shipboard Data Servers for Sharing Field Data at Sea, 2nd ed. International Organization for Standardization (ISO): Geneva, Switzerland, 2024.
Saxena, A.; Celaya, J.; Saha, B.; Saha, S.; Goebel, K. Metrics for offline evaluation of prognostic performance. Int. J. Progn. Health Manag. 2010, 1, 1–20. [Google Scholar] [CrossRef]
Fan, Z.; Li, W.; Chang, K.-C. A two-stage attention-based hierarchical transformer for turbofan engine remaining useful life prediction. Sensors 2024, 24, 824. [Google Scholar] [CrossRef]
National Aeronautics and Space Administration (NASA). CMAPSS Jet Engine Simulated Data. Available online: https://data.nasa.gov/dataset/cmapss-jet-engine-simulated-data (accessed on 17 April 2026).
Park, D.; Jeong, J.; Kang, J.; Shin, D. DK-LSTM: Standards-Aligned Hybrid AI–Digital Twin Framework for Maritime Predictive Maintenance. Available online: https://github.com/dongwook-park/dk-lstm-maritime-pdm (accessed on 17 April 2026).

Figure 1. Four-layer DT-ready reference architecture proposed in this study. The data contract layer (Layer 2) provides standards-aligned signal naming, while the prognostics layer (Layer 3) embeds constraint-regularized learning. Multi-seed verification (seeds 0, 42, 123) is conducted at the prognostics layer to confirm robustness under zero-shot domain-shift conditions.

Figure 2. Validation MAE learning curves (E1 scenario, FD001 normal conditions).

Figure 3. Multi-seed verification and ablation box plots for E3a and E3b. NASA Score is shown on a log scale. DK-LSTM-v4 denotes the final stabilized configuration. “↓” denotes that lower is better; “→” denotes the train→test transition (FD001 → FD002 for E3a; FD001 → FD003 for E3b).

Figure 4. Scenario-based robustness summary under the fixed-seed setting (seed = 42). “→” denotes the train→test transition (FD001 → FD002 for E3a; FD001 → FD003 for E3b).

Table 1. ISO 19848-aligned channel mapping schema.

Channel ID	ISO 19848 Channel Name	C-MAPSS Feature	Unit	Type	Maritime Vessel Counterpart
OC_ALT	OperationalCondition_Altitude	setting_1	ft	setting	Vessel operational sea-state conditions
OC_MN	OperationalCondition_MachNumber	setting_2	mach	setting	Ship speed/engine load ratio
FAN_IN_T	FanInlet_Temperature	s_2	degR	sensor	Gas turbine air intake temperature/diesel intake manifold temperature
LPC_OUT_T	LPC_OutletTemperature	s_3	degR	sensor	Low-pressure compressor (turbocharger LP stage) outlet temperature
HPC_OUT_T	HPC_OutletTemperature	s_4	degR	sensor	High-pressure compressor (turbocharger HP stage) outlet temperature
OGV_T	BypassDuct_Temperature	s_7	degR	sensor	Bypass duct/exhaust gas bypass temperature
HPC_IN_T	HPC_InletTemperature	s_8	degR	sensor	High-pressure compressor inlet temperature (intercooler outlet)
HPT_IS_TS	HPT_IsentropicEfficiency	s_9	pct	sensor	High-pressure turbine isentropic efficiency
HPC_OUT_P	HPC_OutletPressure	s_11	psia	sensor	High-pressure compressor outlet pressure (boost pressure)
HPT_OUT_P	HPT_OutletPressure	s_12	psia	sensor	High-pressure turbine outlet pressure (exhaust manifold pressure)
FAN_EFF	Fan_IsentropicEfficiency	s_13	pct	sensor	Fan/propeller drive efficiency
LPC_EFF	LPC_IsentropicEfficiency	s_14	pct	sensor	Low-pressure compressor efficiency
HPT_EFF	HPT_IsentropicEfficiency	s_15	pct	sensor	High-pressure turbine efficiency
BPR	BypassRatio	s_17	ratio	sensor	Bypass ratio (exhaust gas distribution ratio)
HPC_FLOW	HPC_FlowSensor	s_20	lbm/s	sensor	High-pressure compressor flow (air/fuel flow)
LPT_EFF	LPT_IsentropicEfficiency	s_21	pct	sensor	Low-pressure turbine efficiency (power turbine efficiency)
RUL (target)	RemainingUsefulLife	—	cycles	target	Remaining useful life (non-negative, upper bound = 125)

Note. The above mapping is a reference mapping inspired by the naming principles of ISO 19848, and does not claim full technical compliance with the standard.

Table 2. DK-LSTM loss function composition.

Term	Formula	λ	Rationale
L_MSE	E[(y − ŷ)²]	1.0	Data loss
L_safety	Asymmetric over-prediction penalty	1.0	Safety-critical environment
L_neg (i)	E[relu(−ŷ)²]	1.0	RUL non-negativity
L_mono (ii)	Batch-sort monotonicity violation	0.0	Implemented but deactivated (instability under domain shift)
L_upper (iii)	E[relu(ŷ − 125)²]	1.0	Operating-range upper bound

Table 3. Scenario-based evaluation design.

Scenario	Objective	Train/Test Configuration	Primary Stress Condition
E1	Baseline performance verification	FD001 train → FD001 test	In-domain distribution
E2	Missingness robustness	FD001 train/test + 30% masking	Sensor missingness and mask expansion
E3a	Operational-condition shift	FD001 train → FD002 test	Operational-condition distribution shift
E3b	Fault-mode shift	FD001 train → FD003 test	Fault-mode distribution shift

Table 4. Model configuration and comparison.

Model	Architecture	Loss Function	Output	Role
Standard LSTM	LSTM (64,32)	MSE	Linear	Baseline 1
GRU	GRU (64,32)	MSE	Linear	Baseline 2
CNN-LSTM	Conv1D (32,32) → LSTM (64,32)	MSE	Linear	Baseline 3 (hybrid)
DK-LSTM (orig)	LSTM (64,32)	MSE + Safety + Neg + Upper	Linear	Proposed (initial)
DK-LSTM-v2	LSTM (64,32)	DK-loss with weakened Safety	Linear	Ablation
DK-LSTM-v3	LSTM (64,32)	DK-loss with strengthened Upper	Linear	Ablation
DK-LSTM-v4	LSTM (64,32) + cosine LR + clipnorm 0.5 + dropout 0.3	DK-loss (v3 weights)	Linear	Proposed (final stabilized)

Table 5. E1—FD001 in-domain performance comparison under the fixed-seed setting (seed = 42).

Model	RMSE (↓)	MAE (↓)	NASA Score (↓)
Standard LSTM	15.53	11.15	596.3
GRU	14.83	10.93	401.9
CNN-LSTM	14.00	10.32	275.8
DK-LSTM (Proposed)	14.82	10.86	393.6