1. Introduction
Modern vessels and offshore infrastructure systems serve as critical assets in global supply chains and energy transportation, with ever-increasing demands for safety, reliability, and availability. The transition from reactive maintenance to predictive maintenance (PdM) has accordingly emerged as a key challenge for preventing unexpected downtime, reducing operational costs, and ensuring crew safety [
1]. In particular, condition-based maintenance (CBM) for key shipboard equipment such as main propulsion systems and generators can substantially improve resource efficiency compared to traditional maintenance practices [
2].
However, data-driven PdM implementation in the maritime domain faces two fundamental challenges. First, actual vessel failure data are extremely scarce and are predominantly classified as proprietary information. Unlike the aviation (NASA C-MAPSS [
3], PRONOSTIA [
4]) and manufacturing (PHM Data Challenge [
5]) domains, publicly available marine machinery degradation datasets are severely limited [
6]. This data scarcity problem has been reported as a prevailing issue across PHM applications in general [
7]. Second, heterogeneous sensor systems installed across diverse vessels often operate without standardized data labeling, impeding the scalability of model training and deployment across multiple equipment types and fleets [
8].
Digital twin (DT) technology offers a promising solution to these challenges. The convergence of AI and DT enables the integration of diagnostic and predictive capabilities with physics-based virtual assets [
9,
10], and DT-based PdM architectures emphasize the importance of standards-based data layers [
11]. However, as Mauro and Kana [
12] highlight in their review of ship life-cycle digital twins, many DT implementations lack genuine bidirectional data exchange between physical and virtual entities, and the term “DT” may be overused. Accordingly, the present study targets not a “full DT” but a DT-ready workflow equipped with a standards-aligned data layer and an auditable prognostics layer.
This study proposes a hybrid AI–DT workflow that integrates an ISO 19848-aligned data contract [
13] with a domain-knowledge enhanced deep learning model, and employs the NASA C-MAPSS benchmark as proxy data for scenario-based robustness evaluation against maritime conditions.
Beyond recurrent networks, recent prognostics research has explored diverse architectures including Transformer-based attention models [
14], hybrid CNN-LSTM frameworks [
15], deep convolutional networks [
16], and physics-informed neural networks [
17,
18]. While these advanced models often achieve lower RMSE on benchmark datasets, the present study focuses on isolating the contribution of domain-knowledge constraints through a controlled comparison, with multi-seed verification across the core domain-shift scenarios.
The remainder of this paper is organized as follows.
Section 2 reviews related work spanning maritime PdM and data availability, DT-based PdM architectures, physics-/domain-knowledge integrated RUL prediction, missingness-aware sequence modeling, and the NASA C-MAPSS benchmark.
Section 3 presents the proposed process, including the DT-ready reference architecture, the ISO 19848-aligned data contract, the DK-LSTM model and its domain-knowledge enhanced loss function, and the missingness handling strategy.
Section 4 describes the experimental setup, including datasets, preprocessing, scenario design, evaluation metrics, and reproducibility provisions.
Section 5 reports the experimental results across all four scenarios.
Section 6 discusses the implications and limitations of the findings, and
Section 7 concludes the paper with a summary and directions for future research.
2. Related Works
2.1. Maritime Predictive Maintenance and Data Availability
Section 1 introduced the two-fold challenges of maritime PdM (data scarcity, heterogeneous sensors) and the rationale for a DT-ready, standards-aligned approach. This section reviews the relevant literature in five areas, identifying for each area the gap that motivates the present work.
The maritime PdM field has advanced substantially over the past decade, yet the absence of publicly available vessel degradation datasets remains a key bottleneck. Kalafatelis et al. [
6] surveyed component-level PdM needs and the diversity of data-driven architectures in the maritime industry, noting that practical barriers to deployment across heterogeneous vessel subsystems remain substantial. As publicly available failure datasets for ship propulsion systems are extremely limited, the majority of maritime PHM studies rely on public benchmarks as proxies [
19]. Lazakis et al. [
1] combined analytical reliability-based approaches with artificial neural networks for vessel machinery condition prediction, and Velasco-Gallego and Lazakis [
19] developed a real-time anomaly detection intelligent system for marine machinery fault diagnosis. In particular, Velasco-Gallego and Lazakis [
2] explicitly addressed the handling of missing sensor data as a practical challenge in maritime CBM/PdM, showing that missing data imputation for marine sensor streams is a key preprocessing concern for real-time decision support.
While these studies establish the importance of data-driven PdM in maritime, none integrate a standards-aligned data contract layer with a constraint-regularized prognostic model under multi-seed verification. The present work fills this gap.
2.2. Digital Twin-Based Predictive Maintenance Architectures
DT-based PdM reference architectures present a multi-layered structure comprising physical assets, data connectivity, virtual models, and service layers [
11,
20]. Van Dinter et al. [
11] argued that reference architectures serve as blueprints for consistent application architecture design when DT-based PdM is integrated into complex systems. Furthermore, van Dinter et al. [
21] reported that the maturity gap of the data integration layer was a key finding in their systematic literature review connecting DT and predictive maintenance. Chen et al. [
22] further examined the role of machine learning within DT-based PdM systems, reinforcing the need for mature data integration as a prerequisite for effective AI–DT convergence. In the maritime domain, Mauro and Kana [
12] conducted a critical systematic review of ship life-cycle DTs, addressing state synchronization, standards-based data exchange, and computational complexity, and warned that the term “DT” may be misused when bidirectional physical–virtual data exchange is absent. Yang and Xiang [
23] reviewed emerging trends in maritime digital twins, identifying data interoperability and standardization as persistent gaps across the sector. Fonseca et al. [
24] reported a standards-based ship digital twin case study grounded in ISO 19847/19848, demonstrating that data standards are meaningful for scalable DT implementations.
Accordingly, this study posits the ISO 19848-aligned data contract as a necessary sub-component, while explicitly scoping the work to a DT-ready workflow rather than a fully synchronized physical–virtual loop.
2.3. Physics-/Domain-Knowledge Integrated RUL Prediction
Integrating physical constraints into neural network loss functions enables physically consistent predictions even under data-sparse conditions [
25,
26]. The review by Li et al. [
25] on physics-informed data-driven RUL prediction noted that purely data-driven methods may produce physically infeasible or inconsistent predictions, and that embedding physics/domain knowledge can reduce data quality and quantity requirements. Raissi et al. [
17] established the physics-informed neural network (PINN) paradigm, a representative approach that incorporates physical laws as constraints (e.g., soft penalties) into the learning objective, which has since been extended to prognostics applications. Liao et al. [
18] further demonstrated the integration of self-attention mechanisms with physics-informed neural networks for RUL prediction, achieving improved prognostic accuracy under constrained data conditions. Lu et al. [
27] specifically formulated monotonic decrease and non-negativity constraints as regularization terms in an LSTM-based RUL model for power electronic devices. Arias Chao et al. [
28] demonstrated that fusing a physics-based thermodynamic model with deep learning extends the prediction horizon on the N-CMAPSS dataset.
This study combines non-negativity, operating-range upper bounds, and an asymmetric safety penalty within a single loss function, and links this to an ISO 19848-aligned data contract. Rather than inserting complex governing equations, the approach adopts constraint losses reflecting “safe prediction boundaries,” offering a practical approach tailored to maritime safety requirements.
Meanwhile, for RUL prediction of safety-critical electromechanical devices, Hu et al. [
29] proposed a spatiotemporal attention-based multi-branch network and reported strong performance in long-term multivariate prediction, and [
30] integrated physical constraints (band energy preservation) into an adaptive style transfer network, demonstrating data efficiency in small-sample fault diagnosis under physics-information scarcity.
Building on Lu et al. [
27], the present work extends the constraint-regularization paradigm by (1) systematically ablating the constraint weights and stabilization techniques and (2) verifying multi-seed robustness under zero-shot domain-shift conditions, which are not addressed in prior LSTM-based constraint-regularization studies.
2.4. Missingness-Aware Sequence Modeling
The notion that missingness patterns themselves may be informative was explicitly addressed in the GRU-D study by Che et al. [
31]. GRU-D directly integrates masking and temporal gap information into the RNN structure, presenting a principled direction for missingness handling beyond simple imputation.
This serves as the academic basis for the E2 (30% missingness) scenario design of the present study, and simultaneously provides the rationale for future research directions in response to the inferior E2 performance of DK-LSTM.
Furthermore, Li et al. [
32] addressed RUL prediction under partial sensor malfunctions using deep adversarial networks, confirming that sensor degradation and missing observations remain a practical challenge for prognostics in real-world deployment. Hu et al. [
33] addressed the problem of simultaneous fault diagnosis for sensors and equipment in autonomous rail systems, confirming that fault diagnosis under sensor fault/missingness conditions is a practical challenge.
2.5. NASA C-MAPSS Benchmark
NASA C-MAPSS provides turbofan engine simulation data encompassing diverse operational conditions and fault modes, and serves as the de facto standard benchmark in the PHM field [
3]. Heimes [
34] provided one of the earliest applications of recurrent neural networks for RUL estimation on C-MAPSS, establishing the foundation for subsequent deep learning approaches. Ramasso and Saxena [
35] benchmarked multiple prognostic methods on C-MAPSS subsets and confirmed that the PHM’08 asymmetric scoring function is one of the most widely used metrics. The piecewise linear RUL labeling convention was established by Zheng et al. [
36]. Since shipboard main propulsion gas turbines and large diesel engines share thermomechanical degradation mechanisms with aircraft turbofan engines—namely high temperature, high pressure, and high-speed rotation [
25]—C-MAPSS serves as a suitable proxy for testing the generalization capability of maritime PHM algorithms.
The cross-evaluation protocol employed in this study—FD001 → FD002 (operational-condition shift) and FD001 → FD003 (fault-mode shift)—represents a standard approach in domain generalization research [
35,
37].
3. Proposed Process
The proposed process consists of four interconnected layers, depicted schematically in
Figure 1. The remainder of this section details each layer.
3.1. DT-Ready Reference Architecture
The reference architecture of the AI–DT workflow proposed in this study comprises four layers. This layer separation is consistent with the discussion by van Dinter et al. [
11] on the role of reference architectures in DT-based PdM.
(1) Physical Layer: Encompasses shipboard equipment and sensor interfaces. Generates raw telemetry, which may be noisy, subject to packet loss, and tagged with manufacturer-specific identifiers.
(2) Data Contract Layer: Transforms heterogeneous signals into structured time-series formats through ISO 19848-aligned channel mapping. Ensures that indicators such as temperature, pressure, and speed can be uniformly identified regardless of vessel source.
(3) Prognostics Layer: Embeds the DK-LSTM-based RUL prediction module.
(4) Service Layer: Links prediction outputs to maintenance decision-making, explicitly accounting for costs (unexpected failure vs. premature replacement).
Each layer has a clearly defined input/output interface: the physical layer outputs raw vendor-specific telemetry; the data contract layer outputs ISO 19848-aligned channel-identified time-series; the prognostics layer consumes this structured input and produces RUL estimates whose robustness is assessed through multi-seed verification (
Section 5); and the service layer translates RUL estimates into maintenance decisions weighted by the asymmetric cost of failure versus premature replacement. This separation enables modular substitution—e.g., replacing the prognostics layer’s DK-LSTM with an alternative model without affecting the data contract above or the service logic below.
3.2. ISO 19848-Aligned Data Contract
ISO 19848:2024 provides naming and description conventions for shipboard machinery and equipment sensor data [
13]. This study defines a reference schema in JSON format that maps C-MAPSS sensors to shipboard physical quantities, comprising a total of 16 channels. The target variable (RUL) includes physical constraint specifications (non-negative, upper bound = 125 cycles).
This implementation is a reference mapping inspired by the naming principles of ISO 19848, and does not claim full technical compliance with the ISO 19848 specification. Full compliance would require additional steps including formal channel registration with classification societies, data quality verification procedures, and integration with onboard data infrastructure conforming to ISO 19847 (ship data servers) [
38].
Table 1 summarizes the ISO 19848-aligned channel mapping schema, including the selected C-MAPSS features, units, semantic channel names, maritime vessel counterparts, and the RUL target constraints.
The “Maritime Vessel Counterpart” column provides reference information indicating which measurement points on a ship’s main propulsion gas turbine or large diesel engine physically correspond to the C-MAPSS turbofan sensors, and is included to visually demonstrate semantic interoperability in heterogeneous sensor environments.
The ISO 19848 mapping process is carried out in three steps.
First, physical quantity classification: of the 26 raw columns in the C-MAPSS data, sensors with zero variance (s_1, s_5, s_6, s_10, s_16, s_18, s_19, setting_3) are excluded, and the remaining 16 features are classified into six physical quantity categories: temperature, pressure, efficiency, flow, ratio, and operational condition.
Second, ISO 19848 naming convention application: each feature is assigned a hierarchical channel identifier following the “component_location_quantity” pattern. For instance, s_4 (high-pressure compressor outlet temperature) is named HPC_OutletTemperature, and s_12 (high-pressure turbine outlet pressure) is named HPT_OutletPressure. This pattern references the hierarchical DataChannelID naming structure recommended by ISO 19848 [
13,
24].
Third, maritime equipment semantic mapping: based on the correspondence of thermodynamic components (compressor, turbine, combustor) shared between aircraft turbofan engines and shipboard gas turbines/large diesel engines [
25], each ISO 19848 channel is mapped to the physical quantity it would measure on actual shipboard equipment.
This three-step process is designed such that semantic consistency can be maintained by repeating the same procedure when new vessel equipment types or heterogeneous sensor systems are added.
3.3. DK-LSTM PdM Model
DK-LSTM is a lightweight sequence model that combines a standard LSTM encoder with a domain-knowledge-based loss function. The input is a normalized sensor sequence with sliding window length L = 30, and the output is a single scalar RUL estimate. The single-output structure (return_sequences = False) was selected for deployment efficiency.
As a key design decision for fair comparison, a linear activation function (activation = ‘linear’) is applied to the final output layer of all models (Standard LSTM, GRU, DK-LSTM). This excludes the expedient of structurally blocking negative values via ReLU or similar activations, forcing only the penalties embedded in the loss function to learn the physical constraints, thereby establishing the loss function contribution as the sole independent variable.
The hidden dimensions follow a stepwise compression pattern (LSTM 64 → 32, Dense 32 → 1), reflecting standard practice in C-MAPSS RUL prediction literature [
34,
36] for balancing representational capacity against overfitting risk in single-engine-unit time-series. The 32-node penultimate dense layer is consistent with the latent representation capacity of the second LSTM layer. No additional hyperparameter tuning was performed, to maintain fair comparison with baseline models.
Before presenting the detailed loss formulation, Algorithm 1 summarizes the optimization procedure used to train the DK-LSTM model. For multi-seed verification, the same procedure was repeated independently using seeds 0, 42, and 123.
| Algorithm 1. DK-LSTM training procedure with a constraint-regularized objective. |
Input: Training set {(Xi, yi)}, validation set {(Xv, yv)}, learning rate η, maximum epochs E, patience P = 15, gradient clipping threshold c, constraint weights {λsafety, λneg, λmono, λupper}, random seed s Output: Trained model parameters θ 1: Initialize model parameters θ using seed s 2: for epoch = 1 to E do 3: for each mini-batch (Xb, yb) do 4: ŷb ← fθ(Xb) 5: LMSE ← mean((yb − ŷb)2) 6: Lsafety ← asymmetric_penalty(yb, ŷb) 7: Lneg ← mean(ReLU(−ŷb)2) // Constraint (i) 8: Lmono ← batch_sort_violation(yb, ŷb) // Constraint (ii) 9: Lupper ← mean(ReLU(ŷb − 125)2) // Constraint (iii) 10: Ltotal ← LMSE + λsafetyLsafety + λnegLneg + λmonoLmono + λupperLupper 11: θ ← Adam_step(θ, ∇θLtotal, η, clipnorm = c) 12: end for 13: Evaluate validation MAE on {(Xv, yv)} 14: if validation MAE shows no improvement for P epochs then 15: break 16: end if 17: end for 18: return θ |
Domain-Knowledge Enhanced Loss Function
The learning objective of DK-LSTM comprises a data loss term, a practical safety loss term, and domain-informed constraint terms.
Constraint (ii) Monotonic Degradation: Predicted RUL should monotonically decrease along the degradation trajectory. Since DK-LSTM is a single-output model, direct intra-sequence comparison is infeasible; a batch-sort approximation was adopted whereby target values are sorted in descending order within each mini-batch and violations of monotonic decrease in predictions are penalized. However, in domain-shift environments with mixed operational conditions such as FD002 (six operating conditions), the batch-sort approximation was found in experiments to distort the physical temporal ordering of time series, destabilizing optimization. Accordingly, this component was simplified in the final experimental version. Trajectory-level monotonicity enforcement is deferred to future work.
Constraint (iii) Operating-Range Upper Bound: A penalty is applied when the predicted value exceeds MAX_RUL (=125). This constraint prevents the model from generating unrealistically high RUL predictions when encountering out-of-distribution noise.
Safety Penalty: An asymmetric penalty is applied with a weight of 2.0 for over-predictions and 0.2 for under-predictions. In safety-critical maritime environments, RUL over-prediction directly translates to delayed maintenance timing and thus to equipment failure risk; consequently, over-prediction is penalized more heavily than under-prediction.
Table 2 summarizes the composition of the DK-LSTM loss function, including the data loss, asymmetric safety penalty, and the three domain-informed constraint terms.
3.4. Missingness Handling
The sensor data missingness rate for the E2 scenario is set at 30%. Forward-fill imputation is applied, combined with a binary masking indicator to expand the input dimension from 16 to 32. This borrows the key idea from GRU-D [
31] of explicitly leveraging missingness information, while applying the same architecture to all baseline models to ensure fair comparison.
Limitation: The initialization of leading consecutive missing values using the first valid value may allow future-timepoint information to leak into the past, which deviates from strictly causal imputation. Accordingly, this study designates the approach as “leakage-minimized” rather than “leakage-free.” Fully leak-proof strategies, such as learnable tokens or constant initialization, are deferred to future work.
5. Experimental Results
This section presents fixed-seed results for E1, E2, E3a, and E3b, followed by multi-seed verification and ablation focused on the zero-shot domain-shift scenarios E3a and E3b. Additional E1 multi-seed results for DK-LSTM-v4 are used to characterize the in-domain performance trade-off of the final stabilized configuration.
5.1. Learning Curves
This subsection presents the validation MAE learning curves of the four models trained on FD001 (in-domain). The curves provide insight into the optimization characteristics induced by each model’s loss function design.
Figure 2 presents the validation MAE learning curves under E1 (FD001, normal conditions). GRU converges rapidly at approximately epoch 10, while Standard LSTM exhibits a step-wise descent around epoch 17. DK-LSTM shows high variability in early epochs and continues training until approximately epoch 65, attributable to the optimization complexity introduced by the composite loss function (Safety + Neg + Upper).
The process by which the neural network negotiates the boundaries imposed by safety, non-negativity, and upper-bound penalties simultaneously with MSE minimization is visually confirmed. Although the final converged values are broadly comparable, the in-domain E1 results show that CNN-LSTM achieves the lowest NASA Score, while DK-LSTM performs comparably to GRU. This indicates that the main advantage of the proposed stabilized DK-LSTM configuration should not be interpreted as in-domain optimization, but as robustness-oriented behavior under zero-shot domain-shift scenarios analyzed in
Section 5.4 and
Section 5.5.
All models share an identical weight initialization seed (42) for this single-seed visualization, which explains the common starting point of the validation MAE learning curves; subsequent divergence reflects the differing optimization landscapes induced by each loss function. Standard LSTM and GRU optimize a single MSE objective, while DK-LSTM optimizes a composite loss with multiple constraint terms, leading to higher early-epoch variability and longer convergence.
5.2. E1 Scenario: FD001 Baseline Performance
Under in-domain conditions (FD001), CNN-LSTM achieves the lowest NASA Score (275.8) by leveraging local pattern extraction, while DK-LSTM (393.6) and GRU (401.9) perform comparably. Standard LSTM lags substantially (596.3). This suggests that under in-domain conditions where the test distribution matches training, architectural complexity (CNN-LSTM) can be more impactful than constraint-regularization (DK-LSTM). However, as shown in
Section 5.4, this ordering changes substantially under operational-condition shift, where CNN-LSTM degrades sharply and DK-LSTM shows stronger safety-oriented robustness.
Table 5 reports the fixed-seed E1 performance comparison under the in-domain FD001 setting.
For reference, state-of-the-art CNN-LSTM and Transformer-based methods report RMSE of approximately 10.9–12.5 on C-MAPSS FD001 [
15,
40]; the purpose of this study is not to minimize in-domain RMSE but to verify constraint-regularized loss behavior under zero-shot domain shift.
5.3. E2 Scenario: Sensor Missingness Robustness
Under 30% sensor missingness, CNN-LSTM (488.5) and DK-LSTM (427.9) show degradation relative to their E1 performance, while Standard LSTM (380.6) and GRU (387.0) remain more stable. The composite loss function of DK-LSTM and the local convolution kernels of CNN-LSTM both appear sensitive to the 32-dimensional input space formed by concatenating the 16 original features with 16 binary mask indicators. These results motivate future research into missingness-aware recurrent architectures (e.g., GRU-D [
31]) or impute-then-predict joint approaches.
Table 6 reports the fixed-seed performance comparison under the E2 sensor-missingness scenario.
5.4. E3 Scenario: Structural Domain Shift
This subsection presents results under zero-shot domain shift, which represents the core contribution of this study. Both single-seed (seed = 42) results and multi-seed verification (seeds 0, 42, 123) are reported. The multi-seed analysis provides a statistically more stable assessment of domain-shift robustness and motivates the stabilized configuration (DK-LSTM-v4) presented in
Section 5.5.
Under operational-condition shift (E3a, fixed seed = 42), the original DK-LSTM achieves the lowest NASA Score (334,736.8), outperforming Standard LSTM (667,607.1), GRU (847,741.0), and CNN-LSTM (12,539,335). Notably, CNN-LSTM exhibits catastrophic degradation under E3a, suggesting that local convolutional pattern extraction may overfit to source-domain temporal patterns and fail to generalize across the six operating conditions of FD002. This contrast supports the hypothesis that constraint-regularized learning can provide a useful inductive bias under operational-condition shift.
Table 7 reports the fixed-seed performance comparison for E3a, representing operational-condition shift from FD001 to FD002.
Under fault-mode shift (E3b, fixed seed = 42), CNN-LSTM achieves the lowest NASA Score (677,130), while DK-LSTM (886,388.4) outperforms GRU (1,017,999.0) and Standard LSTM (1,738,546.9). This indicates that DK-LSTM is not uniformly superior across all domain-shift settings. Instead, the benefit of constraint-regularized learning is more pronounced under operational-condition shift (E3a), whereas CNN-LSTM remains competitive under fault-mode shift (E3b), where the operational-condition distribution is less different from the source domain.
Table 8 reports the fixed-seed performance comparison for E3b, representing fault-mode shift from FD001 to FD003.
The exponential nature of the NASA Score amplifies large individual errors under zero-shot transfer. These fixed-seed results are complemented by the multi-seed verification and stabilization analysis in
Section 5.5.
5.5. Multi-Seed Verification and Ablation
To evaluate whether the observed domain-shift behavior is robust to random initialization and training variability, we conduct multi-seed verification under the E3a and E3b scenarios using seeds 0, 42, and 123. In addition, we examine whether loss-weight adjustment and training-stabilization techniques can improve the stability of DK-LSTM under zero-shot domain shift. This analysis compares the original DK-LSTM with three stabilized variants, culminating in DK-LSTM-v4.
5.5.1. Loss Weight and Stabilization Ablation
Four DK-LSTM configurations were evaluated:
DK-LSTM-v2: Safety penalty weakened (λ_safety: 1.0 → 0.5; asymmetric weights: (2.0, 0.2) → (1.5, 0.5)).
DK-LSTM-v3: v2 + λ_upper strengthened (1.0 → 2.0).
DK-LSTM-v4 (final stabilized): v3 + clipnorm tightened (1.0 → 0.5), dropout strengthened (0.2 → 0.3), cosine learning rate scheduling with 5-epoch warmup.
Table 9 reports the multi-seed verification results for E3a, comparing the baseline models and the DK-LSTM variants under operational-condition shift.
Table 10 reports the corresponding multi-seed verification results for E3b under fault-mode shift.
5.5.2. Findings
Original DK-LSTM under multi-seed. The single-seed results in
Table 7 and
Table 8 reflected a favorable seed for the original DK-LSTM. As shown in
Table 9 and
Table 10, the original DK-LSTM shows mean NASA Scores of 1,011,947 in E3a and 1,084,459 in E3b, both higher than GRU’s mean.
Loss-weight tuning alone is insufficient. DK-LSTM-v2 (weakened safety) and DK-LSTM-v3 (v2 + strengthened upper bound) further amplify the seed sensitivity in E3a, with NASA Score means above 11.6 × 106. This indicates that loss-weight rebalancing, while improving E3b moderately, does not address the underlying optimization instability under multi-condition operational shift.
Comprehensive stabilization (DK-LSTM-v4) resolves E3a instability. With cosine learning rate scheduling, tightened gradient clipping (0.5), and stronger dropout (0.3), DK-LSTM-v4 achieves a mean NASA Score of 269,778 ± 215,387 in E3a—a 43.7% improvement over GRU (479,241) and a 78.7% improvement over Standard LSTM (1,265,523). Under E3b, DK-LSTM-v4 maintains a 20.8% improvement over GRU. CNN-LSTM remains highly unstable in E3a (2.9 × 106) and competitive with DK-LSTM-v4 in E3b (823,932), confirming that architectural complexity alone does not guarantee cross-domain robustness.
Trade-off in in-domain performance. Additional multi-seed runs for DK-LSTM-v4 under E1 confirm that the final stabilized configuration is not an in-domain optimizer. DK-LSTM-v4 records an E1 NASA Score of 3,532.26 ± 5,457.79 across seeds 0, 42, and 123, substantially worse than the less strongly regularized DK-LSTM-v3 configuration (462.5 ± 112.5). This confirms that the stronger stabilization strategy—dropout 0.3, tighter gradient clipping, and cosine learning-rate scheduling—sacrifices in-domain accuracy while improving zero-shot operational-condition shift robustness under E3a. DK-LSTM-v4 is therefore positioned as a domain-shift-oriented stabilized configuration rather than a general-purpose in-domain accuracy optimizer.
5.6. Result Summary
Figure 3 summarizes the multi-seed domain-shift behavior and stabilization effects under E3a and E3b, while
Figure 4 provides a compact fixed-seed overview across E1, E2, E3a, and E3b. The scenario-specific quantitative results are reported in the corresponding tables in
Section 5.2,
Section 5.3,
Section 5.4 and
Section 5.5. In the in-domain E1 setting, CNN-LSTM achieves the best NASA Score, while DK-LSTM remains comparable to GRU. Under E2 sensor missingness, DK-LSTM and CNN-LSTM show degradation relative to simpler recurrent baselines. The central empirical finding is therefore not universal superiority across all scenarios, but the multi-seed robustness of the stabilized DK-LSTM-v4 configuration under zero-shot domain shift, particularly E3a and, relative to GRU and Standard LSTM, E3b.
6. Discussion
6.1. Effect of Domain Constraints Under Zero-Shot Domain Shift
The primary contribution of this study lies in zero-shot domain-shift performance under multi-seed verification. In the in-domain E1 setting, CNN-LSTM achieves the strongest NASA Score, while the original DK-LSTM remains comparable to GRU rather than dominating all baselines (
Section 5.2). The stabilized DK-LSTM-v4, by contrast, is designed for domain-shift robustness and achieves substantial NASA Score reduction under E3a and E3b: 43.7% over GRU and 78.7% over Standard LSTM in E3a, and 20.8% over GRU in E3b (
Table 9 and
Table 10).
CNN-LSTM, despite stronger E1 performance, shows catastrophic degradation under E3a (single-seed: 12.5 × 106; multi-seed mean: 2.9 × 106), confirming that architectural complexity alone does not provide cross-domain inductive bias. In contrast, the constraint-regularized loss combined with comprehensive training stabilization (clipnorm 0.5, dropout 0.3, cosine LR) produces predictions that remain bounded and safety-shaped under unseen operating conditions.
In safety-critical maritime/defense environments, RUL over-prediction directly translates to delayed maintenance and equipment failure risk. The asymmetric NASA Score weights such errors more heavily, making it the primary metric for these applications. The 43.7% NASA Score improvement of DK-LSTM-v4 over the strongest baseline (GRU) under E3a thus carries practical significance beyond what RMSE alone would indicate.
6.2. Training Instability Under Sensor Missingness (E2)
Both DK-LSTM and CNN-LSTM show degradation under 30% sensor missingness (
Section 5.3). DK-LSTM’s composite loss function generates conflicting gradients in the 32-dimensional (16 + 16 mask) input space, while CNN-LSTM’s local convolution kernels appear sensitive to imputation artifacts. Standard LSTM with simple MSE loss navigates the corrupted space more flexibly. These results demonstrate that neither composite domain constraints nor architectural depth confer universal benefits, motivating future research into missingness-aware recurrent architectures (GRU-D [
31]) or impute-then-predict joint approaches (e.g., BRITS).
6.3. Deactivation of the Monotonic Degradation Constraint
This study does not claim its core contribution is a “fully activated physics-informed LSTM with monotonic degradation constraint.” The current implementation approximates monotonicity via in-batch target sorting in a single-time-point output model, which can distort the physical temporal ordering in data with mixed operating conditions such as FD002.
In numerous iterative experiments, setting λ_mono > 0 caused training divergence or substantially unstable convergence in E3a/E3b; accordingly, this study designates the monotonicity constraint as “implemented but deactivated in the final reported experiments.”
Trajectory-level monotonicity enforcement techniques that directly constrain outputs to decrease consistently along the time-series order are deferred to future work.
To further examine whether this instability could be mitigated through training stabilization, we explored hyperparameter-level interventions—gradient clipping intensification (clipnorm 1.0 → 0.5), dropout strengthening (0.2 → 0.3), and cosine learning-rate scheduling—as documented in
Section 5.5. While these interventions successfully stabilize DK-LSTM-v4 under operational-condition shift (E3a), they do not enable activation of the monotonicity constraint, because the underlying limitation is structural (batch-sort approximation in single-output architecture) rather than optimization-related. Trajectory-level monotonicity enforcement via sequence-to-sequence formulations is identified as the natural next step.
6.4. Significance of the ISO 19848-Aligned Data Contract
The ISO 19848-aligned data contract is materialized as a JSON schema and provided as
Supplementary Material. This schema serves as a reference point for maintaining semantic consistency when integrating actual vessel data into the same pipeline or extending to other shipboard equipment. The physical constraint specifications (non-negativity, upper bound) included in the data contract are directly linked to the model loss function design, ensuring consistency between the data layer and the learning algorithm.
Full ISO 19848 compliance requires additional steps including formal channel registration with classification societies, data quality verification procedures, and integration with onboard data infrastructure conforming to ISO 19847 (ship data servers) [
24,
38].
6.5. Comparison with State-of-the-Art
The focus of this study is not to achieve the lowest in-domain RMSE on C-MAPSS FD001. Recent Transformer-based architectures [
14,
40], hybrid CNN-LSTM models [
15,
28], and deep convolutional networks [
16] report RMSE of approximately 10.9–12.5 on FD001. By contrast, the stabilized DK-LSTM-v4 configuration is not optimized for the lowest in-domain RMSE; it deliberately trades in-domain accuracy for stronger cross-domain robustness through stabilization and constraint-regularized learning. This design choice follows from two considerations:
(1) DK-LSTM is intentionally implemented as a simple two-layer LSTM to isolate the contribution of the loss function, and (2) DK-LSTM-v4 is targeted at zero-shot domain-shift robustness rather than in-domain accuracy maximization.
6.6. Limitations
This study is intentionally scoped toward evaluating whether domain-knowledge constraints and training stabilization improve RUL prediction under stress scenarios, rather than to achieving the lowest in-domain RMSE on C-MAPSS FD001.
(1) Absence of real vessel data: The experimental results constitute a proxy stress test on aviation-domain data and do not represent maritime environment validation.
(2) Monotonic degradation constraint deactivated: The batch-sort approximation in a single-output architecture does not permit trajectory-level monotonicity enforcement. This limitation is structural, not hyperparameter-related.
(3) In-domain performance trade-off: DK-LSTM-v4’s comprehensive stabilization (cosine LR, clipnorm 0.5, dropout 0.3) reduces in-domain (E1) accuracy. The model is positioned as a domain-shift-oriented stabilized configuration, not an in-domain optimizer.
(4) E2 performance: DK-LSTM does not outperform the simpler baselines under 30% sensor missingness.
(5) Imputation rigor: Initialization with the first valid value departs from strictly causal imputation; this is documented as “leakage-minimized” rather than “leakage-free.”
(6) Full DT not implemented: The current framework is at a DT-ready stage without physical–virtual state synchronization.
(7) Multi-seed evaluation scope: Multi-seed baseline comparison is limited to E3a and E3b, which represent the core zero-shot domain-shift contribution of this study. Additional E1 multi-seed runs are conducted only for DK-LSTM-v4 to characterize the in-domain trade-off of the stabilized configuration. Full multi-seed evaluation across all models under E1 and E2 remains future work.
(8) Architectural comparison scope: Direct multi-seed Transformer baselines under the same protocol remain future work.
(9) Disturbance scope: The robustness evaluation addresses sensor missingness; additional disturbances (Gaussian noise, sensor drift, outlier injection) remain future work.
6.7. Scope of Data Scarcity and Future Extension
In this paper, “data scarcity” is defined at two levels. First, domain-level scarcity, referring to the situation in which publicly available failure datasets for the maritime domain are absent, necessitating the use of C-MAPSS as a proxy. Second, observation-level scarcity, referring to missing data arising from sensor faults, communication failures, and packet loss during vessel operation, which the E2 (30% missingness) scenario represents.
Separately, experiments that reduce the training data volume itself (sample-level sparsity) may also be discussed as one facet of data-scarce environments. For example, the 100 FD001 training units could be reduced to 50% or 10% to evaluate whether DK-LSTM’s domain constraints exert a regularization effect on small data.
However, such experiments were not included in the scope of this study for the following reasons. First, the E3a/E3b scenarios in the current experimental design already constitute zero-shot cross-domain transfer in which target-domain training data is 0%, representing a more severe form of data insufficiency than training data reduction. Second, reducing training units to 10 would necessitate hyperparameter readjustment to prevent overfitting, compromising the controlled experimental design in which the loss function contribution is the sole independent variable. Third, under a single-seed (seed = 42) condition, small-unit experiments may be dominated by unit-selection bias, making interpretation difficult without concurrent multi-seed experiments. Accordingly, DK-LSTM robustness evaluation under training data reduction conditions is deferred to future work when a multi-seed experimental infrastructure is in place. This is also a meaningful follow-up task from the perspective of PHM methodology evaluation in small-data environments as highlighted by Li et al. [
7].
The multi-seed ablation in
Section 5.5 also indicates that DK-LSTM-v4’s robustness in E3a relies on the combination of constraint-regularized loss and training stabilization. Future work will examine whether the same stabilization techniques transfer to alternative architectures (Transformer, CNN-LSTM) when paired with the proposed loss function.
7. Conclusions
This paper proposed a standards-aligned hybrid AI–DT workflow for predictive maintenance of maritime vessels and offshore plants under data scarcity and heterogeneous sensor environments. The main contribution of the study is the integration of an ISO 19848-aligned data contract with a constraint-regularized DK-LSTM prognostics module, thereby linking standards-based signal semantics with safety-oriented RUL prediction. The proposed workflow further contributes a scenario-based robustness evaluation protocol that separates sensor missingness, operational-condition shift, and fault-mode shift, together with multi-seed ablation analysis and an additional CNN-LSTM hybrid baseline for architectural comparison. The core predictive component embeds two active physical-boundary constraints—RUL non-negativity and operating-range upper bound—together with an asymmetric safety penalty, while the monotonic degradation constraint is explicitly treated as an implemented but deactivated component due to the structural limitation of batch-sort approximation in a single-output architecture.
Through systematic ablation across multi-seed verification (seeds 0, 42, 123), the final stabilized configuration (DK-LSTM-v4) achieves 43.7% NASA Score improvement over the strongest baseline (GRU), 78.7% improvement over Standard LSTM under operational-condition shift (E3a), and 20.8% improvement over GRU under fault-mode shift (E3b). CNN-LSTM, despite stronger in-domain performance, exhibits catastrophic degradation under E3a, confirming that architectural complexity alone does not guarantee cross-domain robustness. DK-LSTM-v4 is positioned as a domain-shift-oriented stabilized configuration that trades modest in-domain accuracy for substantial cross-domain robustness, aligning with the practical requirement of safety-critical maritime and defense applications where target-domain training data is unavailable.
We acknowledge several limitations. First, the deactivation of the monotonic degradation constraint (λ_mono = 0.0) reflects the structural limitation of batch-sort approximation in single-output architectures, rather than a hyperparameter issue; trajectory-level monotonicity enforcement via sequence-to-sequence formulations is the natural next step. Second, while multi-seed verification confirms statistical robustness in the domain-shift scenarios, full multi-seed evaluation across all models under the in-domain (E1) and sensor-missingness (E2) scenarios remains future work. Third, the in-domain trade-off of DK-LSTM-v4 is acknowledged as a deliberate consequence of the stabilization-oriented design.
Future work will address (a) domain adaptation validation using real vessel sensor data, (b) mask-attention-based missingness handling, (c) trajectory-level monotonicity constraint implementation, (d) full DT extension including physical–virtual state synchronization, (e) application of the DK-LSTM-v4 loss and stabilization combination to more advanced architectures (Transformer, CNN-LSTM hybrids) under multi-seed protocol, (f) extension of robustness evaluation to additional data disturbances (Gaussian noise, sensor drift, outlier injection), and (g) full multi-seed evaluation of E1 and E2.