1. Introduction
Structural health monitoring (SHM) of fibre-reinforced composites under cyclic mechanical loading has emerged as a load-bearing component of modern reliability engineering practice in aerospace, civil, and energy infrastructure assets. A central operational artefact in such pipelines is the health indicator, a scalar trajectory
that summarises the cumulative damage state of a structural specimen and is consumed downstream by maintenance-decision modules that schedule inspection, repair, and retirement [
1]. Many artificial intelligence methods have been developed for health management [
2,
3,
4,
5,
6]. Three properties are demanded of any health indicator construction method that is to be deployed in field settings: monotonicity—conditional on the absence of self-healing or maintenance-induced restoration events, in which case (e.g., bond-line repair on aircraft primary structures) the monotonicity constraint must be relaxed or applied piecewise between maintenance events—trendability (similar specimens must produce similar trajectories), and robustness (the indicator must be insensitive to high-frequency sensor noise) [
7]. The composite of these three quality measures, hereafter denoted the Prognostics and Health Management (PHM) Composite, abbreviated PCS, has become the de facto canonical evaluation metric for health indicator quality in the prognostics and health management community [
1].
However, in real composite-fatigue datasets, the only reliably labelled timestamp per specimen is the failure event, after which the specimen is removed from service [
8]. Per-time labels of the underlying damage state are unavailable, since damage is unobservable except through the very acoustic emission, strain, and fibre Bragg grating channels that the health indicator must interpret. This regime is referred to here as the failure-only-supervised setting.
Two predominant modelling families have been deployed for failure-only-supervised health indicator construction. The first family applies endpoint-supervised regressors, including convolutional and recurrent architectures [
9,
10,
11], transformer-style attention networks [
12], and physics-informed hybrids [
13,
14]. These regressors typically attain high rank correlation against normalised life but the inferred per-window indicator values are often non-monotone in time, particularly on noisy specimens whose acoustic-emission descriptors fluctuate rapidly. A jagged indicator, even one whose Spearman correlation against ground-truth life fraction is high, is operationally invalid for SHM deployment because downstream maintenance-decision logic is structurally non-monotone-tolerant: a temporary decrease in
is interpreted as evidence of self-healing, which is inconsistent with the accumulation-of-damage physics that the SHM pipeline is built around. The second family applies unsupervised, isotonic, or contrastive estimators [
15], which produce monotone indicators by construction but cannot exploit the cross-specimen failure-event evidence and tend to degrade under cross-condition transfer.
A third complication is that the multi-source observation channels in field-deployed composite SHM rigs are heterogeneous in count, in scale, and in informativeness. Acoustic emission (AE) descriptors are computed by summary statistics over per-window waveform segments and span twenty-five descriptors per window in the present dataset. Strain channels are scalar and well-conditioned but are sensitive to global rigid-body effects that are uncorrelated with damage. Fibre Bragg grating channels are spatial measurements distributed across the specimen and span a variable count of zero to five channels depending on instrumentation. A deployed health indicator construction method must be agnostic to the fibre channel count, must accommodate the absence of the fibre stream entirely (the so-called fibre-mask robustness regime), and must produce indicators whose calibration is invariant to the underlying stress level. Existing multi-sensor fusion strategies based on graph neural networks [
16,
17] address the channel-count heterogeneity but introduce non-trivial parameter overhead. In an empirical pre-check on the present dataset, the modality-conditional gating variant was found to be statistically indistinguishable from a simple mask-aware sum-pool, and the gating was consequently discarded.
The motivation for the present work is sketched in
Figure 1. In the left panel a conventional endpoint-supervised regressor produces a high-Spearman but jagged indicator that violates the monotonicity requirement and is consequently unusable for maintenance-decision deployment. In the right panel the indicator produced by SAMS-Net under the same supervision regime is smooth, monotone, and bounded, and consequently passes the engineering acceptance criteria. The figure also illustrates the supervision regime in the inset: the only available label is the failure event, and the time series between
and
is otherwise unlabelled.
The present method is deliberately minimal. Stacking many theoretically motivated architectural components without per-component ablation risks over-claiming, because the operative mechanism may in fact be a single element, while the remainder add parameters without measurable benefit. An effective method should therefore commit to a small number of contributions, include an a priori empirical pre-check before adopting each component, and report null ablation findings transparently. Accordingly, SAMS-Net retains only the smooth-latent provider—a neural differential equation backbone, in either its stochastic SDE form or its deterministic ODE limit—and replaces every other heuristic by the two-level Pool-Adjacent-Violators projection, whose dominance is demonstrated in
Section 4.
The methodological insight is that the failure mode of conventional endpoint-supervised regressors is a constraint-satisfaction failure: they learn a useful representation but produce trajectories lacking the structural inductive bias that engineering demands. The remedy is a hard projection onto the constraint manifold, applied at training time so the gradient through the projection reshapes the upstream representation, and again at inference to enforce the global constraint. Pool-Adjacent-Violators is the natural projector for monotonicity because it is the
projection onto the cone of non-decreasing sequences with linear-time amortised computation [
7].
The present method, SAMS-Net (Smoothness-Anchored Monotone Neural Differential Equation Network), is the minimal proposal consistent with these constraints. Three contributions are claimed.
A two-level Pool-Adjacent-Violators projection head is introduced, in which a within-window projection is applied during training with a straight-through gradient and an across-window projection is applied at inference. This is the dominant contribution per the ablation study reported in
Section 4. Removing the projection drops the PCS by roughly 0.39.
A smoothness-stratified two-phase training schedule is introduced, in which the first of the E training epochs are allocated to specimens whose per-specimen median local-smoothness coefficient exceeds 0.5, after which a full-set fine-tuning phase covers the entire training pool.
A neural differential equation backbone (either a stochastic SDE or its deterministic ODE limit, the two variants being operationally equivalent on the present dataset as established by ablation A4) with smoothness-derived drift weighting is adopted, providing the smooth latent on which the projection acts. The backbone is presented as a smooth-latent provider rather than as an inductive-bias claim, since both null findings below contradict any such claim. Two architectural choices that did not materialise as positive contributions in ablation, namely the smoothness-adaptation of the drift weighting and the stochasticity of the diffusion, are reported transparently in
Section 4 as null findings.
The remainder of the paper is organised as follows.
Section 2 reviews failure-only-supervised health indicator construction, neural-SDE prognostics, and isotonic-regression-based health indicators.
Section 3 formalises the problem and describes the architecture, training procedure, and loss.
Section 4 reports the experimental study, including the main results, ablation, sensitivity, and multi-seed variance analysis.
Section 5 concludes.
4. Experimental Study
4.1. Dataset
The empirical study is conducted on a seventeen-specimen open-hole carbon-fibre composite fatigue dataset spanning two cyclic stress levels (8 and 10 kN). Each specimen is instrumented with synchronous acoustic-emission, strain, and fibre Bragg grating channels, the fibre channel count varies across specimens (zero, one, two, four, or five channels), and the middle-channel rule is applied to extract a single representative fibre stream per specimen. Specimens are partitioned into three groups: a high-load multi-stage group (G1, nine specimens at 10 kN with multi-stage loading and pre-set cycles), a high-load run-to-fail group (G2, five specimens at 8 or 10 kN run to failure), and a low-load run-to-fail group (G3, three specimens at 8 kN run to failure). Sliding windows of 100 cycles with stride 100 are extracted, yielding per-specimen trajectories of approximately 100 to 400 windows. The only label per specimen is the failure-event cycle , after which the specimen is removed from service, and per-time damage-state labels are unavailable.
The acoustic-emission descriptor vector contains twenty-five summary statistics per window: moment-based features (mean, variance, skewness, kurtosis), spectral features (centroid, kurtosis, skewness, roll-off), peak-rate, amplitude, and several derived ratios. Descriptor list and per-feature normalisation constants are fixed a priori. Strain and fibre Bragg grating streams are scalar and z-scored per-specimen, and mask-aware fusion ensures fibre-absent specimens receive zero contribution.
4.2. Implementation Details and Compared Methods
SAMS-Net is implemented in PyTorch 2.11 on an NVIDIA RTX-class GPU. Window length is 100 and the stride is 100. Batch size is 192. AdamW is used with learning rate
and weight decay
. All methods are trained for
epochs uniformly to remove training-budget confounding. The cosine-annealing schedule of
Section 3 is applied over the same five-epoch horizon, and Algorithm 1 uses the integer-epoch phase transition
(matching the experimental setup). The default hyperparameters of SAMS-Net are
,
,
,
,
. SAMS-Net has approximately one hundred and eighty-eight thousand trainable parameters, smaller than the strongest baselines.
Five baselines are reported. A convolutional neural network combined with a long short-term memory network (CNN-LSTM, ≈191k params) [
9] and a transformer-style attention regressor (Transformer-RUL, ≈232k) [
12,
22] are mainstream references. GRU-ODE-Bayes (≈33k) [
33] is a continuous-time non-monotone reference. Isotonic-SK [
36] is a monotone-by-construction reference. A multi-layer perceptron regressor (MLP-RUL, ≈15k) is a weak feed-forward reference. All methods receive the same supervision (failure-event endpoint), identical hardware, batch size, learning rate, and epoch count. The across-window PAV is applied only to SAMS-Net since baselines do not claim trajectory-level monotonicity. SAMS-Net’s parameter count (188k) is between MLP-RUL and CNN-LSTM and smaller than Transformer-RUL, removing parameter-count confounding.
Six scenarios are defined: S1 = leave-one-specimen-out (LOSO) on the high-load group (five instances), S2 = LOSO on the low-load group (three instances), S3 = high-to-low cross-condition transfer (G1 + G2 → G3, three instances), S4 = low-to-high transfer (G3 → high, three instances), S5 = multi-stage to single-stage transfer (G1 → G2, three instances), S6 = fibre-mask robustness (training with random fibre masking, testing with fibre absent, three instances). The full grid is 120 training runs plus the ablation and multi-seed variance studies reported below.
All health indicator quality metrics reported in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7 and in all figures of
Section 4 are computed exclusively on the held-out test specimens of each scenario instance. The training pool of each LOSO and cross-condition instance is strictly disjoint from the test specimen. The test-unit composite metric defined in Equation (
3) is also reported per scenario to address the concern that HI metrics aggregated across training and test units can mask a failure to generalise.
4.3. Main Results
The principal results are reported in
Table 1. On the present seventeen-specimen open-hole carbon-fibre dataset, SAMS-Net wins on every one of the six scenarios on the PHM Composite metric, with a mean rank of one. Cross-site or cross-material transferability remains to be established (
Section 5). Per-scenario margins against the strongest baseline range from 0.220 (S6) to 0.482 (S1), the smallest margin (S6) is approximately seven times the per-seed standard deviation of SAMS-Net at that scenario, and the paired
t-statistic exceeds eight on every tested comparison (
Table 7). The largest margins are attained on LOSO scenarios where conventional regressors are most exposed to distribution shift. The smallest margin is attained on S6, where the strain and AE streams partially compensate for the absent fibre stream.
The per-scenario Spearman bar chart of
Figure 5 confirms SAMS-Net rank one on every scenario, with the advantage most pronounced on LOSO and reduced but persistent on cross-condition and modality-dropout scenarios.
The qualitative trajectory comparison of
Figure 6 shows SAMS-Net is smooth and monotone end-to-end, with endpoints reliably converging to one, while the strongest two baselines zigzag in the early-life phase and scatter endpoints between 0.7 and 1.1. This is the operational reason SAMS-Net is preferred in maintenance settings even though it does not lead on per-window remaining-life MAE.
The secondary remaining-life MAE in
Table 2 and
Figure 7 shows SAMS-Net ranks fifth or sixth on five of the six scenarios (third on S2). The trajectory-level monotonicity constraint is incompatible with arbitrary per-window value adjustment, so per-window RUL error is bounded below by the non-monotonicity of the underlying signal. The trade-off is discussed in
Section 4.9.
4.4. Ablation Study
The ablation study is reported in
Table 3 and visualised in
Figure 8. Four variants are evaluated: A1 removes the smoothness-adaptive weighting (i.e.,
and
uniformly), A2 removes the two-level Pool-Adjacent-Violators projection (within-window and across-window both off), A3 removes the smoothness-stratified two-phase training (single-phase training on the full pool), and A4 removes the diffusion noise (i.e.,
, deterministic ODE limit). The ablation covers all six experimental scenarios.
Three findings emerge. First, the two-level PAV projection (A2) is the dominant contribution: removing it drops the PHM Composite by 0.33–0.46 across all six scenarios (mean drop 0.388), essentially the full margin over the strongest baseline. Second, SSTP (A3) contributes a small lift of up to 0.05 PCS on five of the six scenarios and is marginally negative on S3. Third, the smoothness-adaptation (A1) and the stochastic diffusion (A4) do not contribute measurably, and both are reported transparently as null findings rather than claimed as positive contributions. The neural differential-equation backbone functions operationally as a smooth-latent provider on which the projection is meaningful.
4.5. Control Experiment: Across-Window PAV Applied to the Strongest Baselines
To isolate the contribution of the across-window PAV projection from that of the upstream representation, the projection has been applied post hoc at inference to the strongest two baselines (CNN-LSTM and Transformer-RUL). All other settings match
Section 4 (five epochs, AdamW, learning rate
, weight decay
, batch size 192). Results are summarised in
Table 4 and visualised in
Figure 9.
The PAV projection lifts the baselines’ PHM Composite by 0.38 on average (CNN-LSTM 0.503 → 0.882, Transformer-RUL 0.505 → 0.894). Even with the projection, the strongest PAV-projected baseline mean (0.894) sits 0.003 below SAMS-Net (0.897). SAMS-Net wins on three of six scenarios (S2, S4, S5) and is matched within 0.046 on S1, S3, and S6. The across-window projection accounts for the majority of the SAMS-Net margin over unprojected baselines, and the remainder is attributable to the within-window training-time projection (which cannot be replicated by inference-time post-processing alone) and to the smooth latent of the neural differential-equation backbone.
4.6. Prognosability and Test-Unit Composite
The Prognosability metric of Equation (
2) and the four-component test-unit composite of Equation (
3) are reported in
Table 5.
Figure 10 visualises the pattern: SAMS-Net attains strictly the highest prognosability on every scenario (Pr
by construction, since the across-window PAV clamps
for every test specimen). The PAV-projected baselines reach 0.89–0.99 but do not match this strict end-anchoring because the inference-time projection lacks the endpoint-anchor loss that drives SAMS-Net to one. On TUC, SAMS-Net is the highest on four of six scenarios and on the mean (0.92), with the strongest PAV-projected baseline within 0.01–0.03 on the remaining two.
4.7. Sensitivity to Hyperparameters
A sensitivity sweep is reported in
Table 6 and visualised in
Figure 11. Three knobs are varied one at a time:
,
, and
. The other two knobs are held at default. The PCS is averaged across three representative held-out scenarios.
PCS varies by at most 0.005 across the ten-point grid, and the default operating point is within 0.001 of the best-observed. The insensitivity is consistent with the ablation finding that the trajectory-level PAV projection absorbs backbone-level variation in the latent. A loss-weight sweep on the same three held-out scenarios (factor-of-two variations on ) produces PCS variations below 0.01, consistent with the broader insensitivity.
4.8. Statistical Significance via Multi-Seed Variance
A three-seed variance analysis is reported in
Table 7. Three random seeds (7, 42, 123) are evaluated on three representative scenarios (S1_LOSO_018, S2_LOSO_022, S6_FMASK_026), and SAMS-Net is compared against the two strongest baselines (CNN-LSTM and Transformer-RUL). The reported
t value is the paired difference in mean PCS divided by its standard error across the three seeds.
The smallest paired t-statistic is 8.7 (S1_LOSO_018 vs Transformer-RUL) and the largest is 32.2 (S2_LOSO_022 vs CNN-LSTM); under a paired t-test with two degrees of freedom every tested comparison reaches . Given the three-seed sample, these values are best read as large standardised effect sizes rather than as small-sample tail probabilities. SAMS-Net’s three-seed standard deviation is 0–0.032 PCS, materially smaller than baselines (0.007–0.077): the across-window projection produces near-identical monotone trajectories even when the backbone training varies between seeds, ensuring reproducibility suitable for safety-critical SHM deployment.
4.9. Trade-Off Between Trajectory Monotonicity and Per-Window Remaining-Life Precision
SHM deployment consumes the indicator shape, not the per-window remaining-life value: the maintenance module triggers inspection when
crosses a threshold. A jagged indicator with lower per-window MAE is operationally invalid because the threshold-crossing decision is non-monotone and subject to spurious triggering. A smooth monotone indicator with slightly larger MAE is operationally valid. This trade-off aligns with [
1], who report that downstream maintenance utility correlates with PHM Composite and is largely insensitive to per-window error magnitude. SAMS-Net’s per-window RUL error is within about 11% of the best baseline on the two leave-one-specimen-out scenarios and larger on the cross-condition and modality-dropout scenarios, which is acceptable given the preserved indicator shape.