3.1. Distinguishability of Residual Signals Under Different Operating Conditions
To quantitatively verify the above hypothesis, we compute the Jacobian matrix
along a standard
zigzag maneuver using central differences, and perform sampling analysis at multiple operating points. To avoid contamination from near-straight-motion segments, only samples with effective rudder excitation (
) are retained for statistical evaluation. The results are presented in
Figure 7.
This analysis is included to support the physical basis of the disturbance-specific expert decomposition. If different disturbance sources generate distinguishable residual directions, then assigning separate expert networks to different disturbance channels is justified.
Figure 7a shows that all three singular values of the Jacobian matrix remain nonzero during the active steering phases, indicating that the disturbance directions maintain nontrivial projections in the residual space.
Figure 7b further shows that the residual fingerprints of different disturbance sources are not fully collinear. The relatively higher similarity between SW and HF is consistent with the channel-sensitivity pattern in
Figure 7c: both disturbances mainly affect the
and
channels, while their influence on
is weak. However, HF is more concentrated in
, whereas SW affects both
and
, so the two directions are similar but not identical. The SW–RD similarity remains at an intermediate level because both affect
, whereas RD shows a stronger relative contribution to
.
Figure 7c provides an averaged view of the Jacobian structure over the trajectory. A clear channel-wise sensitivity pattern can be observed: hull fouling produces the strongest response in the surge residual
; shallow water affects both
and
; and rudder degradation mainly affects
and
. These results indicate that the residual signals contain sufficient discriminative information for disturbance classification and intensity estimation, thereby supporting the subsequent disturbance-specific expert decomposition and CNN-SE-BiLSTM encoder design.
3.3. Closed-Loop Prediction Performance
The closed-loop validation adopts a two-stage evaluation protocol consisting of observation warm-up and strict closed-loop autoregressive prediction. First, during the initial (or ), the encoder is driven by the ground-truth observation sequence, and the disturbance-intensity estimate is obtained from the residual features over the most recent time windows, so as to mitigate the accumulation of errors caused by unstable encoder estimates in the initial phase.
The system is then switched to the strict closed-loop prediction mode, in which the vessel states are propagated entirely by the model itself without reference to any ground-truth trajectory information. Specifically, the current estimate is used to activate the corresponding expert network, which generates the correction term . This correction is added to the nominal MMG model output, and the subsequent vessel trajectory over (or ) is propagated using an RK4 integrator. During this phase, the encoder input is no longer constructed from real observations; instead, the residual sequence and the corresponding time windows are reconstructed from the model’s own closed-loop predicted trajectory and are then fed back into the encoder to update .
Therefore, the entire prediction stage constitutes a strict autoregressive closed-loop rollout, in which state propagation, residual construction, and disturbance estimation are all recursively driven by the model’s own previous predictions. To account for potentially time-varying disturbances, is updated every based on the latest time window.
All comparative methods share the same ground-truth control input sequence and identical initial observation conditions. The evaluation metrics include position RMSE (m), heading RMSE (∘), terminal position error (m), and the normalized root-mean-square errors of surge velocity, sway velocity, and yaw rate .
Three representative test scenarios are considered, with the trajectory and state-prediction results presented in
Figure 9,
Figure 10 and
Figure 11 and the corresponding averaged performance metrics summarized in
Table 6,
Table 7 and
Table 8. The first is a random steering scenario under a single severe shallow-water disturbance, corresponding to
Figure 9 and
Table 6 which is used to examine the upper limit of the model’s compensation capability under a strong single-source disturbance. The second is a random steering scenario under combined disturbances (SW + HF + RD), corresponding to
Figure 10 and
Table 7 which is used to evaluate the model’s ability to disentangle and compensate for coupled multi-source disturbances. The third is a standard zigzag
maneuver under a single severe shallow-water condition, corresponding to
Figure 11 and
Table 8 which is used to assess the applicability and stability of the model under a typical maneuvering scenario.
Table 9 summarizes the detailed configurations of the baseline and ablation models, including their architectures, hidden-layer settings, parameter counts, input-output formats, and training trajectories. These settings provide the basis for the subsequent closed-loop performance comparisons.
The comparison with UKF should mainly be understood as a comparison against a model-based reference baseline, rather than a direct competition with a practically deployable final engineering solution. The purpose of introducing UKF in this work is to show that, even when a high-fidelity explicit state equation and a classical recursive filtering framework are available, the estimation of can still be affected by observational ambiguity; therefore, the true physical parameters do not naturally correspond to the optimal disturbance-compensation coordinates. It should also be noted that UKF benefits from stronger modeling priors in this comparison, including a high-accuracy explicit state-space model and a recursive residual-correction mechanism. Even under such conditions, the proposed method still outperforms UKF in the single shallow-water random steering scenario and the triple-source disturbance random steering scenario, while achieving comparable performance in the single shallow-water zigzag maneuver. These results indicate that the proposed method can still realize stable and effective disturbance compensation without relying on an explicit filtering framework or strong model priors.
The comparison with BiLSTM and MLP baselines is mainly intended to highlight the essential difference between the proposed physically structured closed-loop framework and purely data-driven methods in complex disturbance-compensation tasks. First, the overall poor performance of MLP under various mismatch conditions is consistent with its methodological characteristics: since this type of model is essentially closer to a one-step mapping and lacks sufficient capability to capture temporal memory and contextual evolution of disturbances, it is difficult for it to maintain stable predictions in scenarios with strong coupling, time-varying effects, and significant distribution mismatch; this is also reflected during training, where its validation loss shows pronounced fluctuations.
In contrast, BiLSTM actually achieves high accuracy at the one-step residual prediction level, but its major performance degradation appears during multi-step autoregressive closed-loop rollout. To determine whether the failure of the BiLSTM baseline stems from insufficient one-step residual fitting or from distribution shift during autoregressive rollout, we evaluated the same trained BiLSTM checkpoint under five random steering seeds and two representative scenarios, namely single severe shallow-water scenario and triple-source disturbance scenario, using three settings: teacher-forced one-step residual prediction, GT-window closed-loop evaluation, and fully autoregressive closed-loop rollout. The results show that the model is highly accurate in one-step residual prediction. For example, in a separate 180 s diagnostic rollout under the single severe shallow-water scenario, the correlation coefficients for reach 0.911/0.989/0.983, with RMSE values of only , , and , respectively. When a ground-truth window is still provided at each step as context, the closed-loop position RMSE remains m. However, under fully autoregressive residual rollout, the position RMSE increases to m and the heading RMSE rises to . The corresponding mean per-seed degradation ratios are 7.70× for position RMSE and 10.35× for heading RMSE, respectively. A similar mean per-seed position-RMSE deterioration of 8.08× is also observed in the triple-source disturbance scenario.
This difference arises from the feedback mechanism used to construct the prediction context. In the ground-truth-window setting, the residual window remains anchored to the true trajectory, which keeps the input context close to the training distribution and suppresses the propagation of local residual errors. In the fully autoregressive setting, by contrast, both the vessel states and residual windows are reconstructed from the model’s own previous predictions. Small residual-amplitude errors, phase shifts, or low-frequency biases can therefore be recursively fed back into subsequent integration steps and accumulated into much larger ship-position errors. Since all three evaluations use exactly the same network weights, these results indicate that the main limitation of BiLSTM does not lie in the learnability of the one-step residual itself, but rather in the distribution shift and self-amplifying error accumulation induced by autoregressive residual rollout.
The comparison with ConcatMLP (True ) and Oracle is mainly intended to examine the actual role of in residual compensation and to further investigate whether the true is naturally identical to the optimal coordinate for closed-loop compensation.
First, it should be noted that ConcatMLP (True ) is directly provided with the ground-truth as input, and therefore it is naturally expected to achieve the smallest error; this in itself can also be regarded as a direct validation of the effectiveness of . From the results, ConcatMLP (True ) generally outperforms Oracle , although the gap is not large. This suggests that residual compensation is influenced not only by the true disturbance parameter itself, but also by additional effective information arising from multi-channel dynamic coupling, modeling errors, and compensation interactions.
It should also be emphasized that the training objective of the proposed encoder is not to reconstruct the true with high fidelity. Instead, the loss function is jointly defined by the normalized expert RMSE and the residual term. Therefore, the encoder output is not a pointwise restoration of the true physical parameter, but rather an optimal equivalent coordinate that is more beneficial to the current expert system after jointly accounting for multi-channel residual superposition, interaction effects, and the approximation error of the expert itself. In other words, the that yields the best closed-loop performance does not necessarily coincide with the true . This also explains why Oracle can even perform worse than the proposed method in the single shallow-water scenario: although the true parameter provides the physically correct reference, it does not necessarily correspond directly to the most effective representation for closed-loop compensation, whereas the learned is more closely aligned with the coordinate representation that is optimal for closed-loop prediction and compensation performance.
3.4. Closed-Loop Validation Under the Four-Source Disturbance Scenario and Wind-Expert Ablation
The wind-expert ablation further shows that the role of the wind expert is not merely to reduce instantaneous error, but more importantly to provide a much more stable compensation for external wind disturbances during the early stage of closed-loop prediction. As shown in
Figure 12, with the wind expert enabled, both the BF5 and BF10 cases remain substantially more stable immediately after the observation phase, and the predicted trajectory as well as the evolutions of
u,
, and
r follow the ground truth more closely. In contrast, without the wind expert, the model deviates much earlier once autonomous rollout begins, and the error continues to accumulate over time.
The different trajectory trend observed in the BF5 case without the wind expert is mainly caused by the absence of explicit wind-load compensation. In the early stage after the prediction starts, the yaw-rate response still follows a trend similar to the ground truth. However, the surge-velocity, sway-velocity, and yaw-rate channels are not independently sufficient to determine long-horizon trajectory accuracy at each instant; even moderate residual errors in these channels can change the integrated heading and position over time. Therefore, the BF5 no-expert case gradually departs from the reference trajectory during the fully autoregressive rollout. With the wind expert included, the wind-induced residual forces and moments are partially compensated, so the predicted trajectory remains closer to the ground truth.
The quantitative results in
Table 10 confirm this trend. Under BF5, adding the wind expert reduces the position RMSE from 296.1 m to 64.2 m and the endpoint error from 763.0 m to 170.9 m. Under the stronger BF10 condition, the position RMSE is reduced from 586.8 m to 99.5 m, while the endpoint error decreases from 1389.9 m to 163.7 m. These correspond to reductions of 78.3% and 83.0% in position RMSE under BF5 and BF10, respectively.
It should also be noted that, even with the wind expert, the predicted trajectory under BF10 still exhibits a non-negligible deviation from the ground truth. This discrepancy is reasonable because BF10 represents a much stronger external wind-load condition, which introduces larger sway-force and yaw-moment residuals and makes the closed-loop rollout more sensitive to small compensation errors. During the long fully autoregressive prediction horizon, any remaining wind-load approximation error, phase mismatch, or velocity-state deviation can be recursively fed back and accumulated through numerical integration. Therefore, the wind expert substantially improves wind compensation and delays divergence, but it cannot completely eliminate long-horizon drift under strong wind forcing.
Taken together, the results in
Figure 9,
Figure 10,
Figure 11 and
Figure 12 and
Table 6,
Table 7 and
Table 8, together with
Table 10, demonstrate stable long-horizon closed-loop autoregressive prediction across the main disturbance scenarios and wind-inclusive ablation cases. As summarized in
Table 11, the proposed framework achieves position-RMSE reductions of 74.7–91.7% relative to the corresponding nominal-MMG and wind-ablation baselines, confirming its overall effectiveness under complex disturbance conditions.