4.2. Evaluation Criteria
To analyze and evaluate the predictive performance of the proposed model, three metrics were selected for assessment and analysis: R2, MAE, and RMSE.
First, R
2 is commonly used to assess the fitting of a regression model regarding the data. It is defined as follows:
An R2 value close to 1 indicates that the model fits well the data, meaning more data points can be accurately predicted by the model. An R2 value close to 0 shows that the model has not learned meaningful information from the dataset and essentially does not fit the data. A negative R2 value indicates that the model is underfitting the dataset.
Where : the sum of squared residuals; : the total sum of squares; : the mean value; : the predicted value from the model.
Moreover, MAE represents the average of the absolute errors between the predicted values and the true ones. It is commonly used to indicate the prediction accuracy of a model. It is expressed as follows:
Finally, RMSE is employed to quantify the deviation between predicted and true values. In forecasting, the RMSE is highly sensitive to outliers within the data, increasing its application to measure this deviation. It is mathematically formulated as follows:
Both the MAE and RMSE indicate smaller errors when their values are closer to 0.
4.3. Ablation Experiment
To systematically assess the effectiveness of each core component within the proposed BO-CNN-LSTM model, this study conducts a series of ablation experiments. Specifically, six model structures are evaluated: the complete BO-CNN-LSTM, the standalone CNN, the standalone LSTM, the CNN with only Bayesian Optimization (BO-CNN), the LSTM with only Bayesian Optimization (BO-LSTM), and the basic CNN-LSTM combination without Bayesian Optimization. All models were evaluated under identical experimental conditions using the same five test subsets, recording R
2, RMSE, and MAE. Percentage error distribution plots and fitting effect plots were also generated to provide a comprehensive comparison. The results are presented in
Table 4,
Table 5 and
Table 6 and
Figure 8 and
Figure 9.
As shown in
Table 4,
Table 5 and
Table 6, the BO-CNN-LSTM model consistently achieved the highest R
2 values across all five repeated experiments, reaching a maximum of 0.98792, with minimal fluctuation between runs. This indicates that its goodness-of-fit and stability are superior compared to the other ablation variants. In contrast, the CNN-LSTM model without BO had R
2 values ranging from 0.976 to 0.980, slightly lower than those of BO-CNN-LSTM, suggesting that the introduction of BO enhances the performance of the CNN-LSTM architecture.
Moreover, the BO-LSTM model exhibited a sharp drop in R2 to 0.60536 in the fourth experiment, accompanied with a dramatic increase in RMSE and MAE, reaching 44.9242 and 25.3151, respectively, therefore demonstrating considerable instability. This anomalous fluctuation indicates that, when only the LSTM is optimized with BO (without incorporating the CNN for feature extraction), the model becomes highly sensitive to data distribution, making it prone to local optima and overfitting, thus compromising prediction stability.
In addition, the standalone CNN model generally yielded relatively low R2 values (ranging from 0.931 to 0.970), while its RMSE and MAE are significantly higher than those of any model incorporating LSTM. This confirms the limitations of a purely convolutional architecture in capturing temporal modeling tasks. Although the standalone LSTM model outperformed CNN, its R2 value remained lower than that of BO-CNN-LSTM, and its error metrics were also higher, indicating that a single recurrent structure struggles to fully leverage local spatial features present in AIS data.
Finally, the BO-CNN model achieved R
2 values below 0.945 across all five experiments, with RMSE ranging between 16 and 18 and MAE varying in the interval [
10,
11]. This highlights the noticeable performance worse than both BO-LSTM and BO-CNN-LSTM. This further demonstrates that, in the absence of temporal modeling capability, local features extracted by only CNN are insufficient for accurately predicting total voyage duration.
Moreover,
Figure 8a presents the percentage error distributions of the models. The BO-CNN-LSTM model exhibits the most concentrated error distribution, with a median error close to zero and very few outliers, indicating high consistency and reliability in its predictions.
As shown in
Figure 8b, the CNN model exhibits a wide error distribution, with a considerable number of samples showing large positive and negative deviations, particularly a noticeable bias toward overestimation. This behavior is closely related to the lack of temporal modeling capability in CNN, limiting its ability to capture temporal variations.
As shown in
Figure 8c, the LSTM model shows an improved error distribution compared to CNN. However, it still preserves several high-deviation samples, suggesting that a single temporal modeling structure has limited adaptability when dealing with complex trajectory patterns.
As shown in
Figure 8d, the BO-CNN model displays a clear bimodal error distribution. While some samples have errors concentrated near zero, others show significant deviation, reflecting that the CNN model, optimized solely by BO, yields accurate predictions for some voyages but suffers from systematic bias in others.
As shown in
Figure 8e, the BO-LSTM model exhibits an extreme error distribution in the fourth experiment, with a significantly expanded error range, further highlighting its instability under certain conditions.
Although the CNN-LSTM model shows an improved error distribution compared to standalone CNN or LSTM (
Figure 8f), its results remain relatively dispersed when compared to the BO-CNN-LSTM framework. This indicates that, in the absence of hyperparameter optimization, the straightforward combination of CNN and LSTM struggles to achieve optimal synergistic performance.
Figure 9a illustrates the fitting between predicted and actual values for each model. The BO-CNN-LSTM model shows scatter points that are densely concentrated along the ideal diagonal line, exhibiting a strong linear relationship with the absence of obvious systematic bias across the entire voyage range. This indicates that the model has good generalization capability within the current dataset.
As shown in
Figure 9b, the CNN model exhibits a pronounced “plateau effect” in the short-voyage range, where predicted values are clustered within a narrow lower band and fail to reflect the continuous variation in actual voyage duration. This behavior highlights the limitations of a purely convolutional structure in modeling temporal dynamics.
As shown in
Figure 9c, although the LSTM model improves fitting performance in the short-voyage range, a degree of dispersion and bias in the medium- to long-voyage range still exists.
As shown in
Figure 9d, the BO-CNN model displays poor fitting performance, with widely scattered points and several significant outliers where predictions deviate substantially from actual values. This suggests that relying solely on CNN and BO optimization is insufficient to capture temporal dependencies in voyage prediction tasks.
As shown in
Figure 9e, the BO-LSTM model performs well in most experimental cases; however, a significant number of outliers appear in the fourth experiment, further confirming instability under certain conditions.
As shown in
Figure 9f, the CNN-LSTM model achieves improved fitting compared to standalone CNN or LSTM models. Nevertheless, the scatter points remain relatively dispersed, with noticeable deviations in some samples. This observation indicates that, without the introduction of BO for hyperparameter tuning, the combined model still has room for improvement in both prediction accuracy and stability.
4.4. Comparative Analysis of Model Performance
To further evaluate the stability and generalization capability of the BO-CNN-LSTM model, this study conducted repeated experiments using five different test subsets, each containing 480 voyage records. For each subset, the R
2, RMSE, and MAE metrics were recorded and compared. The results are shown in
Table 7,
Table 8 and
Table 9, and their corresponding percentage error distributions and fitting effects are illustrated in
Figure 10 and
Figure 11.
As shown in
Table 7,
Table 8 and
Table 9, the BO-CNN-LSTM model achieved the highest R
2 values across the five repeated experiments, reaching a maximum of 0.98792. At the same time, it produces the lowest RMSE and MAE values, demonstrating both stable and excellent predictive performance. In contrast, the AdaBoost model exhibited R
2 values consistently below 0.34, accompanied by significantly larger error metrics and substantial fluctuations, indicating poor suitability for this task. Moreover, Bi-GRU and Elman models showed intermediate performance, but their R
2 values and error metrics varied notably across experiments, reflecting insufficient stability. Finally, the Random Forest (RF) model maintained a stable R
2 of approximately 0.97; however, its MAE and RMSE remained higher than those of the BO-CNN-LSTM model, indicating a gap in both goodness-of-fit and error minimization.
Figure 10 illustrates the percentage error distributions of the models across repeated experiments. The BO-CNN-LSTM model exhibited the most concentrated error distribution, with the smallest median error and minimal outliers, demonstrating strong prediction consistency and robustness. In contrast, AdaBoost showed a highly dispersed error distribution, with predictions generally underestimating the actual values. Although some overestimations were observed, the errors were large, indicating extremely poor prediction accuracy. The Bi-GRU and Elman models performed reasonably in some experiments, their error distributions remained wide, with certain experiments showing substantial deviations, suggesting weak adaptability to data variations. Among the five compared models, the error distribution of the RF model was the most concentrated, indicating that its robustness is relatively poor.
Figure 11 illustrates the relationship between predicted and actual values for each model. The scatter points of the BO-CNN-LSTM model were densely concentrated along the ideal diagonal line, indicating a significant linear relationship and achieving excellent fitting performance under the current experimental conditions. In contrast, the other models exhibit varying degrees of systematic bias across different voyage ranges. For instance, the RF model tended to underestimate voyage durations in medium- to long-distance scenarios. Meanwhile, the Elman and Bi-GRU models displayed substantial fluctuations in short-distance voyage predictions, further confirming their limitations in modeling complex spatiotemporal characteristics.