4.2. Performance Metrics
One starts by examining averaged (over 10 independent runs) confusion matrix results for the CNN and LSTM, for the categories of 2010 West Coast, 2010 Gulf of Mexico, 2024 West Coast, and 2024 Gulf of Mexico.
Confusion-matrix conventions.
Recall that for each model the confusion matrix contains the following four values (we adopt the standard convention that the positive class denotes an anomaly):
(true positive): the true label is an anomaly and is correctly identified by the model (anomaly → anomaly).
(false positive): the true label is normal but is incorrectly identified as an anomaly (normal → anomaly). Such misclassification is a Type I error (false alarm).
(true negative): the true label is normal and is correctly identified as normal (normal → normal).
(false negative): the true label is an anomaly but is incorrectly identified as normal (anomaly → normal). This is a Type II error (missed detection).
Finally, let and , which represent the total number of true anomalies and true normals, respectively.
The averaged confusion matrix results for the West Coast are displayed in
Figure 5 and summarized in
Table 4.
For the West Coast dataset, in 2010, the total number of true “+” and “−” labels were and , respectively. By contrast, in 2024, the total number of true “+” and “−” labels was and , respectively. The total number of labels () increased from 8495 to 14,331, about 70% increase. This is due to a large increase in detection devices. However, we note that the dataset remained highly imbalanced: the proportion of anomalies () in 2010 was about 5.8% and increased to 7.3% in 2024.
The averaged confusion matrix results for the Gulf of Mexico are displayed in
Figure 6 and summarized in
Table 4.
For the Gulf of Mexico, in 2010, the total number of true “+” and “−” labels were and , respectively. By contrast, in 2024, the total number of true “+” and “−” labels was and , respectively. The total number of labels () increased from 11,838 to 11,960, a small 1% increase. This is in contrast to a much larger increase of about 70% for the West Coast. Similarly to the West Coast dataset, the class distribution in this region remains highly imbalanced. The proportion of anomalies () in 2010 was about 3.2% and increased slightly to 3.7%, much less than the increase on the West Coast.
The Gulf of Mexico anomaly rate increased modestly from approximately 3.2% to 3.7%, which is smaller than the West Coast increase but coincides with the timing and location of the 2010 Deepwater Horizon oil spill. This suggests a potential correlation between the spill and elevated anomalies, reflecting the influence of large-scale environmental disruptions on the behavior of ocean drifters.
The averaged confusion matrix results are summarized in
Table 4. As seen from
Table 4, different categories in different time periods and regions show different models as having superior performance, with the CNN model typically more accurate at identifying normal behavior and avoiding false alarms, and the LSTM model typically better at detecting anomalies. The takeaway from this is that the CNN model is more conservative in its performance, as it avoids flagging normal sequences as anomalies, but the LSTM model is more sensitive to anomalies.
4.3. Performance Ratios
Next, the following performance ratios derived from the confusion matrices were examined:
Accuracy: Fraction of all labels (both positive and negative) that are predicted correctly (). This metric may not be very useful for highly imbalanced datasets, as is the case in this investigation. A simple classifier that assigns all points to be positive (majority) class) will have high accuracy simply by the imbalance
Precision: Accuracy of positive label predictions ()
Recall: Sensitivity or True Positive Rate (). This is the proportion of actual positive labels that are correctly identified by a model. This ratio is especially useful for imbalanced datasets, as is the case in this investigation.
Score: The harmonic mean of the Precision and Recall. This metric (symmetrically) represents both precision and recall in one metric.
ROC AUC: (Return Operating Characteristics) Area Under the Curve measures the model’s ability to distinguish between positive and negative labels across all thresholds.
PR AUC: Precision Area under Curve - this is used for heavily imbalanced datasets when optimizing for the positive labels.
Figure 7 compares standard statistical metrics, such as accuracy, precision, recall,
-score, ROC AUC, and PR Curve AUC, of the CNN and LSTM over 10 independent runs. The bar graphs show that across a large majority of runs, the LSTM achieves better metrics than the CNN model. However, in the Mexico Gulf in 2024, the situation is more complex, with the CNN model performing better in some metrics and the LSTM in others.
The average metrics from each model during each time period for the West Coast and the Gulf of Mexico are shown in
Table 5.
The CNN and LSTM models show different performance patterns across regions and years. On the West Coast datasets, for both 2010 and 2024, the LSTM generally achieves higher precision, recall, score, and ROC AUC than the CNN, suggesting that it benefits from learning temporal dependencies in the drifter trajectories. For instance, in the 2024 West Coast data, the LSTM reached a mean score of 0.282 compared to 0.212 for the CNN.
In contrast, the performance differences between the two models in the Mexican Gulf datasets are less consistent. In the 2010 Gulf data, both models perform similarly, with the LSTM showing a slight advantage in
and ROC AUC. However, in the 2024 Gulf data, the CNN achieves higher accuracy and
values than the LSTM, suggesting stronger generalization to more recent or spatially variable Gulf conditions. These patterns indicate that the relative strengths of CNN and LSTM architectures may depend on regional and temporal factors in the data, while the LSTM tends to perform better when temporal continuity is prominent, the CNN remains effective under more variable environmental conditions. Further analysis in
Section 5 explores these regional and temporal differences in greater depth.
4.5. Statistical Testing
To evaluate the statistical significance of the performance difference between the CNN and LSTM models, we performed a block-shuffled bootstrap analysis on the -scores obtained over 10 experimental runs.
We resampled contiguous blocks of size 2 to partially preserve potential dependence between consecutive runs. Block bootstrap methods are appropriate when the independence assumption is suspect and are designed to retain short-range correlation structure in the resamples [
81]. Variants such as overlapping and circular block bootstraps are well studied and are appropriate under weak dependence. We used block shuffling as a pragmatic approach, given the small number of runs. We generated 10,000 bootstrap samples and focused on the mean difference in
-score between the CNN and LSTM (CNN–LSTM). Although bootstrapping can provide approximate sampling distributions without strict distributional assumptions, its theoretical justification typically relies on asymptotic arguments where bootstrap confidence intervals may be unstable with very small original samples because the resampled distributions cannot fully reflect the underlying variability [
82]. We therefore emphasize caution. The limited number of runs (10) constrains the precision of the inference.
The resulting 95% bootstrap confidence intervals are shown in
Figure 9 (West Coast and Gulf of Mexico, 2010 and 2024). Numerically, the bootstrap summaries for mean (CNN–LSTM) and 95% CI are shown in
Table 6. The West Coast results are unambiguous: 2010 West Coast (mean =
, 95% CI
) and 2024 West Coast (mean =
, 95% CI
) both have intervals entirely below zero, indicating that the LSTM outperforms the CNN with the reported confidence. For the Gulf of Mexico, the 2010 interval (mean =
, 95% CI
) crosses zero, while the point estimate favors the LSTM, the confidence interval includes zero, and thus the difference is not statistically significant at the 95% level. By contrast, the 2024 Gulf of Mexico interval (mean =
, 95% CI
) lies entirely above zero, indicating a statistically supported advantage for the CNN in that region and year.
The Gulf of Mexico shows a significant temporal shift in model performance. In 2010, despite the Deepwater Horizon oil spill, the mean difference between CNN and LSTM performance was similar to that in the West Coast regions, with LSTM outperforming CNN. This can be explained by the LSTM’s ability to capture long-range temporal dependencies in the drifter trajectories, which is particularly useful when the flow patterns are influenced by a strong, localized disturbance such as the oil spill. One can speculate that the spill may have introduced coherent temporal structures in surface currents that the LSTM could exploit, leading to more consistent performance across runs.
By 2024, however, the CNN has a better performance than the LSTM. This reversal likely reflects changes in the Gulf’s dynamics over time, including increased mesoscale variability [
83], altered Loop Current behavior [
84], and higher spatial heterogeneity [
85]. These changes reduce the temporal coherence that benefits the LSTM, while the CNN’s ability to detect short-term, localized patterns becomes more advantageous. Thus, the observed reversal in mean performance is not random but rather demonstrates an evolution in the data regime, where the CNN becomes better suited to capture the anomalies in the 2024 drifter trajectories.
The proportion of anomalies increased on the West Coast but remained largely stable in the Gulf of Mexico. From 2010 and 2024, the West Coast sample size increased by about 70% (8495 to 14,331) and the anomaly rate increased from 5.8% to 7.3%, while the Gulf sample size increased by only 1% (11,838 to 11,960) with a small anomaly rate change (3.2% to 3.7%). The West Coast increase is consistent with expanded detection coverage and may also reflect heightened oceanographic variability. Contrasting this, the Gulf’s persistent mesoscale circulation likely contributed to its relatively stable anomaly proportion.
4.6. Hyperparameter Tuning
To evaluate the influence of hyperparameter values, consider the 2024 West Coast results. The baseline values are as follows:
The models were re-run by changing some of the hyperparameters as follows:
A larger window size ;
50/50 split (as opposed to 80/20);
CNN with three layers and a larger window size ;
LSTM with 32 units per layer and a larger window size .
In addition, logistic regression [
12] was used as a baseline: this classifier does not require any hyperparameters and is often used as a benchmark in machine learning.
For each new choice of hyperparameters, the models were re-run 10 times. Average performance metrics were computed and summarized in
Table 7.
First, consider a split size. For the CNN model, the 50/50 split resulted in a slight increase in the mean score and ROC AUC, indicating that it can handle a larger test set without a significant loss in performance. Yet for the 80/20 split, there is a higher overall accuracy, showing that having more training data benefits the model’s ability to correctly classify the majority of examples. On the other hand, the LSTM model experienced a decrease in overall accuracy and precision with the 50/50 split, indicating that it would benefit from additional training data. These results demonstrate the trade-offs inherent in choosing a train-test split. However, overall, model performance with a 50/50 train-test split is roughly comparable to that with the standard 80/20 split. Given the small differences in accuracy, , and other metrics, the 80/20 split was adopted as the default. This choice ensures sufficient training data for learning while maintaining a reliable test set for evaluation, aligning with common practice for reproducibility and consistency.
The image below shows a graph of validation loss against percent training data, showing how the validation loss changes with different train/test splits in
Figure 10. The CNN consistently achieves lower validation loss than the LSTM, showing that it has better generalization for all the tested data fractions. Overall, the CNN consistently achieves lower validation loss than the LSTM, indicating better generalization across all data fractions. For the CNN, validation loss decreases as the training fraction increases from 50 percent to 70 percent, suggesting improved learning with more data, before slightly increasing at 80 percent, possibly due to variance or mild overfitting. In contrast, the LSTM exhibits relatively stable validation loss across all splits, implying less sensitivity to the amount of training data. The error bars indicate higher variability for the CNN at intermediate splits, while the LSTM remains more consistent but at a higher loss level.
Next, the effect of using a sliding window of
instead of the baseline value of
was considered. Across the 10 runs, the CNN achieved an average accuracy of approximately 0.86, with an average
score of 0.33. Precision and recall show greater variability, reflecting the challenge of detecting the minority class in these dataset. The ROC AUC and PR AUC metrics indicate moderate discriminative performance, consistent with the imbalanced nature of the anomaly detection task. Overall, the model demonstrates stable performance across different seeds with this sliding window configuration. Comparing the CNN results obtained with a sliding window of 20 to the baseline CNN results for the 2024 West Coast in
Table 7, one observes that the performance metrics are not substantially different. The per-run accuracy,
score, and AUC values with the 20-window sequences are broadly comparable to those reported for the standard configuration, indicating that increasing the window length does not significantly alter overall model performance. This consistency across different input sequence lengths justifies using the approach with a window size of 10 in the evaluation procedure in code, providing a reliable estimate of model performance without additional computational cost.
Next, consider the mean performance metrics for the LSTM model with a sliding window of 20 and the 3-layer Deep CNN, respectively. For the LSTM model with a window of , the average accuracy across 10 runs is approximately 0.851, with an average score of 0.338. Precision shows moderate variability, while recall is relatively high, indicating that the model captures a substantial portion of the positive (anomalous) events. The ROC AUC and PR AUC values suggest moderate discriminative performance.
The 3-layer Deep CNN achieves higher overall accuracy, with 0.922 on average, but slightly lower score compared to the LSTM. Precision and recall are more balanced, and the ROC AUC and PR AUC indicate reliable discrimination of positive events. Overall, both models demonstrate stable performance across runs, with the LSTM better at capturing anomalies and the deep CNN achieving higher overall classification accuracy.
The 2024 West Coast CNN and LSTM models using a sliding window of 20 show performance broadly consistent with the overall confusion matrix results reported in
Table 4. For the CNN, the 20-window configuration achieved a mean accuracy of 0.857 and an
score of 0.322, which aligns with the higher true positive and true negative counts observed in the 2024 average confusion matrix. The minor differences in performance metrics reflect the variation introduced by the sliding window input sequences, but the overall classification behavior remains similar. Similarly, the LSTM with window
shows a mean accuracy of 0.851 and an
score of 0.338. This is consistent with the higher true positive count and true negative count for the 2024 West Coast in
Table 4. The sliding window approach does not significantly alter the overall model behavior, confirming that the standard evaluation using 10 runs provides a reliable estimate of model performance.
To evaluate the effect of reducing the number of LSTM units, a model with 32 units per layer was trained using the same input data and hyperparameters as the standard LSTM-64 model. Compared to the standard LSTM-64 configuration (mean accuracy 0.851, score 0.338), the LSTM-32 model shows slightly lower accuracy and score, with modest reductions in recall and ROC AUC. Precision remains comparable, suggesting that reducing the number of units has minimal impact on correctly identifying positive events but slightly reduces the model’s overall capacity to capture all anomalies.
Overall, the LSTM-32 model performs reasonably well, providing a lighter alternative to LSTM-64, which trades a small reduction in performance for a decrease in model complexity and computational cost. This suggests that the standard LSTM-64 configuration is preferable when the goal is to maximize anomaly detection performance, while LSTM-32 may be more suitable for faster training or deployment in resource-constrained environments.
Finally, a baseline logistic regression model was also considered in comparison to the CNN and LSTM models. The average performance of a simple linear model, with values averaged over 10 runs, is presented in
Table 7.
Logistic regression achieved an average accuracy of 0.888, precision of 0.207, recall of 0.396, score of 0.267, (ROC) AUC of 0.793, and PR AUC of 0.287, while it has a relatively high recall, it is limited in its ability to capture temporal and nonlinear patterns in the drifter trajectories. The CNN and LSTM models use the sequential information within sliding windows, with CNNs capturing local patterns and LSTMs modeling long-range dependencies. In our results, the CNN achieved an accuracy of 0.896, a precision of 0.190, a recall of 0.270, an of 0.212, AUC of 0.662, and PR AUC of 0.164, performing slightly worse than logistic regression in terms of score and ROC AUC. This is likely due to the limited window size () and the relatively small number of sequential features, which reduces the CNN’s ability to capture local patterns. In contrast, the LSTM achieved the best overall performance, with an accuracy of 0.911, precision of 0.252, recall of 0.341, score of 0.282, AUC of 0.806, and PR AUC of 0.236, demonstrating its ability to effectively model long-range temporal dependencies, while CNNs outperform LSTMs in other regions, such as within the 2024 Mexico Gulf data, for the 2024 West Coast trajectories, LSTMs are better suited for high-quality anomaly detection. Logistic regression remains a fast, interpretable baseline or initial filter.
The above results suggest that the choice of CNN and LSTM with baseline values of hyperparameters is appropriate.