This section presents the training, validation, and blind testing results for seven CNN architectures: VGG16, ResNet50, InceptionV3, DenseNet121, InceptionResNetV2, EfficientNetB0, and a lightweight custom CNN. Performance metrics and confusion matrices are used to assess each model’s ability to classify underwater images into three classes: gas leak, liquid leak, and non-leak.
4.1. Dataset Summary
The datasets used in this study were constructed from experimentally generated leak imagery, with an additional scenario including synthetic data. Two configurations were considered. In the first, the dataset comprised 252 images, balanced across the three classes, with 210 images used for training and 42 for validation. In the second configuration, the same experimentally acquired images were retained and complemented with synthetic samples, resulting in a dataset of 888 images, evenly balanced across the three classes (non-leak, liquid leak, and gas leak), with 708 images used for training and 180 for validation. By subtracting the experimental portion from this combined set, it is possible to isolate the synthetic contribution, which accounts for 636 images, divided into 498 training samples (166 per class) and 138 validation samples (46 per class).
Table 2 summarizes the dataset for each case.
In both cases, an additional blind test set of 30 images (10 per class) was held out and never used during training or validation, serving exclusively for the final evaluation of generalization performance. This design ensured that differences in performance could be directly attributed to the contribution of synthetic data, while preserving equal class representation and a separate unseen subset for robust assessment.
The synthetic data scenario was designed to extend the experimental dataset rather than replace it. Specifically, the same experimentally acquired images were used in both configurations, with synthetic images added only in the second case. This scenario is part of the four training configurations considered in this study (no synthetic/no augmentation, synthetic only, augmentation only, and synthetic + augmentation), which together provide a controlled basis for assessing the relative contribution of each strategy to model generalization.
It is important to note that data augmentation does not create a new dataset, but rather applies transformations (e.g., rotation, shift, brightness jitter) to the original training images during the learning process. Thus, unlike the synthetic data configurations, the number of training samples remains unchanged; only their variability as seen by the model is increased. This approach is consistent with standard practices in computer vision, where augmentation has long been used to improve generalization under limited data conditions [
62,
63].
4.2. Performance Comparison of CNN Architectures
To assess the impact of synthetic data and augmentation, four experimental training configurations were evaluated:
Without synthetic data, without augmentation (Configuration 1).
With synthetic data, without augmentation (Configuration 2).
Without synthetic data, with augmentation (Configuration 3).
With synthetic data and augmentation (Configuration 4).
Table 3 represents the validation performance across all training configurations (best checkpoint based on the highest validation accuracy).
Figure 7 and
Figure 8 show validation accuracy and loss curves across epochs for all backbones, grouped by training configuration.
Appendix A provides full Training and Validation Accuracy and Loss together with the blind test confusion matrices for all backbones and training configurations.
The four training configurations enabled controlled evaluation of the relative contribution of synthetic data and augmentation. Configuration 1 (no synthetic, no augmentation) served as the baseline, relying exclusively on experimentally acquired images. In this setting, models were able to distinguish leaks from background cases, but recurrent confusion occurred between gas leak and liquid leak. This reflects the intrinsic visual differences of the phenomena: gas leaks generate small and rapidly dissipating bubble plumes with low contrast, whereas liquid leaks produce larger, more persistent visual disturbances that are easier to detect. Thus, the observed misclassifications align with physical expectations and indicate that the networks were not simply memorizing training data.
Configuration 2 (synthetic only) demonstrated that synthetic data stabilized training and mitigated class imbalance. By enriching under-represented classes, particularly gas leaks, the models reduced their tendency to collapse into a single dominant class. Performance metrics improved in terms of class balance, and confusion matrices revealed more reliable separation between gas plumes and background cases. These findings suggest that synthetic data are effective for extending the coverage of experimental datasets, even if the generated images are not perfect replicas of real-world scenes.
Configuration 3 (augmentation only) yielded counterintuitive results. Instead of improving generalization, augmentation reinforced plume-like structures that were already dominant in the training set. As a result, ambiguous cases were often mapped to the liquid leak class, reducing the sensitivity to gas leaks. This illustrates a key limitation of augmentation when used in isolation: transformations such as rotations, shifts, or brightness adjustments increase visual diversity, but do not create fundamentally new patterns for under-represented classes. Consequently, augmentation amplified existing dataset biases rather than compensating for them.
Configuration 4 (synthetic + augmentation) achieved the most consistent and balanced results across backbones. Here, augmentation increased variability, while synthetic data prevented class collapse, enabling more robust generalization. Confusion matrices showed more even distributions across classes, and borderline predictions (confidence values near 0.5) were represented more faithfully. This synergy highlights the complementary roles of synthetic imagery (class enrichment) and augmentation (variability), confirming the importance of combining both approaches in data-limited underwater monitoring tasks.
It is also worth noting that the modest size of the experimental dataset likely constrained model performance. As widely reported in deep learning literature, model generalization improves significantly with larger and more diverse datasets [
64,
65]. Thus, increasing the volume of real annotated samples would potentially enhance accuracy and class discrimination even further, complementing the gains already observed with synthetic data.
However, collecting large-scale experimental datasets in subsea environments is often costly, logistically complex, and sometimes unfeasible, which reinforces the relevance of studies such as the present work that explore strategies like smaller experiments, synthetic data generation, and augmentation to mitigate data scarcity.
4.2.1. Configuration 1: No Synthetic, No Augmentation
Table 4 demonstrates that, for Configuration 1 (no synthetic, no augmentation), when models were trained exclusively with the experimentally acquired dataset, validation metrics suggested near-perfect performance for some backbones (e.g., VGG16 and ResNet50 reached validation accuracy and F1-scores of 1.0). However, this apparent success did not translate to the blind test. Systematic misclassifications emerged, particularly between the gas leak and liquid leak classes. This pattern indicates that the networks learned to capture generic features of plume-like structures (e.g., turbidity and bubble dispersion), but failed to distinguish the subtle differences between gaseous and liquid leak events.
Confusion with the non-leak class was also observed. In models such as DenseNet121 and EfficientNetB0, background patterns with noise, poor illumination, or seabed textures were misclassified as leaks. The lightweight CustomCNN presented an opposite behavior, showing a strong bias toward the liquid leak class and rarely predicting non-leak, thus over-detecting leaks and ignoring background variability. These behaviors underscore the limited representativeness of training solely on experimental data.
Another limitation of this configuration was the poor calibration of predictive confidence. Several backbones produced highly confident wrong predictions, such as classifying non-leak samples as liquid leak with probabilities above 0.95. Although InceptionV3 and EfficientNetB0 occasionally provided intermediate probability values (0.4–0.7) for ambiguous cases, most models exhibited overconfidence in their errors, which is problematic in safety-critical subsea applications.
Architectural comparisons reinforce these findings. VGG16 and ResNet50 learned well the internal dataset distribution but generalized poorly. Inception-based models showed more cautious predictions, yet still confused gas with liquid leaks. DenseNet121 and EfficientNetB0 offered more balanced trade-offs, although both struggled with the non-leak class. The CustomCNN was computationally efficient but strongly biased, making it unsuitable for deployment without complementary models.
Overall, this configuration exposed three critical weaknesses: overfitting to the training/validation split, lack of discriminative power between liquid and gas leaks, and frequent false positives in non-leak conditions. These limitations demonstrate that relying solely on limited data is insufficient for robust underwater leak detection. The results highlight the necessity of introducing additional variability through new experiments, synthetic imagery, or augmentation to mitigate overfitting and improve class separation in subsequent configurations.
4.2.2. Configuration 2: Impacts of Adding Synthetic Data
From
Table 5, it can be noted that the introduction of synthetic data produced consistent improvements in model robustness, particularly when comparing validation and blind test performance.
In the configuration without synthetic data, some backbones (e.g., VGG16, ResNet50) reached nearly perfect validation scores, but failed to generalize to blind test samples, indicating overfitting to the experimental dataset. By contrast, when synthetic images were included, validation metrics decreased slightly, but blind test results became more aligned with observed training performance. This shift demonstrates that synthetic data reduced overfitting and improved the capacity to generalize.
A key distinction emerged in the classification of gas leak versus liquid leak. Without synthetic images, most models exhibited strong confusion between these two classes, frequently predicting gas events as liquid leaks. With synthetic data, class separation improved, especially for DenseNet121 and InceptionResNetV2. Although confusion still occurred, prediction probabilities were more moderate (0.4–0.6), indicating that the networks recognized the ambiguity rather than producing overconfident errors. This is a desirable behavior for operational deployment, as it enables threshold calibration or reject strategies.
Performance on the non-leak class also benefited from synthetic augmentation. In the purely experimental setup, several models—including CustomCNN—systematically misclassified non-leak images as leaks, leading to a high false positive rate. With synthetic data, recognition of the non-leak class improved considerably. Backbones such as VGG16, EfficientNetB0, and ResNet50 assigned more balanced probabilities (0.6–0.8) to background scenes, reducing the tendency to over-detect leaks.
Confidence calibration further confirmed this trend. In the baseline configuration, misclassifications were frequently made with very high probabilities (0.9–1.0), reflecting poorly calibrated models that were “certain” of incorrect decisions. When synthetic data were included, errors persisted but with more moderate confidence values (0.4–0.7). This improvement is practically important: it allows the definition of adjustable thresholds and the implementation of reject options for uncertain predictions.
Architectural comparisons reinforce these findings. DenseNet121, ResNet50, and InceptionResNetV2 benefited most from synthetic data, showing robust and balanced performance across classes. VGG16 maintained high accuracy but with less overfitting, while EfficientNetB0 presented the clearest improvements in confidence calibration, particularly reducing false positives in non-leak cases. The lightweight CustomCNN also improved by no longer ignoring the non-leak class, though it remained less stable than transfer learning backbones.
In summary, the use of synthetic data did not guarantee perfect validation metrics, but it yielded models that were more balanced, better calibrated, and more reliable under blind test conditions. This robustness is more valuable for practical subsea monitoring applications than the artificially inflated performance observed when training only on real experimental data.
4.2.3. Configuration 3: Augmentation Only
Table 6 demonstrates that when augmentation was applied in the absence of synthetic data, the results diverged from expectations. Although validation accuracy and F1-scores remained high for most backbones, blind test results revealed a recurrent collapse of predictions into the liquid leak class. This outcome reflects a reinforcement of dataset biases: geometric and photometric transformations (rotations, shifts, brightness jitter) increased variability within existing classes but did not introduce fundamentally new features, especially for under-represented gas leak cases.
As a consequence, models tended to amplify plume-like cues already dominant in the dataset, leading to oversimplification of the decision boundary.
Confusion matrices confirmed that ambiguous samples, including many gas leak instances, were consistently misclassified as liquid leak. In addition, false positives increased for the non-leak class, as background textures and turbidity were exaggerated by augmentation, making them visually closer to leak scenarios. Confidence calibration was also problematic: several misclassifications occurred with high certainty (0.9–1.0), suggesting that augmentation reinforced spurious correlations rather than encouraging caution.
Across backbones, ResNet50 and VGG16 again appeared strong in validation but unstable in blind evaluation. DenseNet121 and EfficientNetB0 showed more resilience, yet still misclassified most gas leaks as liquid. The Custom CNN remained heavily biased toward liquid leak, confirming that shallow architectures are particularly vulnerable to bias reinforcement when augmentation is applied without additional data diversity.
In summary, augmentation alone did not improve generalization. On the contrary, it exacerbated existing imbalances, producing high apparent performance in validation but poor robustness in blind testing. This highlights that augmentation cannot substitute for class enrichment and is most effective when combined with synthetic or real additional samples.
4.2.4. Configuration 4: Synthetic + Augmentation
The combined use of synthetic data and augmentation yielded the most balanced and consistent results. This is demonstrated in
Table 7. Validation accuracy and F1-scores were moderate but aligned closely with blind test outcomes, indicating improved generalization. Unlike Configurations 1 and 3, models did not collapse into a single dominant class. Synthetic images enriched under-represented cases, particularly gas leak gas, while augmentation added realistic variability in lighting, orientation, and turbidity. Together, these strategies acted synergistically, reducing bias and promoting robustness across architectures.
Confusion matrices showed that all three classes (gas leak, liquid leak, non-leak) were represented with fewer systematic misclassifications. Errors that remained were typically associated with borderline cases and were accompanied by moderate prediction confidence (0.4–0.7), indicating that the models recognized ambiguity instead of making overconfident mistakes. This calibration is a key advantage for real-world deployment, as it allows the definition of adaptive thresholds or the rejection of uncertain predictions.
Among the backbones, DenseNet121 and ResNet50 achieved the most stable balance between accuracy and calibration. InceptionV3 and InceptionResNetV2 also performed well, though with slightly higher variability in gas–liquid separation. EfficientNetB0 stood out for offering strong performance with reduced computational cost, while VGG16 remained competitive but less efficient. The CustomCNN improved compared to previous scenarios, recognizing all three classes more evenly, though still with lower absolute performance compared to transfer learning models.
Overall, this configuration demonstrated that synthetic data and augmentation are not redundant but complementary. Synthetic images expand class coverage, while augmentation increases variability within each class. The combined effect produced the most reliable generalization among all tested scenarios, supporting the value of a data-centric approach in underwater leak detection tasks.
4.3. Model Architecture Considerations
The evaluation across seven backbones revealed important differences in how architectures handle the underwater leak detection task. Residual and densely connected networks (ResNet50, DenseNet121, InceptionResNetV2) consistently showed stable training behavior, reflecting their ability to propagate features effectively and reuse information in data-limited conditions. Among them, ResNet50 provided a robust and widely recognized baseline, while DenseNet121 offered parameter efficiency and strong validation performance. InceptionResNetV2 achieved competitive accuracy but showed some sensitivity to class imbalance in scenarios without synthetic data.
InceptionV3, with its multi-scale inception modules, captured visual features at different receptive fields and performed well in most cases, although its behavior was less stable for gas-leak detection. VGG16, despite being a canonical reference for transfer learning, presented limitations due to its large parameter count and comparatively rigid structure, which hindered generalization in the multi-class setting. EfficientNetB0 demonstrated a favorable accuracy–efficiency trade-off, achieving solid results with fewer parameters, though its robustness decreased in visually degraded conditions such as turbid or low-contrast imagery. Finally, the lightweight custom CNN confirmed the feasibility of domain-specific models optimized for efficiency, but its reduced depth limited its capacity to capture subtle visual cues, making it less reliable for distinguishing gas leaks from similar background patterns.
Overall, these observations emphasize that while lightweight architectures are appealing for real-time applications in resource-constrained environments (e.g., AUVs or ROVs), deeper networks with residual or dense connectivity currently offer the most reliable performance in underwater leak detection. The systematic benchmarking across diverse backbones provides valuable insight into trade-offs between accuracy, computational cost, and robustness to challenging visual conditions.
4.4. Application to Field Data: Blind Test with Deepwater Horizon Oil Spill and the Atlantic Margin Natural Methane Seeps
To assess the practical applicability of the leak classification models in real-world scenarios, a set of publicly available images from the Deepwater Horizon oil spill in 2010 at the Gulf of Mexico was employed. These field images were sourced from the NOAA Office of Response and Restoration of the United States repository [
66] and were not used during training or validation and served as an additional test to assess generalization performance beyond the experimental dataset. A subset of leak images was selected and processed using the trained models and the custom CNN to evaluate classification performance under field conditions for all the four configurations studied.
It is important to note that, due to the lack of publicly available images of verified gas leaks in subsea pipelines, the blind test set for the gas leak class was performed with imagery of natural methane seeps. These images were obtained from NOAA campaigns in the Atlantic margin of the United States, where the ROV Deep Discoverer (D2) explored methane plumes rising from the seafloor [
67]. While such seeps are not identical to gas discharges from damaged pipeline sections, they represent the closest available real-world analogue of underwater gas leaks, providing valuable visual references of bubble plumes, turbidity, and dispersion under deep-sea conditions. We acknowledge that this substitution introduces a potential mismatch between training and testing distributions, since the network was trained on controlled pipeline leak experiments rather than natural seepage phenomena. Nevertheless, incorporating these samples was considered scientifically justified, as they expose the models to authentic subsea gas plumes and allow a more realistic assessment of generalization in the absence of dedicated pipeline datasets.
For clarity, a summary of the blind test performance under the best-performing setup, Configuration 4 (synthetic + augmentation), is reported in
Table 8. Among the evaluated backbones, InceptionV3 achieved the highest performance, with a macro F1-score of 0.671 and accuracy of 0.667, demonstrating its superior generalization capability in this scenario. Representative prediction outcomes for Configurations 1 to 4 are shown in
Figure 9,
Figure 10,
Figure 11 and
Figure 12.
Figure 9 represents blind test predictions for Configuration 1—without synthetic, without augmentation: Note the recurrent confusion between gas leak and liquid leak (images D to F), as well as unstable predictions for background cases (image C). In Configuration 1, models trained solely on the original dataset showed partial ability to discriminate leak types (images A and B), with recurrent confusion between gas and liquid leaks. Although the models captured plume-like structures, their decision boundaries were unstable, leading to inconsistent performance across backbones.
The results of blind test predictions for Configuration 2 in
Figure 10 demonstrated that synthetic data considerably improved training stability and reduced class imbalance effects. The models trained with synthetic data achieved more robust discrimination of gas plumes (images A, B, D, and F) and background scenes, avoiding collapse into a single class and producing more balanced confusion matrices. There was improved stability and class balance. Synthetic images supported the correct classification of gas leak events that were otherwise misclassified.
In Configuration 3, whose results can be observed in
Figure 11, the introduction of light augmentation operations (rotation, shift, brightness jitter) without synthetic data resulted in a counterintuitive outcome: instead of improving generalization, the models tended to oversimplify the classification boundary (images B, C, D, and F). Many ambiguous samples were consistently mapped to the liquid leak class, indicating that the augmentation reinforced plume-like visual cues without introducing enough diversity for non-leak and gas-leak contexts.
Finally, the results for Configuration 4, demonstrated in
Figure 12, yielded the most consistent and robust results. The combination of synthetic data and augmentation enabled the networks to generalize across challenging blind test images. Predictions were more evenly distributed across classes (images A, B, C, and F), and borderline cases (e.g., with confidence 0.5, as in images D and E) were more faithfully represented. This synergy highlights the complementary roles of dataset enrichment through synthetic imagery and augmentation-induced variability.
4.5. Implications and Practical Considerations
The comparative analysis across scenarios provides key insights into the role of data diversity in underwater leak detection tasks.
First, synthetic data proved essential to stabilize training. Models without synthetic samples (Scenarios 1 and 3) often converged to biased solutions, particularly in the presence of augmentation, which exaggerated plume-like structures and caused class collapse into liquid leak. This confirms that augmentation alone cannot address intrinsic dataset imbalance nor compensate for underrepresented leak modes.
Second, augmentation amplified existing biases when applied without synthetic data. The tendency to misclassify ambiguous cases as liquid leak suggests that the operations (blur, intensity jitter, spatial shifts) magnified features already abundant in the dataset, while failing to generate meaningful variations for gas leaks or background patterns. This explains the apparent paradox: augmentation degraded generalization when used in isolation.
Third, the best-performing models emerged from combining synthetic data and augmentation (Configuration 4). In this setting, augmentation acted synergistically with synthetic data by diversifying the visual space while synthetic images prevented class collapse. The result was a more balanced performance across leak types, with representative examples correctly classified even in borderline cases (e.g.,
Figure 12A,B, where predictions hovered near 0.5).
Overall, these findings indicate that data-centric strategies are indispensable for underwater leak detection. Synthetic data enriches the representation of under-sampled classes, while augmentation, when applied cautiously, further enhances robustness. Future work should investigate adaptive or class-specific augmentation schemes, as well as advanced generative approaches (e.g., diffusion-based synthetic imagery) to further reduce residual misclassifications.
Although the absolute values of validation and blind test accuracy do not reach the highest levels typically reported in large-scale vision benchmarks, the results provide meaningful insights into the challenges of underwater leak detection. In particular, gas leaks proved consistently harder to classify than liquid leaks, reflecting their subtle visual signatures, characterized by small bubble plumes and limited turbidity. This behavior highlights the sensitivity of CNNs to physical properties of the leak phenomena and confirms that the models are learning realistic patterns rather than overfitting spurious cues.
A comparison between validation and blind test performance provides additional insight into the generalization capacity of the models. Across backbones, validation accuracy and macro F1-scores were generally higher than those obtained on blind test samples, which is expected given the limited dataset size and the inherent variability of unseen images. However, the blind test results did not collapse to random guessing, indicating that the networks learned meaningful representations of leak phenomena. For example, models that achieved strong validation stability under Configuration 4 also retained balanced behavior on the blind test set, confirming that the synergy between synthetic data and augmentation improved robustness. The relative consistency of performance trends across validation and blind test evaluations suggests that the models were not overfitting to the training conditions, but rather captured transferable visual cues related to subsea leak events. This observation, while modest in absolute performance, reinforces the feasibility of applying CNN-based approaches to underwater monitoring tasks, even when training data availability is limited.