Author Contributions
Conceptualization, G.L.; methodology, G.L.; software, G.L.; validation, G.L., A.D. and E.S.; formal analysis, G.L.; investigation, G.L.; resources, E.S.; data curation, G.L.; writing—original draft preparation, G.L.; writing—review and editing, G.L., A.D. and E.S.; visualization, G.L.; supervision, A.D. and E.S.; project administration, A.D. and E.S.; funding acquisition, A.D. and E.S. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Three tiles representative of each year.
Figure 1.
Three tiles representative of each year.
Figure 2.
t-SNE visualisation of YOLOv8 backbone features ( tiles per partition), illustrating the temporal domain shift. The 2018 partition demonstrates clear feature separation from the 2021 and 2022 partitions.
Figure 2.
t-SNE visualisation of YOLOv8 backbone features ( tiles per partition), illustrating the temporal domain shift. The 2018 partition demonstrates clear feature separation from the 2021 and 2022 partitions.
Figure 3.
Visual demonstration of methodology-induced border fragmentation noise caused by the sliding-window tiling process. (a) A polymetallic nodule located within the 96-pixel overlap region of a 640 px tile boundary. (b) Physical splitting of the nodule across adjacent frames. Because the left fragment constitutes less than 50% of the original bounding box area, its ground-truth annotation is discarded during dataset generation, while the right fragment retains its annotation. (c) The consequence during model training and inference is that the model correctly detects the left fragment based on morphological features, but due to the discarded annotation, it is strictly penalized as a false positive, corrupting the pseudo-label pool.
Figure 3.
Visual demonstration of methodology-induced border fragmentation noise caused by the sliding-window tiling process. (a) A polymetallic nodule located within the 96-pixel overlap region of a 640 px tile boundary. (b) Physical splitting of the nodule across adjacent frames. Because the left fragment constitutes less than 50% of the original bounding box area, its ground-truth annotation is discarded during dataset generation, while the right fragment retains its annotation. (c) The consequence during model training and inference is that the model correctly detects the left fragment based on morphological features, but due to the discarded annotation, it is strictly penalized as a false positive, corrupting the pseudo-label pool.
Figure 4.
Visual taxonomy of pseudo-label noise. The red shaded region denotes the 64-pixel border margin (). Predictions within this zone are highly susceptible to becoming border false positives (red dashed line) due to tiling fragmentation, even when detecting legitimate nodule fragments. Beyond this margin, the tile interior contains true positives (green solid line) alongside a persistent floor of interior false positives (orange solid line), which arise when the source-trained model hallucinates targets from unfamiliar, target-domain benthic textures. Ground truth annotations are shown in grey.
Figure 4.
Visual taxonomy of pseudo-label noise. The red shaded region denotes the 64-pixel border margin (). Predictions within this zone are highly susceptible to becoming border false positives (red dashed line) due to tiling fragmentation, even when detecting legitimate nodule fragments. Beyond this margin, the tile interior contains true positives (green solid line) alongside a persistent floor of interior false positives (orange solid line), which arise when the source-trained model hallucinates targets from unfamiliar, target-domain benthic textures. Ground truth annotations are shown in grey.
Figure 5.
Spatial decay of border fragmentation noise. False positive rate as a function of distance from the tile boundary. A sharp spike is observed at the border edge, which rapidly decays and stabilizes at a persistent interior noise floor by the 64-pixel mark.
Figure 5.
Spatial decay of border fragmentation noise. False positive rate as a function of distance from the tile boundary. A sharp spike is observed at the border edge, which rapidly decays and stabilizes at a persistent interior noise floor by the 64-pixel mark.
Figure 6.
True positive rate across confidence thresholds. The monotonic increase in true positive rate across all three architectures demonstrates that higher-confidence predictions are significantly better localised (under standard matching). The vertical dashed line denotes the selected 0.90 operating point, which effectively isolates the domain-stable prediction subspace.
Figure 6.
True positive rate across confidence thresholds. The monotonic increase in true positive rate across all three architectures demonstrates that higher-confidence predictions are significantly better localised (under standard matching). The vertical dashed line denotes the selected 0.90 operating point, which effectively isolates the domain-stable prediction subspace.
Figure 7.
Final cascade-filtering pipeline.
Figure 7.
Final cascade-filtering pipeline.
Figure 8.
Scaling curve for Condition F at seed 42.
Figure 8.
Scaling curve for Condition F at seed 42.
Figure 9.
Distribution of macro mAP50:95 scores for Condition F across four random seeds at varying pseudo-label pool sizes. The red dashed line represents the multi-seed mean of the supervised baseline (Condition B: 0.4467 across the same four seeds. The dark blue dots represent the individual scores for each of the four random seeds, and the solid red dots indicate the mean value for that pool size.).
Figure 9.
Distribution of macro mAP50:95 scores for Condition F across four random seeds at varying pseudo-label pool sizes. The red dashed line represents the multi-seed mean of the supervised baseline (Condition B: 0.4467 across the same four seeds. The dark blue dots represent the individual scores for each of the four random seeds, and the solid red dots indicate the mean value for that pool size.).
Table 1.
Dataset division by year and total number of high-resolution images.
Table 1.
Dataset division by year and total number of high-resolution images.
| Expedition Year | Number of Images | Resolution |
|---|
| 2018 | 1331 | 1024 × 683 and 5184 × 3456 |
| 2021 | 216 | 5184 × 3456 |
| 2022 | 597 | 5184 × 3456 |
Table 2.
Dataset division by year and total number of tiles.
Table 2.
Dataset division by year and total number of tiles.
| Expedition Year | Number of Tiles |
|---|
| 2018 | 31,674 |
| 2021 | 11,664 |
| 2022 | 32,238 |
Table 3.
Image statistics across the 2018, 2021, and 2022 datasets.
Table 3.
Image statistics across the 2018, 2021, and 2022 datasets.
| Metric | 2018 | 2021 | 2022 |
|---|
| Brightness | | | |
| Contrast | | | |
| Sharpness | | | |
| Entropy | | | |
| Red | | | |
| Green | | | |
| Blue | | | |
| Saturation | | | |
Table 4.
Summary of fold definitions and tile counts per split. Labelled sets from the non-test partitions are divided into training and validation subsets using an 80/20 split. The pseudo-label pool is strictly restricted to unlabelled tiles from the non-test partitions to prevent temporal leakage.
Table 4.
Summary of fold definitions and tile counts per split. Labelled sets from the non-test partitions are divided into training and validation subsets using an 80/20 split. The pseudo-label pool is strictly restricted to unlabelled tiles from the non-test partitions to prevent temporal leakage.
| Fold | Test Year | Train/Val Years | Train Tiles 1 | Val Tiles | Test Tiles | Pseudo-Label Tiles 2 |
|---|
| 0 | 2018 | 2021, 2022 | 288 | 70 | 148 | 36,772 |
| 1 | 2021 | 2018, 2022 | 238 | 58 | 210 | 47,195 |
| 2 | 2022 | 2018, 2021 | 294 | 72 | 140 | 27,967 |
Table 5.
Summary of the selected teacher models, their architectural paradigms, and their baseline performance on the source dataset.
Table 5.
Summary of the selected teacher models, their architectural paradigms, and their baseline performance on the source dataset.
| Teacher Model | Architectural Paradigm | Source Benchmark mAP@50:95 |
|---|
| DINO [4] | Transformer-based set-prediction | 0.899 |
| Faster R-CNN [2] | Anchor-based two-stage | 0.832 |
| YOLOv8s [3] | Anchor-free single-stage | 0.856 |
Table 6.
Distribution of object sizes across the target and source datasets.
Table 6.
Distribution of object sizes across the target and source datasets.
| | Target Dataset | Source Dataset |
|---|
|
Category
|
Count
|
Percentage
|
Count
|
Percentage
|
|---|
| Tiny | 23 | 0.90% | 0 | 0.00% |
| Small | 385 | 15.01% | 8 | 0.12% |
| Medium | 480 | 18.71% | 3730 | 58.04% |
| Large | 1677 | 65.38% | 2689 | 41.84% |
Table 7.
Distribution of border annotations by year, detailing the counts and percentages for each spatial zone.
Table 7.
Distribution of border annotations by year, detailing the counts and percentages for each spatial zone.
| Year | Clipped | Filter Zone | Overlap Zone | Interior |
|---|
| 2018 | 17 (1.9%) | 286 (32.7%) | 132 (15.1%) | 440 (50.3%) |
| 2021 | 45 (4.7%) | 231 (24.0%) | 196 (20.4%) | 489 (50.9%) |
| 2022 | 50 (6.9%) | 173 (23.7%) | 166 (22.8%) | 340 (46.6%) |
Table 8.
Filter cascade yield across the full unlabelled pool.
Table 8.
Filter cascade yield across the full unlabelled pool.
| Stage | Boxes Retained | % of Raw |
|---|
| Raw WBF fused boxes (conf ≥ 0.001) | 7,666,887 | 100.0% |
| Spatial Filter | 481,008 | 6.3% |
| Confidence Filter | 224,858 | 2.9% |
| Ensemble Agreement | 167,011 | 2.2% |
Table 9.
Fine-tuning hyperparameters.
Table 9.
Fine-tuning hyperparameters.
| Hyperparameter | Value |
|---|
| Base model | YOLOv8 (source checkpoint) |
| Optimizer | AdamW |
| Initial learning rate (lr0) | 0.0001 |
| Final learning rate factor (lrf) | 0.01 |
| Learning rate schedule | Cosine decay |
| Momentum | 0.937 |
| Weight decay | 0.0005 |
| Warmup epochs | 3 |
| Maximum epochs | 100 |
| Early stopping patience | 20 |
| Frozen layers | 10 (backbone) |
| Batch size | 16 |
| Image size | |
| Mosaic augmentation | 1.0 |
| Horizontal/vertical flip probability | 0.5/0.5 |
| HSV hue/saturation/value jitter | 0.015/0.7/0.4 |
| Scale augmentation | 0.5 |
| Translation augmentation | 0.1 |
| Rotation augmentation | |
| Single class mode | True |
| Seed | 42 (varied across runs: 42, 365, 1234, 2026) |
Table 10.
Experimental condition definitions. Each condition adds a new component to the pipeline.
Table 10.
Experimental condition definitions. Each condition adds a new component to the pipeline.
| Condition | Description | Teacher | Conf Filter | Spatial Filter | Ensemble |
|---|
| A | Zero-shot transfer | — | — | — | — |
| B | Supervised fine-tuning | — | — | — | — |
| C | Naïve single-model pseudo-labelling | YOLOv8 | 0.001 | × | × |
| D | Single-model + spatial filter | YOLOv8 | 0.001 | 64 px | × |
| E | Single-model + spatial filter + confidence threshold | YOLOv8 | 0.9 | 64 px | × |
| F | Multi-model ensemble pseudo-labelling | YOLOv8 + DINO + FRCNN | 0.90 | 64 px | ≥2 |
| G | Early distillation extension | YOLOv8 + DINO + FRCNN | 0.90 | 64 px | ≥2 |
| H | Mean Teacher (EMA) | EMA student | — | — | — |
Table 11.
Baseline condition results (mAP50:95). Condition A is a fixed zero-shot checkpoint with no seed variation. Conditions B and H report mean ± standard deviation across four random seeds.
Table 11.
Baseline condition results (mAP50:95). Condition A is a fixed zero-shot checkpoint with no seed variation. Conditions B and H report mean ± standard deviation across four random seeds.
| Condition | Description | Macro mAP50:95 |
|---|
| A | Zero-shot transfer | 0.2530 |
| B | Supervised fine-tuning | |
| H | Mean Teacher (EMA, ) | |
Table 12.
Condition F cross-validation results. Macro mAP50:95 performance across three folds and the performance delta compared to Condition B.
Table 12.
Condition F cross-validation results. Macro mAP50:95 performance across three folds and the performance delta compared to Condition B.
| Pseudo Count | Fold_0 | Fold_1 | Fold_2 | Macro mAP50:95 | vs. Cond B |
|---|
| B (0 pseudo) | 0.2617 | 0.5293 | 0.5481 | 0.4464 | baseline |
| H (Mean Teacher) | 0.2757 | 0.5020 | 0.5198 | 0.4325 | |
| 100 | 0.3049 | 0.5291 | 0.5734 | 0.4691 | 0.0227 |
| 200 | 0.3130 | 0.5215 | 0.5434 | 0.4593 | 0.0129 |
| 500 | 0.3131 | 0.5000 | 0.5514 | 0.4548 | 0.0084 |
| 1000 | 0.2811 | 0.4100 | 0.4500 | 0.3804 | −0.0660 |
| 2000 | 0.2861 | 0.4900 | 0.5500 | 0.4420 | −0.0044 |
| 5000 | 0.2589 | 0.4500 | 0.5200 | 0.4096 | −0.0368 |
Table 13.
Component ablation at the 100-tile pseudo-label budget. Each condition adds one pipeline component over the previous. All pseudo-labelling conditions report mean ± standard deviation across four random seeds. Condition E-FT reports seed 42 only.
Table 13.
Component ablation at the 100-tile pseudo-label budget. Each condition adds one pipeline component over the previous. All pseudo-labelling conditions report mean ± standard deviation across four random seeds. Condition E-FT reports seed 42 only.
| Condition | Added Component | Macro mAP50:95 | Δ vs. B |
|---|
| B | Supervised Baseline | | baseline |
| H | EMA teacher updates | | |
| C-100 | Naive pseudo-labelling | | |
| D-100 | + 64 px spatial filter | | |
| E-100 | + conf ≥ 0.90 | | |
| F-100 | + ensemble ≥ 2 of 3 | | |
| G-100 | + early distillation | | |
Table 14.
Condition F—multi-seed macro mAP50:95 across four random seeds (42, 365, 1234, 2026). Mean ± standard deviation reported.
Table 14.
Condition F—multi-seed macro mAP50:95 across four random seeds (42, 365, 1234, 2026). Mean ± standard deviation reported.
| Pseudo Count | Macro mAP50:95 | vs. Cond B |
|---|
| 100 | 0.4745 ± 0.0042 | +0.028 |
| 200 | 0.4647 ± 0.0070 | +0.018 |
| 500 | 0.4414 ± 0.0199 | −0.0053 |
| 1000 | 0.4020 ± 0.0295 | −0.0448 |
| 2000 | 0.4331 ± 0.0112 | −0.0136 |
| 5000 | 0.4316 ± 0.0155 | −0.0151 |
Table 15.
Teacher Checkpoint Evaluation (conf = 0.001, dborder ≥ 64 px).
Table 15.
Teacher Checkpoint Evaluation (conf = 0.001, dborder ≥ 64 px).
| Teacher Model | Test Year | Evaluation Metric | True Positives | False Positives | FP Rate (%) |
|---|
| Zero-Shot Source | 2018 | Centre-Point | 384 | 233 | 37.76% |
| IoU @ 0.50 | 366 | 251 | 40.68% |
Condition B (Fine-Tuned) | 2018 | Centre-Point | 412 | 451 | 52.26% |
| IoU @ 0.50 | 383 | 480 | 55.62% |
| Zero-Shot Source | 2021 | Centre-Point | 335 | 268 | 44.44% |
| IoU @ 0.50 | 248 | 355 | 58.87% |
Condition B (Fine-Tuned) | 2021 | Centre-Point | 295 | 732 | 71.28% |
| IoU @ 0.50 | 289 | 738 | 71.86% |
| Zero-Shot Source | 2022 | Centre-Point | 235 | 165 | 41.25% |
| IoU @ 0.50 | 173 | 227 | 56.75% |
Condition B (Fine-Tuned) | 2022 | Centre-Point | 224 | 1160 | 83.82% |
| IoU @ 0.50 | 215 | 1169 | 84.47% |
Table 16.
Border vs. interior mAP50:95 split evaluation across four random seeds (Mean ± SD). Condition A is a fixed zero-shot checkpoint with no seed variation.
Table 16.
Border vs. interior mAP50:95 split evaluation across four random seeds (Mean ± SD). Condition A is a fixed zero-shot checkpoint with no seed variation.
| Condition | Overall mAP50:95 | Interior mAP50:95 | Border mAP50:95 |
|---|
| A | 0.2530 | 0.2667 | 0.1337 |
| B | | | |
| F-100 | | | |
Table 17.
Evaluation comparison of small objects mAP50:95 for test year 2018.
Table 17.
Evaluation comparison of small objects mAP50:95 for test year 2018.
| Condition | mAP50:95 Small | AR50:95 Small |
|---|
| B | 0.232 | 0.285 |
| F-100 | 0.315 | 0.366 |