Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach

Loureiro, Gabriel; Dias, André; Silva, Eduardo

doi:10.3390/jmse14111048

Open AccessArticle

Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach

by

Gabriel Loureiro

^1,*

,

André Dias

^1,2

and

Eduardo Silva

^1,2

¹

INESCTEC—Institute for Systems and Computer Engineering, Technology and Science, Rua Dr. Roberto Frias, 4200-465 Porto, Portugal

²

ISEP—School of Engineering, Polytechnic Institute of Porto, Rua Dr. António Bernardino de Almeida 431, 4200-072 Porto, Portugal

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(11), 1048; https://doi.org/10.3390/jmse14111048

Submission received: 6 May 2026 / Revised: 30 May 2026 / Accepted: 31 May 2026 / Published: 3 June 2026

(This article belongs to the Special Issue Application of Deep Learning in Underwater Image Processing—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The automated detection of deep-sea polymetallic nodules is critical for processing large volumes of benthic imagery. However, its scalability faces challenges from cross-expedition covariate shifts, such as changes in lighting, altitude, and camera payloads, which lower zero-shot model performance. While semi-supervised pseudo-labelling presents a potential alternative to time-consuming re-annotation, simple implementations can quickly lead to confirmation bias. This study identifies two primary sources of this degradation: spatial noise from tiling fragmentation at tile borders and an architecture-agnostic interior false positive floor caused by semantic domain shift. This work proposes using a multi-model ensemble for pseudo-labelling to reduce the noise impact. Using a spatial border filter and confidence stratification, three architecturally distinct teacher models (YOLOv8, Faster R-CNN, and DINO) are employed to determine a reliable and domain-invariant subspace. Under a strict anti-leakage Leave-One-Partition-Out protocol, the proposed approach surpasses the supervised fine-tuning baseline at 100-tile pseudo-label budget across four random seeds (macro mAP_50:95 of

0.4745 \pm 0.0042

versus

0.4467 \pm 0.0079

), with gains concentrated in the most domain-shifted fold. Beyond this budget, our findings highlight two important adaptation trends: a pool-size degradation trend where excessive pseudo-label volume actively degrades generalisation, and the observation that the fine-tuned models reduce pseudo-label fidelity despite higher precision, providing evidence for the advantage of using frozen source checkpoints for cross-domain adaptation.

Keywords:

polymetallic nodules; domain adaptation; object detection; pseudo-labelling; benthic imagery

1. Introduction

Exploration campaigns for polymetallic nodules increasingly rely on automated object detection to process the large volumes of seafloor imagery collected during survey operations [1]. State-of-the-art architectures such as Faster R-CNN [2], YOLOv8 [3], and DINO [4] (The DINO detector (DETR with Improved DeNoising Anchor Boxes) used in this work is distinct from the self-supervised vision transformer (DINO/DINOv2) [5,6]) achieve high precision for in-domain detection tasks [7]; despite this, the direct deployment of the same models across multiple missions is limited over the course of campaigns that usually span months or years. The inevitable hardware and operational changes during campaigns affect underwater light attenuation, scattering, and the pixel-to-metric resolution. These operational disparities introduce severe covariate shifts through expeditions, triggering a drop in prediction performance [8] and creating a significant bottleneck for scalable nodule mapping. Standard supervised methods necessitate costly expert re-annotation for every new survey condition, whereas zero-shot applications of pre-trained models fail to generalise across these domain shifts.

Despite this domain gap, the visual characteristics of polymetallic nodules—their spherical morphology, surface texture, and contrast against surrounding sediment—are determined by mineralogy and remain consistent across expeditions regardless of imaging conditions [9]. Although overall zero-shot performance drops, the models continue to extract domain-invariant morphological features for a subset of the target data. This implies that source-trained models retain useful knowledge about nodule appearance even when low-level image statistics differ between surveys. In this sense, semi-supervised pseudo-labelling offers a scalable alternative to full re-annotation [10]. Conversely, a naive application of pseudo-labelling on cross-expedition benthic imagery typically induces confirmation bias and model degradation. Specifically, the tiling required to process high-resolution deep-sea images generates fragmentation artefacts along tile boundaries. Standard pseudo-labelling pipelines consider these false positives (FP) as high-confidence predictions, propagating source-biased noise into the training signal.

The current literature generally relies on single-architecture teacher models, leading to rapid confirmation bias and confident false positives under severe covariate shifts [11,12]. Beyond that, the existing literature overlooks spatially structured, methodology-induced noise, such as the severe tiling fragmentation inherent to processing large-format benthic imagery, treating border effects merely as an inference-time issue rather than a fundamental source of training-time annotation noise [13]. Additionally, current work relies on random train-test splitting, failing to account for inherent cross-expedition domain shifts and, as a consequence, introducing spatial data leakage and, inevitably, inflating performance metrics [14,15].

To address these gaps, this work introduces a multi-model ensemble pseudo-labelling pipeline, evaluated under a strict, anti-leakage Leave-One-Partition-Out cross-validation protocol, presenting three main contributions:

A diagnostic framework for pseudo-label noise: the framework identifies architecture-agnostic false positives and border fragmentation, demonstrating that stratifying labels by confidence helps to isolate a domain-agnostic subspace robust to covariate shift.
Analysis of teacher-model’ degradation: the data reveals that fine-tuning enhances target-domain average precision (AP) while simultaneously eroding pseudo-label fidelity, thereby suggesting that AP is an insufficient proxy for teacher quality and highlighting the advantage of using frozen source checkpoints.
A cascaded filtering pipeline: by rectifying domain shifts and under a strict anti-leakage evaluation, this methodology surpasses supervised baselines at the 100-tile pseudo-label budget, with improvement concentrated in the most domain-shifted partition. The results also reveal a pool-size degradation effect where excessive pseudo-label volume actively degrades generalisation.

This work is structured as follows: Section 2 reviews the existing literature on semi-supervised object detection, domain adaptation, benthic object fragmentation, and ensemble evaluation methodologies. Section 3 presents the dataset and defines the strict, anti-leakage Leave-One-Partition-Out evaluation protocol used alongside three source-trained teacher models. Then, Section 4 quantifies the domain shifts. Section 5 identifies and measures the two primary sources of pseudo-label degradation. Section 6 outlines the proposed pipeline. Section 7 presents experimental results and ablation studies. Finally, Section 8 discusses the implications of the findings and Section 9 summarises the work’s main conclusions.

2. Related Work

2.1. Semi-Supervised and Domain Adaptive Object Detection

Semi-supervised learning is a machine learning paradigm that connects supervised and unsupervised learning methods [16]. The main idea is to leverage a small labelled dataset alongside a larger volume of unlabelled data. Classical approaches such as that of Rosenberg et al. [17] utilised a self-training method, in which a baseline model trained on the labelled subset estimates labels for unlabelled samples, acting as a wrapper for the usual object detection training regime. Sohn et al. [18] proposed the STAC framework, a two-stage offline approach in which the teacher is first trained on the labelled data and used to generate pseudo-labels for the unlabelled data. These pseudo-labels are then filtered. In the second stage, the model is trained using both labelled and strongly augmented unlabelled data.

A fundamental method for semi-supervised learning is the Mean Teacher framework, developed by Tarvainen and Valpola [19], in which a student model is updated by gradient descent while a teacher model is maintained as an exponential moving average (EMA) of the student’s weights. Modern deep-learning-based Semi-Supervised Object Detection (SSOD) typically builds on the Mean Teacher framework. Xu et al. [20] introduced Soft Teacher, an end-to-end framework that mitigated spatial uncertainty via geometric box jittering and continuous loss weighting. Similarly, Liu et al. [12,21] developed Unbiased Teacher and its successor v2, which together address class imbalance in pseudo-labels and extend pseudo-label filtering to anchor-free detectors via teacher–student uncertainty comparison. Zhou et al. [22] proposed Dense Teacher in order to overcome background noise. The authors’ algorithm bypasses discrete bounding boxes, favouring dense, pixel-level pseudo-supervision.

Standard SSOD suffers from confirmation bias when deployed in different environments. Li et al. [23] proposed an Adaptive Teacher to address the domain shift caused in this scenario. Their approach uses a feature-level adversarial training to align target data with source data. However, there are scenarios in which the source may not be available. In order to overcome this drawback, Hao et al. [24] demonstrated that applying simple self-training strategies to a Faster R-CNN detector, such as combining batch statistics adaptation (AdaBN) with weak–strong augmentations on a fixed set of pseudo-labels, provides competitive source-free adaptation without requiring complex adversarial alignment or unstable teacher–student mutual learning. In the same source-free scenario for video object detection, Zhang et al. [25] mitigated catastrophic failure by proposing STAR-MT, refining the temporal aggregation modules and the spatial backbone rather than fine-tuning them simultaneously. An empirical study conducted by Ericsson et al. [26] highlights the severe risks of these adaptation processes; the authors showed that without highly reliable validation criteria for model selection, source-free adaptation often fails, resulting in a performance that is inferior to an unadapted source checkpoint.

Recently, Grounding DINO [27] and the Segment Anything Model (SAM) [28] have achieved zero-shot capabilities across diverse terrestrial domains. However, deploying these massive architectures on specialised benthic imagery introduces computational overhead for processing an elevated number of images. These models also rely heavily on semantic textual grounding, which may struggle to differentiate between subtle variations in deep-sea sediment and polymetallic nodules without domain-specific fine-tuning. Thus, adapting lightweight, classical architectures (such as YOLO) via pseudo-labelling remains the most computationally viable for scalable, multi-expedition deployment.

2.2. Object Fragmentation in Benthic Computer Vision

Mbani et al. [29] developed the FaunD-Fast framework, utilising a Faster R-CNN to localise and classify benthic megafauna in the Clarion–Clipperton Zone (CCZ). The FaunD-Fast reduces the manual annotation of deep-sea fauna. First, an unsupervised approach is used to evaluate superpixels, isolating visual anomalies in the seafloor. These anomalies are converted into weak bounding-box annotations that human experts simply verify and categorise, which are then used to train a Faster R-CNN model. Using a single-stage architecture, Cui et al. [30] optimised a YOLOv5 model to detect varying scales of deep-sea manganese nodules. Zurowietz and Nattkemper [31] introduced UnKnoT, which applied an unsupervised geometric scale transfer to adapt bounding box dimensions based on changes in camera altitude. However, this method requires explicit telemetry metadata. Baseline studies, such as that of Park et al. [32], report that polymetallic nodules in the south-central CCZ exist in highly concentrated patches, which makes border fragmentation an inherent challenge for tile-based detection pipelines.

Since processing high-resolution imagery may exceed computational constraints, the proposed solutions for aerial or medical imaging can address the fragmentation of objects in benthic imagery. Frameworks like Slicing-Aided Hyper Inference (SAHI), proposed by Akyon et al. [33], mitigate the truncation of small objects by slicing images into overlapping patches and merging the resulting predictions via inference-time Non-Maximum Suppression (NMS). On the other hand, rigid tile borders cause severe morphological damage during model training. In the medical image domain, Isensee et al. [34] observed that accuracy was reduced in the border regions of the patches, motivating the use of increased weighting of the centre voxels for prediction. According to Xiao et al. [35], overlapping tiles can recover the missing objects in aerial images but lower precision because of redundancy, motivating their proposed semantic filtering to suppress duplicate predictions.

2.3. Ensemble Methodologies and Evaluation Protocols

Combining multiple classifiers can overcome the statistical, computational, and representational limitations of single learning algorithms and filter pseudo-label noise reliably. Lakshminarayanan et al. [36] established that deep ensembles offer a scalable, non-Bayesian approach to uncertainty estimation that performs competitively with approximate Bayesian neural networks. Xu et al. [37] proposed a framework that combines multi-scale feature fusion and dilated convolutions to improve detection precision. Casado-García and Heras [38] addressed output-level aggregation by introducing an ensemble method that merges bounding box predictions and adaptable voting protocols. Labao and Naval [39] designed a cascaded ensemble of region-based Convolutional Neural Networks (CNNs) linked by Long Short-Term Memory (LSTM) networks to progressively refine bounding box coordinates for fish detection in highly distorted underwater environments.

Regarding the evaluation protocols, Meyer et al. [40] demonstrated that random cross-validation of spatially contiguous environmental data leads to severe spatial autocorrelation, requiring spatial blocking to prevent the network from memorising overlapping local features. In addition, Gulrajani and Lopez-Paz [41] formalised the Leave-One-Domain-Out cross-validation protocol via the DomainBed benchmark, demonstrating that entire-domain datasets must be held in reserve to measure algorithmic generalisation accurately.

3. Dataset and Evaluation Framework

3.1. Dataset Description

The dataset used in this work was provided by the International Seabed Authority (ISA) and is composed of high-resolution images collected during deep-sea polymetallic nodule exploration surveys conducted in 2018, 2021, and 2022. Since the dataset was provided by the ISA strictly as a repository of visual assets, a significant constraint of this dataset is the complete absence of physical acquisition metadata. Camera system specifications, platform type, operating altitude, and seafloor depth are not reported. Rather than discarding this data, this study operates under the methodological assumption that varying visual footprints in the images reflect changes in the missing operational telemetry. An initial visual inspection indicated that the images exhibited significant inter-annual domain shifts. Thus, the first step in curation was to partition the data by year, as demonstrated in Table 1.

The 2018 partition contains images with two resolutions (1024 × 683: 804 images; 5184 × 3456: 527 images), reflecting a camera change. The difference in resolution introduces a secondary within-partition shift, which was treated as an additional source of covariate shift that the pipeline must tolerate, consistent with the real-world heterogeneity of multi-year expedition archives.

Since processing high-resolution inputs during training is inherently computationally expensive and memory-intensive, the same tiling approach adopted in our prior benchmark study [7] was carried out, yielding a combined pool of 75,576 tiles (summarised in Table 2). The window size was set to 640 × 640 pixels with a 32-pixel overlap in both the horizontal and vertical directions. A shift-to-fit strategy was employed to handle image boundaries, ensuring the window remained entirely within the image limits. Figure 1 displays three representative tiles for the years 2018, 2021, and 2022. The images were chosen based on the median brightness of each partition to illustrate the temporal domain shifts between partitions.

Before defining the training subsets, it is necessary to quantify the disparities between the temporal partitions. In order to achieve this, a feature-space analysis using t-distributed Stochastic Neighbour Embedding (t-SNE) was carried out. We extracted 512-dimensional penultimate layer features from the source YOLOv8 backbone (specifically, the C2f module of the last backbone stage) using global average pooling, applied to n = 500 randomly sampled tiles per temporal partition. The value of n = 500 was chosen as a practical trade-off between the computational cost and statistical representativeness of each partition. Features were first reduced to 50 dimensions using Principal Component Analysis (PCA) before the t-SNE projection (perplexity = 30, seed = 42). The projection is shown in Figure 2.

The t-SNE visualisation reveals clear separation between temporal partitions in the feature space learned on the source domain. The 2018 partition forms a largely isolated cluster in the lower-left region of the projection, while 2021 and 2022 overlap partially in another region. This spatial separation confirms that the source backbone encodes substantially different representations for tiles from different expeditions, consistent with the visual differences in illumination, contrast, substrate texture, and nodule density visible in the representative tile examples.

In the absence of physical telemetry, we quantify the operational domain shift through image-level visual proxies. Table 3 summarises the statistics. The calculated statistics provide direct observational evidence of shifting hardware configurations and flight altitudes. For instance, the 2021 partition exhibits a severe spike in brightness (194.95) compared to 2018 (121.97), heavily suggesting a lower operating altitude or a higher-intensity strobe configuration. The steady increase in sharpness (from 26.26 in 2018 to 48.54 in 2022) alongside shifting Red, Green, Blue (RGB) channels dominances confirms that the underwater physics of light attenuation and scattering varied substantially between deployments. Additionally, the elevated sharpness variance in the 2018 partition reflects the mixed-resolution composition (1024 × 683 and 5184 × 3456 images), where the higher-resolution subset exhibits substantially greater sharpness values. These variations empirically establish the severity of the covariate shift that the pseudo-labelling pipeline must overcome.

To establish the ground-truth labelled subset for the experiments, 10 high-resolution parent images were selected prior to the general tiling process. While this selection was randomised, it was strictly structured to satisfy two key conditions: (1) the inclusion of representative samples from all three temporal partitions (2018, 2021, and 2022) to capture the aforementioned domain shifts, and (2) the guaranteed visual presence of polymetallic nodules.

These selected images were comprehensively annotated and subsequently processed using the established sliding window methodology modified to a 96-pixel overlap. Crops retaining less than 50% of the original bounding box area were discarded. This process yielded a total of 498 patches, which constituted our few-label ground-truth dataset. The resulting dataset is composed of 148 tiles from 2018, 210 tiles from 2021, and 140 tiles from 2022.

The original high-resolution images contained negative samples that solely depict seafloor background and are entirely devoid of polymetallic nodules. These samples were also tiled and incorporated into the dataset to improve model robustness and mitigate false positives. Since the negative samples lack temporal metadata, they were confined to the training subsets to prevent any leakage during evaluation. This labelled set is considered as a diagnostic budget against which we measure how far a pseudo-labelling pipeline can differ from that baseline.

3.2. Leave-One-Partition-Out Evaluation Protocol

Since image tiles collected within the same year may share correlated visual properties, random train–test splitting would introduce leakage during model evaluation, potentially inflating the results. For this reason, model generalisation across survey expeditions is evaluated using a Leave-One-Partition-Out protocol, in which each of the three temporal partitions (2018, 2021, and 2022) serves as the test set in turn, producing three evaluation folds. The test partition is strictly isolated from training both in the labelled subset and in the pseudo-label pool. A complete summary of the fold definitions and tile counts is provided in Table 4.

Beyond temporal isolation, a critical implementation requirement is the enforcement of a spatial anti-leakage group constraint. The tiling process subdivides large parent images into multiple tiles with overlaps of 96 pixels in the labelled pool and 32 pixels in the unlabelled pool. Consequently, adjacent tiles share correlated visual content from the same seafloor region. To prevent spatial data leakage, where contextually similar features could be observed during both training and evaluation, all tiles derived from the same parent image are considered as an indivisible unit. Using the parent image identifier encoded in the filenames, no parent image contributes tiles to more than one split (train, validation, or test) within any given fold.

3.3. Source Domain and Teacher Models

This work uses three teacher models sourced from a benchmark evaluation conducted in [7]: DINO [4], Faster R-CNN [2], and YOLOv8s [3]. These models were originally trained and evaluated on a publicly available dataset—hereafter referred to as the source dataset—comprising seafloor imagery collected during the RV SONNE expeditions (SO268/1+2), as published by Purser et al. [42].

Besides raw performance, these three models represent fundamentally distinct detection paradigms: transformer-based set-prediction (DINO), anchor-based two-stage (Faster R-CNN), and anchor-free single-stage (YOLOv8s). The architectural diversity produces largely uncorrelated prediction errors, thereby enhancing the robustness of the subsequent pseudo-labelling process. Table 5 summarises the models and baseline performance.

4. Domain Adaptation Analysis

4.1. Sources of Domain Shift

Domain shifts in deep-sea surveys cause models trained under one set of imaging conditions to lose performance under different conditions [8,43]. Three categories are relevant: image-level shifts from underwater light propagation (attenuation, turbidity, scattering); instance-level shifts from morphological variation and background complexity across locations [43]; and operational shifts from changes in camera sensors, lens distortions, and operating altitude [44]. The severity of the domain shift in our dataset is confirmed by the t-SNE projection in Section 3 (Figure 2), where the temporal partitions form geometrically distinct clusters.

4.2. Object Size Distribution Mismatch

Beyond environmental variations, a notable source of artificial domain shift is introduced by divergent annotation protocols and physical size distributions between the independent datasets. Differences in annotator strictness during training data annotation can lead to systematically different bounding box extents. In 640 × 640 images, even minor pixel-level variations can substantially affect Intersection over Union (IoU) scores. Consequently, a model that predicts tighter bounding boxes may obtain a low IoU against a more permissive reference annotation, despite correctly localising the object and, in some cases, doing so more precisely than the ground truth.

Object sizes are another significant source of distribution shift between the source and target domain. Table 6 summarises the proportions of tiny (<16² pixels), small (16²–32²), medium (32²–96²), and large (>96²) objects in the source and target datasets. As shown in Table 6, tiny objects are absent from the source dataset, and small objects are nearly absent (0.12%), whereas they represent a non-negligible fraction of the target dataset (0.90% tiny; 15.01% small). The source dataset is composed almost entirely of medium and large objects. Hence, source-trained models are likely to be biased towards the appearance of larger polymetallic nodules, which may impair detection of smaller nodules or reduce confidence in those predictions.

4.3. Tiling Fragmentation and Border Effects

The tiling procedure described in Section 3 was designed to reduce image size while avoiding the loss of visual detail associated with resizing. Although the 96-pixel window shift aims to limit boundary truncation, the procedure can still introduce artefacts and annotation noise when nodules are located near tile borders and become fragmented across tiles, as illustrated in Figure 3. For instance, if a large nodule cluster is split such that only a minor fragment (for example, 35%) is retained within a tile, the 50% rule discards the corresponding ground-truth annotation. The source-trained model may nevertheless detect the visible fragment successfully. Because the annotation was removed as part of the tiling procedure, this valid partial detection has no corresponding ground-truth match and is therefore penalised as a false positive.

Table 7 summarises the spatial distribution of annotations relative to tile boundaries across the three temporal partitions. A substantial proportion of annotations lies within the 96-pixel overlap region, ranging from 15.1% in 2018 to 22.8% in 2022, while a further 1.9–6.9% are clipped at the tile edge. These results indicate that border-adjacent annotations are common across all partitions, showing that tiling-induced fragmentation is not a marginal artefact but a systematic property of the labelled dataset.

5. Pseudo-Label Noise Diagnostic Framework

5.1. Motivation and Noise Taxonomy

Direct pseudo-labelling from a single source-trained model at a low confidence threshold generally fails to produce improvements over supervised fine-tuning. Given that the source models achieved strong within-expedition performance and the unlabelled pool contains 75,576 tiles, this failure requires targeted investigation.

If pseudo-label noise is concentrated in identifiable subsets of the prediction distribution, targeted filtering can remove the majority of false positives while maintaining True Positives (TP). The 506 labelled target data are used as a diagnostic set, allowing for the direct measure of the noise structure. Through this diagnosis, two independent categories of noise are identified, visually summarised in Figure 4. Border Fragmentation False Positives: As established in Section 4, tiling splits nodules across adjacent frames. Due to the 50% area threshold, the fragment containing less than half the nodule receives no ground truth annotation. Any detection of that fragment is inherently a false positive (Figure 4, red dashed line). This noise is strictly localised to tile boundaries and remains orthogonal to the visual domain gap. The Interior False Positive Floor: After removing border-proximate predictions, a residual false positive rate persists. These predictions occur when the source model detects visual features in the target domain (e.g., sediment textures or lighting artefacts) that are absent in the source domain (Figure 4, orange solid). Unlike border fragmentation, interior false positives are not spatially localised.

5.2. Border Fragmentation Analysis

Inference was computed at a permissive confidence threshold (conf = 0.001) across all three teacher models on the diagnostic set, enabling the quantification of the boundary effect. In order to avoid the annotation induced noise described in Section 4, in which the standard

IoU \geq 0.5

matching penalizes spatially correct detections, a centre-point containment is adopted for the spatial analysis: a prediction is a true positive if its centre point falls within any ground truth bounding box. In order to avoid multiple predictions matching a single massive ground-truth box, a heavy NMS is enforced to avoid inflating the True Positive count.

The absolute border distance of each prediction is computed as

d_{border} = min (x_{1}, y_{1}, W - x_{2}, H - y_{2})

(1)

where

(x_{1}, y_{1}, x_{2}, y_{2})

are the predicted bounding box coordinates in absolute pixels and

W = H = 640

.

Predictions are binned by border distance in 8-pixel increments and the false positive rate is computed within each bin. As illustrated in Figure 5, at

d_{b o r d e r} = 0

, false positive rates reach 77.8% for YOLOv8, 86.5% for DINO, and 48.7% for Faster R-CNN. All curves drop sharply within 16–32 pixels of the boundary. The weaker border effect for Faster R-CNN aligns with its two-stage proposal mechanism, which applies an implicit object completeness check absent in anchor-free and attention-based architectures.

Based on the convergence point visible in Figure 5, where all three models stabilize at their respective interior floors, a spatial filter discarding all predictions with

d_{b o r d e r} < 64

pixels is established. The choice of the 64-pixel spatial boundary is not arbitrary; it is intrinsically linked to the geometry of the tiling process. Since the labelled data generation utilised a 96-pixel overlap, boundary artefacts are concentrated within this margin. As demonstrated in Figure 5, false positive rate spikes severely at the immediate edge (0–32 pixels) but decays and stabilizes by the 64-pixel mark. For that reason, masking the outer two-thirds (64 pixels) of the overlap region may efficiently eliminate the methodology-induced fragmentation noise while preserving valid interior detections. These border false positive rates represent a conservative lower bound.

The 64-pixel convergence point is consistent across all three teachers, indicating that the threshold reflects a property of the tiling procedure rather than the response characteristics of any individual model. Although the threshold was identified from the diagnostic data, it is independently derivable from the tiling geometry (two-thirds of the 96-pixel overlap), meaning it does not constitute an artificial optimisation against downstream performance. A formal cross-validated sweep at neighbouring values (32, 64, 96 pixels) on downstream AP is identified as a prioritised extension in Section 8.7.

5.3. The Interior False Positive Floor

Applying the 64-pixel spatial filter isolates the residual semantic noise in the tile interior. Under centre-point containment matching, the interior false positive rate at a 0.001 confidence threshold is 22.8% for YOLOv8, 21.9% for DINO, and 23.1% for Faster R-CNN, as illustrated in Figure 5.

The narrow 1.2% spread across a transformer-based detector, an anchor-free single-stage detector, and an anchor-based two-stage detector rules out model-specific failure modes. This result implicates the domain shift itself: source models generate false low-confidence detections on target-domain features that superficially resemble source training data. This suggests that a complementary strategy regarding the confidence score is required.

5.4. Confidence Stratification: Isolating the Domain-Invariant Subspace

In order to determine whether the false positive rate is uniform across the confidence spectrum, the matching criterion switches from centre-point containment to standard

IoU \geq 0.5

. Theoretically, high-confidence predictions are better localised and fitted to object boundaries; therefore, IoU serves as the appropriate quality measure for this specific validation.

Figure 6 plots the true positive rate at increasing confidence thresholds. The pattern is consistent across all three models: the true positive rate rises monotonically from 57.3–68.1% at a 0.1 confidence threshold to 71.6–82.4% at a 0.9 confidence threshold. Therefore, the ∼22% interior false positive floor observed previously is a methodological artefact of retaining poorly calibrated, low-confidence predictions. At confidence levels

\geq 0.90

, predictions map to a domain-stable morphological subspace that remains robust to covariate shift.

The 0.90 operating point is selected at the plateau of Figure 6, where the true positive rate stabilises for all three teacher architectures. This cross-architecture consistency suggests that a threshold of conf ≥ 0.90 isolates a domain-stable prediction subspace rather than a model-specific calibration artefact. By deriving this parameter directly from the diagnostic convergence behaviour, iterative tuning against the held-out test partitions is avoided. A formal sweep at neighbouring thresholds (0.80, 0.85, 0.95) to quantify downstream sensitivity remains a natural extension of this work, as discussed in Section 8.7.

6. Fragmentation-Aware Teacher–Student Pseudo-Labelling

6.1. Pipeline Overview and Ensemble Formulation

Mitigating the covariate shifts in cross-expedition benthic imagery requires a robust filtering approach. To this end, a multi-model ensemble pseudo-labelling pipeline is employed to systematically isolate both methodology- and domain-induced noise. Rather than relying on a single architecture, which risks rapid confirmation bias under domain shift, the adopted approach uses an orthogonal ensemble of three frozen teachers trained on the source domain: an anchor-free single-stage detector (YOLOv8), an anchor-based two-stage detector (Faster R-CNN), and a transformer-based model (DINO).

As demonstrated in the diagnostic framework (Section 5.2), these architectural differences produce uncorrelated prediction errors. A false detection triggered by a localised artefact by YOLOv8 is unlikely to be replicated by DINO’s global self-attention mechanism. Due to the models orthogonality, multi-model consensus can be leveraged to suppress architecture-specific false positives. All teacher models remain frozen; no gradient updates are applied on the target domain data. The objective is to prevent the models from overspecialising to localised target-domain imaging characteristics, preserving the domain-agnostic representations necessary for cross-expedition generalisation.

6.2. Pseudo-Label Generation and Diagnostic-Guided Filtering

Unlabelled target tiles are processed by the frozen ensemble, and raw predictions are fused using Weighted Box Fusion (WBF) [45]. To preserve the confidence calibration established in the diagnostic framework, WBF is configured with an Intersection over Union (IoU) threshold of 0.50 and conf_type=‘max’. This ensures the fused bounding box reflects the maximum evidence for the detection across all contributing models, rather than an artificially deflated average that would penalize high-confidence predictions missed by a single architecture.

The fused predictions are subsequently passed through a cascaded filtering mechanism, with each stage directly motivated by the diagnostic findings in Section 5:

Spatial Border Filter: This mitigates the methodology-induced fragmentation noise inherent to tiling large-format benthic imagery. All predictions with a border distance d_border < 64 pixels are discarded. This addresses the 49–87% false positive rate previously identified at tile boundaries.
Confidence Stratification: This bypasses the architecture-agnostic interior false positive floor, predictions with a confidence score below 0.90 are removed.
Multi-Model Agreement: Finally, predictions must be supported by at least two of the three teacher models (IoU ≥ 0.5). This majority-consensus criterion (≥2) exploits the architectural diversity while tolerating individual model failures. A stricter agreement (three teachers) could discard predictions where one teacher fails due to architecture-specific limitations—for instance, the attention mechanism’s sensitivity in DINO—reducing pseudo-label recall without proportional precision gain. Requiring only single-model support (≥1) would fail to suppress architecture-specific hallucinations, undermining the ensemble’s noise-filtering purpose. The criterion is not a tunable hyperparameter in the conventional sense but the unique value that operationalises the ensemble’s architectural orthogonality.

After applying all filters, the resulting high-fidelity pseudo-label pool contains 167,011 bounding boxes across 55,967 tiles. Table 8 summarises the number of predictions surviving each filter stage across the full unlabelled pool. The cascade begins with 7,666,887 raw WBF-fused boxes at a permissive threshold (

c o n f \geq 0.001

), representing the unfiltered teacher ensemble output. The spatial filter removes 93.7% of raw predictions, reflecting the high border false positive rates detailed in Section 5.2.

Based on inference profiling on a 1000-tile random sample, pseudo-label generation for the full 75,576-tile unlabelled pool is estimated to require approximately 4.5 GPU-hours on a single NVIDIA GeForce GTX 1080 Ti (DINO: ~2.6 h, Faster R-CNN: ~1.3 h, YOLOv8: ~0.5 h, run sequentially), excluding I/O overhead; the reported value is therefore a lower bound on total wall-clock time. This is a one-time offline cost per target expedition partition; the deployed student model runs single-architecture YOLOv8 inference at standard speed.

6.3. Manifest Construction and Anti-Leakage Constraints

To enforce the evaluation protocol (Section 3.2), the pseudo-label pool is dynamically filtered prior to training to prevent temporal data leakage. All tiles in the pseudo-label pool are parsed by acquisition year; any tile matching the temporally held-out test partition is entirely excluded from the training manifest.

From the eligible remaining unlabelled tiles, a random sample of tiles is drawn without replacement for each training setup, balanced to preserve the approximate year distribution of the available non-test partitions. This sampled subset is then concatenated with the fold’s labelled ground-truth training set. After temporal exclusion, the maximum available pseudo-label pool sizes are [36,772] tiles for fold 0, [47,195] tiles for fold 1, and [27,967] tiles for fold 2.

6.4. Student Fine-Tuning Configuration

The student model (YOLOv8) is fine-tuned from the source domain checkpoint using the combined ground truth and pseudo-labelled tiles. Training hyperparameters are standardised across all experimental conditions to ensure performance deltas are attributable solely to training data composition (Table 9).

The model is updated using the AdamW optimizer with a cosine learning rate decay (lrf = 0.01). We define a low initial learning rate lr = 0.0001 and freeze the first 10 layers of the student’s backbone, preventing catastrophic forgetting of the features learned during source training. This partial freezing strategy allows the detection head and deeper semantic layers to cautiously adapt to the target domain without overwriting the early backbone features that transfer across expeditions. Figure 7 displays a summary of the pipeline.

7. Experimental Results

7.1. Experimental Conditions

We evaluate seven experimental conditions under the evaluation protocol described in Section 3.2. Table 10 summarises their definitions. Conditions A and B are baselines requiring no pseudo-labels. Conditions C through G progressively add pipeline components, forming a structured ablation that isolates the contribution of each design decision identified in Section 5. All conditions use identical fine-tuning hyperparameters (Table 9), ensuring that performance differences are attributable solely to training data composition.

Each condition adds exactly one element over the previous, so the performance difference between adjacent conditions isolates the contribution of that element. Condition C is a naïve pseudo-labelling in which every detection is considered a pseudo-label. The transition from C to D isolates the spatial filter while the transition from D to E isolates the confidence threshold. The transition from E to F isolates the ensemble agreement. Finally, the transition from F to G isolates an early distillation step, in which the model is warmed-up for few epochs using the labelled data and then the pseudo-labels are revaluated. Condition H (Mean Teacher) is evaluated independently as a standard SSOD baseline.

7.2. Baselines

7.2.1. Condition A—Zero-Shot Transfer

Applying the source checkpoint directly to target test tiles without any fine-tuning yields macro mAP_50:95 = 0.253, mAP₅₀ = 0.520, and mAP₇₅ = 0.175. The disproportionately low mAP₇₅ relative to mAP₅₀ reflects a localisation deficit consistent with the annotation-prediction granularity mismatch described in Section 5: the source model detects nodule regions but produces boxes that do not conform to the relaxed annotation style of the target dataset, failing the tighter IoU thresholds. This result establishes the performance floor—that is, the cost of no adaptation—against which all subsequent conditions are measured.

7.2.2. Condition B—Supervised Fine-Tuning

Across four random seeds (42, 365, 1234, 2026), Condition B achieves a macro mAP_50:95 of 0.4467 ± 0.0079 as displayed in Table 11. Table 12 reports per-fold results at seed 42. The wide gap between fold_0 (mAP_50:95 = 0.262) and folds 1–2 (0.529, 0.548) is consistent with the feature space separation documented in Section 4: the 2018 partition occupies a largely isolated region of the t-SNE projection, indicating that a model trained on 2021 and 2022 labelled tiles must generalise to a substantially different visual domain. The macro supervised baseline of 0.4467 is the primary comparison target for all pseudo-labelling conditions.

7.2.3. Semi-Supervised Baseline (Mean Teacher)

To benchmark against SSOD methods, an iterative Mean Teacher baseline was evaluated at the 100-tile pseudo-label budget. The student model was trained while an Exponential Moving Average (EMA) of its weights acted as the teacher to continuously generate pseudo-labels. Across the four random seeds, the Mean Teacher baseline achieved a macro

m A P_{50 : 95}

of

0.4189 \pm 0.0215

as displayed in Table 11. This approach underperforms Condition B baseline (

0.4467 \pm 0.0079

), indicating that without targeted spatial and confidence filtering, standard mutual-learning SSOD frameworks are highly susceptible to confirmation bias under severe cross-expedition domain shifts, as the teacher actively propagates border fragmentation and interior semantic noise into the student’s training signal.

Table 12 reports the per-fold breakdown at seed 42. Mean Teacher underperforms Condition B on folds 1 and 2 (

Δ = - 0.0273

and

Δ = - 0.0283

, respectively), where the supervised baseline is already strong. On fold 0, where the domain shift is largest, the comparison with supervised baseline is comparable (

Δ = 0.0140

). This asymmetry mirrors the performance pattern of Condition F-100, where the most significant gains are similarly concentrated in fold 0.

Table 11 summarises the three baseline conditions against which all pseudo-labelling pipelines are compared.

7.3. Component Ablation

Table 13 presents all conditions at the 100-tile pseudo-label budget, the only pool size at which the proposed pipeline produces positive results. Each ablation condition adds exactly one pipeline component over the previous, isolating the contribution of each design decision. Multi-seed results across all pool sizes (100–5000 tiles) for each condition are reported in Appendix A.

7.3.1. Condition C—Naive Single-Model Pseudo-Labelling

Condition C uses a single YOLOv8 teacher at conf = 0.001 without spatial filtering, representing the simplest possible pseudo-labelling approach. At the smallest pool size (100 tiles), Condition C achieves a macro mAP_50:95 of 0.4407 ± 0.0077, statistically indistinguishable from the supervised baseline (0.4467 ± 0.0079). This indicates that at very small pool sizes, naive pseudo-labelling neither helps nor hurts: the random sample draws too few border-fragmentation false positives to corrupt training, but also too few high-quality interior detections to produce a measurable improvement. Performance degrades monotonically with pool size, falling to

0.3458 \pm 0.0332

at 5000 tiles (Appendix A, Table A1). This degradation pattern is consistent with the diagnostic framework predictions in Section 5. At small pool sizes, the absolute number of injected false positives is small enough to be absorbed by the labelled ground-truth signal; as the pool size grows, the false positive count grows proportionally and overwhelms the useful signal.

7.3.2. Condition D—Single-Model with Spatial Filter

Condition D adds the 64-pixel spatial border filter to Condition C, retaining the single YOLOv8 teacher at conf = 0.001. Multi-seed macro mAP_50:95 across four random seeds (42, 365, 1234, 2026) is reported in Table A2.

At 100 pseudo-labelled tiles, Condition D achieves a macro mAP_50:95 of

0.4409 \pm 0.0050

, statistically indistinguishable from Condition C at the same pool size (

0.4407 \pm 0.0077

) and from the supervised baseline (

0.4467 \pm 0.0079

). Across all evaluated pool sizes, the two single-model conditions are within 0.01 of each other in macro mAP_50:95, and their degradation curves are nearly identical: Condition D declines monotonically from

0.4409 \pm 0.0050

at 100 tiles to

0.3491 \pm 0.0276

at 5000 tiles, mirroring Condition C’s decline from

0.4407 \pm 0.0077

to

0.3458 \pm 0.0332

over the same range.

At conf = 0.001, applying the spatial border filter alone provides no measurable improvement over no filtering at all. The interpretation is consistent with the diagnostic framework. At low-confidence thresholds, the dominant noise contribution is not border fragmentation but the architecture-agnostic interior false positive floor. Removing border-region predictions alone leaves the interior FP burden untouched, and the small reduction in border noise is offset by the loss of legitimate edge detections discarded by the same filter. The spatial filter therefore requires confidence stratification to become useful, motivating the move to high-confidence thresholding in Condition E.

7.3.3. Condition E—Single-Model with Spatial Filter and Confidence Threshold

Condition E differs from Condition D by adding the high-confidence threshold. The objective is to analyse whether the confidence filter isolates the domain-invariant subspace and potentially breaks the interior false positive floor. Table A3 shows the multi-seed macro results.

Despite applying both the spatial border filter and the high-confidence threshold, Condition E fails to surpass the supervised baseline at low pseudo-label counts. At 100 tiles, Condition E achieves 0.4353, marginally below Condition B (0.4467), whereas Condition F reaches 0.469 at the same count. This gap isolates the contribution of multi-model ensemble agreement: a single YOLOv8 teacher, even at

c o n f \geq 0.90

, still admits high-confidence artefacts that no single-architecture filter can suppress. The ensemble agreement requirement in Condition F acts as a second independent gate—a false positive that YOLOv8 fires on with high confidence is unlikely to be replicated by DINO’s global attention mechanism or Faster R-CNN’s two-stage proposal filter, and is therefore discarded. This is consistent with the diagnostic finding in Section 5.3, where the interior FP floor was architecture-agnostic, meaning that while all three models share the floor, their individual high-confidence errors are not correlated.

7.3.4. Condition F—Multi-Model Ensemble Pseudo-Labelling

Condition F implements the full pipeline: three frozen source teacher models,

conf \geq 0.90

,

d_{border} < 64

px spatial filter, and ensemble agreement

\geq 2

. Table 12 reports results across all six pseudo-label pool sizes at seed 42 and Table 14 presents the macro multi-seed results.

At 100 pseudo-labelled tiles, Condition F achieves macro mAP_50:95

= 0.4745 \pm 0.0042

across four random seeds, compared to

0.4467 \pm 0.0079

for the supervised fine-tuning baseline (Condition B). The improvement

0.028

is robust across all seeds: every F-100 seed value (range

0.4691

–

0.4782

) exceeds every Condition B seed value (range

0.4362

–

0.4549

). A paired-samples t-test confirmed that this macro

m A P_{50 : 95}

improvement in the proposed pipeline (Condition F-100) over the supervised baseline (Condition B) is statistically significant (

t (3) = 5.845, p < 0.01

), with a large standardised effect size (Cohen’s

d = 4.42

).

The macro improvement over the supervised baseline is driven primarily by fold_0, the fold with the largest visible domain gap in the t-SNE analysis (Section 3), while folds 1 and 2 show marginal gains. This result is consistent with the already good performance of the supervised baseline. This suggests that the pipeline pseudo-labels contribute most where source–target shift is most severe, and contribute little where supervised learning already generalises well. Furthermore, Condition F-100 has better performance than standard Mean teacher SSOD baseline (

0.4745 \pm 0.0042

vs.

0.4189 \pm 0.0215

), suggesting the necessity of the filtering cascade for cross-domain benthic applications.

F-100 also dominates the three single-architecture ablations at the same pool size: vs. C-100 (

0.4407 \pm 0.0077

), vs. D-100 (

0.4409 \pm 0.0050

), and vs. E-100 (

0.4353 \pm 0.0049

). This dominance pattern isolates the contribution of multi-architecture consensus: neither the spatial filter alone (D), nor the spatial filter combined with the confidence filter on a single teacher (E), nor their absence (C), is sufficient to surpass the supervised baseline. Only the combination of spatial filtering, confidence threshold, and multi-architecture ensemble agreement produces a statistically significant improvement.

Beyond 100 pseudo-labelled tiles, macro mAP_50:95 decreases monotonically to

0.4020 \pm 0.0295

at 1000 tiles before partially recovering at 2000 (0.4331 ± 0.0112) and declining again at 5000 (0.4316 ± 0.0155). Figure 8 plots this pool-size degradation curve. This pool-size degradation pattern mirrors that observed in Conditions C, D, and E, indicating that the phenomenon is a property of the underlying pseudo-label noise distribution rather than of any specific filtering scheme.

Figure 9 visualises the multi-seed distribution. At the 100-tile operating point, the entire Condition F distribution lies above the observed Condition B range (every pipeline seed exceeds every baseline seed). The widening interquartile ranges at 500 and 1000 tiles illustrate the network destabilisation caused by accumulating false positive noise as the pool grows.

7.3.5. Condition G—Early Distillation

Condition G extends Condition F with a two-phase early distillation-inspired training procedure: Phase 1 warms up the student on ground-truth and pseudo-labels for 10 epochs, after which the source model re-screens all pseudo-label boxes, retaining only those with

IoU \geq 0.3

overlap with a source model prediction at conf = 0.3. Phase 2 trains on the refined pseudo-label set. Multi-seed macro mAP_50:95 across four random seeds (42, 365, 1234, 2026) is reported in Table A4.

The keep rate across all pseudo-labels indicates that the conf

\geq 0.90

pre-filter applied during pseudo-label generation already only retains predictions the source model is highly confident about, leaving almost nothing for the early-distillation phase 1 to remove. At 100 pseudo-labelled tiles, Condition G achieves macro mAP_50:95 =

0.4372 \pm 0.0158

, which is statistically indistinguishable from Conditions C-100 (

0.4407 \pm 0.0077

), D-100 (

0.4409 \pm 0.0050

), E-100 (

0.4353 \pm 0.0049

), and the supervised baseline (

0.4467 \pm 0.0079

); and substantially below F-100 (

0.4745 \pm 0.0042

).

Condition G establishes a redundancy result: when pseudo-labels are pre-filtered at conf ≥ 0.90, the early-distillation source-model re-screening step provides no additional discriminative signal because the upstream confidence filter has already isolated the same clean subset. In essence, the two approaches are substitutes rather than complements and stacking both yields no further gain and increases training-time variance.

7.4. Ablation Studies

7.4.1. Bounding Box Fusion Strategy and Spatial Convergence

In order to evaluate the impact of bounding box fusion strategies, the WBF was compared against a standard greedy consensus clustering approach on a sample of 500 unlabelled target tiles drawn from the pseudo-label pool. During evaluation, it was observed that configuring WBF to average confidence scores aggressively penalized the ensemble when models exhibited varying confidence calibrations, causing a 33% artificial drop in surviving boxes (941 versus 1412). However, when WBF was configured to retain the maximum confidence score, it produced 1495 surviving boxes, nearly identical to the 1494 boxes generated by the standard consensus clustering. This negligible difference of a single bounding box across 500 tiles provides empirical evidence for the spatial convergence hypothesized in Section 6. At the strictly filtered

\geq 0.90

operating point, the spatial predictions of the three architecturally distinct models are already tightly aligned. Because there is virtually no localization jitter or overlapping box noise remaining for WBF to correct, sophisticated box fusion becomes mathematically equivalent to simple consensus agreement in this high-confidence subspace.

7.4.2. Fine-Tuned Teacher

Despite Condition B achieving a significantly higher target-domain mAP_50:95 (0.4467) than the zero-shot source model (0.253), deploying it as a teacher becomes detrimental to pseudo-label fidelity. Table 15 shows a comparison of the false positive rates for the two models. Both checkpoints were evaluated on identical image and label sets using the same matching protocol (centre-point and IoU ≥ 0.5).

By evaluating the 2018 fold, the fine-tuned teacher produced an interior false positive rate of 52.26% (451 catastrophic hallucinations), compared to the 37.76% rate (233 hallucinations) maintained by the frozen source model. Extending to the other folds, the same pattern repeats. This contrast suggests that while fine-tuning on a small target subset improves confident bounding box regression, it triggers severe overspecialisation to local background textures.

Additionally, we conducted a single-seed end-to-end pseudo-labelling run using the fold-specific Condition B fine-tuned checkpoint as the YOLOv8 teacher, matching the Condition E configuration at the 100-tile budget. Condition E was selected as the test bed rather than Condition F because substituting a fine-tuned teacher within the full three-teacher ensemble would be confounded by the ensemble agreement filter, which would absorb fine-tuned teacher false positives through the remaining frozen Faster R-CNN and DINO teachers. This validation run yielded a macro

m A P_{50 : 95}

of 0.4447, failing to surpass the comparable supervised baseline (0.4464). The elevated false positive rate of the fine-tuned teacher does not translate into downstream gains, with single-teacher fine-tuned pseudo-labelling performing similarly to its frozen-source counterpart and both remaining near the supervised baseline.

7.5. Border vs. Interior Detection Analysis

To verify that the proposed pipeline improves interior detection specifically, rather than detection across the tile indiscriminately, Common Objects in Context (COCO) metrics evaluation is split by ground truth annotation position. A ground truth box is classified as border if its centre falls within 64 pixels of any tile edge, and as interior otherwise. Table 16 reports per-condition mAP_50:95 for each split.

Interior mAP exceeds overall mAP for Conditions B and F-100 because border annotations, which are subject to tiling fragmentation and annotation clipping, suppress the overall AP when included; their removal reveals the model’s stronger performance on complete, unambiguous interior objects.

As shown in Table 16, Condition F improves interior mAP over Condition B and border AP, indicating that filtering border noise during pseudo-label generation improves the quality of interior pseudo-labels, which in turn improves the student’s interior detection performance, and contributing to border detection, where pseudo-labels were discarded by design.

7.6. Summary

The proposed pipeline at 100 pseudo-labelled tiles achieves the largest improvement over the supervised baseline (Table 13). Neither the spatial filter alone (D), nor the spatial filter combined with the confidence filter on a single teacher (E), nor their absence (C), nor the addition of early-distillation refinement on top of the multi-architecture filter (G), produces a measurable improvement over the baseline at any evaluated pool size, indicating the necessity of using the multi-architecture ensemble agreement. In addition, all conditions exhibit pool-size degradation performance (Table 14 and Appendix A). The replication of this pattern across the five pipelines indicates that pool-size effects are a property of the underlying pseudo-label noise distribution under cross-expedition domain shift, not an artefact of any specific filtering scheme.

8. Discussion

8.1. Why the Pipeline Succeeds at Low Pseudo-Label Counts

As established in Section 7.3.4, the macro mAP_50:95 improvement at 100 pseudo-labelled tiles is concentrated in fold 0 (

Δ = + 0.043

at seed 42), the fold with largest domain shift. Fold_0 holds out 2018 as the test partition and trains on 2021 and 2022 labelled tiles only. The filter cascade produces a high-fidelity but small effective signal; at low pool sizes, this signal is informative because it is not yet overwhelmed by either (a) within-pool noise or (b) the dilution of the labelled ground-truth contribution.

Folds 1 and 2 benefit less since their supervised baselines are already strong (

0.529

and

0.548

at seed 42) and the gap between supervised and optimal performance is smaller, leaving less room for pseudo-labels to add useful signal before noise begins to dominate.

8.2. The Pool-Size Degradation Trend

The monotonic degradation beyond 100 pseudo-labelled tiles is replicated across all five pseudo-labelling conditions evaluated in this work. While the absolute values differ, the trend is consistent: a small optimum near 100–200 tiles, a monotonic decline through 1000–2000 tiles, and partial stabilisation at the largest pool sizes.

Considering that this pattern occurs in the distinct pipelines, this degradation trend appears to be a property of the underlying pseudo-label noise distribution and not a property of any specific filtering condition. As the pool size grows, the cumulative false positive count grows proportionally and eventually exceeds the recovery capacity of the labelled ground-truth signal.

Within Condition F specifically, the 1000-tile point has the highest variance in Condition F, suggesting that the network destabilises before partially re-equilibrating at larger pool sizes. To fully evaluate this effect, a per-tile pseudo-label profiling would be required.

8.3. The Fine-Tuned Teacher Overspecialisation Effect

Replacing the frozen source model with the Condition B model degrades teacher quality, despite Condition B’s superior detection metrics. This confirms that fine-tuning on small target subsets induces overfitting. With only a few training tiles per fold, the model overspecialises to partition-specific imaging characteristics rather than extracting the generalised representations required for robust target-domain performance. Similarly, the Mean Teacher result provides corroboration through a different teacher-update mechanism. EMA-based updates cause the student to underperform the supervised baseline.

A counter-intuitive aspect is worth noting. The degradation is largest on Folds 1 and 2, not on Fold 0, where the t-SNE analysis indicated the most severe domain shift. In Fold 0, where the supervised baseline is weakest (0.2617), one hypothesis is that the model has a less coherent partition-specific signal to specialise to, limiting the magnitude of overspecialisation. The implication is that fine-tuned teacher quality degrades fastest precisely where the labelled target set is most representative of the training partitions.

In low-annotation cross-domain settings like the one evaluated here, fine-tuning is likely to degrade teacher quality even as it improves detection performance on the labelled evaluation set. The standard practice of using the best-performing checkpoint as the pseudo-label teacher should therefore be treated with caution in low-annotation cross-domain settings.

8.4. Architecture-Agnostic Nature of the Interior FP Floor

The architecture-agnostic nature of the interior false positive floor established in Section 5.3 is a stronger result than the individual measurements suggest. These architectures are structurally heterogeneous, differing fundamentally in their prediction mechanisms (dense grid vs. region proposal vs. bipartite matching), spatial context aggregation, and confidence calibration. This shared performance floor strongly suggests that the bottleneck stems from the domain shift between source and target images, rather than a flaw in the models themselves. Consequently, further architectural search is unlikely to yield improvements; overcoming this plateau necessitates interventions within the confidence dimension.

8.5. Small Nodules Detection

As detailed in Table 6, the source model never saw small targets during training and for that reason they may not be able to detect them in the target domain with sufficient confidence to survive the filtering cascade. The labelled ground truth contains a representative distribution of tiny (

0.90 %

) and small (

15.01 %

) nodules, while analysis of the filtered pseudo-label pool shows that they are entirely composed of medium (

53.62 %

) and large (

46.38 %

) nodules. Hence, the student model relies entirely on the small labelled ground-truth set to learn representations for small and tiny nodules, while the pseudo-label pool strictly reinforces medium and large morphologies.

The small-object analysis is restricted to fold 0 (2018) because the 2021 and 2022 test partitions contain no small-object ground-truth annotations in their held-out tiles, reflecting a genuine property of the dataset’s size distribution across survey periods rather than an evaluation gap. Evaluating fold 0 demonstrates that the proposed pipeline (Condition F-100) improves small-nodule detection over the supervised baseline (Condition B), thereby not inducing catastrophic forgetting. As shown in Table 17, the proposed pipeline increases precision and recall for small objects. This improvement can be attributed to the scale-invariant morphology of polymetallic nodules. Because nodules share the same surface texture regardless of physical size, the abundant medium and large pseudo-labels effectively train the network’s backbone to separate general nodule features from the target-domain background, improving its localisation of small nodules without requiring size-specific pseudo-supervision. Cross-fold evaluation of this finding requires future survey data with richer small-object coverage.

8.6. Limitations

A major limitation of this study is the small size of the fully annotated dataset, which comprises only 498 tiles gathered from 10 high-resolution parent images. Although the results suggest that few annotations can yield an improvement over the supervised baseline at the optimal pool size and that the evaluation protocol prevents spatial leakage, the limited size restricts the assessment of more severe morphological outliers. Furthermore, sensitivity to this specific selection, for instance, whether reshuffling the ten parent images would shift the measured results between the proposed pipeline and the supervised baseline, is not quantified in this work.

Multi-seed evaluation was performed for all primary conditions (B, C, D, E, F, G, H) across the full pool-size sweep, providing statistical comparisons for the central ablation findings. Beyond sampling variance, the evaluation protocol assesses generalisation across the three temporal partitions in the dataset but cannot evaluate generalisation to entirely unseen survey areas or camera systems not represented in any partition. The protocol is rigorous within its scope but does not constitute a claim about out-of-distribution generalisation beyond the dataset. In addition, the absence of physical acquisition metadata requires that the operational domain shifts be inferred from visual proxies rather than confirmed directly through instrumentation telemetry.

The 64-pixel spatial filter and 0.90 confidence threshold are derived from plateau behaviour in the diagnostic curves (Figure 5 and Figure 6) rather than from a formal threshold sweep on downstream AP. A systematic sensitivity analysis at neighbouring threshold values would strengthen the generality of the design choices and is a natural extension.

8.7. Future Work

The confidence analysis in Section 5.4 identifies

conf \geq 0.90

as the operating point at which the true positive rate stabilises. Although this threshold was derived empirically from diagnostic convergence, a formal cross-validated sweep at neighbouring spatial and confidence values to quantify downstream sensitivity remains a natural extension of this work. Additionally, a dynamic confidence threshold, such as the Gaussian Mixture Model (GMM) used in Consistent Teacher [46], could adapt to partition-specific calibration differences and potentially improve pseudo-label quality at lower confidence levels, expanding the effective pool size beyond 100 tiles. This could potentially recover smaller nodules that produce lower confidence, without allowing the border noise artefacts.

The single-seed Condition E-FT result warrants multi-seed confirmation to establish statistically whether fine-tuned single-teacher pseudo-labelling differs from its frozen-source counterpart. Similarly, the Condition H comparison could be extended to include Soft Teacher and Unbiased Teacher as additional SSOD reference points.

Besides this, the border vs. interior split evaluation in Section 7.5 opens the possibility of developing separate pseudo-label strategies for border and interior predictions: border predictions might be recovered by post-processing adjacent tile predictions jointly rather than discarding them entirely, recovering a portion of the fragmentation false positives through geometric consistency checking. Finally, extending the evaluation protocol to include a fourth held-out partition, either a different survey area or a different vehicle platform, would test the generalisability of the diagnostic findings beyond the three temporal partitions used. The small-object recovery finding is currently evaluable only on fold 0 due to the absence of small-object annotations in the 2021 and 2022 test partitions; validating this scale-invariant learning effect across multiple survey years will rely on future datasets containing a higher density of annotated small targets.

9. Conclusions

The cross-expedition deployment of deep-sea nodule detectors is constrained by the cost of re-annotation and the severity of the covariate shift between survey periods. This work diagnoses two noise sources that cause failure of naive pseudo-labelling: false positives concentrated near tile boundaries, and an architecture-agnostic interior false positive floor arising from source domain bias in low-confidence predictions. The diagnostic framework characterises both noise sources on labelled target tiles and derives principled filtering thresholds from empirical evidence rather than hyperparameter search.

Predictions at conf

\geq 0.90

constitute a domain-stable subspace of the teacher output distribution, achieving a high true positive rate across three models independently, suggesting that the reliability of pseudo-labels under domain shift is a property of the confidence regime rather than the model architecture.

Additionally, fine-tuning degrades teacher quality despite improving detection performance, indicating that target-domain AP is not a reliable proxy for pseudo-label quality in low-annotation cross-domain settings.

On the dataset evaluated here, performance peaks at 100 pseudo-labelled tiles and degrades monotonically beyond this point, a pattern replicated across all five evaluated pseudo-labelling pipelines and interpreted as a property of the underlying noise distribution rather than of any specific filtering scheme. The improvement is concentrated in the most domain-shifted fold, indicating that pseudo-labelling is useful where the supervised signal is weaker. Finally, the results presented in the paper suggest that frozen ensemble teachers with different architectures provide a viable basis for cross-expedition adaptation without additional annotation costs, subject to the noise distribution constraints characterised in this work.

Author Contributions

Conceptualization, G.L.; methodology, G.L.; software, G.L.; validation, G.L., A.D. and E.S.; formal analysis, G.L.; investigation, G.L.; resources, E.S.; data curation, G.L.; writing—original draft preparation, G.L.; writing—review and editing, G.L., A.D. and E.S.; visualization, G.L.; supervision, A.D. and E.S.; project administration, A.D. and E.S.; funding acquisition, A.D. and E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by TRIDENT project financed by the European Union’s HE programme under grant agreement No 101091959.

Data Availability Statement

The photographic materials are publicly available on the DeepData website of the International Seabed Authority (ISA) and were utilised under permission for this specific publication. The authors do not have the authority to distribute these images. The WBF configuration, cascade filter parameters, diagnostic-curve analysis scripts, and LOPO training manifests are available at https://github.com/gabloureiro/cross_expedition_adaptation (accessed on 29 May 2026). The full pseudo-label generation pipeline will be made publicly available in a future release upon completion of ongoing related research.

Acknowledgments

The authors would like to gratefully acknowledge the International Seabed Authority (ISA) for providing the photographic images utilised in this research through its DeepData database available at https://isa.org.jm/deepdata-database/ (accessed 27 April 2026). Their support in supplying these visual assets is highly appreciated.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	Average Precision
CCZ	Clarion–Clipperton Zone
CNN	Convolutional Neural Network
COCO	Common Objects in Context
EMA	Exponential Moving Average
FP	False Positive
GMM	Gaussian Mixture Model
IoU	Intersection over Union
ISA	International Seabed Authority
LOPO	Leave-One-Partition-Out
mAP	mean Average Precision
NMS	Non-Maximum Suppression
PCA	Principal Component Analysis
RGB	Red, Green, Blue
SAHI	Slicing-Aided Hyper Inference
SSOD	Semi-Supervised Object Detection
t-SNE	t-distributed Stochastic Neighbour Embedding
TP	True Positive
WBF	Weighted Box Fusion

Appendix A. Per-Condition Pool-Size Tables

Table A1, Table A2, Table A3 and Table A4 report per-condition multi-seed macro mAP_50:95 across all evaluated pool sizes. These tables support the pool-size degradation analysis in Section 7.

Table A1. Condition C—Naive single-model pseudo-labelling, multi-seed macro mAP_50:95.

Pseudo-Labels	Macro mAP
100	$0.4407 \pm 0.0077$
200	$0.4318 \pm 0.0118$
500	$0.4284 \pm 0.0068$
1000	$0.4184 \pm 0.0256$
2000	$0.4139 \pm 0.0119$
5000	$0.3458 \pm 0.0332$

Table A2. Condition D—Single-model + spatial filter, multi-seed macro mAP_50:95.

Pseudo Count	Macro mAP
100	$0.4409 \pm 0.0050$
200	$0.4334 \pm 0.0150$
500	$0.4275 \pm 0.0121$
1000	$0.4258 \pm 0.0040$
2000	$0.4068 \pm 0.0175$
5000	$0.3491 \pm 0.0276$

Table A3. Condition E—Single-model + spatial + confidence, multi-seed macro mAP_50:95.

Pseudo Count	Macro mAP
100	$0.4353 \pm 0.0049$
200	$0.4297 \pm 0.0059$
500	$0.4256 \pm 0.0201$
1000	$0.4171 \pm 0.0205$
2000	$0.4035 \pm 0.0186$
5000	$0.3706 \pm 0.0405$

Table A4. Condition G—Early distillation, multi-seed macro mAP_50:95.

Pseudo Count	Macro mAP
100	$0.4372 \pm 0.0158$
200	$0.4381 \pm 0.0072$
500	$0.4302 \pm 0.0038$
1000	$0.4168 \pm 0.0175$
2000	$0.4224 \pm 0.0157$
5000	$0.3706 \pm 0.0417$

References

Loureiro, G.; Dias, A.; Almeida, J.; Martins, A.; Hong, S.; Silva, E. A survey of seafloor characterization and mapping techniques. Remote Sens. 2024, 16, 1163. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Loureiro, G.; Dias, A.; Almeida, J.; Martins, A.; Silva, E. Evaluation of deep learning models for polymetallic nodule detection and segmentation in seafloor imagery. J. Mar. Sci. Eng. 2025, 13, 344. [Google Scholar] [CrossRef]
Folkman, L.; Pitt, K.A.; Stantic, B. A data-centric framework for combating domain shift in underwater object detection with image enhancement. Appl. Intell. 2025, 55, 272. [Google Scholar] [CrossRef]
Peukert, A.; Schoening, T.; Alevizos, E.; Köser, K.; Kwasnitschka, T.; Greinert, J. Understanding Mn-nodule distribution and evaluation of related deep-sea mining impacts using AUV-based hydroacoustic and optical data. Biogeosciences 2018, 15, 2525–2549. [Google Scholar] [CrossRef]
Chen, G.; Mao, Z.; Shen, J.; Cheng, Z. Pseudo-Label Guided Object Detection in Sparsely Annotated Underwater Optical Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Arazo, E.; Ortego, D.; Albert, P.; O’Connor, N.E.; McGuinness, K. Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning. arXiv 2020, arXiv:1908.02983. Available online: https://arxiv.org/abs/1908.02983 (accessed on 25 April 2026).
Liu, Y.-C.; Ma, C.-Y.; He, Z.; Kuo, C.-W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased teacher for semi-supervised object detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Kloster, M.; Burfeid-Castellanos, A.M.; Langenkämper, D.; Nattkemper, T.W.; Beszteri, B. Improving deep learning-based segmentation of diatoms in gigapixel-sized virtual slides by object-based tile positioning and object integrity constraint. PLoS ONE 2023, 18, e0272103. [Google Scholar] [CrossRef] [PubMed]
Gazis, I.Z.; Greinert, J. Importance of spatial autocorrelation in machine learning modeling of polymetallic nodules, model uncertainty and transferability at local scale. Minerals 2021, 11, 1172. [Google Scholar] [CrossRef]
Koh, P.W.; Sagawa, S.; Marklund, H.; Xie, S.M.; Zhang, M.; Balsubramani, A.; Liang, P. Wilds: A benchmark of in-the-wild distribution shifts. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 5637–5664. [Google Scholar]
Engelen, J.E.V.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef]
Rosenberg, C.; Hebert, M.; Schneiderman, H. Semi-supervised self-training of object detection models. In Proceedings of the Seventh IEEE Workshop on Applications of Computer Vision (WACV), Breckenridge, CO, USA, 5–7 January 2005. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.L.; Zhang, H.; Lee, C.Y.; Pfister, T. A simple semi-supervised learning framework for object detection. arXiv 2020, arXiv:2005.04757. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Liu, Z. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3060–3069. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; Kira, Z. Unbiased teacher v2: Semi-supervised object detection for anchor-free and anchor-based detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 9819–9828. [Google Scholar]
Zhou, H.; Ge, Z.; Liu, S.; Mao, W.; Li, Z.; Yu, H.; Sun, J. Dense teacher: Dense pseudo-labels for semi-supervised object detection. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022; pp. 35–50. [Google Scholar]
Li, Y.J.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Chen, K.; Wu, B.; Vajda, P. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 7581–7590. [Google Scholar]
Hao, Y.; Forest, F.; Fink, O. Simplifying source-free domain adaptation for object detection: Effective self-training strategies and performance insights. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 196–213. [Google Scholar]
Zhang, X.; Chou, C.-H. Source-free domain adaptation for video object detection under adverse image conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Ericsson, L.; Li, D.; Hospedales, T. Better practices for domain adaptation. In Proceedings of the International Conference on Automated Machine Learning (AutoML), PMLR, Potsdam, Germany, 12–15 November 2023; pp. 1–25. [Google Scholar]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Zhang, L. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Girshick, R. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Mbani, B.; Buck, V.; Greinert, J. An automated image-based workflow for detecting megabenthic fauna in optical images with examples from the Clarion–Clipperton Zone. Sci. Rep. 2023, 13, 8350. [Google Scholar] [CrossRef]
Cui, C.; Ma, P.; Zhang, Q.; Liu, G.; Xie, Y. Grabbing Path Extraction of Deep-Sea Manganese Nodules Based on Improved YOLOv5. J. Mar. Sci. Eng. 2024, 12, 1433. [Google Scholar] [CrossRef]
Zurowietz, M.; Nattkemper, T.W. Unsupervised knowledge transfer for object detection in marine environmental monitoring and exploration. IEEE Access 2020, 8, 143558–143568. [Google Scholar] [CrossRef]
Park, C.; Simon-Lledó, E.; Fleming, B.F.; Ju, S.J. Environmental drivers of abyssal benthic megafaunal biodiversity in the south-central Clarion-Clipperton Zone, Pacific Ocean. Elem. Sci. Anthr. 2026, 14, 00060. [Google Scholar] [CrossRef]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 966–970. [Google Scholar]
Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P.F.; Kohl, S.; Maier-Hein, K.H. nnU-Net: Self-adapting framework for U-Net-based medical image segmentation. arXiv 2018, arXiv:1809.10486. [Google Scholar]
Xiao, Y. Group Evidence Matters: Tiling-based Semantic Gating for Dense Object Detection. arXiv 2025, arXiv:2509.10779. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Xu, J.; Wang, W.; Wang, H.; Guo, J. Multi-model ensemble with rich spatial information for object detection. Pattern Recognit. 2020, 99, 107098. [Google Scholar] [CrossRef]
Casado-García, Á.; Heras, J. Ensemble methods for object detection. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI), Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 2688–2695. [Google Scholar]
Labao, A.B.; Naval, P.C., Jr. Cascaded deep network systems with linked ensemble components for underwater fish detection in the wild. Ecol. Inform. 2019, 52, 103–121. [Google Scholar] [CrossRef]
Meyer, H.; Reudenbach, C.; Wöllauer, S.; Nauss, T. Importance of spatial predictor variable selection in machine learning applications—Moving from data reproduction to spatial prediction. Ecol. Model. 2019, 411, 108815. [Google Scholar] [CrossRef]
Gulrajani, I.; Lopez-Paz, D. In search of lost domain generalization. arXiv 2020, arXiv:2007.01434. [Google Scholar] [CrossRef]
Purser, A.; Bodur, Y.; Ramalo, S.; Stratmann, T.; Schoening, T. Seafloor Images of Undisturbed and Disturbed Polymetallic Nodule Province Seafloor Collected During RV SONNE Expeditions SO268/1+ 2; PANGAEA: Bremerhaven, Germany, 2021; p. 935856. [Google Scholar] [CrossRef]
Han, L.; Zhai, J.; Yu, Z.; Zheng, B. See you somewhere in the ocean: Few-shot domain adaptive underwater object detection. Front. Mar. Sci. 2023, 10, 1151112. [Google Scholar] [CrossRef]
Walker, J.L.; Zeng, Z.; Wu, C.L.; Jaffe, J.S.; Frasier, K.E.; Sandin, S.S. Underwater object detection under domain shift. IEEE J. Ocean. Eng. 2024, 49, 1209–1219. [Google Scholar] [CrossRef]
Solovyev, R.; Wang, W.; Gabruseva, T. Weighted boxes fusion: Ensembling boxes from different object detection models. Image Vis. Comput. 2021, 107, 104117. [Google Scholar] [CrossRef]
Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent-teacher: Towards reducing inconsistent pseudo-targets in semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 3240–3249. [Google Scholar]

Figure 1. Three tiles representative of each year.

Figure 2. t-SNE visualisation of YOLOv8 backbone features (

n = 500

tiles per partition), illustrating the temporal domain shift. The 2018 partition demonstrates clear feature separation from the 2021 and 2022 partitions.

Figure 2. t-SNE visualisation of YOLOv8 backbone features (

n = 500

tiles per partition), illustrating the temporal domain shift. The 2018 partition demonstrates clear feature separation from the 2021 and 2022 partitions.

Figure 3. Visual demonstration of methodology-induced border fragmentation noise caused by the sliding-window tiling process. (a) A polymetallic nodule located within the 96-pixel overlap region of a 640 px tile boundary. (b) Physical splitting of the nodule across adjacent frames. Because the left fragment constitutes less than 50% of the original bounding box area, its ground-truth annotation is discarded during dataset generation, while the right fragment retains its annotation. (c) The consequence during model training and inference is that the model correctly detects the left fragment based on morphological features, but due to the discarded annotation, it is strictly penalized as a false positive, corrupting the pseudo-label pool.

Figure 4. Visual taxonomy of pseudo-label noise. The red shaded region denotes the 64-pixel border margin (

d_{b o r d e r} < 64

). Predictions within this zone are highly susceptible to becoming border false positives (red dashed line) due to tiling fragmentation, even when detecting legitimate nodule fragments. Beyond this margin, the tile interior contains true positives (green solid line) alongside a persistent floor of interior false positives (orange solid line), which arise when the source-trained model hallucinates targets from unfamiliar, target-domain benthic textures. Ground truth annotations are shown in grey.

Figure 4. Visual taxonomy of pseudo-label noise. The red shaded region denotes the 64-pixel border margin (

d_{b o r d e r} < 64

). Predictions within this zone are highly susceptible to becoming border false positives (red dashed line) due to tiling fragmentation, even when detecting legitimate nodule fragments. Beyond this margin, the tile interior contains true positives (green solid line) alongside a persistent floor of interior false positives (orange solid line), which arise when the source-trained model hallucinates targets from unfamiliar, target-domain benthic textures. Ground truth annotations are shown in grey.

Figure 5. Spatial decay of border fragmentation noise. False positive rate as a function of distance from the tile boundary. A sharp spike is observed at the border edge, which rapidly decays and stabilizes at a persistent interior noise floor by the 64-pixel mark.

Figure 6. True positive rate across confidence thresholds. The monotonic increase in true positive rate across all three architectures demonstrates that higher-confidence predictions are significantly better localised (under standard

I o U \geq 0.5

matching). The vertical dashed line denotes the selected 0.90 operating point, which effectively isolates the domain-stable prediction subspace.

Figure 6. True positive rate across confidence thresholds. The monotonic increase in true positive rate across all three architectures demonstrates that higher-confidence predictions are significantly better localised (under standard

I o U \geq 0.5

matching). The vertical dashed line denotes the selected 0.90 operating point, which effectively isolates the domain-stable prediction subspace.

Figure 7. Final cascade-filtering pipeline.

Figure 8. Scaling curve for Condition F at seed 42.

Figure 9. Distribution of macro mAP_50:95 scores for Condition F across four random seeds at varying pseudo-label pool sizes. The red dashed line represents the multi-seed mean of the supervised baseline (Condition B: 0.4467 across the same four seeds. The dark blue dots represent the individual scores for each of the four random seeds, and the solid red dots indicate the mean value for that pool size.).

Table 1. Dataset division by year and total number of high-resolution images.

Expedition Year	Number of Images	Resolution
2018	1331	1024 × 683 and 5184 × 3456
2021	216	5184 × 3456
2022	597	5184 × 3456

Table 2. Dataset division by year and total number of tiles.

Expedition Year	Number of Tiles
2018	31,674
2021	11,664
2022	32,238

Table 3. Image statistics across the 2018, 2021, and 2022 datasets.

Metric	2018	2021	2022
Brightness	$121.97 \pm 29.20$	$194.95 \pm 18.04$	$150.34 \pm 13.95$
Contrast	$11.29 \pm 4.63$	$14.84 \pm 5.35$	$17.14 \pm 4.60$
Sharpness	$26.26 \pm 45.46$	$44.46 \pm 20.09$	$48.54 \pm 16.49$
Entropy	$5.30 \pm 0.47$	$5.17 \pm 0.59$	$5.45 \pm 0.42$
Red	$103.49 \pm 29.49$	$184.28 \pm 21.72$	$139.48 \pm 16.51$
Green	$130.73 \pm 29.91$	$200.23 \pm 17.85$	$153.02 \pm 13.36$
Blue	$113.21 \pm 24.13$	$151.35 \pm 15.32$	$110.01 \pm 9.53$
Saturation	$61.53 \pm 12.81$	$62.92 \pm 4.64$	$71.71 \pm 5.59$

Table 4. Summary of fold definitions and tile counts per split. Labelled sets from the non-test partitions are divided into training and validation subsets using an 80/20 split. The pseudo-label pool is strictly restricted to unlabelled tiles from the non-test partitions to prevent temporal leakage.

Fold	Test Year	Train/Val Years	Train Tiles ¹	Val Tiles	Test Tiles	Pseudo-Label Tiles ²
0	2018	2021, 2022	288	70	148	36,772
1	2021	2018, 2022	238	58	210	47,195
2	2022	2018, 2021	294	72	140	27,967

¹ Train/Val splits approximate 80/20 but fluctuate slightly due to the strict parent-image grouping constraint. ² The eight negative background samples account for the mathematical discrepancy in the total tile counts.

Table 5. Summary of the selected teacher models, their architectural paradigms, and their baseline performance on the source dataset.

Teacher Model	Architectural Paradigm	Source Benchmark mAP@50:95
DINO [4]	Transformer-based set-prediction	0.899
Faster R-CNN [2]	Anchor-based two-stage	0.832
YOLOv8s [3]	Anchor-free single-stage	0.856

Table 6. Distribution of object sizes across the target and source datasets.

	Target Dataset		Source Dataset
Category	Count	Percentage	Count	Percentage
Tiny	23	0.90%	0	0.00%
Small	385	15.01%	8	0.12%
Medium	480	18.71%	3730	58.04%
Large	1677	65.38%	2689	41.84%

Table 7. Distribution of border annotations by year, detailing the counts and percentages for each spatial zone.

Year	Clipped	Filter Zone	Overlap Zone	Interior
2018	17 (1.9%)	286 (32.7%)	132 (15.1%)	440 (50.3%)
2021	45 (4.7%)	231 (24.0%)	196 (20.4%)	489 (50.9%)
2022	50 (6.9%)	173 (23.7%)	166 (22.8%)	340 (46.6%)

Table 8. Filter cascade yield across the full unlabelled pool.

Stage	Boxes Retained	% of Raw
Raw WBF fused boxes (conf ≥ 0.001)	7,666,887	100.0%
Spatial Filter	481,008	6.3%
Confidence Filter	224,858	2.9%
Ensemble Agreement	167,011	2.2%

Table 9. Fine-tuning hyperparameters.

Hyperparameter	Value
Base model	YOLOv8 (source checkpoint)
Optimizer	AdamW
Initial learning rate (`lr0`)	0.0001
Final learning rate factor (`lrf`)	0.01
Learning rate schedule	Cosine decay
Momentum	0.937
Weight decay	0.0005
Warmup epochs	3
Maximum epochs	100
Early stopping patience	20
Frozen layers	10 (backbone)
Batch size	16
Image size	$640 \times 640$
Mosaic augmentation	1.0
Horizontal/vertical flip probability	0.5/0.5
HSV hue/saturation/value jitter	0.015/0.7/0.4
Scale augmentation	0.5
Translation augmentation	0.1
Rotation augmentation	$15^{\circ}$
Single class mode	True
Seed	42 (varied across runs: 42, 365, 1234, 2026)

Table 10. Experimental condition definitions. Each condition adds a new component to the pipeline.

Condition	Description	Teacher	Conf Filter	Spatial Filter	Ensemble
A	Zero-shot transfer	—	—	—	—
B	Supervised fine-tuning	—	—	—	—
C	Naïve single-model pseudo-labelling	YOLOv8	0.001	×	×
D	Single-model + spatial filter	YOLOv8	0.001	64 px	×
E	Single-model + spatial filter + confidence threshold	YOLOv8	0.9	64 px	×
F	Multi-model ensemble pseudo-labelling	YOLOv8 + DINO + FRCNN	0.90	64 px	≥2
G	Early distillation extension	YOLOv8 + DINO + FRCNN	0.90	64 px	≥2
H	Mean Teacher (EMA)	EMA student	—	—	—

Table 11. Baseline condition results (mAP_50:95). Condition A is a fixed zero-shot checkpoint with no seed variation. Conditions B and H report mean ± standard deviation across four random seeds.

Condition	Description	Macro mAP_50:95
A	Zero-shot transfer	0.2530
B	Supervised fine-tuning	$0.4467 \pm 0.0079$
H	Mean Teacher (EMA, $α = 0.95$ )	$0.4189 \pm 0.0215$

Table 12. Condition F cross-validation results. Macro mAP_50:95 performance across three folds and the performance delta compared to Condition B.

Pseudo Count	Fold_0	Fold_1	Fold_2	Macro mAP_50:95	vs. Cond B
B (0 pseudo)	0.2617	0.5293	0.5481	0.4464	baseline
H (Mean Teacher)	0.2757	0.5020	0.5198	0.4325	$- 0.0139$
100	0.3049	0.5291	0.5734	0.4691	0.0227
200	0.3130	0.5215	0.5434	0.4593	0.0129
500	0.3131	0.5000	0.5514	0.4548	0.0084
1000	0.2811	0.4100	0.4500	0.3804	−0.0660
2000	0.2861	0.4900	0.5500	0.4420	−0.0044
5000	0.2589	0.4500	0.5200	0.4096	−0.0368

Table 13. Component ablation at the 100-tile pseudo-label budget. Each condition adds one pipeline component over the previous. All pseudo-labelling conditions report mean ± standard deviation across four random seeds. Condition E-FT reports seed 42 only.

Condition	Added Component	Macro mAP_50:95	Δ vs. B
B	Supervised Baseline	$0.4467 \pm 0.0079$	baseline
H	EMA teacher updates	$0.4189 \pm 0.0215$	$- 0.028$
C-100	Naive pseudo-labelling	$0.4407 \pm 0.0077$	$- 0.006$
D-100	+ 64 px spatial filter	$0.4409 \pm 0.0050$	$- 0.006$
E-100	+ conf ≥ 0.90	$0.4353 \pm 0.0049$	$- 0.011$
F-100	+ ensemble ≥ 2 of 3	$0.4745 \pm 0.0042$	$+ 0.028$
G-100	+ early distillation	$0.4372 \pm 0.0158$	$- 0.010$

Table 14. Condition F—multi-seed macro mAP_50:95 across four random seeds (42, 365, 1234, 2026). Mean ± standard deviation reported.

Pseudo Count	Macro mAP_50:95	vs. Cond B
100	0.4745 ± 0.0042	+0.028
200	0.4647 ± 0.0070	+0.018
500	0.4414 ± 0.0199	−0.0053
1000	0.4020 ± 0.0295	−0.0448
2000	0.4331 ± 0.0112	−0.0136
5000	0.4316 ± 0.0155	−0.0151

Table 15. Teacher Checkpoint Evaluation (conf = 0.001, dborder ≥ 64 px).

Teacher Model	Test Year	Evaluation Metric	True Positives	False Positives	FP Rate (%)
Zero-Shot Source	2018	Centre-Point	384	233	37.76%
Zero-Shot Source	2018	IoU @ 0.50	366	251	40.68%
Condition B (Fine-Tuned)	2018	Centre-Point	412	451	52.26%
Condition B (Fine-Tuned)	2018	IoU @ 0.50	383	480	55.62%
Zero-Shot Source	2021	Centre-Point	335	268	44.44%
Zero-Shot Source	2021	IoU @ 0.50	248	355	58.87%
Condition B (Fine-Tuned)	2021	Centre-Point	295	732	71.28%
Condition B (Fine-Tuned)	2021	IoU @ 0.50	289	738	71.86%
Zero-Shot Source	2022	Centre-Point	235	165	41.25%
Zero-Shot Source	2022	IoU @ 0.50	173	227	56.75%
Condition B (Fine-Tuned)	2022	Centre-Point	224	1160	83.82%
Condition B (Fine-Tuned)	2022	IoU @ 0.50	215	1169	84.47%

Table 16. Border vs. interior mAP_50:95 split evaluation across four random seeds (Mean ± SD). Condition A is a fixed zero-shot checkpoint with no seed variation.

Condition	Overall mAP_50:95	Interior mAP_50:95	Border mAP_50:95
A	0.2530	0.2667	0.1337
B	$0.4467 \pm 0.0079$	$0.4741 \pm 0.0081$	$0.3017 \pm 0.0085$
F-100	$0.4745 \pm 0.0041$	$0.4939 \pm 0.0040$	$0.3347 \pm 0.0053$

Table 17. Evaluation comparison of small objects mAP_50:95 for test year 2018.

Condition	mAP_50:95 Small	AR_50:95 Small
B	0.232	0.285
F-100	0.315	0.366

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Loureiro, G.; Dias, A.; Silva, E. Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach. J. Mar. Sci. Eng. 2026, 14, 1048. https://doi.org/10.3390/jmse14111048

AMA Style

Loureiro G, Dias A, Silva E. Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach. Journal of Marine Science and Engineering. 2026; 14(11):1048. https://doi.org/10.3390/jmse14111048

Chicago/Turabian Style

Loureiro, Gabriel, André Dias, and Eduardo Silva. 2026. "Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach" Journal of Marine Science and Engineering 14, no. 11: 1048. https://doi.org/10.3390/jmse14111048

APA Style

Loureiro, G., Dias, A., & Silva, E. (2026). Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach. Journal of Marine Science and Engineering, 14(11), 1048. https://doi.org/10.3390/jmse14111048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Expedition Domain Adaptation for Polymetallic Nodule Detection: A Multi-Model Pseudo-Labelling Approach

Abstract

1. Introduction

2. Related Work

2.1. Semi-Supervised and Domain Adaptive Object Detection

2.2. Object Fragmentation in Benthic Computer Vision

2.3. Ensemble Methodologies and Evaluation Protocols

3. Dataset and Evaluation Framework

3.1. Dataset Description

3.2. Leave-One-Partition-Out Evaluation Protocol

3.3. Source Domain and Teacher Models

4. Domain Adaptation Analysis

4.1. Sources of Domain Shift

4.2. Object Size Distribution Mismatch

4.3. Tiling Fragmentation and Border Effects

5. Pseudo-Label Noise Diagnostic Framework

5.1. Motivation and Noise Taxonomy

5.2. Border Fragmentation Analysis

5.3. The Interior False Positive Floor

5.4. Confidence Stratification: Isolating the Domain-Invariant Subspace

6. Fragmentation-Aware Teacher–Student Pseudo-Labelling

6.1. Pipeline Overview and Ensemble Formulation

6.2. Pseudo-Label Generation and Diagnostic-Guided Filtering

6.3. Manifest Construction and Anti-Leakage Constraints

6.4. Student Fine-Tuning Configuration

7. Experimental Results

7.1. Experimental Conditions

7.2. Baselines

7.2.1. Condition A—Zero-Shot Transfer

7.2.2. Condition B—Supervised Fine-Tuning

7.2.3. Semi-Supervised Baseline (Mean Teacher)

7.3. Component Ablation

7.3.1. Condition C—Naive Single-Model Pseudo-Labelling

7.3.2. Condition D—Single-Model with Spatial Filter

7.3.3. Condition E—Single-Model with Spatial Filter and Confidence Threshold

7.3.4. Condition F—Multi-Model Ensemble Pseudo-Labelling

7.3.5. Condition G—Early Distillation

7.4. Ablation Studies

7.4.1. Bounding Box Fusion Strategy and Spatial Convergence

7.4.2. Fine-Tuned Teacher

7.5. Border vs. Interior Detection Analysis

7.6. Summary

8. Discussion

8.1. Why the Pipeline Succeeds at Low Pseudo-Label Counts

8.2. The Pool-Size Degradation Trend

8.3. The Fine-Tuned Teacher Overspecialisation Effect

8.4. Architecture-Agnostic Nature of the Interior FP Floor

8.5. Small Nodules Detection

8.6. Limitations

8.7. Future Work

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Per-Condition Pool-Size Tables

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI