Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection

Lu, Junhong; Chen, Hao; Gao, Pengfei; Wang, Yu

doi:10.3390/rs17152592

Open AccessArticle

Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection

¹

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China

²

China Unicom, Heilongjiang Branch, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2592; https://doi.org/10.3390/rs17152592

Submission received: 9 May 2025 / Revised: 23 July 2025 / Accepted: 23 July 2025 / Published: 25 July 2025

(This article belongs to the Special Issue Target Detection, Recognition, Tracking, and Positioning Using Remote Sensing and AI Techniques)

Download

Browse Figures

Versions Notes

Abstract

Semi-supervised object detection (SSOD) in remote sensing is challenged by the accumulation of pseudo-label errors in complex scenes with dense objects and high intra-class variability. While teacher–student frameworks enable learning from unlabeled data, erroneous pseudo-labels such as false positives and missed detections can be reinforced over time, which degrades model performance. To address this issue, we propose the Error-Mitigation Teacher (EMT), a unified framework designed to suppress error propagation during SSOD training. EMT consists of three lightweight modules. First, the Adaptive Pseudo-Label Filtering (APLF) module removes noisy pseudo boxes via a second-stage RCNN and adjusts class-specific thresholds through dynamic confidence filtering. Second, the Confidence-Based Loss Reweighting (CBLR) module reweights training loss by evaluating the teacher model’s ability to reconstruct its own pseudo-labels, using the resulting loss as an indicator of label reliability. Third, the Enhanced Supervised Learning (ESL) module improves class-level balance by adjusting supervised loss weights according to pseudo-label statistics. EMT demonstrates consistent performance gains over representative state-of-the-art SSOD methods on DOTA, DIOR, and SSDD datasets. Notably, EMT achieves a 2.9% absolute mAP₅₀ improvement on DIOR using only 10% of labeled data, without incurring additional inference cost. These results highlight EMT’s effectiveness in improving SSOD for remote sensing.

Keywords:

remote sensing images; semi-supervised object detection; teacher–student network

Graphical Abstract

1. Introduction

Remote sensing object detection (RSOD) plays a critical role in a wide range of applications [1,2,3], including disaster response, urban monitoring, and maritime surveillance. However, the high cost of annotating dense and diverse objects in high-resolution satellite images severely limits the scalability of fully supervised detection models. To address this limitation, semi-supervised object detection (SSOD) has emerged as a promising paradigm that leverages abundant unlabeled data alongside a small set of labeled samples [4].

Among the recent developments in SSOD, teacher–student frameworks have become the most widely adopted. In these approaches, a teacher model generates pseudo-labels on unlabeled data, which are then used to supervise the student model. The teacher is typically updated via an exponential moving average (EMA) [5] of the student weights, aiming to provide stable supervision. Despite recent advances, SSOD in remote sensing faces several unique challenges. A major issue is the progressive accumulation of pseudo-label errors during training. In complex scenes with dense targets and varying object scales, pseudo boxes generated by the teacher often contain false positives or localization errors, particularly in early stages. Once accepted, these unreliable boxes influence the student model and are further reinforced through the EMA update cycle, ultimately degrading detection performance.

Although pseudo-labeling has become a central component in SSOD, its quality remains a critical bottleneck. One major limitation lies in how pseudo-labels are selected. Most existing methods, whether based on static thresholds or dynamic strategies, rely solely on classification confidence scores to accept or reject predictions. However, these scores often fail to capture localization quality, especially in dense or cluttered scenes, leading to the inclusion of inaccurate boxes. Another issue arises in the way pseudo-labels are used during optimization. Once selected, all pseudo boxes typically contribute equally to the training loss, regardless of their actual reliability. This uniform weighting introduces gradient noise and weakens model convergence, particularly when teacher and student predictions diverge. Furthermore, some object categories exhibit highly complex visual patterns that make them difficult to learn from noisy pseudo-label supervision. These hard classes tend to be poorly represented in the unsupervised branch, as pseudo-labels for them are often missing or inaccurate. Consequently, they receive limited gradient updates and underperform.

To address these challenges, we propose Error-Mitigation Teacher (EMT), a modular framework designed to reduce pseudo-label noise and suppress error accumulation in semi-supervised remote sensing object detection. EMT introduces a three-stage error control pipeline that jointly considers classification confidence, localization precision, and training dynamics throughout training.

First, the Adaptive Pseudo-Label Filtering (APLF) module mitigates noisy pseudo-label selection by combining class-wise confidence thresholds with geometric verification using a second-stage RCNN head. This dual-criteria filtering ensures that only pseudo boxes with both semantic and spatial reliability are retained. Second, the Confidence-Based Loss Reweighting (CBLR) module addresses label-level unreliability by evaluating how well the teacher model can reconstruct its own pseudo-labels during a secondary forward pass. The resulting regression and classification losses serve as reliability indicators, which are normalized and used to dynamically weight each instance during optimization, reducing the influence of uncertain pseudo-labels. Third, the Enhanced Supervised Learning (ESL) module improves class-level robustness by identifying hard-to-learn categories based on the learning difficulty estimated from pseudo-label in the unsupervised branch. These categories receive increased attention in the supervised branch, enabling joint optimization that balances the learning across all classes.

EMT consistently outperforms state-of-the-art SSOD methods, as demonstrated by extensive experiments on three representative remote sensing benchmarks: DOTA, DIOR, and SSDD. The remainder of this article is organized as follows: Section 2 provides an overview of related work in object detection for remote sensing images and semi-supervised object detection. The proposed method is detailed in Section 3. Section 4 presents a comprehensive account of the experiments and analysis. Finally, Section 5 concludes our work.

2. Related Work

In this section, we briefly introduce the literature related to our work, encompassing remote sensing object detection and Semi-Supervised Object Detection.

2.1. Remote Sensing Object Detection

Unlike natural image datasets like MS COCO, remote-sensing images typically feature multi-scale and clustered targets. CNN-based detectors have achieved excellent performance in RSOD. Depending on whether candidate regions are generated, these methods can be categorized into single-stage and two-stage detectors. Two-stage detectors generally yield better results, while single-stage detectors offer faster detection speeds.

Single-stage detectors consider detection tasks as regression problems and utilize convolutional layers to efficiently process feature maps, producing localization and classification results. Owing to the advantages of lightweight deployment, single-stage detectors hold broad application potential in the field of remote sensing. Wei et al. [6] introduced a bilateral attention mechanism in YOLOX to enhance the detection of small-sized objects. S2A-Net [7] employed deformable ConvNets [8] to align anchors and features for aerial image object detection. DAFNe [9] introduced a centrality function for bounding boxes of arbitrary orientations, improving localization performance.

Two-stage detectors segment the process into two steps. The first step generates a range of possible proposals, similar to single-stage detectors, and the second step refines these suggestions. Faster RCNN [10] has become the most widely used foundational framework in RSOD [1], which abandoned the Selective Search algorithm in RCNN [11] and designed the RPN subnet to derive proposals from the anchor mechanism. The multi-scale feature fusion of FPN [12] and the RoI Align method proposed in Mask RCNN [13], which reduces the quantization error, have also been applied to recent implementations of Faster RCNN. Additionally, Ding et al.’s RoI Transformer [14] applies spatial transformations to Regions of Interest (RoIs), enhancing the detection performance of rotating targets in remote sensing imagery. ReDet [15] offers a rotation-equivariant detector that extracts rotation-invariant features of targets.

While most existing RSOD methods focus on enhancing localization precision through architectural adaptations such as rotation-equivariant modules and anchor refinements, they typically rely on fully supervised training paradigms. However, in remote sensing applications where annotations are scarce and expensive, these methods struggle to maintain performance. In contrast, our EMT framework addresses this challenge by incorporating pseudo-label noise mitigation strategies under a semi-supervised setting, thereby achieving stronger generalization with limited labeled data. In this work, we focus on horizontal bounding box (HBB) detection, which is compatible with a broader range of remote sensing datasets and yields more stable performance under semi-supervised training settings.

2.2. Semi-Supervised Object Detection

Semi-supervised learning has seen remarkable success in object detection by effectively leveraging unlabeled data through mechanisms such as pseudo-labeling [16], consistency regularization [17], and transfer learning [18]. In recent years, the teacher–student paradigm, combined with pseudo-label learning, has become the dominant framework in semi-supervised object detection (SSOD). Under this paradigm, a teacher model generates pseudo-labels for unlabeled images, and a student model is trained using both labeled data and high-confidence pseudo-labeled data. Although conceptually simple, this approach has undergone significant evolution to overcome core challenges such as pseudo-label noise, confirmation bias, and class imbalance. Below, we review representative SSOD methods following this developmental trajectory.

Early works such as STAC [16] proposed a two-stage training pipeline, where the teacher model, trained on labeled data, generates pseudo-labels for unlabeled images using a fixed confidence threshold. The student model is then trained with both labeled data and pseudo-labeled data under strong augmentation. While effective, this method is prone to confirmation bias and label noise. To alleviate such issues, Unbiased Teacher [19] introduced an EMA strategy, adapted from Mean Teacher [5], to update the teacher model dynamically using the student model during training.

Subsequent works have focused on improving the quality of pseudo-labels used to train the student. ISMT [20] aggregates detection results from multiple historical iterations to stabilize pseudo-labels and reduce temporal noise, while RPL [21] and PCT [22] propose a dynamic thresholding mechanism that jointly considers classification and localization confidence to improve label reliability. However, methods that rely solely on classification confidence often fail to account for localization quality, leading to underestimated noise in early training and missed detections in later stages.

Building upon these ideas, more recent methods have employed consistency regularization to further enhance the learning of the unsupervised branch. Soft Teacher [23] proposes a soft weighting strategy that assigns each unlabeled bounding box a classification loss proportional to the confidence score from the teacher. It also introduced box jittering to refine localization targets. Unbiased Teacher v2 [24] extended this framework to anchor-free detectors like FCOS and introduced the Listen2Student mechanism to enhance the learning of localization signals. Consistent-Teacher [25] redesigned the sample allocation strategy and detection head to improve pseudo-label stability and reduce oscillation noise in one-stage detectors.

Some studies have explored the adaptation of the teacher–student framework to transformer-based architectures. For example, Semi-DETR [26] extends SSOD to DETR-style detectors by employing hybrid matching strategies and introducing cross-view query consistency as well as cost-based pseudo-label mining. In the context of remote sensing, SSOD-QCTR [27] further customizes the semi-supervised transformer framework by designing dedicated loss functions for remote sensing.

While prior SSOD methods have made notable progress, they often rely on classification confidence alone for pseudo-label selection, apply uniform loss weighting regardless of label quality, and offer limited support for learning under-represented categories. In contrast, our EMT framework introduces a unified strategy that jointly considers semantic and geometric cues during label selection, modulates supervision strength based on prediction reliability, and enhances learning for hard classes through adaptive supervision. These innovations help mitigate error accumulation and promote more stable and balanced training in semi-supervised remote sensing detection.

3. Materials and Methods

3.1. Overview

As illustrated in Figure 1, the proposed Error-Mitigation Teacher (EMT) framework adopts a teacher–student architecture composed of supervised and unsupervised branches, allowing the model to fully leverage both labeled and unlabeled data. Each branch uses a Faster RCNN detector with an FPN backbone to support robust multi-scale feature extraction. The teacher network is updated via an exponential moving average (EMA) of the student model, ensuring stability during pseudo-label generation.

In the supervised branch, we introduce Enhanced Supervised Learning (ESL), which adjusts class-wise loss weights based on learning difficulty inferred from the pseudo-label distribution. Specifically, ESL identifies underperforming classes based on the confidence thresholds estimated by APLF and increases their supervised loss contributions. This strategy directs the model’s focus toward hard-to-learn categories in the labeled data, thereby improving class-level detection robustness.

In the unsupervised branch, unlabeled images undergo both weak and strong augmentations [22]. The teacher network generates pseudo-labels from weakly augmented views, which are refined by the proposed Adaptive Pseudo-Label Filtering (APLF) module before being treated as targets for the student’s training on strongly augmented samples. APLF re-evaluates pseudo-labels using a second-stage RCNN head and dynamically adjusts confidence thresholds to suppress low-quality predictions. During loss computation on these samples, the Confidence-Based Loss Reweighting (CBLR) module evaluates the reliability of each pseudo-labeled region by performing a secondary forward pass in the teacher network. Specifically, the teacher attempts to reconstruct its own pseudo-labels, and the resulting classification and regression losses are used as reliability indicators. These values are then used to apply a soft weighting function, reducing the impact of unreliable supervision.

Together, these three components form a unified and lightweight error-mitigation pipeline. APLF provides high-quality pseudo-labels, CBLR suppresses noise during optimization, and ESL adaptively enhances class-level supervision. This design systematically reduces error accumulation and improves generalization in complex remote sensing scenarios.

3.2. Adaptive Pseudo-Label Filtering

In semi-supervised object detection (SSOD), the quality of pseudo-labels generated by the teacher network critically influences the training stability and overall performance of the student model. However, remote sensing imagery typically presents greater complexity and diversity compared to natural scenes, exhibiting significant intra-class variability, densely packed objects, and cluttered backgrounds. Such conditions commonly result in numerous false positives and unstable predictions in pseudo-labels, especially during the initial training phases. As illustrated in Figure 2a, naively accepting pseudo-labels from early-stage teacher predictions can significantly degrade detection performance due to the propagation of noisy supervision signals.

To mitigate these issues, we introduce Adaptive Pseudo-Label Filtering (APLF), a robust and adaptive filtering module that selectively retains high-quality pseudo-labels based on both classification consistency and class-specific confidence dynamics. The APLF module comprises two key components: Second-Stage RCNN Evaluation (SSRE) and Dynamic Threshold Filtering (DTF).

3.2.1. Second-Stage RCNN Evaluation

The design of the SSRE component is motivated by a structural limitation in the standard Faster RCNN detection pipeline. In most implementations, the RCNN head predicts classification scores and per-class bounding box regressions for each region proposal generated by the Region Proposal Network (RPN). Following this stage, Non-Maximum Suppression (NMS) is applied independently for each class, which helps balance category-wise detections. However, this per-class NMS mechanism also leads to an unintended consequence: a large number of low-confidence or ambiguous bounding boxes may survive, especially for frequent or spatially overlapping categories.

These surviving boxes, although structurally allowed by the NMS step, may include clearly erroneous predictions. If used as pseudo-labels, they can contaminate the training signal and interfere with the adaptive threshold estimation of downstream components such as DTF. To mitigate this, we propose Second-Stage RCNN Evaluation (SSRE) as a lightweight verification strategy that reuses the RCNN head to filter out the most unreliable detections based on label consistency.

Notably, SSRE does not attempt to comprehensively denoise the pseudo-label set, but instead aims to remove only the most obviously unstable predictions. This is achieved by re-evaluating each bounding box through the RCNN head a second time, using the same features. If the predicted class label changes between the two evaluations, the detection is considered inconsistent and is discarded. Although our SSRE is not derived from cascaded architectures, it adopts a similar intuition to multi-stage refinement strategies used in fully supervised detectors such as Cascade R-CNN [28] and Deformable DETR [29], where iterative prediction improves localization quality. In our case, this principle is applied in a novel semi-supervised context to identify noisy pseudo-labels via internal consistency checks, with minimal computational overhead.

Formally, let

b_{i}

be a bounding box retained after per-class NMS. Let

y_{i} \in {1, \dots, C}

denote the predicted class label from the first RCNN pass in teacher’s prediction process. The same box is then re-evaluated to obtain a second label

y_{i}^{'}

. Candidate boxes for which

y_{i} \neq y_{i}^{'}

are considered unstable and thus removed from the pseudo-label set.

This label consistency check provides an efficient approximation of prediction stability. Compared with threshold-based heuristics, SSRE leverages the model’s own classification behavior to identify unreliable detections. It avoids additional network branches or confidence calibration components, thereby preserving computational efficiency while improving pseudo-label reliability in complex remote sensing scenarios.

3.2.2. Dynamic Threshold Filtering

After filtering unstable predictions via SSRE, the remaining pseudo-labels are further refined using DTF. Unlike conventional fixed-score thresholds that apply a uniform confidence cutoff across all classes, DTF adaptively adjusts thresholds on a per-class basis, based on the evolving prediction quality over training.

This design is motivated by the observation that pseudo-label confidence scores can differ significantly across classes due to factors such as object size, class frequency, and visual ambiguity. Rigid thresholds tend to unfairly penalize rare or difficult classes, leading to excessive false negatives. DTF addresses this by computing a class-wise threshold T that dynamically adapts using EMA of softmax scores.

Let

P \in ℝ^{N \times C}

be the matrix of softmax classification probabilities output by the second RCNN pass, where N is the number of retained boxes and C is the number of object classes. Define the global confidence

{\bar{s}}_{g} \in ℝ^{1}

as the EMA of the maximum scores per box:

{\bar{s}}_{g}^{(t)} = λ \cdot {\bar{s}}_{g}^{(t - 1)} + (1 - λ) \cdot \frac{1}{N} \sum_{i = 1}^{N} \max_{j} P_{i j}

(1)

Similarly, the local per-class confidence vector

{\bar{s}}_{c} \in ℝ^{C}

is updated using:

{\bar{s}}_{c}^{(t)} = λ \cdot {\bar{s}}_{c}^{(t - 1)} + (1 - λ) \cdot \frac{1}{N} \sum_{i = 1}^{N} P_{i}

(2)

where

λ \in [0, 1]

controls the momentum.

Finally, the adaptive threshold vector

T \in ℝ^{C}

T^{(t)} = \frac{{\bar{s}}_{g}^{(t)}}{\max ({\bar{s}}_{c}^{(t)})} \cdot {\bar{s}}_{g}^{(t)}

(3)

This normalization ensures that each class threshold is scaled relative to its own average confidence, while remaining anchored to the global detection quality. Pseudo-labels whose confidence scores fall below their class-specific threshold

T^{(t)}

are discarded.

Together, SSRE and DTF constitute a coherent pseudo-label filtering strategy. SSRE eliminates clearly unstable predictions via classification consistency, while DTF adjusts selection strictness based on class-specific learning dynamics. By combining these two complementary components, APLF significantly improves the quality and balance of pseudo-labels used for training, enhancing both convergence stability and final detection accuracy in remote sensing scenarios.

3.3. Confidence-Based Loss Reweighting

Unlike ground-truth annotations, pseudo-labels inherently contain noise due to imperfect localization and class ambiguity. Even with filtering mechanisms such as APLF, these labels cannot match the precision of human annotations. When such imperfect labels are used to guide sample assignment, especially under IoU-based anchor assignment schemes, they may lead to incorrect positive or negative assignments for region proposals. For instance, slightly inaccurate bounding boxes may still be accepted as positives, resulting in suboptimal regression targets and unstable gradients, as also observed in Consistent Teacher [25].

To mitigate this issue, we introduce the Confidence-Based Loss Reweighting (CBLR) module, which assigns soft, sample-wise weights to pseudo-labeled instances based on their estimated reliability. In contrast to prior works that rely on hard filtering or modify the sample assignment process, CBLR retains the original assignments and instead modulates each sample’s influence during loss optimization. This design allows the model to learn from uncertain labels in a more stable and graded manner, improving robustness while preserving valuable supervision signals.

Specifically, we reuse the RoIs generated in the student branch and feed them into the teacher model’s RCNN head (no gradients) for evaluation. During this secondary forward pass, the teacher treats its own pseudo-labels, originally predicted under weak augmentation on the same image, as ground-truth targets and attempts to re-predict them. The classification and regression losses from this pass indicate how well the teacher can recover its own predictions. Since loss magnitude reflects the ease of fitting a label, it serves as a proxy for label reliability: low loss implies a self-consistent and likely correct pseudo-label, whereas high loss suggests inconsistency or noise. These per-sample loss values are normalized within each mini-batch and used to scale the corresponding student loss. This dynamic weighting strategy ensures that noisy labels contribute less to gradient updates, while reliable ones retain full influence.

Let

p_{i}^{T}

and

p_{i}^{S}

denote the class probability vectors output by the teacher and student models, respectively, for the i-th pseudo-labeled RoI. Let

b_{i}^{T}

and

b_{i}^{S}

denote the bounding box predictions, and let

{\hat{y}}_{i}

and

{\hat{b}}_{i}

be the pseudo-label targets. We define the classification and regression confidence weights as:

w_{i}^{c l s} = Softmax (- ℒ_{c l s}^{T} (p_{i}^{T}, {\hat{y}}_{i})), w_{i}^{r e g} = Softmax (- ℒ_{r e g}^{T} (b_{i}^{T}, {\hat{b}}_{i}))

(4)

Here,

ℒ_{c l s}

and

ℒ_{r e g}

are the RCNN loss functions for classification and regression, respectively. The softmax operation is computed over all pseudo-labels in the mini-batch, ensuring normalized weights and stabilizing learning dynamics. High-loss samples receive lower weights, reducing their impact on the loss calculation.

For the regression branch, each pseudo-label corresponds to a group of positively assigned anchors. We compute the loss at the pseudo-label level:

L_{r c n n_r e g}^{u n s u p} = \frac{1}{\sum_{k} w_{k}^{r e g}} \sum_{k = 1}^{K} w_{k}^{r e g} \cdot \frac{1}{N_{k}} \sum_{j = 1}^{N_{k}} ℒ_{r e g}^{S} (b_{j}^{S}, {\hat{b}}_{k})

(5)

Here,

K

is the number of pseudo-labels and

N_{k}

the number of assigned anchors for the k-th box. For the classification branch, we apply confidence weights to each labeled anchor directly:

L_{r c n n - c l s}^{u n s u p} = \frac{1}{\sum_{k} w_{k}^{c l s}} \sum_{i = 1}^{N} w_{k}^{c l s} \cdot ℒ_{c l s}^{S} (p_{j}^{S}, {\hat{y}}_{k})

(6)

To ensure stability, all weights

w_{i}

are detached from the computation graph. The teacher model does not participate in gradient updates, and its loss values are used solely to guide student training through differentiable, static reweighting.

Overall, CBLR serves as the second stage of error mitigation: while APLF removes clearly erroneous pseudo-labels, CBLR dynamically suppresses the impact of remaining uncertainty. Because loss-based reliability is grounded in the model’s own predictive certainty, it adapts dynamically to varying levels of label noise, class imbalance, and sample complexity, thereby enhancing stability across heterogeneous domains.

3.4. Enhanced Supervised Learning

Although the focus of semi-supervised learning lies in exploiting unlabeled data, the effective use of the limited labeled set remains equally crucial, especially in remote sensing scenarios where annotations are expensive and typically class-imbalanced. Conventional supervised training treats all categories equally, which may cause the model to overfit frequent classes while under-learning rare or difficult ones. Moreover, in a multi-class detection context, uniform treatment can further exacerbate the long-tail nature of remote sensing datasets.

To address this issue, we propose Enhanced Supervised Learning (ESL), which dynamically reweights the supervised classification loss to emphasize classes with higher learning difficulty. The core idea is to use the class-wise pseudo-label quality, derived from the DTF module (Section 3.2), to guide how much weight should be given to each class during supervised training.

Let

T = [t_{1}, \dots, t_{C}]

be the vector of adaptive class-specific thresholds computed in DTF. These thresholds represent the average pseudo-label confidence for each class. A lower value

t_{c}

indicates that pseudo-labels for class c tend to have lower confidence, suggesting that the model is struggling to learn or generalize to that class.

To encourage the model to focus more on such difficult classes, we apply an exponential inverse transformation:

w_{c} = e x p (- t_{c})

(7)

This ensures that classes with low-confidence pseudo-labels receive higher weight. To maintain training stability and ensure balanced optimization, we normalize the total supervised classification loss using the following formulation:

L_{r c n n_c l s}^{s u p} = \frac{\sum_{i = 1}^{N} ℒ_{c l s} (p_{i}^{\sup}, c_{i}) \cdot w_{c_{i}}}{\sum_{i = 1}^{N} w_{c_{i}}}

(8)

Here,

p_{i}^{\sup}

is the predicted classification probability vector for the i-th labeled RoI, and

c_{i} \in {1, \dots, C}

is its assigned ground-truth class label. The exponential term

w_{c_{i}}

dynamically scales the loss according to the current difficulty of class

c_{i}

, inferred from the pseudo-label statistics of the unlabeled branch.

This design offers three advantages. First, ESL adaptively emphasizes classes with lower pseudo-label confidence, mitigating the under-learning of rare or hard-to-detect categories. Second, by reusing pseudo-label statistics from the unsupervised branch, ESL aligns the supervised learning process with the evolving difficulty of each class. Third, the use of normalized exponential weights ensures training stability and avoids distorting the overall loss scale. In combination with APLF and CBLR, ESL completes a coherent error mitigation framework that jointly improves pseudo-label quality, suppresses residual errors, and enhances class-wise learning balance.

3.5. Overall Training Objective

EMT is trained with the proposed unsupervised losses for unlabeled data as well as the supervised loss for labeled data. The overall loss L is defined as:

\begin{array}{l} L = L^{s u p} + μ L^{u n s u p} \\ L^{s u p / u n s u p} = L_{r p n}^{s u p / u n s u p} + L_{r c n n}^{s u p / u n s u p} \\ L_{r c n n}^{s u p / u n s u p} = L_{r c n n_r e g}^{s u p / u n s u p} + L_{r c n n_c l s}^{s u p / u n s u p} \end{array}

(9)

L^{s u p}

and

L^{u n s u p}

represent the total supervised and unsupervised losses, respectively.

L_{r p n}^{s u p / u n s u p}

are the losses from the RPN for supervised and unsupervised data, respectively.

L_{r c n n}^{s u p / u n s u p}

are the RCNN losses, further divided into regression

L_{r c n n_r e g}^{s u p / u n s u p}

and

L_{r c n n_c l s}^{s u p / u n s u p}

components.

μ = 2

is a balancing parameter that determines the relative importance of the unsupervised loss compared to supervised loss. We also set an additional 5000 iterations for model warming-up to ensure meaningful pseudo-labels, during which

L^{u n s u p}

is not calculated. The teacher network is updated using the EMA of the student’s parameters and is not involved in gradient computation. During inference, only the teacher detector is used, and none of the error mitigation modules introduce additional runtime overhead.

4. Experimental Results

4.1. Datasets

DOTA: We perform experiments on the DOTA1.0 [30] dataset, a comprehensive and widely-employed collection for object detection in remote sensing. Encompassing a substantial variety of 15 common categories: plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). This large dataset has become a benchmark in the field. It features images sourced from diverse sensors and platforms, ranging in size from 800 × 800 to an expansive 20,000 × 20,000 pixels. Its breadth and diversity have established DOTA as a standard for assessing object detection algorithms in remote sensing applications.

DIOR: Another prevalent dataset in the remote sensing object detection domain, DIOR [31] includes 23,463 images with a total of 190,288 instances. Uniformly sized at 800 × 800 pixels, the dataset spans 20 distinct categories: airplane (PL), airport (PO), baseball field (BD), basketball court (BC), bridge (BR), chimney (CH), dam (DM), expressway service areas (ESA), expressway toll station (ETS), golf field (GF), ground track field (GTF), harbor (HA), overpass (OP), ship (SH), stadium (SD), storage tank (ST), tennis court (TC), train station (TS), vehicle (V), and windmill (WM). Like DOTA, DIOR is utilized for its variety and representativeness of the remote sensing object detection challenges, providing a rich resource for validating the robustness and effectiveness of detection models.

SSDD: The SAR Ship Detection Dataset (SSDD) [32] is a publicly available dataset specifically created for ship detection in SAR images. It comprises a total of 1160 images containing 2456 annotated ships, and it captures diverse variations in sea state, image resolution, and sensor. We use the officially provided training and test sets for model training and evaluation. Moreover, due to the inherent challenges in annotating SAR images, such as low contrast and high noise levels, there is a heightened need for effective SSOD methods to alleviate the annotation burden and improve detection performance.

4.2. Training Protocol

For our experiments on the DOTA and DIOR datasets, we adhere to a structured partitioning methodology to effectively evaluate the performance of our SSOD framework. Specifically, for DOTA dataset, we first partition the images into 640 × 640 patches following the approach outlined in PCT [22]. This preprocessing step ensures that even the smallest labeled subset (1%) contains a sufficient number of labeled images, facilitating robust training despite the reduced labeled data. For the DIOR dataset, we do not perform image partitioning.

Subsequently, we divide the training sets of both the DOTA and DIOR datasets into subsets comprising 1%, 5%, and 10% of images with ground truth annotations. The remaining portions of each training set consist exclusively of unlabeled images, rigorously testing the network’s capability to leverage unlabeled data for improved detection performance. Each dataset’s validation set is reserved for assessing the performance of the trained models.

Contrary to the training regimen established in PCT and Soft Teacher, which involves 180,000 iterations for the 1% protocol and 360,000 iterations for both the 5% and 10% protocols, our approach emphasizes training efficiency. We employ a reduced number of training iterations: 90,000 iterations for the 1% protocol and 180,000 iterations for both the 5% and 10% protocols. This adjustment demonstrates the efficacy of our method in achieving comparable or superior performance with fewer computational resources.

For SSDD, since the training set contains only 928 images, applying a 1% protocol would result in merely 9 labeled images, leading to unstable training. To ensure more reliable learning while testing the model’s generalization at different labeled data proportions, we adopt 5%, 10%, and 20% protocols instead. All SSDD protocols are trained for 18,000 iterations.

4.3. Evaluation Metrics

In this work, we adopt horizontal bounding box (HBOX) detection across all datasets. This choice aligns with the annotation format of most RSOD benchmarks (e.g., DIOR, SSDD), and avoids the additional complexity and instability introduced by angle regression in Rotated Bounding Boxes (RBOX). Moreover, HBOX enables more efficient training and fairer comparison with recent SSOD methods, many of which are limited to horizontal annotations.

To effectively evaluate the performance of our model, we employ the widely recognized mean Average Precision (mAP) metrics, specifically mAP₅₀ and mAP_50:95. The mAP₅₀ metric applies an Intersection over Union (IoU) threshold of 0.5 to classify detections as true positives, which is particularly favored in the RSOD community for its relevance in scenarios where moderate localization accuracy is sufficient. On the other hand, the mAP_50:95 metric, as proposed by the MS-COCO benchmark, provides a more comprehensive assessment by averaging the model’s performance across ten IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05. This broader measure captures the model’s localization accuracy across various levels of precision, making it essential for a thorough performance analysis. By utilizing both mAP₅₀ and mAP_50:95, we ensure a balanced evaluation of our model’s detection capabilities, accounting for both general and fine-grained localization performance.

4.4. Implementation Details

All experiments described in this paper were conducted on one NVIDIA RTX4090. To ensure a fair comparison, our method employs the same training and testing strategies as PCT and Soft Teacher. We train the model with a batch size of 5, maintaining a labeled to unlabeled image ratio of 1:4. The weak data augmentations include two operations: random resize and random flip. The strong data augmentations are more comprehensive, comprising five operations: random resize, random flip, random color space transformation, random geometric transformation, and erasing. Both types of augmentations are applied to the same unlabeled image.

Our optimizer of choice is SGD, initialized with a learning rate of 0.004, a momentum of 0.9, and a weight decay rate of 0.0001. In line with prevalent SSOD methodology, like PCT, we utilize the Faster RCNN framework with FPN for both student and teacher networks, maintaining separate weights. ResNet50 [33] serves as the backbone for all experiments. We employ the EMA, widely used in semi-supervised object detection, to update the teacher model parameters based on the student’s parameters, adhering to the same momentum decay settings used in APLF.

4.5. Comparison with the State-of-the-Art Methods

To comprehensively evaluate the effectiveness of our proposed EMT framework, we compare it against several state-of-the-art SSOD methods, which represent the current leading approaches in both general and remote sensing domains. These include Consistent Teacher [25], a single-stage detector-based method that improves pseudo-label stability via enhanced feature alignment; Soft Teacher [23], which builds upon the Faster RCNN framework and employs soft labels and jittering strategies to reduce label noise; MixPL [34], which mitigates pseudo-label imbalance across quantity, scale, and class via mixup and mosaic augmentations; and Semi-DETR [26], a transformer-based method incorporating deformable attention and hybrid bipartite matching to improve detection in complex scenes. These methods are consistently ranked among the top-performing SSOD approaches under standard benchmarks such as COCO 10% labeled data.

We also include PCT [22], which is one of the most advanced SSOD methods specifically tailored for remote sensing object detection. PCT enhances pseudo-label quality via contrastive learning and introduces scale-sensitive learning mechanisms, making it especially suitable for remote sensing scenarios with multi-scale, dense objects.

For fair evaluation, all methods are tested on DOTA, DIOR, and SSDD under consistent protocols, using mAP₅₀ and mAP_50:95 as performance metrics. Our results show that EMT outperforms all compared baselines across datasets and protocols, demonstrating its superior ability to mitigate pseudo-label errors and enhance detection robustness in challenging remote sensing environments. We re-implemented all methods with publicly available code; for methods without released code (e.g., PCT), we report the results directly from their original publications.

4.5.1. Analysis Across All Categories

Table 1 demonstrates that under both mAP₅₀ and mAP_50:95 evaluation standards, EMT effectively learns from unlabeled data even with a small fraction of annotated training data. Specifically, on the DOTA dataset, EMT outperforms all compared methods, including the transformer-based Semi-DETR, in both mAP₅₀ and mAP_50:95. This performance gain is attributed to EMT’s ability to reduce error accumulation in SSOD for remote sensing, allowing for more efficient utilization of data that surpasses the performance improvements seen in more complex detector models. Notably, for the commonly used mAP₅₀ metric in RSOD, EMT achieves improvements of 1.3, 1.8, and 1.1 points over the best competing methods under 1%, 5%, and 10% labeled data protocols, respectively. For the more localization-sensitive mAP_50:95 metric, EMT records enhancements of 0.5, 2.2, and 2.6 points, respectively.

On DIOR, EMT also achieves comprehensive leading performance on the mAP₅₀ metric, particularly breaking through the 70 mAP₅₀ threshold under the 10% labeled data protocol, being the only method to do so. For mAP_50:95, although Consistent Teacher with its more complex detection head shows advantages in localization, EMT maintains a leading position among Faster RCNN based methods, such as PCT and Soft Teacher.

We further evaluate the generalization of EMT on the SSDD dataset, which focuses on ship detection in SAR. Since SSDD contains only one class (ship), the mAP₅₀ reported corresponds to the average precision of ship detection. As shown in Table 1, EMT achieves state-of-the-art performance across all labeling ratios in mAP₅₀. Under the 20% labeled protocol, EMT attains an mAP₅₀ of 91.2, outperforming Soft Teacher and Semi-DETR by 2.5 and 1.9 mAP₅₀, respectively. Even with only 5% labeled data, EMT achieves an mAP₅₀ of 81.3, maintaining a strong advantage over prior works. From the perspective of localization accuracy, EMT also excels on the mAP_50:95 metric, with values of 50.5, 51.5, and 60.1 for 5%, 10%, and 20% protocols, respectively. These results demonstrate that EMT’s improvements are not solely derived from better classification, but also from superior bounding box precision, which is critical in maritime surveillance tasks.

4.5.2. Analysis with Specific Categories

Table 2 presents detailed category-wise evaluations on the DOTA dataset, demonstrating the consistent superiority of our proposed method compared to existing semi-supervised object detection frameworks. Across varying annotation rates (1%, 5%, and 10%), our approach achieves either leading or second-best performance in most categories, highlighting robust generalization capabilities. Specifically, our method notably excels in complex and structurally distinctive categories. For instance, at the 10% annotation level, our model achieves the highest mAP₅₀ of 85.4 in plane and 93.4 in tennis court, significantly surpassing competing methods.

Although our model demonstrates strong overall performance, categories like Basketball Court exhibits comparatively limited results at the 1% annotation rate for all methods. The challenges are likely due to limited training samples and indistinct object boundaries. Despite these inherent difficulties, our method frequently remains among the top two performers, underscoring its robustness. Notably, our semi-supervised model surpasses the fully supervised Oracle Faster RCNN in certain categories such as plane and large vehicle. This superior performance is partly attributed to the EMA mechanism, which effectively enlarges the equivalent training batch size, thus enhancing feature representation and generalization.

An additional observation is found in the helicopter category, where performance at the 10% supervision level declines across all methods compared to lower supervision rates. This decline likely arises from visual confusion with similar categories such as plane, commonly encountered in airport contexts. This underscores the importance of precise pseudo-label refinement strategies to mitigate class confusion.

Table 3 further validates the robustness and effectiveness of our method on the DIOR dataset. Our model consistently demonstrates superior or highly competitive performance across various annotation scenarios, particularly excelling in semantically challenging categories like storage tank, expressway service area and vehicle.

However, the model’s performance in train station indicates persistent challenges due to inherent visual ambiguities and sparse annotations at the 1% annotation rate. Despite these challenges, our method consistently ranks among the top-performing approaches. Additionally, increasing annotation rates to 10% further enhances performance, achieving notable mAP₅₀ scores of 44.0 in both vehicle and dam, and an impressive 83.1 in ground track field, matching or slightly exceeding the fully supervised Oracle Faster RCNN. Compared with representative baselines under the same protocols, our method maintains clear advantages across most categories, including recent strong methods like Soft Teacher, Semi DETR and MixPL. This further underscores the robustness and generalization ability of EMT across diverse object types.

Analyzing cross-dataset performances reveals that competing approaches such as MixPL and Semi-DETR exhibit strengths primarily within certain categories. In contrast, our method demonstrates more consistent advantages across diverse categories and datasets. This consistent cross-dataset performance indicates our model’s capability to capture generalizable semantic representations, reinforcing its effectiveness in varied remote sensing scenarios.

4.6. Ablation Studies

4.6.1. Ablation Studies of Each Component in the EMT

In this section, we thoroughly investigate the DOTA dataset to assess the effectiveness and robustness of the EMT. We employed the same training strategies and testing settings as previous standard SSOD methods to ensure a fair comparison. Specifically, we randomly divided the DOTA training set into three subsets containing 1%, 5%, and 10% of labeled images, while treating the remainder of the training set as unlabeled data.

Table 4 presents the ablation studies of each component within EMT, evaluating their contributions under different labeled data protocols. The baseline method represents the original teacher–student network without the inclusion of APLF, CBLR, or ESL.

APLF corresponds to the pseudo-label threshold filtering mechanism in the teacher–student framework. In the ablation study, we replaced APLF with a fixed high threshold of 0.9, following existing methodologies. Removing APLF resulted in a significant decline in performance, especially under the 1% labeled data protocol, where mAP₅₀ dropped by 5.3 points. This substantial decrease is attributed to the model’s increased reliance on pseudo-labels due to limited labeled data, highlighting APLF’s critical role in balancing true positive retention and false positive elimination. In higher labeled data protocols (5% and 10%), the performance drops were around 2 points of mAP₅₀, demonstrating APLF’s effectiveness in more data-rich scenarios as well.

CBLR is responsible for mitigating error accumulation in the unsupervised learning branch by weighting the loss based on pseudo-label confidence. The ablation involved resetting the sample loss weights to 1, reverting to the original Faster RCNN loss computation. This removal led to performance decline of approximately 3 points in mAP₅₀ across all protocols. Notably, the degradation was more pronounced for CBLR compared to APLF in the 5% and 10% protocols, underscoring CBLR’s crucial role in managing noise and enhancing robustness These findings indicate that traditional supervised loss calculation methods are inadequate for SSOD in the unsupervised branch, even when pseudo-label quality is high.

ESL is responsible for mitigating error accumulation by reweighting the supervised classification loss based on information from the unsupervised branch. Specifically, ESL leverages class-specific adaptive thresholds in APLF to assess the learning difficulty of each class and adjusts the loss weights accordingly. In the ablation study, we reset the loss weights to 1, reverting to the original Faster RCNN loss functions. This removal caused decreases in both mAP₅₀ and mAP_50:95 across all protocols, with more significant drops observed in the 5% and 10% protocols (from 43.8 to 43.6 at 1%, from 62.0 to 60.9 at 5%, and from 62.8 to 61.5 at 10%). The greater impact in higher protocols is due to ESL’s enhanced ability to exploit the increased labeled data, effectively suppressing error accumulation and ensuring balanced learning across classes. When ESL is used in isolation, the performance gains are limited. This is because the removal of APLF and CBLR leads to accumulated errors in the unsupervised branch, resulting in inaccurate class difficulty estimations that ESL relies on for loss reweighting.

Overall, the ablation studies demonstrate that APLF, CBLR, and ESL collectively play essential roles in mitigating error accumulation within SSOD for remote sensing. Each component contributes uniquely to enhancing pseudo-label quality, reducing the influence of noisy data, and ensuring balanced learning across diverse and challenging object categories. Moreover, when used in combination, these modules exhibit clear synergistic effects. The full configuration, which incorporates all three modules, consistently achieves the best performance across all labeled data protocols. This highlights the complementary nature of APLF, CBLR, and ESL in addressing different aspects of pseudo-label reliability and class imbalance, thereby significantly boosting the overall effectiveness and robustness of the EMT framework.

4.6.2. Ablation Studies of Each Component in the APLF

APLF is designed to improve the quality of pseudo-labels in remote sensing SSOD by balancing the retention of true positives with the removal of false positives under challenging conditions. It achieves this by the cooperation of two internal modules: SSRE, and DTF. To further verify the effectiveness of each component within APLF, we conducted additional experiments focusing on the 1% training protocol of the DOTA dataset. As indicated by Table 3, the 1% protocol was chosen because it contains the smallest amount of labeled data, making the model’s performance more dependent on the pseudo-label filter.

APLF comprises two sequential sub-modules: SSRE and DTF. In our ablation experiments, disabling SSRE means that the corresponding module is removed from the filtering process, while ablating DTF is achieved by replacing the dynamic threshold with a fixed threshold of 0.9 (as used in many previous implementations). Table 5 presents the ablation results of each sub-module within APLF under the 1% training protocol on DOTA.

Ablating the SSRE module results in a decrease in mAP₅₀ from 43.8 to 41.3, a drop of 2.5 points. This indicates that SSRE plays a role in refining pseudo-labels. By evaluating the classification results from two RCNN stages, SSRE effectively removes a significant number of low-quality predictions, preventing the dynamic threshold in DTF from being estimated too low. This effect is particularly notable in categories such as small vehicle (SV), where mAP₅₀ drops significantly from 47.5 to 16.2, highlighting SSRE’s importance in scenarios involving densely distributed, small-sized objects commonly found in urban environments.

Ablating the DTF module causes a substantial decrease in mAP₅₀ from 43.8 to 40.1, a drop of 3.7 points. This highlights the critical role of dynamic thresholding in balancing the retention of true positives and the elimination of false positives based on the model’s evolving confidence levels. The most significant declines occur in categories like storage tank (ST), where mAP₅₀ drops drastically from 58.6 to 30.6. This clearly indicates that categories with variable confidence distributions rely heavily on dynamic threshold adjustments to maintain accurate pseudo-label filtering.

Overall, the ablation studies of APLF’s sub-components confirm that each module contributes uniquely to enhancing pseudo-label quality. SSRE is crucial for refining and ensuring the precision of pseudo-labels, particularly in medium and small objects, while DTF dynamically adjusts thresholds to adapt to varying data distributions, thereby preventing error accumulation and improving the model’s robustness and performance, especially for large objects.

4.7. Analysis of Transformer Backbone

Given that the recent Semi-DETR [26] method employs a Transformer-based detector, we extend the backbone of our EMT framework to compare the performance differences between CNN-based and Transformer-based architectures.

ResNet [33] is a widely used CNN model for object detection tasks, while Transformers have emerged as a powerful alternative framework for image classification and detection. In recent years, Transformer-based backbones such as Swin Transformer [35] and Pyramid Vision Transformer (PVT) [36] have demonstrated strong performance across various dense prediction tasks. To evaluate the robustness and generalizability of our EMT framework, we replace the ResNet backbone with these two representative Transformer-based architectures while keeping all other components unchanged. This allows us to assess EMT’s compatibility with different types of backbone paradigms under semi-supervised settings.

The results in Table 6 demonstrate that our EMT framework consistently improves detection performance across all tested backbones and supervision levels. Notably, when using Swin Transformer, EMT achieves the highest mAP scores among all configurations, with particularly large gains in the 1% and 10% settings. This suggests that Swin’s ability to model long-range dependencies and hierarchical features is especially beneficial when labeled data is scarce or object appearances vary significantly.

In contrast, the Pyramid Vision Transformer (PVT), although less complex than Swin, still shows clear performance improvements when combined with EMT. Compared to its supervised-only baseline, EMT with PVT improves mAP_50:95 by 6.2, 4.7, and 4.2 under the 1%, 5%, and 10% settings, respectively. This confirms that EMT remains effective even with more lightweight Transformer architectures.

Furthermore, across all backbones (ResNet50, Swin, and PVT), EMT consistently outperforms both the fully supervised baseline and the strong PCT baseline. This highlights EMT’s backbone-agnostic design and its robustness in mitigating pseudo-label noise and class imbalance under varying architectural choices. These findings validate the generality and practical adaptability of EMT in real-world remote sensing SSOD.

4.8. Visualization of Predictions in Test Set

To intuitively demonstrate the effectiveness of our proposed model, we visualize the prediction results on the test sets of DOTA, DIOR, and SSDD, as shown in Figure 3, Figure 4, and Figure 5, respectively. These visualizations correspond to EMT models trained with only 10% of labeled data from each dataset.

Despite challenges such as diverse object scales, SAR noise patterns, and cluttered or complex backgrounds, the proposed EMT accurately detects and localizes objects across various scenes and categories. Notably, in Figure 5, the model effectively identifies ships in noisy SAR imagery from the SSDD dataset, demonstrating robustness to low contrast and low SNR conditions.

These results highlight that EMT maintains strong generalization and high detection quality even in semi-supervised settings, validating its applicability to real-world remote sensing tasks with limited annotation budgets.

4.9. Visualization of Pseudo-Labels

To further illustrate the effectiveness of APLF, we visualize the pseudo-labels in Figure 6. Specifically, ground truth annotations are marked with green boxes, while pseudo-labels are indicated by red boxes. The selected images are sampled from the unlabeled training data on 10% labeled DOTA and SSDD with weak augmentation, representing the input data processed by the teacher model during SSOD training. By comparing the visualization results before and after applying APLF, we observe that the teacher model’s predictions contain a significant number of false positives, including high-confidence erroneous pseudo-labels. However, APLF effectively reduces the number of false positives while preserving true positives as much as possible, thereby providing the student model with a more reliable supervisory signal.

4.10. Inference Cost Analysis

To assess the computational efficiency and deployment feasibility of the proposed EMT framework, a comprehensive inference cost evaluation was conducted. The analysis includes inference speed (measured in frames per second, FPS) and GPU memory usage under various labeling regimes.

All evaluations were performed on the DIOR dataset, which consists of images that have been uniformly cropped to a fixed size of 800 × 800 pixels. This characteristic eliminates the need for additional resizing or padding operations during inference, ensuring consistent and reproducible measurement conditions. The experiments were conducted on a single NVIDIA RTX 3060 GPU with 12 GB of memory to simulate realistic deployment scenarios on resource-constrained or embedded platforms, rather than relying on high-performance server-class GPUs.

Inference performance was evaluated for three model categories under different labeled data regimes: (1) supervised models trained with 1%, 5%, and 10% labeled data, (2) EMT models trained under identical labeling ratios, and (3) an Oracle model trained using 100% labeled data. For each model, inference throughput and memory usage were measured with and without automatic mixed precision (AMP) enabled. A total of 2000 test images were used per configuration, and the reported values represent the average across all samples after removing warm-up iterations. All models were tested with consistent settings, including a batch size of 1 without gradient computation.

The results are summarized in Table 7. Across all configurations, the EMT framework achieves inference speed and GPU memory usage comparable to the supervised and Oracle models, confirming that the proposed modules (APLF, CBLR, and ESL) do not introduce any computational overhead at inference time. The model deployed during inference maintains the standard structure of the Faster RCNN detector, ensuring architectural consistency.

A consistent improvement in inference speed is observed when AMP is enabled, with an average FPS increase of approximately 10 FPS and a reduction of around 50 MB in memory consumption across all models. This demonstrates the effectiveness of AMP in accelerating inference and reducing resource requirements, particularly beneficial for low-power or embedded environments. It is also noted that a slight variation in FPS between the EMT and supervised models under identical label ratios. This difference may be partially attributed to minor system-level resource fluctuations during benchmarking. In addition, the supervised models, trained with highly limited labeled data, tend to produce more false positives, potentially leading to a marginally longer NMS stage.

These observations further confirm that EMT not only enhances detection performance in low-label regimes, but also preserves inference efficiency, making it well-suited for deployment in real-time or resource-constrained remote sensing applications.

5. Discussion

5.1. Computational Efficiency and Complexity

Computational efficiency is a critical aspect in deploying object detection frameworks. The EMT framework introduces additional modules such as APLF and CBLR, raising potential concerns about computational overhead. To address these concerns, we provide a detailed analysis of the computational complexity introduced by EMT.

During the pseudo-label generation phase in standard SSOD, the teacher model typically processes approximately 1000 region proposals generated by the RPN through the RCNN head. In contrast, the proposed SSRE module within APLF reassesses only pseudo-labels identified by the teacher as potential positives. Given that the number of these pseudo-labels typically does not exceed 100, the additional computational cost introduced by the SSRE is minimal (less than 10% of the computational burden associated with processing standard RPN proposals). Furthermore, the pseudo-label reassessment does not require gradient computation, significantly reducing memory and computational time.

Similarly, the CBLR module introduces minor additional computations by requiring the teacher network to perform an extra forward pass and loss computation for RoIs generated by the student. However, since these operations do not involve gradient calculations, the associated computational cost is substantially lower than a standard forward-backward pass of the student model.

Overall, the primary computational cost in Faster RCNN-based object detection arises from its ResNet50 backbone. Therefore, the incremental computation introduced by EMT is negligible compared to the backbone’s computational requirements. Importantly, during inference, the EMT framework does not alter the computational complexity or memory usage compared to the underlying detector, ensuring that deployment efficiency remains unaffected.

5.2. Comparison Between CNN-Based and Transformer-Based Detectors

Transformer-based object detectors, such as Semi-DETR, have garnered attention due to their ability to capture global context through self-attention mechanisms, inherently modeling the interactions between multiple objects. Such global contextual understanding could potentially enhance detection performance in remote sensing scenarios, especially those involving dense and multi-scale targets. However, transformer-based models typically require substantial amounts of training data and computational resources to converge effectively. In scenarios with extremely limited labeled data, transformer models may exhibit suboptimal performance due to insufficient training signals.

In contrast, EMT, built upon the CNN-based Faster RCNN architecture, offers significant advantages in terms of data efficiency and computational practicality, particularly under low-labeling conditions. The convolutional architecture inherently requires fewer training samples to achieve robust performance, making EMT especially suitable for remote sensing tasks where labeled data is scarce or expensive to obtain.

Importantly, the error mitigation strategies incorporated in EMT, such as pseudo-label filtering and consistency-based reweighting, are fundamentally model-agnostic. These strategies can theoretically be integrated into transformer-based detectors to similarly reduce error accumulation and enhance detection accuracy. As future work, we intend to explore hybrid architectures that leverage both the global context modeling strengths of transformer detectors and the effective error mitigation strategies developed in EMT. Such integrations hold promise for further advancing semi-supervised object detection in remote sensing applications.

6. Limitations and Future Work

Despite the demonstrated effectiveness of EMT, several challenging yet important aspects remain to be addressed. Firstly, although EMT has been evaluated extensively on optical and SAR datasets separately, its robustness and generalization capabilities in extremely complex scenarios remain to be fully verified, such as disaster-affected regions with drastic appearance changes, strong interference, or unusual imaging angles. Such conditions typically introduce significant domain shifts, thus posing considerable challenges to the model’s current design.

Secondly, EMT’s practical deployability on truly resource-constrained embedded platforms has not been empirically validated. Although our inference efficiency experiments simulated resource-constrained scenarios using an NVIDIA RTX 3060 GPU (12 GB), constraints prevented us from implementing and evaluating EMT directly on embedded hardware. Future work should therefore involve deploying and testing EMT on actual embedded devices or edge-computing platforms to further verify and improve its applicability in real-world remote sensing systems.

Thirdly, although we visually demonstrated the effectiveness of pseudo-label refinement through the APLF module, the interpretability of the model itself remains inherently limited. This limitation partly stems from the historical difficulty and complexity of interpretability in deep learning itself, a longstanding open research problem. Consequently, future studies should explore advanced visualization and attribution methods to enhance transparency and deepen our understanding of model behavior across different training stages.

Fourth, while our framework focuses explicitly on semi-supervised training methodologies, its compatibility with multi-modal data (e.g., optical and SAR imagery fusion) was not experimentally investigated. In principle, EMT’s architecture can inherently accommodate multi-modal input, but it is also important to recognize that multi-modal fusion introduces additional complexities, including modality alignment and complementary information integration. Investigating EMT’s effectiveness and required adaptations for multi-modal remote sensing data thus presents an interesting and valuable direction for future research.

Finally, our current method does not explicitly model prediction uncertainty. Nevertheless, EMT incorporates implicit uncertainty-aware mechanisms through modules such as APLF which dynamically filters low-confidence pseudo-labels and CBLR which downweights noisy samples based on reconstruction loss. These strategies have been shown to improve robustness, as evidenced in our ablation studies and pseudo-label visualizations. In future work, we plan to explore more principled uncertainty modeling approaches, such as entropy-based filtering or agreement-aware selection, which may further enhance the reliability of pseudo-label supervision under severe domain shifts.

7. Conclusions

In this paper, we proposed EMT, a novel framework designed to improve SSOD performance in remote sensing. EMT addresses the common issue of error accumulation from noisy pseudo-labels, which is particularly severe in remote sensing scenarios due to complex backgrounds, class imbalance, and dense object distributions.

Our framework integrates three complementary components. First, APLF enhances pseudo-label quality through second-pass consistency checks and class-aware dynamic thresholds. Second, CBLR down-weights unreliable pseudo-labels using teacher–student consistency, suppressing residual error propagation during training. Third, ESL dynamically adjusts class-wise loss weights based on pseudo-label confidence, ensuring better use of limited labeled data.

Extensive experiments on three widely-used remote sensing datasets (DOTA, DIOR, and SSDD) demonstrate that EMT achieves consistent improvements over state-of-the-art SSOD baselines, including both convolutional and transformer-based methods. Notably, EMT maintains high detection performance even with only 5–10% labeled data and introduces negligible additional inference cost.

Author Contributions

Conceptualization, J.L., H.C. and Y.W.; methodology, J.L., H.C., P.G. and Y.W.; writing—original draft preparation, J.L., H.C. and Y.W.; resources, J.L., H.C. and P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the authors of “Enhancing Prospective Consistency for Semisupervised Object Detection in Remote-Sensing Images” [22] for sharing their dataset partition protocol details. The authors also appreciate the anonymous reviewers for their insightful feedback and suggestions.

Conflicts of Interest

Author Pengfei Gao was employed by the company China Unicom, Heilongjiang Branch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SSOD	Semi-Supervised Object Detection
CNN	Convolutional Neural Network
FPN	Feature Pyramid Network
RCNN	Region-based Convolutional Neural Network
RPN	Region Proposal Network
NMS	Non-Maximum Suppression
mAP	mean Average Precision
IOU	Intersection over Union

References

Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep Learning-Based Object Detection Techniques for Remote Sensing Images: A Survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Bai, C.; Bai, X.; Wu, K. A Review: Remote Sensing Image Object Detection Algorithm Based on Deep Learning. Electronics 2023, 12, 4902. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote Sensing Object Detection in the Deep Learning Era—A Review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Z.; Lian, S. Semi-Supervised Object Detection: A Survey on Recent Research and Progress. arXiv 2023, arXiv:2306.14106. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean Teachers Are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018. [Google Scholar]
Wei, J.; Liu, Y.; BenganSu; Li, L.; Xie, W.; Zhao, S.; Zhao, Z. Improved YOLO X with Bilateral Attention for Small Object Detection. In Proceedings of the 2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC), Dharwad, India, 16–17 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA; pp. 764–773. [Google Scholar]
Lang, S.; Ventola, F.; Kersting, K. DAFNe: A One-Stage Anchor-Free Approach for Oriented Object Detection. arXiv 2022, arXiv:2109.06148. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.-S. ReDet: A Rotation-Equivariant Detector for Aerial Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 15–20 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 2785–2794. [Google Scholar]
Sohn, K.; Zhang, Z.; Li, C.-L.; Zhang, H.; Lee, C.-Y.; Pfister, T. A Simple Semi-Supervised Learning Framework for Object Detection. arXiv 2020, arXiv:2005.04757. [Google Scholar] [CrossRef]
Jeong, J.; Lee, S.; Kim, J.; Kwak, N. Consistency-Based Semi-Supervised Learning for Object Detection. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Tang, Y.; Wang, J.; Gao, B.; Dellandréa, E.; Gaizauskas, R.; Chen, L. Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2119–2128. [Google Scholar]
Liu, Y.-C.; Ma, C.-Y.; He, Z.; Kuo, C.-W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [PubMed]
Yang, Q.; Wei, X.; Wang, B.; Hua, X.-S.; Zhang, L. Interactive Self-Training with Mean Teachers for Semi-Supervised Object Detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 5937–5946. [Google Scholar]
Li, H.; Wu, Z.; Shrivastava, A.; Davis, L.S. Rethinking Pseudo Labels for Semi-Supervised Object Detection. In Proceedings of the 36th AAAI Conference on Innovative Applications of Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2021. [Google Scholar]
Shen, J.; Zhang, C.; Yuan, Y.; Wang, Q. Enhancing Prospective Consistency for Semisupervised Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5619312. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft Teacher. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
Liu, Y.-C.; Ma, C.-Y.; Kira, Z. Unbiased Teacher v2: Semi-Supervised Object Detection for Anchor-Free and Anchor-Based Detectors. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 9809–9818. [Google Scholar]
Wang, X.; Yang, X.; Zhang, S.; Li, Y.; Feng, L.; Fang, S.; Lyu, C.; Chen, K.; Zhang, W. Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Zhang, J.; Lin, X.; Zhang, W.; Wang, K.; Tan, X.; Han, J.; Ding, E.; Wang, J.; Li, G. Semi-DETR: Semi-Supervised Object Detection with Detection Transformers. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 23809–23818. [Google Scholar]
Ma, X.; Lv, P.; Gong, X. SSOD-QCTR: Semi-Supervised Query Consistent Transformer for Optical Remote Sensing Image Object Detection. Remote Sens. 2024, 16, 4556. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6154–6162. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Li, J.; Qu, C.; Shao, J. Ship Detection in SAR Images Based on an Improved Faster R-CNN. In Proceedings of the 2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), Beijing, China, 13–14 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2015. [Google Scholar]
Chen, Z.; Zhang, W.; Wang, X.; Chen, K.; Wang, Z. Mixed Pseudo Labels for Semi-Supervised Object Detection. arXiv 2023, arXiv:2312.07006. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved Baselines with Pyramid Vision Transformer. Comp. Visual. Med. 2022, 8, 415–424. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed Error-Mitigation Teacher (EMT) framework. Pseudo-labels are generated by the teacher network from weakly augmented unlabeled images and refined using the Adaptive Pseudo-Label Filtering (APLF) module. The student network is trained using both labeled and pseudo-labeled data: the Confidence-Based Loss Reweighting (CBLR) module modulates the unsupervised loss based on teacher–student consistency, while the Enhanced Supervised Learning (ESL) module adjusts the supervised loss to emphasize underrepresented or difficult classes.

Figure 2. Illustration of the pseudo-labels before (a) and after (b) APLF at 10,000 training iterations.

Figure 3. Predictions on test images: DOTA. (a) Tennis-court and small vehicle; (b) Large vehicle; (c) Tennis-court under complex backgrounds; (d) Storage-tank in panchromatic image.

Figure 4. Predictions on test images: DIOR. (a) Various sports field facilities; (b) Chimney; (c) Storage tanks; (d) Trainstation in urban environment.

Figure 5. Predictions on test images: SSDD. (a) Dense nearshore ships; (b) Isolated ship; (c) Single nearshore ship.

Figure 6. Illustrations of pseudo-labels (red) and ground truth (green) via weak augmentation with 10% labeled train images of DOTA (a–c) and SSDD (d) before (top) and after (bottom) APLF.

Table 1. Comparison with state-of-the-art methods.

Dataset	Method	1%		5%		10%
Dataset	Method	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀
DOTA	Supervised Only	17.6	32.3	29.5	52.4	31.6	55.2
	Soft Teacher	19.0	36.6	33.1	57.6	34.1	58.7
	Consistent Teacher	23.2	40.4	32.2	53.8	33.4	54.7
	Semi DETR	23.7	42.5	34.6	58.9	35.8	60.1
	MixPL	34.1	36.6	31.0	54.0	30.7	52.4
	PCT	20.8	40.9	34.5	60.2	35.8	61.7
	EMT	24.2	43.8	36.8	62.0	38.4	62.8
DIOR	Supervised Only	15.5	30.5	24.7	46.2	28.3	51.9
	Soft Teacher	21.5	41.4	35.0	61.7	40.0	65.9
	Consistent Teacher	26.3	41.8	41.8	62.5	45.6	66.9
	Semi DETR	25.3	41.6	41.6	64.5	44.2	67.3
	MixPL	20.1	36.2	36.4	62.5	36.2	62.8
	PCT	20.2	40.9	32.4	58.6	36.4	63.3
	EMT	23.8	44.6	38.3	64.5	44.2	70.2
Dataset	Method	5%		10%		20%
Dataset	Method	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀
SSDD	Supervised Only	42.1	78.5	51.1	82.8	58.2	90.3
	Soft Teacher	51.1	79.8	54.1	82.4	58.8	88.7
	Consistent Teacher	47.4	80.7	45.4	81.3	57.0	89.4
	Semi DETR	35.8	74.5	44.1	83.7	50.8	89.3
	MixPL	45.4	78.4	46.5	83.3	50.5	85.3
	EMT	50.5	81.3	51.5	84.5	60.1	91.2

Highest result is shown in bold, while second best is underlined.

Table 2. Class-wise performance comparison on the DOTA dataset.

Percent	Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP₅₀
1%	Supervised Only	61.4	25.2	15.1	12.5	33.9	54.3	66.5	72.9	0.0	34.6	9.5	22.0	41.3	23.9	12.0	32.3
	Soft Teacher	77.0	37.9	25.9	21.0	36.3	48.0	64.8	90.8	0.0	11.9	29.3	6.4	50.1	31.8	21.5	36.6
	Consistent Teacher	56.8	36.1	23.9	17.3	16.8	48.0	45.7	77.2	4.6	34.0	24.4	25.0	54.1	22.9	10.9	40.4
	Semi DETR	76.4	30.3	24.1	20.3	31.7	68.4	59.8	83.8	0.6	54.2	20.3	31.6	49.8	40.1	35.4	42.5
	MixPL	77.1	37.4	19.2	2.7	19.9	56.3	57.3	88.2	0.0	34.6	19.7	27.6	44.4	32.7	11.5	36.6
	PCT	76.3	39.2	19.1	12.4	40.6	62.4	69.5	89.5	1.3	33.9	26.7	29.4	52.0	36.1	25.3	40.9
	EMT	77.8	45.7	24.6	10.8	47.5	72.6	72.1	90.4	1.7	58.6	23.3	29.5	43.1	37.6	21.6	43.8
5%	Supervised Only	77.7	49.9	28.9	33.4	48.2	70.6	72.1	86.4	40.2	62.2	37.0	45.9	64.7	37.3	30.9	52.4
	Soft Teacher	82.4	61.1	35.4	43.8	52.1	71.8	73.2	91.7	48.3	64.8	45.8	44.2	66.1	43.2	33.3	57.6
	Consistent Teacher	75.1	43.4	32.2	46.2	14.6	62.1	52.0	83.9	39.8	51.6	36.1	34.4	61.6	35.1	27.1	53.8
	Semi DETR	79.0	66.9	28.7	36.1	51.6	73.5	80.3	90.4	45.4	63.5	46.0	44.5	65.4	42.1	43.4	58.9
	MixPL	83.3	55.1	31.1	34.3	50.3	71.9	69.0	92.2	48.1	63.5	47.4	38.8	66.1	40.9	18.2	54.0
	PCT	82.9	63.3	36.3	46.3	54.3	78.4	74.0	91.3	54.9	71.8	53.3	46.3	67.6	48.9	29.5	60.2
	EMT	83.3	58.4	37.1	60.2	53.8	78.0	74.7	92.5	57.6	68.7	55.4	49.0	73.1	50.0	38.2	62.0
10%	Supervised Only	80.9	54.8	36.3	40.4	53.3	76.8	73.3	90.9	44.1	62.1	48.8	49.9	69.4	43.5	3.0	55.2
	Soft Teacher	83.1	64.3	45.9	51.6	55.0	77.5	73.1	92.3	50.3	63.8	51.3	50.8	71.2	47.1	0.0	58.7
	Consistent Teacher	76.2	40.2	36.6	40.5	14.0	64.0	49.4	84.8	44.0	50.7	42.8	25.3	61.3	30.9	1.0	54.7
	Semi DETR	82.1	67.3	40.9	44.0	55.7	75.0	80.3	90.6	44.4	64.6	52.2	47.1	69.2	42.1	38.8	60.1
	MixPL	83.5	58.1	31.9	36.7	50.1	72.7	66.5	92.0	45.6	62.6	39.7	32.1	69.5	44.9	0.0	52.4
	PCT	84.4	66.6	45.6	51.6	55.3	81.1	74.5	92.8	58.7	71.2	53.9	51.7	72.0	50.9	14.0	61.7
	EMT	85.4	65.6	43.2	64.7	55.9	81.5	74.5	93.4	63.2	67.0	59.7	48.7	75.5	50.1	13.4	62.8
100%	Oracle Faster RCNN	84.2	65.6	47.5	70.8	57.3	76.5	76.5	92.2	62.6	62.0	60.8	57.9	76.4	54.0	55.0	66.6

Highest result is shown in bold, while second best is underlined.

Table 3. Class-wise performance comparison on the DIOR dataset.

Percent	Method	PL	PO	BD	BC	BR	CH	DM	ESA	ETS	GF	GTF	HA	OP	SH	SD	ST	TC	TS	V	WM	mAP₅₀
1%	Supervised Only	29.1	14.0	60.0	10.9	2.6	71.9	4.8	14.2	18.8	2.9	32.2	19.0	18.6	60.1	32.8	39.2	66.0	0.1	23.3	59.5	30.5
	Soft Teacher	39.2	12.0	67.4	56.8	5.1	74.6	19.3	46.6	45.5	17.4	58.3	15.5	36.2	65.7	50.8	33.7	78.6	1.2	31.9	72.8	41.4
	Consistent Teacher	17.5	48.0	52.0	54.7	2.7	69.4	26.8	48.5	28.2	40.9	53.8	19.4	20.4	48.8	47.7	16.5	71.0	0.6	24.7	63.5	41.8
	Semi DETR	62.3	33.1	61.7	9.0	7.0	73.9	42.0	51.7	44.2	13.2	34.1	17.7	31.7	62.2	57.0	31.8	51.5	10.3	29.3	70.3	42.5
	MixPL	29.2	8.3	66.7	48.8	4.1	74.0	10.4	42.2	30.4	3.1	44.4	20.0	26.4	57.0	46.5	37.3	78.4	1.2	31.1	64.2	36.2
	EMT	60.2	23.7	64.5	55.2	4.5	73.9	18.2	36.8	40.3	28.4	71.0	20.3	26.3	70.8	56.1	54.5	79.5	1.5	35.4	71.1	44.6
5%	Supervised Only	40.7	33.3	64.0	58.5	14.3	73.7	19.9	44.2	39.2	52.7	62.1	23.2	36.4	68.0	43.2	43.9	81.4	16.2	33.9	74.9	46.2
	Soft Teacher	69.7	73.6	68.1	84.4	27.6	75.1	30.5	65.8	61.3	75.4	73.5	38.0	52.6	72.8	62.6	51.5	87.8	40.5	42.3	81.6	61.7
	Consistent Teacher	36.2	77.5	59.3	75.3	29.5	71.3	43.4	66.2	48.4	69.2	69.8	28.9	52.7	57.8	52.8	23.8	78.3	52.9	30.0	66.0	62.5
	Semi DETR	68.0	75.0	66.5	73.0	28.3	74.2	49.3	79.2	51.5	69.6	69.7	41.8	49.0	75.2	64.8	57.7	77.8	52.3	35.0	74.0	64.5
	MixPL	78.3	65.3	69.1	88.7	30.8	75.7	25.3	70.0	52.9	68.1	76.9	49.9	56.4	70.6	66.6	59.7	89.0	35.9	40.5	80.3	62.5
	EMT	78.5	66.3	69.1	87.0	37.0	76.3	30.7	85.1	58.1	73.2	78.0	45.0	52.8	73.9	63.3	65.3	90.6	31.2	42.8	85.7	64.5
10%	Supervised Only	57.9	43.1	67.4	71.2	19.0	74.4	30.8	46.6	73.1	52.8	71.1	22.0	42.2	68.4	45.8	51.9	83.1	29.2	37.7	79.8	51.9
	Soft Teacher	75.3	76.7	71.8	89.2	34.8	75.1	44.1	76.5	64.6	75.5	79.4	42.6	56.5	73.1	55.9	58.3	88.5	51.2	43.2	86.5	65.9
	Consistent Teacher	45.7	81.5	60.5	81.3	31.2	70.5	62.6	72.8	51.8	68.6	70.5	26.3	54.7	55.6	55.2	33.6	77.2	48.2	31.1	72.6	66.9
	Semi DETR	67.2	79.9	69.8	80.3	32.3	75.6	58.5	84.1	60.2	70.8	74.1	41.6	53.6	76.7	52.1	59.8	79.8	57.9	39.4	80.2	67.3
	MixPL	80.5	62.8	69.9	89.3	28.9	75.7	36.0	68.8	53.0	68.1	78.2	47.0	54.5	70.2	63.7	59.0	88.8	39.6	39.8	82.7	62.8
	EMT	79.6	78.2	72.8	87.8	44.4	76.3	44.0	89.5	73.3	76.8	83.1	47.9	61.0	75.1	73.6	70.3	91.0	45.0	44.0	88.9	70.2
100%	Oracle Faster RCNN	76.5	86.8	82.7	89.5	50.7	80.3	70.5	89.3	76.7	85.8	84.4	58.2	63.8	74.2	82.3	66.9	89.2	63.8	48.7	89.7	75.5

Highest result is shown in bold, while second best is underlined.

Table 4. Ablation study of each component in the EMT framework.

APLF	CBLR	ESL	1%		5%		10%
APLF	CBLR	ESL	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀
√	√	√	24.2	43.8	36.8	62.0	38.4	62.8
×	√	√	21.3	38.5	35.9	60.0	36.5	60.7
√	×	√	21.2	39.8	34.2	58.5	35.7	59.9
√√	√	×	23.9	43.6	35.9	60.9	37.2	61.5
√	×	×	20.4	38.9	32.3	56.9	34.0	58.3
×	√	×	20.4	38.4	33.5	58.0	35.3	59.4
×	×	√	15.2	30.4	30.6	53.8	32.9	56.9
×	×	×	15.0	29.7	29.9	53.3	33.3	57.4

Highest result is shown in bold. “√” indicates the corresponding module is used, and “×” means it is not used.

Table 5. Ablation study of pseudo-label filtering strategies.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP_50:95	mAP₅₀
APLF	77.8	45.7	24.6	10.8	47.5	72.6	72.1	90.4	1.7	58.6	23.3	29.5	43.1	37.6	21.6	24.2	43.8
Ablation SSRE	77.6	34.2	24.1	12.6	16.2	65.8	70.2	89.4	0.0	39.7	22.3	32.0	48.6	32.7	23.3	23.3	41.3
Ablation DTF	77.6	40.7	24.9	14.7	46.8	61.3	71.6	90.4	1.0	30.6	16.8	30.9	40.4	31.7	21.6	22.7	40.1

Highest result is shown in bold.

Table 6. Comparison of Transformer backbones on DOTA.

Backbone Type	1%		5%		10%
Backbone Type	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀	mAP_50:95	mAP₅₀
Resnet50 (sup)	17.6	32.3	29.5	52.4	31.6	55.2
Resnet50 (PCT)	20.8	40.9	34.5	60.2	35.8	61.7
Resnet50 (EMT)	24.2	43.8	36.8	62.0	38.4	62.8
Swin (sup)	18.8	36.3	30.4	56.1	31.5	56.5
Swin (PCT)	21.2	41.3	31.9	59.1	35.7	62.3
Swin (EMT)	28.4	51.3	39.3	65.1	47.7	75.5
PVT (sup)	17.7	33.7	28.9	51.6	30.4	53.8
PVT (EMT)	23.9	43.9	33.6	58.2	34.6	59.1

Highest result is shown in bold for each backbone.

Table 7. Inference speed and GPU memory usage comparison on DIOR.

Model	Inference Speed (FPS)		CUDA Memory (MB)
Model	AMP On	AMP Off	AMP On	AMP Off
EMT 1%	27.3	17.4	604	655
EMT 5%	27.3	17.4	604	655
EMT 10%	27.3	17.3	604	655
supervised 1%	26.0	17.0	604	655
supervised 5%	27.4	17.3	604	655
supervised 10%	26.4	17.1	604	655
Oracle Faster RCNN	27.3	17.4	604	655

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, J.; Chen, H.; Gao, P.; Wang, Y. Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection. Remote Sens. 2025, 17, 2592. https://doi.org/10.3390/rs17152592

AMA Style

Lu J, Chen H, Gao P, Wang Y. Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection. Remote Sensing. 2025; 17(15):2592. https://doi.org/10.3390/rs17152592

Chicago/Turabian Style

Lu, Junhong, Hao Chen, Pengfei Gao, and Yu Wang. 2025. "Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection" Remote Sensing 17, no. 15: 2592. https://doi.org/10.3390/rs17152592

APA Style

Lu, J., Chen, H., Gao, P., & Wang, Y. (2025). Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection. Remote Sensing, 17(15), 2592. https://doi.org/10.3390/rs17152592

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Error Mitigation Teacher for Semi-Supervised Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Object Detection

2.2. Semi-Supervised Object Detection

3. Materials and Methods

3.1. Overview

3.2. Adaptive Pseudo-Label Filtering

3.2.1. Second-Stage RCNN Evaluation

3.2.2. Dynamic Threshold Filtering

3.3. Confidence-Based Loss Reweighting

3.4. Enhanced Supervised Learning

3.5. Overall Training Objective

4. Experimental Results

4.1. Datasets

4.2. Training Protocol

4.3. Evaluation Metrics

4.4. Implementation Details

4.5. Comparison with the State-of-the-Art Methods

4.5.1. Analysis Across All Categories

4.5.2. Analysis with Specific Categories

4.6. Ablation Studies

4.6.1. Ablation Studies of Each Component in the EMT

4.6.2. Ablation Studies of Each Component in the APLF

4.7. Analysis of Transformer Backbone

4.8. Visualization of Predictions in Test Set

4.9. Visualization of Pseudo-Labels

4.10. Inference Cost Analysis

5. Discussion

5.1. Computational Efficiency and Complexity

5.2. Comparison Between CNN-Based and Transformer-Based Detectors

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI