Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection

Fang, Fang; Kang, Jianing; Li, Shengwen; Tian, Panpan; Liu, Yang; Luo, Chaoliang; Zhou, Shunping

doi:10.3390/rs17101743

Open AccessArticle

Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection

by

Fang Fang

^1,2

,

Jianing Kang

¹,

Shengwen Li

^1,2,*

,

Panpan Tian

¹,

Yang Liu

¹,

Chaoliang Luo

³ and

Shunping Zhou

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

Engineering Research Center of Natural Resource Information Management Digital Twin Engineering Software, Ministry of Education, Wuhan 430074, China

³

Wuhan Huitong Zhiyun Information Technology, Wuhan 430200, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(10), 1743; https://doi.org/10.3390/rs17101743

Submission received: 24 March 2025 / Revised: 13 May 2025 / Accepted: 15 May 2025 / Published: 16 May 2025

(This article belongs to the Special Issue Advancement of Multi-Source Remote Sensing Data Fusion in Environmental Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

Object detection in remote sensing images (RSIs) is pivotal for various tasks such as natural disaster warning, environmental monitoring, teacher–student urban planning. Object detection methods based on domain adaptation have emerged, which effectively decrease the dependence on annotated samples, making significant advances in unsupervised scenarios. However, these methods fall short in their ability to learn remote sensing object features of the target domain, thus limiting the detection capabilities in many complex scenarios. To fill this gap, this paper integrates a multi-granularity feature alignment strategy and the teacher–student framework to enhance the capability of detecting remote sensing objects, and proposes a multi-granularity domain-adaptive teacher (MGDAT) framework to better bridge the feature gap across target and source domain data. MGDAT incorporates the teacher–student framework at three granularities, including pixel-, image- and instance-level feature alignment. Extensive experiments show that MGDAT surpasses SOTA baselines in detection accuracy, and exhibits great generalizability. This proposed method can serve as a methodology reference for various unsupervised interpretation tasks of RSIs.

Keywords:

unsupervised domain adaptation (UDA); remote sensing image (RSI); object detection; multi-granularity features

1. Introduction

Object detection in remote sensing images (RSIs) refers to detecting whether a given RSI contains some objects in specified categories and locating the objects in the image [1]. As the basic task of RSI applications, object detection of RSIs holds substantial importance in various applications, including geographic information updating, geological disaster recognition, land cover mapping and urban planning [1,2].

Benefiting from the advances of deep learning [3,4,5,6,7,8,9,10], fully supervised object detection methods for RSIs have experienced rapid development [11,12,13,14,15,16,17,18,19,20,21]. For example, Sharifuzzaman Saga et al. [22] designed a comprehensive method, multiscale-attention R-CNN, to improve RSI object detection and scene understanding. Xiao et al. [23] introduced a network by decoupling features and refining localization to address the challenge of objects in RSIs exhibiting significant variations in scale and arbitrary orientations. The above methods have greatly facilitated the capability of object detection in RSIs. However, they employ the fully supervised machine learning paradigm that requires extensive annotated datasets for model training, thus suffering from limitations in terms of scalability and practicality.

Domain adaptation (DA)-based methods utilize both unlabeled samples in target domain and labeled samples in the source domain to train their models, which effectively decrease the dependence on annotated samples. They have been extended for the RSI object detection task [24,25,26,27,28,29,30,31]. These methods reduce the feature gap between the source and target with feature alignment strategies, thereby improving model generalization ability in detecting remote objects. However, these methods are still deficient in learning remote sensing object features in the target domain, thus limiting detection capabilities in many complex images.

To tackle the issue, this paper integrates a multi-granularity feature alignment strategy and the teacher–student model [32] to improve the detection of remote sensing objects, and proposes a multi-granularity domain-adaptive teacher (MGDAT) method to better bridge the feature gap between source and target domain data. MGDAT incorporates the teacher–student model at three granularities, including the pixel-, image- and instance-level alignment modules, namely PFA, IMFA and INFA, respectively. Specifically, the PFA module designs two sets of pixel-level domain discriminators and gradient reversal layer (GRL) components for improving detection ability of small objects. The IMFA module performs discrimination on the deeper feature map of RSI, which incorporates global features for alignment to identify objects from complex backgrounds. The INFA module aligns the features of object instances detected by RPN and ROI to guide the discriminator to capture object instances. To the best of our knowledge, this is the first effort that incorporates multi-granularity feature alignment and the teacher–student framework to improve the object detection of RSIs. The main contributions of this paper are delineated as follows:

This study proposes a method to improve the detection of remote sensing objects by incorporating three levels of feature alignment and the teacher–student framework.
A multi-granularity domain-adaptive teacher (MGDAT) method is developed, which designs three modules, including PFA, IMFA and INFA, to be incorporated with the domain-adaptive teacher to map features between the source domain and the target domain. The method helps domain feature alignment while mitigating the adverse effects of complex backgrounds, providing a solution for various RSI interpretation tasks.
Experimental results show that the proposed method outperforms the state-of-the-art methods in terms of detection accuracy, which has advantages in object detection for varying object sizes and complex backgrounds.

2. Related Work

2.1. Object Detection of RSIs

Traditional object detection methods of RSIs primarily comprise template matching methods and prior knowledge methods. The template matching methods rely on hand-crafted templates of a specific object class, which achieves object detection by calculating the similarity between those templates and regions in the image [33,34,35]. For example, Zhou et al. [33] use orthogonal and parallel relationships with road directions to design the template for road detection in RSIs. The prior knowledge methods utilize geometric and spatial contextual knowledge to detect objects [36,37,38]. For instance, Irvin and McKeown [36] determined the position and geometry of buildings by examining the correlation between artificial structures and the shadows they cast. These traditional methods are based on manually designed features, which have the limitations of poor accuracy and generalization.

Machine learning approaches consider object detection as a classification task by searching for regions of interest in an image that may contain objects to extract features [39,40,41,42]. For instance, Shi et al. [40] achieve ship detection in RSIs by combining HOG features and circle frequency features. Recently, deep learning-based methods have rapidly advanced the object detection of RSIs [39]. This kind of method designs models based on convolutional neural networks (CNNs) to train the detectors, and combines with various strategies to improve detection accuracy. For instance, Zhang et al. [11] enhance object features in RSIs by extracting multiscale convolutional features and leveraging multiple fully connected layer features, thereby elevating the accuracy of the detector. These methods are grounded in the fully supervised paradigm [11,12,13,14,15,16,17,18,19], which relies heavily on large amounts of data with precise annotations. In practice, annotating RSIs is time-consuming and laborious. Furthermore, these methods assume that the feature distributions of the training set and the inference data are identical. However, due to a variety of factors, the distribution of features in the inference data differs significantly from that of the training dataset in practice, making it extremely difficult to accurately detect the objects in RSIs.

In recent years, Transformer-based object detection methods have developed rapidly and have been increasingly applied to RSI object detection tasks. For instance, Zheng et al. [43] incorporated Transformers into a lightweight Feature Pyramid Network (FPN) to enhance the semantic representation of feature maps, thereby improving detection performance. Similarly, Zhang et al. [44] introduced a parallel Transformer branch alongside the backbone network to improve the CNN’s capacity for capturing global features. Although these methods have achieved performance improvements, they still fail to overcome the drawbacks inherent in fully supervised training, being constrained by label scarcity and limited generalization capability.

2.2. DA-Based Object Detection

In the computer vision field, DA object detection methods reduce the reliance on extensive annotated data, and enhance model performance in detecting objects of images that have varied feature distributions from the labeled set [45]. The methods usually align features between labeled samples of the source domain and unlabeled samples of the target domain during the model training process. For example, some approaches adopt adversarial learning strategies, employing a domain discriminator and a GRL to enable the feature extractor to capture invariant features in the source and target domain samples [46]. Some methods employ unpaired image-to-image translation algorithms, such as Cycle-GAN [47], to reduce the feature disparity in the source and target domain data level [48,49,50]. Additionally, some methods leverage the teacher–student framework, which effectively alleviates the domain gap [51,52,53]. The teacher–student framework enables better utilization of the features from unlabeled target domain data, thereby achieving superior domain adaptation performance.

Unsupervised DA methods have been extended for the object detection of RSIs [24,25,26,27,28,29,30,31]. For example, Chen et al. [24] tackled the specificity of RSIs by achieving feature alignment at the prototype level by using a relation-aware graph. Zhu et al. [26] employed two feature alignment strategies for low- and high-level feature alignment to optimize domain feature alignment. Koga et al. [28] employed adversarial learning to achieve aligned predictions in class confidences and object locations simultaneously. Zhu et al. [27] designed double heads to mitigate the impact of biased information. Biswas et al. [54] presented a DA remote sensing object detection method utilizing contrastive learning, enhancing the feature extraction.

However, the previous methods do not sufficiently align RSI features between the source and target domains, such as the straightforward instance-level features of detection objects. This paper introduces multi-granularity feature alignment and the teacher–student framework to improve the detection of objects for RSIs.

3. Methods

In this paper,

{(x_{i}^{s}, b_{i}^{s}, c_{i}^{s})}_{i = 1}^{N_{s}}

represents the set of labeled samples in the source domain, where

b_{i}^{s}

represents the annotations of bounding box,

c_{i}^{s}

represents the corresponding category labels for the source domain images

x_{i}^{s}

,

N_{s}

denotes the number of images in the source domain, and

{x_{i}^{t}}_{i = 1}^{N_{t}}

denotes the set of

N_{t}

unlabeled images in the target domain. Meanwhile, S and T denote the source domain and target domain, respectively.

The proposed method, named MGDAT, employs multi-granularity feature alignment to enhance the detection of remote sensing objects, as illustrated in Figure 1. MGDAT is derived from the teacher–student framework and designs three feature alignment modules, including PFA, IMFA and INFA. The PFA module will align features in two-scale feature maps at the pixel level. The IMFA module focuses on achieving global feature alignment. The INFA module aligns the features of object instances detected with the Region Proposal Network (RPN) and regions of interest (ROIs). However, MGDAT does not divide the remotely sensed images into instances for feature comparison for each instance. That is, the model treats each remotely sensed image as a sample for overall alignment of all instances in an image. The alignment of features at each granularity is achieved by domain discriminators and GRLs.

In the proposed method, the source domain data are input to the student model for supervised training, which optimizes the model parameters based on RSIs and their corresponding labels from the source domain. Meanwhile, the target domain data are input to both the teacher model and the student model, where the teacher model generates pseudo-labels to support the training of the student model.

3.1. Pixel-Level Feature Alignment Module (PFA)

As illustrated in Figure 1a, PFA is an adversarial learning module comprising two sets of pixel-level domain discriminators and GRL components, aiming to improve feature alignment for objects with diverse sizes. The module aligns pixel-wise features by adversarial learning with two downsampled feature maps, denoted as

f_{1}

and

f_{2}

. The two feature maps are fed into the domain discriminator

D_{p i x}

to generate a score between 0 and 1 for each pixel, determining which domain it is categorized into. A score close to 0 indicates a closer match with the source domain, and otherwise a better match with the target domain. The pixel-level adversarial loss is computed based on scores of features and the domains they truly belong to. And the GRLs are used to optimize the feature extractor by extracting domain-invariant features. The loss function adopts binary cross-entropy loss, formulated as follows:

\begin{matrix} L_{D_{p i x}}^{s} = & - \frac{1}{N_{s} W_{1} H_{1}} \sum_{i = 1}^{N_{s}} \sum_{w = 1}^{W_{1}} \sum_{h = 1}^{H_{1}} l o g (1 - D_{p i x} ({f_{1}}_{i}^{s})) \\ - \frac{1}{N_{s} W_{2} H_{2}} \sum_{i = 1}^{N_{s}} \sum_{w = 1}^{W_{2}} \sum_{h = 1}^{H_{2}} l o g (1 - D_{p i x} ({f_{2}}_{i}^{s})), \end{matrix}

(1)

\begin{matrix} L_{D_{p i x}}^{t} = & - \frac{1}{N_{t} W_{1} H_{1}} \sum_{i = 1}^{N_{t}} \sum_{w = 1}^{W_{1}} \sum_{h = 1}^{H_{1}} l o g D_{p i x} ({f_{1}}_{i}^{t}) \\ - \frac{1}{N_{t} W_{2} H_{2}} \sum_{i = 1}^{N_{t}} \sum_{w = 1}^{W_{2}} \sum_{h = 1}^{H_{2}} l o g D_{p i x} ({f_{2}}_{i}^{t}), \end{matrix}

(2)

\begin{matrix} L_{D_{p i x}} = L_{D_{p i x}}^{s} + L_{D_{p i x}}^{t}, \end{matrix}

(3)

where W and H denote the width and height of the feature maps, respectively;

{f_{1}}_{i}^{s}

,

{f_{1}}_{i}^{t}

,

{f_{2}}_{i}^{s}

and

{f_{2}}_{i}^{t}

denote the

f_{1}

and

f_{2}

feature maps of the i-th image in the source and target domains, respectively.

3.2. Image-Level Feature Alignment Module (IMFA)

To minimize the negative impact of complex backgrounds on feature alignment, the IMFA module identifies remote sensing objects based on the feature maps that present semantic information from the entire image. As illustrated in Figure 1b, the feature map with a larger downsampling factor,

f_{i m g}

, is input to the image-level domain discriminator,

D_{i m g}

, to assign a score within the range of 0 to 1. The closer the score is to 0, the more the image matches the source domain, and vice versa.

D_{i m g}

and GRLs allow the feature extractor to remain invariant across domains. Inspired by SWDA [55], Focal Loss [56] is employed as the loss function for image-level adversarial learning to balance the impact of different features on model parameters. The loss function of this module is defined as follows:

\begin{matrix} L_{D_{i m g}}^{s} = - \frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} D_{i m g} {({f_{i m g}}_{i}^{s})}^{γ} l o g (1 - D_{i m g} ({f_{i m g}}_{i}^{s})), \end{matrix}

(4)

\begin{matrix} L_{D_{i m g}}^{t} = - \frac{1}{N_{t}} \sum_{i = 1}^{N_{t}} {(1 - D_{i m g} ({f_{i m g}}_{i}^{t}))}^{γ} l o g (D_{i m g} ({f_{i m g}}_{i}^{t})), \end{matrix}

(5)

\begin{matrix} L_{D i m g} = L_{D i m g}^{s} + L_{D i m g}^{t}, \end{matrix}

(6)

where

γ

is a hyperparameter to regulate the weight of difficult-to-discriminate samples;

{f_{i m g}}_{i}^{s}

and

{f_{i m g}}_{i}^{t}

represent the

f_{i m g}

feature map generated by the i-th image in the source domain and target domain, respectively.

3.3. Instance-Level Feature Alignment Module (INFA)

INFA aims to align the features of object instances detected by RPN and ROI, which encourages the domain discriminator to concentrate on object features while ignoring background features. As illustrated in Figure 1c, after the feature map

f_{i m g}

is input to the RPN and ROI, object instances will be detected. These instances are fed into an adversarial learning network, consisting of an instance-level domain discriminator

D_{i n s}

and a GRL to align instance-level features.

D_{i n s}

is designed to maximize the preservation of detected instance features during feature alignment. The formula of the loss function at the instance-level feature alignment is given below:

\begin{matrix} L_{D_{i n s}}^{s} = - \frac{1}{N_{s}^{i n s} W_{i n s} H_{i n s}} \sum_{i = 1}^{N_{s}^{i n s}} \sum_{w = 1}^{W_{i n s}} \sum_{h = 1}^{H_{i n s}} l o g (1 - D_{i n s} (F (x_{i}^{s}))), \end{matrix}

(7)

\begin{matrix} L_{D_{i n s}}^{t} = - \frac{1}{N_{t}^{i n s} W_{i n s} H_{i n s}} \sum_{i = 1}^{N_{t}^{i n s}} \sum_{w = 1}^{W_{i n s}} \sum_{h = 1}^{H_{i n s}} l o g D_{i n s} (F (x_{i}^{t})), \end{matrix}

(8)

\begin{matrix} L_{D_{i n s}} = L_{D_{i n s}}^{s} + L_{D_{i n s}}^{t}, \end{matrix}

(9)

where

F (x_{i}^{s})

and

F (x_{i}^{t})

represent the object instances detected by the Faster R-CNN [57] network in the source and target domain images, respectively;

W_{i n s}

and

H_{i n s}

denote the width and height of the detected objects in feature map, respectively.

3.4. Model Training

Following the training strategy of the teacher–student framework, the proposed method is trained with two stages: an initial burn-up stage, followed by a teacher–student collaborative training stage. During the burn-up phase, the training of the student model is supervised by utilizing source domain data to learn the model parameters, which are then propagated to the teacher model. In the teacher–student collaborative training phase, as shown in Figure 1, images in both the source and target domains are strongly augmented before being fed into the student model. Moreover, the weakly augmented target domain data will also be fed into the teacher model. For the source domain data, the detection results of the student model along with their corresponding labels are employed for supervised training. For the target domain data, the detection results generated by the teacher model serve as pseudo-labels

P_{i}^{t}

, along with the results detected by the student model, for unsupervised training.

P_{i}^{t} = ({\hat{b}}_{i}^{t}, {\hat{c}}_{i}^{t})

represents the pseudo-label of the i-th image in the target domain, comprising the bounding box

\hat{b}

and its corresponding category

\hat{c}

. The total loss of this module consists of supervised loss and unsupervised loss. The supervised loss function is as follows:

\begin{matrix} L_{d e t}^{s u p} = L_{r e g} (F (x^{s}), b^{s}) + L_{c l s} (F (x^{s}), c^{s}), \end{matrix}

(10)

where

L_{r e g}

represents the regression loss, and

L_{c l s}

represents the classification loss.

For unlabeled images in target domain data, the module utilizing pseudo-labels as supervised signals is as follows:

\begin{matrix} L_{d e t}^{u n s u p} = L_{r e g} (F (x^{t}), {\hat{b}}^{t}) + L_{c l s} (F (x^{t}), {\hat{c}}^{t}), \end{matrix}

(11)

where

L_{r e g}

represents the regression loss;

L_{c l s}

represents the classification loss.

Throughout the entire training phase, the student model continuously refines its model parameters through backpropagation. To enhance the quality of pseudo-labels produced by the teacher model, exponential moving average (EMA) [32] is employed to update the parameters of the teacher model as follows:

\begin{matrix} θ_{t} \leftarrow α θ_{t} + (1 - α) θ_{s}, \end{matrix}

(12)

where

θ_{t}

represents the parameters of the teacher model;

θ_{s}

represents the parameters of the student model; and

α

is the parameter update rate.

The total loss of the proposed method is calculated as the weighted sum of

L_{d e t}^{s u p}

,

L_{d e t}^{u n u p}

,

L_{D_{p i x}}

,

L_{D_{i m g}}

and

L_{D_{i n s}}

as follows:

\begin{matrix} L_{t o t a l} = L_{d e t}^{s u p} + λ_{u n s u p} L_{d e t}^{u n s u p} + λ_{d i s} (L_{D_{p i x}} + L_{D_{i m g}} + L_{D_{i n s}}), \end{matrix}

(13)

where

λ_{u n s u p}

and

λ_{d i s}

are hyperparameters to control the weights of the unsupervised loss and the adversarial learning loss, respectively.

4. Experiments and Results

4.1. Datasets

NWPU VHR-10 [58] is an extensively used dataset in RSI object detection tasks, which was released by Northwestern Polytechnical University (NWPU) in 2014. This dataset contains 800 images of 10 types of remote sensing objects. The images are derived from two source, 715 images with a Ground Sampling Distance (GSD) of 0.5 to 2 m from Google Earth, and 85 images with a GSD of 0.08 m from the Vaihingen dataset [59]. Of the 800 images, 650 are labeled as positive samples containing at least one remote sensing object, and the remaining 150 are labeled as negative samples containing no remote sensing objects. During the experiment, the positive sample images are randomly assigned to the training, validation, and test sets in a ratio of 14:6:5.

DIOR [60] is a large-scale benchmark dataset specifically designed for RSI object detection. It consists of RSIs with a resolution of 800 × 800 pixels, covering more than 80 countries worldwide. These images were collected under various climate conditions, seasonal changes, and different imaging environments. There are 20 remote sensing object categories in this dataset, and 10 of these categories overlap with the NWPU VHR-10 dataset. To facilitate the experiments, the 10 categories from NWPU VHR-10 and their corresponding labels are extracted from the DIOR dataset, forming a subset dataset named Dior* containing 6997 RSIs and 50,410 objects. The Dior* dataset is used in our experiments, in which the training and validation sets consist of 5597 RSIs, and the test set consists of 1400 RSIs. The dataset exhibits significant inter-class similarity and intra-class variation, and the backgrounds are complex, posing extremely hard challenges for remote sensing object detection [60].

HRRSD [11] is widely used for RSI object detection, covering 13 categories of remote sensing objects. The RSIs are acquired from Google Earth and Baidu Maps, with a GSD ranging from 0.15 m to 1.2 m. The shared 10 categories with NWPU VHR-10 are extracted from HRRSD to build a subset dataset named HRRSD* for our experiments. This extracted subset dataset contains 17,016 RSIs, split into 13,612 RSIs for training and validation, and 3404 RSIs for testing. The images in HRRSD* have a complex and diverse background with a wide range of object sizes, shapes, orientations, and densities, posing challenges for object detection. Table 1 details the distribution of instances for each category in the NWPU VHR-10, DIOR*, and HRRSD* datasets, while also reporting the total number of images and instances in these datasets.

4.2. Experimental Settings

In the proposed method, the teacher and student models utilize a Faster R-CNN with a backbone of ResNet101 pre-trained on ImageNet to implement DA RSI object detection. The input image is downsampled at scales of 4× and 8×, outputting feature maps

f_{1}

and

f_{2}

, respectively. Within the proposed feature alignment modules of three granularities,

f_{1}

and

f_{2}

are employed for PFA. Moreover,

f_{i m g}

, downsampled at a scale of 16x image feature maps, is fed into the RPN and the ROI generation to produce object instances for INFA. The hyperparameter

α

in EMA is set to 0.9996 as configured in Unbiased Teacher [32]. During the training process, pseudo-labels with confidence scores above 0.6 are recognized as valid annotations. Additionally,

λ_{u n s u p}

and

λ_{d i s}

are set to 1.0 and 0.001, respectively. The

D_{p i x}

consists of three convolutional layers with 3 × 3 kernels and LeakyReLU activation functions. The

D_{i m g}

is structured with three convolutional layers with 3 × 3 kernel, batch normalization, and linear layers. The

D_{i n s}

is composed of linear layers and ReLU activation functions. These discriminators collectively form the adversarial learning network, aiming to achieve feature alignment at different granularities. The proposed method is implemented with Detectron2. During training, the initial learning rate is set to 0.00125, the batch size to 2, the momentum to 0.9, and the weight decay to 0.0001. The training process will perform 24,000 iterations, of which 4000 iterations are the burn-up phase, accounting for one-sixth of the whole training.

4.3. Evaluation Metrics

The mean average precision (

m A P

) is used as the evaluation metric in this paper.

m A P

is the mean of the

A P

for each class, where

A P

is calculated from precision (P) and recall (R). They are formulated as follows:

\begin{matrix} P = \frac{T P}{T P + F P}, \end{matrix}

(14)

\begin{matrix} R = \frac{T P}{T P + F N}, \end{matrix}

(15)

\begin{matrix} A P = \int_{0}^{1} P (R) d R, \end{matrix}

(16)

\begin{matrix} m A P = \frac{1}{N_{c}} \sum_{c = 1}^{N_{c}} A P_{c}, \end{matrix}

(17)

where

T P

,

F P

and

F N

present True Positive, False Positive, and False Negative, respectively. A prediction is considered

T P

if its IOU with the ground truth exceeds 0.5, or else

F P

.

F N

means that the object appears in the ground truth but is not predicted by the model.

N_{c}

is the total number of classes, and

{A P}_{c}

presents the

A P

value of class c.

4.4. Baselines

Four methods were included for comparison to evaluate the effectiveness of the proposed method, including SourceOnly, DA-Faster [61], AT [51], RFA-Net [26] and ConfMix [62]. Specifically, SourceOnly is a Faster R-CNN that is trained on the source domain and tested on the target domain. The DA-Faster is an approach to cross-domain object detection, which enhances the Faster R-CNN by addressing the domain shift at both the image level and the instance level. AT is based on the teacher–student framework, which utilizes domain adversarial learning and weak–strong data augmentation to decrease the domain gaps. RFA-Net employs different feature alignment strategies, and learns target domain knowledge through pseudo-label generation. ConfMix is a state-of-the-art DA method, which introduces a sample mixing strategy to learn an adaptive object detector.

4.5. Overall Results

Three sets of experiments are designed, namely from NWPU VHR-10 to DIOR*, NWPU VHR-10 to HRRSD*, and DIOR* to HRRSD*, to validate the performance of the proposed method.

4.5.1. Results of NWPU VHR-10 to DIOR*

In this experiment, the NWPU VHR-10 dataset is used as the source domain, and the DIOR* dataset serves as the target domain. The experimental results are presented in Table 2. The table shows that the proposed method achieved the highest mAP value with 55.07%, which is superior to SourceOnly with 23.73%, DA-Faster with 17.28%, RFA-Net with 3.33%, AT with 6.29% and ConfMix with 1.02%. Additionally, the proposed method exhibits a low standard deviation of 29.12. Specifically, the proposed method achieved the highest AP values in 6 of all 10 categories, including “Vehicle”, “Ship”, “Airplane”, “Tennis court”,“Baseball field”, and “Bridge”. It should be noted that the objects in the categories Storage tank”, “Ship” and “Airplane” exhibit significant feature variations and size differences. Furthermore, for the “Harbor” and “Basketball court” categories, the proposed method achieved the second-highest AP values. We argue that the proposed method leverages multi-granularity features to improve the effect of feature alignment, thus improving the accuracy of detection.

To further illustrated the results, five samples are plotted in Figure 2. The first row of the figure shows that most baseline methods failed to detect the small vehicles in the image, whereas the proposed method identified them. As demonstrated in the second row, the SourceOnly, AT and RFA-Net methods did not identify the bridge object, while the proposed method recognized the bridge and provided a more precise location than DA-Faster. The objects in the third row of images have a significant size disparity. The baseline methods failed to detect some small-scale objects presented in the image, whereas the proposed method accurately identified them. As illustrated in the fourth and fifth rows of Figure 2, the proposed method detects objects with significant intra-class variations, such as ships and storage tanks with varying styles. In contrast, the baseline methods suffer from missing detections. The figure suggests that the proposed method is effective in detecting objects with varying sizes, and outperforms baselines in detecting objects with large intra-class feature variations.

4.5.2. Results of NWPU VHR-10 to HRRSD*

In this experiment, the NWPU VHR-10 dataset serves as the source domain and the HRRSD* dataset is employed as the target domain. Due to significant differences in both the volume and features of the two datasets, this experiment is more challenging than the experiment from NWPU VHR-10 to DIOR*. As presented in Table 3, the proposed method shows outstanding performance in terms of mAP, exceeding SourceOnly by 27.17%, DA-Faster by 14.29%, RFA-Net by 4.37%, AT by 6.01% and ConfMix by 3.88%. Additionally, it also achieves a low standard deviation of 24.67. Specifically, the proposed method shows remarkable improvement across all categories compared to the SourceOnly method. The proposed method achieved the highest AP values on “Tennis court”, “Basketball court”, and “Bridge” categories, and the second-highest AP values on the “Vehicle”, “Storage tank”, “Ship”,“Airplane”, “Ground track field”, and “Baseball field” categories. This suggests that the proposed method alleviates domain shift, as evidenced by the accurate detection results on the target domain HRRSD* dataset.

Figure 3 illustrates the visualization results of this experiment. In the first row, the proposed method accurately identifies all the objects, while the baseline methods fail to detect or misidentify the airplanes due to their similar colors to the background. In the second row, the baseline methods struggle to detect objects accurately in complex scenes where they are similar to the background. In contrast, the proposed method successfully distinguishes objects from the background. As demonstrated in the third row of the figure, the proposed method can still accurately detect the basketball court under the image blurring caused by weather conditions. In the fourth and fifth rows, our method can detect densely distributed objects. These visualized samples further suggest that the proposed method can significantly mitigate the complex background noise, thus recognizing objects correctly.

4.5.3. Results of DIOR* to HRRSD*

The experiment of DIOR* to HRRSD* was conducted to further evaluate the proposed method. Compared to the NWPU VHR-10 dataset, the DIOR* dataset exhibits greater diversity in terms of viewing angle, occlusion, appearance, background, and object pose for each object type, which makes it more challenging to detect objects from the RSIs in this dataset. As depicted in Table 4, the proposed method achieves the highest mAP value of 54.47%, outperforming the SourceOnly approach by a margin of 15.31%, DA-Faster by 14.71%, RFA-Net by 1.76%, AT by 2.66% and ConfMix by 0.66%. In addition, it exhibits a low standard deviation of 30.32. Moreover, it attains the highest AP values in the “Storage tank”, “Airplane”, “Ground track field”, and “Bridge” categories, and the second-highest AP values in the “Vehicle” and “Ship” categories. The results indicate that the proposed method is promising and robust. By using multi-granularity feature alignment, our method is effective in mitigating the adverse effects of background and minimizing the feature discrepancies between source and target domain data.

Figure 4 illustrates some examples that further demonstrate the effectiveness of our method. The proposed method detects the objects more accurately than the baselines in these challenging scenarios. For instance, the proposed method accurately identifies airplanes that have a similar color and texture to the backgrounds, as presented in the first row of Figure 4. In addition, our method is superior in recognizing large objects, especially in distinguishing bridges, as evidenced by the samples in the second row. As shown in the third through fifth rows of the figure, the background greatly hampers object recognition, and most baseline methods fail to distinguish objects from irrelevant backgrounds. In contrast, the proposed method has the ability to reliably detect objects in these complex backgrounds. The visualizations show that the proposed method is capable of effectively distinguishing the background from the foreground and accurately identifying objects.

5. Discussion

5.1. Ablation Study

An ablation study was conducted on the experiment of NWPU VHR-10 to DIOR* to investigate the effectiveness of three feature alignment modules. The experimental results are detailed in Table 5. As shown in the table, the proposed method integrates all three modules, which significantly improve the detection accuracy with the best mAP.

Removing any module, i.e., “w/o PFA”, “w/o IMFA” and “w/o INFA”, decreases the performance of the model. This suggests that all the modules at each granularity help to align features between the source and target domains, thus reducing domain gaps at different levels.

In addition, “w/o IMFA and INFA”, “w/o PFA and INFA” and “w/o PFA and IMFA” indicate the removal of two of the three key modules, resulting in lower detection accuracy compared to removing only one module, but the performance still surpasses the baselines. This further indicates that the detection performance of the model can be effectively enhanced by independently introducing any feature alignment module.

To further illustrate the impact of the three modules, eight samples are illustrated in Figure 5. Figure 5a shows that the PFA module enhances the model’s capability to detect smaller objects with the ability to adapt to size variations. As shown in the first row of the figure, the model fails to detect the small-sized “car” objects when the PFA module is removed. In the fourth row, which presents a scenario with significant size variation among objects, the model equipped with the PFA module accurately identifies all instances, whereas its absence leads to duplicate detections of the same objects. As shown in Figure 5b, the IMFA and INFA modules play a crucial role in mitigating the interference from complex backgrounds, allowing the model to accurately differentiate between objects and background regions. As shown in the first and second rows of the figure, models without the IMFA and INFA modules tend to misclassify background regions as objects. This issue becomes more pronounced in the fourth row, where the background exhibits substantial feature similarity to the objects.

5.2. Effect of $λ_{d i s}$

The hyperparameter,

λ_{d i s}

, is used to control the weight of the adversarial learning losses of the three feature alignment modules, balancing the domain difference between the source and target domains. To investigate the effects of

λ_{d i s}

, three sets of experiments were conducted by setting

λ_{d i s}

to 0.001, 0.005, 0.01, 0.05, and 0.1, respectively. As shown in Figure 6, the impact of the feature alignment module throughout the training process increases as the

λ_{d i s}

value increases. The highest mAPs are obtained when the

λ_{d i s}

value is 0.01, indicating that this setting effectively balances the adversarial loss in the total loss function. As the value of

λ_{d i s}

increases further, the model performance gradually decreases. We argue that if the

λ_{d i s}

weight is set too small, the detection performance cannot be effectively improved; conversely, if the model overfits the adversarial samples, this may result in a weakened ability to detect them accurately.

5.3. Effect of $λ_{u n s u p}$

The hyperparameter

λ_{u n s u p}

is used to weight the unsupervised loss from target domain data within the total loss. To verify the effect of

λ_{u n s u p}

, four sets of experiments were conducted by setting

λ_{u n s u p}

to 0.1, 0.5, 1.0, and 1.5, respectively. As depicted in Figure 7, the proportion of unsupervised loss in the total loss increases as the value of unsupervised loss increases. The highest mAPs are obtained when the

λ_{u n s u p}

value is 1.0. This particular setting effectively balances the unsupervised loss in the total loss function. As the value of

λ_{u n s u p}

increases, the proportion of unsupervised losses in the total loss function gradually increases, while mAPs show a decreasing trend. We argue that proper weight settings can improve the overall performance, while weights that are too large or too small may affect the learning effect of the model.

5.4. Model Efficiency

In this section, the complexity and time cost of the model are examined. Figure 8 provides a visualization of the number of trainable parameters and the frames per second (FPS) for the proposed method and the comparative baselines.

The proposed method, despite requiring more parameters for model training compared to the baseline methods, achieves a higher FPS than most of them. This can be explained by the fact that the three feature alignment modules are only computed during the training phase, which do not increase the computational cost of the inference process. Specifically, the FPS of the proposed method is 8.85, which is significantly higher than that of SourceOnly, DA-Faster and RFA-Net, and slightly lower than that of AT. Compared to ConfMix, the proposed method has a slight disadvantage in terms of FPS. This is attributed to the fact that ConfMix is derived from the single-stage detection model YOLOv5, which is characterized by a small number of parameters. We argue that the proposed method effectively maintains a commendable trade-off between accuracy and efficiency.

6. Conclusions

This paper proposes a novel UDA method to improve object detection in RSIs. The proposed method utilizes multi-granularity features at the pixel-, image- and instance-level to align the features of the source and target domains. Extensive experiments suggest that the proposed method outperforms the baseline methods, achieving significant improvement in detection performance in challenging scenarios. This study provides a new attempt to combine multi-granularity feature alignment and the teacher–student framework, which can be used for various object detection scenarios with RSIs, such as building detection, ship detection and so on.

In the future, asymmetric semi-supervised frameworks deserve to be further considered to generate more accurate pseudo-labels. In addition, enhanced model generalization using multi-source domain information is expected to contribute to unsupervised RSI object detection, and how to align multi-source domain features to optimize the model is a worthwhile research direction.

Author Contributions

Conceptualization, F.F. and S.L.; methodology, F.F. and J.K.; data curation, S.L., J.K. and C.L.; validation, F.F. and P.T.; formal analysis, F.F., Y.L. and S.Z.; writing—review and editing, F.F., J.K. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partial supported by the Hubei Key Laboratory of Intelligent Geo-Information Processing under the Open Research Project Grant KLIGIP-2023-B11, and was also partially funded by the National Natural Science Foundation of China, Grant No. 42371420.

Data Availability Statement

The NWPU VHR-10 and DIOR datasets can be obtained from https://gcheng-nwpu.github.io/ (accessed on 14 May 2025), and the HRRSD dataset can be obtained from https://github.com/CrazyStoneonRoad/TGRS-HRRSD-Dataset (accessed on 14 May 2025).

Conflicts of Interest

Author Chaoliang Luo was employed by the company Wuhan Huitong Zhiyun Information Technology. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Zheng, Z.; Lei, L.; Sun, H.; Kuang, G. A Review of Remote Sensing Image Object Detection Algorithms Based on Deep Learning. In Proceedings of the 2020 IEEE 5th International Conference on Image, Vision and Computing (ICIVC), Beijing, China, 10–12 July 2020; pp. 34–43. [Google Scholar] [CrossRef]
Zhang, W.; Tang, P.; Zhao, L. Remote sensing image scene classification using CNN-CapsNet. Remote Sens. 2019, 11, 494. [Google Scholar] [CrossRef]
Fang, F.; Zheng, K.; Li, S.; Xu, R.; Hao, Q.; Feng, Y.; Zhou, S. Incorporating Superpixel Context for Extracting Building From High-Resolution Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1176–1190. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Guo, L.; Liu, Z.; Bu, S.; Ren, J. Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4238–4249. [Google Scholar] [CrossRef]
Liu, X.; He, J.; Yao, Y.; Zhang, J.; Liang, H.; Wang, H.; Hong, Y. Classifying urban land use by integrating remote sensing and social media data. Int. J. Geogr. Inf. Sci. 2017, 31, 1675–1696. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Fang, F.; Xu, R.; Li, S.; Hao, Q.; Zheng, K.; Wu, K.; Wan, B. Semi-supervised building instance extraction from high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5619212. [Google Scholar] [CrossRef]
Zhou, S.; Feng, Y.; Li, S.; Zheng, D.; Fang, F.; Liu, Y.; Wan, B. DSM-assisted unsupervised domain adaptive network for semantic segmentation of remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5608216. [Google Scholar] [CrossRef]
Zheng, D.; Li, S.; Fang, F.; Zhang, J.; Feng, Y.; Wan, B.; Liu, Y. Utilizing Bounding Box Annotations for Weakly Supervised Building Extraction from Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4702517. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Shi, G.; Zhang, J.; Liu, J.; Zhang, C.; Zhou, C.; Yang, S. Global context-augmented objection detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10604–10617. [Google Scholar] [CrossRef]
Tian, Z.; Zhan, R.; Hu, J.; Wang, W.; He, Z.; Zhuang, Z. Generating anchor boxes based on attention mechanism for object detection in remote sensing images. Remote Sens. 2020, 12, 2416. [Google Scholar] [CrossRef]
Yu, D.; Ji, S. A new spatial-oriented object detection framework for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4407416. [Google Scholar] [CrossRef]
Li, X.; Deng, J.; Fang, Y. Few-shot object detection on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601614. [Google Scholar] [CrossRef]
Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616217. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Li, Q.; Chen, Y.; Zeng, Y. Transformer with transfer CNN for remote-sensing-image object detection. Remote Sens. 2022, 14, 984. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A context-aware detection network for objects in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Sun, X.; Wang, P.; Wang, C.; Liu, Y.; Fu, K. PBNet: Part-based convolutional neural network for complex composite object detection in remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 173, 50–65. [Google Scholar] [CrossRef]
Wang, C.; Bai, X.; Wang, S.; Zhou, J.; Ren, P. Multiscale visual attention networks for object detection in VHR remote sensing images. IEEE Geosci. Remote Sens. Lett. 2018, 16, 310–314. [Google Scholar] [CrossRef]
Sharifuzzaman Sagar, A.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar] [CrossRef]
Xiao, J.; Yao, Y.; Zhou, J.; Guo, H.; Yu, Q.; Wang, Y.F. FDLR-Net: A feature decoupling and localization refinement network for object detection in remote sensing images. Expert Syst. Appl. 2023, 225, 120068. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Q.; Wang, T.; Wang, B.; Meng, X. Rotation-invariant and relation-aware cross-domain adaptation object detection network for optical remote sensing images. Remote Sens. 2021, 13, 4386. [Google Scholar] [CrossRef]
Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. FADA: Feature aligned domain adaptive object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5617916. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, X.; Diao, W.; Li, H.; Fu, K. Rfa-net: Reconstructed feature alignment network for domain adaptation object detection in remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5689–5703. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, X.; Diao, W.; Wei, H.; Fu, K. Dualda-net: Dual-head rectification for cross domain object detection of remote sensing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612616. [Google Scholar] [CrossRef]
Koga, Y.; Miyazaki, H.; Shibasaki, R. Adapting Vehicle Detector to Target Domain by Adversarial Prediction Alignment. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2341–2344. [Google Scholar]
Li, X.; Luo, M.; Ji, S.; Zhang, L.; Lu, M. Evaluating generative adversarial networks based image-level domain transfer for multi-source remote sensing image segmentation and object detection. Int. J. Remote Sens. 2020, 41, 7343–7367. [Google Scholar] [CrossRef]
Chen, J.; Sun, J.; Li, Y.; Hou, C. Object detection in remote sensing images based on deep transfer learning. Multimed. Tools Appl. 2022, 81, 12093–12109. [Google Scholar] [CrossRef]
Koga, Y.; Miyazaki, H.; Shibasaki, R. A method for vehicle detection in high-resolution satellite images that uses a region-based object detector and unsupervised domain adaptation. Remote Sens. 2020, 12, 575. [Google Scholar] [CrossRef]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.C.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Zhou, J.; Bischof, W.F.; Caelli, T. Road tracking in aerial images based on human–computer interaction and Bayesian filtering. ISPRS J. Photogramm. Remote Sens. 2006, 61, 108–124. [Google Scholar] [CrossRef]
Kim, T.; Park, S.R.; Kim, M.G.; Jeong, S.; Kim, K.O. Tracking road centerlines from high resolution remote sensing images by least squares correlation matching. Photogramm. Eng. Remote Sens. 2004, 70, 1417–1422. [Google Scholar] [CrossRef]
Zhang, J.; Lin, X.; Liu, Z.; Shen, J. Semi-automatic road tracking by template matching and distance transformation in urban areas. Int. J. Remote Sens. 2011, 32, 8331–8347. [Google Scholar] [CrossRef]
Irvin, R.B.; McKeown, D.M. Methods for exploiting the relationship between buildings and their shadows in aerial imagery. IEEE Trans. Syst. Man. Cybern. 1989, 19, 1564–1575. [Google Scholar] [CrossRef]
Huertas, A.; Nevatia, R. Detecting buildings in aerial images. Comput. Vis. Graph. Image Process. 1988, 41, 131–152. [Google Scholar] [CrossRef]
Weidner, U.; Förstner, W. Towards automatic building extraction from high-resolution digital elevation models. ISPRS J. Photogramm. Remote Sens. 1995, 50, 38–49. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Zhang, N.; Zhang, Y.; Zhao, Z.; Xu, D.; Ben, G.; Gao, Y. Deep learning-based object detection techniques for remote sensing images: A survey. Remote Sens. 2022, 14, 2385. [Google Scholar] [CrossRef]
Shi, Z.; Yu, X.; Jiang, Z.; Li, B. Ship detection in high-resolution optical imagery based on anomaly detector and local shape feature. IEEE Trans. Geosci. Remote Sens. 2013, 52, 4511–4523. [Google Scholar]
Zhang, W.; Sun, X.; Fu, K.; Wang, C.; Wang, H. Object detection in high-resolution remote sensing images using rotation invariant parts based model. IEEE Geosci. Remote Sens. Lett. 2013, 11, 74–78. [Google Scholar] [CrossRef]
Zhang, W.; Sun, X.; Wang, H.; Fu, K. A generic discriminative part-based model for geospatial object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2015, 99, 30–44. [Google Scholar] [CrossRef]
Zheng, Y.; Sun, P.; Zhou, Z.; Xu, W.; Ren, Q. ADT-Det: Adaptive dynamic refined single-stage transformer detector for arbitrary-oriented object detection in satellite optical imagery. Remote Sens. 2021, 13, 2623. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Wa, S.; Chen, S.; Ma, Q. GANsformer: A detection network for aerial images with high performance combining convolutional network and transformer. Remote Sens. 2022, 14, 923. [Google Scholar] [CrossRef]
Oza, P.; Sindagi, V.A.; Sharmini, V.V.; Patel, V.M. Unsupervised domain adaptation of object detectors: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 14, 4018–4040. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Huang, D.; Liu, S.; Wang, Y. Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13766–13775. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4091–4101. [Google Scholar]
Zhang, D.; Li, J.; Xiong, L.; Lin, L.; Ye, M.; Yang, S. Cycle-consistent domain adaptive faster RCNN. IEEE Access 2019, 7, 123903–123911. [Google Scholar] [CrossRef]
Arruda, V.F.; Paixao, T.M.; Berriel, R.F.; De Souza, A.F.; Badue, C.; Sebe, N.; Oliveira-Santos, T. Cross-domain car detection using unsupervised image-to-image translation: From day to night. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Li, Y.J.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Chen, K.; Wu, B.; He, Z.; Kitani, K.; Vajda, P. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7581–7590. [Google Scholar]
Cai, Q.; Pan, Y.; Ngo, C.W.; Tian, X.; Duan, L.; Yao, T. Exploring object relation in mean teacher for cross-domain detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11457–11466. [Google Scholar]
Tang, S.; Cheng, Z.; Pu, S.; Guo, D.; Niu, Y.; Wu, F. Learning a domain classifier bank for unsupervised adaptive object detection. arXiv 2020, arXiv:2007.02595. [Google Scholar]
Biswas, D.; Tešić, J. Domain Adaptation With Contrastive Learning for Object Detection in Satellite Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5620615. [Google Scholar] [CrossRef]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Cramer, M. The DGPF-test on digital airborne camera evaluation overview and test design. Photogramm.-Fernerkund.-Geoinf. 2010, 2010, 73–82. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Mattolin, G.; Zanella, L.; Ricci, E.; Wang, Y. Confmix: Unsupervised domain adaptation for object detection via confidence-based mixing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 423–433. [Google Scholar]

Figure 1. An overview of MGDAT. MGDAT comprises three modules: (a) PFA, (b) IMFA, and (c) INFA.

Figure 2. The visualization results of different methods in the NWPU VHR-10 to DIOR* experiment. (a) SourceOnly. (b) DA-Faster. (c) AT. (d) RFA-Net. (e) ConfMix. (f) Ours. (g) Ground truth.

Figure 3. The visualization results of different methods in the NWPU VHR-10 to HRRSD* experiment. (a) SourceOnly. (b) DA-Faster. (c) AT. (d) RFA-Net. (e) ConfMix. (f) Ours. (g) Ground truth.

Figure 4. The visualization results of different methods in the DIOR* to HRRSD* experiment. (a) SourceOnly. (b) DA-Faster. (c) AT. (d) RFA-Net. (e) ConfMix. (f) Ours. (g) Ground truth.

Figure 5. Visualization results of the ablation study. (a) Visualization results of the role of PFA. (b) Visualization results of the role of IMFA and INFA.

Figure 6. Effects of

λ_{d i s}

.

Figure 6. Effects of

λ_{d i s}

.

Figure 7. Effects of

λ_{u n s u p}

.

Figure 7. Effects of

λ_{u n s u p}

.

Figure 8. Comparisons of the trainable parameters and FPS of different methods.

Table 1. Image and instance statistics of NWPU VHR-10, DIOR* and HRRSD*.

Dataset	NWPU VHR-10	DIOR*	HRRSD*
Vehicle	598	11,582	4756
Ship	302	23,319	3975
Harbor	224	2041	3902
Airplane	757	1712	4901
Ground track field	163	998	4033
Tennis court	524	4142	4402
Storage tank	655	2502	4424
Bridge	124	1122	4651
Basketball court	159	909	4064
Baseball field	390	2083	4042
Image number	650	6997	17,016
Instance number	3896	50,410	43,168

Table 2. Comparison of the different detection methods on NWPU VHR-10 to DIOR*. The best results are highlighted in bold and the second-best results are underlined.

Method	Vehicle	Storage Tank	Ship	Harbor	Airplane	Ground Track Field	Tennis Court	Basketball Court	Baseball Field	Bridge	mAP
SourceOnly	25.62	10.33	5.80	4.38	50.67	44.65	71.97	25.61	72.72	1.70	31.34
DA-Faster	23.21	16.78	3.92	10.94	54.66	64.28	77.71	41.42	80.95	4.02	37.79
RFA-Net	37.93	42.78	15.98	17.94	80.47	70.29	88.62	66.86	87.95	8.64	51.74
AT	35.55	60.92	12.11	6.76	73.18	63.59	90.53	43.79	90.32	11.08	48.78
ConfMix	37.28	66.61	20.62	17.73	77.94	63.52	91.76	62.66	92.35	10.04	54.05
Proposed method	39.75	64.83	21.88	14.78	82.13	63.91	92.68	61.93	94.11	14.67	55.07

Table 3. Comparison of the different detection methods on NWPU VHR-10 to HRRSD*. The best results are highlighted in bold and the second-best results are underlined.

Method	Vehicle	Storage Tank	Ship	Harbor	Airplane	Ground Track Field	Tennis Court	Basketball Court	Baseball Field	Bridge	mAP
SourceOnly	56.95	12.49	24.29	21.74	67.38	21.83	50.37	16.42	10.42	12.57	29.45
DA-Faster	72.51	14.92	35.27	47.55	71.64	66.57	55.83	17.82	15.50	25.72	42.33
RFA-Net	46.11	65.86	43.09	56.19	88.15	70.29	64.34	22.06	19.25	47.16	52.25
AT	56.45	93.02	39.21	20.56	70.11	83.82	67.27	25.00	16.59	34.05	50.61
ConfMix	69.91	73.25	34.66	37.93	73.21	67.08	72.18	29.24	16.93	53.05	52.74
Proposed method	70.48	92.06	41.70	26.04	80.77	74.65	76.36	31.78	17.08	55.30	56.62

Table 4. Comparison of the different detection methods on DIOR* to HRRSD*. The best results are highlighted in bold and the second-best results are underlined.

Method	Vehicle	Storage Tank	Ship	Harbor	Airplane	Ground Track Field	Tennis Court	Basketball Court	Baseball Field	Bridge	mAP
SourceOnly	17.85	79.00	19.69	21.87	86.02	57.80	66.75	14.14	10.57	17.87	39.16
DA-Faster	21.63	88.61	9.32	30.95	82.46	58.04	66.15	14.21	11.52	14.66	39.76
RFA-Net	27.17	83.21	52.58	36.84	90.05	83.73	70.07	22.35	24.65	37.46	52.81
AT	20.94	86.02	23.70	45.87	96.78	86.83	85.72	24.30	18.02	24.89	51.31
ConfMix	25.47	86.21	38.25	41.73	97.26	86.41	87.02	25.32	17.21	33.26	53.81
Proposed method	26.75	90.09	41.65	33.35	97.52	88.42	85.04	23.00	16.37	41.53	54.47

Table 5. Results of the combinations of different modules on NWPU VHR-10 to DIOR*.

	PFA	IMFA	INFA	mAP
Proposed method	✓	✓	✓	55.07
w/o PFA	-	✓	✓	51.34
w/o IMFA	✓	-	✓	52.61
w/o INFA	✓	✓	-	51.98
w/o IMFA and INFA	✓	-	-	50.22
w/o MPFA and INFA	-	✓	-	49.18
w/o MPFA and IMFA	-	-	✓	49.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, F.; Kang, J.; Li, S.; Tian, P.; Liu, Y.; Luo, C.; Zhou, S. Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection. Remote Sens. 2025, 17, 1743. https://doi.org/10.3390/rs17101743

AMA Style

Fang F, Kang J, Li S, Tian P, Liu Y, Luo C, Zhou S. Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection. Remote Sensing. 2025; 17(10):1743. https://doi.org/10.3390/rs17101743

Chicago/Turabian Style

Fang, Fang, Jianing Kang, Shengwen Li, Panpan Tian, Yang Liu, Chaoliang Luo, and Shunping Zhou. 2025. "Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection" Remote Sensing 17, no. 10: 1743. https://doi.org/10.3390/rs17101743

APA Style

Fang, F., Kang, J., Li, S., Tian, P., Liu, Y., Luo, C., & Zhou, S. (2025). Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection. Remote Sensing, 17(10), 1743. https://doi.org/10.3390/rs17101743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection of RSIs

2.2. DA-Based Object Detection

3. Methods

3.1. Pixel-Level Feature Alignment Module (PFA)

3.2. Image-Level Feature Alignment Module (IMFA)

3.3. Instance-Level Feature Alignment Module (INFA)

3.4. Model Training

4. Experiments and Results

4.1. Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Baselines

4.5. Overall Results

4.5.1. Results of NWPU VHR-10 to DIOR*

4.5.2. Results of NWPU VHR-10 to HRRSD*

4.5.3. Results of DIOR* to HRRSD*

5. Discussion

5.1. Ablation Study

5.2. Effect of $λ_{d i s}$

5.3. Effect of $λ_{u n s u p}$

5.4. Model Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection of RSIs

2.2. DA-Based Object Detection

3. Methods

3.1. Pixel-Level Feature Alignment Module (PFA)

3.2. Image-Level Feature Alignment Module (IMFA)

3.3. Instance-Level Feature Alignment Module (INFA)

3.4. Model Training

4. Experiments and Results

4.1. Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Baselines

4.5. Overall Results

4.5.1. Results of NWPU VHR-10 to DIOR*

4.5.2. Results of NWPU VHR-10 to HRRSD*

4.5.3. Results of DIOR* to HRRSD*

5. Discussion

5.1. Ablation Study

5.2. Effect of λ d i s

5.3. Effect of λ u n s u p

5.4. Model Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Effect of $λ_{d i s}$

5.3. Effect of $λ_{u n s u p}$