WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection

Cheng, Gui; Yang, Hao; Tian, Yan; Xie, Meilin; Dang, Chaoya; Ding, Qing; Feng, Xubin

doi:10.3390/rs17233854

Open AccessArticle

WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection

by

Gui Cheng

¹,

Hao Yang

¹,

Yan Tian

¹,

Meilin Xie

¹,

Chaoya Dang

²,

Qing Ding

³ and

Xubin Feng

^1,*

¹

Key Laboratory of Space Precision Measurement Technology, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

²

The College of Agriculture, Nanjing Agricultural University, Nanjing 210095, China

³

The College of Geo-Exploration Science and Technology, Jilin University, Changchun 130026, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(23), 3854; https://doi.org/10.3390/rs17233854 (registering DOI)

Submission received: 21 October 2025 / Revised: 20 November 2025 / Accepted: 26 November 2025 / Published: 28 November 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed method effectively mitigates domain discrepancies in UAV object detection by integrating teacher–student mutual learning, domain adversarial learning, and weighted multi-layer feature alignment.
Experimental results on four challenging UAV cross-domain benchmarks—covering cross-time, cross-camera, cross-view, and cross-weather scenarios—demonstrate that WMFA-AT consistently improves detection accuracy and robustness under severe domain shifts.

What are the implications of the main findings?

The proposed method enables accurate UAV object detection in unseen domains without requiring additional bounding box annotations, substantially reducing the cost of manual labeling for new environments.
The effectiveness of the weighted multi-layer feature alignment strategy highlights a new direction for designing domain-adaptive detection frameworks capable of generalizing across complex aerial imaging conditions.

Abstract

Unmanned Aerial Vehicle (UAV) object detection has witnessed rapid progress in recent years. However, its heavy reliance on labeled data and the assumption of consistent data distributions between training and deployment domains limit its generalization ability, leading to significant performance degradation under domain shifts. To address this challenge arising from substantial discrepancies in feature distributions across UAV images captured under diverse conditions, we propose a novel framework: Adaptive Teacher with Weighted Multi-layer Feature Alignment (WMFA-AT) for cross-domain UAV object detection. WMFA-AT adopts a teacher–student mutual learning paradigm, integrating domain adversarial learning with weighted multi-layer feature alignment and strong-weak data augmentation to effectively mitigate domain discrepancies. Specifically, the student model performs adversarial alignment using multiple domain discriminators applied to different feature layers, where layer-wise transferability is quantitatively estimated and used to adaptively weight the alignment process. This strategy ensures that features from the source and target domains are aligned in a distribution-aware manner. Meanwhile, the teacher model benefits from the student model via mutual learning, incorporating knowledge from both source and target domains while avoiding overfitting to the source. To comprehensively evaluate the proposed approach, we construct four challenging cross-domain UAV object detection benchmarks covering cross-time, cross-camera, cross-view, and cross-weather scenarios. Experimental results demonstrate that WMFA-AT consistently improves detection accuracy across diverse domain shifts, highlighting its robustness, generalization capability, and practical applicability in real-world UAV deployment settings.

Keywords:

UAV; cross-domain object detection; teacher-student mutual learning; weighted multi-layer feature alignment

1. Introduction

Unmanned aerial vehicles (UAVs), as an emerging remote sensing platform, have been widely applied in various fields such as smart agriculture, environmental monitoring, and traffic management [1,2,3,4,5]. In these applications, object detection in UAV images is a critical component. In recent years, convolutional neural network-based methods have achieved state-of-the-art performance on multiple UAV image benchmarks [6,7,8,9,10,11,12]. However, the success of these detectors relies on a large amount of labeled data with bounding boxes and requires the training and test datasets to have independent and identically distributed properties. These constraints significantly weaken the generalization ability of the detectors when confronted with new domains, leading to a substantial performance drop in the presence of domain shift [13,14]. One intuitive solution to address this issue is to collect and annotate images from all new environments to mitigate the impact of domain shift. However, this approach is clearly impractical, as obtaining high-quality bounding box annotations is both expensive and time-consuming. This has led to the emergence of unsupervised domain adaptation tasks [15,16,17], aiming to transfer knowledge from labeled source domain data to the unlabeled target domain.

While excellent performance has been achieved on benchmark datasets, object detection in real-world UAV applications still faces substantial challenges caused by variations in viewpoints, object appearance, backgrounds, illumination, atmospheric conditions, and imaging quality [18]. These factors introduce domain shifts between the training (source) domain and the deployment (target) domain. For example, UAV images captured by different camera sensors, flight altitudes, or in different cities may exhibit distinct visual characteristics. Moreover, real-world UAV systems must remain reliable under diverse environmental conditions—such as rain, fog, or nighttime—whereas training datasets are often collected under clear-weather conditions for better visibility. Similar discrepancies also arise when models trained on synthetic images are deployed on real-world UAV data. These factors collectively lead to strong domain variations across datasets.

Such domain shifts commonly result in significant degradation in detection accuracy. Although obtaining more annotated training data could mitigate this issue, bounding box annotation is costly and labor-intensive, particularly for UAV imagery that typically contains small, dense, and diverse targets. Therefore, it is desirable to develop techniques that can adapt an existing detector to a visually different target domain without requiring new manual annotations [13].

The core idea of Domain Adaptive Object Detection (DAOD) is to learn domain-invariant feature representations by aligning the source and target domains [15]. Domain invariance refers to feature representations that capture the essential semantic information of objects (e.g., shape, structure, and contextual cues) while suppressing domain-specific variations (e.g., lighting, color tone, imaging noise, or background patterns). By aligning feature distributions between the source and target domains—typically through adversarial learning, feature alignment, or pseudo-label-based mutual learning—the detector becomes less sensitive to superficial appearance changes and more robust to diverse real-world conditions. Achieving strong domain invariance is therefore foundational for building UAV detection models capable of reliable performance across time, sensors, viewpoints, and weather conditions.

Existing alignment strategies can be categorized into three types: feature-level adaptive methods [18,19,20,21], pixel-level adaptive methods [22,23,24], and self-training-based methods [13,25]. Feature-level adaptive methods reduce the distribution discrepancy between the source and target domain features using adversarial learning or explicit metric learning strategies. However, existing methods treat different feature layers equally during the alignment process, neglecting the varying transferability of features across different layers. Pixel-level adaptive methods first synthesize the source domain into an intermediate domain with target-style images and then train detectors using supervised learning. However, training an ideal image translator is extremely challenging, and in extreme cases, synthetic images from certain new domains may even exacerbate domain shift. Self-training-based methods align classes by generating pseudo-labels, but such methods are prone to generating low-quality pseudo-labels in the target domain.

It is noteworthy that most current domain adaptive object detection methods are modeled based on ground-level images, such as Cityscapes [26] and BDD100k [27], as shown in Figure 1. However, the possibility of capturing images from various perspectives and altitudes results in more severe instance-level domain shift (e.g., appearance and size) in UAV images compared to ground-level images. Furthermore, factors such as weather (sunny vs. rainy) and lighting conditions (daytime vs. nighttime) have a significant impact on the quality of aerial images [18]. In other words, domain shift differences at the image level are more pronounced in UAV images than in ground-level images, as shown in Figure 2. Overall, the main challenges of domain adaptive object detection in UAV images stem from two aspects: (1) the significant image-level style differences make it difficult to transfer the knowledge learned from the source domain to the target domain; (2) the difficulty of generating high-quality pseudo-labels.

Domain-adaptive (cross-domain) object detection seeks to address the challenges posed by deploying detection models in unconstrained or previously unseen environments [19,20,28]. Since its initial introduction by Chen et al. [19], who employed adversarial learning with domain discriminators to align domain-invariant visual features at both image and instance levels, a variety of strategies have emerged to mitigate domain shifts. For instance, Saito et al. [20] proposed strong alignment on low-level and weak alignment on high-level features, while He and Zhang introduced Multi-Adversarial Faster-RCNN (MAF) [28] with multi-level alignment using hierarchical domain discriminators. Hsu et al. [29] focused alignment on foreground-centric regions, and Hierarchical Transferability Calibration Network (HTCNet) [30] adjusted the discriminability-transferability trade-off hierarchically. Other notable efforts include foreground-guided adaptation [31], asymmetric label distribution modeling [32], and auxiliary multi-label classification regularization [33,34]. While these methods improve transferability by encouraging domain-invariant representations, they often compromise the model’s discriminative capacity, especially when alignment is applied indiscriminately across features.

Recently, self-training with pseudo-labels has garnered significant attention in domain adaptation. In this paradigm, a model generates pseudo-labels for unlabeled target data to iteratively supervise itself [35]. To enhance label reliability, several improvements have been proposed: temporal ensembling [36], exponential moving average-based Mean Teacher models [37], mutual learning strategies [38], and consistency-based techniques such as FixMatch [39].

To transfer the success of self-training from semi-supervised to domain-adaptive object detection, researchers have extended the teacher–student framework to the cross-domain setting [37,40,41]. In this framework, a teacher model generates pseudo-labels for unlabeled target data, while a student model is trained with these labels to gradually reduce the domain discrepancy. For example, Mean Teacher with Object Relations (MTOR) [42] introduced consistency-based graph relation modeling, while Unbiased Mean Teacher (UMT) [24] incorporated image translation with CycleGAN [43]. More recent frameworks, such as Adaptive Teacher (AT) [13] and Probabilistic Teacher (PT) [44], integrate uncertainty-aware labeling and strong-weak augmentation strategies to enhance performance. Harmonious Teacher [25] further improves pseudo-label quality by enforcing consistency between classification and localization confidence. Nonetheless, these teacher–student paradigms often suffer from noisy pseudo-labels due to unresolved domain shifts, ultimately leading to suboptimal generalization.

Pseudo-labeling has become one of the most widely adopted strategies in domain-adaptive object detection because it effectively leverages abundant unlabeled target-domain UAV data without additional annotation cost [13,18,35]. By allowing the teacher model to iteratively generate supervision for target samples, self-training enables progressive refinement, where both the model and pseudo-label quality improve over time, thus facilitating stable adaptation even under large domain gaps. Training on pseudo-labeled target data also provides direct target-guided supervision, helping the detector learn domain-invariant representations and reducing sensitivity to appearance shifts. Furthermore, consistency regularization helps suppress noise in pseudo-labels, while pseudo-labeling remains highly compatible with adversarial or feature-alignment approaches, offering complementary adaptation at both feature and instance levels. These combined advantages make pseudo-label-based self-training a practical, scalable, and highly effective solution for adapting UAV detectors to new environments without human annotations.

In this study, we address the cross-domain object detection task in UAV imagery, which presents pronounced domain gaps between labeled source and unlabeled target domains due to variations in altitude, weather, lighting, and viewpoint. Although the teacher–student framework offers a promising solution, it remains susceptible to degraded performance caused by uniform treatment of features with different transferability and the reliance on biased pseudo-labels. To overcome these limitations, inspired by Adaptive Teacher [13], we propose WMFA-AT (Adaptive Teacher with Weighted Multi-layer Feature Alignment), a novel self-training framework that integrates (i) weighted multi-layer feature alignment based on transferability estimation and adversarial learning, and (ii) strong-weak augmentation-driven mutual learning. Specifically, the student model employs multiple adversarial domain classifiers on different backbone layers, and each layer is weighted according to its transferability—measured by adversarial loss—allowing the model to selectively align the most informative features. This targeted alignment enables the student to learn domain-invariant representations, which in turn improves the teacher model via exponential moving average updates. To further alleviate domain bias, the student model incorporates a gradient reversal layer-based discriminator to perform adversarial alignment on dual-domain inputs. Collectively, these strategies result in a substantial improvement in pseudo-label quality and cross-domain detection performance.The main contributions of this work are as follows:

1.: We propose a novel teacher–student framework (WMFA-AT) for cross-domain UAV object detection that requires no bounding box annotations in the target domain, thereby significantly reducing annotation costs while maintaining high detection accuracy.
2.: A weighted multi-layer feature adaptive alignment mechanism is introduced, enabling layer-wise domain adaptation based on estimated transferability. This approach effectively enhances pseudo-label quality and improves model generalization under complex domain shifts.
3.: We construct four challenging UAV cross-domain datasets spanning cross-time, cross-camera, cross-view, and cross-weather scenarios to comprehensively validate our approach. Extensive experiments demonstrate that WMFA-AT consistently outperforms state-of-the-art methods across all scenarios, highlighting its robustness and versatility.

2. Materials and Methods

2.1. Datasets

This study utilizes four publicly available visible-band (RGB) datasets, including two optical satellite remote sensing datasets—DIOR [45] and DOTA [6]—and two UAV remote sensing datasets—VisDrone [46] and UAVDT [47]. All datasets adopt the COCO-style annotation format with axis-aligned bounding boxes.

The VisDrone dataset [46] contains 10 object categories captured by various UAV platforms under different weather, illumination, and viewpoint conditions. Image resolutions range from 1500 × 2000 pixels to 360 × 480 pixels. The dataset consists of 10,209 images (6471 for training, 548 for testing, and 3190 for validation). In this study, images were preprocessed through standard cropping and category filtering to ensure consistency with the experimental settings.

The UAVDT dataset [47] consists of 80,000 image frames extracted from over 10 h of video recordings. The videos were captured using UAV platforms in different urban areas, covering various perspectives and altitudes. Considering the redundancy in sequential video frames, this study uses a preprocessed version of the UAVDT dataset, which includes 8058 training images and 3325 testing images, each with a resolution of

1024 \times 540

pixels. The dataset includes three vehicle categories (car, bus, truck), and COCO-format annotations are provided.

The DIOR dataset [45] comprises 23,463 optical remote sensing images, each with a resolution of

800 \times 800

pixels, covering 20 object categories and 190,288 object instances. It is widely recognized as a benchmark dataset for object detection in remote sensing imagery. The dataset exhibits several notable characteristics: (1) it contains a large number of object categories, instances, and images, supporting large-scale object detection tasks; (2) it features significant variations in object sizes, reflecting differences in spatial resolution and intra-inter-object size variations, which increase detection complexity; (3) it includes images captured under diverse imaging conditions, weather, seasons, and backgrounds, demonstrating high environmental diversity; and (4) it presents high inter-category similarity and rich intra-category diversity.

The DOTA dataset [6] is a large-scale remote sensing object detection dataset designed for developing and evaluating detectors for aerial images. The images were collected from various sensors and platforms, with sizes ranging from

800 \times 800

pixels to 20,000 × 20,000 pixels, and containing objects of varying scales, orientations, and shapes. This study uses version 1.0 of the dataset, which includes 2806 images, 188,282 object instances, and 15 object categories.

Based on these four datasets, we constructs new datasets to evaluate the proposed method across four domains: cross-time, cross-camera, cross-view, and cross-weather. The details are analyzed in the follows:

2.1.1. Cross-Time UAV Object Detection Dataset

Temporal domain shifts arise primarily from variations in lighting conditions between daytime and nighttime imagery. To evaluate the performance of the proposed method under cross-time domain adaptation, two benchmark datasets—VisDrone_daytime_to_night and UAVDT_daytime_to_night—are constructed.

For VisDrone_daytime_to_night, 6000 daytime images and 1200 nighttime images are randomly selected from the original VisDrone dataset to serve as the source and target domains, respectively. Due to the natural variability in object distributions across different lighting conditions, these domains exhibit differing object densities. Object categories with insufficient representation in nighttime imagery (e.g., bicycles, tricycles, and awning-tricycles) are excluded from training and evaluation. Additionally, the categories “people” and “pedestrian” are merged into a unified “people” class, resulting in a total of six object categories.

Similarly, for UAVDT_daytime_to_night, 8000 daytime images and 2000 nighttime images are randomly selected from the UAVDT dataset to serve as the source and target domains, respectively. Given the inherent scene diversity between day and night, this dataset also reflects distinct object densities across domains and includes three object categories. Detailed statistics for both datasets are presented in Table 1 and Table 2.

2.1.2. Cross-Camera UAV Object Detection Dataset

Due to significant differences in images captured by different devices, we constructs a cross-camera object detection dataset, VisDrone_to_UAVDT, to evaluate the performance of the proposed method in cross-camera object detection. From the VisDrone training set, 6000 images are selected as the source domain (VisDrone_to_UAVDT_source_training). From the UAVDT training set, 6000 images are randomly selected as the target domain’s training data (VisDrone_to_UAVDT-_target_training). Additionally, 2000 images are randomly selected from the UAVDT test set as the target domain’s test data (VisDrone_to_UAVDT_target_testing). These datasets include three common object categories: car, truck, and bus. Detailed statistics are shown in Table 3.

2.1.3. Cross-View UAV Object Detection Dataset

The DIOR and DOTA datasets primarily consist of nadir-view imagery, while UAVDT and VisDrone datasets contain frontal and side-view images. Scale variations resulting from viewpoint differences significantly affect detection performance. To evaluate the proposed method’s adaptability to cross-view variations, four datasets are prepared: DIOR-to-VisDrone, DIOR-to-UAVDT, DOTA-to-VisDrone, and DOTA-to-UAVDT. The DIOR dataset includes 20 object categories, with all vehicle types annotated as a single category, “car.” Accordingly, the “car,” “truck,” and “bus” categories in VisDrone and UAVDT are merged into “car,” and unrelated categories are excluded. From 6428 images containing “car” in the DIOR dataset, 6000 images are selected as source domain data (DIOR_source_training). Similarly, 6000 images are selected from the VisDrone training set (VisDrone_target_training) and the UAVDT training set (UAVDT_target_training) as target domain data, with the respective test sets used for evaluation (VisDrone_target_testing, UAVDT_target_testing).

The DOTA dataset is processed with horizontal bounding boxes and split into

800 \times 800

subimages with 100-pixel overlaps. Regions smaller than

800 \times 800

are padded with zero pixels. Among its 15 object categories, only “large-vehicle” and “small-vehicle” are relevant to the “car” category. After filtering, 4357 subimages containing these categories are selected as source domain data (DOTA_source_training). The target domain data remains consistent with the DIOR-to-VisDrone/UAVDT datasets. Details are presented in Table 4.

2.1.4. Cross-Weather UAV Object Detection Dataset

To enhance model generalization under varying weather conditions, a UAVDT_sunny _to_rainy dataset is constructed. 6000 sunny images are selected as the source domain (UAVDT_sunny_to_rainy_training), and 1500 rainy images are selected as the target domain (UAVDT_sunny_to_rainy_testing). Due to the limited number of “bus” and “truck” instances in the target domain, only the “car” category is considered for evaluation. Dataset details are provided in Table 5.

2.2. Methods

2.2.1. Basic Network Architecture

Before presenting the proposed WMFA-AT, we first define the problem of cross-domain object detection [13]. Given a labeled source domain dataset containing

N_{s}

images, denoted as

D_{s} = \{(X_{s}, B_{s}, C_{s})\}

, and an unlabeled target domain dataset containing

N_{t}

images, denoted as

D_{t} = \{X_{t}\}

, where

B_{s} = {\{B_{s}^{i}\}}_{i = 1}^{N_{s}}

represents the bounding box annotations for the source domain images

X_{s} = {\{X_{s}^{i}\}}_{i = 1}^{N_{s}}

, and

C_{s} = {\{C_{s}^{i}\}}_{i = 1}^{N_{s}}

represents the corresponding class labels. The target domain images

X_{t} = {\{X_{t}^{j}\}}_{j = 1}^{N_{t}}

have no annotations. The ultimate goal of cross-domain object detection is to design a domain-invariant detector leveraging both

D_{s}

and

D_{t}

.

The basic network architecture of the proposed method is illustrated in Figure 3 which consists of two main modules: a Target-Domain Teacher Model and a Cross-Domain Student Model. The teacher model processes weakly augmented images from the target domain (

D_{t}

), while the student model processes strongly augmented images from both the source domain (

D_{s}

) and the target domain (

D_{t}

). Two training strategies are employed to optimize the model: a teacher–student mutual learning strategy and an adversarial learning strategy.

Applying weak augmentation to the teacher model enables it to generate more stable and reliable pseudo-labels, as heavy augmentation may distort image content and introduce noise into the pseudo-labeling process. This design is aligned with findings from UT [48] and AT [13], where weakly augmented inputs help preserve semantic consistency and support high-quality pseudo-label prediction. In contrast, the student model is trained using strong augmentation, which exposes it to substantial appearance variations and encourages the learning of robust and domain-invariant representations. Importantly, this weak–strong augmentation pipeline stabilizes the mutual learning dynamics: weak augmentation prevents the teacher from producing noisy pseudo-labels, while strong augmentation enhances the student’s generalization capability across domains.

To initialize the feature encoder and detection head, a target detector is first trained using source-domain annotations. During the mutual learning stage, this initialized detector is duplicated into two structurally identical models—the teacher and the student. The teacher model generates pseudo-labels to supervise the student model, while the student updates the teacher model through an Exponential Moving Average (EMA) mechanism [37]. As training progresses, the quality of the pseudo-labels guiding the student model improves iteratively.

Furthermore, the student model incorporates multi-layer discriminators (domain classifiers) together with a Gradient Reversal Layer (GRL) [49] to perform adversarial domain alignment across multiple feature levels. This multi-level alignment strategy reduces domain discrepancies and enhances the student model’s feature transferability, which in turn enables the teacher model to produce more reliable pseudo-labels for the target domain.

2.2.2. Teacher–Student Mutual Learning Strategy

Following the initial teacher–student framework proposed for semi-supervised object detection [37], the proposed model consists of two structurally identical components: a student model and a teacher model. The student model is updated through standard gradient-based optimization, while the teacher model is updated using the EMA of the student model’s weights. To generate accurate pseudo-labels for target domain images, weakly augmented images are fed into the teacher model to provide reliable pseudo-labels, while strongly augmented images are input into the student model. Specifically, in the teacher model, object samples undergo weak augmentations such as random horizontal flipping and cropping. In contrast, in the student model, object samples are subjected to strong augmentations, including random color jittering, grayscale conversion, Gaussian blur, and random cropped patches, to introduce perturbations.

(1): Model Initialization

In a self-training framework, initialization is crucial, as it relies on the teacher model to generate reliable pseudo-labels for optimizing the student model on the unlabeled target domain [13]. To achieve this, the proposed model is first optimized using the available labeled source data

D_{s} = \{(X_{s}, B_{s}, C_{s})\}

with a supervised loss

L_{s u p}

. The supervised loss for training and initializing the student model with labeled source data is defined as:

L_{s u p} (X_{s}, B_{s}, C_{s}) = L_{c l s}^{r p n} (X_{s}, B_{s}, C_{s}) + L_{r e g}^{r p n} (X_{s}, B_{s}, C_{s}) + L_{c l s}^{r o i} (X_{s}, B_{s}, C_{s}) + L_{r e g}^{r o i} (X_{s}, B_{s}, C_{s})

(1)

here, the RPN loss

L^{r p n}

is used to train the Region Proposal Network (RPN) for generating candidate proposals, while the Region of Interest (ROI) loss

L^{r o i}

is employed for the ROI prediction branch. Both RPN and ROI losses include components for bounding box regression (reg) and classification (cls). In this paper, Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss) is adopted for

L_{c l s}^{r p n}

and

L_{c l s}^{r o i}

, while

l 1

-loss is used for

L_{r e g}^{r p n}

and

L_{r e g}^{r o i}

.

(2): Optimizing the Student Model Using Target Pseudo-Labels

Since no labels are available in the target domain, a pseudo-labeling approach is adopted to generate virtual labels for target domain images to train the student model [13]. To filter out noisy pseudo-labels, a confidence threshold

δ

is applied to the bounding boxes predicted by the teacher model, removing false positives. Additionally, non-maximum suppression (NMS) is performed for each class to eliminate duplicate bounding box predictions. After obtaining pseudo-labels for target domain images from the teacher model, the student model can be updated using the following loss function:

L_{u n s u p} (X_{t}, {\hat{C}}_{t}) = L_{c l s}^{r p n} (X_{t}, {\hat{C}}_{t}) + L_{c l s}^{r o i} (X_{t}, {\hat{C}}_{t})

(2)

where

{\hat{C}}_{t}

denotes the pseudo-labels generated by the teacher model for the target domain. In this process, the unsupervised loss is not applied to the bounding box regression task because the confidence scores of the predicted bounding boxes on unlabeled data only indicate the confidence of object class predictions, rather than the accuracy of the generated bounding box positions.

(3): Gradually updating the teacher model from the student model

To obtain high-quality pseudo-labels from target images, this study employs the EMA method [37], which updates the teacher model by progressively copying the weights of the student model. The update formula is defined as:

θ_{t} \leftarrow α θ_{t} + (1 - α) θ_{s}

(3)

where

θ_{t}

and

θ_{s}

denote the network parameters of the teacher model and student model, respectively.

α

denotes the exponential moving average smoothing coefficient for the teacher model.

2.3. Weighted Multi-Layer Feature Alignment Strategy for Adversarial Learning

In cross-domain UAV object detection, the absence of labeled data in the target domain often leads both teacher and student models to exhibit a bias toward the source domain during mutual learning. Specifically, the pseudo-labels generated by the teacher for target-domain images are heavily dependent on source-domain knowledge, which undermines their reliability and constrains the model’s generalization ability. Therefore, narrowing the domain gap between source and target domains is crucial. Shallow features primarily encode local positional information, whereas deep features capture global semantics. Due to the weak correlation between these levels of representation, this study adopts a coarse-to-fine adaptation strategy by performing image-level alignment at shallow layers and instance-level alignment at deeper layers. Furthermore, given that relying on a single feature map for image-level alignment is inadequate for UAV imagery—owing to its large scale variations and complex domain shifts—we propose a weighted multi-layer feature alignment strategy to enhance cross-domain transferability.

Our weighted multi-layer feature alignment module with multiple adversarial domain classifiers is proposed to align distributions and learn domain-invariant features in the student model, which processes both domains. To enable multi-layer adversarial learning, multiple domain classifiers (D) are placed after the feature encoder (Backbone, denoted as E) of the student model, as illustrated in Figure 3. The role of each domain classifier is to distinguish whether the derived feature

E (X)

originates from the source domain or the target domain. For each input sample, the probability of belonging to the target domain is defined as

D (E (X))

, while the probability of belonging to the source domain is

1 - D (E (X))

. Given the domain label d of each input image, the domain classifier D is updated using a binary cross-entropy loss. Specifically, images from the source domain are labeled as

d = 0

, while those from the target domain are labeled as

d = 1

. The discriminator loss for the i-th feature layer (e.g., the res2, res3, and res4 feature layers in the feature encoder) can be expressed as:

L_{d i s}^{i} = - d log D_{i} (E_{i} (X)) - (1 - d) log (1 - D_{i} (E_{i} (X)))

(4)

In contrast, during adversarial learning, the feature encoder E is trained to confuse the domain classifier D, while the domain classifier D attempts to distinguish features based on their domain origin. Accordingly, the adversarial optimization objective for the output features of the i-th feature layer can be formulated as:

L_{i}^{a d v} = max_{E_{i}} min_{D_{i}} L_{d i s}^{i}

(5)

To simplify the min-max optimization process, a gradient reversal layer (GRL) is inserted between the feature encoder and the domain classifier. The GRL reverses the gradient direction during backpropagation, allowing the feature encoder E to maximize the discriminator loss while minimizing the objective

L_{d i s}^{i}

.

The overall multi-layer feature alignment loss can thus be expressed as:

L_{a d v} = \sum_{i = 2}^{4} L_{i}^{a d v}

(6)

Given the specific properties of features extracted from different layers—such as their varying adaptability to scene factors like illumination and viewpoint—this study introduces a weighting mechanism to emphasize the imbalanced feature alignment capabilities of different domain classifiers. The detailed steps are as follows:

1. Compute the adversarial loss for each domain classifier using Equation (5). A higher loss indicates that the features are less distinguishable between domains, suggesting higher transferability.

2. Quantify the transferability of the i-th feature layer. Transferability reflects the contribution of the features to cross-domain adaptation, with higher transferability indicating greater importance for domain adaptation. This is defined as:

w_{i} = \frac{e^{L_{i}^{a d v}}}{\sum_{i = 2}^{4} e^{L_{i}^{a d v}}}

(7)

Here,

w_{i}

represents the weighting factor for the domain classifier, where a higher adversarial loss corresponds to a higher weight, indicating that feature alignment should focus more on that layer.

3. The weighted multi-layer feature alignment loss is given by:

L_{W M F A} = \sum_{i = 2}^{4} w_{i} L_{i}^{a d v}

(8)

By incorporating the above weighted feature alignment loss, the student model effectively addresses domain biases in visual features. Through repeated exponential moving average updates, the teacher model is further refined to generate more accurate pseudo-labels.

2.4. Loss Function

The total loss function

L

of the proposed method is summarized as follows:

L = L_{s u p} + λ_{u n s u p} \cdot L_{u n s u p} + λ_{d i s} \cdot L_{W M F A}

(9)

here,

λ_{u n s u p}

and

λ_{d i s}

are hyperparameters that control the weights of the corresponding loss components. It is important to note that

L_{s u p}

and

L_{u n s u p}

are used to train the feature encoder and detector in the student model, while

L_{W M F A}

is employed to update both the feature encoder and domain classifiers. The teacher model is updated solely through the exponential moving average mechanism.

3. Results

3.1. Comparison Methods

To evaluate the effectiveness of the proposed method, a series of comparative experiments were conducted. The comparison methods include: The baseline method, Faster RCNN [50], which is trained solely on the source domain dataset and tested on the target domain. Feature-level adaptation methods, such as DA-Faster RCNN [19], SWDA (Strong-Weak Distribution Alignment) [20], and H2FA (Holistic and Hierarchical Feature Alignment) [51]. Self-training-based methods, including UT (Unbiased Teacher) [48], PT [44], AT [13], and CMT (Contrastive Mean Teacher) [52].

3.2. Implementation Details

The experiments in this paper were conducted on a workstation equipped with an Intel(R) Core(TM) i7-7820X processor (4 cores, 3.60 GHz), 32 GB RAM, and two NVIDIA Geforce RTX 2080ti GPUs with 12 GB of RAM each. The experiments were conducted using the PyTorch 2.2.0 framework and the publicly available object detection framework Detectron2 [53]. For the network structure, Faster RCNN [50] was selected as the base detection model. Specifically, ResNet50, pre-trained on ImageNet [54], was used as the backbone network. All images were resized by scaling the shorter side to 800 pixels while maintaining the original aspect ratio.

For hyperparameters,

λ_{u n s u p}

was set to 1.0 and

λ_{d i s}

to 0.05 in all experiments, with a confidence threshold of

δ = 0.8

. The training was conducted using SGD with a fixed learning rate of 0.001 and a batch size of 4. During the initialization phase, following prior studies [13,48], the proposed model was trained on source domain data for 10 k iterations. At the beginning of the mutual learning stage, the initialized weights were copied to both the teacher and student models, after which the network was further trained for 50 k iterations. The 50 k training length was selected empirically to allow stable convergence of the adversarial alignment and pseudo-label refinement processes.

Two types of data augmentation strategies were employed: (1) weak augmentation, consisting of only random horizontal flipping, which preserves the semantic structure of images and is widely adopted in self-trained and domain-adaptive detection frameworks to maintain pseudo-label stability [13,48]; and (2) strong augmentation, which introduces larger appearance variations through color jittering, grayscaling, Gaussian blurring, and patch cutout.The augmentation strengths and parameters were selected following commonly used configurations in strong–weak augmentation pipelines [13,48], which have been shown to improve robustness while avoiding excessive distortion that could harm pseudo-label quality. The exponential moving average smoothing coefficient

α

for the teacher model was set to 0.9996.

Following standard evaluation protocols in domain-adaptive object detection [13,19,20], the average precision (

m A P_{50}

) of the model was assessed using an Intersection-over-Union (IoU) threshold of 0.5.

3.3. Quantitative Evaluation

3.3.1. Cross-Time Domain Adaptive Object Detection

Domain shifts occur in images captured at different times due to changes in illumination conditions (e.g., daytime and nighttime). To evaluate the effectiveness of the proposed method in cross-time domain adaptive object detection, two experiments were conducted: VisDrone cross-time UAV object detection (VisDrone_daytime_to_night dataset) and UAVDT cross-time UAV object detection (UAVDT_daytime_to_night dataset). Details are provided in Section 2.1.1.

As shown in Table 6 and Table 7, the proposed WMFA-AT method achieves state-of-the-art performance on both day-to-night adaptation tasks, reaching

m A P_{50}

scores of 33.4% on VisDrone and 64.6% on UAVDT. These results correspond to absolute gains of 10.4% and 21.9% over the source-only model, highlighting the effectiveness of the proposed cross-domain alignment and pseudo-label refinement mechanisms under severe illumination shifts. More importantly, the performance improvements extend to hard-to-transfer categories such as bus and truck, which typically suffer from increased appearance ambiguity and low visibility at night. The consistent gains across both common and challenging categories indicate that WMFA-AT not only enhances overall detection accuracy but also improves category-level robustness, demonstrating its strong capability to model domain-invariant features and maintain stable performance in low-light UAV scenarios.

3.3.2. Cross-Camera Domain Adaptive Object Detection

Domain discrepancies arise between images captured by different devices. To study cross-camera adaptability, this work employed the VisDrone and UAVDT datasets. Specifically, the training set of the VisDrone dataset was used as the source domain, while the training set of the UAVDT dataset was treated as the target domain, with final evaluations performed on the UAVDT test set. Details are provided in Section 2.1.2.

As shown in Table 8, the results of the VisDrone-to-UAVDT domain adaptation task demonstrate that the proposed WMFA-AT method achieved the best

m A P_{50}

of 27.5%. This result confirms the effectiveness of the proposed method in handling domain shifts in cross-camera adaptation scenarios.

3.3.3. Cross-View Domain Adaptive Object Detection

The DIOR and DOTA datasets consist of satellite orthophotos captured from a vertical perspective, while the UAVDT and VisDrone datasets contain front-view and side-view images. Viewpoint differences cause significant scale variations, affecting detection performance. To assess the proposed method’s adaptability to cross-view changes, four experiments were designed.

In Experiments 1 and 2, the DIOR dataset’s training set was the source domain, with VisDrone and UAVDT as target domains. Experiments 3 and 4 used the DOTA dataset’s training set as the source domain, with VisDrone and UAVDT as target domains. Only the common category (“car”) was evaluated. Details are provided in Section 2.1.3.

As shown in Table 9, across all four domain adaptive tasks, the performance differences between the proposed method and training on source domain data alone were 32.8%, 31.7%, 33.2%, and 40.8%, respectively. The results indicate that viewpoint variations significantly hinder the model’s generalization ability. Compared to CMT, the proposed method achieved performance improvements of 6.9%, 5.8%, 12.4%, and 6.1% across the four challenging domain adaptive tasks, respectively. These gains are primarily attributed to the weighted multi-layer feature alignment loss, which effectively alleviates the impact of scale variations during cross-domain adaptation by leveraging multi-layer feature alignment.

3.3.4. Cross-Weather Domain Adaptive Object Detection

In real-world applications, object detectors must operate under various weather conditions. Thus, enhancing a model’s generalization ability to different weather scenarios is crucial. To simulate such scenarios, this work selected 6000 sunny images from the UAVDT dataset as the source domain and 1500 rainy images as the target domain. Details are provided in Section 2.1.4. Due to the relatively small number of buses and trucks in the target domain, only the “car” category was evaluated in this experiment.

As shown in Table 10, the proposed WMFA-AT method outperformed PT, AT, and CMT, with performance gains of 23.7%, 1.9%, and 1.1%, respectively. These results strongly validate the effectiveness and applicability of the proposed method in cross-weather domain adaptation scenarios.

3.4. Qualitative Evaluation

The detection results of different comparison methods across various cross-domain adaptation scenarios are illustrated in Figure 4. Here, false positives and false negatives follow the standard definitions in object detection evaluation—predicted boxes that do not correspond to any ground-truth object, and ground-truth objects missed by the detector, respectively [50]. As shown in the visualization, the Faster RCNN model trained solely on source-domain data exhibits a large number of such errors due to severe domain shift. After applying domain adaptation, the performance of SWDA, UT, and AT improves notably. In comparison, our proposed approach further reduces both types of errors, which can be clearly observed in the qualitative results.

The detection results of the proposed method are shown in Figure 5. In different domain adaptation tasks, the proposed method achieves excellent results. In cross-time tasks, the method effectively learns image-level domain-invariant features through domain adaptation, enabling robust detection performance even under extreme lighting differences, with fewer false positives and false negatives. In cross-view tasks, the method further adapts to scale variations caused by changes in perspective, demonstrating outstanding detection performance. However, due to the substantial viewpoint differences between satellite and UAV images, the proposed method still encounters false positive issues for targets with significant viewpoint changes, such as large-scale vehicles in oblique UAV views.

Thus, for tasks with severe domain shifts, such as cross-time and cross-view tasks, there remains room for further performance improvement. This also indicates that illumination variations and scale changes have a significant impact on the generalization ability of the model. Overall, the proposed method provides a novel approach to domain-adaptive object detection in UAV image.

4. Discussion

This section conducts a comprehensive comparative analysis of the proposed method through ablation studies. The experiments focus on the following three aspects: the impact of the weighted multi-layer feature alignment loss, the effect of the weak-strong data augmentation strategy, and the contribution of the teacher–student mutual learning strategy. All experiments were evaluated under four cross-domain settings: cross-time, cross-camera, cross-view, and cross-weather. The results are summarized in Table 11, where “WS Aug” refers to weak-strong augmentation, and “w/o” denotes the removal of the corresponding module.

4.1. Impact of Weighted Multi-Layer Feature Alignment Loss $L_{W M F A}$

To analyze the importance of the adversarial learning strategy for weighted multi-layer feature alignment in WMFA-AT,

L_{W M F A}

was removed, and the model’s performance across the four cross-domain settings was recorded in Table 11. In scenarios with significant domain gaps, such as cross-time (daytime to nighttime), cross-view (satellite to UAV), and cross-weather (sunny to rainy), the model’s performance dropped by 6.7%, 9.8%, and 7.3%, respectively. However, in scenarios with smaller domain gaps, such as cross-camera (VisDrone to UAVDT), the performance only decreased by 3.2%.

Additionally, the impact of weights assigned to different layers in the weighted multi-layer feature alignment process was further analyzed. Figure 6 illustrates the weights assigned to various features in the weighted multi-layer feature alignment process. Higher weights indicate greater transferability, and vice versa. The results reveal two key characteristics: (1) within the same domain adaptation task, the weights of features from different layers vary, indicating differences in their transferability; (2) across different domain adaptation tasks, the weights of features from the same layer vary significantly, indicating task-specific transferability. This highlights the importance of selecting appropriate features for domain adaptation.

To further validate the effectiveness of the weighted multi-layer feature alignment strategy, three image-level feature alignment methods were compared across four domain adaptation tasks: single-layer feature alignment (SFA, using only “res2” features from the backbone), multi-layer feature alignment (MFA, using “res2,” “res3,” and “res4” features from the backbone), and weighted multi-layer feature alignment (WMFA, applying weights to “res2,” “res3,” and “res4” features from the backbone). The results are presented in Table 12, where “w/SFA,” “w/MFA,” and “w/WMFA” denote the use of SFA, MFA, and WMFA in the proposed model, respectively. The results show that SFA performs worse than MFA, likely due to the wide scale distribution of objects in UAV images. Aligning features across multiple layers helps the model learn richer domain-invariant features. Moreover, introducing the proposed reweighting strategy in MFA improved the adaptive detector’s performance by 1.7%, 1.4%, 2.0%, and 2.5% across the four cross-domain tasks, respectively, demonstrating the necessity of treating feature transferability across different layers non-uniformly.

4.2. Impact of Data Augmentation Strategy

The effectiveness of the weak-strong (WS) data augmentation strategy in WMFA-AT was also evaluated. Removing WS augmentation resulted in performance drops of approximately 3.2%, 2.5%, 6.0%, and 5.5% across the four domain adaptation tasks (see Table 11). These results indicate that the simple modification of applying weak augmentation to the teacher model and strong augmentation to the student model during training is critical for improving performance.

4.3. Impact of $λ_{u n s u p}$ and EMA

Finally, the importance of the teacher–student mutual learning strategy was analyzed, as shown in Table 11. Following prior studies [13,48], mutual learning and the teacher model were removed, and the performance of a student model trained solely with strong augmentation and adversarial loss

L_{W M F A}

for cross-domain tasks was evaluated. The results demonstrate a significant performance drop, indicating that the improvement primarily stems from mutual learning via pseudo-labeling in the target domain.

4.4. Limitations of the Proposed Method

The results highlight the method’s effectiveness in handling cross-domain challenges such as illumination changes, sensor variations, viewpoint differences, and weather conditions. However, some limitations remain. The reliance on manually set hyperparameters, such as

λ_{u n s u p}

and

λ_{d i s}

, may limit the model’s adaptability to highly diverse domains. Additionally, while the method improves detection accuracy in various scenarios, it can still face challenges in extreme domain shifts, where pseudo-label quality can degrade. Furthermore, the computational overhead associated with adversarial alignment and EMA updates may be a barrier to real-time applications. Despite these challenges, the method demonstrates clear advantages in improving detection performance in cross-domain UAV object detection tasks, making it a promising approach for future research and practical applications.

5. Conclusions

This paper presents WMFA-AT, a novel cross-domain UAV object detection framework that integrates a weighted multi-layer feature adaptive alignment strategy within a teacher–student mutual learning paradigm. To effectively mitigate domain gaps arising from significant shifts in UAV image distributions, the proposed method incorporates feature-level adversarial learning in the student model, where feature layers are adaptively weighted based on their transferability. This enables more precise alignment of domain-specific features and facilitates robust knowledge transfer. In parallel, weak–strong data augmentation and bidirectional knowledge exchange between teacher and student models further enhance pseudo-label quality and reduce bias toward the source domain, collectively contributing to improved domain generalization.

Extensive experiments were conducted on four newly constructed UAV cross-domain object detection benchmarks, spanning cross-time, cross-camera, cross-view, and cross-weather scenarios. The results demonstrate the effectiveness of WMFA-AT in addressing various types of domain shifts. Discussion validate the necessity of each core component—including weighted multi-layer alignment and adversarial learning—in achieving optimal cross-domain feature adaptation. Quantitative evaluations show that the proposed method consistently outperforms baseline and state-of-the-art methods, achieving substantial improvements in

m A P_{50}

across all tasks. Qualitative results also highlight the method’s robustness and adaptability in practical UAV detection scenarios. Overall, WMFA-AT demonstrates strong performance and generalization capabilities in domain-adaptive UAV object detection, establishing it as a promising solution for real-world UAV deployment in dynamic and unconstrained environments.

Future work will explore two main directions. First, we plan to extend the framework toward broader application scenarios, including multimodal sensing (e.g., infrared or LiDAR) and fully unsupervised or continual domain adaptation, enabling the model to operate reliably under extreme conditions and in dynamically evolving environments. Second, we aim to further improve the efficiency and stability of the adaptation process by developing lightweight or sparse architectures suitable for onboard deployment, as well as more reliable pseudo-label filtering or uncertainty-aware learning mechanisms to reduce noise and enhance mutual learning robustness.

Author Contributions

G.C.: Conceptualization, Methodology, Data curation, Formal analysis, Software, Visualization, Investigation, Validation, Writing—original draft, Writing—review & editing; H.Y.: Conceptualization, Resources, Methodology, Writing—review and editing; Y.T.: Methodology, Writing—review and editing, Supervision; M.X.: Formal analysis, Writing—review and editing, Supervision; Q.D.: Resources, Validation, Writing—Reviewing and Editing; C.D.: Validation, Writing—Reviewing and Editing; X.F.: Software, Writing—Reviewing and Editing, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Youth Innovation Promotion Association CAS (Grant No. 2023419), the Natural Science Foundation of Jilin Province (20250102183JC), the Jilin Province Youth Talent Support Project (QT202401), the Postdoctoral Fellowship Program of CPSF (GZC20252644) and Jiangsu Province Outstanding Postdoctoral Program (2025ZB155).

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Weber, I.; Bongartz, J.; Roscher, R. Artificial and Beneficial–Exploiting Artificial Images for Aerial Vehicle Detection. ISPRS J. Photogramm. Remote Sens. 2021, 175, 158–170. [Google Scholar] [CrossRef]
Lu, Y.; Xue, Z.; Xia, G.S.; Zhang, L. A survey on vision-based UAV navigation. Geo-Spat. Inf. Sci. 2018, 21, 21–32. [Google Scholar] [CrossRef]
Shao, Z.; Cheng, G.; Li, D.; Huang, X.; Lu, Z.; Liu, J. Spatio-temporal-spectral-angular observation model that integrates observations from UAV and mobile mapping vehicle for better urban mapping. Geo-Spat. Inf. Sci. 2021, 24, 615–629. [Google Scholar] [CrossRef]
Cheng, G.; Feng, X.; Tian, Y.; Xie, M.; Dang, C.; Ding, Q.; Shao, Z. ASCDet: Cross-space UAV object detection method guided by adaptive sparse convolution. Int. J. Digit. Earth 2025, 18, 2528648. [Google Scholar] [CrossRef]
Zhao, X.; Yang, Z.; Zhao, H. DCS-YOLOv8: A Lightweight Context-Aware Network for Small Object Detection in UAV Remote Sensing Imagery. Remote Sens. 2025, 17, 2989. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
Shao, Z.; Cheng, G.; Ma, J.; Wang, Z.; Wang, J.; Li, D. Real-time and accurate UAV pedestrian detection for social distancing monitoring in COVID-19 pandemic. IEEE Trans. Multimed. 2021, 24, 2069–2083. [Google Scholar] [CrossRef]
Yu, H.; Wang, J.; Bai, Y.; Yang, W.; Xia, G.S. Analysis of large-scale UAV images using a multi-scale hierarchical representation. Geo-Spat. Inf. Sci. 2018, 21, 33–44. [Google Scholar] [CrossRef]
Zhang, S.; Shao, Z.; Huang, X.; Bai, L.; Wang, J. An internal-external optimized convolutional neural network for arbitrary orientated object detection from optical remote sensing images. Geo-Spat. Inf. Sci. 2021, 24, 654–665. [Google Scholar] [CrossRef]
Zhang, Y.; Xie, Y.; Shi, C.; Li, Q.; Yang, B.; Hao, W. Identifying vehicle types from trajectory data based on spatial-semantic information. Geo-Spat. Inf. Sci. 2024, 28, 1757–1773. [Google Scholar] [CrossRef]
Xu, Z.; Zhao, H.; Liu, P.; Wang, L.; Zhang, G.; Chai, Y. SRTSOD-YOLO: Stronger Real-Time Small Object Detection Algorithm Based on Improved YOLO11 for UAV Imageries. Remote Sens. 2025, 17, 3414. [Google Scholar] [CrossRef]
Qu, S.; Dang, C.; Chen, W.; Liu, Y. SMA-YOLO: An Improved YOLOv8 Algorithm Based on Parameter-Free Attention Mechanism and Multi-Scale Feature Fusion for Small Object Detection in UAV Images. Remote Sens. 2025, 17, 2421. [Google Scholar] [CrossRef]
Li, Y.J.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Chen, K.; Wu, B.; He, Z.; Kitani, K.; Vajda, P. Cross-Domain Adaptive Teacher for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7581–7590. [Google Scholar]
Fang, F.; Kang, J.; Li, S.; Tian, P.; Liu, Y.; Luo, C.; Zhou, S. Multi-Granularity Domain-Adaptive Teacher for Unsupervised Remote Sensing Object Detection. Remote Sens. 2025, 17, 1743. [Google Scholar] [CrossRef]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A Theory of Learning from Different Domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Liu, W.; Liu, J.; Su, X.; Nie, H.; Luo, B. Multi-level domain perturbation for source-free object detection in remote sensing images. Geo-Spat. Inf. Sci. 2024, 28, 1034–1050. [Google Scholar] [CrossRef]
Zhang, G.; Wang, L.; Chen, Z. A Step-Wise Domain Adaptation Detection Transformer for Object Detection under Poor Visibility Conditions. Remote Sens. 2024, 16, 2722. [Google Scholar] [CrossRef]
Ma, Y.; Chai, L.; Jin, L.; Yan, J. Hierarchical alignment network for domain adaptive object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2024, 208, 39–52. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain Adaptive Faster R-CNN for Object Detection in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3339–3348. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-Weak Distribution Alignment for Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Zhao, L.; Wang, L. Task-Specific Inconsistency Alignment for Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14217–14226. [Google Scholar]
Wang, W.; Zhang, J.; Zhai, W.; Cao, Y.; Tao, D. Robust Object Detection via Adversarial Novel Style Exploration. IEEE Trans. Image Process. 2022, 31, 1949–1962. [Google Scholar] [CrossRef]
Arruda, V.F.; Berriel, R.F.; Paixão, T.M.; Badue, C.; De Souza, A.F.; Sebe, N.; Oliveira-Santos, T. Cross-Domain Object Detection Using Unsupervised Image Translation. Expert Syst. Appl. 2022, 192, 116334. [Google Scholar] [CrossRef]
Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased Mean Teacher for Cross-Domain Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4091–4101. [Google Scholar]
Deng, J.; Xu, D.; Li, W.; Duan, L. Harmonious Teacher for Cross-Domain Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23829–23838. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. Bdd100k: A Diverse Driving Video Database with Scalable Annotation Tooling. arXiv 2018, arXiv:1805.04687. [Google Scholar]
He, Z.; Zhang, L. Multi-Adversarial Faster-RCNN for Unrestricted Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
Hsu, C.C.; Tsai, Y.H.; Lin, Y.Y.; Yang, M.H. Every Pixel Matters: Center-Aware Feature Alignment for Domain Adaptive Object Detector. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12354, pp. 733–748. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Ding, X.; Huang, Y.; Dou, Q. Harmonizing Transferability and Discriminability for Adapting Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8869–8878. [Google Scholar]
Zheng, Y.; Huang, D.; Liu, S.; Wang, Y. Cross-Domain Object Detection through Coarse-to-Fine Feature Adaptation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13766–13775. [Google Scholar]
He, Z.; Zhang, L. Domain Adaptive Object Detection via Asymmetric Tri-Way Faster-RCNN. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12369, pp. 309–324. [Google Scholar] [CrossRef]
Xu, C.D.; Zhao, X.R.; Jin, X.; Wei, X.S. Exploring Categorical Regularization for Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11724–11733. [Google Scholar]
Zhao, Z.; Guo, Y.; Shen, H.; Ye, J. Adaptive Object Detection with Dual Multi-label Prediction. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12373, pp. 54–69. [Google Scholar] [CrossRef]
Lee, D.H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 20–21 June 2013; Volume 3, p. 896. [Google Scholar]
Laine, S.; Aila, T. Temporal Ensembling for Semi-Supervised Learning. arXiv 2017, arXiv:1610.02242. [Google Scholar] [CrossRef]
Tarvainen, A.; Valpola, H. Mean Teachers Are Better Role Models: Weight-averaged Consistency Targets Improve Semi-Supervised Deep Learning Results. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Zhang, Y.; Xiang, T.; Hospedales, T.M.; Lu, H. Deep Mutual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4320–4328. [Google Scholar]
Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. Fixmatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. Adv. Neural Inf. Process. Syst. 2020, 33, 596–608. [Google Scholar]
Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-Domain Weakly-Supervised Object Detection through Progressive Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5001–5009. [Google Scholar]
Khodabandeh, M.; Vahdat, A.; Ranjbar, M.; Macready, W.G. A Robust Learning Approach to Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 480–490. [Google Scholar]
Cai, Q.; Pan, Y.; Ngo, C.W.; Tian, X.; Duan, L.; Yao, T. Exploring Object Relation in Mean Teacher for Cross-Domain Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 11457–11466. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Chen, M.; Chen, W.; Yang, S.; Song, J.; Wang, X.; Zhang, L.; Yan, Y.; Qi, D.; Zhuang, Y.; Xie, D.; et al. Learning Domain Adaptive Object Detection with Probabilistic Teacher. arXiv 2022, arXiv:2206.06293. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Liu, Y.C.; Ma, C.Y.; He, Z.; Kuo, C.W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1180–1189. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Xu, Y.; Sun, Y.; Yang, Z.; Miao, J.; Yang, Y. H2FA R-CNN: Holistic and Hierarchical Feature Alignment for Cross-Domain Weakly Supervised Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14329–14339. [Google Scholar]
Cao, S.; Joshi, D.; Gui, L.Y.; Wang, Y.X. Contrastive Mean Teacher for Domain Adaptive Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23839–23848. [Google Scholar]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 1 April 2024).
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. Imagenet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]

Figure 1. The instance of domain shift in ground images [26,27].

Figure 2. The instance of domain shift in UAV images.

Figure 3. The Basic Network Architecture of WMFA-AT. The WMFA-AT consists of two main modules: a Target-Domain Teacher Model and a Cross-Domain Student Model. The teacher model processes weakly augmented images from the target domain, while the student model processes strongly augmented images from both the source domain and the target domain.

Figure 4. The object detection results of comparison methods across the four cross-domain tasks.

Figure 5. The object detection results of the proposed method across the four cross-domain tasks.

Figure 6. Variation of feature layer weights in multi-layer weighted feature alignment.

Table 1. Cross-time UAV object detection dataset VisDrone_daytime_to_night.

Dataset	Image Num	People	Car	Van	Truck	Bus	Motor	Instance Num
VisDrone_daytime_to_night_source	6000	109,422	133,589	24,235	12,198	6565	28,947	314,956
VisDrone_daytime_to_night_target	1200	8706	21,065	3060	1695	1497	2359	38,382

Table 2. Cross-time UAV object detection dataset UAVDT_daytime_to_night.

Dataset	Image Num	Car	Truck	Bus	Instance Num
UAVDT_daytime_to_night_source	8000	157,901	6298	3815	168,014
UAVDT_daytime_to_night_target	2000	22,968	264	624	23,856

Table 3. Cross-camera UAV object detection dataset VisDrone_to_UAVDT.

Dataset	Image Num	Car	Truck	Bus	Instance Num
VisDrone_to_UAVDT_source_training	6000	134,500	12,044	5528	152,072
VisDrone_to_UAVDT_target_training	6000	97,879	4313	2622	104,814
VisDrone_to_UAVDT_target_testing	2000	44,195	918	913	46,026

Table 4. Cross-View UAV object detection dataset.

Cross-View	Dataset	Image Num	Car
DIOR → VisDrone	DIOR_source_training	6000	37,814
	VisDrone_target_training	6000	151,813
	VisDrone_target_testing	548	15,065
DIOR → UAVDT	DIOR_source_training	6000	37,814
	UAVDT_target_training	6000	104,814
	UAVDT_target_testing	2000	46,026
DOTA → VisDrone	DOTA_source_training	4357	103,283
	VisDrone_target_training	6000	151,813
	VisDrone_target_testing	548	15,065
DOTA → UAVDT	DOTA_source_training	4357	103,283
	UAVDT_target_training	6000	104,814
	UAVDT_target_testing	2000	46,026

Table 5. Cross-Weather UAV object detection dataset UAVDT_sunny_to_rainy.

Dataset	Image Num	Car
UAVDT_sunny_to_rainy_training	6000	106,180
UAVDT_sunny_to_rainy_testing	1500	69,270

Table 6. Detection results on the VisDrone Cross-time UAV object detection dataset.

Method	${AP}_{50}$	People ( ${AP}_{50}$ )	Car ( ${AP}_{50}$ )	Van ( ${AP}_{50}$ )	Truck ( ${AP}_{50}$ )	Bus ( ${AP}_{50}$ )	Motor ( ${AP}_{50}$ )
Faster RCNN (source-only)	23.0	14.3	54.0	14.9	16.0	31.2	7.6
DA-Faster RCNN	30.5	25.2	57.9	25.8	26.8	33.6	14.0
SWDA	31.1	29.2	63.3	26.6	31.8	34.3	16.6
UT	31.3	22.4	63.0	26.1	23.3	37.9	15.1
PT	26.7	11.4	58.5	28.6	20.3	33.3	7.8
H2FA	28.4	17.8	57.8	23.8	27.0	34.4	9.9
AT	31.6	20.5	61.8	27.1	32.0	37.7	10.4
CMT	32.1	21.3	62.2	26.0	31.7	39.1	12.4
WMFA-AT	33.4	21.6	64.5	27.7	34.1	38.9	13.8

Table 7. Detection results on the UAVDT Cross-time UAV object detection dataset.

Method	${AP}_{50}$	Car ( ${AP}_{50}$ )	Truck ( ${AP}_{50}$ )	Bus ( ${AP}_{50}$ )
Faster RCNN (source-only)	42.7	65.8	29.5	32.9
DA-Faster RCNN	48.7	76.4	35.5	34.1
SWDA	57.5	69.7	57.4	45.4
UT	62.1	66.1	74.8	45.3
PT	55.8	81.0	40.2	46.3
H2FA	62.7	75.8	50.2	62.2
AT	63.0	81.7	53.8	53.5
CMT	63.4	83.9	58.1	48.2
WMFA-AT	64.6	81.0	60.7	52.0

Table 8. Detection results on the Cross-camera UAV object detection dataset.

Method	${AP}_{50}$	Car ( ${AP}_{50}$ )	Truck ( ${AP}_{50}$ )	Bus ( ${AP}_{50}$ )
Faster RCNN (source-only)	22.6	40.9	3.5	23.4
DA-Faster RCNN	23.6	42.2	4.7	23.8
SWDA	25.7	45.4	8.6	23.2
UT	26.8	43.9	5.1	31.3
PT	27.1	42.7	9.1	29.6
H2FA	27.3	47.6	4.5	29.8
AT	26.5	44.6	6.5	28.4
CMT	26.8	46.0	7.9	26.5
WMFA-AT	27.5	45.6	5.0	32.0

Table 9. Detection results on the Cross-View UAV object detection dataset.

Method	DIOR → VisDrone Car ${AP}_{50}$	DIOR → UAVDT Car ${AP}_{50}$	DOTA → VisDrone Car ${AP}_{50}$	DOTA → UAVDT Car ${AP}_{50}$
Faster RCNN (source-only)	7.5	9.8	5.5	8.3
DA-Faster RCNN	16.0	16.4	15.5	19.4
SWDA	7.6	17.6	8.8	15.4
UT	16.2	22.6	10.5	18.9
PT	10.3	19.5	5.2	15.7
H2FA	9.9	13.5	15.4	20.1
AT	35.8	36.3	35.6	44.1
CMT	33.4	35.7	26.3	43.0
WMFA-AT	40.3	41.5	38.7	49.1

Table 10. Detection results on the Cross-Weather UAV object detection dataset.

Method	Car ${AP}_{50}$
Faster RCNN (source-only)	21.4
DA-Faster RCNN	33.2
SWDA	25.3
UT	38.4
PT	38.8
H2FA	37.4
AT	60.6
CMT	61.4
WMFA-AT	62.5

Table 11. Ablation study of WMFA-AT. The metric reported in the table is

m A P_{50}

(%).

Table 11. Ablation study of WMFA-AT. The metric reported in the table is

m A P_{50}

(%).

Method	Cross-Time (UAVDT)	Cross- Camera	Cross-View (DOTA → UAVDT)	Cross- Weather
WMFA-AT	64.6	27.5	49.1	62.5
WMFA-AT w/o $L_{W M F A}$	57.9	24.3	39.3	55.2
WMFA-AT w/o WS Aug	61.4	25.0	43.1	57.0
WMFA-AT w/o $λ_{u n s u p}$ &EMA	52.5	23.4	29.5	41.8

Table 12. Results of different image-level feature alignment methods. The metric reported in the table is

m A P_{50}

(%).

Table 12. Results of different image-level feature alignment methods. The metric reported in the table is

m A P_{50}

(%).

Method	Cross-Time (UAVDT)	Cross-Camera	Cross-View (DOTA → UAVDT)	Cross-Weather
w/SFA	61.0	25.4	44.1	56.6
w/MFA	62.9	26.1	47.1	60.0
w/WMFA	64.6	27.5	49.1	62.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, G.; Yang, H.; Tian, Y.; Xie, M.; Dang, C.; Ding, Q.; Feng, X. WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection. Remote Sens. 2025, 17, 3854. https://doi.org/10.3390/rs17233854

AMA Style

Cheng G, Yang H, Tian Y, Xie M, Dang C, Ding Q, Feng X. WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection. Remote Sensing. 2025; 17(23):3854. https://doi.org/10.3390/rs17233854

Chicago/Turabian Style

Cheng, Gui, Hao Yang, Yan Tian, Meilin Xie, Chaoya Dang, Qing Ding, and Xubin Feng. 2025. "WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection" Remote Sensing 17, no. 23: 3854. https://doi.org/10.3390/rs17233854

APA Style

Cheng, G., Yang, H., Tian, Y., Xie, M., Dang, C., Ding, Q., & Feng, X. (2025). WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection. Remote Sensing, 17(23), 3854. https://doi.org/10.3390/rs17233854

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

WMFA-AT: Adaptive Teacher with Weighted Multi-Layer Feature Alignment for Cross-Domain UAV Object Detection

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. Cross-Time UAV Object Detection Dataset

2.1.2. Cross-Camera UAV Object Detection Dataset

2.1.3. Cross-View UAV Object Detection Dataset

2.1.4. Cross-Weather UAV Object Detection Dataset

2.2. Methods

2.2.1. Basic Network Architecture

2.2.2. Teacher–Student Mutual Learning Strategy

2.3. Weighted Multi-Layer Feature Alignment Strategy for Adversarial Learning

2.4. Loss Function

3. Results

3.1. Comparison Methods

3.2. Implementation Details

3.3. Quantitative Evaluation

3.3.1. Cross-Time Domain Adaptive Object Detection

3.3.2. Cross-Camera Domain Adaptive Object Detection

3.3.3. Cross-View Domain Adaptive Object Detection

3.3.4. Cross-Weather Domain Adaptive Object Detection

3.4. Qualitative Evaluation

4. Discussion

4.1. Impact of Weighted Multi-Layer Feature Alignment Loss L W M F A

4.2. Impact of Data Augmentation Strategy

4.3. Impact of λ u n s u p and EMA

4.4. Limitations of the Proposed Method

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Impact of Weighted Multi-Layer Feature Alignment Loss $L_{W M F A}$

4.3. Impact of $λ_{u n s u p}$ and EMA