Next Article in Journal
Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations
Previous Article in Journal
Dynamic Algorithm for Mining Relevant Association Rules via Meta-Patterns and Refinement-Based Measures
Previous Article in Special Issue
Quantum Edge Detection and Convolution Using Paired Transform-Based Image Representation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection

1
Zhongshan Institute, University of Electronic Science and Technology of China, Zhongshan 528400, China
2
Mathematic and Information Institute, South China Agricultural University, Guangzhou 510642, China
*
Author to whom correspondence should be addressed.
Information 2025, 16(6), 439; https://doi.org/10.3390/info16060439
Submission received: 17 April 2025 / Revised: 23 May 2025 / Accepted: 24 May 2025 / Published: 26 May 2025
(This article belongs to the Special Issue Emerging Research in Object Tracking and Image Segmentation)

Abstract

Unsupervised domain adaptation (UDA) effectively transfers knowledge learned from a labeled source domain to an unlabeled target domain. The teacher–student framework, which generates pseudo-labels for target domain samples and uses them for pseudo-supervised training, enables self-training and improves generalization in UDA object detection. However, for one-stage detection models, pseudo-labels are unreliable when positive and negative samples are imbalanced. This may lead the model to overfit the source domain and overlook important target-domain information. In this work, we propose a novel domain-specific student–teacher framework to address this issue. The innovations of the proposed framework can be summarized in two aspects. First, we employ two domain-specific heads (DSHs) in the student model to handle inputs from the source domain and the target domain separately. These two heads are optimized independently with samples from their respective domains. This design allows for reducing the impact of unreliable pseudo-labels and fully leveraging unique information specific to the target domain. Second, we introduce an auxiliary reconstruction branch, named the multi-scale mask adversarial alignment (MMAA) module, into the teacher–student framework. The MMAA is tasked with reconstructing randomly masked multi-scale features of the source domain, which enhances the student model’s semantic representation capability and facilitates the generation of high-quality pseudo-labels. Experimental results on six diverse cross-domain scenarios demonstrate the effectiveness of our framework.

Graphical Abstract

1. Introduction

Object detection, which aims to jointly predict the categories and bounding boxes of foreground objects in images, is a fundamental task in the field of computer vision. Effective training of object detection models heavily depends on the availability of large-scale annotated datasets. However, in many practical applications, such training data may not be readily available due to constraints in cost, time, and other factors. In such cases, a common strategy is to train the model on publicly available datasets and then deploy it in specific application scenarios. Consequently, the model’s performance may suffer a significant degradation due to the discrepancy between the training data and the application environment, commonly referred to as the domain shift [1,2].
To address this issue, researchers have explored unsupervised domain adaptation (UDA) detection [3], aiming to transfer knowledge learned from an annotated source domain to an unlabeled target domain. Specifically, UDA methods extract valuable information from unlabeled target domain samples and integrate it into the training process to improve model generalization. Beyond object detection, UDA has also been extensively studied in other vision tasks such as classification [4] and segmentation [5]. Although the performance and generalization capabilities of large visual models, such as SAM [6] or Grounding DINO [7], have significantly improved in recent years, these models often require substantial computational resources. As a result, UDA-based detection models remain a preferred choice for practical engineering applications where efficiency and adaptability are critical.
Existing UDA methods can be generally categorized into six main classes [8]: domain-invariant feature learning [9,10], pseudo-label-based self-training [11], image-to-image translation [12], domain randomization [13], mean-teacher training [14], and graph reasoning [15]. Each of these approaches offers distinct advantages, which has motivated researchers to combine multiple strategies to improve performance. A particularly effective strategy involves employing the teacher–student framework to generate pseudo-labels for target domain samples and utilizing these labels for pseudo-supervised training [16,17,18,19].
In previous works, the student network typically employs a single detection head for both the source and target domains, while the teacher network inherits the student model’s weights via the exponential moving average (EMA) algorithm. However, when the positive and negative samples are imbalanced, which is common in domain adaptation, the resulting pseudo-labels tend to be unreliable. In two-stage detection models, most of those pseudo-labels are filtered out during proposal generation, which helps mitigate their negative impact. In contrast, in one-stage detection models, pseudo-labels may lead the model to overfit the source domain, thereby neglecting critical information specific to the target domain.
To address the challenges outlined above, we propose a novel domain-specific teacher–student framework with FCOS [20] as the baseline detector. The proposed approach involves two primary modifications.
First, a target-domain-specific detection head is incorporated into the student model and is independently optimized using target-domain samples. As shown in Figure 1, the dual domain-specific heads (DSHs) facilitate more effective extraction of latent domain-specific features and mitigates the negative impact of unreliable pseudo-labels. This design fully exploits the unique information specific to the target domain.
Second, we introduce an auxiliary reconstruction task into the teacher–student framework, referred to as the multi-scale mask adversarial alignment (MMAA) module. This module enhances the student model’s semantic representation capability, thereby enabling the generation of more reliable pseudo-labels. In particular, multi-scale sparse features are extracted through random masking and subsequently reconstructed by a dedicated decoder. The auxiliary task is trained under an adversarial paradigm to encourage the backbone to learn domain-invariant feature representations.
The rest of this paper is organized as follows: Section 2 provides an overview of related work. Section 3 defines the problem and introduces the self-training framework. Section 4 describes the core methodology in detail. Section 5 presents the experimental results. Finally, Section 6 concludes the study.

2. Related Work

2.1. UDA Detection

Object detection models can be generally divided into three main categories: one-stage detectors [21,22,23,24], two-stage detectors [25,26], and transformer-based detectors [27,28,29]. One-stage detectors directly regress the locations and categories of foreground objects from each pixel center of the feature map or based on a set of predefined anchor boxes. Two-stage detectors first generate region proposals in the initial stage and then classify and refine these proposals in the second stage, effectively filtering out unreliable ones. In a typical transformer-based detector such as the Detection Transformer (DETR) [27], the encoder–decoder architecture processes a fixed-length sequence of tokens derived from the input image’s feature map. This architecture enables the model to generate a set of box predictions in an end-to-end fashion based on these tokens.
Due to the reliance of object detection networks on large-scale training data, UDA detection, which aims at transferring knowledge from a labeled source domain to an unlabeled target domain, has been experiencing a period of rapid development in recent years. These UDA detection approaches, despite their differences, aim to learn feature representations that reduce the domain gap, thereby enabling better knowledge transfer and improved detection performance. In the early stages of UDA research, most studies focused on two-stage detectors. Chen et al. [9] were the first to incorporate the generative adversarial learning paradigm (GALP) [30] and the gradient reversal layer (GRL) [2] into the Faster R-CNN architecture [26], aligning features at both the image level and instance level. Following their work, GALP has increasingly become a fundamental component in UDA detection, owing to its capability not only to align the source and target domains within the feature space but also minimize the distributional discrepancies between them. Xu et al. [31] explored the consistency between image-level and instance-level predictions by introducing additional classifiers at the instance level, thereby enhancing the alignment of corresponding objects across domains. On the basis of this foundation, subsequent studies have extended feature alignment to multiple semantic levels, including image level [32], pixel level [33,34,35], instance level [8,36], and category level [10,37,38]. Furthermore, aligning fine-grained feature distributions has been proven to improve model performance [39,40], enhancing robustness and adaptability by addressing discrepancies at finer granularities.
At the same time, researchers have also explored one-stage UDA detectors. Hsu et al. [35] introduced a center-aware feature alignment method that leverages the properties of FCOS [22] to achieve pixel-level feature alignment. Zhang et al. [41] enhanced the generalization performance of YOLOv3 [42] by regularizing image-level features with instance-level features.
Inspired by the success of transformers in natural language processing, researchers have increasingly turned their attention to transformer-based unsupervised domain adaptation (UDA) detectors [27,43]. These detectors typically employ an encoder–decoder architecture, wherein the encoder is responsible for extracting high-level semantic features from the input data, and the decoder generates instance-level predictions based on learned object queries or tokens. Adversarial learning and consistency regularization techniques have been integrated into transformer-based UDA detectors [16,43,44]. Simultaneously, semantic prompts have been shown to effectively enhance the domain awareness of detectors and demonstrate significant potential in UDA detection tasks [45,46]. For a more comprehensive discussion on language-assisted UDA detectors, please refer to the review by Feng et al. [47].
Despite the strong performance achieved by two-stage and transformer-based detectors on many publicly available datasets, these models typically require substantial computational resources. In contrast, one-stage object detectors offer a more favorable trade-off between detection accuracy and computational efficiency, making them better suited for deployment on low-power edge devices. Many previous studies have adopted VGG16 as the backbone of one-stage object detectors [19,48]. In this work, we adopt FCOS [22] with a ResNet-50 backbone as our baseline detector, which achieves superior detection performance with significantly fewer parameters compared to VGG16.

2.2. Self-Training Based on Teacher–Student Learning

First, Li et al. [17] proposed an adaptive mean-teacher framework based on [49] to enhance the generalization and robustness of UDA detectors. However, their method is sensitive to the uncertainty of pseudo-labels. To address this issue, Chen et al. [50] introduced uncertainty-guided consistency training to facilitate both classification and localization adaptation. Most recently, Deng et al. [19] proposed a novel framework that integrates a harmonious sample re-weighting module with a harmonious model learning strategy, aiming to improve the quality of pseudo-labels. Although self-training has significantly improved detector performance, most previous studies have primarily focused on enhancing the quality of pseudo-labels, with relatively little attention paid to learning domain-specific information in the target domain. To bridge this gap, we propose a novel domain-specific teacher–student framework that enhances the self-training paradigm by decoupling domain-specific knowledge acquisition and reconstructing masked multi-scale features.

3. Problem Formulation and Self-Training Framework

3.1. Problem Formulation

A domain consists of a feature space and a marginal probability distribution, corresponding to the feature data and their underlying distribution, respectively. Classical machine learning assumes that the training and test sets are sampled from the same domain. However, this assumption does not always hold in real-world scenarios, where discrepancies in domain distributions often arise between different data sources. Assume a labeled source domain dataset  D S = { ( x n S , y n S ) } n = 1 N S  and an unlabeled target domain dataset  D T = { ( x n T ) } n = 1 N T , where  y n S  denotes the class labels and bounding box annotations corresponding to the source image  x n S , and  N S  and  N T  represent the total number of samples in the source and target domains, respectively. The objective is to train an adaptive detector using the labeled source samples  D S  and unlabeled target domain samples  D T , with the aim of improving the generalization performance on the target domain by aligning the marginal probability distributions across domains.

3.2. Domain-Specific Teacher–Student Framework

Self-training. Current self-training-based domain adaptive object detection methods commonly employ the mean-teacher framework and integrate feature-level adversarial domain adaptation techniques to enhance detector performance on the target domain dataset. In particular, both the teacher and student models share an identical architecture. The teacher model is responsible for generating pseudo-labels from weakly augmented samples in the target domain, while the student model processes strongly augmented samples from the same domain, producing predictions and performing pseudo-supervised optimization. This optimization treats the pseudo-labels as a form of supervisory information, which inherently introduces uncertainty into the unsupervised loss. Taking the FCOS model as an example, the unsupervised loss  L u n s T  usually consists of three components: the classification loss  L c l s T , the regression loss  L r e g T , and the centerness loss  L c t n T . The specific formulation of the loss is as follows:
L u n s T = L c l s T ( x t , y t ^ ) + L r e g T ( x t , y t ^ ) + L c t n T ( x t , y t ^ )
where  L c l s T  denotes the focal loss [51] for classification,  L r e g T  is the IoU loss for bounding box regression, and  L c t n T  refers to the binary cross-entropy loss for localization.  y t ^  indicates the pseudo-label generated by the teacher model.
During the training of the adaptive teacher–student framework, strongly augmented source domain samples are typically fed directly into the student model, and the detection loss is computed using the ground-truth labels from the source domain. The supervised loss for the source domain, denoted as  L s u p S , is formulated as follows:
L s u p S = L c l s S ( x s , y s ) + L r e g S ( x s , y s ) + L c t n S ( x s , y s )
Thus, the overall optimization objective of self-training can be summarized as follows:
L s t = L s u p S + L u n s T
It is worth noting that during the self-training process, the student model is first optimized using the supervised loss and subsequently undergoes joint optimization with both the supervised and unsupervised losses. In addition, the teacher model is updated exclusively through the EMA of the student model’s weights and does not accumulate gradient updates. The specific update rules are as follows:
θ t α θ t + ( 1 α ) θ s
where  θ t  and  θ s  denote the parameters of teacher and student models, respectively, and  α  is a hyperparameter.
Adversarial domain adaptation. Following previous studies, we adopt the generative adversarial paradigm in the teacher–student framework to enable the extraction of domain-invariant features from both source and target domains. Specifically, in this paradigm, the domain discriminator is tasked with identifying the origin domain of the input features, while the feature extractor is trained to generate feature that are invariant across domains. The adversarial interaction between these two components is implemented through the GRL [2], which inverts the gradient values during back-propagation. The adversarial optimization is fundamentally formulated as a mini-max game, defined as follows:
L a d v = max D min G L d i s
where G and D denote the feature extractor and the domain discriminator, respectively.  L d i s  represents the classification loss associated with the domain discriminator. Accordingly, the optimization objectives of the entire adaptive teacher–student framework are as follows:
L b a s e = L s t + L a d v

4. Proposed Method

4.1. Overall Framework

In this section, we construct a domain adaptive object detection network based on a self-training framework, as illustrated in Figure 2. We select FCOS as the baseline detector due to its anchor-free architecture, which eliminates the need for managing extensive anchor configurations and enables effective filtering of low-quality pseudo-labels by leveraging its dense prediction mechanism. A dedicated detection head is introduced for the target domain, which does not share weights with the detection head for the source domain. After passing through the shared backbone network and feature fusion layer, the enhanced source domain samples and the target domain samples are fed into their respective detection heads to generate domain-specific predictions. With the decoupled detection heads, the model is capable of extracting domain-specific information, thereby significantly enhancing its robustness. Furthermore, we incorporate a global feature alignment (GFA) [16,50] module and a MMAA module into the backbone and feature fusion layers of the student model. The dominance of feature alignment is dynamically shifted from GFA to MMAA through a scheduled weighting strategy during training. For the MMAA module, the input features are weighted using a random mask and then fed into the fully convolutional decoder to reconstruct multi-scale selective features. These features are further combined with an adversarial domain adaptation method to align the pixel-level feature distributions.

4.2. Domain-Specific Teacher–Student Framework

In self-training-based domain-adaptive object detection, the student model typically employs a shared detection branch for both source and target domains, as illustrated in the left panel of Figure 3. However, the unsupervised training on target domain samples often yields inferior performance compared to supervised training on source domain data. This is primarily due to the fact that the pseudo-labels generated by the teacher model for the target domain exhibit significant variations in quality. These low-quality pseudo-labels can mislead the student model, potentially causing it to converge to sub-optimal solutions. Moreover, most existing methods overlook the importance of domain-specific characteristics of the target data. Evidence from the literature [52] suggests that achieving generalization across arbitrary distributions remains challenging without effectively leveraging such target domain-specific information.
Figure 2. The overall structure of our proposed framework for domain-adaptive object detection. The whole framework consists of a teacher model and a student model, where the teacher model receives the weakly enhanced target domain image and generates pseudo-labels, and the student model receives the strongly enhanced source and target domain images, and then loss optimization is performed with real labels and pseudo-labels after generating predictions through the backbone network, feature fusion layer, and different detection heads. The feature fusion layer is the Feature Pyramid Network (FPN) [53]. The GFA module and the MMAA module act on the backbone network and the feature fusion layer of the student model, respectively. The parameters of the teacher model are updated only by the EMA of the student model.
Figure 2. The overall structure of our proposed framework for domain-adaptive object detection. The whole framework consists of a teacher model and a student model, where the teacher model receives the weakly enhanced target domain image and generates pseudo-labels, and the student model receives the strongly enhanced source and target domain images, and then loss optimization is performed with real labels and pseudo-labels after generating predictions through the backbone network, feature fusion layer, and different detection heads. The feature fusion layer is the Feature Pyramid Network (FPN) [53]. The GFA module and the MMAA module act on the backbone network and the feature fusion layer of the student model, respectively. The parameters of the teacher model are updated only by the EMA of the student model.
Information 16 00439 g002
To address the aforementioned issues, we introduce another detection head specifically for the target domain, as illustrated in the right panel of Figure 3. After feature encoding, source domain samples are fed into the source domain detection head, i.e., the original detection branch of the detector, while target domain samples are directed to the newly added target domain detection head. The latter is trained in an unsupervised manner using a mixture of source and target domain samples. This mixed training paradigm allows the target domain detection head to better capture domain-specific characteristics, thereby enhancing the generalization capability of the student network on the target domain. This improvement further facilitates the generation of higher-quality pseudo-labels, which are subsequently fed back to the teacher network for model refinement. Similar to the adaptive teacher–student framework, the optimization objective of the student network combines both supervised and unsupervised loss components.
The dense prediction capability of FCOS enables us to make targeted modifications to the loss components in Equation (1). For the classification loss, we adopt the method proposed by Dense Teacher [11], which computes filtering scores for pseudo-labels by performing the Hadamard product between the student model’s classification predictions and the center-aware predictions from the teacher model. During the filtering process, these values are sorted, and the top  σ  pixels are selected as positive samples  y ^ t , as formally defined below:
y ^ t = t o p k ( P c l s t P c t n t , σ )
where  P c l s t  and  P c t n t  represent the classification and centerness predictions of the teacher model, respectively.
Since the filtered pseudo-labels are continuous values, we adopt the Quality Focal Loss (QFL) [54] as the classification loss function. QFL not only accommodates continuous-valued labels but also retains the advantages of the original Focal Loss, effectively balancing the contributions of positive and negative samples as well as hard and easy samples. The classification loss is defined as follows:
L c l s T = | y t ^ P c l s s | β ( ( 1 y t ^ ) log ( 1 P c l s s ) + y t ^ log ( P c l s s ) )
where  P c l s s  represents the classification predictions of the student model, and  y t ^  denotes the filtered pseudo-label generated by the teacher model.  β  serves as the balancing factor for the loss function.
The model localization loss consists of two components: the regression loss  L r e g T  and the centerness loss  L r e g T . We employ the Generalized Intersection over Union (GIoU) loss and the binary cross-entropy (BCE) loss for  L r e g T  and  L r e g T , respectively. The specific formulas are defined as follows:
L r e g T = G I o U ( P r e g s , y t ^ )
L c t n T = y t ^ log P c t n s ( 1 y t ^ ) log ( 1 P c t n s )
where  P r e g s  and  P c t n s  represent the regression predictions and centerness predictions of the student model, respectively.  y t ^  denotes the filtered pseudo-label generated by the teacher model.
The overall optimization objective of the student model is formally defined as follows:
L u n s T = L c l s T + L r e g T + L c t n T

4.3. Global Feature Alignment

Based on the work of [16,50], we incorporate a global feature alignment module into the backbone architecture. The primary objective of this module is to align image-level feature distributions in order to mitigate domain discrepancies. Specifically, it employs a global domain discriminator that leverages an adversarial domain adaptation strategy to determine whether the feature distribution of each input sample originates from the source or target domain. To implement this, we enhance the network architecture by integrating the GRL along with domain discriminators applied to the final three feature maps of the backbone. Each discriminator consists of two convolutional layers with  3 × 3  kernels. During the pre-training phase, the adaptive teacher–student framework exclusively utilizes data from the source domain, thereby improving the generalization capability of the detector through GFA. As a result, a robust teacher model is obtained before the co-training phase begins. However, in later stages of training, the model often tends to converge to sub-optimal solutions, which compromises the reliability of the global features. To address this issue, we systematically reduce the loss weight assigned to the GFA module by introducing a carefully designed weighting factor. This facilitates a smooth transition toward the multi-scale mask-based adversarial alignment module. The formulation of this weighting factor is described as follows:
γ = 2 1 + exp ( ζ · η )
where  η  controls the smoothness of the transition, and  ζ  represents the ratio of the current iterations number to the pre-training end iteration threshold. Consequently, the optimization objective of the GFA module is formulated as follows:
L G F A = γ N S n = 1 N S B C E ( D ( G ( x n s ) ) , y s ) + γ N T n = 1 N T B C E ( D ( G ( x n t ) ) , y t )
where G and D are the backbone network and domain discriminator, respectively.  y s  and  y t  are set to 1 and 0, respectively, corresponding to the pseudo-labeling of the adversarial domain adaptive method.  B C E  denotes the binary cross-entropy loss function.

4.4. Multi-Scale Mask Adversarial Alignment

By leveraging domain-specific information, the generalization capability of the student network is significantly enhanced, and the precision of pseudo-labeling is concurrently improved. However, errors accumulated during the EMA updates the teacher model may lead to the generation of low-quality pseudo-labels. To mitigate the risk of sub-optimal outcomes caused by erroneous pseudo-labeling, we implement a dense-sparse training strategy. Inspired by Masked Autoencoders (MAEs) [55], we incorporate a reconstruction-based learning mechanism, in which the model learns to reconstruct randomly masked input images. This process has been shown to enhance the model’s semantic expressiveness and improve its generalization ability. Moreover, the introduction of randomness helps the student model avoid suboptimal learning trajectories that are prone to being misled by inaccurate pseudo-labels. Accordingly, we adopt a reconstruction-based training approach to develop the MMAA module, aiming to address the issue of suboptimal learning.
Feature Masking: In contrast to MAE, this approach does not involve image-level mask reconstruction; instead, it applies feature masking at each level of the feature pyramid. Specifically, random mask vectors  m R N × H  of the same resolution are first generated, with a masking rate denoted by  ϵ , where N and H represent the width and height of the input features, respectively. Furthermore, we introduce a learnable token vector  v R C  corresponding to each channel of the input features, where C denotes the number of channels in the input feature maps. The multi-scale features are then masked according to the following equation:
f ^ m a s k m a s k f ( 1 m ( ϵ ) ) + ( m ( ϵ ) × v )
where f and  f m a s k  represent the input and mask-weighted features, respectively.
Mask Decoding: After masking, the input features become sparse. To address this, we design a lightweight mask decoder inspired by the Fully Convolutional Mask Autoencoder (FCMAE) [56]. This decoder reconstructs multi-scale features from each masked feature layer in both the source and target domains. Finally, the adversarial learning paradigm is employed to enable the backbone network to learn domain-invariant features. Focal loss is adopted as the training objective, and the optimization goal of this module is formulated as follows:
L M M A A = γ 1 N S n = 1 N S F L ( D ( f ^ m a s k S ) , y s ) + γ 1 N S n = 1 N T F L ( D ( f ^ m a s k T ) , y t )
where  f ^ m a s k S  and  f ^ m a s k T  denote the multi-scale reconstructed features from the source and target domains, respectively.  y s  and  y t  are pseudo-labels used in the domain alignment process, which are set to 1 and 0, respectively. GFA dominates the early training phase, facilitating rapid alignment of image-level feature distributions between the source and target domains. During the self-training phase, to alleviate the issue of sub-optimal convergence, we introduce a confidence factor  γ 1  to make the MMAA take a dominant role in reconfiguring the sparse features for domain alignment.

4.5. Loss and Optimization

Within the teacher–student framework, we jointly optimize the student model for all loss components in an end-to-end manner. The overall optimization objective of our model is therefore formulated as follows:
L t o t a l = L s u p S + λ 1 L u n s T + λ 2 L M M A A + λ 3 L G F A
where  λ 1 λ 2 , and  λ 3  are the trade-off parameters. Specifically, we use labeled data from the source domain to train the backbone, FPN, and source prediction head of the FCOS by minimizing the supervisory loss  L s u p S , which is defined in Equation (2) of Section 3.2 L u n s T  denotes the unsupervised loss, which is computed using pseudo-labels generated by the teacher model on the unlabeled target domain, to update the student model.

5. Experiments and Analysis

In this section, we conduct experiments on six common domain-shift scenarios typically encountered in real-world applications. These scenarios include transitions from sunny to foggy weather conditions, virtual-to-real highway environments, cross-camera outdoor settings with varying capture conditions, and three realistic-to-artistic translation tasks. For each scenario, we provide a detailed description of the corresponding datasets and highlight their distinguishing characteristics. We then outline the experimental setup, including specific hyperparameter configurations. To evaluate performance on domain-adaptive object detection, we define clear evaluation metrics. Following this, we benchmark our model against state-of-the-art methods under the same experimental conditions. Finally, we conduct ablation studies and visual analyses to validate the effectiveness of each component in our method.

5.1. Datasets

In this subsection, we provide an overview of the six scenarios.
Cityscapes→Foggy Cityscapes: Cityscapes [57] is a large-scale database focusing on semantic understanding of urban scenes. It contains 2975 training images and 500 validation images, annotated with eight categories: people, cars, trains, riders, trucks, motorcycles, bicycles, and buses. Foggy Cityscapes [58] is a synthetic version of Cityscapes generated by applying fog simulation and thus shares the same annotations as the original dataset. This experiment aims to evaluate the effectiveness of our method under weather-induced domain shifts. In this setting, we use the Cityscapes training set as the source domain and the Foggy Cityscapes training set as the target domain. The Foggy Cityscapes validation set is used as the test set, and performance is evaluated across all eight object categories.
Sim10k→Cityscapes: Sim10K [59] is a synthetic dataset rendered from the computer game Grand Theft Auto V (GTA V), which introduces a significant domain gap compared to real-world scenes such as Cityscapes. It consists of 10,000 images with 58,071 bounding box annotations of cars. This experiment evaluates the performance of our method in synthetic-to-real domain adaptation scenarios. We train the model using Sim10K training images as the source domain and Cityscapes training images as the target domain. Since Sim10K only includes annotations for the “car” category, we evaluate performance solely on this class using the Cityscapes test set.
KITTI→Cityscapes: KITTI [60] is a real-world dataset collected using various sensors, including 14,999 images annotated with classes such as person, car, and truck. Due to differences in camera perspectives—KITTI images are captured from vehicle-mounted cameras while Cityscapes uses onboard cameras—this setup introduces a cross-camera domain shift. This task evaluates the robustness of our method under domain discrepancies caused by different imaging sources. We use KITTI training images as the source domain and Cityscapes training images as the target domain. Following [19], we focus on the “car” category and evaluate performance on the Cityscapes test set.
Pascal VOC→Artistic datasets. To further evaluate the generalization capability of our method, we conduct experiments on three challenging benchmarks: Pascal VOC [61] → Clipart, Watercolor, and Comic [62]. Pascal VOC is a large-scale real-world dataset containing 20 object categories. We use the 2012 version of Pascal VOC, which includes 13,276 training images. Clipart is a dataset of 1000 cartoon-style images with bounding box annotations aligned with the 20 Pascal VOC categories. Both Watercolor and Comic contain 1000 training and 1000 test images in artistic styles, sharing six common categories with Pascal VOC. These benchmarks provide more challenging domain shifts and allow evaluation in multi-class settings.

5.2. Implementation Details

Following [35], we adopt FCOS [20] as the base detector and evaluate its performance on the target domain test sets. The backbone of the model is a ResNet-50 network [63] that is pre-trained on ImageNet [64]. In all experiments, input images are resized such that the shorter side does not exceed 600 pixels and the longer side does not exceed 1300 pixels. Data augmentation follows the same strategy as in [17]. Weak augmentation includes random horizontal flipping and padding, while strong augmentation incorporates random color dithering, grayscale conversion, padding, and Gaussian blur. Each experiment is conducted on a workstation equipped with 2.8 GHz CPU, 64 GB RAM, and 4 Nvidia 2080Ti GPUs with a batch size of four. Each training batch consists of two images from the source domain and two from the target domain. The learning rate is initialized to 0.01. The total number of training iterations is set to 40,000, with a warmup strategy applied during the first 1000 iterations. At iteration 32,000 and 38,000, the learning rate is decayed to 0.001 and 0.0001, respectively. Notably, we pre-train the model for 8000 iterations using only labeled source samples to initialize both the teacher and student models. Afterward, data from both the source and target domains are used to jointly optimize the remaining training iterations. We employ stochastic gradient descent (SGD) with momentum of 0.9 and weight decay of 0.0001 as the optimizer. The weight smooth parameter  α  of EMA is set to 0.9996, and we further set  β = 2.0 σ = 0.1 , and  ζ = 5.0 . The threshold  ϵ  is assigned stepwise: 0.1 for large-size feature maps and 0.5 for small-size ones. For other parameters,  γ  and  δ  are both set to 1.0. And coefficients  λ 1 λ 2 , and  λ 3  in Equation (16) are set to 1.0, 1.0, and 0.5, respectively. Our implementation is based on PyTorch [65].
For evaluation, we adopt mean Average Precision (mAP) with a threshold of 0.5 as the primary performance metric for all approaches, along with the Average Precision (AP) for each class.

5.3. Results and Comparisons

In this section, we compare the performance of our model with that of current state-of-the-art methods on the six domain-shift scenarios described in the previous subsections. The experimental results demonstrate the effectiveness of our model. In the tables below, the superscript asterisk (*) indicates that the corresponding method was re-implemented using FCOS with a ResNet-50 backbone as the base detector, ensuring a fair comparison.
Cityscapes→Foggy Cityscapes. In this experiment, we evaluate the performance of various algorithms designed for domain adaptive object detection under the scenario involving weather variation-induced domain shifts. As described in Section 5.1 and Section 5.2, we use  m A P 50  as the evaluation metric and conduct a category-wise comparison across eight object classes. The results on the Cityscapes→Foggy Cityscapes benchmark are summarized in Table 1. In addition to FCOS-based methods, we also list results from approaches with Faster R-CNN, DETR, and RegionCLIP for reference. Our proposed model outperforms OADA and HT by 4.7% and 2.0% in  m A P 50 , respectively. Among all FCOS-based methods, our approach achieves the highest performance. Notably, our model even surpasses the Oracle baseline, demonstrating strong generalization ability. A class-wise analysis of  A P  reveals that our model achieves the best or near-best performance across most categories, except for train and bus. Given the limited number of train samples and the high visual similarity between train and bus instances, the teacher model often fails to provide reliable pseudo-labels for these two categories, which limits the learning capability of the student model.
Sim10k→Cityscapes. The adaptation results from Sim10k to Cityscapes are summarized in Table 2. In this experiment, the evaluation is based solely on the car class in the target domain test set. Our proposed model outperforms OADA and HT by margins of 1.2% and 0.2%, respectively, in terms of car  A P 50 . Among all FCOS-based approaches, our model achieves the best performance on this synthetic-to-real domain shift scenario.
KITTI→Cityscapes. The objective of this experiment is to evaluate the robustness of our model under cross-camera domain shifts. As shown in Table 3, our model achieves a performance of 50.9%, surpassing other FCOS-based cross-domain detectors. Compared to the performance gaps observed in the Cityscapes→Foggy Cityscapes and Sim10k→Cityscapes scenarios, FCOS-based methods exhibit a significantly larger discrepancy when compared to other detectors in the KITTI→Cityscapes scenario.
This can be attributed to the unique image characteristics of the KITTI dataset, where images have a resolution of 1242 × 375 and are typically narrow and elongated. Furthermore, as depicted in Figure 4, instances in the KITTI dataset are predominantly located in the lower half of the images, with some appearing near the boundaries and corners. Since FCOS involves resizing images to a fixed scale while preserving the aspect ratio, it often introduces substantial padding, leading to significant information loss for objects at the edges and corners. Consequently, anchor-free detectors like FCOS are more significantly impacted when dealing with images that have large aspect ratios and contain object instances located near the image boundaries or corners. In contrast, anchor-based detectors such as Faster R-CNN or RetinaNet can better handle such cases by leveraging multi-scale anchors that adapt to various object sizes and positions. Additionally, transformer-based models like DETR benefit from global attention mechanisms, enabling more robust detection even for marginally positioned objects.
Realistic→Artistic. The results of the three artistic domain adaptation scenarios are presented in Table 4, Table 5 and Table 6. On Cityscapes-based domain adaptation scenarios, MGCAMT and DA-Pro consistently achieve the best and second-best performance across all tasks, respectively. In contrast, for the artistic-domain-related tasks, no existing method demonstrates superior performance across all settings, which highlights the increased difficulty and diversity of these cross-domain adaptation challenges. Our method performs closely to DC on Clipart and Watercolor and is comparable to TFD on Comic. These results demonstrate that our approach achieves consistently stable performance across different artistic domains.

5.4. Ablation Study and Analysis

In this section, we begin with ablation studies and detailed analyses of the proposed method. We then visualize the detection results to provide a qualitative assessment of our model. All analyses are conducted on the Cityscapes→Foggy Cityscapes domain adaptation task.
In the ablation study, we evaluate the effects of GFA, MMAA, and DSH components. The results are summarized in Table 7. The baseline refers to the detector trained using the teacher–student framework with GFA, achieving 41.3% in  m A P 50 . MMAA and DSH achieve a better performance of 46.1% and 48.7%, respectively. When both modules are included, the  m A P 50  increase to 50.1%. As illustrated in Figure 5, the integration of MMAA and DHS leads to a notable improvement in the quality of the pseudo-labels generated during the training process.
To more comprehensively demonstrate the superiority of our approach, we present two confusion matrices in Figure 6 to visualize the performance of the baseline and our method. These matrices evaluate the category-level similarity across the eight classes in the target domain, where each matrix element represents the degree of similarity between the class corresponding to the row and that corresponding to the column. As illustrated, our method significantly reduces the confusion among the truck, bus, and train categories, which are often misclassified in the baseline model. Overall, these results highlight that our method effectively reduces misdetections in the target domain, improves classification accuracy, and substantially enhances the efficiency of domain adaptation.
To better understand the mechanism of MMAA, we compare the results obtained under different mask settings, as presented in Table 8. Here,  F i  denotes the application of MMAA to the output of the i-th layer of FPN. Applying MMAA at any single layer effectively improves model performance, with the most significant gain observed at layer  F 0 . Reconstructed features at larger scales help the model avoid sub-optimal solutions even under a high masking rate, thereby improving detection accuracy on the target domain data. However, such large-scale feature reconstruction leads to sparser representations, which can reduce the recall rate. In contrast, small-scale feature reconstruction effectively alleviates these drawbacks. This is because a lower masking rate at smaller scales allows for a larger portion of the original features to be preserved, resulting in a more stable recall rate after domain alignment. By integrating reconstructed features across multiple scales for adversarial domain alignment, the model achieves a significant improvement in recall while maintaining high precision. This enhancement effectively mitigates the sub-optimality issue commonly encountered in cross-domain detection models.
Figure 7 presents several representative detection results from three typical domain adaptation scenarios. As can be seen, our method achieves superior detection performance compared to both the source-only model and the baseline method, particularly for objects such as cars and trucks. Notably, in challenging cases involving occlusion or high visual similarity between object categories (e.g., the examples in the last row of the figure, where cars and buses are often confused), the baseline method frequently misclassifies buses as cars. In contrast, our method accurately detects and classifies all car instances. These results clearly demonstrate that the proposed method offers significant advantages in reducing object omission, suppressing false detections, improving bounding box prediction accuracy, and enhancing detection confidence.
In our method, we select the top  σ  most confident pixels as the pseudo-labels  y ^ t . To demonstrate the effectiveness of this strategy, we conducted a comparison between the top  σ  selection approach and the threshold-based method under the same experimental settings as described in Section 5.2. The results are summarized in Table 9. The adaptive threshold strategy starts with an initial threshold of  η = 0.3 , which is increased to  0.4  at 20,000 iterations and further to  0.5  at 30,000 iterations. Compared to the fixed-threshold approach, the top  σ  method offers clear advantages in both performance and parameter tuning. A threshold that is too high may introduce many unreliable pseudo-labels, while one that is too low may result in insufficient label coverage, thereby slowing down training. Although the adaptive threshold strategy can improve performance significantly, it still underperforms the top  σ  strategy by 1.9%.

6. Conclusions

In this paper, we propose a domain-specific teacher–student framework that leverages two individual detection heads to exploit domain-specific features. This design enhances the teacher network’s capability in distinguishing positive and negative samples within the target domain. To address the issue of sub-optimal feature alignment, we introduce a multi-scale mask adversarial alignment module. This module employs a random masking strategy to sparsify the features, which are then reconstructed into novel feature maps by a decoder. Subsequently, pixel-level feature distributions are aligned using an adversarial domain alignment method. Additionally, a transition factor is incorporated into the domain alignment module to guide the model’s shift from global feature alignment to pixel-level alignment through dynamically adjusted loss weights. Our model achieves 50.1%, 60.%, and 50.9% mAP on the Cityscapes→Foggy Cityscapes, Sim10k→Cityscapes, and KITTI→Cityscapes benchmarks, respectively, outperforming other FCOS-based methods. On three more-challenging Realistic→Artistic scenarios, our method reaches performance levels comparable to mainstream Faster R-CNN-based approaches. These experiments across various cross-domain settings validate the effectiveness and robustness of the proposed method.
As a conservative pseudo-label selection strategy, the top  σ  approach tends to overlook a small fraction of target classes, thereby limiting the effectiveness of cross-domain object detection. In future work, we plan to explore the integration of a memory banking strategy to improve feature alignment, especially for categories with long-tailed distributions.

Author Contributions

Conceptualization and methodology, S.D.; software and validation, K.D.; writing—original draft preparation, S.D.; writing—review and editing, K.Z.; visualization, K.D.; supervision, K.Z.; project administration, K.Z.; funding acquisition, S.D. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by National Natural Science Foundation of China grant numbers 62002053 and 62271130, Natural Science Foundation of Guangdong Province grant number 2023A1515010066, Key Area Special Fund of Guangdong Provincial Department of Education grant number 2022ZDZX3042, and Social Public Welfare and Basic Research Project of Zhongshan City grant number 2024B2021.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  2. Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the IEEE/CVF International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1180–1189. [Google Scholar]
  3. VS, V.; Oza, P.; Patel, V.M. Towards online domain adaptive object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 478–488. [Google Scholar]
  4. Chen, L.; Chen, H.; Wei, Z.; Jin, X.; Tan, X.; Jin, Y.; Chen, E. Reusing the task-specific classifier as a discriminator: Discriminator-free adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7181–7190. [Google Scholar]
  5. Chen, M.; Zheng, Z.; Yang, Y.; Chua, T.S. Pipa: Pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1905–1914. [Google Scholar]
  6. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar]
  7. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H. Grounding DINO: Marrying DINO with grounded pre-training for open-Set object detection. In Proceedings of the European Conference on Computer Vision, Paris, France, 26–27 March 2025; pp. 38–55. [Google Scholar]
  8. Guan, D.; Huang, J.; Xiao, A.; Lu, S.; Cao, Y. Uncertainty-aware unsupervised domain adaptation in object detection. IEEE Trans. Multimed. 2022, 24, 2502–2514. [Google Scholar] [CrossRef]
  9. Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Gool, L.V. Domain adaptive faster R-CNN for object detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
  10. Vibashan, V.S.; Oza, P.; Sindagi, V.A.; Gupta, V.; Patel, V.M. MeGA-CDA: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4514–4524. [Google Scholar]
  11. Zhou, H.; Ge, Z.; Liu, S.; Mao, W.; Li, Z.; Yu, H.; Sun, J. Dense teacher: Dense pseudo-labels for semi-supervised object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 35–50. [Google Scholar]
  12. Xie, X.; Chen, J.; Li, Y.; Shen, L.; Ma, K.; Zheng, Y. Self-supervised CycleGAN for object-preserving image-to-image domain adaptation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 498–513. [Google Scholar]
  13. Zakharov, S.; Kehl, W.; Ilic, S. Deceptionnet: Network-driven domain randomization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 532–541. [Google Scholar]
  14. Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4089–4099. [Google Scholar]
  15. Li, W.; Liu, X.; Yuan, Y. Sigma: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5291–5300. [Google Scholar]
  16. Zhao, Z.; Wei, S.; Chen, Q.; Li, D.; Yang, Y.; Peng, Y.; Liu, Y. Masked retraining teacher-student framework for domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 19039–19049. [Google Scholar] [CrossRef]
  17. Li, Y.J.; Dai, X.; Ma, C.Y.; Liu, Y.C.; Chen, K.; Wu, B.; He, Z.; Kitani, K.; Vajda, P. Cross-domain adaptive teacher for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7581–7590. [Google Scholar]
  18. Cao, S.; Joshi, D.; Gui, L.Y.; Wang, Y.X. Contrastive mean teacher for domain adaptive object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23839–23848. [Google Scholar]
  19. Deng, J.; Xu, D.; Li, W.; Duan, L. Harmonious teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23829–23838. [Google Scholar]
  20. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
  21. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multi-box detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  22. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 15–20 June 2019; pp. 9626–9635. [Google Scholar] [CrossRef]
  23. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  24. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  25. Girshick, R. Fast R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  26. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  27. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  28. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
  29. Zong, Z.; Song, G.; Liu, Y. DETRs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6748–6758. [Google Scholar]
  30. Goodfellow, I.; Abadie, J.P.; Mirza, M.; Xu, B.; Farley, D.W.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. arXiv 2014, arXiv:1406.2661v1. [Google Scholar]
  31. Xu, M.; Wang, H.; Ni, B.; Tian, Q.; Zhang, W. Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12355–12364. [Google Scholar]
  32. Li, C.; Du, D.; Zhang, L.; Wen, L.; Luo, T.; Wu, Y.; Zhu, P. Spatial attention pyramid network for unsupervised domain Adaptation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 481–497. [Google Scholar]
  33. Kim, T.; Jeong, M.; Kim, S.; Choi, S.; Kim, C. Diversify and match: A domain adaptive representation learning paradigm for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12448–12475. [Google Scholar]
  34. Hsu, H.K.; Yao, C.H.; Tsai, Y.H.; Hung, W.C.; Tseng, H.Y.; Singh, M.; Yang, M.H. Progressive Domain Adaptation for Object Detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 738–746. [Google Scholar]
  35. Hsu, C.C.; Tsai, Y.H.; Lin, Y.Y.; Yang, M.H. Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 733–748. [Google Scholar]
  36. Su, P.; Wang, K.; Zeng, X.; Tang, S.; Wang, X. Adapting object detectors with conditional domain normalization. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 403–419. [Google Scholar]
  37. Xu, C.D.; Zhao, X.R.; Jin, X.; Wei, X.S. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11721–11730. [Google Scholar]
  38. Zhao, L.; Wang, L. Task-specific inconsistency alignment for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14197–14206. [Google Scholar] [CrossRef]
  39. Jiang, J.; Chen, B.; Wang, J.; Long, M. Decoupled adaptation for cross-domain object detection. arXiv 2021, arXiv:2110.02578. [Google Scholar]
  40. Zhou, W.; Du, D.; Zhang, L.; Luo, T.; Wu, Y. Multi-granularity alignment domain adaptation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9571–9580. [Google Scholar]
  41. Zhang, S.; Tuo, H.; Hu, J.; Jing, Z. Domain adaptive Yolo for one-stage cross-domain detection. In Proceedings of the Asian Conference on Machine Learning, Online, 17–19 November 2021; pp. 785–797. [Google Scholar]
  42. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  43. He, L.; Wang, W.; Chen, A.; Sun, M.; Kuo, C.H.; Todorovic, S. Bidirectional alignment for domain adaptive detection with transformers. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 18729–18739. [Google Scholar] [CrossRef]
  44. Weng, W.; Yuan, C. Mean teacher DETR with masked feature alignment: A robust domain adaptive detection transformer framework. Proc. AAAI Conf. Artif. Intell. 2024, 38, 5912–5920. [Google Scholar] [CrossRef]
  45. Li, H.; Zhang, R.; Yao, H.; Song, X.; Hao, Y.; Zhao, Y.; Li, L.; Chen, Y. Learning domain-aware detection head with prompt tuning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 4248–4262. [Google Scholar]
  46. Chen, Z.; Cheng, J.; Xia, Z.; Hu, Y.; Li, X.; Dong, Z.; Tashi, N. Focusing on feature-level domain alignment with text semantic for weakly-supervised domain adaptive object detection. Neurocomputing 2025, 622, 129435. [Google Scholar] [CrossRef]
  47. Feng, Y.; Liu, Y.; Yang, S.; Cai, W.; Zhang, J.; Zhan, Q.; Huang, Z.; Yan, H.; Wan, Q.; Liu, C. Vision-language model for object detection and segmentation: A review and evaluation. arXiv 2025, arXiv:2504.09480. [Google Scholar]
  48. Chen, J.; Liu, L.; Deng, W.; Liu, Z.; Liu, Y.; Wei, Y.; Liu, Y. Refining pseudo labeling via multi-granularity confidence alignment for unsupervised cross domain object detection. IEEE Trans. Image Process. 2025, 34, 279–294. [Google Scholar] [CrossRef]
  49. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv 2017, arXiv:1703.01780. [Google Scholar]
  50. Chen, M.; Chen, W.; Yang, S.; Song, J.; Wang, X.; Zhang, L.; Yan, Y.; Qi, D.; Zhuang, Y.; Xie, D.; et al. Learning domain adaptive object detection with probabilistic teacher. arXiv 2022, arXiv:2206.06293. [Google Scholar]
  51. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  52. Zhang, Y.F.; Wang, J.; Liang, J.; Zhang, Z.; Yu, B.; Wang, L.; Tao, D.; Xie, X. Domain-specific risk minimization for domain generalization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3409–3421. [Google Scholar]
  53. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  54. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. arXiv 2020, arXiv:2006.04388. [Google Scholar]
  55. He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
  56. Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt v2: Co-designing and scaling Convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
  57. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
  58. Sakaridis, C.; Dai, D.; Gool, L.V. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]
  59. Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv 2016, arXiv:1610.01983. [Google Scholar]
  60. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The Kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  61. Everingham, M.; Van Gool, L.; Williams, C.; Winn, J.; Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  62. Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-domain weakly-supervised object detection through progressive domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5001–5009. [Google Scholar]
  63. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  64. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  65. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. 2017. Available online: https://openreview.net/pdf?id=BJJsrmfCZ (accessed on 23 May 2025).
  66. Liu, D.; Zhang, C.; Song, Y.; Huang, H.; Wang, C.; Barnett, M.; Cai, W. Decompose to adapt: Cross-domain object detection via feature disentanglement. IEEE Trans. Multimed. 2022, 25, 1333–1344. [Google Scholar] [CrossRef]
  67. Yoo, J.; Chung, I.; Kwak, N. Unsupervised domain adaptation for one-stage object detector using offsets to bounding box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 691–708. [Google Scholar]
  68. Liu, Y.; Wang, J.; Huang, C.; Wang, Y.; Xu, Y. CIGAR: Cross-modality graph reasoning for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 23776–23786. [Google Scholar] [CrossRef]
  69. Liu, F.; Zhang, X.; Wan, F.; Ji, X.; Ye, Q. Domain contrast for domain adaptive object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8227–8237. [Google Scholar] [CrossRef]
  70. Wang, H.; Jia, S.; Zeng, T.; Zhang, G.; Li, Z. Triple feature disentanglement for one-stage adaptive object detection. AAAI Conf. Artif. Intell. 2024, 38, 5401–5409. [Google Scholar] [CrossRef]
  71. Xu, S.; Zhang, H.; Xu, X.; Hu, X.; Xu, Y.; Dai, L.; Choi, K.S.; Heng, P.A. Representative feature alignment for adaptive object detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 689–700. [Google Scholar] [CrossRef]
Figure 1. Traditional adaptive teacher–student frameworks often suffer from suboptimal feature discrimination due to the shared detection head, which applies a shared decision boundary for both source and target domains. In contrast, our method introduces two independent detection heads tailored to each domain, enabling more accurate feature separation and reducing the interference caused by negative samples during adaptation.
Figure 1. Traditional adaptive teacher–student frameworks often suffer from suboptimal feature discrimination due to the shared detection head, which applies a shared decision boundary for both source and target domains. In contrast, our method introduces two independent detection heads tailored to each domain, enabling more accurate feature separation and reducing the interference caused by negative samples during adaptation.
Information 16 00439 g001
Figure 3. A comparison of the structure of the detection head of the conventional adaptive teacher–student framework (left) and the domain-specific teacher–student framework proposed in this paper (right). The target domain samples and source domain samples of the previous adaptive teacher–student framework are encoded and fed into the weight-sharing detection head network for supervised loss optimization. On the contrary, our proposed framework splits the detection head network into two weight-independent detection heads corresponding to the decoding of source domain features and target domain features.
Figure 3. A comparison of the structure of the detection head of the conventional adaptive teacher–student framework (left) and the domain-specific teacher–student framework proposed in this paper (right). The target domain samples and source domain samples of the previous adaptive teacher–student framework are encoded and fed into the weight-sharing detection head network for supervised loss optimization. On the contrary, our proposed framework splits the detection head network into two weight-independent detection heads corresponding to the decoding of source domain features and target domain features.
Information 16 00439 g003
Figure 4. Instance position distribution heatmaps: Cityscapes vs. KITTI. More instances in KITTI are near the boundaries and corners.
Figure 4. Instance position distribution heatmaps: Cityscapes vs. KITTI. More instances in KITTI are near the boundaries and corners.
Information 16 00439 g004
Figure 5. The accuracy of the pseudo-labels produced during training is a key indicator of model performance.
Figure 5. The accuracy of the pseudo-labels produced during training is a key indicator of model performance.
Information 16 00439 g005
Figure 6. Confusion matrix. The values in the matrix indicate the degree of similarity between the categories on the horizontal axis and the categories on the vertical axis.
Figure 6. Confusion matrix. The values in the matrix indicate the degree of similarity between the categories on the horizontal axis and the categories on the vertical axis.
Information 16 00439 g006
Figure 7. Visual results on three domain adaptation scenarios. From top to bottom: (a) Cityscapes→Foggy Cityscapes, (b) Sim10k→Cityscapes, (c) KITTI→Cityscapes.
Figure 7. Visual results on three domain adaptation scenarios. From top to bottom: (a) Cityscapes→Foggy Cityscapes, (b) Sim10k→Cityscapes, (c) KITTI→Cityscapes.
Information 16 00439 g007
Table 1. Quantitative adaptation results from Cityscapes to Foggy Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
Table 1. Quantitative adaptation results from Cityscapes to Foggy Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
Cityscapes→Foggy Cityscapes
Method Venue Detector Person Rider Car Truck Bus Train Motor Bicycle mAP 50
DDF [66]IEEE TMM 2022Faster R-CNN37.645.556.130.750.447.031.139.842.3
PT [50]ICML 2022Faster R-CNN43.252.463.433.456.637.841.348.747.1
CMT [18]CVPR 2023Faster R-CNN45.955.763.739.666.038.841.451.250.3
BiADT [43]ICCV 2023DETR52.258.969.231.755.045.142.651.350.8
MRT [16]ICCV 2023DETR52.851.768.735.958.154.541.047.151.2
MTM [44]AAAI 2024DETR51.053.467.237.254.441.638.447.748.9
DA-Pro [45]NeurIPS 2023RegionCLIP55.462.970.940.363.454.042.358.055.9
MGCAMT [48]IEEE TIP 2025RetinaNet60.266.676.533.260.143.249.857.955.9
Source only-FCOS44.046.551.625.635.019.029.645.937.1
Baseline-FCOS47.750.463.522.438.828.330.748.741.3
EPM * [35]ECCV 2020FCOS39.938.157.328.750.737.230.234.239.5
OADA [67]ECCV 2022FCOS47.846.562.932.148.550.934.339.845.4
CIGAR [68]CVPR2023FCOS46.147.362.127.856.644.333.741.344.9
HT * [19]CVPR 2023FCOS50.354.265.632.051.538.839.450.648.1
Ours-FCOS51.454.368.437.854.640.242.451.850.1
Oracle-FCOS52.552.170.330.950.236.141.651.848.2
Table 2. Quantitative adaptation results from Sim10k to Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
Table 2. Quantitative adaptation results from Sim10k to Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
Sim10k→Cityscapes
Method Venue Detector carAP 50
DDF [66]IEEE TMM 2022Faster R-CNN44.3
PT [50]ICML 2022Faster R-CNN55.1
BiADT [43]ICCV 2023DETR56.6
MRT [16]ICCV 2023DETR62.0
MTM [44]AAAI 2024DETR58.1
DA-Pro [45]NeurIPS 2023RegionCLIP62.9
MGCAMT [48]IEEE TIP 2025RetinaNet67.5
Source only-FCOS44.8
Baseline-FCOS55.0
EPM * [35]ECCV 2020FCOS51.2
OADA [67]ECCV 2022FCOS59.2
CIGAR [68]VPR 2023FCOS58.5
HT * [19]CVPR 2023FCOS60.2
Ours-FCOS60.4
Oracle-FCOS74.9
Table 3. Quantitative adaptation results from KITTI to Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
Table 3. Quantitative adaptation results from KITTI to Cityscapes. Our approach achieves the highest performance among all FCOS-based methods.
KITTI→Cityscapes
Method Venue Detector carAP 50
DDF [66]IEEE TMM 2022Faster R-CNN46.0
PT [50]CIML 2022Faster R-CNN60.2
DA-DETR [44]CVPR 2023DETR58.1
DA-Pro [45]NeurIPS 2023RegionCLIP61.4
MGCAMT [48]IEEE TIP 2025RetinaNet62.2
Source only-FCOS42.7
Baseline-FCOS46.5
EPM * [35]ECCV 2020FCOS45.0
OADA [67]ECCV 2022FCOS47.8
CIGAR [68]CVPR 2023FCOS48.5
HT * [19]CVPR 2023FCOS50.3
Ours-FCOS50.9
Oracle-FCOS74.9
Table 4. Quantitative adaptation results from Pascal VOC to Clipart. The performance of our method is on par with that of DC. Due to space constraints, the “Venue” and “Detector” columns are omitted; the corresponding information is provided in Table 5 and Table 6.
Table 4. Quantitative adaptation results from Pascal VOC to Clipart. The performance of our method is on par with that of DC. Due to space constraints, the “Venue” and “Detector” columns are omitted; the corresponding information is provided in Table 5 and Table 6.
MethodAeroBicycleBirdBoatBottleBusCarCatChairCowTableDogHrsBikePrsnPlntSheepSofaTrainTvmAP
DC [69]47.153.238.837.046.645.952.614.539.148.431.723.734.987.067.854.022.823.844.951.043.2
CMT [18]31.966.933.930.226.365.243.612.644.546.347.919.329.953.163.340.517.141.249.643.940.4
TFD [70]27.964.828.429.525.764.247.713.547.550.950.821.333.960.265.642.515.140.545.548.641.2
DA-Pro [45]--------------------46.9
Source only27.360.417.516.014.543.732.010.238.615.324.516.018.449.530.730.02.323.035.129.926.7
Baseline30.865.518.723.024.957.540.210.938.025.936.015.622.666.852.135.31.034.638.139.433.8
Ours47.053.439.837.647.045.852.514.038.048.532.524.135.886.168.754.024.623.445.650.343.5
Oracle30.151.447.242.530.755.759.425.147.452.537.843.342.661.673.341.944.325.559.051.346.1
Table 5. Quantitative adaptation results from Pascal VOC to Watercolor. The performance of our method is on par with that of DC.
Table 5. Quantitative adaptation results from Pascal VOC to Watercolor. The performance of our method is on par with that of DC.
MethodVenueDetectorBikeBirdCarCatDogPersonmAP
DC [69]TCS 2022Faster R-CNN76.753.245.341.635.570.053.7
CMT [18]CVPR 2023Faster R-CNN87.148.750.237.131.566.353.5
TFD [70]AAAI 2024Faster R-CNN93.052.647.639.233.763.955.0
RFA [71]TCS 2023Faster R-CNN97.155.353.848.740.967.260.5
DA-Pro [45]NeurIPS 2023RegionCLIP------58.1
Source only-FCOS67.845.836.231.720.359.743.6
Baseline-FCOS77.546.144.630.026.058.647.1
Ours-FCOS54.655.531.835.269.176.253.8
Oracle-FCOS83.555.744.950.351.474.560.1
Table 6. Quantitative adaptation results from Pascal VOC to Comic. The performance of our method is on par with that of TFD.
Table 6. Quantitative adaptation results from Pascal VOC to Comic. The performance of our method is on par with that of TFD.
MethodVenueDetectorBikeBirdCarCatDogPersonmAP
DC [69]TCS 2022Faster R-CNN51.923.936.727.131.561.038.7
CMT [18]CVPR 2023Faster R-CNN49.819.229.815.229.154.132.9
TFD [70]AAAI 2024Faster R-CNN53.419.235.016.133.249.234.4
RFA [71]TCS 2023Faster R-CNN46.624.233.321.729.061.236.0
DA-Pro [45]NeurIPS 2023RegionCLIP------44.6
Source only-FCOS43.39.423.69.810.934.221.9
Baseline-FCOS50.515.927.211.418.446.128.3
Ours-FCOS52.221.532.615.930.851.634.1
Oracle-FCOS38.330.834.951.847.572.846.0
Table 7. Ablation studies of our method on Cityscapes→Foggy Cityscapes using FCOS with a RestNet-50 backbone. GFA achieves 41.3%  m A P 50 ; adding MMAA and DSH improves performance by 4.8% and 7.4%, respectively. Combining all components yields a total improvement of 8.8%, demonstrating their complementary benefits.
Table 7. Ablation studies of our method on Cityscapes→Foggy Cityscapes using FCOS with a RestNet-50 backbone. GFA achieves 41.3%  m A P 50 ; adding MMAA and DSH improves performance by 4.8% and 7.4%, respectively. Combining all components yields a total improvement of 8.8%, demonstrating their complementary benefits.
MethodGFAMMAADSH mAP 50
Baseline 41.3
46.1 (4.8 ↑)
48.7 (7.4 ↑)
50.1 (8.8 ↑)
Table 8. Ablation study of the proposed multi-scale mask adversarial alignment module. In this study, we use different metrics to present results for different scale features.
Table 8. Ablation study of the proposed multi-scale mask adversarial alignment module. In this study, we use different metrics to present results for different scale features.
Mask Scale mAP mAP 50 mAP 75 mAR s mAR m mAR l
without MMAA20.541.319.54.129.860.3
F 0 ( ϵ = 0.5 ) 27.345.926.48.437.467.2
F 1 ( ϵ = 0.4 ) 27.645.827.27.936.366.5
F 2 ( ϵ = 0.3 ) 27.045.325.68.437.965.9
F 3 ( ϵ = 0.2 ) 27.044.826.29.737.066.3
F 4 ( ϵ = 0.1 ) 26.944.426.68.236.866.5
F 0 4 27.146.126.310.238.967.4
Table 9. Performance comparison between the top  σ  strategy and the threshold strategy.
Table 9. Performance comparison between the top  σ  strategy and the threshold strategy.
Selection StrategyParameter mAP 50
threshold η = 0.1 44.4
η = 0.3 45.7
η = 0.5 45.8
η = 0.7 45.1
η = 0.9 43.9
adaptive  η 48.2
top  σ σ = 0.05 49.1
σ = 0.1 50.1
σ = 0.2 49.4
σ = 0.3 47.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, S.; Deng, K.; Zou, K. Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection. Information 2025, 16, 439. https://doi.org/10.3390/info16060439

AMA Style

Dong S, Deng K, Zou K. Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection. Information. 2025; 16(6):439. https://doi.org/10.3390/info16060439

Chicago/Turabian Style

Dong, Shuai, Kang Deng, and Kun Zou. 2025. "Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection" Information 16, no. 6: 439. https://doi.org/10.3390/info16060439

APA Style

Dong, S., Deng, K., & Zou, K. (2025). Reconstructing Domain-Specific Features for Unsupervised Domain-Adaptive Object Detection. Information, 16(6), 439. https://doi.org/10.3390/info16060439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop