Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO

Zhu, Sihan; Zhu, Peipei; Wu, Yuan; Qiao, Wensheng

doi:10.3390/s25175363

Open AccessArticle

Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO

National Key Laboratory of Complex Aviation System Simulation, Southwest China Institute of Electronic Technology, Chengdu 610036, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(17), 5363; https://doi.org/10.3390/s25175363

Submission received: 13 July 2025 / Revised: 21 August 2025 / Accepted: 25 August 2025 / Published: 29 August 2025

(This article belongs to the Special Issue Advanced Signal Processing for Affective Computing)

Download

Browse Figures

Versions Notes

Abstract

To alleviate the performance degradation caused by domain shift, domain adaptive object detection (DAOD) has achieved compelling success in recent years. DAOD aims to improve the model’s detection performance on the target domain by reducing the distribution discrepancy between different domains. However, most existing methods are built on two-stage Faster RCNN, which is not suitable for real applications due to the detection efficiency. In this paper, we propose a novel Hierarchical Multi-scale Domain Adaptive (HMDA) method by integrating a simple but effective one-stage YOLO framework. HMDAYOLO mainly consists of the hierarchical backbone adaptation and the multi-scale head adaptation. The former performs hierarchical adaptation based on the differences in representational information of features at different depths of the backbone network, which promotes comprehensive distribution alignment and suppresses the negative transfer. The latter makes full use of the rich discriminative information in the feature maps to be detected for multi-scale adaptation. Additionally, it can reduce local instance divergence and ensure the model’s multi-scale detection capability. In this way, HMDA can improve the model’s generalization ability while ensuring its discriminating capability. We empirically verify the effectiveness of our method on four cross-domain object detection scenarios, comprising different domain shifts. Experimental results and analyses demonstrate that HMDA-YOLO can achieve competitive performance with real-time detection efficiency.

Keywords:

object detection; domain adaptation; hierarchical; multi-scale; YOLO

1. Introduction

Object detection is a core technique in the field of computer vision, which aims to recognize and localize objects of interest in images or videos. And it can help individuals to extract key information from complex visual environments. Traditional object detection algorithms rely on hand-crafted features, with drawbacks, such as high computational costs, poor robustness, and low accuracy. In recent years, deep learning has gained popularity [1], which can learn more robust and generalizable deep feature representations. Specifically, object detection has made impressive progress thanks to convolutional neural networks (CNNs). And it has become an essential part of many real-world applications. The mainstream three detectors are Faster RCNN [2], You Only Look Once (YOLO) [3], and Single Shot Multi-box Detector (SSD) [4]. However, the superior performance of advance detectors relies on the availability of large number of high-quality annotated data. Acquiring such data is time consuming and labour intensive. Additionally, well-trained detectors may experience a sudden drop in performance due to the domain shift problem when processing new data or tasks.

The aforementioned problems seriously affect the application and deployment of the detection models when data distributions are different. Unsupervised Domain Adaptation (UDA) [5,6,7] has been proposed to deal with the domain shift problems and reduce the dependence of model training on target labels. It aims to transfer the source knowledge to the target and reduce the distribution discrepancy between different domains, which enhances the model’s generalization ability and discriminative capability. Deep UDA methods can be divided into two categories: Moment matching methods [8,9,10] explicitly match the feature distribution across domains during network training based on pre-defined metrics. Adversarial learning methods [11,12,13] implicitly learn domain-invariant representations in an adversarial paradigm.

Unlike classification and semantic segmentation, the object detection task predicts bounding box localization and corresponding object category [14]. This brings potential problems and challenges for cross-domain object detection, but has likewise raised a lot of concerns. Recent studies have made significant efforts to improve the cross-domain detection capabilities. Following the first try on cross-domain object detection, domain adaptive Faster RCNN [15], most existing DAOD methods are still built on the two-stage detector Faster RCNN [16,17,18,19,20,21,22,23]. Few works have utilized one-stage detectors (e.g., SSD [4] and FCOS [24]) to consider the computational efficiency [25,26,27,28]. And some methods have also been proposed for light-weight and practical use based on the YOLO series [29,30,31,32]. Overall, the development of DAOD methods is closely related to the key technologies in the field of object detection and domain adaptation.

Since the scene layout, number of objects, patterns between objects, and the background may be quite different across domains in object detection tasks, a potential problem in DAOD tasks is that blindly and directly adapting feature distributions can lead to negative transfer, which degrades the model’s cross-domain performance. This is also the reason why strategies like prototype alignment and entropy regularization, which work well in image classification or segmentation, become inapplicable in object detection. In addition, some of the methods utilize image generation techniques to introduce auxiliary data [22,31], or adopt a student–teacher network paradigm [32,33] for training. The former makes the training process not an end-to-end mode, and the latter significantly increases the model’s complexity and the difficulty of the training model. These types of methods greatly restrict the training and application scenarios of detectors. Finally, the efficiency and accuracy of the dominant Faster RCNN are outdated, which is not suitable for resource-limited and time-critical real applications [32].

For the above reasons, we propose a novel Hierarchical Multi-scale Domain Adaptive method, HMDA-YOLO, based on the simple but effective one-stage YOLOv5 [34] detector. HMDA-YOLO is easy to implement and has competitive performance, which consists of the hierarchical backbone adaptation and the multi-scale head adaptation. Considering the differences in representation information of features at various depths, we designed the hierarchical backbone adaptation strategy, which promotes comprehensive distribution alignment and suppresses the negative transfer. Specifically, we adopt the pixel-level adaptation, image-level adaptation, and weighted image-level adaptation for the adversarial training of shallow-level, middle-level, and deep-level feature maps, respectively. To make full use of the rich discriminative information of the feature maps to be detected, we further designed the multi-scale head adaptation strategy. It performs pixel-level adaptation across each detection scale, which reduces local instance discrepancy and the impact of background noise. Note that “pixel-level” denotes each location at the corresponding feature map and the “image-level” treats the entire feature map as a whole in this paper. The proposed HMDA-YOLO can significantly improve the model’s cross-domain capability. We empirically verify the HMDA-YOLO on four cross-domain object detection scenarios, comprising different domain shifts. Experimental results and analyses demonstrate that HMDA-YOLO can achieve competitive performance with high detection efficiency.

The contributions of this paper can be summarized as follows: (1) we propose a simple but effective DAOD method, HMDA-YOLO, for more accurate and efficient cross-domain detection; (2) a hierarchical adaptation strategy for the backbone network as well as a multi-scale adaptation strategy for the head network are designed to simultaneously ensure the model’s generalization and discriminating capability; (3) HMDA-YOLO can achieve competitive performance on several cross-domain object detection benchmarks, compared with state-of-the-art DAOD methods.

The remainder of this paper is organized as follows: Section 2 reviews the techniques related to DAOD. Section 3 explores the technical details of HMDA. Section 4 presents the experimental results and analysis. Finally, Section 5 provides a summary.

2. Related Work

2.1. Object Detection

The mainstream methods for object detection can be categorized into two-stage (e.g., Faster RCNN [2] and Mask RCNN [35]) and one-stage (e.g., SSD [4], FCOS [24], and YOLO series [3]). Faster RCNN [2] is the representative method for two-stage detectors. Its main contribution is the region proposal network (RPN), and it integrates feature extraction, proposal detection, bounding box regression, and classification into a single unified framework. SSD [4] predicts bounding box and category from feature maps at different scales for fast and accurate object detection. The YOLO series has undergone several version upgrades. YOLO [3] first regards the detection task as regression and it can realize real-time detection. YOLOv3 [36] utilizes Darknet-53 as the backbone network and can obtain multi-scale predictions by adding the Spatial Pyramid Pooling (SPP) module. YOLOv4 [37] examines the contribution of different tricks on the object detection task and extends the Darknet-53 backbone with Cross Stage Partial (CSP) connection. YOLOv5 [34] modifies the CSPDarknet-53 and makes use of multiple augmentation techniques for faster and better capability. In recent years, DEtection TRansformer (DETR) [38] shows competitive performance, which treats object detection as a set prediction task.

2.2. Unsupervised Domain Adaptation

As mentioned before, there are two mainstream methods of UDA. Moment matching methods usually measure the distribution discrepancy by a pre-defined metric (e.g., Maximum Mean Discrepancy), and explicitly match the feature distribution across domains by reducing such discrepancy during network training. The Deep Adaptation Network (DAN) [8] extends the maximum mean discrepancy with multi-kernel and optimizes it with multiple layers. The Joint Adaptation Network (JAN) [9] tries to simultaneously match the margin and conditional distribution. The Deep Sub-domain Adaptation Network (DSAN) [10] aligns sub-domains for fine-grained distribution matching. Adversarial learning methods extend the network with a domain discriminator or more classifiers, and implicitly learn domain-invariant representations in an adversarial paradigm. The Domain Adversarial Neural Network (DANN) [11] firstly combines adversarial training with domain adaptation. And it proposes the gradient reverse layer, which is the core of most adversarial methods. The Conditional Domain Adversarial Network (CDAN) [12] proposes multi-linear conditioning to make use of discriminative information for better adversarial training. The Dynamic Adversarial Adaptation Network (DAAN) [13] implements adversarial training in each category and dynamically adapts their weights.

2.3. Domain Adaptive Object Detection

Object detection is a fundamental task in the field of computer vision and has a wide range of real-world applications. Therefore, researchers pay more and more attention to cross-domain detection techniques [14]. As mentioned earlier, the Faster RCNN-based methods are currently the dominant. Domain Adaptive Faster RCNN (DAF) [15] is the first work to address the DAOD task by designing the image-level and instance-level adaptation. Strong-Weak Distribution Alignment (SWDA) [16] adjusts the adaptation balance between different network layers to avoid potential risks of blind transfer. Reference [18] used a domain randomization technique to generate source domains with different domain shifts, and unified multiple source domains to learn domain-invariant features by representation learning. Reference [22] proposed the Hierarchical Transferability Calibration Network (HTCN) to improve transferability while maintaining discriminability, which extends existing adaptation strategies with image-to-image translation.

Comparatively, DAOD approaches based on one-stage detectors do not obtain enough attention. Reference [25] enhanced the SSD framework with weakly supervised learning for cross-domain detection. Reference [26] used the style transfer and pseudo-labeling approach for pixel-level adaptation and noise reduction. Reference [27] proposed an Implicit Instance-Invariant Network (I³Net) to learn instance-invariant features by designing multiple strategies and modules. Specifically, domain-adaptive YOLO architectures are proposed for resource-limited and time-critical real applications. References [29,30] extended YOLOv3 and YOLOv4 at different scales to learn domain-invariant features, respectively. Stepwise Domain Adaptive YOLO (S-DAYOLO) framework [31] was proposed to bridge the domain gap by constructing an auxiliary domain for autonomous driving. Semi-supervised Domain Adaptive YOLO (SSDA-YOLO) [32] combines the knowledge distillation framework and scene style transfer with the mean teacher model, and further proposes a consistency loss for distribution alignment based on the YOLOv5.

It is worth noting that with the development of DETR detectors, related cross-domain methods have also received widespread attention. Sequence Feature Alignment (SFA) designs a domain query-based feature alignment module and a token-wise feature alignment module for the adaptation of detection transformers [39]. The Domain Adaptive Detection Transformer (DA-DETR) introduces information fusion for effective knowledge transfer [40].

As mentioned earlier, there are numerous DAOD algorithms built on the outdated Faster RCNN. In this paper, we select the lightweight YOLO framework, which greatly improves inference speed while ensuring feature extraction capability. By comprehensively aligning the features in the backbone and head network, HMDA not only effectively enhances the model’s generalization ability but also avoids the degradation of target discriminability. It is shown that, combined with the proposed hierarchical multi-scale domain adaptation approach, very competitive results on various cross-domain detection scenarios can be obtained even by the small version of the YOLO framework.

3. Method

In this section, we introduce the technical details of the proposed HMDA. An overview of the HMDA based on YOLOv5 architecture is shown in Figure 1. HMDA-YOLO hierarchically extracts the deep features of the image and performs specific adaptations at different depths by embedding multiple domain discriminators, which are distributed in the backbone network and the head network of YOLOv5. More information is described below.

3.1. Preliminary Knowledge

We will briefly introduce the YOLOv5 framework. YOLOv5 is a simple but effective one-stage detector and has been widely used in real-world applications. It mainly contains three parts, which is shown in the top part of Figure 1. (1) The backbone network is responsible for deep feature extraction. The CSPDarknet53 [37] is the most commonly used backbone network, composed of convolutional module, C3 module, and the Spatial Pyramid Pooling (SPP) module [41]; (2) The neck network further processes the extracted features and performs feature fusion to enhance the feature representative capability. It can achieve top-down and bottom-up feature fusion by the utilization of the Feature Pyramid Network (FPN) [42] and Path Aggregation Network (PAN) [43]; (3) The head network implements multi-scale detection of small, medium, and large objects.

In the cross-domain object detection tasks, there are two domains: a labeled source domain with

n_{s}

images, denoted as

D_{S} = {x_{i}^{s}, b_{i}^{s}, y_{i}^{s}}_{i = 1}^{n_{s}}

, where

x_{i}^{s}

denotes the source image,

b_{i}^{s}

is the coordinate of bounding box, and

y_{i}^{s}

is the object category. An unlabeled target domain with

n_{t}

images, denoted as

D_{T} = {x_{i}^{t}}_{j = 1}^{n_{t}}

. The label space of the source and target domains are identical, but their data distributions are different. The goal of domain adaptive object detection is to train a detector to reduce the domain shift and learn transferable representations that the model can generalize well to the target objects.

3.2. Hierarchical Multi-Scale Adaptation

The adversarial DAOD methods usually learn domain-invariant feature representations with the help of the domain discriminator (i.e., domain classifier). Assuming that the framework is composed of a feature extractor F and a domain discriminator G. The feature extractor F tries to confuse the domain discriminator G, which means maximizing the domain classification loss. Conversely, the domain discriminator D is trained to distinguish the source samples from the target samples, which means minimizing the domain classification loss. The feature extractor F also learns to minimize the source supervision loss. The adversarial loss can be written as follows:

max_{F} min_{D \in H} {E_{D_{S}} + E_{D_{T}}},

(1)

E_{D_{S}}

and

E_{D_{T}}

denote the expected domain classification error over the source and target domain, respectively. By utilizing the Gradient Reverse Layer (GRL) [11], the min–max adversarial training can be unified in one back-propagation.

For object detection tasks, each activation on the feature map corresponds to a patch of the input image, and detectors will perform classification and regression on each location. Therefore, the domain discriminator can implement not only image-level distribution alignment, but also pixel-level distribution alignment. Since HMDA-YOLO hierarchically aligns the feature distribution in the backbone network and the head network, we will focus on these two parts.

3.2.1. Hierarchical Backbone Adaptation

The backbone network hierarchically adapts the output features of each C3 module (i.e., the layer 4, 6, and 9 in YOLOv5 framework) through different ways, according to the representation information of the features at different depths.

The shallow feature can capture a lot of detailed information (e.g., edges, colors, textures, and angles), which not only facilitates the detection of small objects, but also helps the alignment of local features of cross-domain objects. According to [44], the least-square loss function can stabilize the training process of the domain discriminator and have advantages on aligning shallow representations. Therefore, we use it as the criterion to implement shallow-level feature adaptation. The domain discriminator

D_{s a}

is a fully convolutional network with

1 \times 1

kernel to predict the pixel-level domain label of the source and target feature maps. The shallow-level adaptation loss can be formulated as follows:

L_{s a} = \sum_{i, u, v} D_{s a} {(f_{i}^{s})}_{u, v}^{2} + \sum_{i, u, v} {(1 - D_{s a} {(f_{i}^{t})}_{u, v})}^{2}

(2)

where

f_{i}^{s}

and

f_{i}^{t}

are the source and target shallow-level feature map of the input image

x_{i}

, respectively. And

D_{s a} {(f_{i})}_{u, v}

is the domain prediction in location

(u, v)

of the corresponding feature map.

The medium-level feature contains information about localized shapes and simple objects. And the Binary Cross Entropy (BCE) loss is used for adversarial loss calculation. The domain discriminator

D_{m a}

is different from

D_{s a}

that

D_{m a}

treats each feature map as a whole and predicts image-level domain label. Specifically,

D_{m a}

is composed of common convolutional layers, a global average pooling layer, and a fully connected layer. The middle-level adaptation loss can be formulated as follows:

L_{m a} = - \sum_{i} [d_{i} log D_{m a} (f_{i}^{m}) + (1 - d_{i}) log (1 - D_{m a} (f_{i}^{m}))],

(3)

where

f_{i}^{m}

is the medium-level feature map of the input image

x_{i}

. And

D_{m a} (f_{i}^{m})

is the image-level domain prediction of

x_{i}

.

d_{i}

is the ground truth domain label, which is 0 for the source domain and 1 for the target domain in this paper.

The scene layout, number of objects, patterns between objects, and the background may be quite different across domains. According to [16], it is likely to hurt performance for larger shifts. As an example, the source domain that contains rural images and the target domain that contains urban images may have large domain discrepancy, even if they share the same object category. In this case, blind alignment may lead to the negative transfer and impair the model’s capability between different domains. HMDA-YOLO alleviates the above problem by adjusting the weights of hard-to-classify and easy-to-classify samples to more robustly improve the model’s cross-domain performance. Specifically, for the deep-level feature adaptation, the BCE loss is extended with focal loss [45] to re-weight different samples. The domain discriminator

D_{d a}

is trained to distinguish the source from the target, which is similar to

D_{m a}

. The deep-level adaptation loss can be written as follows:

L_{d a} = - \sum_{i} [d_{i} {(1 - D_{d a} (f_{i}^{d}))}^{γ} log D_{d a} (f_{i}^{d}) + (1 - d_{i}) D_{d a} {(f_{i}^{d})}^{γ} log (1 - D_{d a} (f_{i}^{d}))],

(4)

where

f_{i}^{d}

is the deep-level feature map of the input image

x_{i}

.

γ

is the parameter to control the weight of different samples. If a sample is easy-to-classify (i.e., far from the decision boundary), it is desired to have a low loss to avoid negative transfer. Conversely, if it is hard-to-classify, we want it to have a high loss for domain confusion. Therefore, the value of

γ

needs to be greater than 1 to assign low weight for the easy-to-classify samples and high weight for the hard-to-classify samples.

Combining the hierarchical domain adaptation loss, HMDA-YOLO promotes comprehensive distribution alignment and suppresses the negative transfer. The overall adaptation loss function of the backbone network can be written as follows:

L_{b a c k b o n e} = L_{s a} + L_{m a} + L_{d a} .

(5)

3.2.2. Multi-Scale Head Adaptation

The feature maps of the head network play an important role in multi-scale object detection, which are crucial for object recognition and localization. Therefore, HMDA-YOLO proposes to implement fine-grained pixel-level adaptation at each scale for efficient cross-domain training, based on the rich discriminative information of the feature maps to be detected. On the one hand, multi-scale adaptation can reduce the local instance divergence and guarantee the model’s multi-scale detection capability. On the other hand, it helps to distinguish between the object and the background, which reduces the impact of background noise. In this way, the model can learn more robust and generalizable representations to improve the detection accuracy.

Specifically, the head adaptation is performed at three different scales, i.e., layer 17, 20, and 23 in YOLOv5 framework. The pixel-level domain discriminator

D_{p i x e l}

is similar to

D_{s a}

, which predicts pixel-level domain labels. And BCE loss function is used for loss calculation. The adaptation loss at each scale can be written as follows:

L_{p i x e l} = - \sum_{i, u, v} [d_{i} log D_{p i x e l} {(f_{i})}_{u, v} + (1 - d_{i}) log (1 - D_{p i x e l} {(f_{i})}_{u, v})],

(6)

where

f_{i}

denotes the feature map to be detected at each scale. Note that

D_{p i x e l}

in three different scales do not share the weight. Considering all three scales feature adaptation, the head adaptation loss can be unified as follows:

L_{h e a d} = \sum_{s = 1}^{3} L_{p i x e l}^{s}

(7)

where s denotes different scales.

3.3. Overall Formulation

The main discriminative capability of the network is learned from the source labeled samples in supervised way. And the source supervised training loss of YOLOv5 can be formulated as follows:

L_{d e t} (X^{s}, B^{s}, Y^{s}) = L_{b o x} (B^{s}; X^{s}) + L_{c l s} (Y^{s}; X^{s}) + L_{o b j} (Y^{s}; X^{s}),

(8)

where

X^{s}

,

B^{s}

, and

Y^{s}

denote the set of

x_{i}^{s}

,

b_{i}^{s}

, and

y_{i}^{s}

.

L_{c l s}

is the classification loss and defaults to the BCE loss.

L_{o b j}

is the object confidence loss and defaults to the BCE loss.

L_{b o x}

is the bounding box regression loss and defaults to the CIoU loss.

The domain adversarial loss combined with the hierarchical backbone adaptation strategy and the multi-scale head adaptation strategy can be written as follows:

L_{a d v} = L_{b a c k b o n e} + L_{h e a d} .

(9)

Combining the source supervised loss function with the cross-domain adversarial loss function, the overall optimization objective can be formulated:

max_{D} min_{G} L_{d e t} - λ L_{a d v},

(10)

where D denotes the set of all domain discriminators, including

D_{s a}, D_{m a}, D_{d a}

, and

D_{p i x e l}

.

λ

is the trade-off parameter to balance the detection loss and the domain adversarial loss. The network can be trained in an end-to-end manner using a standard stochastic gradient descent algorithm. And the adversarial training can be achieved by the utilization of GRL, which reverses the gradient during propagation. Structures of these domain discriminators are summarized in Figure 2.

3.4. Theoretical Analysis

Reference [46] designed

H Δ H

-distance to measure the divergence between two sets of samples that have different data distributions. Let us consider a source domain

D_{S}

, a target domain

D_{T}

, and a domain discriminator

h : x \mapsto {0, 1}

, which tries to predict the source and target domain label to be 0 and 1, respectively. Assuming that

H

to be a set of possible domain discriminators, the

H Δ H

-distance can be defined as follows:

d_{H Δ H} (D_{S}, D_{T}) = 2 sup_{h, h^{'} \in H} | E_{x \sim D_{S}} [h (x) \neq h^{'} (x)] - E_{x \sim D_{T}} [h (x) \neq h^{'} (x)] |,

(11)

where

E_{x \sim D_{S}}

and

E_{x \sim D_{T}}

denote the expected domain classification errors over the source domain and the target domain, respectively. The combined error of the ideal hypothesis (e.g., domain discriminator) can be denoted as follows:

\begin{matrix} λ & = ϵ_{S} (h^{*}) + ϵ_{T} (h^{*}), where \\ h^{*} & = \underset{h \in H}{arg min} ϵ_{S} (h) + ϵ_{T} (h) \end{matrix}

(12)

where

h^{*}

is the ideal joint hypothesis. The terms

ϵ_{S} (\cdot)

and

ϵ_{T} (\cdot)

are the expected risks on source and target domains, respectively. It can be used to measure the adaptability between different domains. If the ideal joint hypothesis performs poorly, i.e., the error

λ

is large, the domain adaptation process is difficult to realize. Based on the above knowledge, Reference [46] gives a upper bound on the target error as follows:

\forall h \in H, ϵ_{T} (h) \leq ϵ_{S} (h) + \frac{1}{2} d_{H Δ H} (D_{S}, D_{T}) + C o n s t,

(13)

the target error is upper bounded by three terms, including the expected error on the source domain

ϵ_{S} (h)

, the domain divergence

d_{H Δ H} (D_{S}, D_{T})

, and few constant terms. Since the

ϵ_{S} (h)

can be directly minimized by the network supervised learning and the third term is difficult to handle, with the majority of existing methods minimize the upper bound on the target error by reducing the domain divergence between source and target domains. Our proposed method not only hierarchically optimizes the distribution discrepancy between different domains at the backbone network, but also alignment the multi-scale features before the detection in the head network. Thus, it can effectively reduce the distribution discrepancy and significantly improve the cross-domain detection performance.

4. Experiment

In this section, we verify the proposed HMDA-YOLO on different cross-domain object detection tasks. Related datasets consist of Cityscapes, Foggy Cityscapes, Sim10k, KAIST, Pascal VOC, and Clipart. Quantitative results and related analyses demonstrate the superiority of the proposed HMDA-YOLO.

4.1. Datasets

Cityscapes [47] is a large-scale street scene dataset for driving scenarios. The images are captured by a car-mounted video camera from 50 different cities. It consists 2975 images in the training set and 500 images in the validation set.

Foggy Cityscapes [48] dataset is rendered from Cityscapes by adding synthetic fog to real, clear-weather images using incomplete depth information. The data split and semantic annotations of Foggy Cityscapes are inherited from Cityscapes.

Sim10k [49] dataset is generated by the game engine of Grand Theft Auto V (GTA V). It contains 10,000 images of synthetic driving scene with 58,701 bounding boxes of car for training.

KAIST [50] dataset has both RGB and long-wavelength infrared (LWIR) images for multi-spectral pedestrian detection and multi-source fusion. Since successive frames of images are similar, we refined it with 7373 images for training and 1229 images for testing. All the three categories are integrated as person.

Pascal VOC [51] is a realistic scenes dataset for segmentation and detection. The train-val datasets of Pascal VOC 2007 and 2012 are utilized together. And it consists of 16,551 images with 20 distinct object categories.

Clipart [25], which is a graphical image dataset with complex backgrounds, has the same 20 object categories with Pascal VOC. Specifically, it has 500 images for training and 500 images for testing.

We evaluate our method in the following scenarios: adaptation across adverse weather (Cityscapes → Foggy Cityscapes), adaptation from synthetic to real (Sim10k → Cityscapes), adaptation across heterogeneous data (KAIST RGB → KAIST LWIR), and adaptation across large domain shift (Pascal VOC → Clipart). More details of the datasets can be found in Table 1. C, F, S, KR, KL, P, and CL denote Cityscapes, Foggy Cityscapes, Sim10K, KAIST RGB, KAIST LWIR, Pascal VOC, and Clipart, respectively.

4.2. Experiment Setup

All experiments are implemented based on the PyTorch framework with a NVIDIA A800 GPU (TSMC, Hsinchu, Taiwan). And we choose YOLO framework with small parameters (YOLOv5s and YOLOv8s) as the base detector. All training and testing images are resized in the shape of

960 \times 960

. The stochastic gradient descent (SGD) is utilized for optimizing the model, where the learning rate is 0.01, weight decay is 0.0005 and momentum is 0.937. The batch size is set to 32 with 200 training epochs during the training stage. Other settings can be referred to [32]. For each cross-domain detection tasks, we report the average precisions (AP, %) and mean average precisions (mAP, %) with a threshold of 0.5. The tradeoff parameter

λ

is set to 1.0 without specified.

We compare our proposed HMDA-YOLO with some state-of-the-art DAOD methods, including Faster RCNN [2], YOLOv5s, YOLOv8s [34], DAF [15], SWDA [16], HTCN [22], ATF [52], UMT [33], TDD [53], TIA [54], Deformable-DETR [55], SFA [39], DA-DETR [40], S-DAYOLO [31] and ConfMix [56]. Faster RCNN, D-DETR, YOLOv5s, and YOLOv8s means the model is only trained with source samples (source only) and it then inferences directly on the target domain samples. Bold in each table indicates the optimal result, except for YOLOv8s.

4.3. Quantitative Results

4.3.1. Adaptation Across Different Visibility

Differences in data distribution due to weather changes are very common in real-world applications, e.g., autonomous driving. The ability of the detection model to adapt to different weather is crucial. Therefore, we firstly evaluate our proposed HMDA-YOLO across adverse weather. Specifically, the Cityscapes dataset with clear weather and the Foggy Cityscapes with fog are used in this task. The quantitative results are shown in Table 2. Methods based on different detectors are separated in each table. It can be seen that the proposed HMDA-YOLO outperforms all the compared methods and achieves 45.9% mAP. And it is 2.8% higher than the second best TDD. HMDA-YOLO has an absolute mAP improvement of 19.6% and a relative improvement of 74.5% over the baseline YOLOv5s. Through the hierarchical backbone adaptation and multi-scale head adaptation, the affection of adverse weather on the model’s discriminative ability can be significantly mitigated. Additionally, HMDA-YOLO is extremely efficient for knowledge transfer on multiple categories, including car, mbike, bus, and train.

4.3.2. Adaptation from Synthetic to Real

With rapid development of game engine and simulation software, we have easy access to a large number of lifelike scene images. Therefore, we can try to improve the model’s generalization ability with the help of synthetic images, by building cross domain models from synthetic to real. In particular, we employ synthetic Sim10K and real Cityscapes to compose this cross-domain task, with one category car to be detected. Detection results are shown in Table 3. Note that

λ

is set to 0.1 to obtain better performance according existing studies [16,22]. Our method achieves the best accuracy of 58.6%. And it outperforms the baseline YOLOv5s, UMT, and DA-DETR by 7.0%, 15.5%, and 3.9%. Note that all methods based on Faster RCNN perform poorly in this task, even lower than the yolov5s without adaptation.

4.3.3. Adaptation Across Heterogeneous Data

There are many heterogeneous sensors that can acquire multiple types of heterogeneous data, such as RGB images, infrared images, and multi-spectral images. And the scenario of heterogeneous data adaptation is often ignored in DAOD research. We validate whether the proposed HMDA-YOLO can deal with heterogeneous data adaptation. The RGB and long-wavelength infrared (LWIR) images in the KAIST dataset are used as the source and target domains, respectively. The results are reported in Table 4. It can be seen that the YOLOv5s model trained with RGB images is completely unable to detect the person in LWIR images. And HMDA-YOLO can drastically improve the cross-domain performance of the model and achieves the accuracy of 45.4%, which demonstrates that the HMDA framework can efficiently implement the knowledge transfer even when the data are heterogeneous.

4.3.4. Adaptation Across Large Domain Shift

Finally, we evaluate the proposed method on real-to-artistic adaptation datasets from Pascal VOC to Clipart. In this task, the source and target domains have large domain shift. The results of Pascal VOC to Clipart are shown in Table 5. The proposed HMDA-YOLO do not obtain the best performance with the accuracy of 39.1%. However, it still achieves the performance close to the SOTA methods. It is worth noting that it has a huge improvement of 18.7% over the baseline YOLOv5s (91.7% relative improvement), which also proves its superiority.

4.3.5. Flexibility on the YOLO Series

The bottom of Table 2, Table 3, Table 4 and Table 5 present the quantitative results of YOLOv8s and HMDA based on it. The shallow-level, middle-level, deep-level, and multi-scale feature maps of YOLOv8 are layers 4, 6, 9, 15, 18, and 21, respectively. It can be seen that YOLOv8s has a stronger feature learning ability compared with YOLOv5s, and its source only model generally performs better than YOLOv5s. Moreover, the proposed HMDA can be flexibly integrated into other YOLO frameworks. HMDA based on YOLOv8s can obtain the accuracy of 54.4%, 64.8%, 33.8%, and 42.8% on task C→F, S→C, KR→KL, and P→CL, respectively. And it achieves the best results in tasks C→F, S→C, and P→CL, which demonstrates the flexibility and effectiveness of the proposed HMDA.

4.4. Analysis

4.4.1. Ablation Study

To evaluate the contribution of different adaptation strategies, we recorded the ablation results in Table 6. There are 3 findings as follows: (1) Each component of our proposed HMDA-YOLO positively impacts the model’s cross-domain detection capability. (2) The hierarchical adaptation of the backbone network is more effective than the multi-scale adaptation of head network, especially the medium-level feature adaptation. (3) Combined with backbone and head adaptation strategy, our proposed HMDA-YOLO can achieve advanced performance. It demonstrate that HMDA-YOLO not only improves the generalization capability, but also guaranties the discriminative ability of the model.

4.4.2. Detection Examples

Figure 3 presents some object detection results of the ground truth, YOLOv5, YOLOv5+HMDA, YOLOv8, and YOLOv8+HMDA on tasks Cityscapes → Foggy Cityscapes, KAIST RGB → KAIST LWIR, Sim10k → Cityscapes, and Pascal VOC → Clipart. The proposed HMDA shows the better cross-domain detection performance compared with YOLOv5s and YOLOv8s.

4.4.3. Convergence

The model’s convergence in the training stage is important for DAOD algorithms. We analyzed the convergence of YOLOv5s and HMDA-YOLO on task Cityscapes → Foggy Cityscapes, by recording the training and validation error curves (composed of

L_{c l s}

.

L_{b o x}

, and

L_{o b j}

) in Figure 4. It can be seen that the convergence trend of the source supervision is basically the same for both methods, but HMDA-YOLO can converge better. As for the target validation, the proposed method is obviously superior, with more stable convergence and less loss. In contrast, YOLOv5s has a progressively higher validation error without transfer learning strategy. The above analyses demonstrate that HMDA can converge faster and better, and can guarantee the performance of model on both source and target domain samples.

4.4.4. Influence of IOU Threshold

Figure 5 illustrates the performance of YOLOv5s and HMDA-YOLO on two different tasks as the IoU threshold changes from 0.5 to 0.95 (at intervals of 0.05). Generally, the overall detection accuracy is gradually decreasing as the IoU threshold increases. However, we have two insightful findings. The first one is that the YOLOv5-based method can localize the object accurately, which makes it possible to maintain good results when the threshold is increasing. The second point is that the proposed HMDA-YOLO greatly improves the cross-domain detection capability. And it still outperforms the baseline YOLOv5s with a IoU threshold of 0.5 even when the threshold of HMDA-YOLO is set to 0.95, which are separated by dash lines. The above findings demonstrate the effectiveness of the proposed HMDA-YOLO.

4.4.5. Model Complexity

An important reason for choosing YOLO framework is its efficiency in real-time detection. Therefore, we analyzed the model complexity to show its advantage. The model’s parameters and inference time of different detectors are shown in Table 7. Compared to other architectures, YOLOv5s and YOLOv8s are light-weight, less complex, and more efficient. HMDA+YOLOv5s has about 12.9M parameters because of multiple domain discriminators during the training stage. In inference time, there is no need to load the weight of domain discriminators and vanilla YOLO with adaptive weight is used. The YOLO series framework has huge advantages in terms of both model parameter quantity and inference speed.

5. Conclusions

In this paper, we propose a novel DAOD method, HMDA-YOLO, to address the problem of domain shift in cross-domain object detection tasks. Considering various factors, such as performance, efficiency, and engineering practice, we adopted YOLOv5 as the baseline instead of the outdated Faster RCNN. The core of HMDA-YOLO is the hierarchical backbone adaptation and multi-scale head adaptation. The backbone adaptation is hierarchically performed at different levels depending on the level of feature abstraction, which promotes efficient adaptation and suppresses negative transfer. The head adaptation is adopted in three scales to enhance the model’s cross-domain performance and not impair its discriminative ability. The experimental results of multiple cross-domain tasks demonstrate that the proposed HMDA-YOLO is characterized by excellent cross-domain detection performance and fast detection speed.

Author Contributions

Conceptualization, S.Z.; methodology, S.Z.; validation, S.Z., P.Z., Y.W. and W.Q.; formal analysis, S.Z., P.Z. and W.Q.; investigation, S.Z. and W.Q.; resources, Y.W.; data curation, S.Z. and P.Z.; writing—original draft, S.Z.; writing—review and editing, S.Z.; visualization, S.Z.; supervision, Y.W.; project administration, Y.W.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the National Natural Science Foundation of China under Grant 62303433.

Data Availability Statement

The data used in the experiment can be obtained at: https://www.cityscapes-dataset.com/, http://host.robots.ox.ac.uk/pascal/VOC/, https://github.com/naoto0804/cross-domain-detection/tree/master/datasets, https://soonminhwang.github.io/rgbt-ped-detection/, and https://fcav.engin.umich.edu/projects/driving-in-the-matrix (accessed on 12 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 27, 3320–3328. [Google Scholar]
Wang, M.; Deng, W. Deep visual domain adaptation: A survey. Neurocomputing 2018, 312, 135–153. [Google Scholar] [CrossRef]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning transferable features with deep adaptation networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 97–105. [Google Scholar]
Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
Zhu, Y.; Zhuang, F.; Wang, J.; Ke, G.; Chen, J.; Bian, J.; Xiong, H.; He, Q. Deep subdomain adaptation network for image classification. IEEE Trans. Neural Networks Learn. Syst. 2020, 32, 1713–1722. [Google Scholar] [CrossRef]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2030–2096. [Google Scholar]
Long, M.; Cao, Z.; Wang, J.; Jordan, M.I. Conditional adversarial domain adaptation. Adv. Neural Inf. Process. Syst. 2018, 31, 1647–1657. [Google Scholar]
Yu, C.; Wang, J.; Chen, Y.; Huang, M. Transfer learning with dynamic adversarial adaptation network. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 778–786. [Google Scholar]
Oza, P.; Sindagi, V.A.; Sharmini, V.V.; Patel, V.M. Unsupervised Domain Adaptation of Object Detectors: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 4018–4040. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6956–6965. [Google Scholar]
Zhu, X.; Pang, J.; Yang, C.; Shi, J.; Lin, D. Adapting object detectors via selective cross-domain alignment. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 687–696. [Google Scholar]
Kim, T.; Jeong, M.; Kim, S.; Choi, S.; Kim, C. Diversify and match: A domain adaptive representation learning paradigm for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12456–12465. [Google Scholar]
Zheng, Y.; Huang, D.; Liu, S.; Wang, Y. Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13766–13775. [Google Scholar]
He, Z.; Zhang, L. Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
Shan, Y.; Lu, W.F.; Chew, C.M. Pixel and feature level based domain adaptation for object detection in autonomous driving. Neurocomputing 2019, 367, 31–38. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Ding, X.; Huang, Y.; Dou, Q. Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8869–8878. [Google Scholar]
Xu, C.D.; Zhao, X.R.; Jin, X.; Wei, X.S. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11724–11733. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Inoue, N.; Furuta, R.; Yamasaki, T.; Aizawa, K. Cross-Domain Weakly-Supervised Object Detection Through Progressive Domain Adaptation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5001–5009. [Google Scholar]
Rodriguez, A.L.; Mikolajczyk, K. Domain adaptation for object detection via style consistency. arXiv 2019, arXiv:1911.10033. [Google Scholar] [CrossRef]
Chen, C.; Zheng, Z.; Huang, Y.; Ding, X.; Yu, Y. I3net: Implicit instance-invariant network for adapting one-stage object detectors. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12576–12585. [Google Scholar]
Zhou, W.; Du, D.; Zhang, L.; Luo, T.; Wu, Y. Multi-Granularity Alignment Domain Adaptation for Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9581–9590. [Google Scholar]
Hnewa, M.; Radha, H. Multiscale domain adaptive yolo for cross-domain object detection. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 3323–3327. [Google Scholar]
Zhang, S.; Tuo, H.; Hu, J.; Jing, Z. Domain adaptive yolo for one-stage cross-domain detection. In Proceedings of the Asian Conference on Machine Learning, Virtual, 17–19 November 2021; pp. 785–797. [Google Scholar]
Li, G.; Ji, Z.; Qu, X.; Zhou, R.; Cao, D. Cross-domain object detection for autonomous driving: A stepwise domain adaptative YOLO approach. IEEE Trans. Intell. Veh. 2022, 7, 603–615. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, F.; Lu, H. SSDA-YOLO: Semi-supervised domain adaptive YOLO for cross-domain object detection. Comput. Vis. Image Underst. 2023, 229, 103649. [Google Scholar] [CrossRef]
Deng, J.; Li, W.; Chen, Y.; Duan, L. Unbiased mean teacher for cross-domain object detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4091–4101. [Google Scholar]
Jocher, G.; Nishimura, K.; Mineeva, T.; Vilariño, R. YOLOv5. GitHub Repository. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 12 July 2025).
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 213–229. [Google Scholar]
Wang, W.; Cao, Y.; Zhang, J.; He, F.; Zha, Z.J.; Wen, Y.; Tao, D. Exploring sequence feature alignment for domain adaptive detection transformers. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 1730–1738. [Google Scholar]
Zhang, J.; Huang, J.; Luo, Z.; Zhang, G.; Zhang, X.; Lu, S. Da-detr: Domain adaptive detection transformer with information fusion. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 23787–23798. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]
Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? arXiv 2016, arXiv:1610.01983. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
He, Z.; Zhang, L. Domain adaptive object detection via asymmetric tri-way faster-rcnn. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 309–324. [Google Scholar]
He, M.; Wang, Y.; Wu, J.; Wang, Y.; Li, H.; Li, B.; Gan, W.; Wu, W.; Qiao, Y. Cross domain object detection by target-perceived dual branch distillation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9570–9580. [Google Scholar]
Zhao, L.; Wang, L. Task-specific Inconsistency Alignment for Domain Adaptive Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14217–14226. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Mattolin, G.; Zanella, L.; Ricci, E.; Wang, Y. Confmix: Unsupervised domain adaptation for object detection via confidence-based mixing. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 423–433. [Google Scholar]

Figure 1. The overall structure of HMDA-YOLO. It mainly consists of two main parts, the backbone adaptation and the head adaptation. The hierarchical adaptation of the feature distribution in the backbone part of YOLOv5 framework alleviates the affection of negative transfer and makes the adaptation more comprehensive. The multi-scale head adaptation significantly reduces local instance discrepancy and the impact of background noise. In this way, HMDA enhances the model’s cross-domain performance and not impairs its discriminative ability, which enables real-time and accurate detection across domains.

Figure 2. The structure of two types of domain discriminator. Top: pixel-level domain discriminator. Bottom: image-level domain discriminator. GAP means global average pooling layer. The information in the convolutional layer indicates the size of the convolutional kernels, the number of output channels, and whether or not to downsample, respectively.

Figure 3. Illustration of the detection results on the target domain. From left to right are results of C→F, S→C, KR→KL, and P→Cl, respectively. (a–e) denote the groud truth, YOLOv5, YOLOv5+HMDA, YOLOv8, and YOLOv8+HMDA (Zoom in for better view).

Figure 4. Training and validation error curves of YOLOv5s and HMDA-YOLO on task Cityscapes → Foggy Cityscapes.

Figure 5. The performance with the variation in IOU thresholds on two tasks.

Table 1. The number of images in different domains.

Tasks	Training Set		Validation Set	Classes
Tasks	Source Domain	Target Domain	Target Domain	Classes
C→F	Cityscapes	Foggy Cityscapes	Foggy Cityscapes	8
C→F	2975	2975	500	8
S→C	Sim10K	Cityscapes	Cityscapes	1
S→C	10,000	2975	500	1
KR→KL	KAIST RGB	KAIST LWIR	KAIST LWIR	1
KR→KL	7601	7601	2252	1
P→Cl	Pascal VOC	Clipart	Clipart	20
P→Cl	16,551	500	500	20

Table 2. Detection results (%) across different visibility (C→F).

Method	Car	Bicycle	Person	Rider	Mbike	Bus	Truck	Train	mAP
Faster RCNN	39.9	28.3	28.5	34.2	23.4	26.3	14.7	11.4	25.8
DAF	40.5	27.1	25.0	31.0	20.0	35.3	22.1	20.4	27.6
SWDA	43.5	35.3	29.9	42.3	30.0	36.2	24.5	32.6	34.3
ATF	50.0	38.8	34.6	47.0	33.4	43.3	23.7	38.7	38.7
HTCN	47.9	37.1	33.2	47.5	32.3	47.4	31.6	40.9	39.8
UMT	48.6	37.4	33.0	46.7	30.4	56.5	34.1	46.8	41.7
TIA	49.7	38.1	34.8	46.3	37.7	52.1	31.1	48.6	42.3
TDD	55.7	41.4	39.6	47.5	37.0	47.6	33.8	42.1	43.1
D-DETR	44.2	35.5	37.7	39.1	21.6	26.8	17.2	5.8	28.5
SFA	62.6	44.0	46.5	48.6	28.3	46.2	25.1	29.4	41.3
DA-DETR	63.1	46.3	43.5	50.0	31.6	45.8	24.0	37.5	26.3
YOLOv5s	46.9	29.7	35.8	37.9	15.4	26.7	11.5	6.6	26.3
S-DAYOLO	61.9	37.3	42.6	42.1	24.4	40.5	23.5	39.5	39.0
Confmix	62.6	33.5	45.0	43.4	28.6	45.8	27.3	40.0	40.8
HMDA	66.1	40.9	49.1	51.6	38.4	54.4	27.8	48.0	45.9
YOLOv8s	54.8	43.3	45.0	51.0	22.0	38.1	16.6	3.2	34.3
HMDA	72.1	49.8	54.6	60.4	40.9	60.9	35.9	60.8	54.4

Table 3. Detection results (%) from synthetic to real (S→C).

Method	mAP
Faster RCNN	34.6
DAF	38.9
SWDA	40.1
HTCN	42.5
UMT	43.1
D-DETR	47.4
SFA	52.6
DA-DETR	54.7
YOLOv5s	51.6
Confmix	56.3
HMDA	58.6
YOLOv8s	58.6
HMDA	64.8

Table 4. Detection results (%) across heterogeneous data (KR→KL).

Method	mAP
Faster RCNN	9.4
DAF	21.8
SWDA	31.3
YOLOv5s	5.5
HMDA	45.4
YOLOv8s	10.7
HMDA	33.8

Table 5. Detection results (%) across large domain shift (P→Cl).

Method	Aero	Bike	Bird	Boat	Bottle	Bus	Car	Cat	Chair	Cow
Faster RCNN	35.6	52.5	24.3	23.0	20.0	43.9	32.8	10.7	30.6	11.7
DAF	15.0	34.6	12.4	11.9	19.8	21.1	23.2	3.1	22.1	26.3
SWDA	26.2	48.5	32.6	33.7	38.5	54.3	37.1	18.6	34.8	58.3
HTCN	33.6	58.9	34.0	23.4	45.6	57.0	39.8	12.0	39.7	51.3
ATF	41.9	67.0	27.4	36.4	41.0	48.5	42.0	13.1	39.2	75.1
D-DETR	24.8	50.5	14.0	22.8	11.5	50.7	28.7	3.0	26.5	32.6
SFA	35.2	47.6	33.5	38.3	39.6	40.4	38.5	27.2	37.6	43.1
DA-DETR	43.1	47.7	31.5	33.7	21.4	62.8	42.6	14.8	39.5	44.2
YOLOv5s	12.8	39.7	8.2	11.5	34.0	33.4	19.5	0.7	34.4	6.3
HMDA	22.8	58.6	23.0	18.2	50.3	70.5	44.7	12.8	45.1	44.0
YOLOv8s	22.4	60.1	27.5	33.2	44.0	37.9	23.5	10.1	57.2	7.8
HMDA	26.3	66.8	28.1	36.6	55.3	65.1	36.5	13.6	60.1	51.8
Method	Table	Dog	Hrs	Mbike	Prsn	Plnt	Sheep	Sofa	Train	TV	mAP
Faster RCNN	13.8	6.0	36.8	45.9	48.7	41.9	16.5	7.3	22.9	32.0	27.8
DAF	10.6	10.0	19.6	39.4	34.6	29.3	1.0	17.1	19.7	24.8	19.8
SWDA	12.5	12.5	33.8	65.5	54.5	52.0	9.3	24.9	54.1	49.1	38.1
HTCN	20.1	20.1	39.1	72.8	61.3	43.1	19.3	30.1	50.2	51.8	40.3
ATF	33.4	7.9	41.2	56.2	61.4	50.6	42.0	25.0	53.1	39.1	42.1
D-DETR	22.1	17.4	19.6	73.1	54.2	20.8	11.5	12.6	55.2	30.3	29.1
SFA	23.9	31.6	32.5	72.5	66.8	43.0	18.5	29.0	53.0	44.9	39.8
DA-DETR	35.9	27.5	31.8	72.6	65.6	42.2	17.3	31.1	71.3	50.1	41.3
YOLOv5s	8.0	10.1	17.0	8.9	28.6	45.7	2.5	20.4	24.3	42.5	20.4
HMDA	15.8	10.0	22.6	12.8	52.3	55.6	5.7	36.7	31.3	61.3	39.1
YOLOv8s	28.5	14.8	39.7	44.8	44.6	58.7	15.8	19.8	29.1	54.8	33.7
HMDA	32.6	12.5	38.1	50.6	59.6	64.7	28.7	34.1	34.8	60.5	42.8

Table 6. Ablation study (%) of HMDA-YOLO based on task C→F.

Method	Car	Bicycle	Person	Rider	Mbike	Bus	Truck	Train	mAP
YOLOv5s	46.9	29.7	35.8	37.9	15.4	26.7	11.5	6.6	26.3
Base + $L_{s a}$	52.0	38.7	42.2	47.4	29.3	37.6	21.2	15.6	35.5
Base + $L_{m a}$	60.3	39.5	45.0	48.5	26.0	39.3	23.9	35.3	39.7
Base + $L_{d a}$	59.2	40.9	46.0	48.4	31.4	37.6	22.5	17.6	37.9
Base + $L_{b a c k b o n e}$	64.4	40.5	48.2	51.2	33.3	44.9	25.0	41.8	43.7
Base + $L_{h e a d}$	59.7	45.0	47.3	50.3	30.2	38.6	24.3	23.7	39.9
HMDA-YOLO	66.1	40.9	49.1	51.6	38.4	54.4	27.8	48.0	45.9

Table 7. Model complexity of different detectors.

Detector	Faster RCNN	Faster RCNN	D-DETR	YOLOv5s	YOLOv8s
Backbone	VGG16	ResNet50	ResNet50	CSPDarknet	CSPDarknet
Params (M)	138.4	25.6	40.2	7.2	11.2
Inference Time (FPS)	5	7–15	19	95	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, S.; Zhu, P.; Wu, Y.; Qiao, W. Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO. Sensors 2025, 25, 5363. https://doi.org/10.3390/s25175363

AMA Style

Zhu S, Zhu P, Wu Y, Qiao W. Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO. Sensors. 2025; 25(17):5363. https://doi.org/10.3390/s25175363

Chicago/Turabian Style

Zhu, Sihan, Peipei Zhu, Yuan Wu, and Wensheng Qiao. 2025. "Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO" Sensors 25, no. 17: 5363. https://doi.org/10.3390/s25175363

APA Style

Zhu, S., Zhu, P., Wu, Y., & Qiao, W. (2025). Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO. Sensors, 25(17), 5363. https://doi.org/10.3390/s25175363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Domain Object Detection with Hierarchical Multi-Scale Domain Adaptive YOLO

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Unsupervised Domain Adaptation

2.3. Domain Adaptive Object Detection

3. Method

3.1. Preliminary Knowledge

3.2. Hierarchical Multi-Scale Adaptation

3.2.1. Hierarchical Backbone Adaptation

3.2.2. Multi-Scale Head Adaptation

3.3. Overall Formulation

3.4. Theoretical Analysis

4. Experiment

4.1. Datasets

4.2. Experiment Setup

4.3. Quantitative Results

4.3.1. Adaptation Across Different Visibility

4.3.2. Adaptation from Synthetic to Real

4.3.3. Adaptation Across Heterogeneous Data

4.3.4. Adaptation Across Large Domain Shift

4.3.5. Flexibility on the YOLO Series

4.4. Analysis

4.4.1. Ablation Study

4.4.2. Detection Examples

4.4.3. Convergence

4.4.4. Influence of IOU Threshold

4.4.5. Model Complexity

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI