Spatial Alignment for Unsupervised Domain Adaptive Single-Stage Object Detection

Domain adaptation methods are proposed to improve the performance of object detection in new domains without additional annotation costs. Recently, domain adaptation methods based on adversarial learning to align source and target domain image distributions are effective. However, for object detection tasks, image-level alignment enforces the alignment of non-transferable background regions, which affects the performance of important target regions. Therefore, how to balance the alignment of background and target remains a challenge. In addition, the current research with good effect is based on two-stage detectors, and there are relatively few studies on single-stage detectors. To address these issues, in this paper, we propose a selective domain adaptation framework for the spatial alignment of a single-stage detector. The framework can identify the background and target and pay different attention to them. On the premise that the single-stage detector does not generate region suggestions, it can achieve domain feature alignment and reduce the influence of the background, enabling transfer between different domains. We validate the effectiveness of our method for weather discrepancy, camera angles, synthetic to real-world, and real images to artistic images. Extensive experiments on four representative adaptation tasks show that the method effectively improves the performance of single-stage object detectors in different domains while maintaining good scalability.


Introduction
Object detection is a fundamental and widely used topic in the field of computer vision. In the past few years, deep neural networks (DNNs) have achieved satisfactory results [1][2][3][4][5][6] in the presence of large amounts of labeled data. However, most of these studies are based on supervised learning methods, assuming that the training data and the test data have the same distribution, which obviously cannot always be satisfied in practical applications [7]. Because the real environment is uncontrollable, changes in environmental conditions, camera angles, and shooting distances may cause domain shifts, resulting in the trained model being unable to achieve expected results on other test sets. This hinders the deployment of the model in a real environment.
An obvious solution is to retrain with data collected in a new environment. Unfortunately, real-world environments usually do not have corresponding annotated images, and finely annotating each object instance in each image is usually expensive and timeconsuming. Unsupervised domain adaptation (UDA) [8], assuming that there is no labeled data available in the target domain, is an attractive alternative to guide the model to transfer knowledge from the source domain to the target domain only if there is labeled data in the source domain.
In recent years, domain adaptation has achieved remarkable success in many fields, and mainstream methods adopt the idea of generative adversarial to align the feature distributions of source and target domains through adversarial learning [9][10][11]. In unsupervised object detection domain adaptation, DA Faster R-CNN [9] considers both image-level applied. In addition, previous methods only consider the invariant features of the learning domain or restrict the specific features of the source domain, ignoring the specific features of the target domain, but the ultimate goal is to achieve the best results on the target domain.
In this work, we address the problem of adaptively aligning background and instance features for single-stage detectors., and propose a new method suitable for singlestage object detection. The goal of this paper is to build an end-to-end single-stage domain-adaptive deep learning model based on YOLOv5, the key idea of which is to apply different attention to the background and target features when aligning them, setting different domain confidences for background and target, so that the network pays more attention to the target area and reduces the interference of the background, as shown in Figure 1. Since the target domain data does not have ground-truth boxes, we introduce a self-training [18,19] method to obtain pseudo-labels of the target domain data to guide domain alignment and assist training. In addition, the high-confidence output will lead to a large number of targets that cannot be accurately identified. To reduce the impact of false-negative samples in self-training, false-negative suppression loss (FNS) is used in the target domain to replace the original target loss.
The contributions of this work mainly include the following three aspects: 1. We propose a domain adaptation framework suitable for single-stage detectors, which combines image-level alignment and instance-level alignment. Balances the domain offset by applying different attention to the background and target through spatial domain alignment (SDA) and spatial consistency regularization (SCR); 2. We improve a new target loss in the target domain loss calculation, reducing the negative impact of false-negative samples by suppressing the loss of false-negative samples. 3. We validate the effectiveness of our method from four different perspectives (weather difference, camera angle, synthetic to real-world, and real images to artistic images). Experimental results show that this method can effectively improve the cross-domain detection performance of single-stage object detectors. Additionally, almost no increase in inference time and real-time detection can be achieved on GPU.  The contributions of this work mainly include the following three aspects: We propose a domain adaptation framework suitable for single-stage detectors, which combines image-level alignment and instance-level alignment. Balances the domain offset by applying different attention to the background and target through spatial domain alignment (SDA) and spatial consistency regularization (SCR); 2.
We improve a new target loss in the target domain loss calculation, reducing the negative impact of false-negative samples by suppressing the loss of false-negative samples.

3.
We validate the effectiveness of our method from four different perspectives (weather difference, camera angle, synthetic to real-world, and real images to artistic images). Experimental results show that this method can effectively improve the cross-domain detection performance of single-stage object detectors. Additionally, almost no increase in inference time and real-time detection can be achieved on GPU.

Related Work
Object detection-The task of object detection is to find out the categories and positions of all objects of interest in an image, which is one of the core problems in the field of computer vision. Object detection algorithms based on deep learning are mainly divided into two categories: two-stage detectors and single-stage detectors. The two-stage detector first performs region generation to obtain region proposals and then performs sample classification in the second stage through a convolutional neural network, such as R-CNN [2], Fast R-CNN [12], Faster R-CNN [13], R-FCN [4], etc. One-stage detectors extract features directly in the network to predict object classification and location. Common singlestage target detection algorithms include YOLO [20][21][22], SSD [23], and RetinaNet [24]. Compared with two-stage detectors that focus on accuracy, in recent studies, single-stage detectors focus on speed while maintaining acceptable accuracy, making the prediction effect comparable to two-stage object detection algorithms, which have become a popular paradigm. This paper selects YOLOv5 as the base detector, which enriches the research on domain adaptation of single-stage detectors.
Unsupervised domain adaptation-Unsupervised domain adaptation [25][26][27][28][29] is a frontier research direction within the field of computer vision and transfer learning, which attempts to transfer knowledge from one domain to another by mitigating distributional changes to improve deep neural network models across domains performance in tasks. Earlier studies were based on statistical difference minimization methods, which tried to measure and minimize the distances of features from different domains in a feature space, such as maximum mean discrepancy (MMD) [30], Wasserstein distance [31], Kullback-Leibler divergence (KL) [32], etc. In recent years, methods based on adversarial learning have gradually become the mainstream framework for UDA, and a lot of research has been done on it. The main idea of adversarial domain adaptation is to confuse the domain by adding a domain discriminator and gradient inversion layer during training, learn domaininvariant features that perform well on both source and target domains, and reduce the impact of domain offset. In addition, some methods have explored domain adaptation methods from other perspectives, such as using a style transfer network to reduce the differences between different domains and improve the effect of domain adaptation.
Domain adaptation in object detection-Domain adaptation has achieved satisfactory results in the fields of image classification and semantic segmentation. However, domain adaptation research in object detection is still in its early stages. Following the widespread success of adversarial learning-based methods in other domains, many adversarial-based domain adaptation frameworks have emerged in object detection. DA Faster R-CNN [9] is the first to introduce an adversarial learning-based method into object detection domain adaptation to achieve image-level alignment and instance-level alignment to reduce imagelevel and instance-level distribution differences. On this basis, recent studies have made a series of improvements to DA FasterR-CNN. Specifically, Xu et al. [33] proposed a classification regularization framework to achieve cross-domain matching of key image regions and important instances, and Saito et al. [8] proposed a strong local alignment and weak global alignment-based detector to help DA Faster R-CNN focus on aligning key regions and important objects across domains. Wang et al. [34] divided the weight direction and gradient into a domain-specific part and a domain-invariant part and removed the interference of the specific domain direction while learning the domain-invariant direction. Besides, Deng et al. [35] proposed an unbiased mean teacher (UMT) model, and achieved state-of-the-art results for various benchmarks of object detection.

Proposed Methods
In this section, we detail the technical details of the proposed method, adopting YOLOv5 as the baseline.

Framework Overview
Our research involves scenes from two domains, a source domain with images and annotations (i.e., bounding boxes and target categories) and a target domain with only images. The final task is to use data from both domains to train a detector that performs well on the target domain. To this end, we propose a framework for selective domain adaptation based on spatial alignment, using YOLOv5 as the baseline. The overall architecture of the proposed method is shown in Figure 2. Aiming at the inconsistency of background and instance alignment requirements, we propose a spatial domain alignment module that balances the influence of background and target to better achieve domain feature alignment. In the training process, this module can guide the model to learn domain invariant features and reduce the gap between domains through backpropagation. Considering that each region of an image should come from the same domain, that is, image domain classification consistency, spatial alignment consistency regularization is added.

Proposed Methods
In this section, we detail the technical details of the proposed method, adopting YOLOv5 as the baseline.

Framework Overview
Our research involves scenes from two domains, a source domain with images and annotations (i.e., bounding boxes and target categories) and a target domain with only images. The final task is to use data from both domains to train a detector that performs well on the target domain. To this end, we propose a framework for selective domain adaptation based on spatial alignment, using YOLOv5 as the baseline. The overall architecture of the proposed method is shown in Figure 2. Aiming at the inconsistency of background and instance alignment requirements, we propose a spatial domain alignment module that balances the influence of background and target to better achieve domain feature alignment. In the training process, this module can guide the model to learn domain invariant features and reduce the gap between domains through backpropagation. Considering that each region of an image should come from the same domain, that is, image domain classification consistency, spatial alignment consistency regularization is added. Our framework does not rely on RPN to extract regions of interest, thus it can be directly applied to single-stage detectors to better align background and targets across domains and improve adaptive detection performance. Additionally, our framework is flexible and generalizable and does not depend on specific algorithms for image or instance alignment. The training loss of the network is the sum of each part, which can be written as: The overall structure of our framework. In this method, the spatial domain alignment module is added after the backbone network to predict the domain confidence of different locations and align the target and background effectively.
Our framework does not rely on RPN to extract regions of interest, thus it can be directly applied to single-stage detectors to better align background and targets across domains and improve adaptive detection performance. Additionally, our framework is flexible and generalizable and does not depend on specific algorithms for image or instance alignment. The training loss of the network is the sum of each part, which can be written as: where λ is a trade-off parameter to balance YOLOv5 detection loss and newly added domain adaptive components. L det is the detection loss and L d and L cst are the domain identification loss and the domain consistency regularization loss, respectively. The network can be trained end-to-end. In addition, the antagonistic training of the domain adaptation component is realized by using a gradient reversal layer (GRL), which automatically reverses the gradient in the process of propagation. During inference, only the detection branch prediction results need to be used. We detail the technical details of the proposed method below.

Spatial Domain Alignment Module
We utilize the improved domain classifier for feature alignment of the source and target domains. Generally speaking, the domain classifier judges the input as a whole from the source domain or the target domain and then conducts domain obfuscation through a gradient reversal layer (GRL) to achieve feature alignment. As mentioned earlier, it is difficult and unnecessary to achieve alignment of background and target of interest using the same settings. Aligning the object of interest in the feature space is more important than aligning the background. To overcome this problem, the spatial domain alignment module is designed to align the background and instances of an image. Specifically, we introduce a domain classifier after the YOLOv5 backbone network, through the gradient reversal layer, the gradient direction is automatically reversed in the backpropagation process, and the identity transformation is realized in the forward propagation process, and then a series of volumes are performed. The product and downsampling finally obtain the feature output of the m × m dimension. Each value represents the domain confidence for that location. In addition, a residual attention module is added to the domain classifier to better extract spatially effective information.
For each image, we wish to identify important regions of the object of interest and reassign the region representations. Single-stage detectors do not have region proposals from RPN, and a natural idea is to use the final predictions. We assume that x s , x t are images from source domain S and target domain T, respectively, and y s , y t are source domain label data and model-predicted pseudo-label data, respectively. According to (x s , y s ), (x t , y t ), the position of the target in the picture can be calculated. Then we divide the image into m × m grids, and assign attention to the corresponding positions according to the area where the target falls into the grid. We use the IOU as a measure, if the IOU is greater than 1/3, then the grid prediction is considered as foreground, otherwise, it is background. Different values for background and foreground are set as ground truth. Domain-consistent features are better learned by optimizing the domain classification loss to confuse the source and target domain features. We let D i,j,k denote the domain label of the i-th training image at the j-th row and the k-th column, 0 for the source domain target, 0.3 for the source domain background, 1 for the target domain target, and 0.7 for the target domain background. By denoting the output of the domain classifier as p i,j,k and using the cross-entropy loss, the spatial domain identification loss can be written as:

Spatial Consistency Regularization
We assume that the source domain sample distribution is P s (C, B, I) and the target domain sample distribution is P t (C, B, I), where I is the image representation, B is the bounding box of an object, and C represents the class of the object. Because of the domain offset, P s = P t . We then use P i,j,k to represent the grid feature distribution of the i-th image row, the j-th row, and the k-th column of the image. D i represents the feature distribution of the i-th image. Since it comes from the same image, there should be P i,j,k = D i . Therefore, we can alleviate the prediction bias of different regions of the same image by strengthening the spatial domain classification consistency. The spatial consistency regularization (SCR) can be written as: Among them, N represents the total number of image division areas, and ||.|| 2 represents the L2 regularization.

False-Negative Suppression Loss
In the target domain, to enable the model to learn the features of the target domain, guided learning with high model prediction confidence is used. In our attempts, directly applying the detection loss to the target domain would result in a negative gain, which is since wrong negative samples would lead to a reduced model recall, we minimize the impact of false negatives by modifying the loss in the target domain, and further propose to stabilize the training of the model with false-negative suppression (FNS) loss. According to the definition in YOLO, the original loss is defined as: where the object loss L obj is defined as: YOLOv5 judges whether there is a target in each prediction frame according to the label information and certain rules. If there is, the value of the corresponding position in the mask matrix is set to 1, otherwise, it is set to 0; obtaining the mask matrix of the image. The classification loss and bounding box loss are only calculated for the predicted box whose position in the mask matrix is 1. Additionally, all prediction boxes need to calculate the target loss, which will cause false-negative samples in the target domain to suppress the prediction that there is a target at that location. We can use the mask matrix to eliminate this effect and improve the target object loss as:

Experiments
In this section, we introduce the details of typical datasets and implementation of domain adaptation, from weather discrepancy, camera angles, synthesis to the real-world discrepancy, and real images to artistic images to validate our approach; and compare our results with other methods. Furthermore, the effectiveness of the proposed method is verified by a simple ablation experiment.

Datasets and Evaluation
Cityscapes [36]-The Cityscapes dataset is a large-scale urban streetscape dataset collected from different cities in normal weather. It contains 2975 training images and 500 validation images.
Foggy Cityscapes [37]-The Foggy Cityscapes dataset is made by adding synthetic fog based on the Cityscapes dataset. The synthetic haze transmittance is automatically generated, inheriting the semantic annotation of the original image. Each image in the Cityscapes will have three density levels of fog added, thus it contains 8925 training images and 1500 validation images.
KITTI [38]-KITTI is one of the most important datasets in the field of autonomous driving. Its images were collected from rural, urban, and highway areas. KITTI contains 7481 annotated images. We randomly select two-thirds of the images for training and other images for verification.
SIM10K [39]-SIM10K is a synthetic dataset generated by the Grand Theft Auto (GTAV) engine. It contains 10,000 images and 58,071 bounding box annotations, only cars are in this category. All images of SIM10K are utilized as the source domain.
PASCAL VOC [40]-The PASCAL VOC dataset is a dataset collected from the real world, which can be used for detection and segmentation. For detection tasks, it mainly includes 20 categories. We used the training set and verification set of PASCAL VOC 2007 and 2012 consisting of about 15 k images for training.

Implementation Details
In all experiments, we use YOLOv5-l as the baseline, keeping the default training settings and hyperparameters of YOLOv5. The input image size is adjusted to 640 × 640 by scaling and padding, and stochastic gradient descent (SGD) is used for training. The initial learning rate is 0.01, the weight decay is 0.0005, and the SGD momentum is 0.937. During the experiment, the data of two domains are loaded for training, and the source domain and target domain are scrambled before training. Pseudo-labels are updated for each iteration during training. In the initial 100 iterations α = 0, the target domain only computes the domain classification loss, after 100 iterations, α gradually increases. For all experiments, we evaluate the mean precision with a threshold of 0.5 for comparison with other methods.

Experimental Results
In this section, we evaluate the object detection of our proposed spatially adaptive model in four different domain transfer scenarios: weather discrepancy (Cityscapes → Foggy Cityscapes), camera angle (Cityscapes → KITTI), synthetic to real-world discrepancy (SIM10K → Cityscapes), and real to artistic (PASCAL VOC → Clipart and Watercolor).

Weather Discrepancy
Setting-In this section we use the Cityscapes dataset for clear weather and the Foggy Cityscapes dataset for foggy weather to study the adaptation of weather discrepancy; taking Cityscapes containing images and annotations as the source domain, and Foggy Cityscapes containing only images as the target domain. The source domain and target domain data are scrambled when reading to participate in the training process. It is finally evaluated on the validation set of Foggy Cityscapes.
Results-The comparison results are shown in Table 1. We used eight categories in our final evaluation. Since pre-training on different datasets has been shown to improve the generalization ability of the model, all methods are pre-trained on the COCO dataset for the convenience of comparison. Source only denotes the YOLOv5-l model trained on the source domain only. The results show that our method can improve the detection performance under different weather conditions, effectively solving the domain shift problem.

Camera Angle
Setting-In this section, we use Cityscapes as the source domain and KITTI as the target domain to verify the performance of our method in scenarios with different camera configurations. During training, the training sets of the KITTI dataset and Cityscapes dataset are used. It is finally evaluated on the validation set of Cityscapes. The final evaluation used the only common class between the two domains-car.
Results-The results are shown in Table 2. It can be observed that our method achieves better performance in the end. Specifically, our method achieves an improvement of about 8.5%.

Synthetic to Real
Setting-Since data acquisition can be difficult, the cost of sampling and labeling can be significantly reduced by generating synthetic data, thus this adaptation makes sense. During training, we use SIM10K as the source domain and Cityscapes as the target domain. It is finally evaluated on the validation set of Cityscapes. The evaluation of the results uses the only common class between the two domains-car.
Results-The final results are shown in Table 3. Experimental results demonstrate that our method can effectively utilize synthetic data and achieve domain adaptation from the synthetic to the real-world, reducing domain transfer. Table 3. Synthetic-to-reality adaptation: results on adaptation from SIM10K to Cityscapes.

Real to Artistic
Setting-In this section, we will verify the effectiveness of adapting real images to artistic images. The source domain images are from Pascal VOC 2007 and 2012, and Clipart and Watercolor are used as the target domain, respectively. Finally, it is evaluated on the Clipart and Watercolor test set.
Results-The performance comparison of PASCAL VOC → Clipark and PASCAL VOC → Watercolor is shown in Tables 4 and 5. Compared with the baseline, the proposed spatial selection domain alignment method improves the map on the Clipart and Watercolor target domain by 12.8% and 8.9%, respectively. The improved performance shows that our method is effective in realizing domain adaptation.

Ablation Experiment
In this section, to verify the effectiveness of different components in our method, thorough ablation experiments are performed. We design several variants of the model to verify the contributions of different components. The results are shown in Table 6. All experiments are based on the adaptation of Cityscapes to Foggy Cityscapes.  Table 6 shows that the various components of our proposed method are effective and complementary. Specifically, the performance of Ours-Type1, Ours-Type2, and Ours-Type7 shows that finer partitioning contributes to domain adaptation. Compared with Source Only, Ours-Type3 achieved a performance gain of +3.4% using the spatial domain alignment module. In addition, Ours-Type4 achieved an additional performance gain of +1.4% using only spatial consistency regularization (SCR), while Ours-Type5 achieved a gain of +5.5% using only FNS. Ours-Type7 combined with all components, reaches an average accuracy of 37.4%.

Visualization
We show an example of detection results in Figure 3. The first row of images is the result of training with only source domain images, illustrating the degraded performance of the detector when the distributions of the source and target domains do not match due to dense fog. The second row of images is domain-adapted using our method. We can see that our model locates the object correctly in these cases.

Conclusions
In this work, we investigate the alignment of background and object instances for unsupervised domain adaptation on a single-stage detector. Therefore, we propose a new UDA framework suitable for single-stage detectors with added attention to background and object instance settings for better feature alignment. Our key contribution is to address the problem of adaptively aligning background and instance features for singlestage detectors. We demonstrate the effectiveness of our approach from different perspectives, such as weather discrepancy, camera angles, synthetic data to the real-world images, and real images to artistic images. We conduct extensive experiments and ablation studies, which demonstrate that our method achieves competitive performance.

Data Availability Statement:
The data present in this study are openly available at https://ieeexplore.ieee.org/document/6248074/, accessed on 26 July 2012.

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
In this work, we investigate the alignment of background and object instances for unsupervised domain adaptation on a single-stage detector. Therefore, we propose a new UDA framework suitable for single-stage detectors with added attention to background and object instance settings for better feature alignment. Our key contribution is to address the problem of adaptively aligning background and instance features for single-stage detectors. We demonstrate the effectiveness of our approach from different perspectives, such as weather discrepancy, camera angles, synthetic data to the real-world images, and real images to artistic images. We conduct extensive experiments and ablation studies, which demonstrate that our method achieves competitive performance.