Cross-Domain Object Detection by Dual Adaptive Branch

The object detection task usually assumes that the training and test samples obey the same distribution, and this assumption is not valid in reality, therefore the study of cross-domain object detection is proposed. Compared with image classification, the cross-domain object detection task presents the greater challenge, which requires both accurate classification and localization of samples in the target domain. The teacher–student framework (the student model is supervised by pseudo-labels from the teacher model) has produced a large accuracy improvement in cross-domain object detection. Feature-level adversarial training is used in the student model, which allows features in the source and target domains to share a similar distribution. However, the direction and gradient of the weights can be divided into domain-specific and domain-invariant features, and the purpose of domain adaptive is to focus on the domain-invariant features while eliminating interference from the domain-specific features. Inspired by this, we propose a teacher–student framework named dual adaptive branch (DAB), which uses domain adversarial learning to address the domain distribution. Specifically, we ensure that the student model aligns domain-invariant features and suppresses domain-specific features in this process. We further validate our method based on multiple domains. The experimental results demonstrate that our proposed method significantly improves the performance of cross-domain object detection and achieves the competitive experimental results on common benchmarks.


Introduction
With the development of deep neural networks, many computer vision tasks have achieved great success. The convolutional neural network-based object detection has achieved excellent performance on various benchmark datasets. However, these successes also depend on a large amount of labeled data [1]. The collection of this data is very costly and time-consuming. Moreover, the performance of object detection models trained based on annotated data might be substantially degraded in a real scenario, and the main reason is a change in the captured view, appearance, background, lighting, or image quality. For these changing visual conditions, some work has begun to investigate the adaptive method [2]. Domain adaptive methods [3] can transfer knowledge from labeled source domains to unlabeled target domains, and it is a more cost-effective and functional option than annotating enough object samples. Therefore, the domain adaptive methods have a widely application in many fields.
The common domain adaptive methods include global feature adaptation [4], instance feature adaptation [5], and local feature adaptation [6]. The global feature adaptation easily confuses the object features in different categories, since each image contains multiple objects. Instance feature adaptation easily confuses the foreground and background features, since the detector output on the target domain is unstable, and most of the predicted foreground is actually background. Local feature adaptation is typically used to address the distribution shifts on the lower semantic levels. Domain adaptive in object classification only requires alignment to features in the same category of objects. However, which part of the features to align has also become a critical study problem in the object detection task because there is the uncertainty about the location of object.
In cross-domain object detection task, it generally contains a domain-invariant feature and a domain-specific feature. The domain-invariant feature is expressed as the composition of the object, and the domain-specific feature is expressed as whether the object has clear boundaries. As shown in Figure 1, we have analyzed the meaningful error types in cross-domain object detection. These error types are related to overall performance to minimize any confounding variables. This allows one to improve the interpretability of design decisions and to describe more clearly the strengths and weaknesses of the model. The pie chart shows the relative contribution of each error, while the bar plots show their absolute contribution. We can observe that the prediction results usually contain a large number of errors and false positives. This method cannot focus on domain-invariant feature alignment. If features are aligned as a whole instead of distinguishing between domain-invariant and domain-specific features, it will be detrimental to solve the errors and false positives problem in cross-domain object detection. A desirable domain adaptive method should achieve the alignment of the domaininvariant features. Otherwise, it might have a "negative migration" effect on the target domain. The distribution difference metric function can minimize the difference between source and target domains. We can assume that the domains have the same or similar feature distributions when the difference is sufficiently minimal. However, considering the feature alignment method in this way, we find that it does not distinguish the domaininvariant and domain-specific features of the downstream task. Inspired by the above, we propose to suppress domain-specific features in the model. There is more focus on learning domain-invariant features on higher-level semantic space.
In this work, we propose a novel cross-domain object detection network with DAB structure. The purpose is to train a well performing model on a target domain with unknown domain annotations. We use adversarial learning and mutual learning to improve the detection performance of the target domain. Our model consists of two modules: a target-domain teacher model and a cross-domain student model. In the student model, we propose a domain-invariant feature alignment branch and a domain-specific feature suppression branch. A distribution difference measure function that focuses on domain invariant feature alignment while suppressing domain specific features is proposed. We map the features to a high-dimensional space and use discriminators with gradients for adversarial learning which adjusts the distribution differences between the source and target domains in the student model.
The main contributions of this paper are as follows: (1) A novel cross-domain object detection framework with dual adaptive branch is designed, which we called DAB. This framework utilizes two branches, domaininvariant feature alignment branch and domain-specific feature suppression branch, to overcome errors and false positives in cross-domain object detection.
(2) The feature alignment branch and feature suppression branch are designed, respectively. With the purpose of domain-invariant feature alignment, we propose to map the features into a high-dimensional space and restrict the gradient using a distribution difference measure function, which minimizes the difference of domain-invariant features between two domains. With the purpose of domain-specific feature suppression, we propose to impose constraints on domain-specific features, which eliminate the influence of domain-specific features on cross-domain object detection. (3) The extensive experiments are conducted on various cross-domain benchmarks, and the experimental results demonstrate that our method achieves a significant performance improvement.

Related Works
Object Detection. Early object detection methods were based on the sliding-window methods, which apply the hand-crafted features and classifiers on dense image grids to find objects [7,8]. However, the traditional hand-crafted feature extraction method for object detection has some limitations, such as poor robustness to changing the objects, high time complexity, and redundant detection window. With the arrival of the deep convolutional neural network, one can solve the problems of traditional hand-crafted feature extraction methods and improve the detection speed and accuracy of object detection. The object detection task is quickly dominated by convolutional neural network (CNN), which can be divided into the two-stage object detection [9][10][11] and one-stage object detection [12][13][14][15][16].
Cross domain Object Detection. The purpose of cross-domain object detection is to detect objects in different domains. Xu et al. [16] have proposed to alleviate the domain shift problem of deformed component-based models (DPMs) by introducing adaptive support vector machine (SVM). Raj et al. [17] have proposed the subspace alignment methods to align the features extracted by using R-CNN models. The works mentioned above are either not trained in the end-to-end manner or focused on a specific case. Cross-domain object detection methods can be generally divided into two categories, namely, adversarial feature alignment [4][5][6][18][19][20][21][22][23][24][25][26] and self-training [27][28][29][30][31]. Besides the standard cross-domain object detection, the tasks of passive and multi-source have been studied in [21,32,33], respectively. In addition, refs. [34,35] have explored the problem of domain generalizability for object detection. This direction of research was first carried out by Chen et al. [5], who proposed a domain-adaptive faster R-CNN, which reduces the difference between image-level and instance-level distributions by embedding adversarial feature adaptation in a two-stage detection pipeline. Saito et al. [6] have proposed to align shallow local perceptual fields with deeper image-level features, i.e., strong local alignment and weak global alignment. Additionally, He et al. [35] have proposed a hierarchical domain feature alignment module and a weighted GRL to reweight the training samples. Kim et al. [36] have randomly expanded the source and target domains into multiple domains and solved the adaptation problem from the perspective of domain diversification. It addresses the issue of adaptability from the perspective of domain diversity. Ref. [37] has proposed a novel approach to domain adaption for object detection to mine the discriminative regions and focus on aligning them across both domains. Ref. [38] have adopted multi-level domain feature alignment. Ref. [39] has utilized classification consistency of image-level and instance-level predictions with the assistance of a multi-label classification model. Ref. [20] has proposed a center-aware feature alignment method that enables the discriminator focus on features from object regions. Ref. [4,25] have emphasized the different strategies for dealing with foreground and background features. Another popular methodology [27,28,[39][40][41] is dedicated to address the problem of inaccurate labels in the target domain.
In summary, these methods have not properly addressed the potential contradiction between shift-ability and discrimin-ability in cross-domain object detectors. Therefore, we propose to design a novel dual adaptive branch, which addresses the domain-invariant feature alignment and domain-specific feature suppression in the cross-domain object detection task.

Framework Overview
As shown in Figure 2, our model consists of two parts: a target-domain teacher model and a cross-domain student model. The teacher-student model is trained by mutual learning and adversarial learning. The target domain images are feeding into the teacher model to generate pseudo-labels, which are used to train the student model. The student model is updated with the teacher model by exponential moving average method (EMA). In the student model, we design two branches of feature alignment and feature suppression, respectively. The features are mapped to a high-dimensional space, and the differences between domain-invariant feature in two domain are minimized using a distribution difference measure function. Additionally, the domain-specific feature are constrained to eliminate the influence of domain-specific feature on cross-domain object detection. Figure 2. Framework Overview. This framework consists of two parts: a target-domain teacher model and a cross-domain student model. In target-domain teacher model, the target domain images are feeding into the teacher model to generate pseudo-labels. In cross-domain student model, we propose a DAB structure. With the purpose of feature alignment branch, the features are mappes into a high-dimensional space, and they restrict the gradient using a distribution difference measure function, which minimizes the difference of domain-invariant features between two domains. With the purpose of feature suppression branch, the domain-specific features are constrainted, which eliminate the influence of domain-specific feature on cross-domain object detection.
The source domain image x s i and the target domain image x t i are as inputs, and the features F S (x s i ) for the source domain and F T (x t i ) for the target domain are obtained by the feature encoder. We design a novel DAB structure. This design allow us to efficiently align domain-invariant feature and suppress domain-specific features from different domains. Basically, our network consists of a target-domain teacher model and a cross-domain student model.
The feature encoder is to ensure that we can construct a feature space of all images, which is used to obtain the feature of the images. The DAB is designed to focuses on domaininvariant feature alignment while suppressing domain-specific feature. Additionally, the detector is used to output the predicted result. In particular, the purpose of the feature alignment branch is to minimize the differences between domain-invariant features using a measure function. The purpose of the feature suppression branch is to eliminate the influence of domain-specific feature.

Optimization Problem in Cross-Domain Student
In cross-domain object detection, we have N s source domains of labeled samples, defined as D S = {x s i , y s i } N s i=1 , and N t target domains of unlabeled samples, defined as . The x s i and x t i are denoted as the input samples in the source and target domains, respectively. The y s i = (c s i , b s i ) denotes the labels of the corresponding input samples in the source domain, c s i is the category label, and b s i is the bounding box label. We set the feature of the source domain sample as The regression representations in the source and target domains are denoted as Our objective is to achieve the same or similar distribution of features in the source and target domain. Therefore, the problem is converted into minimizing the distance of the feature matrix.
Firstly, we map the features in the source and target domains based on the Gaussian distribution. The mean values with different distributions are calculated by finding a continuous function in the sample space, which is used to evaluate the difference in distribution between the source and target domains, as shown in Equations (1) and (2).
whereF S andF T denote mean of the feature mapping in source and target domains, respectively. γ = − 1 δ 2 denotes the variance of the function. We use the Frobenius norm to represent the mean of the samples with different distributions, as shown in Equation (3).
Therefore, the objective function for minimizing the difference in feature distribution can be denoted as Equation (4).
As shown in Figure 3a, Equation (4) is used to minimize the domain-invariant feature distribution, which only considers the means of the feature distributions in the source and target domains, but ignores the effect of the variance δ on the feature distributions, as shown in Figure 3b. Therefore, we design a domain adaptive kernel function for aligning the domain-invariant feature distribution and suppressing the domain-specific feature distribution in the source and target domains in parallel.
Therefore, we design f C (F S ,F T ) regular terms to suppress the variances of the domainspecific feature distribution, and obtain the optimization problem, as shown in Equation (5).
We define the domain adaptive kernel function K( f s i , f t i ) as Equation (6).
When n → ∞, we can ignore the residual term to obtain Equation (8).
We set, Therefore, we are able to express the domain adaptive kernel function as Equation (10).
Then, the optimization problem Equation (5) can be expressed as Equation (11).
Theorem 1. In the kernel mapping operator, we define ||F S −F T || F and f C ( Proof of Theorem 1. Since ||F S −F T || F → 0, we obtain φ(P S ) and φ(P T ), which are similar, and, then, we have E(φ(P S )) ≈ E(φ(P T )). By the theorem of expectation, we know that

Feature Alignment Branch in Cross-Domain Student
In the domain adaptive task, it is easy to confuse the object features of different categories, since each image contains multiple objects. Therefore, we propose a feature alignment branch, which aims to minimize the difference of domain-invariant feature between source and target domain, as shown in Figure 4.
The feature encoder obtains the features of the source and target domain, respectively. At first, the quality focal loss is used to calculate the classification loss of the source domain for the source domain samples, which is densely supervised for the whole image. The objective function is obtained as shown in Equation (12).
where, y gt i denotes the pseudo label from teacher model, the parameter β controls the down-weighting rate smoothly, β = 2.
Secondly, we train the feature encoder for obtaining the features F S (x s i ) for the input x s i in the source domain and F T (x t i ) for the input x t i in the target domain, respectively, and we map the feature toF S andF T , respectively. We use a gradient reversal layer (GRL) in the feature alignment branch, which allows the sign of the gradient to be reversed when the gradient passes through the GRL layer.
We design the feature alignment branch to align the domain-invariant feature in source and target domain. We can derive two distributions, the source anchor distribution D anc S and the target anchor distribution D anc T . Therefore, the objective function is obtained as shown in Equation (13).
where we denote each source-domain anchor x s i ∈ D anc S with a ground truth label y gt i . The Equation (12) is used to guide the feature alignment branch for correct classification in the source domain. Equation (13) is used to encourage the alignment of the mappinĝ F T on the target domain with the mappingF S on the source domain, which reduces the difference in the domain-invariant feature distribution. In summary, the objective function of the feature alignment branch is obtained as shown in Equation (14).
where λ is the trade-off parameter.

Feature Suppression Branch in Cross-Domain Student
In the domain adaptive task, we aim to align the domain-invariant feature of two domains while suppressing the domain-specific feature, which gradually converges the feature distributions to the corresponding local optimal values. Therefore, we propose a feature suppression branch for constraining domain-specific feature, as shown in Figure 5. This branch can significantly improve the performance of cross-domain object detection. Especially, the model is well pre-trained on the source domain. The reason is the constraint inhibits the learning of the final task, but the pre-training process performed on the source domain will compensate this limitation. It removes a key obstacle to the cross-domain object detection task. We have the ground truth category and bounding box labels in each object of the source domain. At first, we calculate the regression loss of the source domain. The objective function is obtained as shown in Equation (15).
where v = 4 π 2 (arctan , respectively. The ground truth labels proposed for the target domain are unknown, and we use P T ( f t i ) to calculate them. However, it is difficult to obtain a satisfactory detector at the target domain since there is the domain offset. Therefore, we propose a feature suppression branch for domain-specific feature.
In the target domain, we train the feature encoder for obtaining the feature F T (x t i ) of the input x t i , and the P T ( f t i ) is obtained through the mapping representationF T of feature F T (x t i ).
With the purpose of feature suppression branch, we measure the differences across domains and suppress the difference of domain-specific feature in source and target domain. The objective function is obtained, as shown in Equation (16).
It is noted that the regression loss in the source domain is defined only for the bounding box corresponding to y gt i , while L adv reg in the target domain is defined only for the bounding box associated with the predicted category in P T ( f t i ). Equation (15) guides the correct prediction in the source domain. Additionally, Equation (16) reduces the difference in the regression representation, which prevents the alignment of the regression representation P T ( f t i ) and P S ( f s i ) in the target domain. In summary, the objective function of the feature suppression branch is obtained, as shown in Equation (17).
where µ is the trade-off parameter.

Datasets
We perform experiments with popular benchmarks in cross-domain object detection. The details of the datasets are shown in Table 1. Pascal VOC. It is a dataset collected from the real world and can be used for detection and segmentation. For the detection task, it mainly consists of 20 categories containing 2501 training images, 2510 validation images, and 4952 test images.
Clipart. It is an artistic image created by manual production. Clipart contains 1000 images in a total of 20 categories.
Watercolor. It contains watercolor style images, which consist of images from six categories. DT Clipart. It uses CycleGAN for style migration, which converts the Pascal VOC dataset to the style of the Clipart dataset. Therefore, the annotation information is identical to the original Pascal VOC dataset.
Cityscapes. It is a semantic segmentation dataset consisting of 2975 training images, 500 validation images, and 1525 test images, each with a size of 1024 × 2048. Each image is annotated at pixel level and can be used for target detection tasks after conversion. The datasets are all urban scenes of different cities under normal weather, and the target objects are mainly pedestrians, vehicles, etc.
Foggy Cityscapes. It is created by adding synthetic fog into the Cityscapes dataset. Therefore, the annotation information is exactly the same as the original Cityscapes dataset.

Scenario
We evaluate our method in two adaptation scenarios. Dissimilar domains. The purpose is to perform adaptation under dissimilar domains. Firstly, we use the Pascal VOC and Clipart as the source and target domains, respectively. The results are presented on the Clipart val set (Pascal VOC → Clipart). Secondly, we use the Pascal VOC and DT Clipart as the source and target domains, respectively. The results are presented on the DT Clipart test set (Pascal VOC → DT Clipart).
Adverse weather. The purpose is to perform adaptation under different weather conditions. We use the Cityscapes and Foggy Cityscapes as the source and target domains, respectively. The results are presented on the Foggy Cityscapes val set (Cityscapes → Foggy Cityscapes).

Implementation Details
The source code and models were trained and evaluated on the Pytorch toolbox, which is based on the Python 3.6 platform. All experiments were implemented on a NVIDIA RTX 3090Ti GPU. We train the network with a batch size of 16. We use an initial learning rate of 0.125 and a decay rate of 0.1 every 400 K steps. The different scales of detection correspond to different perception fields, and there are a total of 10,647 proposal boxes. We transform the training set in the model by data augmentation, which enriches the training set and enhances the generalization ability. The four images are randomly cropped and scaled, and then randomly arranged and stitched to form a single image. While enriching the dataset, the data of four images are calculated at once during the normalization operation. Therefore, the memory requirement of the model is reduced.

Adaptation between Dissimilar Domains
Firstly, we have reported the adaptive experiments on dissimilar domains. We use the Pascal VOC dataset as the source domain and the Clipart dataset as the target domain. Source only indicates a model trained with only source domain data, and oracle indicates a model trained with labeled data from the source and target domains. The FA branch only denotes the method with feature alignment branch only, FS branch only denotes the method with feature suppression branch only, and Proposed Method denotes the method with DAB.
As shown in Table 2, we can observe that our method achieves 44.1% mAP with only feature alignment branch (FA branch only), which equals the advanced algorithm UMT [45]. This demonstrates that the feature alignment branch is effective in aligning domain-invariant feature. Additionally, our method achieves 45.9% mAP with only feature suppression branch (FS branch only), which exceeds the advanced algorithm UMT [45] by +1.8%. This illustrates the feature suppression branch can improve detection performance effectively by constraining domain-specific feature. Our method achieves 48.4% mAP with dual adaptive branch (Proposed Method), which outperforms all the other methods, and the detection performance has achieved an improvement of an order of magnitude. Although in the experiments mentioned above, our method does not perform best in some categories, such as ('chair', 'cow', and 'mbike'). However, we can observe that their differences of average precision is actually not large, and the detection performance of Proposed Method outperforms the method using only feature alignment branch (FA branch only) and only feature suppression branch (FS branch only) in these categories. Additionally, with the confusion matrix in Figure 6, it can be observed that 'aero' may be recognized as the 'bird' category, and 'cat' may be recognized as the 'dog' category. We analyze the reasons for the above situation, and it is influenced by the variation of image styles in cross-domain datasets, and it is easy to learn the approximate features in some small sample categories so that the accuracy of detection is reduced. Figure 7 shows the heat maps for three exampled images from Clipart, where the main objects related to categories such as "sheep", "person", "chair", "cow", and "bottle" is localized. Figure 7a-c are heat maps of attention for the FA branch only, FS branch only, and Proposed Method, respectively. It can be observed that the proposed method enables a more accurate alignment of the critical regions and instances. Therefore, it can help the model to activate the main objects of interest more accurately and achieve improved detection performance. In addition, we use the Pascal VOC dataset as the source domain and the DT Clipart dataset as the target domain. As shown in Table 3, we can observe that our method achieves 52.1% mAP with only the feature alignment branch (FA branch only), our method achieves 53.1% mAP with only the feature suppression branch (FS branch only), and our method achieves 54.7% mAP with dual adaptive branch (Proposed Method), which outperforms all the other methods and the detection performance has achieved the order of magnitude improvement. We will further evaluate the generalization ability of our model on unknown domains. We trained the model on the source domain dataset with labels (PASCAL VOC) and another artistic dataset without labels (Watercolor). Then, we inferred the model on a target dataset (Clipart), which is unknown during training. We only trained the overlapping classes (six classes) between Clipart1k and Watercolor, and the results are given in Table 4. Compared with AT [51] and MT [50], our model achieved the best performance in several categories. Despite this, in some categories (such as bicycle, dog, and person), our method is not achieving the optimal performance, but we can observe that their differences of average precision is actually not large. It demonstrates that our model can promote to unknown domains.

Adaptation between Adverse Weather
We have reported the experimental results of domain adaptive object detection under adverse weather condition. Source only indicates a model trained with only source domain data, and oracle indicates a model trained with labeled data from the source and target domains.The FA branch only denotes the method with feature alignment branch only, FS branch only denotes the method with feature suppression branch only, and Proposed Method denotes the method with DAB. Table 5 shows the experimental results on Cityscapes → Foggy Cityscapes transfer. We can observe that our method achieves 43.6% mAP with feature alignment branch only (FA branch only), our method achieves 43.5% mAP with feature suppression branch only (FS branch only), and our method achieves 47.4% mAP with DAB (Proposed Method). It is worth noting that the mAP is approximated in the FA branch only method and the FS branch only method. However, in the categories of pedestrians and riders, the detection performance of the FS branch only method is better. The state-of-the-art method TDD [52] achieves 43.1% mAP, while our method achieves +4.3% gains. It shows that our method has the stable ability to solve the domain adaptive problem in adverse weather. The confusion matrix for the source only method and our method are shown in Figure 8. We can clearly observe the improvement in detection quality. The results show that the proposed method significantly improves the performance, especially for the accuracy of localization.

Ablation Study
In this study, we investigate the performance of various strategies for aligning feature representations. We have used Pascal VOC → Clipart to conduct the study. For a fair comparison, all experiments have been performed under the same settings.
Trade-off parameter. Firstly, we investigate the effect of different trade-off parameters for the performance in domain adaptive object detection. We present the results of the ablation experiments for the trade-off parameter in Table 6. We can observe that the trade-off parameter of EXP.2 achieves optimal performance. Both AP 0.5 and AP 0.5:0.95 outperformed the results of EXP.1 and EXP.3, with gains of +4.7% and +0.9%, and +2.7%, and +0.2%, respectively.
Scales. There is a potential scale shift between the source and target domain datasets. To investigate the effect of image scale on our method, we have changed the size of the image in the target domain, and the scale in the source domain is fixed at 640 pixels. We have plotted the detection performance at different image scales by changing the scales of the target domain images. The FA branch only denotes the method with feature alignment branch only, FS branch only denotes the method with feature suppression branch only, w/o DAB denotes the method without the DAB, and Proposed Method denotes the method with DAB.
As shown in Figure 9a,b, we observe that changing the scales under the same experimental conditions, the model with the DAB achieves better results at most scales. In Figure 9c, we can observe the detection performance of our model with the DAB at each scale. EXP.1-EXP.5 indicate the experiments were performed at various model depth conditions. In Figure 9d, we can observe the inference speed at different scales. Comparing the two branch, we observe that the feature alignment branch is more robust to scale variation than the feature suppression branch. The reason is that scale variation is a global shift that affects all objects and backgrounds. While, in our method, the global domain shift is mainly solved by feature alignment branch to domain-invariant feature alignment, and the feature suppression branch is used to constraint domain-specific feature. When a significant global domain shift is present, the localization error is increased, and therefore the accuracy of the feature suppression branch is affected by the error in the domain-specific feature. Nevertheless, DAB consistently provide the optimal results at all scales.
Conv kernel. The depth of the model might affect the performance of feature extraction. To investigate the effect of different model depths for our method, we conduct experiments with different convolution kernel conditions. The reported results are set with Pascal VOC → Clipart. We conducted experiments at the same image scale as shown in Table 7. We can observe that the EXP.5 obtained the optimal performance, the AP 0.5 and AP 0.5:0.95 achieved 46.2% and 25.9% mAP, respectively. We also report the results of different convolution kernels in Figure 10, which evaluates the accuracy and speed of different convolution kernel models. It can be observed that, for trade-off between accuracy and speed, EXP.5 also achieves optimal performance. Branch structure. To verify the effectiveness of our dual adaptive branch structure, we have performed a set of ablation studies. The reported results are set with Pascal VOC → Clipart. The results of the different experiments are shown in Table 8. We use Proposed Method as the baseline. It can be observed that Proposed Method outperforms FA branch only in AP 0.5 and AP 0.5:0.95 , achieving the gains of +2.5% and +0.9%, respectively. The Proposed Method outperforms the FS branch only in AP 0.5 and AP 0.5:0.95 , achieving gains of +4.3% and +1.1%, respectively. The Proposed Method outperforms all the above experiments in AP 0.5 and AP 0.5:0.95 , achieving 48.4% and 26.5% mAP. It demonstrates that the model performance is gradually improved with DAB involved in the training, which illustrates the utility of each branch. The experiment of DAB is superior to all single branch methods, which indicates that our method preserves useful source domain knowledge effectively and explores target domain information in parallel.

Error Analysis
To create a meaningful distribution of errors and identify the components of the mAP, we separated all false positives and false negatives of model with four types. We will represent the overlap between the maximum IoU of a false positive and the ground truth of a given category as IoUmax. The foreground IoU threshold is denoted as t f , the background threshold is denoted as t b , and the above thresholds are set at 0.5 and 0.1, respectively [57]. IoU max ≥ t f is denoted as classification error (Cls), which indicates localized correctly but classified incorrectly. t b ≤ IoU max ≤ t f is denoted as localization error (Loc), which indicates classified correctly but localized incorrectly. IoU max ≤ t b is denoted as background error (Bkg), which indicates detected background as foreground. Additionally, missed GT error (Miss) indicates undetected ground truth.
The error ratios for each model on Pascal VOC → Clipart are shown in Figure 11. It can be observed that the main errors in the target domain appear from: Miss (undetected ground truth), Cls (incorrect classification), and Loc (incorrect localization). As shown in column 2 of Figure 11, it is observed that the error ratio of Cls is effectively reduced after feature alignment branch, but the error ratio of Loc is increased. This also illustrates the necessity for the feature suppression branch is performed. As shown in column 3 of Figure 11, it can be observed that the error ratio of Loc effectively decreases through the constraint domain-specific feature. In summary, it is illustrated that DAB is reasonable.  Figure 12 shows the qualitative results in Pascal VOC → Clipart cross-domain detection. From top to bottom, the visualization results of ground truth, source only, DAfaster [5], and proposed method are shown , respectively. We can observe that there are many missing and incorrect results in the cource only. Compared with DA-faster [5], our method has more significant improvement in localization accuracy, which indicates that the problem of errors and false positives has been improved. We can clearly observe the improvement of the detection quality. The results demonstrate a significant improvement of performance with our method. Figure 13 shows the qualitative results in Pascal VOC → DT Clipart cross-domain detection. Figure 14 shows the qualitative results in Cityscapes → Foggy Cityscapes crossdomain detection. From left to right, the visualization results of ground truth, source only, DA-faster and proposed method are shown, respectively. We can observe that the detection results of source only have incorrect detection and mis-location. In the DA-Faster [5] detection results, there are several omissions. Additionally, our method significantly improves the appearance of the above situations. Notably, our method shows a competitive performance with the oracle model. This demonstrates that our model can perceive the knowledge of the target domain while retaining the useful information of the source domain.

Conclusions
In this work, we address the domain shift in high-level semantic features by proposing a novel DAB structure. The purpose of domain adaptation is to focus on domain-invariant features and to eliminate the interference of domain-specific features. Therefore, we propose feature alignment and feature suppression branches, respectively. The effect of feature shift is eliminated by this strategy, which reduces the probability of false positives and errors in detection. Specifically, we exploit a distribution difference metric function to improve prediction consistency. It allows the model to focus on the object-relevant features aligned in the high-level semantic space. Experimental results on the common benchmarks indicated that our model achieves a comparable performance with advanced methods, achieving 48.4% mAP and 47.4% mAP, outperforming the next best methods by +1.8% and +4.3%, respectively. The experimental results also indicated that our detector is highly robust in different scales, which is very effective and advantageous in cross-domain object detection.