Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection

Qiu, Yuanyuan; Xu, Zhijie; Zhang, Jianqin

doi:10.3390/electronics13071239

Open AccessArticle

Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection

by

Yuanyuan Qiu

¹

,

Zhijie Xu

^1,* and

Jianqin Zhang

²

¹

School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

²

School of Geomatics and Urban Spatial Informatics, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(7), 1239; https://doi.org/10.3390/electronics13071239

Submission received: 18 February 2024 / Revised: 8 March 2024 / Accepted: 19 March 2024 / Published: 27 March 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

Domain adaptive object detection (DAOD) aims to leverage labeled source domain data to train object detection models that can generalize well to unlabeled target domains. Recently, many researchers have considered implementing fine-grained pixel-level domain adaptation using graph representations. Existing methods construct semantically complete graphs and align them across domains via graph matching. This work introduced an auxiliary node classification task before domain alignment through graph matching, which utilizes the inherent information of graph nodes to classify them, in order to avoid suboptimal graph matching results caused by node class confusion. However, previous methods neglected the contextual information of graph nodes, leading to biased node classification and suboptimal graph matching. To solve this issue, we propose a novel patch-based auxiliary node classification method for DAOD. Unlike existing methods that use only the inherent information of nodes for node classification, our method exploits the local region information of nodes and employs multi-layer convolutional neural networks to learn the local region feature representation of nodes, enriching the node context information. Thus, accurate and robust node classification results are produced and the risk of class confusion is reduced. Moreover, we propose a progressive strategy to fuse the inherent features and the learned local region features of nodes, which ensures that the network can stably and reliably utilize local region features for accurate node classification. In this paper, we conduct abundant experiments on various DAOD scenarios and demonstrate that our proposed model outperforms existing works.

Keywords:

object detection; domain adaptation; node classification; patch; graph matching

1. Introduction

In the past few years, computer vision tasks based on deep neural networks, such as object detection [1,2], have made significant breakthroughs with the swift advancements of deep learning. The characteristic of these superior models is that they have abundant labeled data, but data annotation is an extremely labor-intensive task. Therefore, an attractive solution is to transfer a model trained on the source domain (labeled) to the target domain (unlabeled). However, the above method faces the challenge of domain shift, which results in a considerable decrease in the model performance. This limits the universality and portability of the model, such as autonomous driving under adverse weather conditions and object detection under different camera settings.

To address the aforementioned issues, researchers have proposed the unsupervised domain adaptation (UDA) method. This method aims to enhance the performance of the model on the target domain without labels by utilizing the labeled data from the source domain under the condition that the source and target domains do not satisfy the independent and identically distributed assumption. The main purpose of UDA methods is to align the feature distribution between the source domain and target domain. In early work, a common strategy was to minimize domain difference measures [3]. Recent work has introduced Generative Adversarial Networks (GANs) for the adversarial training of UDA applications [4,5]. Besides the above two methods, there are also UDA methods based on reconstruction and sample generation. At present, the UDA method has been widely applied in vehicle detection [6,7], oil palm tree detection [8], road extraction [9], dehazing [10], and other fields.

The combination of domain adaptation and object detection has solved the problem of the decreased performance of object detection models caused by domain shift, and has become a research focus in recent years. The previous adaptive works in object detection tasks [11,12,13,14] were based on the Faster R-CNN [2] model, which reduce domain discrepancy at the image and instance levels by adding two domain adaptive components, respectively. Subsequent work [15] achieved category-level adaptation by minimizing the distance between domain class prototypes. Recent work [16] modeled domain information by completing incomplete semantics and transformed domain adaptation problems into graph matching problems. The above problems match graph nodes in a one-to-one manner by creating a paired node corresponding relationship between two graphs, thereby achieving fine-grained pixel-level adaptation. This approach introduced an auxiliary node classification task to enhance the semantics of nodes before graph matching. More concretely, they mainly adopted a node-based method, which utilizes the inherent information of nodes to classify the fused semantic complete graph nodes, in order to reduce the risk of class confusion. However, they only used the information of the nodes themselves for classification tasks, ignoring the contextual information of the nodes. As a result, inaccurate classifications were produced, which is not conducive to subsequent graph matching.

To tackle these aforementioned difficulties, we propose a patch-based auxiliary node classification method for DAOD which combines the contextual information of nodes to construct patches and utilizes multi-layer convolutional neural networks to learn the local region feature representations of nodes. As a result, the lack of node context information is compensated, enabling accurate classification results and reducing category confusion, which facilitates subsequent graph matching to achieve domain adaptive object detection. Specifically, for semantically complete nodes that have undergone domain fusion, we first utilize the local region information of nodes to construct patches. Then, the multi-layer convolutional neural networks are used to perceive nodes and their contextual information, and learn the local region feature representations of nodes. Next, we employ a classification network to integrate the region representations of all nodes for classification, obtaining accurate and authentic classification results. Moreover, we propose a progressive strategy to fuse the inherent and the learned local region features of nodes. In this strategy, the importance of local region features is continuously increased at appropriate stages of network training to ensure that the network can stably and reliably use local region features for accurate node classification. We integrate the above methods and strategies with the DAOD model, which constitutes our model. To evaluate our model, we perform experiments on different datasets, including Cityscapes [17], Foggy Cityscapes [18], and Sim10k [19], achieving SOTA performance on different domain migration scenarios.

We summarize our primary contributions below:

We propose a patch-based method to fuse information from the local regions of nodes, which addresses the problem of inaccurate classification results due to the missing contextual information of nodes in node classification tasks, and reduces the risk of node category confusion after domain fusion.
We design a progressive strategy to integrate the inherent features of nodes and the learned local region features, enabling the network to stably utilize local region features for accurate node classification.
We develop a model that incorporates the above methods, strategies, and DAOD model. We assess our proposed model in different domain adaptation scenarios, including adverse weather conditions, adapt synthesized data to real data, and achieve SOTA results.

2. Related Work

2.1. Object Detection

Object detection is a basic computer vision task that identifies the location and content of objects of interest in images. With the rapid development of deep learning, CNN-based detectors have improved significantly in the past decade. CNN-based methods for object detection can be divided into the following two categories: one-stage and two-stage. The Faster R-CNN [2] is a representative of two-stage models that utilizes a Region Proposal Network (RPN) to improve the speed and precision of object detection, rather than traditional selective search [20]. In the one-stage models, the two representatives YOLO [21] and SSD [1] directly detect targets from the features of the backbone network, achieving end-to-end object detection. Over the past two years, anchor-free detectors have become all the rage. For example, FCOS [22] predicts the category, offset, and centrality of anchor points on the feature map instead of using anchor boxes. In this article, we apply the anchor-free detector FCOS to develop our domain adaptive model.

2.2. Unsupervised Domain Adaptation

UDA transfers a trained model from a source domain with labeled data to a target domain with unlabeled data by using deep neural networks to align the data distribution between the domains. According to the role of deep neural networks, UDA methods can be classified into the following four types: distribution difference-based methods, adversarial methods, sample generation-based methods, and reconstruction-based methods. The first type extracts domain-invariant features by reducing the discrepancies in domain distributions. The second type aligns the feature distributions via adversarial training, which involves a game between the feature extractor and the domain discriminator. The third type synthesizes labeled target domain samples from source domain samples, and trains the target domain network using the synthesized samples. The fourth type utilizes autoencoders to extract transferable features, which are decoupled into domain-invariant and domain-specific features. The former transfer knowledge, while the latter reduce the generalization error of the target domain. Despite the significant progress in UDA research, most methods are applied to image classification tasks, and their application to cross-domain object detection remains a challenge.

2.3. Domain Adaptive Object Detection

DAOD reduces domain discrepancy between source and target domains in object detection tasks using domain adaptation methods. Based on the Faster R-CNN model, Chen et al. [11] designed and established a DAOD model with image-level and instance-level components. He et al. [12] introduced a multi-adversarial Faster R-CNN method to minimize domain disparity in feature representation. Xu et al. [14] developed a classification regularization framework that solves DAOD problems to some extent and can be applied as a plug-and-play module to various domain adaptive Faster R-CNN models.

In the field of DAOD, the image-level alignment of global features may confuse foreground and background information, while the instance-level alignment of local features may have background noise. Hsu et al. [23] addressed the aforementioned issues by predicting the objectivity and centrality of each pixel. Xu et al. [15] performed a graph-induced prototype alignment, which achieves category-level domain adaptation by seeking class prototype representation. Tian et al. [24] designed a knowledge transfer network, which was built on a basic detector with inherent knowledge extraction and relational knowledge restrictions. Zhao et al. [25] introduced a task-specific inconsistency alignment method to decouple and improve the performance of detectors in the two independent task spaces of classification and localization.

The recent work aimed to achieve fine-grained domain adaptation at the pixel level using graph matching. Li et al. [16] proposed a framework for semantically complete graph matching, which employed a semantic completion module to produce imaginary graph nodes for missing categories, completing mismatched semantics. The framework then constructed a graph structure to perform graph matching and enable domain adaptive object detection. However, their method relied on the node-based classification of graph nodes before graph matching, which only input the intrinsic information of nodes into the classifier and ignored the contextual information of nodes, leading to suboptimal classification results. To address these problems, we designed a patch-based auxiliary node classification method that replaces the node-based classification method. Our method integrates information from the local regions of nodes to overcome the deficiency of contextual information in nodes. This allows the classifier to achieve accurate classification results, thus minimizing the risk of class confusion. Furthermore, we propose a progressive strategy to integrate the inherent features and the learned local region features of nodes to ensure that the network can stably and reliably utilize the learned local region features for accurate node classification.

3. Method

In DAOD, we denote the source domain as

S = {(x_{i}^{s}, y_{i}^{s}, B_{i}^{s})}_{i = 1}^{p_{s}}

, where

x_{i}^{s}

is the i-th labeled image,

y_{i}^{s} = {γ_{i q}^{s}}_{q}

and

B_{i}^{s}

are the class labels and bounding box coordinates in

x_{i}^{s}

, respectively, and

p_{s}

is the number of labeled images. Analogously, we denote the target domain as

T = {x_{i}^{t}}_{i = 1}^{p_{t}}

, where

x_{i}^{t}

is the i-th unlabeled image and

p_{t}

is the number of unlabeled images. The two domains share the same category labels but have different data distributions. The structure of this section is as follows. First, we present the overall process of our model. Then, we describe our proposed patch-based auxiliary node classification method and progressive node feature fusion strategy in detail. Finally, we explain the loss function and the details of optimization.

3.1. Overview

Figure 1 illustrates the process of the proposed model, which is based on the framework of [16]. Given a batch of labeled source images

S = {(x_{i}^{s}, y_{i}^{s}, B_{i}^{s})}_{i = 1}^{b}

and unlabeled target images

T = {x_{i}^{t}}_{i = 1}^{b}

, we first apply a shared feature extractor to obtain image-level features

F_{s / t} = {F (x_{i}^{s / t})}_{i = 1}^{b}

, where

F (x_{i}^{s / t}) \in ℝ^{H \times W \times D}

,

\forall i

. Then, we sample the spatial features of the source domain uniformly and obtain a set of pixels. A predefined ratio of pixels inside and outside the ground truth boxes are, respectively, used as foreground and background nodes in the source domain. Similarly, we propagate the target domain features forward and select the foreground and background nodes in the target domain based on a predefined threshold. Next, we perform non-linear mapping to transform the features from visual space to graph space and generate the original node set

V_{s / t}

. The class prototype of the corresponding domain is used as the mean, and the distribution of the opposite domain (i.e., the other domain) is used as the standard deviation. Gaussian sampling and linear mapping are applied to obtain the imaginary nodes for the missing classes. These imaginary nodes and the original nodes form a semantically complete node set

{\hat{V}}_{s / t}

. After that, we construct the graph structure

G_{s / t}

by calculating the inner product of the feature vectors of these semantically complete nodes. Domain fusion on the graph nodes is then performed to obtain the fused node set

{\tilde{V}}_{s / t}

. To enhance the semantics of the graph, we fuse the semantically complete nodes and their local regions progressively to construct patches. We perform patch-based auxiliary node classification tasks and obtain accurate and realistic classification results. Next, the cross-domain-aware graph nodes

{\tilde{V}}_{s / t}

are concatenated through linear mapping. We utilize multi-layer perceptron layers, instance normalization layers [26], and differentiable Sinkhorn layers [27] to obtain the similarity matrix M. Finally, fine-grained domain adaptation is guided through structure-aware matching loss.

3.2. Patch-Based Auxiliary Node Classification

Before applying graph matching for DAOD, we fuse the semantics of cross-domain nodes to establish dense connections among nodes from different domains, which enables sparse and fine-grained adaptation using interactive semantics. Although this is indispensable and highly effective for DAOD using graph matching, the cross-domain fusion method may cause class confusion to a certain extent. To enhance graph semantics, previous work [16] introduced an auxiliary node classification task which used the inherent information of nodes to classify the fused graph nodes across domains so that the model can perceive class differences more sensitively, as follows:

L_{n o d e} = - \sum_{i = 1}^{N_{s}} {\tilde{y}}_{i}^{s} l o g {s o f t m a x [f ({\tilde{v}}_{i}^{s})]} - \sum_{i = 1}^{N_{t}} {\tilde{y}}_{i}^{t} l o g {s o f t m a x [f ({\tilde{v}}_{i}^{t})]},

(1)

where

{\tilde{v}}_{i}^{s / t} \in {\tilde{V}}_{s / t}

represents the node after the fusion of the source and target domain,

f

represents a classifier with cross-entropy loss,

{\tilde{y}}_{i}^{s} \in y_{r}^{s}

represents the ground-truth category label of

{\tilde{v}}_{i}^{s}

(if

{\tilde{v}}_{i}^{s}

is sampled from

x_{r}^{s}

),

{\tilde{y}}_{i}^{t}

represents the pseudo-label of

{\tilde{v}}_{i}^{t}

(obtained through propagating target domain features forward), and

N_{s}

and

N_{t}

are the number of nodes in the source and target domains, respectively. However, previous work adopted a node-based approach, which simply considers the inherent features of nodes without fusing the local region information of nodes when learning the node features, ultimately resulting in biased classification results.

Intuitively, compared to the feature of the node itself, the local region features contain richer semantic information, which is beneficial for enhancing the class distinguishability of the node in node classification tasks and obtaining accurate classification results. Therefore, we designed a novel patch-based auxiliary node classification method (PANC) that integrates the contextual information of nodes during node classification, thereby achieving accurate node classification results. Specifically, for each node

{\tilde{v}}_{i}^{s / t}

, we utilize the K-Nearest Neighbor (KNN) method to seek its k-nearest neighbor nodes

N ({\tilde{v}}_{i}^{s / t}) = {{\tilde{v}}_{i 1}^{s / t}, {\tilde{v}}_{i 2}^{s / t}, \dots, {\tilde{v}}_{i k}^{s / t}}

. These k neighbor nodes are then fused with node

{\tilde{v}}_{i}^{s / t}

to construct a patch representation:

P ({\tilde{v}}_{i}^{s / t}) = C o n c a t ({\tilde{v}}_{i}^{s / t}, \frac{α}{k} \sum_{{\tilde{v}}_{j}^{s / t} \in N ({\tilde{v}}_{i}^{s / t})} {\tilde{v}}_{j}^{s / t}),

(2)

where

α \in (0, 1]

denotes a weighting factor designed to enhance the effectiveness and reliability of local region features during the feature fusion, which increases with the training process. In Section 3.3, we will provide a detailed depiction to the necessity of introducing

α

and the design of

α

. Note that the node classification task is added to prevent class confusion among nodes after domain fusion, we search for neighbor nodes within two domains to enhance the perception of the network for each class and mitigate the impact of domain discrepancy. Next, multi-layer convolutional neural networks are utilized to perceive the local region information of nodes and learn the local region feature representation

\tilde{P} ({\tilde{v}}_{i}^{s / t})

. Finally, the classification network integrates the local region representations of all nodes for classification, not only obtaining accurate and authentic classification results, but reducing the risk of class confusion:

L_{n} = \sum_{i = 1}^{N_{s}} l_{i}^{s} + \sum_{i = 1}^{N_{t}} l_{i}^{t},

(3)

where

l_{i}^{s / t} = - {\tilde{y}}_{i}^{s / t} \log {s o f t m a x [f (\tilde{P} ({\tilde{v}}_{i}^{s / t}))]} .

(4)

The structure diagram and detailed process of PANC are shown in Figure 2 and Algorithm 1, respectively.

Algorithm 1. Patch-based Auxiliary Node Classification (PANC)

Input: domain fusion nodes

{\tilde{V}}_{s / t} = {{\tilde{v}}_{i}^{s / t}}_{i = 1}^{N_{s / t}}

Output: node classification loss

L_{n}

1. for

i = 1, 2, \dots, N_{s / t}

do

2. Find k neighbor nodes closest to

{\tilde{v}}_{i}^{s / t}

, i.e.,

N ({\tilde{v}}_{i}^{s / t}) = {{\tilde{v}}_{i 1}^{s / t}, {\tilde{v}}_{i 2}^{s / t}, \dots, {\tilde{v}}_{i k}^{s / t}}

3. Merge these k neighbor nodes

{\tilde{v}}_{i 1}^{s / t}, {\tilde{v}}_{i 2}^{s / t}, \dots, {\tilde{v}}_{i k}^{s / t}

with node

{\tilde{v}}_{i}^{s / t}

to construct a patch representation

P ({\tilde{v}}_{i}^{s / t})

using Equation (2)

4. Employ multi-layer convolutional neural networks to perceive local region information of

{\tilde{v}}_{i}^{s / t}

and learn local region feature representation

\tilde{P} ({\tilde{v}}_{i}^{s / t})

, i.e.,

\tilde{P} ({\tilde{v}}_{i}^{s / t}) = C N N s (P ({\tilde{v}}_{i}^{s / t}))

5. Calculate the classification cross-entropy loss

l_{i}^{s / t}

of

{\tilde{v}}_{i}^{s / t}

based on the local region feature representation

\tilde{P} ({\tilde{v}}_{i}^{s / t})

using Equation (4)

6. end for

7. Calculate the node classification loss

L_{n}

using Equation (3)

8. return

L_{n}

In this section, we elaborate on our designed patch-based auxiliary node classification method. By combining the local region information of nodes to construct patches and fusing the inherent and local features of nodes, our method compensates for the lack of contextual information of nodes, making node classification more accurate. Thus, we reduce the risk of class confusion, which facilitates fine-grained domain adaptation. However, we found that the local region features of nodes may not be effective or reliable during some stages of network training. We will discuss this issue in Section 3.3 and introduce our progressive node feature fusion strategy as a solution.

3.3. Progressive Node Feature Fusion Strategy

In the last section, we attempted to compensate for the missing contextual information used for node classification by fusing the inherent features of nodes with local region features. This enriched the semantics input into the network and produced accurate node classification results. However, during some stages of network training, the local region features of nodes may not be effective and reliable. If we directly integrate the inherent and local region features of nodes throughout the entire model training period, i.e.,

α

is always 1 in Equation (2), suboptimal results will be produced, which is not conducive to achieving the most accurate node classification. Next, we will divide the entire process of network training into early, middle, and later stages, and analyze the characteristics of the network during these three stages, as well as the effectiveness and reliability of the local region features of nodes in the corresponding periods. Based on our analysis results, we will provide our progressive node feature fusion strategy. Our strategy can sufficiently utilize effective and reliable local region features while minimizing the negative impact of ineffective and unreliable local region features, ensuring the effectiveness and reliability of the proposed PANC.

In the early stages of training, insufficient learning of the features and patterns of input data may result in unstable output results, which means that the node features learned by the network and the connections among nodes are unreliable at this time. For example, color features in the data may be focused on excessively. A high similarity may be output between black cars and pedestrians wearing black hats. Note that the local region features of a node originate from several neighbor nodes, i.e., the k nodes with the highest similarity. The unreliability of the connections or similarity between nodes means the unreliability of the local region features of the node. Therefore, if the inherent features of nodes are directly fused with local region features in the early stage of training, the node features from black cars and pedestrians wearing black hats will be highly likely to be fused into patch representations, resulting in confused node classification results. We vividly summarize the characteristics of the network in the early stage of training as the weak foundation. During this period, the local region features of nodes are unreliable and thus not effective. Although the network can be continuously updated during the training process to attempt to correct errors in the early stage, the experimental results in Section 4.4 show that the updating ability of the network can only avoid generating chaotic results, but is unable to break through the error vortex that the network has already fallen into in the early stage of training. Therefore, we cannot expect the network to rely solely on itself during the whole training procedure to counteract the unreliability of the local region features of nodes caused in the early stage of training. It is essential to avoid the unreliability of the local region features of nodes in the early stage of training.

Here, we temporarily set aside the situation in the middle stage of network training and discuss the situation in the later stage of training. We assume that the network is almost unaffected by the local region features of nodes in the early and middle stages. In the later stage, the network has a sufficient or even an excessive learning of the features and patterns of the input data, thus the results are stable, which indicates that most parameters in the network have become fixed and cannot be easily updated significantly at this time. The understanding of the network for the characteristics and patterns of input data has also become stable, making it difficult to bring about fundamental changes. We vividly summarize the characteristics of the network in the later stage of training as rigid thinking. During this period, the local region features of nodes are reliable. However, due to the rigid thinking of the network, although reliable, the local region features of nodes cannot fundamentally update the understanding of the network for the features and patterns of input data. Therefore, the local region features of the nodes are not effective in this stage. We cannot expect the local region features of nodes to bring rapid changes to the understanding of the network for the features and patterns of input data. The reliable local region features of nodes should be fully learned before the later stage of training so as to effectively enhance the understanding of the network for the features and patterns of input data.

The middle stage of network training is between the early and later stages of training. In the middle stage, the network has undergone exploration and experimentation in the early stage, having sufficient learning of the features and patterns of input data. The results are generally stable and reliable, which implies that the understanding of the network for the features of nodes and their local regions is trustworthy at present. Moreover, the network has not entered the later stage of training so far, thereby the understanding of the network for the features and patterns of input data is still in an active stage, i.e., it can be effectively improved through the influence of the local region features of nodes. We vividly summarize the characteristics of the network in the middle stage of training as solid foundation and active thinking. During this period, the local region features of nodes are reliable and effective. Therefore, we rely on the reliability and effectiveness of the local region features of nodes in this stage in order to improve the understanding of the network for the features and patterns of input data as much as possible.

Based on the above analysis, we designed a weighting factor

α

that increases with the training process for feature fusion in Equation (2) to gradually adjust the importance of the local region features of nodes at different stages of network training. The weighting factor we designed can guide the network to effectively utilize and integrate contextual information to achieve accurate node classification, which can be written as:

α = \{\begin{array}{l} σ / i^{'} \times i t e r, & i t e r \leq i^{'}, \\ σ + (1 - σ) / (i^{″} - i^{'}) \times (i t e r - i^{'}), & i^{'} < i t e r \leq i^{″}, \\ 1, & i t e r > i^{″}, \end{array}

(5)

where

i t e r

denotes the epoch of the network training;

σ, i^{'}

and

i^{″}

are hyperparameters that satisfy

σ \in (0, 1)

,

i^{'}, i^{″} \in ℤ^{+}

, and

i^{'} < i^{″}

. In the early stage of training, i.e.,

i t e r \leq i^{'}

, the weight

α

of local region features increases linearly from 0 to parameter

σ

to avoid involving overly complex information in the stage of insufficient network learning. Since

σ

is usually set very small, the growth process is quite slow. In the middle stage of training, i.e.,

i^{'} < i t e r \leq i^{″}

,

α

increases linearly from

σ

to 1, so that the understanding of the network for the features and patterns of input data can be improved through local region feature information during this period of solid foundation and active thinking. If

σ

is set very small, this growth process will be greatly rapid. In the later stage of training, i.e.,

i t e r > i^{″}

, we just maintain

α = 1

to slightly adjust the parameters, because the recognition of the network for the features and patterns of input data has been basically improved in the middle stage of training. By designing the weight of local region features for segmented growth, we have found a good method to fuse the inherent and local region features of nodes during network training to produce accurate classification results. In order to visually summarize our progressive node feature fusion strategy, the above description is integrated into Table 1 and an example of

α

changing with

i t e r

is provided in Figure 3.

We analyze the characteristics of the network at different training periods and the reliability and effectiveness of the local region features of nodes in this section. In addition, we explain the reasons why the local region features of nodes may not be reliable and effective in the early and later stages of network training. The progressive local region feature weighting factor

α

we designed can fully utilize effective and reliable local region features, while minimizing the negative impact of ineffective and unreliable local region features as much as possible. Therefore, fused features compensate for missing contextual information used for node classification, improving the reliability and effectiveness of the recognition of the network for the features and patterns of input data, and achieving accurate node classification. We will provide a detailed introduction to the specific parameter settings in Section 4.2 and further analyze the superiority of the optimal parameters in Section 4.4.

3.4. Loss Function and Model Optimization

Compared with previous methods, our proposed method obtains more accurate and realistic node classification results, which is beneficial for better graph matching and domain adaptation object detection. Next, the cross-domain-aware graph nodes

{\tilde{V}}_{s / t}

are concatenated together by linear mapping. The semantic-aware node similarity matrix M is then obtained by using multi-layer perceptron layers, instance normalization layers [26], and differentiable Sinkhorn layers [27], further modeling the correspondence between nodes in the graph. Finally, we utilize the edge

E_{s / t}

of graph

G_{s / t}

as a quadratic constraint to optimize graph matching, guiding fine-grained domain adaptation through structure-aware matching loss. The total loss function of our model is as follows:

L = L_{d e t} + L_{G A} + L_{N A} + λ_{1} L_{n} + λ_{2} L_{m},

(6)

where

L_{d e t}

is the detection loss [16],

L_{G A}

is the global alignment loss [23],

L_{N A}

is the node alignment loss [16],

L_{n}

is our proposed node classification loss, and

L_{m}

is the graph matching loss [16].

λ_{1}

and

λ_{2}

are both set to 0.1.

4. Experiments

In this part, the datasets used in the experiments and the implementation details are firstly introduced. Secondly, we compare our model with the existing superior works to demonstrate the effectiveness of the proposed method. Then, a detailed parameter analysis of our method is performed to validate the design of the optimal parameters in our model. Finally, we conduct qualitative visualization analyses to verify the superiority of our model from an intuitive perspective.

4.1. Datasets

Cityscapes→Foggy Cityscapes. Cityscapes [17] is a street-view dataset captured under clear weather conditions from 50 different cities, which has 2975 training images and 500 validation images, with eight categories. Foggy Cityscapes [18] is a synthetic dataset based on Cityscapes with some foggy noise. Therefore, the two datasets share images and annotations. The former and the latter are utilized as the source and target domain, respectively, and the domain disparity under different weather conditions is explored in this adaptation situation.

Sim10k→Cityscapes. Sim10k [19] is a simulation dataset which is obtained from the video game named Grand Theft Auto V, with a total of 10,000 composite images, only the car category, and a total of 58,701 annotated bounding boxes. In this adaptation scenario, Sim10k and Cityscapes are utilized as the source and target domain, respectively, and the domain discrepancy between synthesized data and real data is explored.

4.2. Implementation Details

We perform experiments using FCOS [22] and VGG-16 [28]. In all experiments, we evaluate using mAP with an IoU threshold of 0.5. We employ the V100-SXM2-32 GB GPU to perform the training and testing of our model, and set the iteration and batch size to 30,000 and 4, respectively. We use the SGD optimizer with a learning rate of 0.0025, a momentum of 0.9, and a weight decay of

5 \times 1 0^{- 4}

. In addition, we set

σ

to 0.2,

i^{'}

to 10,000, and

i^{″}

to 20,000. Our model is conducted using the PyTorch 1.5.1 [29] deep-learning framework.

4.3. Comparison with SOTA Methods

Cityscapes→Foggy Cityscapes. We present a comparison among the proposed model and other models in Table 2. Compared with the EPM [23], CTRP [30], GPA [15], DBGL [31], RPA [32], FGRR [33], and SIGMA [16] domain adaptation models, our proposed model achieves the highest mAP (42.1%), demonstrating our superiority over existing domain adaptation-based object detection works. Especially for the categories of person and car, the average precision reaches 44.8% and 60.8%, respectively, exceeding the results of all other models. The optimal performance validates that our proposed model has the ability to adapt well to different weather conditions.

Sim10k→Cityscapes. We report the comparison results of our model with other models on the common category car of two datasets in Table 3. Compared with the EPM [23], KTNet [24], MeGA [34], RPA [32], MGA [35], FGRR [33], and SIGMA [16] domain adaptation models, our proposed model reaches the highest mAP (53.0%), verifying the effectiveness of our method. It is worth noting that the domain gap between the synthesized data and real data is significant, thus the requirements for domain adaptive object detectors are also higher. Our model shows the best performance, which indicates that the proposed model has stronger domain adaptive detection ability.

4.4. Parameter Analysis

In this section, we provide the parameter analysis of the adaptation from Sim10k to Cityscapes using VGG-16 as the backbone network, as shown in Table 4, in order to validate the design of the optimal parameters in our method.

As shown in Table 4, the first column indicates whether to use a progressive node feature fusion strategy (PS), the second column is the value of parameter

σ

, and the third column is the k value in the KNN, i.e., the number of searched neighbor nodes. The experiment results show that the fusion of the inherent and local region information of nodes can compensate for missing contextual information, achieving accurate node classification and enhancing the effectiveness of the model. Furthermore, note that the AP in the second row of Table 4 produces unsatisfactory results, which confirms our analysis in Section 3.3 that the network has been unable to correct erroneous cognition caused by weak foundation in the early stage of training over a long period of time. Therefore, our progressive node feature fusion strategy is also indispensable. The progressive strategy we propose provides guidance on the degree and opportunity of node feature fusion, improving the understanding of the network for the features and patterns of input data. This is absolutely necessary to compensate for missing contextual information through fused features. The experiments show that setting the k value of the KNN to 5 and

σ

to 0.2 achieves the best performance.

4.5. Visualization Analysis

4.5.1. Result Visualization Analysis

We present an intuitive comparison of the results among the (a) EPM [23], (b) SIGMA [16], (c) our proposed model, and (d) the ground truth in Figure 4. In the EPM and SIGMA detection results, some targets are not detected, and many bounding boxes are assigned incorrect labels. Compared with the EPM and SIGMA, our proposed model yields more precise detection results; that is, our model can recognize more foreground targets and provide bounding boxes with a better quality and classification labels with a higher accuracy. For example, in the fourth row, the EPM mistakenly detects bus as train, the SIGMA cannot detect bus, while our model correctly detects bus; in the fifth row, both the EPM and SIGMA mistakenly detects truck as bus, while our model properly detects and locates truck. Therefore, our model generates detection results closest to the ground truth, which intuitively validates the effectiveness of our method.

4.5.2. Feature Visualization Analysis

As displayed in Figure 5, we use the t-SNE [36] to visualize the distribution of foreground and background features obtained by the EPM [23], SIGMA [16], and our model in cross-domain scenes from Cityscapes to Foggy Cityscapes. It can be observed that there are prominent domain differences between the foreground and background features extracted by the EPM. Although the features extracted by the SIGMA effectively separate foreground and background features, the two domains are not well aligned for the foreground category, resulting in a concentrative phenomenon of a red region. In contrast, our model has a good ability to distinguish foreground and background features, which keeps foreground and background features as far away as possible. Furthermore, the representations of the same class in the two domains can be better aligned through our method, i.e., the foreground (background) of the source domain and the foreground (background) of the target domain are close to the greatest extent possible. The above analysis confirms that our proposed model improves the ability of feature alignment.

Similarly, as illustrated in Figure 6, we utilize the t-SNE [36] to visualize the distribution of eight foreground category features in the source and target domains obtained by the EPM [23], SIGMA [16], and our model in the cross-domain scene from Cityscapes to Foggy Cityscapes. For each category, we randomly sample the same number of node representations on the features of each domain. The EPM cannot distinguish different classes well, nor can it align the features of two domains satisfactorily. Although the SIGMA can generally distinguish categories, it cannot effectively separate the truck and bus classes, resulting in the fusion of the two classes in visualization. In comparison, our model can clearly distinguish each category in terms of features, especially for some similar categories (such as truck and bus), and the classification results are more prominent. In conclusion, the visualization results demonstrate the superior performance of our model.

5. Conclusions

In this paper, we proposed a novel patch-based auxiliary node classification method which provides a more comprehensive model framework for DAOD based on graph matching. Specifically, our main contributions are as follows: (1) The method proposed in this article combines the local region information of nodes and constructs patch representations of nodes for node classification (i.e., the PANC part in Figure 1), compensating for missing contextual information and obtaining accurate and authentic classification results. Thus, the risk of class confusion is reduced, which facilitates the subsequent graph matching domain adaptation. (2) We designed a progressive node feature fusion strategy (i.e., parameter

α

in Figure 2) to ensure that the network can stably and reliably utilize the learned local region features for accurate node classification. (3) We conducted extensive experiments on multiple domain adaptation scenarios to validate the effectiveness of our model. A large number of experiments showed that our model was significantly superior to existing works.

In the future, we will consider using more reliable methods to replace the averaging operation when constructing patch representations in order to reduce the loss of local region information. Moreover, we will try to provide an adaptive parameter-tuning method to avoid the uncertainty and huge cost of manual parameter tuning. Finally, it may also be beneficial to design a non-linear scheme for the weight of local region features to vary with the current training iteration, as the non-linear scheme may better fit the practical situation of network training compared to the linear scheme.

Author Contributions

Conceptualization, Y.Q., Z.X. and J.Z.; methodology, Y.Q.; software, Y.Q. and Z.X.; validation, Y.Q.; formal analysis, Y.Q.; investigation, Y.Q. and J.Z.; resources, Y.Q. and Z.X.; data curation, Y.Q.; writing—original draft preparation, Y.Q.; writing—review and editing, Y.Q. and Z.X.; visualization, Y.Q.; supervision, Y.Q. and J.Z.; project administration, Y.Q.; funding acquisition, Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation Project (grant No. 42371416) and the Beijing University of Civil Engineering and Architecture Graduate Student Innovation Project (grant No. PG2023147).

Data Availability Statement

The datasets used in our study are publicly available at https://www.cityscapes-dataset.com/downloads/ (accessed on 20 March 2024) and https://fcav.engin.umich.edu/projects/driving-in-the-matrix (accessed on 20 March 2024).

Acknowledgments

We are thankful for all the reviewers who provided useful recommendations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1440–1448. [Google Scholar] [CrossRef] [PubMed]
Tzeng, E.; Hoffman, J.; Zhang, N.; Saenko, K.; Darrell, T. Deep domain confusion: Maximizing for domain invariance. arXiv 2014, arXiv:1412.3474. [Google Scholar]
Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1180–1189. [Google Scholar]
Bousmalis, K.; Silberman, N.; Dohan, D.; Erhan, D.; Krishnan, D. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 3722–3731. [Google Scholar]
Koga, Y.; Miyazaki, H.; Shibasaki, R. A method for vehicle detection in high-resolution satellite images that uses a region-based object detector and unsupervised domain adaptation. Remote Sens. 2020, 12, 575. [Google Scholar] [CrossRef]
Koga, Y.; Miyazaki, H.; Shibasaki, R. Adapting Vehicle Detector to Target Domain by Adversarial Prediction Alignment. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2341–2344. [Google Scholar]
Wu, W.; Zheng, J.; Fu, H.; Li, W.; Yu, L. Cross-regional oil palm tree detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 56–57. [Google Scholar]
Lu, X.; Zhong, Y. A Noval Global-Local Adversarial Network for Unsupervised Cross-Domain Road Detection. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2775–2778. [Google Scholar]
Shao, Y.; Li, L.; Ren, W.; Gao, C.; Sang, N. Domain adaptation for image dehazing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 2808–2817. [Google Scholar]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 19–21 June 2018; pp. 3339–3348. [Google Scholar]
He, Z.; Zhang, L. Multi-adversarial faster-rcnn for unrestricted object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6956–6965. [Google Scholar]
Xu, C.D.; Zhao, X.R.; Jin, X.; Wei, X.S. Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11724–11733. [Google Scholar]
Xu, M.; Wang, H.; Ni, B.; Tian, Q.; Zhang, W. Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12355–12364. [Google Scholar]
Li, W.; Liu, X.; Yuan, Y. Sigma: Semantic-complete graph matching for domain adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 5291–5300. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. Semantic foggy scene understanding with synthetic data. Int. J. Comput. Vis. 2018, 126, 973–992. [Google Scholar] [CrossRef]
Johnson-Roberson, M.; Barto, C.; Mehta, R.; Sridhar, S.N.; Rosaen, K.; Vasudevan, R. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 746–753. [Google Scholar]
Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Hsu, C.C.; Tsai, Y.H.; Lin, Y.Y.; Yang, M.H. Every pixel matters: Center-aware feature alignment for domain adaptive object detector. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 733–748. [Google Scholar]
Tian, K.; Zhang, C.; Wang, Y.; Xiang, S.; Pan, C. Knowledge mining and transferring for domain adaptive object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9133–9142. [Google Scholar]
Zhao, L.; Wang, L. Task-specific inconsistency alignment for domain adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 14217–14226. [Google Scholar]
Fu, K.; Liu, S.; Luo, X.; Wang, M. Robust point cloud registration framework based on deep graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8893–8902. [Google Scholar]
Sinkhorn, R. A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Stat. 1964, 35, 876–879. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhao, G.; Li, G.; Xu, R.; Lin, L. Collaborative training between region proposal localization and classification for domain adaptive object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 86–102. [Google Scholar]
Chen, C.; Li, Z.; Zheng, Z.; Huang, Y.; Ding, X.; Yu, Y. Dual bipartite graph learning: A general approach for domain adaptive object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2703–2712. [Google Scholar]
Zhang, Y.; Wang, Z.; Mao, Y. Rpn prototype alignment for domain adaptive object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12425–12434. [Google Scholar]
Chen, C.; Li, J.; Zhou, H.Y.; Han, X.; Huang, Y.; Ding, X.; Yu, Y. Relation matters: Foreground-aware graph-based relational reasoning for domain adaptive object detection. Pattern Anal. Mach. Intell. 2022, 45, 3677–3694. [Google Scholar] [CrossRef] [PubMed]
VS, V.; Gupta, V.; Oza, P.; Sindagi, V.A.; Patel, V.M. Mega-cda: Memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4516–4526. [Google Scholar]
Zhou, W.; Du, D.; Zhang, L.; Luo, T.; Wu, Y. Multi-granularity alignment domain adaptation for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 9581–9590. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The overview of our model. First, a shared feature extractor is used to obtain image-level features. Node sampling is performed to generate the original node set. We obtain the imaginary nodes for the missing classes by node completing, gaining a semantically complete node set. Then, we construct the graph structure and perform domain fusion on the graph nodes to obtain the fused node set. Subsequently, we perform patch-based auxiliary node classification tasks. The cross-domain-aware graph nodes are concatenated through linear mapping. We use MLP, instance normalization, and Sinkhorn to obtain the similarity matrix. Finally, fine-grained domain adaptation is guided through structure-aware matching loss.

Figure 2. The structure diagram of patch-based auxiliary node classification (PANC).

Figure 3. An example of

α

changing with

i t e r

. In the early stage of training,

α

increases linearly from 0 to

σ

. Since

σ

is usually set very small, the growth process is quite slow. In the middle stage,

α

increases linearly from

σ

to 1. This growth process is greatly rapid. In the later stage,

α = 1

is just maintained to slightly adjust the parameters.

Figure 3. An example of

α

changing with

i t e r

. In the early stage of training,

α

increases linearly from 0 to

σ

. Since

σ

is usually set very small, the growth process is quite slow. In the middle stage,

α

increases linearly from

σ

to 1. This growth process is greatly rapid. In the later stage,

α = 1

is just maintained to slightly adjust the parameters.

Figure 4. Comparison of results among the (a) EPM [23], (b) SIGMA [16], (c) our proposed model, and (d) the ground truth from Cityscapes to Foggy Cityscapes. Compared with the EPM and SIGMA, our model can eliminate some mistaken detection errors (false positives), such as car in the first row, bus in the fourth row, and truck in the fifth row. Moreover, our proposed model also reduces some missed detection errors (false negatives), such as car in the second row, truck in the third row, and bus in the fourth row. (Best viewed in color).

Figure 5. The t-SNE visualization of the foreground and background feature embeddings among the (a) EPM [23], (b) SIGMA [16], and (c) our proposed model for cross-domain tasks from Cityscapes to Foggy Cityscapes. (Best viewed in color).

Figure 6. The t-SNE visualization of eight foreground category feature embeddings among the (a) EPM [23], (b) SIGMA [16], and (c) our proposed model for cross-domain tasks from Cityscapes to Foggy Cityscapes. (Best viewed in color).

Table 1. We analyze the characteristics of the network during the early, middle, and later stages, as well as the effectiveness and reliability of the local region features of nodes in the corresponding periods. Moreover, we provide the increase situation of the weight of local features.

Stage	Network’s Characteristic	Reliable	Effective	Weight of Local Features ( $α$ )
Early	Weak foundation	No	No	Increase slowly
Middle	Solid foundation and active thinking	Yes	Yes	Increase rapidly
Later	Rigid thinking	Yes	No	Keep unchanged

Table 2. The experiment results (%) from Cityscapes to Foggy Cityscapes using VGG-16. GPA uses Res-50 as the backbone network. The data in bold represents the best results in their respective categories.

Method	Person	Rider	Car	Truck	Bus	Train	Motor	Bike	mAP
EPM ECCV’20	43.9	41.2	60.1	22.5	49.8	35.3	23.7	36.4	39.1
CTRP ECCV’20	32.7	44.4	50.1	21.7	45.6	25.4	30.1	36.8	35.9
GPA(Res-50) CVPR’20	32.9	46.7	54.1	24.7	45.7	41.1	32.4	38.7	39.5
DBGL ICCV’21	33.5	46.4	49.7	28.2	45.9	39.7	34.8	38.3	39.6
RPA CVPR’21	33.6	43.8	49.6	32.9	45.5	46.0	35.7	36.8	40.5
FGRR TPAMI’22	33.5	46.4	49.7	28.2	45.9	39.7	34.8	38.3	39.6
SIGMA CVPR’22	44.4	43.3	60.4	25.5	43.9	45.4	31.9	36.7	41.4
Ours	44.8	42.6	60.8	30.9	48.2	43.8	28.3	37.7	42.1

Table 3. The experiment results (%) from Sim10k to Cityscapes using VGG-16. The data in bold represents the best results in their respective categories.

Method	AP on Car
EPM ECCV’20	52.3
KTNet ICCV’21	50.7
MeGA CVPR’21	44.8
RPA CVPR’21	45.7
MGA CVPR’22	49.8
FGRR TPAMI’22	44.5
SIGMA CVPR’22	51.9
Ours	53.0

Table 4. Parameter analysis of adaptation from Sim10k to Cityscapes using VGG-16 as the backbone network (%). PS represents our proposed progressive node feature fusion strategy. The data in bold represents the best results in their respective categories.

Use of PS	$σ$	k	AP on Car
-	-	5	50.1
√	0.1	5	49.1
√	0.2	5	53.0
√	0.3	5	51.2
√	0.2	7	51.6
√	0.2	9	51.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, Y.; Xu, Z.; Zhang, J. Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection. Electronics 2024, 13, 1239. https://doi.org/10.3390/electronics13071239

AMA Style

Qiu Y, Xu Z, Zhang J. Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection. Electronics. 2024; 13(7):1239. https://doi.org/10.3390/electronics13071239

Chicago/Turabian Style

Qiu, Yuanyuan, Zhijie Xu, and Jianqin Zhang. 2024. "Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection" Electronics 13, no. 7: 1239. https://doi.org/10.3390/electronics13071239

APA Style

Qiu, Y., Xu, Z., & Zhang, J. (2024). Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection. Electronics, 13(7), 1239. https://doi.org/10.3390/electronics13071239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Patch-Based Auxiliary Node Classification for Domain Adaptive Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Unsupervised Domain Adaptation

2.3. Domain Adaptive Object Detection

3. Method

3.1. Overview

3.2. Patch-Based Auxiliary Node Classification

3.3. Progressive Node Feature Fusion Strategy

3.4. Loss Function and Model Optimization

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Comparison with SOTA Methods

4.4. Parameter Analysis

4.5. Visualization Analysis

4.5.1. Result Visualization Analysis

4.5.2. Feature Visualization Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI