TransConv: Transformer Meets Contextual Convolution for Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) aims to reapply the classifier to be ever-trained on a labeled source domain to a related unlabeled target domain. Recent progress in this line has evolved with the advance of network architectures from convolutional neural networks (CNNs) to transformers or both hybrids. However, this advance has to pay the cost of high computational overheads or complex training processes. In this paper, we propose an efficient alternative hybrid architecture by marrying transformer to contextual convolution (TransConv) to solve UDA tasks. Different from previous transformer based UDA architectures, TransConv has two special aspects: (1) reviving the multilayer perception (MLP) of transformer encoders with Gaussian channel attention fusion for robustness, and (2) mixing contextual features to highly efficient dynamic convolutions for cross-domain interaction. As a result, TransConv enables to calibrate interdomain feature semantics from the global features and the local ones. Experimental results on five benchmarks show that TransConv attains remarkable results with high efficiency as compared to the existing UDA methods.


Introduction
Deep neural networks have achieved impressive success in a wide range of computer vision applications [1], but their success usually demands massive quantities of labeled data for better representations.This often follows the assumption that training and testing sets are from the same data distribution.Nevertheless, this situation does not always work well in practice.One way out could be resorting to the unsupervised domain adaptation (UDA), which trains deep neural network models on rich labeled data from a related source domain.But this supervised learning suffers from the domain shift issue, resulting in poor generalization performance on other new target domains.To address this issue, considerable research efforts are devoted to such UDA tasks [2][3][4][5] by bridging the distribution discrepancy, minimizing distance metrics or adversarial learning, etc.In such arts, most existing approaches advance convolution neural networks (CNNs)-based frameworks to learn the domain-invariant feature representation.Such features are often from local receptive fields.
With the success of transformers in various visual tasks, recent UDA methods focus more on global features by using encoder-decoder frameworks, in contrast to local features learned by CNN frameworks.The most advanced domain adaptation methods extract global features of images by using transformer architecture as backbone network.Recent studies show that the models with transformers are obviously better than those with pure convolutional neural networks.For example, transferable vision transformers (TVT) [5] utilize the transferability adaptation module of vision transformers (ViT) [6] for domain adaptation.Cross-domain transformers (CDTrans) [4] use the robustness of cross-attention in transformers to propose a three-branch transformer model for UDA tasks.To take full advantage of both transformer and CNN architectures, a natural idea is combining both of them.However, CDTrans uses a two-stage training method, which takes a long time and is not conducive to the rapid migration of the model.The challenge of hybrid models is how to maintain the robustness of cross-attention with high efficiency.
On one hand, we introduce a Gaussian attended MLP module to further empower the robustness of the encoder of transformer by adjusting more attention to major channel dimensions of features, thus improving the quality of features.As shown in Figure 1, Gaussian attention can attend to more important visual clues than the baseline.This is because Gaussian distribution can smooth the distribution of the attention weights, thereby filtering out the respective noisy values.Moreover, since it only involves the mean value and deviations, the speed of forming the attention is lightning-fast.The corresponding extra overhead is negligible.On the other hand, the context information of features is able to enhance the spatial semantics of the 'Class Token (CLS-Token)' features.Inspired by ConvNeXt [7], which reparameterizes the transformer architecture into the fully CNN model for efficiency, we design an efficient dynamic convolution module with the context information by using the Gaussian error linear units (GELU) activation function and the layer normalization.This module is also lightly weighted.In summary, the contributions of this paper are summarized as follows: • We propose a novel hybrid model of both transformers and convolution networks, termed TransConv.It improves the robustness of cross-attention with a Gaussian attended MLP module and meanwhile absorbs more semantics via the context-aware dynamic convolution module.

•
TransConv better trades off model performance and efficiency as compared to the state-of-the-arts with a large margin on five datasets.
The rest of this paper is organized as follows: first of all, we review the related work in the Section 2.Then, Section 3 introduces the overall architecture of the proposed TransConv model and each improved module are introduced in detail.Section 4 reports the experimental results on five commonly used datasets and ablation experiments.At last, the conclusion of this paper and the future works are given in Section 5.

Related Work
In this section, we will introduce the related work in four aspects: unsupervised domain adaptation, vision transformers, dynamic convolution, and contextual information.

Unsupervised Domain Adaptation
From the perspective of using different training methods, there are two main methods of UDA, namely UDA based on metric learning and UDA based on adversarial learning.UDA of metric learning [8,9] mainly measures the distribution difference between different domains by defining a distance metric.UDA can be formulated as a distance minimization problem.For example, the maximum mean discrepancy (MMD) [10] metric has been widely used in UDA methods.UDA of adversarial learning [11,12] mainly trains a domain discriminator and a feature learning network through the adversarial method.The feature learning network learn domain-invariant features and attempt to fool the domain discriminator.When the domain discriminator cannot distinguish whether the input data come from the source domain or the target domain, it will assume that the distributions between the two domains are well aligned.From the perspective of UDA alignment granularity, UDA methods also can be divided into two main methods: domain-level UDA and category-level UDA.Domain-level UDA [13,14] mainly alleviates the distribution difference between the source domain and the target domain by reducing the overall distribution of the source domain and the target domain.Category-level UDA [15,16] mainly achieves more accurate fine-grained alignment by reducing the distribution of each category in the source domain and target domain.The method adopted in this paper is a category-level UDA method based on metric learning.By exploring a hybrid model of transformers and CNNs, our method can fully combine the advantages of both architectures to solve the UDA problem.

Vision Transformer
Transformers [17] were first proposed in the natural language processing (NLP) field and have shown excellent performance in tasks of the NLP field [18][19][20].As transformers moved from the NLP field to the computer vision field, many studies have shown their effectiveness in computer vision tasks [21][22][23].ViT [6] was the first work to apply transformers from NLP to computer vision, which is a pure transformer model without convolution.ViT-based variants [24,25] are widely used in image classification and downstream tasks such as object detection [26][27][28], image segmentation [29,30], etc.In the unsupervised domain adaptation task, as compared to the pure convolutional architecture model like ResNet-50, the transformer-type model is better at capturing global features through attention mechanisms.In addition, ResNet-50 relies on the inductive bias for specific images, while transformers have no inductive bias and benefits from large-scale pretraining data.For hybrid-based network models, several studies [31,32] mix transformers with CNN, which further improve the quality of features.This paper also explores the advantage of the hybrid model between ViT and convolutional neural networks from other viewpoints, such as context information and the robustness.

Dynamic Convolution
In traditional regular convolution, the convolution kernel learned is invariant, which leads to performance degradation in the domain shift issue.In contrast, the convolution kernel in dynamic convolution [33][34][35] can be generated dynamically with the input.To better adapt to the problem of domain shift across domains, dynamic convolution is used in our hybrid model instead of regular convolution.Recently, sparse region-based convolutional neural networks (Sparse R-CNN) [28] have also used dynamic convolution to improve the performance of the transformer model architecture in target detection tasks.To better understand the attention mechanism [36], also compared dynamic convolution with regular convolution, deformable convolution, and transformer attention.ConvNeXt explores many strategies from convolutional neural networks to improve performance in the transformer model architecture.In this paper, the proposed dynamic convolution module also borrows some schemes from ConvNeXt to improve the convolutional neural network module.

Contextual Information
Contextual information [37,38] plays a key role in image recognition.Without the help of contextual information, it is easy to identify objects incorrectly.By integrating contextual information, the performance improvement of computer vision systems is very effective.Therefore, compared with the local features of convolutional neural networks, the advantage of the transformer model architecture in improving performance is the use of global context features.The 'CLS-Token' features learned in ViT as the global context features are given to the classifier for classification recognition, while the 'CLS-Token' features are neglected in Swin transformers.Swin transformers use the global average pooling operation to output global context features for classification recognition.Two ways, in fact, are orthogonal.In this paper, we mix them to form the new dynamic convolution module.

The Proposed Method
In this section, we first introduce the self-attention module in ViT and the improved Gaussian attended MLP module.After that, we improve the performance of the hybrid model based on ViT by combining contextual information with the dynamic convolution module.Lastly, we introduce our method TransConv, which consists of three parts: a transformer encoder, contextual information combination, and dynamic convolution.The overall structure of the proposed TransConv is shown in Figure 2. The source domain image and the target domain image are respectively split into multiple patches and rearranged by patch embedding to output token features.They are fed into the transformer encoder, where layer normalization serves to normalize, the multi-head attention module adjusts the attention weight of spatial features, and the Gaussian attended MLP module adjusts the attention weight of channel features.The attended features are divided into the 'CLS-Token' branch and the average pooling branch, which respectively serve for global class-wise semantics and spatial features.Dynamic convolution makes them adaptive to domain-agnostic.They are concatenated together and then classified by the classifier.Our method simultaneously optimizes the classification (cls) loss and the local maximum mean discrepancy (lmmd) loss.The cls loss and the lmmd loss will be introduced in Section 3.3.

Transformer Encoder
The model designed in MLP-Mixer [39] uses a pure MLP structure with two types of MLP layers, which are channel-mixing MLP and token-mixing MLP.Inspired by MLP-Mixer, TransConv uses the self-attention module in ViT for token mixing and the Gaussian channel mlp module for channel mixing.
Self-Attention in Transformer.The basic module of ViT is the self-attention module, referred to as the SA module.The inputs of SA are Q, K, and V, which represent query, key, and value, respectively.To obtain the meaning of each token in the whole image, one dots the product of query with all the transpositions of keys, normalizes the result, and finally uses the softmax function to obtain the weight of the value.In order to provide more possibilities for the self-attention module, multiple self-attention modules are concatenated to form the multi-attention module, referred to as the MSA module.
where d is the dimension of Q and K.
where QW Q i , KW K i , VW V i are projections of different heads; W O is a mapping function.Gaussian Attended MLP in Transformer is an improvement of the MLP module in ViT.Gaussian channel attention is an alternative method to improve feature quality, which helps improve performance on UDA tasks and does not require complex training.This is because the Gaussian attended MLP module enhances the denoising ability only using an end-to-end training.In fact, the scaling operation is performed on the MLP module.The attention weights are applied to the channel dimensions in TransConv.This attends important channels and decreases the focus on unimportant ones.Inspired by channel attention methods such as Gaussian context transformer (GCT) [40], the Gaussian attended MLP module is added to the scaling operation of MLP module, and the 'CLS-Token' feature is used to calculate weight and adjust the channel dimensions of the input feature, as shown in Figure 3. Specifically, given a feature map X ∈ R B×HW×C , the global feature can be represented by the learnable 'CLS-Token' feature in feature map X. 'CLS-Token' ∈ R B×1×C , where B is the number of images in a batch, HW is the spatial dimension, and C is the channel dimension.First, the 'CLS-Token' feature is normalized, which can be expressed as where µ denotes the mean of the 'CLS-Token' feature and σ denotes the variance of the 'CLS-Token' feature.Then, the Gaussian function is used to calculate the attention weights: where a denotes the amplitude of the Gaussian function, b denotes the mean of the Gaussian function, and c denotes the standard deviation of the Gaussian function.
To simplify the operation, set a to constant 1, b to constant 0, and c to a parameter that can be learned, which can control the channel attention activation.Therefore, the Gaussian function can be simplified to We combine the above operations to form an Gaussian attented MLP, which can be formulated as where X denotes the input features before the Gaussian attended MLP and Y denotes the output features after the Gaussian attended MLP.

Dynamic Convolution
Compared with regular convolution, dynamic convolution is more suitable for unsupervised domain adaptation.Because the kernels of dynamic convolution adapt to the input image, dynamic convolution requires an additional convolution kernel generation module.The output features Y ∈ R B×(HW+1)×C of the transformer encoder consist of two parts: the features 'CLS-Token' ∈ R B×1×C and the features X ∈ R B×HW×C .
Convolution kernel generation.First, X needs the global average pooling to obtain a global spatial feature, while the 'CLS-Token' features is already a global spatial feature, so there is no need to perform additional global average pooling.Then, the global spatial feature is mapped to K dimensions through two fully connected (FC) layers, using GELU activation function between two FC layers, and finally, the softmax function is used to complete the normalization.K attention weights obtained in this way can be assigned to K kernels of this layer.Here, different from the Gaussian attended MLP module, dynamic convolution takes kernels as attention objects.
Dynamic convolution: K 1 × 1 kernels are convolved with the global spatial feature, and the result of dynamic convolution is obtained by layer normalization (LN), as shown in Figure 5. Finally, to obtain the contextual information, the results of the 'CLS-Token' features and features X obtained by dynamic convolution are concatenated together and delivered to the classifier for classification.The implementation of dynamic convolution is similar to a dynamic perceptron and can be summarized by the following formula: where π k denotes the attention weight of the kth linear function WT k x + bk , which is generated by the convolution kernel generation module and is different for different input x.

TransConv: Transformer Meets Convolution
The framework of the proposed TransConv in this paper is shown in Figure 2. It consists of two weight-sharing hybrid models.The hybrid model includes a transformer encoder and dynamic convolution.The source domain images and target domain images in the input are sent to the source domain branch and target domain branch, respectively.In these two branches, the hybrid model participates in learning the representation of a specific domain.In the training phase, the classification result of the source domain is supervised by the labels of the source domain dataset.In the image classification task of UDA, a labeled source domain D s {(x s i , y s i )} n s i=1 with n s examples and an unlabeled target domain D t {x t j } n t j=1 with n t examples are provided.The supervised classification loss function L cls for the source domain can be expressed as where J (•,•) is the cross-entropy loss function and f is TransConv hybrid network model.However, the target domain has no label, and the classification result of the target domain, namely the pseudo-label, reduces the distribution difference between the source and target domains by minimizing the metric learning loss between the source domain features and target domain features.The domain adaptation metric learning loss selected in this paper is L lmmd [41], which is a loss function for subdomain adaptation.Subdomain adaptation can adjust the subdomain distribution of the source domain and target domain more accurately than global adaptation.
where k (•,•) is a kernel function [42,43], z l is the lth(l ∈ L = 1, 2, . . ., |L|) layer activation, w sc i and w tc j denote the weight of z sl i and z tl j belonging to class c, respectively.To summarize, the objective function of TransConv is where α is a hyperparameter.The main steps of our method are reported in Algorithm 1.

Algorithm 1 TransConv
Input: Source and target domain data X s and X t ; labels for source domain data y s .Parameter: parameter α = 0.

Experiments
To verify the effectiveness of our model, we evaluate our proposed method on four widely used datasets including Office-31, Office-Home, ImageCLEF-DA, and VisDA-2017 for object recognition.MNIST, USPS, and SVHN are used for digit classification.And we compare them with the state-of-the-art UDA methods.
Digit classification is an UDA benchmark, consisting of MNIST [44], USPS and Street View House Number (SVHN) [45].We use the same settings as previous work to train our model, i.e., training phase uses training sets for each pair of source and target domains, and the testing phase uses the test set for target domain to perform evaluations.
The Office-31 dataset [46] contains 4652 images in 31 categories and consists of three domains: Amazon (A), DSLR (D), and Webcam (W).Amazon (A) contains 2817 images, which were downloaded from www.amazon.com;498 images in DSLR (D) and 795 images in Webcam (W) were captured in an office environment by web and digital SLR cameras, respectively.
The Office-Home dataset [47] consists of 15,588 images in 65 object categories.It contains images from four different domains: artistic images (A), clip art (C), product images (P), and real-world images (R).Images in each domain are collected in office and home environments.There are 2427 images in (A), 4365 images in (C), 4439 images in (P), and 4357 images in (R), respectively.
The ImageCLEF-DA dataset contains 1800 images in 12 categories.It consists of three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P).There are 600 images in each domains and 50 images in each category.
The VisDA-2017 dataset [48] contains about 280k images in 12 categories.It includes three domains: training, validation, and test domains.It is a dataset from simulation to a real environment.The training set has 152,397 images, which were generated by the same object under different circumstances, the 55,388 images in the validation set and the 72,372 images in the test set are real-world images.
Implementation Details The ViT-B/16 model pretrained on ImageNet 21k is used as a backbone network to extract image features.The input image size in our experiments is 256 × 256, and the size of each patch is 16 × 16.The transformer encoder of ViT-B/16 consists of 12 transformer encoder layers.We train the model using a minibatch stochastic gradient descent (SGD) optimizer with a momentum of 0.9, and we initialize the learning rate as 0. We linearly increase its learning rate to 3 ×10 −2 after 500 training steps, and then decrease it by the cosine decay strategy.Experiments are conducted on a single card 2080 Ti with 11 G memory.The batch size is set to 16.

Results of Digit Recognition
The classification results of the three tasks in digital recognition are shown in Table 1.Since current compared methods only evaluate three cases (i.e., SVHN→MNIST, USPS→MNIST and MNIST→USPS), and there is no comparisons for the remaining three cases, we also use the same settings as the previous studies.TransConv achieves the same best accuracy as TVT on MNIST→USPS task, and 0.2% lower than the best average classification accuracy.The above-mentioned results demonstrate the effectiveness of the TransConv model and well alleviate the domain shift problem.Results of Object Recognition.We evaluate four datasets for object recognition tasks, including Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017.The results of object recognition are shown in Tables 2-5.In Table 3, TransConv achieves the best average classification accuracy on ImageCLEF-DA, and achieves the significant improvement over the best prior UDA method (92.3% vs. 91.3%).But TransConv is lower than the best prior UDA method on Office-31, Office-Home and VisDA-2017.In Tables 2 and 4, TransConv is lower than TVT on Office-31 (92.8% vs. 93.9%)and Office-Home (82.9% vs. 83.6%).In Table 5, TransConv is lower than CDTrans on VisDA-2017 (80.9% vs. 88.4%).In Table 2, it can be seen from the difference in the number of samples and the results obtained in the three domains (Amazon, DSLR, and Webcan) of the Office-31 dataset that the larger the source domain dataset, the higher the corresponding performance.Moreover, as shown in Table 6, TransConv surpasses the Baseline (92.8% vs. 91.7%).This is also evidenced by the t-SNE visualization of learned features as shown in Figure 6.We visualize the network activations of baseline and TransConv for task A→W of Office-31 dataset.Red points are source samples and blue are target samples.Figure 6a shows the result for baseline, we can find that the source and target domains are not aligned very well and some points are hard to classify.In contrast, Figure 6b shows the our TransConv.It is observed that the source and target domains are aligned very well.The experimental results show that the hybrid model using the Gaussian attended MLP module with denoising capability and highly efficient dynamic convolution module can improve the domain adaptation problem to some extent.TransConv is an effective attempt to the hybrid model of transformers and convolutional neural networks.Ablation Study.In order to learn the individual contribution of Gaussian attended MLP, dynamic convolution and context information in improving the knowledge transferability of ViT, we conduct the ablation study, as shown in Table 6.Compared to the TransConv model, the Gaussian attended MLP, dynamic convolution, and context infor-mation are subtracted, respectively.Without the Gaussian attended MLP, the average classification accuracy is reduced by 0.2%; without dynamic convolution, the average classification accuracy is reduced by 0.5%; and without the context information, the average classification accuracy is reduced by 0.6%.Baseline is the result of simultaneously being without the Gaussian attended MLP, dynamic convolution, and contextual information, which reduces the average classification accuracy by 1.1%, indicating the significance of the three improvement methods in the model performance.To understand the effect of each improvement in dynamic convolution, we conduct the ablation study, as shown in Table 7; compared to the TransConv model, without LN, the average classification accuracy is reduced by 0.4%, and without GELU, the average classification accuracy is reduced by 0.1%.Baseline is the result of without LN and without GELU at the same time, which reduces the average classification accuracy by 0.2%, indicating that the two improvements play a role in the model performance.Parameter Sensitivity and Robustness.In our model, the hyperparameter α controls the L lmmd .To better understand the effects of α, we report the sensitivities of α in Figure 7a.It can be seen that our TransConv achieves the best results when α = 0.1.Therefore, this article fixes α = 0.1.We also observe the robustness of our model by changing the distribution of the source domain and target domain.In Figure 7b, the d represents the interdomain distance, '-' represents reducing the interdomain distance, and '+' represents increasing the interdomain distance.It can be seen that when the increasing/decreasing distance is less than d, the model performance decreases.When increasing the distance greater than d, the larger the distance, the more the model performance decreases.

Discussion
To see the efficiency of TransConv, we especially compare it with TVT and CDTrans, as shown in Table 8.First, we compare TransConv with TVT.TVT uses a pure transformer architecture with adversarial training, while TransConv uses a hybrid architecture that is easy to implement without adversarial training.This also saves some overheads.In addition, TVT has multiple loss functions, while TransConv only uses two loss functions.Moreover, TransConv has only one hyperparameter, while TVT has three hyperparameters.TransConv has fewer loss functions and hyperparameters, thereby avoiding hyperparameter tuning.Second, we compare TransConv with CDTrans.CDTrans uses a three-branch pure transformer architecture, which requires a large amount of cross-attention computations and two different training phases.Our hybrid architecture does not require a large number of cross-attention calculations and only needs to be trained in an end-to-end single-phase way.TransConv balances model improvement and computation overheads.Overall, these results verify the advantages of TransConv.

Conclusions
In this paper, we tackle the problem of unsupervised domain adaptation by improving transformer encoders and using context information in a novel hybrid way.This induces a new hybrid network structure, TransConv.Specifically, TransConv can improve the robustness of the features through a Gaussian attended MLP module and can improve semantics of local features by context-aware dynamic convolution.Experimental results on widely used benchmarks demonstrate the effectiveness of the TransConv model.Future work will further investigate other strategies to efficiently achieve the state-of-the-arts (SOTA) performance on UDA tasks.TransConv is restricted to inadequate transferability of global class-wise semantics and global spatial representations because feature confusion needs more fine-grained feature interaction.

Figure 1 .
Figure 1.Attention visualization of bike, bike helmet, and letter tray in the Office-31 dataset.The hotter the color, the higher the attention.

Figure 3 .
Figure3.The Gaussian attended MLP framework.The MLP is a scaling operation implemented by two FC modules.For the input X MLP of MLP, its dimension is R B×(HW+1)×C .For the output of MLP, the dimension of X is R B×HW×C and the dimension of CLS is R B×1×C .The normalization module normalizes the CLS feature to obtain Ĉ LS and the Gaussian function calculation module calculates the attention weight for ĈLS to obtain g.The g represents the attention activations.denotes broadcast element-wise product.Robustness to Noise.The pseudo-labeling in the target domain usually contains noises.To further analyze whether the Gaussian channel attention has the ability to denoise the pseudo-labeling, we design an experiment carefully.Specifically, we sample the same number of the same category images from the source and target domains in the W→A task of the Office-31 dataset as the training data, i.e., the training images of the source domain and target domain in each batch belong to the same category.Then, we manually replace the image pairs of the same category with the image pairs of different categories to increase noise and observe the changes in performance and the changes in UDA performance by the ratio of image pairs of different categories, as shown in Figure4.The x-axis represents the ratio of image pairs of different categories in the training data, and the y-axis represents the accuracy of different methods on the UDA task.When the X-axis is a value of 0.0, it means that all image pairs in a batch have the same category.When the X-axis is a value of 1.0, it means that the categories of all image pairs in a batch are different.The red curve represents the results by using the Gaussian attended MLP module, while the blue curve is the results without the Gaussian Attended MLP module.It can be seen that the red curve before the 80 percent ratio of images in different categories achieves better performance than the blue curve, which implies the robustness of the Gaussian attended MLP module to noise.

Figure 4 .
Figure 4.The model with Gaussian Attended MLP modules vs. without Gaussian Attended MLP modules.The red/blue curves represent the model with and without the Gaussian Attended MLP modules.

Figure 5 .
Figure 5. Improved dynamic convolution module framework.The red represents the improved part based on the original dynamic convolution.

Figure 6 .
Figure 6.Feature visualization of (a) the baseline and (b) TransConv using t-SNE on the task A→D of Office-31 dataset, where red and blue points indicate the source and the target domain, respectively.

Figure 7 .
Figure 7. Model analysis on evaluation W→A.(a) Parameter sensitivity of α.(b) The changes of performance by the interdomain distance for source domain and target domain.

Table 1 .
Performance comparison on digits dataset.The best performance is marked as bold.

Table 4 .
Performance comparison on Office-Home dataset.* indicates the results of using the ensemble learning strategy.

Table 7 .
Ablation study of dynamic convolution on Office-31 dataset.

Table 8 .
Performance comparison of TransConv, TVT, and CDTrans on the Office-31 dataset.The running time is the convergence time and is measured on a single 2080Ti GPU.