Synthetic Source Universal Domain Adaptation through Contrastive Learning

Universal domain adaptation (UDA) is a crucial research topic for efficient deep learning model training using data from various imaging sensors. However, its development is affected by unlabeled target data. Moreover, the nonexistence of prior knowledge of the source and target domain makes it more challenging for UDA to train models. I hypothesize that the degradation of trained models in the target domain is caused by the lack of direct training loss to improve the discriminative power of the target domain data. As a result, the target data adapted to the source representations is biased toward the source domain. I found that the degradation was more pronounced when I used synthetic data for the source domain and real data for the target domain. In this paper, I propose a UDA method with target domain contrastive learning. The proposed method enables models to leverage synthetic data for the source domain and train the discriminativeness of target features in an unsupervised manner. In addition, the target domain feature extraction network is shared with the source domain classification task, preventing unnecessary computational growth. Extensive experimental results on VisDa-2017 and MNIST to SVHN demonstrated that the proposed method significantly outperforms the baseline by 2.7% and 5.1%, respectively.


Introduction
The application of deep learning has been rapidly improving performance in various fields of computer vision, such as object detection [1], human pose estimation [2,3], semantic segmentation [4,5], and image classification [6]. In addition, its application in real-world industries has also been promoted [7,8].
Achieving significant improvement in deep learning relies on the abundance of labeled samples [9] in a supervised manner. However, since images are collected from various sensors, test samples may be different from those used in the training phase. In this case, even if the test sample is semantically the same as the training samples, the performance is significantly reduced [10]. This is known as the domain shift problem [11]. As a result, deep learning models require new training for a new domain. Training data must be collected again, and the labeling process must be repeated continuously even for the same task in different domains. By contrast, humans are capable of robustly recognizing images and inferring their meanings across different domains. For example, a person who has seen and understood pictures of numerous cars (source domain) can also recognize them on artwork depicting cars (target domain). Hence, recent studies on deep learning have focused on increasing the efficiency of training models by leveraging information that has already been learned [12].
Domain adaptation (DA) [13,14] is the task of adapting algorithms trained on one or more source domains to other related target domains in the same task. Many studies transferred the knowledge learned from the source domain to the target domain successfully without target data labels. However, there is a common assumption that the label sets of the source and target domains are the same, i.e., closed-set domain adaptation (CDA) [15][16][17][18]. However, closed-set domain adaptation cannot effectively bridge label gaps in different domains if the classes in the target domain are less than those in the source domain, i.e., partial domain adaptation (PDA) [19][20][21], or if the target domain includes "unknown" classes that are not visible in the source domain, i.e., open-set domain adaptation (ODA) [22][23][24].
Currently, research is underway on how to adapt learned knowledge to different domains with no restrictions to the source and target domain label sets. Universal domain adaptation (UDA) [25,26] includes CDA, PDA, and ODA. As a result, target domain samples should be correctly classified into "known" class labels in a source domain or "unknown" classes if the label is not in the source domain [25,26]. Saito et al. [26] achieved UDA by proposing neighbor clustering and entropy separation in a self-supervised manner. Neighbor clustering brings target samples closer to known source class prototypes or unknown target samples. At the same time, entropy separation separates unknown target samples from known class boundaries using confident target samples.
Even though [26] demonstrated a compelling performance for universal domain adaptation, I found that it is insufficient when synthesized data were used as a source domain. My hypothesis on this problem is that if the source and target domains are too different, self-supervision by two losses is erroneous, and the feature extraction network trained by [26] does not obtain discriminative feature representations for the target domains. This drawback stems from the fact that there is no direct training loss for the target domain. Ideally, universal domain adaptation should be able to adapt knowledge learned from synthesized source domain data to the real-world target domain, as depicted in Figure 1. It is a fundamental solution that can automatically generate labeled training data without human intervention, thereby reducing the cost of large-scale data labeling. In ideal circumstances, universal domain adaptation should apply the knowledge learned from synthesized source domain data to the real target domain. Therefore, I focus on leveraging synthetic source domain data to reduce labeling costs by humans.
In this paper, I propose a novel universal domain adaptation method that improves the discriminative power of the target feature representation of [26]. Here, discriminative means that positive samples are pulled together, and multiple negative samples are pushed toward each other. However, because there are no labels available in the target domain, the proposed method utilizes contrastive learning [27] for target domain training, wherein I first generate a positive pair by augmenting the same target domain sample. Then, negative samples are generated by randomly selecting samples from the target minibatch except for the positive pair. By following this process, I can maximize the mutual information between different views on the same data in the target domain without requiring any labels for positive pairs. Moreover, the improved discriminative power in the target domain helps neighborhood clustering and known/unknown separation. Thereby, the proposed method achieves better performance than the baseline method [26] when a synthetic source domain is adapted to a real target domain.
The contributions are summarized as follows: • In contrast to the existing universal domain adaptation methods [25,26], I focused on using synthetic source domain data to reduce human labeling costs. • In this case, owing to the limitations of data synthesis, the source domain has different characteristics from the real data in the target domain. Thus, the target domain information should be fully utilized to learn the feature representation. I used contrastive learning [27] to extract discriminative features from the target domain. • For contrastive learning, the target domain feature extraction network is not separately constructed. The source and target domains share a common feature extraction network, thus avoiding unnecessary computation surges.

•
The experiments conducted on the VisDa-2017 dataset and MNIST to SVHN dataset indicate that the proposed method significantly outperforms baselines by 2.7% and 5.1%, respectively.
The remainder of this paper is organized as follows: Section 2 introduces related work. In Section 3, the proposed method is described. Section 4 presents the implementation and experimental results. Finally, Section 5 concludes the paper.

Domain Adaptation
Let L s and L t denote a set of class labels present in the source and target, respectively. Considering this, domain adaptions can be categorized into three main topics according to the label set constraints between domains: I here introduce recent domain adaptation methods.

Close-Set Domain Adaptation
Haeusser et al. [28] produced similar feature representations of the source and target domains by utilizing a bipartite graph. Tzeng et al. [29] first outlined a unified framework for adversarial domain adaptation by combining discriminative modeling, untied weight sharing, and generative adversarial loss. Long et al. [30] designed a conditional domain adversarial network with multilinear and entropy conditioning to improve discriminability and transferability.

Partial Domain Adaptation
Cao et al. [19] introduced partial domain adaptation as a new challenge and proposed a down-weighting solution to handle outlier source classes that do not appear in the target domain. Cao et al. [20] further improved PDA and proposed an example transfer network that was designed using a weighting scheme to quantify the transferability of examples in the source domain. Therefore, it alleviates negative transfers and promotes positive transfers. Zhang et al. [21] extended the adversarial nets-based domain adaptation that identifies the importance score of source samples based on a two-domain classifier strategy.

Open-Set Domain Adaptation
Busto et al. [22] introduced the concept of open sets to domain adaptation and proposed a method to fit in both closed and open-set scenarios by solving the assignment problem of targets that are potentially known classes in the source domain. Saito et al. [23] proposed a method for learning feature representations that separate unknown targets from known target samples based on adversarial training.
Recently, You et al. [25] introduced a universal domain adaptation and proposed a universal adaptation network that enables the shared label to be distinguished from the private label set for each domain by quantifying sample-level transferability. Satio et al. [26] improved UDA by proposing neighboring cluster and entropy separation losses, which are trained in a self-supervised manner. However, I found that this method [26] was insufficient when synthesized samples were used for the source domain. In this paper, I propose a contrastive-based UDA method for leveraging synthetic source data.

Contrastive Learning
Recent studies have shown that unsupervised feature representations for various downstream tasks can be learned in a contrastive manner. Oord et al. [31] captured useful feature representation by predicting the future in latent space based on autoregressive models and probabilistic contrastive loss. MoCo [32] is a seminal method for contrastive learning that maintains a dynamic dictionary (memory bank) for computing the contrastive loss and shows competitive results on ImageNet [9] classification. Chen et al. [27] conducted systematic studies to understand the factors that enable contrastive prediction tasks to learn useful representations. Grill et al. [33] achieved an improved accuracy of the ImageNet classification task without negative pairs. Instead, they relied on online target networks that interact and learn from each other. Based on the abovementioned success of contrastive learning, I extracted the target domain information in a contrastive manner for universal domain adaptation.

Let the dataset for the source domain be
, where x s i and y s i indicate the i-th input sample and its corresponding true label, respectively. N s is the number of samples in the source domain. The target domain dataset D t = {x t i } N t i=1 does not have true labels in which samples x t correspond to the same classes or unknown samples in the source domain. Letx s i indicate augmented source domain samples by data augmentation µ, e.g., random cropping, random color distortion, etc. The feature representationf s i ∈ R d is calculated using a feature extraction network G, i.e.,f s i = G(x s i ). For the target domain, two different data augmentations µ and µ are applied to the target domain sample x t i . The augmented samples are denoted byx t i andx t i , and their feature representations by G are denoted byf t i andf t i . The same feature extraction network G is shared for both the source and target domains. Let C(·; W) be a linear classification network, where the weights are represented as W = [w 1 , w 2 , . . . , w K ]. The k-th weight vector w k ∈ R d is normalized by the l 2 norm and used as a prototype representing the k-th class. The proposed method uses a memory bankF t = [f t 1 ,f t 2 , . . . ,f t N t ] ∈ R d×N t that saves N t feature representations of samples in the target domain. I also denote the total memory bank as

Architecture
In this study, domain adaptation aims to transfer the knowledge of labeled samples in the source domain to train a classification model for unlabeled target domain samples. However, the problem is made more challenging for universal domain adaptation by including unknown samples in the target domain. Furthermore, some classes in the source domain may not have samples in the target domain. Therefore, the feature extraction network should move target domain samples closer to known class samples from the source domain, simultaneously making it easier to distinguish between unknown and known samples. To achieve this goal, researchers in a previous work (DANCE) [26] trained the feature extraction network through two losses: neighbor clustering and entropy separation. However, there is no direct loss function available to learn a target-domainspecific representation due to a lack of true labels for the target domain sample. My hypothesis is that the absence of the target domain-specific loss function causes the feature extraction network to create biased feature representations for source domain classification. This prevents useful feature representation for unknown target domain samples. To resolve this, I propose a method that learns feature representation for a source domain classification task in a supervised manner and a target domain instance-level classification task in an unsupervised manner.
To summarize, the proposed method consists of four loss functions as depicted in Figure 2. The first function is the loss function for source domain classification (L cls ), the second function is the instance-level classification of the target domain (L ct ), and the other two functions are the losses for neighbor clustering (L nc ) and entropy separation (L es ) proposed in [26]. I sequentially explain three loss functions, except for the source domain classification loss function. Overview. The proposed method learns a feature representation space considering the discriminative power in both the source (L cls ) and target (L ct in Section 3.2.1) domains. The target domain feature is enhanced using contrastive learning, which is an instance-level classification task to overcome the limitations of data synthesis in the source domain and unlabeled target domain data. The remaining losses (L nc in Section 3.2.2 and L es in Section 3.2.3) cluster class samples and separate known/unknown classes, respectively. Notably, the feature extraction network G is shared in the source and target domains. T denotes the pool of data augmentation.

Target Domain Contrastive Loss
Contrastive representation learning aims to learn a discriminative embedding space devoid of any true labels through self-supervised learning wherein similar pairs of samples are close to each other, and dissimilar pairs of samples are distant from each other. To achieve this, each image in a given dataset is considered its own class, i.e., instance-level classification [27].
I used the SimCLR-based contrastive learning method [27] to learn the target domain feature representation. The procedure followed is as given below.

•
Generate different perspectives based on the same target sample:Randomly selected data augmentations µ and µ are applied sequentially to a target domain sample x t i , obtaining two augmented samplesx t i andx t i . As a result, two sets of minibatches are obtained:B t = {x t k } andB t = {x t k }. The size of each minibatch is equal, i.e., |B t | = |B t |. • Extract features from augmented target samples: The augmented target samples are fed into a feature extraction network G to generate feature representationsf t i andf t i . The feature extraction network is shared for the source-domain classification task. Although any network is freely available, I use ResNet [6,27], i.e., • Obtain target projection features for contrastive learning: Projection feature representationsz t i andẑ t i are obtained via a projection network M, and contrastive learning loss is applied to them. I use a shallow multilayer perceptron (MLP) for projection networks M, i.e.,z • Minimize contrastive learning loss in the target domain: Given two augmented minibatchesB t andB t , contrastive learning aims to identifyx t i usingx t or vice versa. The loss function is defined using the projection representation of the target samples as follows: where τ is the temperature parameter [34]. Z = Σ N t k=1 exp(s(z t i ,ẑ t k )/τ) + Σ N t k=1,k =i exp(s(z t i ,z t k )/τ), and s(z t i ,ẑ t i ) is the similarity function. I used the cosine similarity for this as s(z t i ,ẑ t i ) =z t i Tẑ t i ||z t i ||||ẑ t i || . Minimizing the contrastive learning loss function causes the two projection feature representations from the same sample to be similar and the feature representations from different samples to be dissimilar. This results in learning powerful feature representations for instance-level classification, even without any labeled samples. Thus, the contrastive learning loss finds a feature space that represents better target domain samples by eliminating unnecessary information.

Neighbor Clustering (NC) Loss
The purpose of this loss function [26] is to bring target samples together in the unknown class prototypes in the source domain or locate them close to their neighborhood in unknown target domain samples. This can be achieved by minimizing the following entropy function: where p i,j is a probability based on the similarity between the i-th target feature representationf t i and the j-th total memory bank element v j as where τ is the temperature parameter to control the distribution concentration degree [34].

Entropy Separation Loss
Even with neighbor clustering loss, it is challenging to separate unknown samples from known classes. To enhance separation, the entropy separation loss function proposed in [26] is minimized.
where p i is a source class probability vector for the i-th target sample, which is calculated as p i = W Tft i . H(·) is the entropy function. If the value H(p i ) is much lower, it becomes nearly indistinguishable to one particular class in the source domain (low entropy), whereas if it is much higher, there are no similar classes in the source class (high entropy). Thus, the extremes entropy value H(p i ) is obtained by minimizing the entropy separation loss. This creates a separation effect in which the unknown samples move away from the class of the source domain. ρ is a threshold boundary value used to separate unknown samples from known classes, and m is a tuning parameter for minimizing the loss function using only reliable samples.

Total Loss
Finally, I combine all four loss functions described above with two hyperparameters, λ 1 and λ 2 , as follows: Minimizing the total loss function results in learning the discriminative feature representation for both source and target domains, neighborhood clustering in the feature space, and maximizing the separation of unknown samples from known classes.

Implementation Details
I conducted experiments in PyTorch [35] and on a single NVIDIA Titan RTX GPU and followed experimental settings [26]. However, the proposed method is not hardware dependent. The feature extractor G was set to ResNet50 [6] and pretrained on ImageNet [9] after removing the last linear layer in all experiments. In addition, I added a new source classification layer W. For contrastive projection features, I used a two-layer perception with ReLU activation, i.e., Linear-ReLU-Linear. For baselines, I used the implementation of a previous work [26]. For the proposed method, the values of ρ, m, and λ 1 were set to log(K)/2, 0.5, and 0.05, respectively, where K is the number of shared classes. The batch size was set to 36. I used the SGD optimizer, and the initial learning rate and weight decay were set to 0.01 and 0.0005, respectively. Table 1 shows a comparison of the number of parameters and GFLOPs between the baseline [26] and the proposed method. Since the proposed method in the training phase requires contrastive projection features, additional parameters and GFLOPs of 9.4 and 5.3 M are required. However, for inference, the proposed method uses the same network as the baseline that consists of one feature extractor G and one classifier C. Notably, the parameters and computational costs for inference do not increase. Table 1. Comparisons of the number of baseline [26] parameters and GFLOPs during training. For inference, the proposed method uses the same network as the baseline.

Evaluation and Data Augmentation
The goal of the experiments was to compare the proposed results with DANCE [26] across subcases of UDA, i.e., CDA, PDA, and ODA, under synthetic source domain and real target domain classification tasks. Following the evaluation metrics in a previously published study [26], I calculated the accuracy over all target samples in CDA and PDA. In ODA, I used the average per class, including "unknown". For example, VisDa-2017 ODA reported an average of over seven classes, i.e., six shared and one unknown class. I ran the experiment three times and reported the average results.
I denote the accuracies reported in [26] as DANCE in the tables in this paper. In addition, I present two additional results from [26] as DANCE-R and DANCE-A for a fair comparison. DANCE-R represents the reproduced result of [26] based on their codes (https://github.com/VisionLearningGroup/DANCE. Accessed time: 12 July 2021.) with the same random seeds. DANCE-A represents another trained version of DANCE by the same augmentation µ as that used for the proposed method. This validates my hypothesis that the data augmentation of synthetic source data is not sufficient to cover the real target domain. I set µ to random flip, Gaussian blur, color jitter, and grayscale. I also set µ to random flip and a scale transform for adjusting the input size to feature extraction network G because contrastive learning needs different views generated augmentations µ and µ based on the same source data.

VisDA-2017 Dataset
VisDA-2017 [36] is a large-scale dataset. It contains 12 category images of various sizes in two domains. One of these contains synthetic 2D renderings of 3D objects with 152,397 images, and the other contains photographs of real-world objects with 55,388 images. I resized the images to 256 × 256 for the feature extraction input. Figure 3 depicts examples of the VisDA-2017 dataset. The first and second rows in Figure 3 depict the source and target domain images, respectively. This dataset exhibits a significant domain shift. I followed [26] to construct closed-set, partial, and open-set domain adaptation tasks to validate the proposed method in large-scale synthetic-to-real domain adaptation. The values in parentheses correspond to the number of shared classes, source private classes, and target private classes, respectively. For instance, (6/6/0) indicates partial domain adaptation, and |L s ∩ L t |, |L s − L t |, and |L t − L s | are 6, 6, and 0, respectively.
As expected, the results in DANCE-* in Table 2 were unfavorable compared to the proposed version in all cases. In particular, the highest improvement was observed in the open-set case. The contrastive loss provides more help since the open-set has many unknown classes. I also checked the effect of hyperparameter λ 1 . As shown in Table 2, the proposed method is insensitive to λ 1 . In both cases, λ 1 = 0.1 and λ 1 = 0.03 were better than the baseline methods. Table 2 also shows a comparison of accuracy between the proposed method and existing other methods. The baseline DANCE-A achieved significantly better accuracy than traditional domain adaptation methods. Nevertheless, the proposed method improved the accuracy of a baseline on VisDa-2017 dataset. Especially for the open-set domain adaptation, learning in the target domain is important; therefore, the proposed method achieved an accuracy improvement of 4.1% over baseline. The proposed method demonstrates almost the same performance as a baseline without significant performance degradation and outperforms other methods.   [37] and SVHN [38].
MNIST consists of 28 × 28 sized gray images containing handwritten numbers ranging from zero to nine. The standard training and test splits included 60,000 and 10,000 images, respectively. I resized it to 32 × 32 for the experiments. Examples of the images are shown in the first row of Figure 4.  Figure 4.
Images in the MNIST dataset are not synthetic but appear synthetic because they contain only black-and-white image files, as depicted in Figure 4. By contrast, the SVHN images in Figure 4 are obtained from the vision cameras. According to their appearances, I set the training split of MNIST as the source domain and the test split of SVHN as the target domain. I did not apply the random crop, flip, and translation augmentations because the SVHN dataset images also included a second number around the centered number as well as a centered ground-truth number, as depicted in the second row of Figure 4.  Table 3 are consistent with those of the experiment discussed in Section 4.3.1. DANCE-A shows significantly better results than DANCE-R. Therefore, data augmentation can be used to handle domain gaps. However, the proposed method outperforms DANCE-R and DANCE-A, regardless of the values of hyperparameter λ 1 , i.e., for both cases λ 1 = 0.1 and λ 1 = 0.03. Even though the baseline DANCE-A uses the same data augmentation as that of the proposed method, the results are, on average, approximately five percent lower than those of the proposed method.

Analysis of the Target Domain Contrastive Loss Function
The results of both experiments in Sections 4.3.1 and 4.3.2 validated that (1) augmentation of synthetic source domain data does not make sufficient generalization for the target domain even when used in universal domain adaptation algorithms, (2) the added contrastive loss helps generalize the feature space to cover the target domain, and (3) feature network G can be shared in target contrastive learning for efficiency. Figure 5 depicts the effect of batch sizes on MNIST to SVHN datasets, where 'SO', 'DANCE-A', and Proposed (λ = 0.3) are the same those in Table 3. In all comparison methods, regardless of the baseline and proposed method, the accuracy decreased as the batch size increased in closed-set, partial, and open-set domain adaptations. For the domain adaptations, the larger the batch, the more significant the impact of the target domain in the feature space. This makes it difficult for supervised classification features to be transferred from the source domain to the target domain.
I also analyzed the effect of the target domain contrastive learning loss function on domain adaptation. To do this, I added a target domain contrastive learning loss to 'SO' that applied unsupervised classification loss function only to the source domain. This is marked 'SO+Target_Contrastive'. In Figure 5, the cyan triangles consistently demonstrate better accuracy than the red squares, regardless of the batch size. The accuracy improvement rates are represented by green bars. This means that the additional target domain contrastive loss helps the model adapt to the target domain. Thus, the proposed method, which adds a contrastive loss to the baseline DANCE-A, achieved the best result by efficiently learning the target domain. The accuracy improvement rates relative to the baselines are represented by yellow bars. As the batch size increased, the baseline did not optimize the model parameters on close-set and partial domain adaptations; therefore, it remains blank.  Table 3. 'SO+Target_Contrastive' means is the addition of target contrastive learning loss to 'SO'. The accuracy improvement rates are represented by green and yellow bars. In the case of optimization unstable, it remains blank.  Table 3. 'SO+Target_Contrastive' means is the addition of target contrastive learning loss to 'SO'. The accuracy improvement rates are represented by green and yellow bars. In the case of optimization unstable, it remains blank.

Conclusions
Universal domain adaptation (UDA) is an important research topic for the efficient use of trained models in various image sensors. I found that the baseline method does not have a direct training loss for the target domain to improve the discriminative power. I hypothesized that if the source and target domains are too different, the feature extraction network does not obtain discriminative feature representations for the target domains. To overcome the limitations of the synthetic data, the information about the target domain data should be fully utilized to learn feature representations. To do this, I used contrastive learning [27] in the target domain. The experimental results validated that the proposed method significantly contributed to improving the UDA task. In addition, the target domain feature extraction network was shared with the source domain classification task, avoiding unnecessary computation increases. The proposed method can be easily expanded to help efficient model training on various problems, such as imaging sensors of self-driving cars.
Author Contributions: J.C. conceived the idea, and he designed and performed the experiments; he also wrote the paper. All authors have read and agreed to the published version of the manuscript.