Unsupervised Domain Adaptation with Coupled Generative Adversarial Autoencoders

: When large-scale annotated data are not available for certain image classiﬁcation tasks, training a deep convolutional neural network model becomes challenging. Some recent domain adaptation methods try to solve this problem using generative adversarial networks and have achieved promising results. However, these methods are based on a shared latent space assumption and they do not consider the situation when shared high level representations in different domains do not exist or are not ideal as they assumed. To overcome this limitation, we propose a neural network structure called coupled generative adversarial autoencoders (CGAA) that allows a pair of generators to learn the high-level differences between two domains by sharing only part of the high-level layers. Additionally, by introducing a class consistent loss calculated by a stand-alone classiﬁer into the generator optimization, our model is able to generate class invariant style-transferred images suitable for classiﬁcation tasks in domain adaptation. We apply CGAA to several domain transferred image classiﬁcation scenarios including several benchmark datasets. Experiment results have shown that our method can achieve state-of-the-art classiﬁcation results.


Introduction
Large-scale well-annotated datasets such as Microsoft COCO [1], ImageNet [2] and KITTI [3] have played a vital role in the recent success of deep learning based models on computer vision tasks such as image classification, target detection, semantic segmentation and so on. However, models trained with large datasets still cannot generalize well to novel datasets when these datasets have different feature distributions. The typical solution is to further fine-tune these models on the task specific datasets. However, creating such datasets can be expensive and time-consuming. Unsupervised domain adaptation offers a solution to this problem by learning a mapping between a labeled dataset (source domain) and an unlabeled dataset (target domain) or by learning domain invariant features. Conventional domain adaptation approaches for image classification are usually developed in two separate steps: designing and extracting fixed features and then training models to reduce their differences in either the marginal distributions or the conditional distributions between domains [4][5][6][7]. Recent deep learning based domain adaptation approaches avoid the difficulty of feature design by extracting features automatically through convolutional neural networks [8][9][10][11][12][13].
Among all kinds of deep neural network based domain adaptation approaches, generative adversarial network (GAN) [14] has become a popular branch. A typical GAN trains a generator and a discriminator to compete against each other. The generator is trained to produce synthetic images as real as possible, whereas the discriminator is trained to distinguish the synthetic and real images. When applying GAN to domain adaptation for image classification, there are two major types of approaches. The first type trains a GAN to generate unlabeled target domain images, thus enlarging the data volume to train a more robust image classifier [15][16][17]. In these methods, the training strategy of the final classifier need to be carefully designed since the newly generated images have no label. The other type of approaches generate labeled target domain images directly by transferring the source domain images into target domain style and have achieved some state-of-the-art results, such as CoGAN [18] and UNIT [19]. These methods are based on the shared latent space assumption, which assumes that the differences of the source domain and the target domain are primarily low-level, and that the two domains share a common high-level latent space. This assumption works well for simple scenarios such as digits adaptation between MNIST [20] and USPS [21] but faces challenges when the semantic features are more complex. When shared high-level latent space in different domains does not exist or such latent space is not as ideal as assumed, these methods will fail [18].
In this paper, we propose an unsupervised domain adaptation method for image classification by combining generative adversarial networks with autoencoders. We call our proposed network architecture Coupled Generative Adversarial Autoencoders (CGAA). Our work is perhaps most similar to CoGAN and UNIT, but we try to solve the aforementioned shortcomings of these methods by the following designs: CGAA consists of a pair of generative adversarial networks (GAN) and a domain adaptive classifier. The architecture of the generator in GAN is designed based on the autoencoder. During training, part of the layers in the generators are forced to share their weights, which gives our model the ability to learn the domain transformation in an unsupervised manner and generate synthetic target domain images with label. By decoupling the highest level layer, we give our model the capacity to tolerant the differences of high-level features between the domains. The classifier provides a class-invariant loss to help the generator produce more suitable images for the classification task in domain adaptation. The main contributions of this work are:

•
We propose an unsupervised domain adaptation method for image classification. Our method trains a pair of coupled generative adversarial networks in which the generator has an encoderdecoder structure.

•
We force part of the layers in the generator to share weights during training to generate labeled synthetic images, and make the highest level layer decoupled for different high-level representations.

•
We introduce a class consistent loss into the GAN training, which is calculated from the output of a stand-alone domain adaptive classifier. It can help the generator to generate more suitable images for domain adaptation.

Related Work
The goal of unsupervised domain adaptation is to transfer knowledge from a labeled source dataset to a target dataset where labeled data is not available. Recent studies have tried to learn transferable features with deep neural networks. The DDC method [11] learned domain invariant representations by introducing an adaptation layer and a Maximum Mean Discrepancy (MMD) domain confusion loss. The work in [22] extended the MMD to jointly mitigate the gaps of marginal and conditional distributions between source and target domain. The DAN method [9] embedded task-specific layers in a reproducing kernel Hilbert space to enhance the feature transferability. The DANN method [8,23] suggested that the features suitable for domain adaptation should be both discriminative and domain-invariant and added a domain classifier at the end of the feature extractor to learn domain invariant features. CAN [24] suggested that some characteristic information from target domain data may be lost after learning domain-invariant features with DANN. Therefore, CAN introduced a set of domain classifiers into multiple blocks to learn domain-informative representations at lower blocks and domain-uninformative representations at higher blocks. The work of [25] proposed to learn a representation that transfered the semantic structure from a well labeled source domain to the sparsely labeled target domain by adding a domain classifier and a domain confusion loss. The DRCN [12] proposed a model which had two pipelines: The first was label prediction for the source domain and the second was data reconstruction for the target domain. ADDA [26] learns the representation of the source domain and then maps the target data to the same space through a domain-adversarial loss.
Other works have attempted to use GANs [14] into image-to-image translation and domain adaptation. The "pix2pix" framework [27] used a conditional generative adversarial network to learn a mapping from input to output images with paired images. CycleGAN [28] learned the mapping without paired training examples using a cycle consistency loss. The method in [29] used GAN to translate unpaired images between domains while remain high level semantic information aligned by introducing attention consistent loss. CoGAN [18] learned a joint distribution of images without corresponding supervision by training two GANs to generate the source and target images respectively given the same noise input and tying the high-level layer parameters of the two GANs. Instead of generating images from noise vectors, PixelDA [30] generated style-transferred images conditioned on the source images. CoGASA [31] integrated a stacked autoencoder with the CoGAN, and UNIT [19] proposed an image-to-image translation framework based on CoGAN and VAE [32].

Proposed Approach
In this section, we introduce the model structure of CGAA and explain our training strategy. As illustrated in Figure 1, CGAA contains seven sub-networks. Two image encoders ENC S and ENC T , two image decoders DEC S and DEC T , two adversarial discriminators D S and D T , and a classifier C. Figure 1. Overview of our model architecture. x S and x T are images from the source and target domain. The encoder ENC S and ENC T are two sequences of convolution layers (including Resnet blocks [33]) that map images to a code in a higher level latent space, DEC S and DEC T are two sequences of de-convolution layers (including Resnet blocks) that generate images from the outputs of the encoder. The Discriminator D S and D T determine whether an image is real or synthesized. During training, we share the weights of the two encoders except for the first and the last layer. Similarly, the weights of decoders are also tied except the first and the last layer. DEC S ENC S x S → x S→S and DEC T ENC T x T → x T→T are reconstructed images. DEC S ENC T x T → x T→S and DEC T ENC S x S → x S→T are style-transferred images. C is the classifier trained by the source images and the style-transferred source images.

Image Reconstruction and Autoencoder
The encoder ENC S and decoder DEC S constitute an autoencoder for the source domain X S . The ENC S maps an input image x S ∈ X S to a code in a latent space and based on this code, the DEC S reconstructs the input image as x S→S . Similarly, ENC T and DEC T constitute an autoencoder for the target domain X T . The aim of these two autoencoders is to reconstruct images as similar as possible to their input images in each domain. We use the mean squared error as the loss function to penalize the differences between inputs and outputs: where k is the number of pixels in input x S and · is the squared L 2 -norm.

Style Transfer and GAN
Style-transferred synthetic images can be generated by changing the combination of encoders and decoders. More specifically, let DEC T take the output of ENC S , and let DEC S take the output of ENC T , thus we are able to change the style of images between domains.
When training an autoencoder, element level penalties such as squared error, is the classic choice. However, as discussed in [34], they are actually not ideal for image generation, and the generated images are always blurred. Therefore, we combine autoencoder with GAN in our method. By jointly training an autoencoder and a GAN, we can generate better images with the feature level metric expressed by the discriminator. In our method CGAA, the ENC S , DEC T and D T constitute a generative adversarial network. During training, DEC T takes the output of ENC S , mapping an input source domain image x S into a target domain style synthetic image x S→T , and discriminator D T is trained to distinguish between synthetic images x S→T and real images x T from the target domain. Similarly, ENC T and DEC S generate synthetic source-style images x T→S conditioned on the target domain images x T and D S is trained to distinguish between real source domain images x S and synthetic images x T→S . With this pair of GANs, our goal is to minmax the following object: where α and β are weights that balance the GAN loss and the reconstruction loss. L GAN S and L GAN T represent the GAN loss:

Weight Sharing
Previous shared latent space assumption based methods such as CoGAN [18] and UNIT [19] are able to conduct the domain transfer training without paired images in different domains by sharing weights in the generators. They assume that images from different domains only have low-level semantic differences due to noise, resolution, illumination and color, etc. Furthermore, a pair of corresponding images in two domains share the same high-level concepts. Therefore, layers responsible for high level representation are forced to share their weights. However, these methods are based on the existence of shared high-level representations in the two domains. If the high-level semantic features are complex and such shared representations do not exist or are hard to find, these methods will not work out well. To this end, our method extends the previous works by only sharing part of the high-level layers and decoupling the rest. More specifically, we do not share the weights of last layer in the encoder, the first layer in decoder, and the last two layers in the discriminator, as shown is Figure 2. Under this structure, the generative models will, to some extent, tolerate different high level representations in different domains.

Decoder(DEC S /DEC T ) ResBlock
ResBlock ResBlock dconv-N256-S2-LReLU dconv-N128-S2-LReLU dconc-N64-S1-TanH ResBlock ResBlock conv-N64-S1-LReLU N256-S1 BN ReLU N256-S1 BN ResBlock ResBlock Figure 2. The network architecture. The ENC S and ENC T are of the same structure as the encoder shown in this figure, so are the two decoders and the two discriminators. The convolution layer is denoted as conv, the transposed convolution layer (deconvolution layer) is denoted as dconv and the residual block is denoted as ResBlock. N means neurons (channels), S means stride, and LReLU means leaky ReLU. BN stands for batch normalization layer and Fc stands for fully connected layer. We share the weights of the dark-color layers in the coupled models during training.

Domain Adapted Classifier
The focus of the unsupervised domain adaptation method described in this paper is to extend a classifier's generalization ability on two domains, originally trained on the source domain that generalizes to the unlabeled target domain. To this end, we train a classifier C with the source domain images and the synthetic target domain images generated by {ENC S , DEC T }. Unlike some other domain adaptation works where the discriminator is modified as classifier, our classifier has a stand-alone structure, shown in Figure 1, which is easy to be detached from the whole network for future training. We do not describe the detailed architecture of the classifier in Figure 2 because it is task specific. During training, we use the typical cross-entropy loss to optimize C: In addition, the classifier C has another function in CGAA, that is being a part of the optimization of the generator {ENC S , DEC T } with a class-consistency loss. When training the generator, C assigns a labelŷ the generated image x S→T , and the class-consistency loss is defined as: where y S is the class label of the input x S . The class-consistency loss makes sure the output image x S→T remains class-invariant, which is essential for the classification task in domain adaptation. With L C and L CC , our final optimization object becomes: The minmax optimization of Equation (8) is achieved by two alternative steps. During the first step, we keep the discriminators and the classifier fixed, optimize the generators and at the same time, minimize the reconstruction losses and the class consistent loss. During the second step, we keep the generators fixed and optimize the discriminators and the classifier.

Experiment Results and Evaluation
To evaluate our method, we conduct experiments on various domain adaptation scenarios and compare our results with other recently reported methods.

Facial Expression Recognition
We first evaluate our method on cross domain facial expression recognition task with three publicly available facial expression datasets: JAFFE, MMI and CK+. The images in these datasets have different resolutions and illuminations, and the subjects vary in gender, age and cultural background. Figure 3 shows some of the sample images from these datasets.
JAFFE dataset [35,36] contains 213 facial expression images. These images are from 10 Japanese females with seven expressions (angry, disgust, fear, happy, sad, surprise and neutral). We use all of the images in JAFFE in our experiments.
MMI dataset [37,38] consists of over 2900 videos as well as still images of 75 subjects, in which 235 videos have emotional labels. We choose the peak frame of each video that has the six basic emotions (angry, disgust, fear, happy, sad and surprise) and the first frame of these videos as neutral emotion images. In total we use 242 images from MMI.
CK+ dataset [39] consists of 593 image sequences from 123 subjects, 327 sequences of which have emotional labels. The dataset labels seven expressions including angry, disgust, fear, happy, sad, surprise, and contempt. We only choose the peak frame from the sequences labelled with the first six expressions. In addition, we choose the first frame from some of the sequences as neutral samples. In total we use 363 images from CK+.  In this experiment, the network structure of our method is shown in Figure 2. Since the facial expression datasets are rather small, to avoid over-fitting, we use the Alexnet model pre-trained on ImageNet as the base model of the classifier and fine-tune it in our experiment. Table 1 shows experiment results of the classifier's accuracy tested on the target domain. In Table 1, the source model is trained with only the labeled source dataset. As for the adapted model, to evaluate the effectiveness of our proposed method, we train with three different settings. In all three settings, the parameters of the low-level layers in encoders, decoders and discriminators are not shared, which are the first layer of the encoder, the last layer of the decoder and the first layer of the discriminator. As for the high-level layers, the first experiment shares the weights of all of these layers, which is a similar structure to UNIT described in [19]. The second experiment has decoupled high-level layers in ENC and DEC, which means we do not share the last layer of the two encoders and the first layer of the two decoders. The last one is to have decoupled high-level layers in ENC, DEC as well as D, which is the same setting described in Figure 2. A decoupled D means we do not share the last two layers in the discriminators. Figure 4 shows examples of the style-transferred images generated by UNIT and the last experiment setting of CGAA.    Figure 4 shows some of the style-transferred images generated in the C→J domain adaptation under our last kind of network setting. The experiment results in Table 1 show that our CGAA model with partially-decoupled high level layers outperforms the model with all the high level layers tied-up in all six domain adaptations. In addition, we find that decoupling the encoder-decoder can lead to a significant increase of recognition accuracy, whereas decoupling the discriminator has only a small impact on the experiment result. Therefore, in other experiments described in this paper, we use the last setting in Table 1 as CGAA for evaluation. We visualize the feature distribution of the two domains before and after the adaptation (J→C), as shown in Figure 5. Figure 5 proves that our model can make the distribution of the features from the two domains much closer, which brings about a higher accuracy in the classification. To further evaluate the effectiveness of our method, we compare the confusion matrices of the class-wise classification accuracy on target domain before and after adaptation.
As shown in Figure 6, the blue ones are the confusion matrices when only source domain images are used for training, the green ones are matrices when UNIT is used for domain adaption and the red ones are matrices when our method is used for domain adaptation. When trained on source domain only, the model have difficulties separating Angry and Neutral between CK+ and JAFFE (see Figure 6a,b, and also cannot seperate Angry and Sad between MMI and CK+ (see Figure 6c,d). When trained on MMI and tested on JAFFE, the model misclassifies a lot of images as Surprised (see Figure 6e) whereas when trained on JAFFE and tested on MMI, the model misclassifies most of the images as Angry (see Figure 6f). These misclassifications are caused by the semantic gap between domains. Figure 6 shows that our domain adaptation method can help the model to cross the semantic gap between domains and increase the class-wise classification accuracies.

Office Dataset
In this experiment, we evaluate our method on the Office dataset [41]. This is the most popular benchmark dataset for object recognition in the domain adaptation field. This dataset has 4410 images across 31 classes of everyday objects in three domains: amazon (A), webcam (W), and dslr (D). The amazon contains product pictures with no background from the Amazon website, and images in webcam and dslr contains similar real-world objects with different resolution. Following previous domain adaptation work [26], we use ResNet-50 as the model structure for the classifier. Other sub-parts of the model are the same as those shown in Figure 2. We adopt the common "fullytransductive" training protocol [8,9,26] . We also implement two other methods on Pytorch based on share-latent space assumption, which are CoGAN and UNIT. Note that in the original papers of these two methods, the classifier is gained by attaching a softmax layer to the last hidden layer of the discriminator. Whereas in our implementation of these methods, we train a stand-alone classifier with the same structure of our method for a fair comparison. The experiment results in Table 2 show that our method is a competitive method and achieve state-of-the-art results compared with previously-reported methods, except for D→W. The method proposed in this paper aims to solve the problem of domain adaptation when the high-level features in the two domains are different and the shared high-level latent space cannot be established. As shown in Figure 7, the images of webcam (W) and dslr (D) are actually very similar, only having differences in the illumination and the image resolution. In other words, their high-level features are the same. Therefore, our method did not achieve better results than other methods in this particular task, but obtained better results in other more challenging tasks with obvious high-level feature differences, such as W→A and D→A.

Office-Home Dataset
Finally, we test our model on Office-home dataset [13]. This is a newer, larger and more challenging dataset compared to the classic Office dataset. It has about 15,500 images cross 4 domains, with each domain containing images from 65 classes of everyday objects. As shown in Figure 8, the four domains are: Art, Clipart, Product and Real-world. Images in Art are artistic depictions of objects and Clipart contains clipart images. Product consists of images of objects without background and Real-World consists of images of objects captured with a camera. We conduct this experiment with the same setting as the classic Office dataset experiment. Table 3 shows that our results outperform competitors in all of the domain adaptations in this experiment. Table 3. Recognition accuracy evaluation for domain adaptation on the office-home dataset. Art (A), Clipart (C), Product (P), Real-World (R). A→C indicates A is the source dataset and C is the target dataset. Bold numbers are the best results.

Conclusions
In this paper, we proposed an unsupervised domain adaptation method called coupled generative adversarial autoencoders. The weight-sharing training strategy proposed in this paper extends the shared high-level latent space assumption and improves the tolerance of the model to the differences in high-level semantic features between domains. Under this training strategy, our model can generate style-transferred images with unpaired images in the two domains and domain adaptation is done by training a classifier with the target-style images generated from the source images. With this proposed method, we achieve state-of-the-art experiment results on various domain adaptation scenarios including popular benchmark datasets.
Author Contributions: X.W. (Xiaoqing Wang) performed the experiments, analyzed the data and wrote the paper; X.W. (Xiangjun Wang) contributed the GPU used in the experiments and modified the paper.