Adversarial Hard Attention Adaptation

: Domain adaptation is critical to transfer the invaluable source domain knowledge to the target domain. In this paper, for a particular visual attention model, saying hard attention, we consider to adapt the learned hard attention to the unlabeled target domain. To tackle this kind of hard attention adaptation, a novel adversarial reward strategy is proposed to train the policy of the target domain agent. In this adversarial training framework, the target domain agent competes with the discriminator which takes the attention features generated from the both domain agents as input and tries its best to distinguish them, and thus the target domain policy is learned to align the local attention feature to its source domain counterpart. We evaluated our model on the benchmarks of the cross-domain tasks, such as the centered digits datasets and the enlarged non-centered digits datasets. The experimental results show that our model outperforms the ADDA and other existing methods.


Introduction
In recent years, deep convolutional neural networks have achieved state-of-the-art performance in many visual tasks, such as image classification, object detection and semantic segmentation [1,2]. However, in many practical cases, there exists problems of distribution mismatch [3] and domain shift [4] between different visual tasks, which results in poor generalization performance to the new task. To some extent, transfer learning can solve this challenge by fine-tuning the model in the target domain. However, such a transfer learning method is hindered when only few or even no labels are available in the target domain.
Unsupervised domain adaptation has a wide range of applications, such as speech recognition, natural language processing and image vision. For example, Ref. [5] proposed a Bayesian adaptation learning framework to obtain the maximum posterior (MAP) estimation of hidden activation function parameters in deep models based automatic speech recognition (ASR) system, which solved the unsupervised adaptation work in the field of speech recognition. Moreover, in image visual task, several methods in unsupervised domain adaptation have been proposed to reduce the harmful effects of domain shift and distribution mismatch. In particular, since the emergence of the seminal work of generative adversarial networks (GAN) [6], adversarial adaptation methods have been attracting great attention [7][8][9][10][11], which explore the performance advantages of pitting two networks against each other to reduce the distribution difference between source and target domain. Specifically, the adversarial discriminative domain adaptation (ADDA) approach [12] trains a feature encoder S and a classifier C using the label information in the source domain at the first stage; then a critic network D and another target feature encoder T are trained by competing to each other by adversarial training technique; then finally the classifier C can correctly classify target images by mapping the target domain features to the common space of the source domain.
Previous adversarial adaptation works mainly aim to match the global features extracted from the entire images across domains. However, few works regarding attention adaptation have been well studied, even though attention methods have been dominating state of the arts in vision tasks [13][14][15][16][17][18][19][20][21]. It is worth noting that for this interesting attention adaptation challenge, Ref. [22] proposed two types of transferable attention: local and global attention by considering the transferability of different regions of images. In this paper, we focus on the transferability of hard attention inspired by [23,24] which also has wide applications such as long document classification [25] and weakly labeled sensor data recognition [26], etc. Notably, a hard attention model usually consists of a recurrent structure to select the most discriminative local features from image patches, and thus hard attention is non-differentiable due to sampling and cropping operations, which poses a challenge to adapt such hard attention from the source domain to the target domain.
To tackle the challenge of hard attention adaptation, in this paper, we propose a novel adversarial hard attention adaptation framework shown in Figure 1. Leveraging the popular ADDA [12] framework, our work mainly consists of three components, saying that a source agent S, a target agent T , and a discriminator D. Different from the classical ADDA framework, to overcome the non-differentiable nature of hard attention, the most important contribution of this work is that we design an adversarial reward strategy to train the target agent T via reinforcement learning technique. Furthermore, the target agent T and the discriminator D naturally constructs the adversarial training counterparts. On one hand, the target domain policy π T is trained to extract the target attention feature T (x t ) closing enough to the source model; and on the other hand, the discriminator D is trained to distinguish the target attention feature T (x t ) from the source attention feature S(x s ).  Figure 1. Illustration of the adversarial hard attention adaptation framework. Source images and target images are represented as the source environment E S and target environment E T respectively in this reinforcement learning framework. Policy π S is learned based on the classical hard attention method in the source domain. Policy π T is trained via our proposed adversarial reward strategy to extract the target attention feature T (x t ) closing enough to the source model; on the other hand, the discriminator D is trained to distinguish the target attention feature from the source attention feature.
The rest of the paper is organized as follows: Section 2 reviews the related works of domain adaptation and attention methods. In Section 3, we propose the adversarial hard attention adaptation framework and describe how to train it in detail. In Section 4, we experiment the proposed framework on the standard cross-domain datasets and further evaluate its capability on a cross-domain non-centered digits datasets. Section 5 concludes the paper and discusses our future work.

Domain Adaptation
Recent unsupervised domain adaptation can be divided into two main categories: instance-based adaptation and feature representation adaptation. For instance-based adaptation, these methods directly compute the re-sampling weights by matching distributions between source and target domains in a non-parametric way [27,28]. For the latter, many researches attempt to reduce the distribution discrepancy between source domain and target domain by projecting two domains into a common feature space [29][30][31][32]. For example, deep domain confusion method (DoC) [31] uses the Maximum Mean Discrepancy (MMD) [33] loss to learn representation from labeled source domain to unlabeled target domain.
Recently, enlightened by generative adversarial networks (GAN) [6], approaches utilizing adversarial objectives were proposed to align the distribution across domains [7][8][9][10][11][12]. Specially, Ref. [12] introduced a novel adversarial domain adaptation framework in which the discriminator D and the target domain feature generator T are used to compete with each other, so that T can generate the feature distribution close to the source counterpart. Inspired by CycleGAN [34,35] proposed a discriminatively-trained cycle-consistent adversarial domain adaptation model that focuses on the representation adaptation at both pixel-level and feature-level. Unlike the usual GAN-style domain adaptation approaches, Ref. [11] utilizes a very different adversarial training technique for unsupervised domain adaptation, in which the classifier with Dropout regularization is used to detect the classification boundary, and the target generator is encouraged to generate discriminative features far away from the boundary.

Attention Mechanism
In recent years, attention mechanism plays an important role in many fields such as machine translation [13], speech recognition [36], and image caption [15]. Like human visual system, attention does not need to focus on the whole image, but only on the key areas of the image. For a large image, focusing on the salient areas of the image makes it possible to process fewer pixels and save more computing resources. Attention models can be divided into two categories: soft (deterministic) and hard (stochastic).
For the former, soft attention computes the weight vector as the attention probability, which is differentiable and can be embedded in the vanilla model for end to end training. Soft attention is used in a variety of visual tasks, such as image caption [16][17][18] and visual question answering task (VQA) [19][20][21].
For the latter, hard attention is non-differential due to the sampling and cropping operations, and thus it is usually optimized by reinforcement learning technique [37]. Recurrent attention model (RAM) [23] and its extension deep recurrent attention model (DRAM) [24] are the representatives for hard attention, all of which consist of a controller module controlling where to glimpse the next image patch. Comparing to the usual convolutional neural network models for visual tasks, these approaches significantly reduce the computation complexity for large images. Furthermore, Ref. [14] proposed a hard attention based image caption method, which was the state of the arts model for COCO image caption task in 2015. Besides successes in the visual tasks, hard attention models have also been introduced into natural language processing [25] and weakly human activity recognition [26].
Few works regarding attention adaptation have been well studied though, it is worth noting that [22] presents a transferable attention for domain adaptation (TADA) which focuses on two complementary transferable local and global attention. On one hand, the local attention generated by the region-level domain discriminators is to highlight the transferable regions, and on the other hand the global attention generated by a single image-level domain discriminator is to highlight the transferable images.

Model
For unsupervised domain adaptation, it is supposed that a source image x s and its corresponding label y s can be accessed from the rich labeled source domain {X s , Y s }; and on the other hand, from the target domain {X t }, a target image x t can only be accessed without any label information. In our adversarial hard attention adaptation work, there are two reinforcement learning agents S and T which correspond to the source domain agent and the target domain agent respectively; and S(x s ) represents the extracted attention feature by S and T (x t ) represents the extracted attention feature by T . The policies of the two agents are denoted as π S and π T .

Problem Formulation
For classical hard attention models, for example the recurrent attention model (RAM) [23], the core idea is to train a reinforcement learning agent interacting with a visual environment to learn the policy π S in the source domain. Such learned policy π S can be safely treated as the target domain policy π T , because there is not significant discrepancy between the source domain and the target domain. However, in this work, we break this assumption, that is to say there is a large discrepancy between the source domain and the target domain. For example, in one of following domain adaptation tasks, we consider to adapt the hard attention policy π S trained on the SVHN datasets to its target domain attention policy π T which should be suitable for the very different MNIST dataset.
Unlike ADDA [12] in which the feature encoder S only utilizes the conventional CNN architecture, the hard attention model usually consists of a recurrent structure to select the most discriminative local features from image patches according to its learned policy π S . It is well-known that hard attention is non-differentiable due to such sampling and cropping operations, which poses a challenge to directly leverage the ADDA framework to hard attention adaptation. For hard attention adaptation, not only should the local feature extractor be adapted to the source domain, but most importantly the target domain attention policy π T should be aligned to the source domain attention policy π S . Then the critical problem for hard attention adaptation is how to train the target domain attention policy π T without any label information. In this work, we propose to train the target domain policy π T via a novel adversarial reward strategy in which π T is encouraged by positive reward if the collected hard attention feature is close enough to the source domain attention feature. Since our work leverages the seminal ADDA framework but introduces such a novel adversarial reward strategy, this work is referred as adversarial hard attention adaptation.

Adversarial Hard Attention Adaptation
In this adversarial hard attention adaptation work, we mainly focus on the RAM model [23] which consists of four subnetworks as shown in Figure 2: the Glimpse network, the Localization network, the Context network, and the Controller network. To apply RAM to domain adaptation, following ADDA [12], on one hand a discriminator D is used to distinguish by which agent the attention feature is extracted from, the source or the target. On the other hand, the target agent T tries to generate the "real" source domain attention feature to fool the discriminator D despite the images are from the target domain. Then naturally, the discriminator D and the target agent T construct the adversarial training counterparts. For the discriminator, it receives the attention feature from both domains and then predict where it is from. Then in this sense, the discriminator D plays a role of a binary classifier without providing any labeled target samples. Then, naturally, the discriminator D can be trained via the standard discriminator loss function L advD as ADDA [12], shown in Equation (1).
For the target agent T , since our adaptation work follows ADDA [12], T is cloned from the source agent S, and thus the network structure of T is the same as S shown in Figure 2. Because the target images do not have any labels, T can not be trained in the same way as S. Moreover, as the non-differentiable characteristic of the Localization network of T due to such sampling and cropping operations, the target agent T should also be trained via reinforcement learning technique. Here a novel adversarial reward strategy is proposed to train the target agent T . Specifically, the discriminator D is leveraged as the critic to classify whether T (x t ) is from the target domain or the source domain. If T (x t ) is wrongly classified as an attention feature from S by the discriminator D, a positive reward +1 is given to the target agent T because T is encouraged to beat the D in this adversarial training framework. Otherwise, zero reward should be given to T . At each step, the reward is given after the target agent T takes a glimpse following its current policy π T and the goal of the agent is to maximize the sum of the reward: R T = ∑ T i=1 r i . Note that to simplify the reward calculation, following the reward strategy of RAM [23], the positive reward +1 can only be given after the T th glimpse is taken, saying that r i = 0 (i = 1, . . . , T − 1) and r T = 1 if T (x t ) is recognized as a source domain attention feature by D. As Figure 2 shows that the target agent T is composed of not only the non-differentiable part, saying the Localization network, but also the differentiable parts, saying the Context network, the Glimpse network, and the Controller network. Then to adversarial train the target agent T against the discriminator D, the loss function of T should also consists of two parts as shown in Equation (2): Here L logllT , referring to Equation (3), represents the log likelihood optimized for the non-differentiable part of T , and L advT , referring to Equation (4), represents the classical adversarial loss optimized for the differentiable parts of T . Figure 3 shows the complete framework of our proposed adversarial hard attention adaptation. We discuss the training details of this attention adaptation model in the next sub-section. Through this kind of adversarial training, the hard attention can be successfully transferred to the target domain.

Training Procedure
The complete training procedure of unsupervised adversarial hard attention adaptation consists of the following three stages.
Stage 1: Referring to Figure 3a, we aim to pre-train the source agent S and the classifier C in the source domain for K-class classification. Training S and C is typically the same as the classical RAM, which is trained by minimizing a hybrid loss function L S , shown in Equation (5). L logllS , referring to Equation (6), represents the log likelihood loss optimized for the non-differentiable part of S. L xent , referring to Equation (7), represents the classical cross-entropy loss optimized for the differentiable parts of S and the classifier.
Stage 2: This stage performs the adversarial unsupervised domain adaptation, referring to Figure 3b. By alternating between optimizing the discriminator D and the target agent T , D tries to distinguish where the attention feature comes from, saying that "real" is from the source domain, and T tries to make its attention feature as "real" as possible. Specifically, D is optimized by minimizing its loss function L advD shown in Equation (1), and T is optimized by minimizing its loss function L T shown in Equation (2) respectively. Stage 3: In the previous two stages, the source agent S, the source classifier C, and the target agent T have been well trained accordingly. To evaluate the adapted hard attention extracted by the target agent T , following ADDA [12], the target attention feature T (x t ) is simply classified by the source classifier C which is trained in the source domain in stage 1. Figure 3c demonstrates this evaluation stage.

Experiments
We evaluated our proposed adversarial hard attention adaptation model on the popular unsupervised domain adaptation datasets. We first focused on the typical domain adaptation tasks between three centered digits datasets, SVHN [38], MNIST [39] and USPS. To further evaluate the capability of hard attention adaptation, we then explored a challenging adaptation task from the enlarged non-centered SVHN datasets to the enlarged non-centered MNSIT datasets, which can be regarded as weak datasets. For both types of experiments, ADDA [12] was used as the baseline adaptation model.

Experiments Setup
We describe the specific network configuration and hyper-parameters of our model as following. Hard Attention: Since the hard attention model considered in our work is RAM [23], the network structure of the source domain and the target domain follows the protocol used in [23]. Specifically, the Glimpse network has a two-layer fully connected network which has 128,256 hidden units along with a RELU activation function. The Localization network consists of a fully connected hidden layer that emits the location tuple l t . The recurrent structure of the Controller is LSTM [40], and the size of the LSTM cell is set to 256. Different from the original RAM structure, a Context network is used to provide the global overview of the image as the initial state for the Controller network. The Context network consists of 3 convolutional layers with pooling layers, the convolution kernel size is 5 for each layer, and the filter numbers are 32, 64, 64 respectively.
Discriminator: The discriminator D in this adversarial training framework is composed of a three-layer fully connected network which has 128 hidden units for the first two layers along with a Leaky-RELU activation function. The last layer predicts the probability of TRUE or FALSE.
Classifier: The classifier C shared by the source agent S and the target agent T is composed of a fully connected layer followed by a Softmax layer.
The whole model is trained by the ADAM [41] with batch size of 128. In stage 1, saying that the typical supervised learning stage on the source domain, the learning rate is set 1 × 10 −3 as initial and then decays by 0.97 for every epoch. During the adversarial training of stage 2, the learning rate is set 2 × 10 −4 and 5 × 10 −5 for optimization of D and T respectively. All our experiments are implemented by Tensorflow 1.4 and are trained on a workstation with NVIDIA Titan X GPU and 32Gb system RAM.

Adaptation between Centered Digits Datasets: SVHN, MNIST and USPS
We validated the performance of our model among the following adaptation tasks: SVHN to MNIST, MNIST to USPS and USPS to MNIST. All of these settings follow ADDA [12]. Specifically, for adaptation between SVHN and MNIST, full datasets are used and the images of MNIST are resized to the same scale of SVHN, that is 32 × 32. For adaptation between MNIST and USPS, 2000 images are sampled from MNIST and 1800 images are sampled from USPS. The images of MNIST are resized to 16 × 16 as the scale of USPS images. For both the source agent S and the target agent T , 6 glimpses are taken for better performance, and the glimpse size is 8 × 8 for the adaptation between SVHN and MNIST, 4 × 4 for the adaptation between USPS and MNIST respectively.
The results of these basic adaptation experiments are shown in Table 1. From the table, it can be seen that the proposed adversarial hard attention adaptation approach achieves better performance than the baseline ADDA [12] and also outperforms other existing methods. In particular, for the task of adaptation from SVHN to MNIST, our method surpasses others by a large margin. It is reasonable that our adaptation method outperforms other competitors, since our approach considers both the global feature provided by the Context network and the local attention feature glimpsed by the Glimpse network. Naturally the adaptation of our approach not only projects the global feature of the two domain into a common space as ADDA [12] does, but also aligns the local attention features among the two domains.  Figure 4 visualizes the target policy π T adapted from the source policy π S . For each sub-figure, the left two rows correspond to the actions (glimpses) taken by the source domain agent S, and the right two rows correspond to the actions (glimpses) taken by the target domain agent T . It is obvious that the target policy π T emits the discriminative hard attention as the source policy π S does, which verifies the effectiveness of our model. Furthermore, Figure 5 shows the T-SNE [43] visualization of the original image distribution of both domains (SVHN and MNIST), the feature distribution generated by the source agent S, and the adapted feature distribution generated by the proposed adversarial hard attention adaptation model. It clearly shows that our model does align the global and local attention features to the same space after hard attention adaptation.

Adaptation between the Enlarged Non-Centered Datasets: SVHN to MNIST
To further examine the capability of hard attention adaptation, enlightened by the experiments done in RAM [23], we considered a challenging task of adaptation between enlarged non-centered datasets. Specifically, we created the enlarged SVHN dataset by placing a SVHN digit on a random location of a large blank canvas. The enlarged MNIST dataset was constructed in the same way. In this adaptation experiment, we only considered the adaptation from the enlarged SVHN to the enlarged MNIST. The size of the canvas is 60 × 60 and the size of the original digits is 32 × 32 in both domains. Figure 6a shows the sample image of the source domain, the enlarged non-centered SVHN dataset; and Figure 6b shows the sample image of the target domain, the enlarged non-centered MNIST dataset. In this experiment, since the images are relatively large in both domains, two scales of glimpse are adopted by the Glimpse network, 12 × 12 and 24 × 24 respectively. For the Context network, it takes the coarse global overview of the down-sampled 32 × 32 images. To compare with ADDA for this enlarged weak datasets, we validated two versions of ADDA. One is the down-sampled version ADDA, saying that the images of both domains are all resized to 32 × 32 which corresponds to the coarse scale of our Context network; and the other is the typical version ADDA, saying that ADDA manipulating the original size of images for both domains. Table 2 lists the accuracy validated in the target domain for each model. The consumed training and testing time for processing each mini-batch and total training time for each model are also reported in this table. Note that the temporal cost comparative of the different models only includes the time cost of the training process of adversarial unsupervised domain adaptation and the test process, that is to say it does not include the time of training a source model directly on the source domain dataset. The results in the table demonstrate that our model obtains better performance than ADDA. For ADDA with downsampling, because the image is down-sampled to 32 × 32, it is obvious that the coarse scale of image does deteriorate the adaptation performance of ADDA even if it only takes the least time to train the model. However, for our model, the Context network requires the coarse scale image to provide the global overview for the agents of both domains, which can help the agents to locate the discriminative hard attention. Since our model needs to deal with coarse global features and local attention features at the same time, it takes more time to train to converge to the best accuracy. Specifically, for typical ADDA, since the input images contain large blank background, aligning features cross-domain extracted from whole images does pay much attention to the useless background. On the contrary, for our model, the target agent T learns its policy through adversarial training, which can highlight the important areas on the target images despite the large blank background. Figure 7 visualizes the source policy π S and the target policy π T on the same MNIST digits: 7, 8, 6, and 4. Figure 7a shows that the source policy π S can not locate the MNIST digits well, and thus the predicted probability for the true label is low. However, Figure 7b shows that the adapted target policy π T could nicely locate the MNIST digits and predicts the correct digit label with high probability. Therefore our model is more suitable for cross-domain adaptation on such enlarged non-centered datasets. (a)

Conclusions
Adapting attention from label-rich source domain to unlabeled target domain would inevitably improve the overall performance of transfer learning. In this paper, we proposed a novel adversarial hard attention adaptation framework to transfer the learned source policy π S to the target domain policy π T . By introducing a novel adversarial reward strategy, we construct two adversarial training counterparts, a reinforcement learning agent and a discriminator: the reinforcement learning agent works in the target domain to generate the "real" source domain attention feature to fool the discriminator, and the discriminator tries its best to distinguish where the attention feature is extracted from, the source or the target.
Though the popular recurrent attention model (RAM) has been mainly discussed for our adversarial hard attention adaptation framework, we believe that other kinds of hard attention, saying that attention generated via non-differentiable operations could also benefit from our framework, in particular the proposed adversarial reward strategy. In future work, we plan to explore other interesting hard attention methods for domain adaptation and other challenging visual tasks, such as transferring attention knowledge for fine-grained classification.