P-Norm Attention Deep CORAL: Extending Correlation Alignment Using Attention and the P-Norm Loss Function

: CORrelation ALignment (CORAL) is an unsupervised domain adaptation method that uses a linear transformation to align the covariances of source and target domains. Deep CORAL extends CORAL with a nonlinear transformation using a deep neural network and adds CORAL loss as a part of the total loss to align the covariances of source and target domains. However, there are still two problems to be solved in Deep CORAL: features extracted from AlexNet are not always a good representation of the original data, as well as joint training combined with both the classiﬁcation and CORAL loss may not be efﬁcient enough to align the distribution of the source and target domain. In this paper, we proposed two strategies: attention to improve the quality of feature maps and the p-norm loss function to align the distribution of the source and target features, further reducing the offset caused by the classiﬁcation loss function. Experiments on the Ofﬁce-31 dataset indicate that our proposed methodologies improved Deep CORAL in terms of performance.


Introduction
Deep learning-based applications have outperformed the imagination of human beings in many aspects, such as computer vision, speech recognition, natural language processing, audio recognition, etc. [1][2][3], but domain shifts dramatically damage the performance of deep learning methods [4,5]. In such a scenario, features extracted by a deep neural network, which was pre-trained using existing datasets (called the source domain), can become meaningless for the target task (referred to as the target domain). Essentially, the different data distributions between the source and target domain will hinder the generalization on the target task, which means the learned knowledge from source domains cannot be transferred to target domains.
To relieve the domain shift issue, which is common in practical scenarios, collecting labeled data and training a new classifier for every possible scenario can compensate the degradation in performance. However, the cost of acquiring huge volumes of labeled data remains expensive and time consuming. Domain Adaptation (DA) [6] is an alternative solution, which, instead of collecting labeled data, utilizes known or labeled data to learn a classifier for unknown or unlabeled data. Domain adaptation is a particular case of Transfer Learning (TL), which has become commonplace in today's deep learning-centric computer vision.

Related Work
CORrelation ALignment (CORAL) [7] works well by aligning the distribution of the source and target features in an unsupervised manner. However, it only relies on a linear transformation to minimize the squared Frobenius norm distance of the covariances of the source and target features, which will limit flexibility and adaptability. Furthermore, CORAL needs to calculate the second-order statistics (covariances) at first between the source and target data and, after that, transform the source domain to the target domain to align their distributions. Training an extra classifier, such as Support Vector Machine (SVM), is necessary with transformed source domain data and, finally, classifying the target domain dataset. In this slightly tedious process, an external classifier must be involved to obtain the final category, which we call "not-end-to-end".
Deep CORAL [8] has been proposed using Deep Neural Networks (DNNs), which is a kind of nonlinear transformation to extend CORAL. Deep CORAL adds the objective function of basic CORAL to be a part of the total loss function, making full use of the characteristics of DNNs, which can minimize the loss function to align covariances between the source and target domain. Hence, Deep CORAL essentially overcomes the linear transformation dependence of CORAL attributed to the nonlinear characteristics of DNNs. Meanwhile, in order to address the not-end-to-end dilemma, Deep CORAL introduces joint training into neural network to reduce the influence of degenerated features induced by minimizing the CORAL loss alone. Nevertheless, we can still point out several problems existing in Deep CORAL.
First of all, Deep CORAL is not concerned with the quality of data, which will influence the accuracy. Deep CORAL extracts features of the source and target datasets using AlexNet only. AlexNet [9], designed primarily by Alex Krizhevsky, is a Convolutional Neural Network (CNN), which became famous in 2012 since the championship with an error rate of 15.3% in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC-2012). However, not all features extracted from convolutional layers can perfectly represent the original data. Our experiments illustrated this point (see Section 4).
Secondly, according to [8], the AlexNet used by Deep CORAL could project the source and target domain to a single point because Deep CORAL relies on the CORAL loss only. Therefore, joint training with both the classification loss and CORAL loss has been chosen by Deep CORAL to reduce this situation, but the classification loss could result in an offset when Deep CORAL tries to align the distributions of the source and target domain with minimizing CORAL loss.
Basically, minimizing CORAL loss only may align the second-order statistics of the source and target domain properly. However, if classification loss is added to the CORAL loss, then it will be a redundant term for CORAL, so the alignment will be disturbed because of it. To overcome this problem, we imported the p-norm [10] to further align the distributions and improve the generalization accuracy.
In this paper, we introduced P-norm Attention Deep CORAL (P-ADC) to address the above challenges. The key insight underlying P-ADC is that we added attention into DNN of Deep CORAL, which not only retained the advantages of AlexNet, but also considered the use of attention to highlight image features, which had the effect of image preprocessing. Meanwhile, our experimental results show that Attention Deep CORAL provided an effective improvement when compared with traditional Deep CORAL. Furthermore, we extended the loss function of Deep CORAL, which included two parts into n∈ [1, ∞) parts, to ease the second challenge mentioned above. The first part of the extended loss function still maintained the original classification loss function, and the rest we introduced contained the p-norm to balance the offset caused by the classification loss.

Method
Suppose the source domain training set D S = x i , y j , x ∈ R d , i ∈ {1, · · · , n S }, j ∈ {1, · · · , L}, consists of N image-label pairs x i , y j where x i is a source domain image, while y j is its corresponding label, and the target domain data D T = {u i }, u ∈ R d , which are unlabeled. In the meantime, n S , n T , µ s , µ t , and C S , C T are the number, the feature vector means, and covariance matrices of the source and target data, respectively.

CORrelation ALignment
CORAL works by aligning the distributions of the source and target features in an unsupervised manner, matching the distributions by aligning the second-order statistics and the covariance, and applying a linear transformation M to minimize the Frobenius distance metric.
where CŜ is the covariance of the transformed source features and C S and C T are the covariance matrices. Let C S = U S Σ S U S , C T = U T Σ T U T be the singular-value decomposition (SVD) of C S , C T . Then, the final optimal value of M as M * = U S Σ [1:r] where Σ + denotes the Moore-Penrose pseudoinverse of Σ.

Deep CORAL
Deep CORAL minimizes the difference in the covariance between the source and target domain with the aid of a DNN. We defined the CORAL loss (Equation (2)) as a part of the total loss function (Equation (3)). Figure 1 shows the architecture of Deep CORAL.
where l CLASS indicates the classification loss function, e.g., cross-entropy, square-loss, etc. Cross-Entropy was adopted in our experiments.  Figure 1. The architecture of Deep CORAL. φ denotes any deep neural network (e.g., AlexNet). The 256 * 6 * 6 denotes the size of the features maps extracted by AlexNet; the 256 stands for the number of channels, and 6 * 6 is the weight * height of a single feature map.

Our Method
Deep CORAL model was built using AlexNet in which convolutional layers are inefficient for modeling global dependencies in images due to its local view. We adapted the attention mechanism to overcome the shortcoming of AlexNet, enabling the image features extracted by convolutional layers to be able to provide more representative information. We call the proposed method P-norm Attention Deep CORAL (P-ADC) because of the added attention mechanism (see Figure 2). Suppose image features X = {x 1 , x 2 , · · · , x N } are provided from the previous layer where x i ∈ R C×W×H are the image features of each sample. Note that C denotes the number of channels, and W, H is the width and height of the image features x. Then, the energy of x i is expressed as follows: where W V ∈ R C×C ,W K ∈ R C×C ,W Q ∈ R C×C are the learned weight matrices, which belong to the convolutional layer with kernel_size = 1, stride = 1, and padding = 0. C is the number of channels. Here, we can reduce the channel number C to C k [11], where k = 8 was chosen in our experiment to reduce the number of parameters while not decreasing the performance significantly. v i and t i are two different feature spaces calculated with the image feature map x i of the previous hidden layer. s i means the energy of x i . j and k indicate the position coordinates of the energy s i of the ith image sample.
Attention mechanisms [12,13] have been employed successfully in sequence modeling and transduction problems such as speech recognition, neural captioning, etc., to tackle capturing long-range interactions for convolutions. Recently, attention mechanisms have also been applied in computer vision models to provide contextual information. The essence of the attention mechanism is actually an addressing process: Given a query vector q related to the task and a key vector k, the distribution values will be calculated by q and k, and then, attach it to the value vector v. The main attention models are as follows.
Additive model s(q) = v T tanh(Wk + Uq) Dot-product model s(q) = k T q Scaled dot-product model Bilinear model s(q) = k T Wq Equation (4) belongs to a kind of additive model. The attention matrices of x i are given by: where M = W × H and α i denote the attention matrices of x i . Then, the output of the attention layer is: where o i ∈ R C×W×H is the output of the attention layer. According to [13], additive attention and dot-product attention are the most popular attention functions. Here, we also defined dot-product attention for the convenience of application.
The definition of the p-norm is as below: We defined the p-norm loss between two domains for a single feature layer.
where C S and C T denote the feature covariance matrices. d was set to the number of categories, i.e., the dimension or the output of the last fully connected layer. Therefore, according to the definition of CORAL loss, we have l CORAL = l 2−norm . The total loss function is as follows: where λ trades off the adaptation and classification accuracy on the source domain.

Experiment Results
To evaluate our method, we performed experiments on a famous domain adaptation benchmark dataset, the Office-31 dataset [14]. This dataset contains three image domains: DSLR, Amazon, and Webcam, and each of them has 31 classes with corresponding class names.
In Figure 3, we compare the information quantity of feature maps for training with vs. without attention in Amazon. We can clearly see that adding attention helped the classifier acquire much more information, which means we can obtain higher test accuracy after adding attention. For comparative analysis of our method (P-ADC), in addition to Deep CORAL, we tested other well-known algorithms (deep domain confusion and conditional domain adversarial networks) on the Office-31 benchmark dataset. Deep Domain Confusion (DDC) [15] adds an adaptation layer and domain confusion loss in AlexNet. Conditional Domain Adversarial Networks (CDANs) [16] introduce multilinear conditioning and entropy conditioning to improve the discriminability and guarantee the transferability.
Following [8], we initialized the weight of the last fully connected layer (fc8) with N (0.0, 0.005) and set the dimension to 31, the number of categories. The other layers of AlexNet were initialized with the pre-trained model parameters of ImageNet [17], keeping the layerwise parameter settings. We also set batch size = 128, learning rate = 10 −3 , weight decay = 5 × 10 −4 , and momentum = 0.9 for all of the experiments below (Table 1) for a fair comparison.
From Table 1, we can see that P-ADC achieved higher average performance than Deep CORAL and the other baseline methods. In three out of six shifts, P-ADC (2−3) achieved the highest accuracy (l TOTAL = L C + ∑ 3 p=2 ∑ t i=1 λ i l p−norm , where t is the number of p-norm loss layers in a deep network and P-ADC (2−3) means that p ranges from two to three). For the other three shifts, P-ADC (2−4) where P-ADC (2−4) indicates that p ranges from two to four) obtained the best scores. In this experiment, we only tried P-ADC (2−3) and P-ADC (2−4) because the p-norm loss would take up many computing resources with the increase of p, resulting in the computing speed declining dramatically. In addition, as we can see from Table 1, the test accuracy could not achieve the result of the official algorithm for all due to the fine-tuned AlexNet model from PyTorch, as well as the software and hardware environment.
p=a ∑ t i=1 λ i l p−norm , L C = l CLASS , and P-ADC (a−b) = P-ADC l TOTAL = L C + L (a−b) , where l CLASS is the classification loss function, L (a−b) denotes the p-norm loss function where p ranges from a to b, and a and b are natural numbers greater than 1. Bold denotes the highest accuracy.  Figure 4 shows us three plots generated for shift D → W to assist us in analyzing P-ADC. In Figure 4a, we visualize the process of training and testing on Deep CORAL and P-ADC. We can see our method outperformed Deep CORAL on the test accuracies. Figure 4b shows the average loss in the training and test stage. It can be seen that our method was more stable in the test stage. Comparing Figure 4b,c, we can conclude that the p-norm loss was not always decreasing during training as the CORAL loss, but nevertheless, the two losses were about the same after training for hundreds of iterations. Furthermore, our p-norm loss could converge finally, constraining the distance between the source and target domain and maintaining an equilibrium in the target domain even more effectively than the CORAL loss.

Conclusions
In this paper, we extended Deep CORAL, a simple, yet effective end-to-end adaptation in deep neural networks, with an attention mechanism to provide more information for deep neural networks. Meanwhile, we used the p-norm loss function to replace CORAL loss to balance the offset. Experiments on standard benchmark datasets (Office-31) showed state-of-the-art performance.
We tested our method on the classic benchmark dataset Office-31, and the experimental results showed us its effectiveness. One of the future research directions is the application of our method to a more diverse range of real-world applications and datasets. In addition, we are performing research on image recognition of vegetable diseases and insect pests under a greenhouse environment, which is very complicated. Different diseases and insect pests overlap, and light changes in real time. We hope to improve the accuracy of vegetable disease and insect pest identification with a domain adaptation method, including this method.