Optimal Transport with Dimensionality Reduction for Domain Adaptation

: Domain adaptation manages to learn a robust classiﬁer for target domain, using the source domain, but they often follow di ﬀ erent distributions. To bridge distribution shift between the two domains, most of previous works aim to align their feature distributions through feature transformation, of which optimal transport for domain adaptation has attract researchers’ interest, as it can exploit the local information of the two domains in the process of mapping the source instances to the target ones by minimizing Wasserstein distance between their feature distributions. However, it may weaken the feature discriminability of source domain, thus degrade domain adaptation performance. To address this problem, this paper proposes a two-stage feature-based adaptation approach, referred to as optimal transport with dimensionality reduction (OTDR). In the ﬁrst stage, we apply the dimensionality reduction with intradomain variant maximization but source intraclass compactness minimization, to separate data samples as much as possible and enhance the feature discriminability of the source domain. In the second stage, we leverage optimal transport-based technique to preserve the local information of the two domains. Notably, the desirable properties in the ﬁrst stage can mitigate the degradation of feature discriminability of the source domain in the second stage. Extensive experiments on several cross-domain image datasets validate that OTDR is superior to its competitors in classiﬁcation accuracy.


Introduction
In order to train a classifier with strong robustness to the test data set, the traditional machine learning methods usually assume that the training and test data sets follow the same distribution, and require sufficient labeled training samples for the model training process [1,2]. However, in practical applications, especially in some areas of computer vision, it is difficult to label samples. For example, in object image recognition, manual annotation of images of different objects is time-consuming and expensive [3,4].
Although some object image datasets with annotations have been established for training classification models, the distribution of the real-world images to be recognized is usually different from that of the training sets due to the differences of viewpoint, resolution, brightness, occlusion, background, and other factors [5,6]. As a result, the classifier with good training performance in one scene (e.g., Caltech 256 [7]) performs poorly in another scene (e.g., images taken by mobile phone).
To address these challenges, domain adaptation (DA) aims to make a classifier, trained on source domain which contains rich label information, robust to unlabeled target domain by reducing the DA methods can be broadly classified into three categories: instance-based adaptation [10,11], classifier-based adaptation [12,13], and feature-based adaptation [14][15][16][17][18]. With recent growth of deep neural network technology, deep DA methods have been developed and achieved satisfying performances on image classification [19][20][21]. However, deep DA methods need to retrain deep neural networks and debug a large number of hyper-parameters, which is cumbersome and expensive to operate and is generally not stable [22].
Feature-based adaptation, which manages to learn the domain-invariant features to reduce the difference between domains, can be implemented on both shallow and deep feature representations, which has aroused great interest in DA works. A fruitful line of feature-based adaptation works statistically utilize distribution alignment [16,17,[23][24][25] to learn shared features while another line of works geometrically adopt subspace alignment to find domain specific features [14,15,26]. All these works aim to achieve global feature alignment to reduce the divergence between domains, while the local information is ignored.
In recent works [18,27], optimal transport (OT) [28] is used for domain adaptation by adding different forms of regularization to the OT problem formulation, which can map each source instance to target instances like graph matching to achieve feature alignment, thus the local information of the domains can be preserved. However, these methods usually reduce the variation of the source data dramatically and make the source samples crowded, which may weaken the feature discriminability of the source domain and lead to degradation of the DA performance.
Therefore, we propose a two-stage feature-based adaptation approach referred to as optimal transport with dimensionality reduction (OTDR). In the first stage, we construct a dimensionality reduction framework similar to principle component analysis (PCA) where the intradomain dispersion is maximized and the intraclass compactness of source domain is minimized. As such, both the source and target data samples are separated as much as possible, the source data samples DA methods can be broadly classified into three categories: instance-based adaptation [10,11], classifier-based adaptation [12,13], and feature-based adaptation [14][15][16][17][18]. With recent growth of deep neural network technology, deep DA methods have been developed and achieved satisfying performances on image classification [19][20][21]. However, deep DA methods need to retrain deep neural networks and debug a large number of hyper-parameters, which is cumbersome and expensive to operate and is generally not stable [22].
Feature-based adaptation, which manages to learn the domain-invariant features to reduce the difference between domains, can be implemented on both shallow and deep feature representations, which has aroused great interest in DA works. A fruitful line of feature-based adaptation works statistically utilize distribution alignment [16,17,[23][24][25] to learn shared features while another line of works geometrically adopt subspace alignment to find domain specific features [14,15,26]. All these works aim to achieve global feature alignment to reduce the divergence between domains, while the local information is ignored.
In recent works [18,27], optimal transport (OT) [28] is used for domain adaptation by adding different forms of regularization to the OT problem formulation, which can map each source instance to target instances like graph matching to achieve feature alignment, thus the local information of the domains can be preserved. However, these methods usually reduce the variation of the source data dramatically and make the source samples crowded, which may weaken the feature discriminability of the source domain and lead to degradation of the DA performance.
Therefore, we propose a two-stage feature-based adaptation approach referred to as optimal transport with dimensionality reduction (OTDR). In the first stage, we construct a dimensionality reduction framework similar to principle component analysis (PCA) where the intradomain dispersion is maximized and the intraclass compactness of source domain is minimized. As such, both the source and target data samples are separated as much as possible, the source data samples in the same class are drawn as close as possible, and the feature discriminability is enhanced accordingly. In the second stage, by virtue of the above properties, we solve the group-lasso regularized OT problem with source label information in the low-dimensional space, thus the degradation of source feature discriminability can Symmetry 2020, 12,1994 3 of 18 be alleviated. Finally, we obtain an optimal transport plan (OTP) with more discriminant information, which not only bridges distribution shift by mapping the source instances into the target domain with the Wasserstein distance minimization, but also generates a more discriminative representation of the source domain. Therefore, the DA performance can be improved. The whole pipeline of our algorithm is briefly depicted in Figure 2.
Symmetry 2020, 12, x FOR PEER REVIEW 3 of 19 in the same class are drawn as close as possible, and the feature discriminability is enhanced accordingly. In the second stage, by virtue of the above properties, we solve the group-lasso regularized OT problem with source label information in the low-dimensional space, thus the degradation of source feature discriminability can be alleviated. Finally, we obtain an optimal transport plan (OTP) with more discriminant information, which not only bridges distribution shift by mapping the source instances into the target domain with the Wasserstein distance minimization, but also generates a more discriminative representation of the source domain. Therefore, the DA performance can be improved. The whole pipeline of our algorithm is briefly depicted in Figure 2. (c) first stage: dimensionality reduction to promote source intraclass compactness, and improve source and target intradomain dispersion; (d) second stage: optimal transport to align the source and target distributions with desirable properties obtained in the first stage.
We summarize the contributions of this paper as follows: (1) In combination with optimal transport and dimensionality reduction, a two-stage feature-based adaptation is proposed for domain adaptation. Compared with global feature alignment methods, our approach can preserve local information of the domains and has a relatively simple structure, which does not need continuous iteration to learn pseudo tags of the target domain; (2) To address the source sample crowding problem generated by previous regularized optimal transport methods which transform the source data in the original space, we solve OT problem in a low-dimensional space where the intradomain instances are dispersed as much as possible.
In this way, the solution OTP will have larger variance, and the separability of the source samples will be enhanced with the new representation generated by the OTP; (3) To enhance the discriminability of source data, we consider the source label information and add the source intraclass compactness regularization to the dimensionality reduction frame in the first stage. Besides, we add a class-based regularization to the OT problem in the second stage. By solving the OT problem, we obtain the OTP, which makes a target instance more likely to be associated with all source domain instances from only one of the classes. Therefore, the OTP can generate a more discriminative representation of the source domain; (4) Comprehensive experiments on several image datasets with shallow or deep features demonstrate that the proposed approach is competitive compared to several traditional and deep DA methods.

Related Works
This section presents the related works on domain adaptation (DA) from the following two aspects: dimensionality reduction for DA and optimal transport for DA. We summarize the contributions of this paper as follows: (1) In combination with optimal transport and dimensionality reduction, a two-stage feature-based adaptation is proposed for domain adaptation. Compared with global feature alignment methods, our approach can preserve local information of the domains and has a relatively simple structure, which does not need continuous iteration to learn pseudo tags of the target domain; (2) To address the source sample crowding problem generated by previous regularized optimal transport methods which transform the source data in the original space, we solve OT problem in a low-dimensional space where the intradomain instances are dispersed as much as possible.
In this way, the solution OTP will have larger variance, and the separability of the source samples will be enhanced with the new representation generated by the OTP; (3) To enhance the discriminability of source data, we consider the source label information and add the source intraclass compactness regularization to the dimensionality reduction frame in the first stage. Besides, we add a class-based regularization to the OT problem in the second stage. By solving the OT problem, we obtain the OTP, which makes a target instance more likely to be associated with all source domain instances from only one of the classes. Therefore, the OTP can generate a more discriminative representation of the source domain; (4) Comprehensive experiments on several image datasets with shallow or deep features demonstrate that the proposed approach is competitive compared to several traditional and deep DA methods.

Related Works
This section presents the related works on domain adaptation (DA) from the following two aspects: dimensionality reduction for DA and optimal transport for DA.

Dimensionality Reduction for Domain Adaptation
Domain adaptation aims to deal with distribution shift between the source and target domains. In this area, dimensionality reduction as a very popular strategy, compensates for cross-domain divergence by aligning feature subspaces, feature distributions, or simultaneously aligning the feature subspaces and feature distributions of the source and target domains.

Subspace Alignment
Subspace alignment states that the reason of distribution shift is due to the source and target domains data are geometrically located in different subspaces [29]. Hence, subspace alignment aims to match the source and target subspaces, to implicitly and geometrically minimize the distribution shift between the two domains. In [5,14], the authors adopt dimensionality reduction to learn low-dimensional subspaces of two different domains and regard the subspaces as two different points, then connect them using intermediary points on the Grassmann manifold to achieve subspace alignment. Fernando et al. [15] introduce a subspace alignment method which learns a mapping function to align the source and target low-dimensional subspaces. These methods align the subspaces of different domains without specifically considering the cross-domain distribution shift from statistical viewpoint.

Distribution Alignment
Differently, distribution alignment points out that the cause of distribution shift is that data distribution functions of the source and target domains are different [29]. Thus, distribution alignment commits to separately constructing some statistics of the source and target distribution functions, then narrows the distance between each pair statistics, to explicitly and statistically reduce the distribution divergence between the source and target domains. Pan et al. [16] construct a dimensionality reduction framework according to the maximum mean discrepancy (MMD) metric to align the source and target marginal distributions. Long et al. [17] further propose class-wise MMD by learning pseudo target labels and building a dimensionality reduction framework to jointly achieve the marginal and conditional distribution alignment across domains. Based on MMD and class-wise MMD, Li et al. [24] explore discriminative information of source and target domains to learn both domain invariant and class discriminative low-dimensional features of the two domains.

Joint Subspaces Alignment and Distribution Alignment
To take advantage of the above two techniques jointly, Zhang et al. [29] propose to combine both the subspace alignment and distribution alignment, which can bridge the distribution shift geometrically and statistically. Li et al. [30] propose a novel approach to exploit feature adaptation with subspace alignment and distribution alignment, and conduct sample adaptation with landmark selection. Based on this, Li et al. [31] further propose a novel landmark selection algorithm to reweight samples, i.e., increase the weight of pivot samples and decrease the weight of outliers. All of these methods, either subspace alignment or distribution alignment or both, aim to bridge the domain shift by global feature alignment, but they ignore the local information.
Different from these methods, the proposed approach pertains to a two-stage feature-based adaptation and utilizes optimal transport for domain adaptation, in which the local information of the domains can be preserved.

Optimal Transport for Domain Adaptation
Optimal transport (OT) can learn the optimal transport plan (OTP) according to Wasserstein distance, thus the source instances can be transported to the target domain at a minimum transport cost. However, high-dimensional source and target data usually leads to irregularities in the OTP and incorrect transport of instances. To address this challenge, several regularized OT methods [32][33][34] are proposed to relax some constraints of the OTP. Among them, Cuturi et al. [34] propose entropy regularization based on information theory to smoothen the transport, which gains popularity due to its fast computation speed.
Courty et al. [18] first attempt to apply such information theoretic regularized optimal transport (OT-IT) mapping source instances in target domain to bridge the cross-domain shift and since then, Symmetry 2020, 12, 1994 5 of 18 optimal transport for domain adaptation (OTDA) has raised great interest. Based on OT-IT, group-lasso regularized optimal transport (OT-GL) [27] is developed to utilize the class-based regularization to explore the source label information. Courty et al. [35] further explore joint distribution optimal transport (JDOT) which can directly obtain the prediction function to label target instances by finding the optimal transport plan from the source joint distribution to the target joint distribution. Zhang et al. [36] use correlation alignment to learn the kernel Gauss-optimal transport map (KGOT) in reproducing kernel Hilbert spaces so as to narrow the cross-domain gap.
The above mentioned OTDA methods also belong to one-stage feature-based adaptation. In addition, they achieve feature alignment across domains by minimizing the Wasserstein distance, but usually result in weak discriminability of the source domain, thereby degrading the DA performance.
To address this problem, we propose a two-stage feature-based adaptation approach, and perform optimal transport (the second stage) in a low-dimensional feature space by constructing a dimensionality reduction framework to maximize the intradomain dispersion and minimize the source intraclass compactness (the first stage), so as to enhance feature discriminability of the source domain.

Theoretical Background
In this section, we first present the domain adaptation definition and then a brief overview of the optimal transport for domain adaptation.
Domain adaptation (DA) definition: Let Ω s , Ω t ⊂ R d be d-dimensional feature spaces and Ψ s , Ψ t ⊂ R be label spaces. Given a source data set x s i n s i=1 = X s ∈ R n s ×d associated with its label set y s i n s i=1 = Y s , and a target data set x t j n t j=1 = X t ∈ R n t ×d without labels (n s and n t are the numbers of the source and target data samples), domain adaptation aims to infer the corresponding target label Notably, the superscripts s and t denote the source and target domains, and the same goes for the subscripts s and t. Letμ be the respective empirical marginal distributions over X s and X t , where δ x is the Dirac function at location x. 1 m denotes a m-dimensional vector whose elements are all ones, where m ∈ {n s , n t }. With a cost matrix C, the optimal transport (OT) problem defined by Kantorovitch [28] is formulated as follows: where Π = γ ∈ (R + ) n s ×n t γ T 1 n s =μ t , γ1 n t =μ s and ·, · F denote the dot product with Frobenius norm. Equation (1) can be seen as the Wasserstein distance betweenμ s andμ t , and C(i, j) usually adopts the squared Euclidean distance between x s i and x t j . To efficiently deal with the transportation among high-dimensional instances, based on information theory, entropy regularization is added to the OT problem and the purpose is to find the optimal transport plan (OTP) as below: where λ > 0 is the parameter of weighting entropy regularization. Courty et al. [18] use such information theoretic regularized optimal transport (OT-IT) to align source and target features to address the real-world DA problems, such as cross-domain image classification.
To promote DA performance, several class-based regularized OT methods [18,27] with different forms of class-based regularization term Θ(γ) are proposed based on OT-IT. These methods take advantage of source label information to promote group sparsity w.r.t. columns of γ * , thus preventing Symmetry 2020, 12,1994 6 of 18 source instances with different labels from being matched to the same target instances. The class-based regularized OT problem can be formulated as: where α > 0 is the parameter of weighting class-based regularization. With γ * , the source data features can be aligned into the target domain to minimize the cross-domain Wasserstein distance, and the new features of the transported source data can be represented: Training a classifier on the new feature representation of the transported source instances with their labels, the labels of the target instances can be predicted.

Motivation and Main Idea
The discriminability of the source data is essential for feature-based DA performance. The reviewed regularized OT methods for DA belonging to feature-based DA methods are based on OT-IT, which is efficient for high-dimensional cross-domain image classification. However, due to the entropy regularization, they usually lead to weak discriminability of the source data. Particularly, when λ → 0, γ * (i, j) = 1 n s n t , ∀i, j [27], it will cause all the source instances to be mapped to the same point in the target domain and the discriminability of the source domain will disappear unexpectedly.
To address this challenge, a two-stage procedure for feature-based adaptation is present in this paper. In the first stage, we construct a dimensionality reduction framework, similar to PCA, to learn a low-dimensional space while both the source and target data variation is maximized and the intraclass compactness in the source domain is minimized. In the second stage, based on the source and target low-dimensional features with the properties described in the first stage, we adopt class-based regularized optimal transport, i.e., Equation (3), to promote the interclass sparsity in the OTP rows (the OTP rows from different classes are sparse, so that the source instances from different classes are not associated with one same target instance) using the source label information.
Notably, the desirable properties in the first stage can mitigate the degradation of feature discriminability of the source domain in the second stage. Specifically, with the low-dimensionality space obtained in the first stage, the fluctuation range of all elements is getting larger in the cost matrix C, and the rows of C from the same classes tend to be similar. As such, the matrix C enables the OTP to have larger variance and enhances the intraclass density in rows of the OTP (that is, the OTP rows from the same classes are similar, so the source instances from the same classes are associated with one or more target instances simultaneously).
In this two-stage feature-based adaptation strategy, we can obtain an OTP with more discriminant information which can generate a more discriminative representation of the source domain when aligning the source features to the target features by minimizing the Wasserstein distance.

A Dimensionality Reduction Framework
OTDA usually reduces the source data variation, which may weaken the source interclass dispersion and degrade the discriminability of the source data. To address this challenge, we use dimensionality reduction to maximize the variation of both the source and target data and minimize the source intraclass compactness, which will lead to an OTP with larger variance and its rows from the same classes being intraclass denser when we conduct OT in the learned low-dimensional space.
Maximizing Intradomain Dispersion: Similar to principal component analysis (PCA), we propose to learn an orthogonal transformation matrix A ∈ R d×k to separately maximize the variances of the k-dimensional embedded representations of the source and target domains. In order to ensure the existence of the orthogonal transformation matrix, a symmetric matrix is constructed to represent the variance, so that the orthogonal transformation matrix will be the solution of the following optimization problem: max where M 1 = X T s H s X s + X T t H t X t is the symmetric matrix, H s = I n s − 1 n s 1 n s ×n s and H t = I n t − 1 n t 1 n t ×n t are centering matrices, I p is an identity matrix, p ∈ {k, n s , n t }, and 1 q×q is a q × q matrix whose elements are all ones, q ∈ {n s , n t }.
Minimizing the Source Intraclass Compactness: Intraclass compactness is a crucial indicator to measure the effectiveness of a model to produce discriminative features, where intraclass compactness indicates how close the features with the same label are to each other [37].
To retain the discriminative information of the source domain, minimization of the source intraclass compactness is added to the dimensionality reduction framework. The source intraclass compactness can be formulated as below: where is a symmetric matrix, and Ω (i) s . From Equation (6), we aim to minimize distance between each pair of data instances that come from the same class in the source domain, so that the source intraclass compactness is promoted. Notably, the weight 1 n (i) s is designed to pay more attention on the smaller-size class to deal with imbalance dataset problem.
According to the generalized Rayleigh quotient, the Equation (6), when minimized, can be integrated into Equation (5), and the feature reduction framework aims to find a transformation matrix by solving the following optimization problem: where β is a trade-off parameter. Since M 1 = X T s H s X s + X T t H t X t and M 2 = X T s LX s are symmetric matrices, applying Lagrange techniques, the transformation matrix A can be solved by finding the k-smallest eigenvectors corresponding to the following generalized eigen-decomposition: where Φ ∈ R k×k is a diagonal matrix with Lagrange multipliers. Therefore, the low-dimensional features of the instances in both domains can be, respectively, represented as Z s = X s A, Z t = X t A, where the intradomain samples are dispersed as much as possible and the source intraclass samples are compacted as much as possible.

OT Based on Low-Dimensional Representation
With the low-dimensional representation Z s = z s 1 , z s 2 , · · · , z s n s T , Z t = z t 1 , z t 2 , · · · , z t n s T learned above, we adopt OT-IT method to align the source and target features by transporting source data to the target domain at a minimum transport cost. As group-lasso regularized optimal transport (OT-GL) can further utilize the source label information by adding a l 1 − l 2 class-based regularization to the OT-IT formulation to promote the interclass sparsity in the OTP rows, and also it can efficiently use the generalized conditional gradient (GCG) algorithm [38] to achieve better DA performance compared with other class-based regularized OT methods, we use OT-GL to obtain OTP and achieve the source and target feature alignment.
Based on the representation Z s = z s 1 , z s 2 , · · · , z s n s T , Z t = z t 1 , z t 2 , · · · , z t n s T , the elements of the cost matrix can be computed by the squared l 2 Euclidean distance as follows: Applying OT-GL, the OTP can be solved by the following formulation: where L Cl = i : x s i ∈ ∧y s i = Cl . The GCG algorithm can efficiently solve the formulation to obtain the OTP. Since we adopt the above mentioned low-dimensional features and apply the class-based regularization, the resulting OTP can generate a more discriminative feature representation. Specifically, the OTP has larger variance and the rows of the OTP show greater intraclass density and interclass sparsity.
With the OTP, we can align the source and target features by mapping the source instances into the target domain to minimize the Wasserstein distance between their distributions and get a new representation of the source data as follows:Z s = n s γ * Z t .
In view of the above properties of the OTP, the new representation of the source data not only disperses the source interclass samples but also compacts the source intraclass samples, thus being more discriminative.
The proposed OTDR approach learns a low-dimensional feature representation to disperse the source/target instances and compact the source intraclass instances in the first stage. Then, with the desirable properties in the first stage, OT-GL method is adopted to map the source data instances into the target domain to achieve feature alignment in the second stage. Therefore, we can get a discriminative representation of the source data based on the OTP with interclass sparsity and intraclass density in its rows. In Algorithm 1, OTDR is summarized. 3: Let Z s = X s A, Z t = X t A, and compute the cost matrixC by Equation (10). 4: Adopt the GCG algorithm, and obtain the optimal transport plan γ * by solving Equation (11). 5: GenerateZ s by Equation (12), and train an adaptive classifier f on Z s , Y s . Output: transformation matrix A, optimal transport plan γ * , and adaptive classifier f .

Experiments
In this section, we conduct comprehensive experiments on cross-domain image classification datasets to validate the effectiveness of our approach.

Data Descriptions
The widely used cross-domain datasets Office10 + Caltech10, ImageCLEF-DA, Office-31, and Office-Home were adopted in this paper in the form of A→B, which denotes a DA task from source domain A to target domain B. The statistics of these datasets are listed in Table 1 and some exemplary images from Office10 + Caltech10, Office-Home are shown in Figure 3.
5: Generate s Z by Equation (12), and train an adaptive classifier f on   , .

ZY
Output: transformation matrix A , optimal transport plan * γ , and adaptive classifier f .

Experiments
In this section, we conduct comprehensive experiments on cross-domain image classification datasets to validate the effectiveness of our approach.
Office-31 [39]: Office-Home [42]: it contains four domains, namely Art (A65), Clipart (C65), Product (P65), and Real World (R65), where 12 DA tasks could be constructed. There are more samples and categories in each domain, consequently, the DA tasks on this dataset are more challenging. Additionally, the ResNet-50 deep features are considered in our experiments.
For feature-based adaptation methods in all experiments, 1-NN classifier is used to predict the target labels. Since there are no labeled target instances, the optimal parameters of DA methods cannot be obtained by cross-validation steps. For the sake of fairness, we set the parameters of the comparison methods either to be the same as those recommended in the corresponding original papers or obtain them through empirical searching procedure for satisfactory DA performance. In our OTDR, like OT-GL algorithm, the group-lasso regularization, the entropy regularization parameters, and the maximum iterations in the GCG algorithm are set to α = 2, λ = 0.1, maxiter = 20. Besides, the reduced feature dimension is set to k = 200. For the trade-off parameter in OTDR, according to the number of categories in different datasets, we set β = 0.001, 0.01, 0.1 for Office-Home, Office-31, and Office10 + Caltech10/Image-CLEF-DA, respectively, which will be analyzed in the parameter sensitivity section.

Experimental Results
In this section, we adopted the classification accuracy to evaluate the effectiveness of proposed OTDR and the compared DA methods. As in [14,22,24,27,35,46], the metric of the classification accuracy is formulated as below Accuracy = |x: where y is the ground-truth label of x. The classification accuracy of Office10 + Caltech10 (SURF features) and Office10 + Caltech10 (Decaf6 features) under different DA methods are shown in Tables 2 and 3, where we can observe that OTDR outperforms all of the feature-based adaptation methods with an average classification accuracy of 53%.
JGSA combines subspace alignment and distribution alignment to reduce the cross-domain divergence, which is better than the pure subspace alignment methods, i.e., GFK, SA, and the pure distribution alignment methods, i.e., TCA, JDA, DICD, ESDM.
Although, those seven methods achieve DA by global feature alignment, they ignore the local information. STSC and OTDA methods, i.e., OT-IT, OT-GL, KGOT, JDOT utilize sample-to-sample matching to exploit the local information, while they do not further explore the source label information. As such, the DA performance of these methods is degraded and they cannot beat JGSA on average.  Based on OT-GL, our OTDR further uses the source label information and a two-stage feature-based adaptation strategy to alleviate the source discriminability degradation caused by OT-GL. Therefore, OTDR stands out among these sample-to-sample matching methods on most of the tasks (17/24 tasks). Moreover, compared with the best baseline JGSA, our OTDR has 3.0% improvements.
To further evaluate the performance of OTDR, we conducted experiments on three datasets with ResNet-50 features. The results of OT-GL, the best baseline feature-based DA method, i.e., JGSA, the state-of-the-art classifier-based DA method, i.e., ARTL, and several end-to-end deep DA models are reported in Tables 4 and 5. It can be seen that the traditional DA methods are superior to some deep DA methods, i.e., DAN, DANN, JAN on average. In this sense, the research of traditional DA method is still meaningful.
In addition, OTDR also outperforms OT-GL, JGSA, and ARTL based on ResNet-50 features on most of the tasks (19/24 tasks), which further indicates that OTDR is significant in traditional methods. More importantly, OTDR is on average better than all of the baseline deep DA methods on average.
In particular, for the challenging large Office-Home dataset, OTDR has 0.5% improvements against the best baseline HAN. Hence, the results indicate that OTDR can achieve DA marginally on cross-domain image classification tasks compared with either traditional or deep DA methods.

Distribution of the OPT Matrix
To verify the effectiveness of proposed OTDR, we first inspected the distribution of the OTP matrix on randomly selected task C10→A10 with SURF features as shown in Figure 4.  It can be seen from Figure 4a that the OTP * γ obtained by OT-IT is smoothest as a whole without source label information, which will result in a source feature representation with relatively poor discriminability. When the class-based regularization is added to OT-IT, the interclass sparsity in the OTP rows can be enhanced as shown in Figure 4b, where the 10 small matrix blocks on the diagonal represent the transport plan between the source and target samples from the same class, respectively. However, there are some wrong transport directions in this OTP, i.e., some source instances of the same class are transported to target instances of different classes.
To enhance the discriminability of the source data, our OTDR approach performs dimensionality reduction procedure before OT. In this way, compared with OT-GL, the obtained OTP has more elements with a variance of 1.3 × 10 −6 (while the variance of the OTP obtained by adopting OT-GL is 7 9.1 10   ). Furthermore, the OTP shows higher intraclass density in its rows, as is shown in Figure 4c, which prompts all of the source samples from the same classes to be associated with one or more target instances simultaneously. Therefore, the OPT will generate more a discriminative representation of the source domain.
It can be seen from Figure 4a that the OTP γ * obtained by OT-IT is smoothest as a whole without source label information, which will result in a source feature representation with relatively poor discriminability.
When the class-based regularization is added to OT-IT, the interclass sparsity in the OTP rows can be enhanced as shown in Figure 4b, where the 10 small matrix blocks on the diagonal represent the transport plan between the source and target samples from the same class, respectively. However, there are some wrong transport directions in this OTP, i.e., some source instances of the same class are transported to target instances of different classes.
To enhance the discriminability of the source data, our OTDR approach performs dimensionality reduction procedure before OT. In this way, compared with OT-GL, the obtained OTP has more elements with a variance of 1.3 × 10 −6 (while the variance of the OTP obtained by adopting OT-GL is 9.1 × 10 −7 ). Furthermore, the OTP shows higher intraclass density in its rows, as is shown in Figure 4c, which prompts all of the source samples from the same classes to be associated with one or more target instances simultaneously. Therefore, the OPT will generate more a discriminative representation of the source domain.

Statistics of Feature Discriminability
In addition, we evaluated the discriminability of the feature representation of the transported source instances on five randomly selected DA tasks (i.e., C10→A10 with SURF features, W10→C10 with Decaf6 features, W31→A31, P12→C12, C65→R65) by showing the ratio of the source intradomain dispersion ("S_dispersion") to the target intradomain dispersion ("T_dispersion") and the ratio of the source intraclass compactness ("S_compactness") to the source intradomain dispersion ("S_dispersion") in Figure 5. As can be seen from Figure 5a, the "S_dispersion/T_dispersion" under OT-GL is sharply reduced, that is, the source instances are crowded relative to the target instances, which may lead to weaker discriminability of the source domain. It corresponds to our motivation.
Using OTDR, we can obtain an OTP, which shows more obvious interclass sparsity and intraclass density in its rows, thus a more discriminative representation of the source data can be generated. The discriminability can be indicated by trends of the "S_compactness/S_dispersion", as shown in Figure 5b, where we can see that compared with OT-GL, "S_compactness/S_dispersion" under OTDR is smaller.

Feature Visualization of Source Domain
Moreover, by displaying the t-SNE feature visualization [40] of task C10→A10 with SURF features in Figure 6, the discriminability of the source data can be illustrated intuitively and the features used in Figure 6a As can be seen from Figure 5a, the "S_dispersion/T_dispersion" under OT-GL is sharply reduced, that is, the source instances are crowded relative to the target instances, which may lead to weaker discriminability of the source domain. It corresponds to our motivation.
Using OTDR, we can obtain an OTP, which shows more obvious interclass sparsity and intraclass density in its rows, thus a more discriminative representation of the source data can be generated. The discriminability can be indicated by trends of the "S_compactness/S_dispersion", as shown in Figure 5b, where we can see that compared with OT-GL, "S_compactness/S_dispersion" under OTDR is smaller.

Feature Visualization of Source Domain
Moreover, by displaying the t-SNE feature visualization [40] of task C10→A10 with SURF features in Figure 6, the discriminability of the source data can be illustrated intuitively and the features used in Figure 6a We observe that the source instances under OT-GL from different classes cluster together, which is consistent with our speculation mentioned above, while the source instances under OTDR are more discriminative.

Ablation Study
Finally, we conducted experiments on the five DA tasks used above to verify the effectiveness of the intraclass compactness minimization and the intradomain dispersion maximization.
As shown in Figure 7, by removing the intraclass compactness regularization (OTDR/ICR), the We observe that the source instances under OT-GL from different classes cluster together, which is consistent with our speculation mentioned above, while the source instances under OTDR are more discriminative.

Ablation Study
Finally, we conducted experiments on the five DA tasks used above to verify the effectiveness of the intraclass compactness minimization and the intradomain dispersion maximization.
As shown in Figure 7, by removing the intraclass compactness regularization (OTDR/ICR), the classification accuracy is reduced on all the five tasks compared with OTDR. In addition, when we further remove the intradomain dispersion maximization, that is, drop the dimensionality reduction procedure, OTDR degenerates to OT-GL and the classification accuracy is further reduced as shown in Figure 7. Therefore, in DA process, it is effective to minimize the intraclass compactness and maximize the intradomain dispersion.

Parameter Sensitivity
Since α = 2, λ = 0.1, maxiter = 20 are fixed to be the same as OT-GL on all DA tasks, we only evaluate the sensitivity of the reduced feature dimensions k and the trade-off parameter β on the five DA tasks used above by fixing one parameter to analyze the other. Specifically, we set k ∈ {20, 40, 60, · · · , 400} and β ∈ 10 −5 , 10 −4 2 , 10 −4 , 10 −3 2 , · · · , 10 2 2 , 10 2 , respectively. The classification accuracy curves under our OTDR are presented in Figure 8 and the trends on all other tasks are similar. From Figure 8a, we observe that as the value of k increases, the accuracy results are higher and then tend to be stabilize within the range of k ∈ {80, 100, · · · , 400}. Therefore, we can choose parameter k in a wide range to obtain optimal performance. For the trade-off parameter  , small values of  can make the source intraclass compactness minimization more effective, and an infinite value of  will cause the source intraclass compactness effectiveness to be ignored. As the value increases inside the range 3 2 2  For the trade-off parameter β, small values of β can make the source intraclass compactness minimization more effective, and an infinite value of β will cause the source intraclass compactness effectiveness to be ignored. As the value increases inside the range β ∈ 10 −3 , 10 −2 , · · · , 10 2 , the accuracy results shown in Figure 8b also increase, reaching the top performance, then decrease slightly and then tend to stabilize, so that OTDR can achieve good performance in a relatively wider range of β. Specifically, the top performance of OTDR can be achieved in β ∈ 5 * 10 −3 , 10 0 and β ∈ 10 −4 , 10 −1 for few-category datasets (i.e., Office10 + Caltech10, Image-CLEF-DA) and multicategory datasets (i.e., Office-31, Office-Home), respectively.

Conclusions
In this paper, a novel two-stage feature-based adaptation approach for domain adaptation (DA) referred to as optimal transport with dimensionality reduction (OTDR) was proposed to promote DA performance. We attempted to enhance the discriminability of the source domain when aligning the source and target features. OTDR uses source label information in two stages to obtain an OTP with larger variance so as to promote the interclass sparsity and intraclass density in its rows, which can generate a more discriminative feature representation of the source domain when aligning the source features to the target features by minimizing the Wasserstein distance between source and target distributions. Comprehensive experiments conducted on different DA tasks demonstrate that OTDR is competitive with traditional and deep DA baselines.

Future Work
OTDR achieves feature adaptation in two separate stages for DA, i.e., low-dimensionality learning, optimal transport, while in the first stage, some original information might be distorted. In future work, we will consider embedding the preservation of the source data discriminability and the optimal transport from source data to target data into deep architectures for end-to-end deep domain adaptation.

Conflicts of Interest:
The authors declare no conflict of interest.