A Subspace Based Transfer Joint Matching with Laplacian Regularization for Visual Domain Adaptation

In a real-world application, the images taken by different cameras with different conditions often incur illumination variation, low-resolution, different poses, blur, etc., which leads to a large distribution difference or gap between training (source) and test (target) images. This distribution gap is challenging for many primitive machine learning classification and clustering algorithms such as k-Nearest Neighbor (k-NN) and k-means. In order to minimize this distribution gap, we propose a novel Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method for visual domain adaptation by jointly matching the features and re-weighting the instances across different domains. Specifically, the proposed STJML-based method includes four key components: (1) considering subspaces of both domains; (2) instance re-weighting; (3) it simultaneously reduces the domain shift in both marginal distribution and conditional distribution between the source domain and the target domain; (4) preserving the original similarity of data points by using Laplacian regularization. Experiments on three popular real-world domain adaptation problem datasets demonstrate a significant performance improvement of our proposed method over published state-of-the-art primitive and domain adaptation methods.


Introduction
Indoor-outdoor camera surveillance systems [1,2] are widely used in urban areas, railway stations, airports, smart homes, and supermarkets. These systems play an important role in security management and traffic management [3]. However, cameras with different properties and positions deployed in these systems can create a distribution difference among the capturing images. This difference leads to the system's poor performance due to considering primitive machine learning algorithms for recognition [4]. For example, if a classifier (or primitive algorithm) is trained from source domain images (taken from a DSLR camera), then the trained classifier will not give as expected results when tested on the images collected from some other target domain (taken from webcam camera). A simple solution to improve the classifier's performance is that it must only be trained with target domain images. However, there are no labeled images in the target domain in practice, and labeling the target domain images is a time-consuming process. Let us consider an example shown in Figure 1 to discuss in detail how the images collected from different environments can cause differences in distribution across domains. In Figure 1, various possibilities, which can cause distribution differences, are presented, such as (1) the images (keyboards and headphones) as shown in Figure 1a,b are collected from different quality cameras, i.e., low-quality camera (webcam camera) and Recently, the literature [2,4] has seen a growing interest in developing transfer learning (TL) or domain adaptation (DA) algorithms to minimize the distribution gap between domains, so that the structure or information available in the source domain can be effectively transferred to understand the structure available in the target domain. In previous work [5][6][7][8][9][10][11][12], two learning strategies for domain adaptation are considered independently: (1) instance re-weighting [9][10][11][12], which reduces the distribution gap between domains by re-weighting the source domain instances and then training the model with the re-weighted source domain data; (2) feature matching [5,6,8,13,14], which finds a common feature space across both domains by minimizing the distribution gap.
If the distribution difference between both domains is large enough, there will always be a situation where the source domain instances are not relevant to the target domain instance, even after finding a common feature location. In this situation, jointly optimizing instance re-weighting and feature matching is an important and unavoidable task for robust transfer learning. To understand a need for joint-learning instance re-weighting and feature matching more deeply, let us consider an example in which we have source domain data with outlier data samples (or irrelevant instances) as shown in Figure 2a and target domain data as shown in Figure 2b. In this case, if we lean only the common feature space between both domains by existing methods such as Joint Geometrical and Statistical Alignment (JGSA) [8] and Joint Distribution Adaptation (JDA) [6], the new representation of the source and target domain data is shown in Figure 2c, where it can be seen that the domain difference is still large for feature matching due to outlier data samples or irrelevant instances (the symbols with circles). However, if we jointly learn feature matching and instance re-weighting, the data representation is shown in Figure 2d, where it can be seen that all the outlier data samples are down-weighted to reduce domain difference further.
Fortunately, in the literature, there is a method called Transfer Joint Matching (TJM) that performs joint feature matching and instance re-weighting by down-weighting irrelevant features of the source domain [7]. However, only performing feature matching and instance re-weighting is insufficient to successfully transfer knowledge from the source domain to the target domain. Some other DA and TL methods consider other essential properties to minimize distribution differences between both domains. For example, the JDA method considers the conditional distribution in addition to the marginal distribution, and this distribution is needed if the data is conditionally or class-wise distributed. Subspace Alignment (SA) [15] makes use of subspaces (composed of 'd' eigenvectors induced by a Principle Component Analysis (PCA)), one for each domain and suggests minimizing the distribution difference between subspaces of both domains rather than the original space data. JGSA preserves source domain discriminant information, among other properties such as SA, marginal, and conditional distributions, to further improve the performance of JDA. However, the feature space obtained by JGSA is not notable because data samples in this space may lose their original similarity so that they can be easily misclassified by the classifier. Kernelized Unified Framework for Domain Adaptation (KUFDA) [16] improves JGSA by adopting the original similarity weight matrix term so that the sample does not lose its original similarity in the learned space. KUFDA follows most of the above discussed important properties but still suffers from outlier data samples, and this is due to not considering instance re-weighting term. In this paper, to solve all of the above-discussed challenges and to efficiently transfer knowledge from the source domain to the target domain, we propose a novel Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method for visual domain adaptation by jointly matching the features and re-weighting instances across both the source and the target domains.

+ -
The major contributions of this work can be listed as follows: • The proposed method STJML is the first framework that crosses the limits of all the comparative cutting edge methods, by considering all inevitable properties such as projecting both domain data into a low dimensional manifold, instance re-weighting, minimizing marginal and conditional distributions, and geometrical structure of both domains in a common framework.

•
With the help of the t-SNE tool, to illustrate the reason for the inclusion of all the components (or inevitable properties), we have graphically visualized the features learned by the proposed method after excluding any component.

Related Work
Recently, various DA and TL approaches have been proposed for transferring structure or information from one domain to another domain in terms of features, instances, relational information, and parameters [4,9]. However, the TL approaches, which are closely related to our work, can be divided into three types: feature-based transfer learning [6], instance-based transfer learning [7], and metric-based transfer learning [9].
In the first type, the objective is to minimize the distribution difference between the source and the target domains based on feature learning. For example, Pan et al. [17] proposed a new dimensionality reduction method called maximum mean discrepancy embedding (MMDE) for minimizing the distribution gap between domains. MMDE learns a common feature space on the domains where the distance between distributions can be minimized while preserving data variance. Pan et al. [5] further extended the MMDE algorithm by proposing a new learning method called Transfer Component Analysis (TCA). TCA tries to learn a feature space across domains in a reproducing kernel Hilbert space using Maximum Mean Discrepancy (MMD). Therefore, with the new representation in this feature space, we can apply standard machine learning methods such as k-Nearest Neighbor (k-NN) and Support Vector Machine (SVM) to train classifiers in the source domain for use in the target domain. Long et al. [6] extends TCA by considering not only marginal distribution but also conditional distribution with the help of pseudo-labels in the target domain. Fernando et al. [18] introduced a subspace centric method called Subspace Alignment (SA). SA aims to align the source domain vectors (E) with the target domain one (F) with the help of a transformation matrix (M). Here, E and F can be obtained by Principle Component Analysis (PCA) on the source domain and the target domain, respectively. Shao et al. [19] proposed a low-rank transfer learning method to match both domain samples in the subspace for transferring knowledge. Zhang et al. [8] proposed a unified framework that minimizes the distribution gap between domains both statistically and geometrically, called Joint Geometrical and Statistical Alignment (JGSA). With the help of two coupled projections E (for source domain) and F (for target domain), JGSA projects the source domain and the target domain data into low dimensional feature space, where both domain samples are geometrically and statistically aligned.
In the second type, the objective is to re-weight the domain samples so as to minimize the distribution difference between both domain samples. The TrAdaBoost TL [10] method re-weights the source domain labeled data to filter samples that are most likely not from the target domain. In this way, the re-weighted source domain samples will create the same distribution found on the target domain. Finally, the re-weighted samples can be considered as additional training samples for learning the target domain classifier. As the primitive TrAdaBoost method is applicable to a classification problem, Pardoe et al. [11] extended this method by proposing ExpBoost.R2 and TrAdaBoost.R2 methods to deal with the regression problem.
In the final type, the target domain metric is to be learned by establishing a relationship between the source domain and the target domain tasks. Kulis et al. [20] introduced a method, called ARC-t, to learn a transformation matrix between the source domain and the target domain based on metric learning. Zhang et al. [21] proposed a transfer metric learning (TML) method by establishing the relationship between domains. Ding et al. [22] developed a robust transfer metric learning (RTML) method to effectively assist the unlabeled target learning by transferring the information from source domain labeled data.

A Subspace Based Transfer Joint Matching with Laplacian Regularization
This section presents the Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method in detail.

Problem Definition
To understand transfer learning or domain adaptation, the domain and the task must be explicitly defined. A domain D consists of two parts: features space X and a marginal probability distribution P(x), i.e., D = {X , P(x)}, where x ∈ X . If there is a difference in their feature space or marginal distribution, the two domains are said to be different. Given a domain D, a task, that is denoted by T , also consists of two parts: a label space Y and a classifier function f (x), i.e., T = {y, f (x)}, where y ∈ Y and classifier function f (x) predicts label of new instance x. This classifier function f (x) can also be interpreted as the conditional probability distribution, i.e., Q(y|x).
Transfer learning, given a labeled source domain D s = {(x 1 , y 1 ), . . . , (x n s , y n s )} and unlabeled target domain D t = {(x 1 ), . . . , (x n t )} under the assumptions X s = X t , Y s = Y t , P s (x s ) = P t (x t ), Q s (y s |x s ) = Q t (y t |x t ), aims to improve the performance of the target domain classifier function f t (x) in D t using the knowledge in D s .

Formulation
To address the limitations of existing TL methods, the STJML method minimizes the distribution gap statistically and geometrically by working on the following components: finding both domain subspaces, matching features, instance re-weighting, and exploiting the similar geometrical property. In our proposed STJML approach, first, we exploit the subspaces of both domains and then with the help of common projection vector-matrix Z for both domains, perform feature matching, instance re-weighting, and the similar geometrical property exploitation in a Reproducing Kernel Hilbert Space (RKHS) to match both first and high-order statistics.

Subspace Generation
Even though both domain data lie in the same D-dimensional feature space, they are drawn according to different marginal distributions. Consequently, according to [15], instead of working on the original feature space, we need to work on more robust representations of both domain data to allow it to induce stronger classification, which is not subject to local perturbations. For this subspace generation, we use the Principle Component Analysis (PCA) technique, which selects 'd' eigenvectors corresponding to the 'd' largest eigenvalues. These 'd' eigenvectors are used to project original space data on it. For example, if a given input data matrix X = [x 1 , x 2 , . . . , x n ] ∈ R D×n where n = n s + n t and D is the dimension of each data sample in original space, then, the PCA generates the subspace matrix X ∈ R d×n by projecting input data matrix (X ) on selected 'd' eigenvectors.

Feature Transformation
As a dimensionality reduction method, such as PCA, can learn the transformed feature representation by reducing the reconstruction error of given data, it can also be utilized for the data reconstruction. Let us consider subspace data matrix X ∈ R d×n , data centering matrix H = I − 1 n 1, and 1 is a n × n matrix of ones. Thus, the covariance matrix of both domain subspace data matrix X can be calculated as XHX T . The objective of PCA is to maximize both domain variances by finding an orthogonal transformation matrix W ∈ R n×σ , where σ is the selected number of eigenvectors on which subspace data matrix X to be projected. Thus, where tr(.) is the trace of a matrix and I is an identity matrix. As the problem in Equation (1) is an eigendecomposition problem, it can easily be decomposed by eigendecomposition as XHX T W = WΦ, where Φ = diag(Φ 1 , . . . , Φ σ ) is the σ largest eigenval matrix. After projecting subspace data matrix X on the selected projection vectors matrix (W σ ) corresponding to top most σ largest eigenvalues matrix, the optimal σ-dimensional learned projection matrix V = [v 1 , . . . , v σ ] = W T σ X To achieve our goal, we need to work in RKHS using some kernel function like linear, polynomial, Gaussian, etc. Let us consider the chosen kernel function θ, which maps the data sample x to θ(x), i.e., , and then the kernel matrix K = θ(X) T θ(X) ∈ R n×n . After the application of the Representer theorem W = θ(X)Z, Equation (1) can be written as follows: where Z ∈ R n×σ is the transformation matrix, and the subspace embedding becomes V = Z T K.

Feature Matching with Marginal Distribution
However, even though maximizing the subspace data variance, the distribution difference between both domains will still be quite large. Therefore, the main problem is to minimize the distribution difference between them by applying an appropriate distance metric (or measure).
There are many distance measures (such as the Kullback-Leibler (KL) divergence) that can be utilized to compute the appropriate distance between both domain samples. However, many of these methods are parameterized or require estimating the intermediate probability density [5]. Therefore, in this paper, we adopt a non-parametric distance estimate method called Maximum Mean Discrepancy (MMD) [23] to compare distribution difference in a Reproducing Kernel Hilbert Space (RKHS) [5]. MMD estimates the distance between the sample means of both domain data in the σ-dimensional embedding, where M d is the MMD matrix and can be determined as follows

Feature Matching with Conditional Distribution
Minimizing the marginal distribution difference does not guarantee that the conditional distribution between both source and target domains will also be minimized. However, for robust transfer learning, minimizing the conditional distributions, i.e., Q s (y s |x s ) and Q t (y t |x t ) between both domains is required [24]. Reducing the conditional distribution is not a trivial process because there is no label data in the target domain. Therefore, we cannot model Q t (y t |x t ) directly.
Long et al. [6] proposed a Joint Distribution Adaptation (JDA) method for modeling Q t (y t |x t ) by generating pseudo labels of the target data. Initial pseudo labels for the target data can be generated by training the classifier with X s and Y s of the source domain, and testing the classifier on target domain subspace X t . Now with Y s , X s , and pseudo labels, the conditional distribution between both domains can be minimized by modifying MMD to estimate distance between the class conditional distributions Q s (x s |y s = c ∈ {1, ..., C}) and Q t (y t |x t = c ∈ {1, ..., C}) where D c s = {k i : k i ∈ D s ∧ y(k i ) = c} is the set of samples belongs to cth class in the source domain, y(k i ) is the true label of k i , and n c s = |D c s |. Similarly, for the target domain, D c t = {k j : k j ∈ D t ∧ŷ(k j ) = c} is the set of samples belongs to cth class in the target domain,ŷ(k j ) is the pseudo label of k j , and n c t = |D c t |. Thus, the MMD matrix M c with class labels of both domains can be determined as follows: By minimizing Equation (5) such that Equation (2) is maximized, the conditional distributions between both the source and the target domains are drawn close with the new representation V = Z T K. In each iteration, this representation V will be more robust till its convergence. As there is a difference in both the marginal and conditional distributions, the initial pseudo labels of the target domain are incorrect. However, we can still take advantage of them and improve the performance of target domain classifiers iteratively.

Instance Re-Weighting
However, matching features with marginal and conditional distributions is not sufficient for transfer learning, as it can only match first-and higher-order statistics. In particular, when the domain difference is significant enough, even in the feature learning, there will always be some source instances or samples that are not related to the target instance. In this condition, an instance re-weighting method with feature learning should also be included to deal with such a problem.
In this paper, we adopt a L 2,1 -norm structured sparsity regularizer as proposed in [7]. This regularizer can introduce row-sparsity to the transformation matrix Z. Because each entry of the matrix Z corresponds to an example, row sparsity can substantially facilitate instance re-weighting. Thus, instance re-weighting regularizer can be defined as follows.
where Z s := Z 1:n s is the transformation matrix corresponding to the source samples, and Z t := Z n s +1:n s +n t is the transformation matrix corresponding to the target samples. As the objective is to re-weight the source domain instances, we only impose L 2,1 -norm on source domain. Thus, minimizing the Equation (7) such that Equation (2) is maximized, the source domain samples, which are similar (or dissimilar) to the target domain, are re-weighted with less (or greater) importance in the new learned space V = Z T K.

Exploitation of Geometrical Structure with Laplacian Regularization
However, matching features and instance re-weighting are not enough to convey knowledge transfer by capturing the intrinsic structure of the source domain labeled samples and target domain unlabeled samples. In particular, labeled data samples of the source domain combined with unlabeled data samples of the target domain are used to construct a graph that sets the information of the neighborhood data samples. Here, the graph provides discrete approximations to the local geometry of the manifold data. With the help of the Laplacian regularization term L, the smooth penalty on the graph can be included. Basically, the term regularizer L allows us to incorporate prior knowledge on certain domains, i.e., nearby samples are likely to share same class labels [25].
Given a kernelized data matrix K, we can use a nn-nearest neighbor graph to establish a relationship between nearby data samples. Specifically, we draw an edge between any two samples i and j if k i and k j are "close", i.e., k i and k j are among nn nearest neighbors of each other. Thus, the similarity weight matrix W can be determined as follows: where N nn (k j ) represents the set of nn nearest neighbors of k i . Here, two data samples are connected with an edge if they are likely to be from the same class. Thus, the regularizer term L can be defined as follows: where D is the diagonal matrix, i.e., D ii = ∑ j W ij and L is the Laplacian matrix; L = D − W,

Overall Objective Function
The objective of this work is to minimize the distribution difference between domains by jointly matching the features of both domains and re-weighting the source domain samples, and preserving original similarity of both domain samples. So, by incorporating Equations (3), (5), (7), and (9), the proposed objective function can be obtained as follows: where δ is a trade-off parameter, which balances the marginal and conditional distributions [13], η is the trade-off parameter that regularizes the Laplacian term, and λ is the regularization parameter to trade-off feature matching and instance re-weighting.

Optimization
By using the Lagrange multiplier Φ, Equation (10) can be written as follows: In order to find out an optimal value of the projection vector matrix Z, we partial derivative L f with respect to Z and equate it to zero as Z s 2,1 is a non-smooth function at zero and its partial derivative can be computed as where G is a diagonal subgradient matrix and its ith element can be calculated as As the problem in Equation (12) is a generalized eigen decomposition problem, we can solve it and find Φ = diag(φ 1 , . . . , φ σ ) (σ leading eigenvalues) and Z = (z 1 , . . . , z σ ) (σ leading eigenvectors). The pseudo code of our proposed method is given in Algorithm 1.

Experiments
In this section, we present a performance of the proposed STJML method by experimenting on various visual domain classification problems.

Data Preparation
We have considered three publicly image datasets: Office + Caltech10 with Speeded Up Robust Features (SURF), Office + Caltech10 with VGG-FC6 features, and Pose, Illumination, and Expression (PIE) face Recognition for experimentation. These data sets are well known in the domain adaptation methods and are widely considered in most recent works (such as [25,26]).
Caltech-256 has 30,607 images and 256 classes, while Office-31 is made of three object domains: DSLR (D), Amazon (A), and Webcam (W). It contains a total of 4652 images with 31 classes. As images in Office and Caltech-256 having different distributions, DA methods can help with cross-domain recognition. Since both the datasets contains 10 common classes, we considered Office + Caltech 10 datasets from [8], which has 12 tasks: A → D, . . . , C → W. For the purpose of experimentation, we considered both the SURF feature and Deep feature (VGG-FC6 features) of this dataset.
Carnegie Mellon University (CMU) PIE (Pose, Illumination, and Expression (PIE)) face dataset [27] contains over 40,000 facial images of 68 people. The images of each person were taken across 13 different poses, under 43 different illumination conditions, and with 4 different expressions. As there are many datasets of different poses, we considered only five poses such as C05, C07, C09, C27, and C29 for experimentation. Here each pose contains images with illumination variation and expression variation. Similar to Office + Caltech 10 datasets, 20 possible combinations of source and target domains or tasks such as C05 → C07, . . . , C29 → C27 can be constructed.
In this paper, we use notation P → Q to show knowledge transfer from source domain P to the target domain Q.

t-SNE Representation of Feature Spaces Learned by the Proposed Method (STJML)
In order to visualize the learned feature space by our proposed method, we considered the t-SNE tool [28], through which high dimensional data is projected to 2-D space (or low dimensional space). To show t-SNE representation of feature spaces for tasks A → D (SURF features) and A → W (VGG-FC6 features), we randomly selected 150 samples from each domain and then used two different symbols (such as circles and pluses) to represent different domains and 10-different colors ('black', 'red', 'lime', 'blue', 'orange', 'cyan', 'magenta', 'green', 'chocolate', and 'maroon') to represent 10 different classes. Furthermore, to clearly understand the distribution differences between both domains, we used different colored ellipses to represent different classes' variances belonging to different domains. We also used different colored lines to indicate the distribution gap between same class samples belong to different domains and different symbols such as square and star to represent the average point (or mean point) of each class in the source and target domains. For example, Figure 3a shows the initial feature representation of A → D task with SURF feature, where it can be seen that different class samples from different domains are too close together or there is no uniform cluster for different classes. Therefore, the classification or clustering algorithms can easily misclassify the samples that are too close or near the edge of their own clusters. However, due to the recent advancement in deep learning approaches, by which we can obtain deep features like VGG-FC6 features for the Office + Caltech10 dataset. The deep features (VGG-FC6) representation of task A → W for the Office + Caltech10 dataset is shown in Figure 3b. After comparing the representation of both types of features (as shown in Figure 3a), i.e., VGG-FC6 and SURF, it can be seen that the representation of VGG-FC6 features is much better than the SURF features ( Figure 3a). Therefore, the performance of the primitive machine learning algorithm is better for deep learning features. The t-SNE representation of feature spaces learned by the proposed method (STJML) for both the tasks A → D (SURF features) and A → W (VGG-FC6 features) is shown in Figure 4.  To quantify misclassification samples in the learned feature space by the proposed method STJML for the task (A → D), we have shown two illustrations such as the first one with the predicted class labels for the target domain (as shown in Figure 4a) and the second one with the given class labels for both domains(as shown in Figure 4b). After carefully analyzing both the graphs, i.e., Figure 4a,b, many samples (highlighted by asterisks (*) and arrows (→))) are being misclassified by the proposed method STJML. However, If we compare the graph (as shown in Figure 3a) with the graph (as shown in Figure 4a), it can be seen that the distribution difference between the source domain samples and the target domain samples is minimized by a small margin. For example, in Figure 3a, it is visible that there is a distribution gap between the red class samples of the circle domain (or source domain) and the red class samples of the plus domain (or target domain). But, in Figure 4a, it can be seen that our proposed method STJML minimizes this distribution gap.
Similar to A → D (SURF features) task, if we compare graphs in Figures 3a and 4c for A → W (VGG-FC6 features), it is observed that the distribution difference between both domains is satisfactorily reduced by the proposed method STJML. However, after comparing Figure 4c to Figure 4d, it can also be seen that there are only a few samples (marked by asterisks (*) and arrows (→))) which are being misclassified by our proposed method STJML.

What Happens if One Component Is Omitted from the Proposed Method (STJML)
To reveal the importance of including all of the above discussed components in our proposed method, we have experimented our proposed method on tasks A → D (SURF feature) and A → W (VGG-FC6 feature) by omitting any of its components at once. Therefore, by excluding any one component from our proposed method STJML, we can divide the proposed method into five new methods: STJML s (Omitting subspaces of both domains), STJML w (Omitting instance-re-weighting term), STJML m (Omitting marginal distribution term), STJML c (Omitting conditional distribution term), and STJML l (Omitting Laplacian regularization term).

Omitting Consideration of Subspaces of Both Domains (STJML s )
If we execute STJML s on the original VGG-FC6 features of the task A → W, the learned feature representation is shown in Figure 5. In Figure 5, the first graph (i.e., Figure 5a) shows feature representation learned by method STJML s with the given source domain labels and predicted target domain labels, while the second graph (i.e., Figure 5b) presents a representation of learned features with given both domain label information. By comparing the representation of features for the task A → W (as shown in Figure 3b) with that of features learned by STJML s (as shown in Figure 5a), the feature space learned by STJML s is much better. From the comparison, it can also be seen that the distribution difference between both domains is minimum, as well as the distance between samples, belong to the same class is minimum, while the distance between different class samples is maximum. Thus, if this learned feature space is given to a classification algorithm, such as 1-NN classifier, the performance of the classifier (in terms of accuracy) will be 86.44%, which is accompanied by a 23% gain over the performance of the trained classifier with original feature space. Although the learned feature space is much better than the original feature space, there are some samples that are still being misclassified. In order to investigate those samples that are being misclassified by STJML s method, we have also visualized these learned features (shown in Figure 5b) with the given both domain labels. If we compare the graphs as shown in Figure 5a,b, it can be observed that some sample class labels (as indicated with asterisks(*) and arrows(→)) predicted by the STJML s are incorrect. Again if we compare the learned feature space (as shown in Figure 5a) by STJML s method with the learned feature space (as shown in Figure 4c) by the proposed method STJML, the clusters for different classes obtained by our proposed method with all components are slightly better than those obtained by STJML s method. For example, a black class cluster obtained by the STJML method is slightly distant from the maroon class samples, but that obtained by the STJML s method collides with the samples with the maroon class. Similarly, the orange class cluster obtained by the STJML s method is the worst compared to that obtained by the STJML method.

Omitting Instance-Re-Weighting Term (STJML w )
Since, the results obtained by both STJML w and STJML methods for the task A → W (VGG-FC6 features) were similar, we considered another task A → D (SURF features) to show the effect of instance-re-weighting term. After executing STJML w method on the task A → D (SURF features), the learned feature spaces are shown in Figure 6a,b. After comparing both the plots (as shown in Figure 6a,b), we can see that there are many samples (some of them we have highlighted by asterisks (*) and arrows (→)) which are being misclassified by the STJML w method. As the clusters obtained in Figure 6a,b by STJML w method for the task A → D (SURF features) are not as good as obtained by the proposed method for the task A → W (VGG-FC6 features), we compare the graph (as shown in Figure 6b) obtained by STJML w method with the graph (as shown in Figure 4b) obtained by the STJML method. After comparing both the plots in Figures 4b and 6b, it can be concluded that some of the samples of source domain for 'lime' colored class in Figure 6b (look at 'lime' colored ellipse) are not efficiently down weighted as compared to the graph in Figure 4b. Therefore, the performance of STJML w method (which is 41.70% accuracy) is not as good as the STJML method (which is 49.10%).

Omitting Marginal Distribution Term (STJML m )
If we omit consideration of the marginal distribution from our proposed method STJML, the t-SNE views of learned feature spaces by STJML m for A → W (VGG-FC6 features) task are shown in Figure 7a,b. After excluding the marginal distribution, the STJML m method achieves 90.85% accuracy, which is similar to the accuracy achieved by the STJML method. Thus, it can be concluded here that even after dropping this distribution, the STJML method does not have much effect on its performance. Moreover, after carefully looking the graphs learned by STJML m (as shown in Figure 7a,b) and STJML (as shown in Figure 4c,d) methods, we find that the graphs learned by both the methods are almost similar.

Omitting Conditional Distribution Term (STJML c )
If we exclude consideration of the conditional distribution from our proposed method STJML, the t-SNE views of learned feature spaces by STJML c for A → W (VGG-FC6 features) task are shown in Figure 8a,b. Without including this conditional distribution term in our proposed method STJML, the STJML c approach achieves 73.90% accuracy, which is much lower than the accuracy (90.85%) achieved by the STJML method. Therefore, we can say that this term greatly impacts the performance of the proposed STJML method if it is not included. If we compare the graph (as shown in Figure 4c) obtained by the STJML method to the graph (as shown in Figure 8a) obtained by STJML c , it can be seen that the distribution difference between both domains has not been effectively reduced by the STJML c method. For example, the distribution difference between green colored class samples of both domains (plus and circle) is not minimized in Figure 8a (i.e., all the green colored class samples are distributed in different green colored circles), but it can be seen in Figure 4c that all green colored class samples are with in a cluster. Because of not minimizing the distribution gap by the STJML c method, we can see that some samples that are being misclassified in Figure 8a (as highlighted by asterisks (*) and arrows (→)) after comparing with the graph in Figure 8b.

Omitting Laplacian Regularization Term (STJML l )
The samples that are supposed to loss their original similarity in the leaned feature space can preserve their original similarity by adding the Laplacian regularization term. As a result, samples that were supposed to go far away from their respective groups or clusters may come closer together. Thus, in order to see the impact of this term, we omit this term from the proposed method STJML and execute the algorithm. The t-SNE representation of learned feature spaces by STJML l is depicted in Figure 9a,b. If we compare the graphs generated by STJML l (as depicted in Figure 9a,b) with the graphs generated by STJML (as depicted in Figure 4c,d), it can be seen that the samples in each class cluster generated by STJML l are widely spread around their mean point, but they are less spread in the cluster generated by STJML. Therefore, the performance (82.03% accuracy) of STJML l method is slightly lower than the STJML method. After comparing the graphs in Figure 9a,b, we can see that some samples (highlighted by asterisks (*) and arrows (→))) are being misclassified by STJML l method.

Comparison with State-Of-The-Art Methods
The proposed STJML method was verified and compared with many state-of-the-art primitive and domain adaptation algorithms. A brief description of all the comparative methods is as follows: • NN, PCA+1NN, and SVM: These are the traditional machine learning algorithms, which assume that both training and test data should follow a uniform distribution. The key insight of this approach is to leverage the discriminative information from the target task, even when the target domain labels are not given.

•
Kernelized Unified Framework for Domain Adaptation (KUFDA) [16]: This TL method improves the JGSA method by adding the Laplacian regularization term.

Parameter Sensitivity
Our proposed STJML method contains various parameters such as nn, k, σ, η, λ, and δ, along with other state-of-the-art domain adaptation methods [5,8,13,16]. Similar to previous methods [14,40], we also need to analyze the parameter sensitivity of the STJML method on all possible tasks of both datasets to validate that an appropriate value of each parameter can be chosen to obtain satisfactory performance. Analyzing the parameter sensitivity of STJML, we vary one parameter value and keep the other parameter values constant. For example, we vary parameter value k from 1 to 10 with an interval of 1 and keep other parameter values nn = 1, σ = 100, η = 10 −1 , λ = 10 −3 , and δ = 0.5 constant. Here, we have provided a description of each parameter and performed a parameter sensitivity test for all the considered datasets. But, we have shown parameter sensitivity analysis graphs for Office + Caltech10 with VGG-FC6 features and PIE face datasets. The description and possible values of each parameter are as follows: In our proposed method, we considered the k-NN classifier to predict the label of the target domain, and the performance of this classifier depends on an appropriate value of parameter k for each task. Therefore, we need to find out its proper value for each task of the datasets. For each task of the considered datasets, we varied k value from 1 to 10 with an interval of 1 and keep other parameter values constant as shown in Figure 10a,b. The resultant graphs for Office + Caltech10 with VGG-FC6 features and PIE face datasets are shown in Figure 10a,b. From Figure 10a,b, it can be seen that the STJML method outperforms at k = 1 for most of the tasks of both the datasets. Therefore, we keep k = 1 for most of the tasks of all the datasets except some tasks such as 9 → 7, 9 → 27, 9 → 29, 27 → 7, 27 → 09, C → D(VGG-FC6 features task), C → A(SURF features task), C → W(SURF features task), D → A(SURF features task), D → C(SURF features task), W → A (SURF features task), but for these tasks k-parameter values are kept 9, 9, 5, 10, 10, 9, 3, 3, 10, 3, and 2, respectively.

nn Parameter
Similar to the parameter k, we require an appropriate value of parameter nn for construing the Laplacian graph as discussed in Section 3.6. So, we vary nn value from 1 to 10 and keep other parameter values constant. Here also, in Figure 10c,d, there is no unique value of nn for which STJML is outperforming for all tasks of Office + Caltech with the SURF features dataset. However, STJML is outperforming for the values (1 and 2) of nn for the tasks of PIE dataset. Therefore, we keep nn = 1 for all tasks (except 5 → 27(nn = 2), 7 → 5(nn = 2), and 7 → 09(nn = 2)) of PIE face dataset. Similarly, we keep nn = 3 for all tasks (except A → C(nn = 1) and A → W(nn = 7)) of Office + Caltech with VGG-FC6 features dataset, but nn = 10 for all tasks (except A → D(nn = 1), A → W(nn = 1), C → A(nn = 8), C → D(nn = 2), and W → C(nn = 1)) of Office + Caltech with SURF features dataset.

δ Parameter
This parameter quantitatively evaluates the importance of aligning marginal and conditional distributions in domain adaptation. In this evaluation, the existing work [6,8] in DA fails by the assumption that both distributions are equally important. However, this assumption may not be true for real-world problems. Wang et al. [13] introduced the adaptive factor parameter to measure the importance of these two distributions dynamically. However, in this paper, we manually performed the parameter sensitivity tests to ascertain a reasonable value of this factor for each task. Thus, we varied its value from 0 to 1 with an interval of 0.1, and the resulted graphs are shown in Figure 11. It is clearly shown in Figure 11, that STJML performs well for different values of δ for different tasks. Thus, to achieve best performance of the STJML method, we keep δ = 0.5 for all tasks (except 5 → 7(δ = 0.9) and 7 → 29(δ = 0.6)) of PIE face dataset. Similarly, we keep δ = 0.5 for all tasks (except A → W(δ = 0.9)) of Office + Caltech with VGG-FC6 dataset, but δ = 0.9 for all tasks (except C → W(δ = 0.9)) of Office + Caltech with SURF features dataset.

Parameter: σ
As the performance of the proposed STJML method depends on choosing the eigenvectors (σ) corresponding to the leading eigenvalues, we ran STJML with varying values of σ (70 to 130 with an interval of 5 for PIE dataset and 9 to 30 with an interval of 3 for Office + Caltech dataset ) and report the results in Figure 13a,b. We plot classification accuracy graph with respect to different values of σ in Figure 13a,b. In Figure 13a,b, it can be seen that proposed STJML the gives best accuracy for different values of this parameter for different tasks of both the datasets. For better accuracy with respect to this parameter, we keep σ = 100 for all tasks ( In order to find a low dimensional subspace for both the domains, we need to project original data from D-dimensional space to a d-dimensional subspace. However, projecting data from D-dimensional space to a d-dimensional subspace, it can lose some information. Therefore, we need to find the appropriate value of d, so that the original information of both domains can remain in the low dimensional space as well.
Like other parameters, we also vary its value and keep other parameter values constant, and find that the proposed STJML method outperforms at d = 140 for all tasks of the PIE face dataset, while d = 100 for SURF features and d = 200 for VGG-FC6 features of the Office + Caltech datasets.

Experimental Setup
To show the strength of the STJML method over previous state-of-the-art methods, we considered 12 tasks of Office + Caltech10 with SURF feature, 12 tasks of Office + Caltech10 with VGG-FC6 features, and 20 tasks of PIE face datasets. With the help of the parameter sensitivity test, we explored an appropriate value of each parameter of the STJML method and then used those values to ascertain the proposed method's accuracy for each task of the considered dataset. Thus, after experimenting on the proposed STJML method, the accuracy of each task of all datasets is stated in Tables 1-3. However, for the accuracy of other comparative methods in Tables 1-3, we have derived directly from their respective papers or previous papers [5,7,8,13,14,16,37,38].

Experimental Results and Analysis
The recognition performance of the proposed method and the other compared state-of-the-art methods on three widely used domain adaptation datasets is reported in Tables 1-3. From the results  reported in Tables 1-3, we can conclude the following observations:

•
Primitive machine learning approaches such as NN, PCA, and SVM are not performing well due to the distribution gap between training (source data) and testing (target data) datasets. • Among domain adaptation methods, the GFK method's performance is worse for an average accuracy of all tasks in the Office + Caltech dataset for both SURF and VGG-FC6 features.

•
The JDA method's performance for all the three datasets is higher than that of the TCA method because it adopts the conditional distribution in addition to the marginal distribution.

•
The ILS method works well compared to other subalignment methods (such as SA, GFK, and CORAL) because of considering the more robust discriminative loss function for the Office + Caltech dataset with deeper features.

•
As TJM adopts the term instance re-weighting, its performance is better than other DA methods such as TCA, GFK, JDA, SA, CORAL, ILS, and BDA for the deep features of Office + Caltech dataset. However, for the SURF features, TJM gives better average accuracy than SCA, ARTL, GFK, and TCA, but performs poorly compared to JGSA, CORAL, LDADA, DICE, RTML, ILS, and JDA. As KUFDA improves JGSA by considering the term Laplacian regularization, its average accuracy is much higher than other methods for the deep features of the Office + Caltech dataset, but less than the DICE method for most of the tasks in the PIE face dataset.

•
For the PIE face and the Office + Caltech10 with SURF features datasets, DICE is performing better than all other methods (except STJML) because it is taking care of the intra-domain structure, especially for the target domain. However, its performance is abysmal for deep features of the Office + Caltech 10 dataset. • Since our proposed method covers all the important objectives, as well as works on the projected subspaces of both the domains, the average accuracy of the proposed STJML method for all the tasks in all the considered datasets, is higher than all the other comparative methods. However, KUFDA beats our proposed algorithm for some tasks in the Office + Caltech dataset with deep features such as A → C, D → W, W → A, W → C, and W → C. Similarly, DICE beats the proposed method for eight tasks of the PIE face dataset.

Computational Complexity
Here, we analyze the computational complexity of Algorithm 1 by the O. The computational cost is detailed as follows: O(D 3 + nDd) for finding subspaces of both the source and the target domains, where D < n, i.e., Line 1; O(n 2 d) for constructing Laplacian matrix, i.e., Line 2; O(n 2 d) for computing kernel matrix, i.e., Line 3; O(n 2 ) for generating initial pseudo labels, i.e., Line 5; O(t(dn 2 + Cdn 2 )) for constructing marginal and conditional distribution matrices, i.e., Line 7; O(tn 2 d) for solving the generalized eigendecomposition problem with dense matrices, i.e., Line 9; O(t(σnn s + σnn t )) for computing X s and X t matrices, i.e., Line 11; O(tn 2 ) for generating pseudo labels, i.e., Line 12; O(tn 2 ) for computing the subgradient matrix, i.e., Line 13. In total, the computational complexity of Algorithm 1 is O(D 3 + nDd + n 2 d + t(dn 2 + Cdn 2 + σnn s + σnn t + n 2 )). The complexity of this model can be greatly reduced by low-rank approximation.
subspace of both domains as well as adding the Laplacian term, it has a higher cost than TJM, but lower than JGSA.

Conclusions and Future Work
In this paper, we proposed a novel Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method for efficiently transferring knowledge from the source domain to the target domain. Because of jointly optimizing all the inevitable components, the proposed STJML method is robust for reducing the distribution differences between both domains. Extensive experiments on several cross-domain image datasets suggest that the STJML method performs much better than state-of-the-art primitive and transfer learning methods.
In the future, there are several ways through which we can extend our proposed method STJML, and some of them are: Firstly, we will extend the STJML method to multi-task learning environments [42], where multiple tasks may contain some label samples. Thus, by using the label information of all tasks, all of them' generalization performance can be enhanced.
Secondly, since the STJML method has many parameters and conducting manual parameter sensitive tests to find appropriate values is a hectic and time-consuming process. Furthermore, the STJML method uses the original features to find a common feature space. Still, the original features itself are distorted, then the STJML method will not become a robust classifier. Therefore, in the future, we will use the particle swarm optimization [43] method to select the appropriate value of each parameter and the proper subset of excellent features across both domains. So, the STJML method for selecting parameters will be strengthened, and its performance will also improve due to the elimination of distorted features.
Lastly, nowadays, there is increasing interest in neural-network-based learning models [44] due to their outstanding performance; we will also extend the STJML method to deep learning framework. In our deep learning STJML method, we will extract deep features concerning our proposed method overall objective function.

Conflicts of Interest:
The authors declare no conflict of interest.