## 1. Introduction

Indoor–outdoor camera surveillance systems [

1,

2] are widely used in urban areas, railway stations, airports, smart homes, and supermarkets. These systems play an important role in security management and traffic management [

3]. However, cameras with different properties and positions deployed in these systems can create a distribution difference among the capturing images. This difference leads to the system’s poor performance due to considering primitive machine learning algorithms for recognition [

4]. For example, if a classifier (or primitive algorithm) is trained from source domain images (taken from a DSLR camera), then the trained classifier will not give as expected results when tested on the images collected from some other target domain (taken from webcam camera). A simple solution to improve the classifier’s performance is that it must only be trained with target domain images. However, there are no labeled images in the target domain in practice, and labeling the target domain images is a time-consuming process. Let us consider an example shown in

Figure 1 to discuss in detail how the images collected from different environments can cause differences in distribution across domains. In

Figure 1, various possibilities, which can cause distribution differences, are presented, such as (1) the images (keyboards and headphones) as shown in

Figure 1a,b are collected from different quality cameras, i.e., low-quality camera (webcam camera) and high-quality camera (DSLR); (2) the images as shown in

Figure 1c,d are taken with different weather conditions, i.e., a day or clear weather and night or rainy season.

Recently, the literature [

2,

4] has seen a growing interest in developing transfer learning (TL) or domain adaptation (DA) algorithms to minimize the distribution gap between domains, so that the structure or information available in the source domain can be effectively transferred to understand the structure available in the target domain. In previous work [

5,

6,

7,

8,

9,

10,

11,

12], two learning strategies for domain adaptation are considered independently: (1) instance re-weighting [

9,

10,

11,

12], which reduces the distribution gap between domains by re-weighting the source domain instances and then training the model with the re-weighted source domain data; (2) feature matching [

5,

6,

8,

13,

14], which finds a common feature space across both domains by minimizing the distribution gap.

If the distribution difference between both domains is large enough, there will always be a situation where the source domain instances are not relevant to the target domain instance, even after finding a common feature location. In this situation, jointly optimizing instance re-weighting and feature matching is an important and unavoidable task for robust transfer learning. To understand a need for joint-learning instance re-weighting and feature matching more deeply, let us consider an example in which we have source domain data with outlier data samples (or irrelevant instances) as shown in

Figure 2a and target domain data as shown in

Figure 2b. In this case, if we lean only the common feature space between both domains by existing methods such as Joint Geometrical and Statistical Alignment (JGSA) [

8] and Joint Distribution Adaptation (JDA) [

6], the new representation of the source and target domain data is shown in

Figure 2c, where it can be seen that the domain difference is still large for feature matching due to outlier data samples or irrelevant instances (the symbols with circles). However, if we jointly learn feature matching and instance re-weighting, the data representation is shown in

Figure 2d, where it can be seen that all the outlier data samples are down-weighted to reduce domain difference further.

Fortunately, in the literature, there is a method called Transfer Joint Matching (TJM) that performs joint feature matching and instance re-weighting by down-weighting irrelevant features of the source domain [

7]. However, only performing feature matching and instance re-weighting is insufficient to successfully transfer knowledge from the source domain to the target domain. Some other DA and TL methods consider other essential properties to minimize distribution differences between both domains. For example, the JDA method considers the conditional distribution in addition to the marginal distribution, and this distribution is needed if the data is conditionally or class-wise distributed. Subspace Alignment (SA) [

15] makes use of subspaces (composed of ‘d’ eigenvectors induced by a Principle Component Analysis (PCA)), one for each domain and suggests minimizing the distribution difference between subspaces of both domains rather than the original space data. JGSA preserves source domain discriminant information, among other properties such as SA, marginal, and conditional distributions, to further improve the performance of JDA. However, the feature space obtained by JGSA is not notable because data samples in this space may lose their original similarity so that they can be easily misclassified by the classifier. Kernelized Unified Framework for Domain Adaptation (KUFDA) [

16] improves JGSA by adopting the original similarity weight matrix term so that the sample does not lose its original similarity in the learned space. KUFDA follows most of the above discussed important properties but still suffers from outlier data samples, and this is due to not considering instance re-weighting term.

In this paper, to solve all of the above-discussed challenges and to efficiently transfer knowledge from the source domain to the target domain, we propose a novel Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method for visual domain adaptation by jointly matching the features and re-weighting instances across both the source and the target domains.

The major contributions of this work can be listed as follows:

The proposed method STJML is the first framework that crosses the limits of all the comparative cutting edge methods, by considering all inevitable properties such as projecting both domain data into a low dimensional manifold, instance re-weighting, minimizing marginal and conditional distributions, and geometrical structure of both domains in a common framework.

With the help of the t-SNE tool, to illustrate the reason for the inclusion of all the components (or inevitable properties), we have graphically visualized the features learned by the proposed method after excluding any component.

## 2. Related Work

Recently, various DA and TL approaches have been proposed for transferring structure or information from one domain to another domain in terms of features, instances, relational information, and parameters [

4,

9]. However, the TL approaches, which are closely related to our work, can be divided into three types: feature-based transfer learning [

6], instance-based transfer learning [

7], and metric-based transfer learning [

9].

In the first type, the objective is to minimize the distribution difference between the source and the target domains based on feature learning. For example, Pan et al. [

17] proposed a new dimensionality reduction method called maximum mean discrepancy embedding (MMDE) for minimizing the distribution gap between domains. MMDE learns a common feature space on the domains where the distance between distributions can be minimized while preserving data variance. Pan et al. [

5] further extended the MMDE algorithm by proposing a new learning method called Transfer Component Analysis (TCA). TCA tries to learn a feature space across domains in a reproducing kernel Hilbert space using Maximum Mean Discrepancy (MMD). Therefore, with the new representation in this feature space, we can apply standard machine learning methods such as k-Nearest Neighbor (k-NN) and Support Vector Machine (SVM) to train classifiers in the source domain for use in the target domain. Long et al. [

6] extends TCA by considering not only marginal distribution but also conditional distribution with the help of pseudo-labels in the target domain. Fernando et al. [

18] introduced a subspace centric method called Subspace Alignment (SA). SA aims to align the source domain vectors (E) with the target domain one (F) with the help of a transformation matrix (M). Here, E and F can be obtained by Principle Component Analysis (PCA) on the source domain and the target domain, respectively. Shao et al. [

19] proposed a low-rank transfer learning method to match both domain samples in the subspace for transferring knowledge. Zhang et al. [

8] proposed a unified framework that minimizes the distribution gap between domains both statistically and geometrically, called Joint Geometrical and Statistical Alignment (JGSA). With the help of two coupled projections E (for source domain) and F (for target domain), JGSA projects the source domain and the target domain data into low dimensional feature space, where both domain samples are geometrically and statistically aligned.

In the second type, the objective is to re-weight the domain samples so as to minimize the distribution difference between both domain samples. The TrAdaBoost TL [

10] method re-weights the source domain labeled data to filter samples that are most likely not from the target domain. In this way, the re-weighted source domain samples will create the same distribution found on the target domain. Finally, the re-weighted samples can be considered as additional training samples for learning the target domain classifier. As the primitive TrAdaBoost method is applicable to a classification problem, Pardoe et al. [

11] extended this method by proposing ExpBoost.R2 and TrAdaBoost.R2 methods to deal with the regression problem.

In the final type, the target domain metric is to be learned by establishing a relationship between the source domain and the target domain tasks. Kulis et al. [

20] introduced a method, called ARC-t, to learn a transformation matrix between the source domain and the target domain based on metric learning. Zhang et al. [

21] proposed a transfer metric learning (TML) method by establishing the relationship between domains. Ding et al. [

22] developed a robust transfer metric learning (RTML) method to effectively assist the unlabeled target learning by transferring the information from source domain labeled data.

## 3. A Subspace Based Transfer Joint Matching with Laplacian Regularization

This section presents the Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method in detail.

#### 3.1. Problem Definition

To understand transfer learning or domain adaptation, the domain and the task must be explicitly defined. A domain $\mathcal{D}$ consists of two parts: features space $\mathcal{X}$ and a marginal probability distribution $P\left(x\right)$, i.e., $\mathcal{D}=\{\mathcal{X},P(x\left)\right\}$, where $x\in \mathcal{X}$. If there is a difference in their feature space or marginal distribution, the two domains are said to be different. Given a domain $\mathcal{D}$, a task, that is denoted by $\mathcal{T}$, also consists of two parts: a label space $\mathcal{Y}$ and a classifier function $f\left(x\right)$, i.e., $\mathcal{T}=\{y,f(x\left)\right\}$, where $y\in \mathcal{Y}$ and classifier function $f\left(x\right)$ predicts label of new instance x. This classifier function $f\left(x\right)$ can also be interpreted as the conditional probability distribution, i.e., $Q\left(y\right|x)$.

Transfer learning, given a labeled source domain ${\mathcal{D}}_{s}=\{({x}_{1},{y}_{1}),\cdots ,({x}_{{n}_{s}},{y}_{{n}_{s}})\}$ and unlabeled target domain ${\mathcal{D}}_{t}=\{\left({x}_{1}\right),\cdots ,\left({x}_{{n}_{t}}\right)\}$ under the assumptions ${\mathcal{X}}_{s}={\mathcal{X}}_{t},{\mathcal{Y}}_{s}={\mathcal{Y}}_{t},{P}_{s}\left({x}_{s}\right)\ne {P}_{t}\left({x}_{t}\right),{Q}_{s}\left({y}_{s}\right|{x}_{s})\ne {Q}_{t}\left({y}_{t}\right|{x}_{t})$, aims to improve the performance of the target domain classifier function ${f}_{t}\left(x\right)$ in ${\mathcal{D}}_{t}$ using the knowledge in ${\mathcal{D}}_{s}$.

#### 3.2. Formulation

To address the limitations of existing TL methods, the STJML method minimizes the distribution gap statistically and geometrically by working on the following components: finding both domain subspaces, matching features, instance re-weighting, and exploiting the similar geometrical property. In our proposed STJML approach, first, we exploit the subspaces of both domains and then with the help of common projection vector-matrix Z for both domains, perform feature matching, instance re-weighting, and the similar geometrical property exploitation in a Reproducing Kernel Hilbert Space (RKHS) to match both first and high-order statistics.

#### 3.3. Subspace Generation

Even though both domain data lie in the same

D-dimensional feature space, they are drawn according to different marginal distributions. Consequently, according to [

15], instead of working on the original feature space, we need to work on more robust representations of both domain data to allow it to induce stronger classification, which is not subject to local perturbations. For this subspace generation, we use the Principle Component Analysis (PCA) technique, which selects ‘d’ eigenvectors corresponding to the ‘d’ largest eigenvalues. These ‘d’ eigenvectors are used to project original space data on it. For example, if a given input data matrix

$\mathcal{X}=[{x}_{1},{x}_{2},\cdots ,{x}_{n}]\in {\mathbb{R}}^{D\times n}$ where

$n={n}_{s}+{n}_{t}$ and

D is the dimension of each data sample in original space, then, the PCA generates the subspace matrix

$\mathbb{X}\in {\mathbb{R}}^{d\times n}$ by projecting input data matrix (

$\mathcal{X}$) on selected ‘d’ eigenvectors.

#### 3.4. Feature Transformation

As a dimensionality reduction method, such as PCA, can learn the transformed feature representation by reducing the reconstruction error of given data, it can also be utilized for the data reconstruction. Let us consider subspace data matrix

$\mathbb{X}\in {\mathbb{R}}^{d\times n}$, data centering matrix

$H=I-\frac{1}{n}1$, and 1 is a

$n\times n$ matrix of ones. Thus, the covariance matrix of both domain subspace data matrix

$\mathbb{X}$ can be calculated as

$\mathbb{X}H{\mathbb{X}}^{T}$. The objective of PCA is to maximize both domain variances by finding an orthogonal transformation matrix

$W\in {\mathbb{R}}^{n\times \sigma}$, where

$\sigma $ is the selected number of eigenvectors on which subspace data matrix

$\mathbb{X}$ to be projected. Thus,

where tr(.) is the trace of a matrix and

I is an identity matrix. As the problem in Equation (

1) is an eigendecomposition problem, it can easily be decomposed by eigendecomposition as

$\mathbb{X}H{\mathbb{X}}^{T}W=W\Phi $, where

$\Phi =\mathrm{diag}({\Phi}_{1},\cdots ,{\Phi}_{\sigma})$ is the

$\sigma $ largest eigenval matrix. After projecting subspace data matrix

$\mathbb{X}$ on the selected projection vectors matrix (

${W}_{\sigma}$) corresponding to top most

$\sigma $ largest eigenvalues matrix, the optimal

$\sigma $-dimensional learned projection matrix

$V=[{v}_{1},\cdots ,{v}_{\sigma}]={W}_{\sigma}^{T}\mathbb{X}$To achieve our goal, we need to work in RKHS using some kernel function like linear, polynomial, Gaussian, etc. Let us consider the chosen kernel function

$\theta $, which maps the data sample

x to

$\theta \left(x\right)$, i.e.,

$\theta :x\to \theta \left(x\right),$ or

$\theta \left(\mathbb{X}\right)=[\theta \left({x}_{1}\right),\cdots ,\theta \left({x}_{n}\right)],$ and then the kernel matrix

$K=\theta {\left(\mathbb{X}\right)}^{T}\theta \left(\mathbb{X}\right)\in {\mathbb{R}}^{n\times n}$. After the application of the Representer theorem

$W=\theta \left(\mathbb{X}\right)Z$, Equation (

1) can be written as follows:

where

$Z\in {\mathbb{R}}^{n\times \sigma}$ is the transformation matrix, and the subspace embedding becomes

$V={Z}^{T}K$.

#### 3.4.1. Feature Matching with Marginal Distribution

However, even though maximizing the subspace data variance, the distribution difference between both domains will still be quite large. Therefore, the main problem is to minimize the distribution difference between them by applying an appropriate distance metric (or measure). There are many distance measures (such as the Kullback–Leibler (KL) divergence) that can be utilized to compute the appropriate distance between both domain samples. However, many of these methods are parameterized or require estimating the intermediate probability density [

5]. Therefore, in this paper, we adopt a non-parametric distance estimate method called Maximum Mean Discrepancy (MMD) [

23] to compare distribution difference in a Reproducing Kernel Hilbert Space (RKHS) [

5]. MMD estimates the distance between the sample means of both domain data in the

$\sigma $-dimensional embedding,

where

${M}^{d}$ is the MMD matrix and can be determined as follows

#### 3.4.2. Feature Matching with Conditional Distribution

Minimizing the marginal distribution difference does not guarantee that the conditional distribution between both source and target domains will also be minimized. However, for robust transfer learning, minimizing the conditional distributions, i.e.,

${Q}_{s}\left({y}_{s}\right|{x}_{s})$ and

${Q}_{t}\left({y}_{t}\right|{x}_{t})$ between both domains is required [

24]. Reducing the conditional distribution is not a trivial process because there is no label data in the target domain. Therefore, we cannot model

${Q}_{t}\left({y}_{t}\right|{x}_{t})$ directly.

Long et al. [

6] proposed a Joint Distribution Adaptation (JDA) method for modeling

${Q}_{t}\left({y}_{t}\right|{x}_{t})$ by generating

$pseudo$ labels of the target data. Initial

$pseudo$ labels for the target data can be generated by training the classifier with

${\mathbb{X}}_{s}$ and

${\mathcal{Y}}_{s}$ of the source domain, and testing the classifier on target domain subspace

${\mathbb{X}}_{t}$. Now with

${\mathcal{Y}}_{s}$,

${\mathbb{X}}_{s}$, and

$pseudo$ labels, the conditional distribution between both domains can be minimized by modifying MMD to estimate distance between the class conditional distributions

${Q}_{s}\left({x}_{s}\right|{y}_{s}=c\in \{1,\dots ,C\})$ and

${Q}_{t}\left({y}_{t}\right|{x}_{t}=c\in \{1,\dots ,C\})$
where

${\mathcal{D}}_{s}^{c}=\{{k}_{i}:{k}_{i}\in {\mathcal{D}}_{s}\wedge y\left({k}_{i}\right)=c\}$ is the set of samples belongs to

cth class in the source domain,

$y\left({k}_{i}\right)$ is the true label of

${k}_{i}$, and

${n}_{s}^{c}=\left|{\mathcal{D}}_{s}^{c}\right|$. Similarly, for the target domain,

${\mathcal{D}}_{t}^{c}=\{{k}_{j}:{k}_{j}\in {\mathcal{D}}_{t}\wedge \widehat{y}\left({k}_{j}\right)=c\}$ is the set of samples belongs to

cth class in the target domain,

$\widehat{y}\left({k}_{j}\right)$ is the

$pseudo$ label of

${k}_{j}$, and

${n}_{t}^{c}=\left|{\mathcal{D}}_{t}^{c}\right|$. Thus, the MMD matrix

${M}^{c}$ with class labels of both domains can be determined as follows:

By minimizing Equation (

5) such that Equation (

2) is maximized, the conditional distributions between both the source and the target domains are drawn close with the new representation

$V={Z}^{T}K$. In each iteration, this representation

V will be more robust till its convergence. As there is a difference in both the marginal and conditional distributions, the initial

$pseudo$ labels of the target domain are incorrect. However, we can still take advantage of them and improve the performance of target domain classifiers iteratively.

#### 3.5. Instance Re-Weighting

However, matching features with marginal and conditional distributions is not sufficient for transfer learning, as it can only match first- and higher-order statistics. In particular, when the domain difference is significant enough, even in the feature learning, there will always be some source instances or samples that are not related to the target instance. In this condition, an instance re-weighting method with feature learning should also be included to deal with such a problem.

In this paper, we adopt a

${\mathcal{L}}_{2,1}$-norm structured sparsity regularizer as proposed in [

7]. This regularizer can introduce row-sparsity to the transformation matrix

Z. Because each entry of the matrix

Z corresponds to an example, row sparsity can substantially facilitate instance re-weighting. Thus, instance re-weighting regularizer can be defined as follows.

where

${Z}_{s}:={Z}_{1:{n}_{s}}$ is the transformation matrix corresponding to the source samples, and

${Z}_{t}:={Z}_{{n}_{s}+1:{n}_{s}+{n}_{t}}$ is the transformation matrix corresponding to the target samples. As the objective is to re-weight the source domain instances, we only impose

${\mathcal{L}}_{2,1}$-norm on source domain. Thus, minimizing the Equation (

7) such that Equation (

2) is maximized, the source domain samples, which are similar (or dissimilar) to the target domain, are re-weighted with less (or greater) importance in the new learned space

$V={Z}^{T}K$.

#### 3.6. Exploitation of Geometrical Structure with Laplacian Regularization

However, matching features and instance re-weighting are not enough to convey knowledge transfer by capturing the intrinsic structure of the source domain labeled samples and target domain unlabeled samples. In particular, labeled data samples of the source domain combined with unlabeled data samples of the target domain are used to construct a graph that sets the information of the neighborhood data samples. Here, the graph provides discrete approximations to the local geometry of the manifold data. With the help of the Laplacian regularization term

$\mathcal{L}$, the smooth penalty on the graph can be included. Basically, the term regularizer

$\mathcal{L}$ allows us to incorporate prior knowledge on certain domains, i.e., nearby samples are likely to share same class labels [

25].

Given a kernelized data matrix

K, we can use a

$nn$-nearest neighbor graph to establish a relationship between nearby data samples. Specifically, we draw an edge between any two samples i and j if

${k}_{i}$ and

${k}_{j}$ are “close”, i.e.,

${k}_{i}$ and

${k}_{j}$ are among

$nn$ nearest neighbors of each other. Thus, the similarity weight matrix

$\mathcal{W}$ can be determined as follows:

where

${N}_{nn}\left({k}_{j}\right)$ represents the set of

$nn$ nearest neighbors of

${k}_{i}$.

Here, two data samples are connected with an edge if they are likely to be from the same class. Thus, the regularizer term

$\mathcal{L}$ can be defined as follows:

where

$\mathbb{D}$ is the diagonal matrix, i.e.,

${\mathbb{D}}_{ii}={\sum}_{j}{\mathcal{W}}_{ij}$ and

L is the Laplacian matrix;

$L=\mathbb{D}-\mathcal{W}$,

#### 3.7. Overall Objective Function

The objective of this work is to minimize the distribution difference between domains by jointly matching the features of both domains and re-weighting the source domain samples, and preserving original similarity of both domain samples. So, by incorporating Equations (

3), (

5), (

7), and (

9), the proposed objective function can be obtained as follows:

where

$\delta $ is a trade-off parameter, which balances the marginal and conditional distributions [

13],

$\eta $ is the trade-off parameter that regularizes the Laplacian term, and

$\lambda $ is the regularization parameter to trade-off feature matching and instance re-weighting.

#### 3.8. Optimization

By using the Lagrange multiplier

$\Phi $, Equation (

10) can be written as follows:

In order to find out an optimal value of the projection vector matrix

Z, we partial derivative

${L}_{f}$ with respect to

Z and equate it to zero as

${\u2225{Z}_{s}\u2225}_{2,1}$ is a non-smooth function at zero and its partial derivative can be computed as

$\frac{\partial ({\u2225{Z}_{s}\u2225}_{2,1}+{\u2225{Z}_{t}\u2225}_{F}^{2})}{\partial Z}=2GZ$, where G is a diagonal subgradient matrix and its

$i$ element can be calculated as

As the problem in Equation (

12) is a generalized eigen decomposition problem, we can solve it and find

$\Phi =diag({\varphi}_{1},\cdots ,{\varphi}_{\sigma})$ (

$\sigma $ leading eigenvalues) and

$Z=({z}_{1},\cdots ,{z}_{\sigma})$ (

$\sigma $ leading eigenvectors). The pseudo code of our proposed method is given in Algorithm 1.

## 5. Conclusions and Future Work

In this paper, we proposed a novel Subspace based Transfer Joint Matching with Laplacian Regularization (STJML) method for efficiently transferring knowledge from the source domain to the target domain. Because of jointly optimizing all the inevitable components, the proposed STJML method is robust for reducing the distribution differences between both domains. Extensive experiments on several cross-domain image datasets suggest that the STJML method performs much better than state-of-the-art primitive and transfer learning methods.

In the future, there are several ways through which we can extend our proposed method STJML, and some of them are:

Firstly, we will extend the STJML method to multi-task learning environments [

42], where multiple tasks may contain some label samples. Thus, by using the label information of all tasks, all of them’ generalization performance can be enhanced.

Secondly, since the STJML method has many parameters and conducting manual parameter sensitive tests to find appropriate values is a hectic and time-consuming process. Furthermore, the STJML method uses the original features to find a common feature space. Still, the original features itself are distorted, then the STJML method will not become a robust classifier. Therefore, in the future, we will use the particle swarm optimization [

43] method to select the appropriate value of each parameter and the proper subset of excellent features across both domains. So, the STJML method for selecting parameters will be strengthened, and its performance will also improve due to the elimination of distorted features.

Lastly, nowadays, there is increasing interest in neural-network-based learning models [

44] due to their outstanding performance; we will also extend the STJML method to deep learning framework. In our deep learning STJML method, we will extract deep features concerning our proposed method overall objective function.