Domain Adaptation Based on Semi-Supervised Cross-Domain Mean Discriminative Analysis and Kernel Transfer Extreme Learning Machine

Good data feature representation and high precision classifiers are the key steps for pattern recognition. However, when the data distributions between testing samples and training samples do not match, the traditional feature extraction methods and classification models usually degrade. In this paper, we propose a domain adaptation approach to handle this problem. In our method, we first introduce cross-domain mean approximation (CDMA) into semi-supervised discriminative analysis (SDA) and design semi-supervised cross-domain mean discriminative analysis (SCDMDA) to extract shared features across domains. Secondly, a kernel extreme learning machine (KELM) is applied as a subsequent classifier for the classification task. Moreover, we design a cross-domain mean constraint term on the source domain into KELM and construct a kernel transfer extreme learning machine (KTELM) to further promote knowledge transfer. Finally, the experimental results from four real-world cross-domain visual datasets prove that the proposed method is more competitive than many other state-of-the-art methods.


Introduction
Traditional classification tasks deal with situations where the distribution of source domain samples and target domain samples are the same, and a classifier trained from source domain samples can be directly applied to the target domain samples. Theoretical studies on classifiers are also based on this assumption [1]. However, in real-world environments, the distribution of source domain samples and target domain samples is often different due to factors such as lighting, viewpoints, weather conditions, and cameras [2]. The most advanced classifiers trained on source domain samples may dramatically degrade when applied to target domain samples. One possible solution is to annotate the new data and retrain the model. Unfortunately, labeling a large number of samples is costly and time-consuming. Another promising solution is domain adaptation (DA), which aims to minimize the distribution gap between the two domains and learn a fairly accurate shared classifier for both domains, where the target domain samples do not have available labels.
One strategy of DA is to find invariant feature representation subspaces between domains to minimize distribution divergence. A large number of existing methods applying this strategy learn a shared feature subspace where the distribution mismatch between the two domains is reduced, and then employ standard classification methods in this subspace [3]. Metrics that measure the mismatch in distribution between domains include maximum mean discrepancy (MMD) [4], Kullback-Leibler (KL) divergence [5], Bregman divergence [6], and Wasserstein distance [7], among others. Due to the fact that MMD is a nonparametric estimation criterion for distance, researchers have proposed various adaptive models commonly with this metric. Pan et al. [8] proposed a classical DA method cross-domain mean constraint term into KELM to further enhance the knowledge transfer ability of our method, as shown in Figure 1. Finally, we investigate the performance of our approach on public domain adaptation datasets, and the experimental results show that it can learn better domain-invariant features with high accuracy and outperform other existing shallow and deep domain adaptation methods. criminative features across domains by using CDMA to minimize the marginal and conditional discrepancies between domains and applying SDA to exploit the label information and original structure information. • We present KTELM by designing a cross-domain mean approximation constraint into KELM for classification in domain adaptation.

•
We obtain a classifier with the ability of knowledge transfer by combining SCDMDA and KTELM and implement a classification task on public image datasets. The results show the superiority of our approach.
The rest of this paper is as follows: In Section 2, we briefly describe SDA, ELM, KELM, and DA. Section 3 provides SCDMDA and KTELM algorithms. The experimental results and analysis are presented in Section 4. Finally, Section 5 is the conclusion of this paper.

Preliminary
In this section, we will briefly introduce SDA, ELM, KELM, and DA. In this paper, we make the following contributions: • We introduce CDMA into SDA and then propose SCDMDA. It extracts shared discriminative features across domains by using CDMA to minimize the marginal and conditional discrepancies between domains and applying SDA to exploit the label information and original structure information.

•
We present KTELM by designing a cross-domain mean approximation constraint into KELM for classification in domain adaptation.

•
We obtain a classifier with the ability of knowledge transfer by combining SCDMDA and KTELM and implement a classification task on public image datasets. The results show the superiority of our approach.
The rest of this paper is as follows: In Section 2, we briefly describe SDA, ELM, KELM, and DA. Section 3 provides SCDMDA and KTELM algorithms. The experimental results and analysis are presented in Section 4. Finally, Section 5 is the conclusion of this paper.

Preliminary
In this section, we will briefly introduce SDA, ELM, KELM, and DA.

Semi-Supervised Discriminant Analysis (SDA)
As a classical semi-supervised feature extraction algorithm, SDA extends LDA by adding a geometrical regularizer to prevent overfitting. Given a sample set X = X l ∪ X u = it is divided into two parts: a labeled subset X l and an unlabeled subset X u . x r is a labeled sample with a label y r , and x v is an unlabeled sample. n l and n u are the number of labeled and unlabeled samples, respectively, N = n l + n u . The dimension of the sample is d. We can define the objective function of SDA as follows: where P is the projection matrix; Q = D − G is Laplacian matrix. G is the graph neighbor similarity matrix of X.
, where x i and x j are among p nearest neighbors of each other, and N p (x i ) denotes the set of p nearest neighbors of x i . D is a diagonal matrix; its entries are a column (or a row, since G is symmetric) sum of G, D ii = ∑ j G ij . α is the parameter balancing the graph regularization. S b and S w represent the interclass scatter matrix and between-class scatter matrix, respectively, which are calculated as follows: i is the i-th sample belonging to class c, N c represents the number x (c) i , and m (c) is the mean value of samples of class c.

Extreme Learning Machine (ELM) and Kernel Extreme Learning Machine (KELM)
As a high-performance classifier, ELM with a single hidden layer has a fast training process and high classification performance. Given a dataset {(x i , y i )} N i=1 with N samples x i and its label y i , we could construct an ELM network with L hidden nodes as follows: where y i is the output prediction result of sample x i , and W j is the output weight connecting the hidden layer and output layer. g(x) is an activation function to improve the nonlinear fitting ability of the network. w j and b j are input weight and bias, between the input layer and the hidden layer, which can be set randomly. It can be seen that W j is the key parameter of an ELM network. If we want to obtain an optimal result, the following objective function needs to be solved: In Equation (5), T , W = [W 1 , · · · , W L ] T , and Y = [y 1 , · · · , y N ] T . W 2 is a regularizer to avoid model overfitting, and Y − HW 2 is the training error. λ is a tradeoff parameter between minimizing training errors and the regularizer.
We consider Equation (5) to be a regular least square problem and set ∂J 2 ∂W = 0. The optimal solution of W is calculated as follows: , f or binary classi f ication , f or multi − classi f ication (7) where h Te = g(x Te ).
If the activation function g(x) is unknown, we let K = HH T , is a kernel function. The output of KELM can be defined as: . . .

Domain Adaptation (DA)
In practical application scenarios, it is difficult to collect enough training samples with the same distribution as the testing samples. For example, in photography, it is hard to avoid changes in pixel distribution caused by changes in lighting, posture, weather, and background. If a large number of samples with the same distribution are selected and manually marked for model training, the cost will be high [22]. The above distribution changes will generate a domain bias between the training dataset and the testing dataset, which could degrade the performance of traditional machine learning methods. DA can not only apply knowledge from other domains (source domains) that contain substantial labeled samples and are related to the target domain for model training, but also eliminate domain bias. Existing DA methods are mainly divided into instance-based adaptation, feature-based adaptation, and classifier-based adaptation.
The instance-based adaptation methods seek a strategy which selects "good" samples in the source domain to participate in model training and suppresses "bad" samples to prevent negative transfer. Kernel mean matching (KMM) [23] minimizes the maximum mean difference [24] and weighs the source data and target data in the reproducing kernel Hilbert space (RKHS), which can correct the inconsistent distribution between domains. As a classic instance-based adaptation method, transfer adaptive boosting (TrAdaBoost) [25] extends the AdaBoost algorithm to weigh source-labeled samples and target-labeled samples to match the distributions between domains. In the work [26], a simple two-stage algorithm was proposed to reweight the results of testing data from the training classifier using their signed distance to the domain separator. Moreover, it applied manifold regularization to propagate the labels of target instances with larger weights to those with smaller weights. Instance-based adaptation methods are more efficient for knowledge transfer, but negative transfer [27] can easily occur when there are no shared samples between domains.
The feature-based adaptation methods attempt to learn a subspace with a better shared features representation, in which the marginal or conditional distribution divergences between domains are minimized to facilitate knowledge transfer. Transfer component analysis (TCA), joint distribution adaptation (JDA), balanced distribution adaptation (BDA), and multiple kernel variant of maximum mean discrepancy (MK-MMD) apply the maximum mean discrepancy (MMD) strategy to perform distribution matching, thus minimizing the marginal or conditional distribution divergences between domains [28]. In [29], Wang et al. use the locality preserving projection (LPP) to learn a joint subspace from the source and target domains.
In recent years, deep neural networks (DNNs) have performed well in high-level semantic feature extraction with strong expression ability, and have broad application prospects in domain adaptation. Domain adaptation networks (DAN) [30], faster domain adaptation networks (FDAN) [31], residual transfer networks (RTN) [32], domain adversarial neural networks (DANN) [33], and collaborative and adversarial networks (CAN) [34] use MMD to align distribution discrepancies in an end-to-end learning paradigm for feature extraction across domains. However, DNNs also have their shortcomings, for example, the models contain huge parameters and require a large amount of data for training, which is not suitable for small sample data. DNNs have a strong ability to fit data and can approximate any complex objective function, but the computational cost is high. Our method in this paper has a small number of parameters and fast calculation speed, and it is suitable for small sample data classification.
The goal of the classifier-based adaptation methods is to adjust the parameters of the classifier so that the classifier trained on the source domain has good performance in the target task. Adaptive support vector machine (Adapt-SVM) [35] introduced a regularizer into SVM to minimize the classification error and the discrepancy between two classifiers trained on the source-and target-labeled samples. Based on LS-SVM, Tommasi et al. [36] presented an adaptation model called multi-model knowledge transfer (Multi-KT), in which the objective function imposed the requirement that the target classifier and a linear combination of the source classifiers be close to each other for knowledge transfer. Considering the simplicity and efficiency of ELM, many adaptation models based on ELM have been proposed for domain adaptation, such as joint domain adaptation semisupervised extreme learning machine (JDA-S2ELM) [37], cross-domain ELM (CDELM) [38], domain space transfer ELM (DST-ELM) [21], domain adaptation extreme learning machine (DAELM) [39], and so on.
In addition, the adaptation model based on the generative adversarial network (adversarial-based adaptation) [34,[40][41][42] has recently shown exceptional performance. Many target samples with transferable and domain-invariant features are produced through the generator of adversarial learning and then applied to confuse the domain discriminator training on the source domain and optimize the generator. It can effectively reduce the distribution differences between domains and transfer knowledge efficiently. Since generating samples with domain-invariant features is the main task of adversarial-based adaptation, it is commonly attributed to feature-based adaptation.
In this paper, our method is a combination of feature-and classifier-based adaptation. SCDMDA belongs to a shallow feature-based adaptation method, and KTELM is a classifierbased adaptation method. At the feature-based adaptation stage, SCDMDA finds domaininvariant features. It not only applies SDA to maximize between-class scatter, minimize within-class scatter, and keep the original structure information by pseudo-labeled target samples and labeled source samples, but it also reduces the distribution differences between domains using CDMA. At the classifier-based adaptation stage, a cross-domain mean constraint term is added to KELM to further enhance its knowledge transfer ability.

Proposed Method
In a domain adaptation environment, given a source domain with labeled samples where the distributions between X S and X T are not the same but are related and the label space Y = {y c } C c=1 is shared, the number of categories is denoted by C. To address the domain adaptation problem, our proposed method is divided into two stages: feature-based adaptation and classifier-based adaptation.
At the feature-based adaptation stage, we design SCDMDA, and its goal is to find an optimal transformation matrix P = (p 1 , p 2 , · · · , p k ) ∈ R d×k , which embeds X S and X T into low-dimensional subspaces z S = P T x S ∈ R k and z T = P T x T ∈ R k , where k is the subspace dimension. In this subspace, the distributions between Z S and Z T are closer than those between X S and X T . At the classifier-based adaptation stage, KTELM is constructed by adding a crossdomain mean constraint term into KELM, which improves the knowledge transfer ability of KELM.
From Equation (1) in Section 2.1, we can see that although SDA has better feature extraction performance than LDA, it can obtain bad feature representation when the training samples and testing samples have inconsistent data distributions. To address this issue, we need to reduce the distribution discrepancy of training samples and testing samples with the help of domain adaptation technology based on CDMA. It is expressed as follows: q ∈ R n is the column vector with element 1, n = n S + n T . In Equation (9), CDMA adapts both the marginal distribution in the first term and the conditional distribution in the second term.

Semi-supervised Cross-Domain Mean Discriminative Analysis (SCDMDA)
To enhance the knowledge transfer ability of SDA, we combine Equation (9) and Equation (1) and design the objective function of SCDMDA for shared feature extraction across domains as follows: Furthermore, the above formula is equivalent to: where P 2 F with tradeoff parameter λ 1 is introduced into the objective function to ensure the sparsity of matrix P. α and β balance the influence of the graph regularizer and LDA, respectively.
We set ∂J 3 ∂P = 0, and Equation (11) becomes a generalized eigen-decomposition problem: Finally, Equation (12) is used to find k, the smallest eigenvectors for the optimal adaptation matrix P, to reduce the difference distribution between domains.
To address the issue of the nonlinear data, we transform Equations (11) and (12) into a high-dimensional space by kernel mapping, namely x mapping → ϕ(x). The kernel matrix corresponding to the dataset X is K = ϕ T (X)ϕ(X), then we obtain: where S ϕ w and S ϕ b are the nonlinear forms of S w and S b under the kernel map, respectively, which can be obtained through Ref. [43]. A complete procedure of SCDMDA is summarized in Algorithm 1.
Step1: According to Equations (1) and (9), construct Q, G, and M 0 , and set S w and S b to 0.
Step4: Project X S and X T by P into k-dimensional subspace to obtain Z S and Z T . Step7: Let t ← t + 1 .
Step8: If t ≥ T max or Y T does not change, output P, otherwise, go to Step3.

Classifier-Based Adaptation of KTELM
After feature adaptation of SCDMADA, we can obtain P and obtain Z S = P T X S and Z T = P T X T . Then, it can be seen from Section 2.2 that we can train a KELM on {Z S , Y S } and effectively predict Z T . However, we can see from Equation (5) that KELM has no capacity for knowledge transfer. In this section, we design a cross-domain mean constraint on the source domain as follows: Tav denotes the mean of the samples dataset with c category in H T . Then, we add Equation (15) into Equation (5) and obtain where λ 2 and θ balance the influence of training errors and cross-domain mean constraint on the source domain, respectively. Similar to ELM, we set ∂J 5 ∂W = 0 and obtain We kernelize Equation (17) and let is a kernel function. The output of KTELM can be defined as: . . . . .

Discussion
In this paper, our method is proposed for domain adaptation, which includes feature adaptation based on SCDMDA and classifier adaptation based on KTELM.
• From Equation (11) and Equation (18), it can be seen that, compared with SDA, SCDMDA adopts CDMA to reduce the distribution discrepancy between domains, which is better for domain adaptation than SDA. Moreover, as a semi-supervised feature extraction method, SCDMDA focuses more attention on individual information with the help of the category separability of LDA and the original structure information of graph regularizers. • Compared with MMD, CDMA reflects individual differences. In our method, CDMA mines individual information through M 0 and M c , which is a more effective interdomain distribution difference measurement mechanism than MMD. In addition, we verify the improved method on k-NN, KELM, and KTELM classifiers.

•
In the classical ELM, solving the output weight W that connects the hidden layer and the output layer is the key, and the optimal solution is obtained by solving Equation (5). However, for samples with interdomain distribution differences, the solution obtained by Equation (5) is not the optimal. The domain adaptation is added to the ELM to obtain Equation (16), and the optimal W * can be obtained. By adopting the cross-domain mean constraint on the source domain to achieve the cross-domain transfer of knowledge, the interdomain distribution difference can be reduced, which shows that KTELM has higher domain adaptation accuracy.

Experiments and Analysis
In this section, extensive experiments were conducted on four widely adopted datasets, including Office+Caltech [10], Office-31 [44] object recognition, USPS [45] and MNIST [46] digital handwriting, and PIE face [47], to test SCDMDA+KETLM. Table 1 shows some descriptions of these. All the experiments were conducted on a PC equipped with Win10, an Intel i5 10400F, 2.9 GHz CPU, and 8 GB RAM, with software MATLAB 2017b. In the experiment, we repeated the experiment 20 times and recorded the average results.

Dataset Description
Office + Caltech: It is a visual domain adaptation benchmark dataset, which contains two sub-datasets of Office and Caltech (C), shown in Figure   Office-31: This is a public dataset commonly used to investigate domain adaptation algorithms, which includes three distinct domains: amazon, webcam, and dslr. It contains 4652 images with 31 categories. In our experiments, we carried out a classification task in a 2,048-dimensional ResNet-50 feature space and performed six groups of experiments: USPS+MNIST: USPS and MNIST are two closely correlated, yet differently distributed, handwritten datasets, sharing 10 numeric categories of 0-9 (as shown in Figure 3b,c). The USPS dataset contains 7291 training samples and 2007 test samples, each with 16 × 16 pixels. There are 60,000 training images and 10,000 test images with 28 × 28 pixels in the MNIST database. During this experiment, 1800 pictures from USPS and 2000 pictures from MNIST were selected to construct source and target domain datasets. All the pictures were converted into grayscale images with 16 × 16 pixels. We constructed two recognition tasks, i.e., USPS as the source domain and MNIST as the target domain (USPS vs. MNIST) and vice versa (MNIST vs. USPS).

Experiment Setting
To evaluate the performance of our algorithm, we compared our approach with several state-of-the-art domain adaptation approaches categorized into three classes: the traditional methods, e.g., 1-NN [48], KELM, and SDA; the shallow feature adaptation methods, e.g., GFK, JDA, STDA-CMC [49], W-JDA, and JDA-CDMAW [50]; and the deep transfer methods, e.g., DAN, DANN, and CAN. In addition, our method is divided into three cases: SCDMDA0 represents SCDMDA+1NN, SCDMDA1 represents SCDMDA+KELM, and SCDMDA2 represents SCDMDA+KTELM. In addition to KELM, SCDMDA1, and SCDMDA2, other algorithms all choose 1-NN as the basic classifier. We applied classification accuracy as the evaluation metric for each algorithm and its formula is as follows: To achieve optimal functionality of SCDMDA and KTELM, we set

Results and Analysis
We applied our method, compared algorithms, and performed classification experiments on the Office+Caltech, USPS+MNIST, PIE, and Office-31 datasets. The results are presented in Tables 2 and 3. From Table 2, conclusions can be obtained as follows: 1. Similarly, the semi-supervised feature extraction method SCDMDA0 works better than SWTDA-CMC. An explanation is that CDMA is a better distribution discrepancy measurement criterion than MMD, so our method works quite well in reducing domain bias. SCDMDA0 and SWTDA-CMC are better than JDA, JDA-CDMAW, and W-JDA. This illustrates that the category separability of LDA and the original structure information of graph regularizers are important for the classification task.   In addition, results on the Office-31 dataset with ResNet-50 features are shown in Table 3. We observe the classification results obtained by 1-NN, KELM, JDA, STDA-CMC, JDA-CDMAW, and SCDMDA (0-2) and some deep domain adaptation methods such as DAN, DANN, and CAN to verify the effectiveness of our approach on this more complex dataset. SCDMDA2 outperforms other methods on all six groups of domain adaptation tasks. In particular, for the amazon vs. dslr domain adaptation task, it achieves a higher accuracy (91.16%). It further validates that our method, equipped with deep generic features, can further reduce the cross-domain distribution discrepancy and achieve the best adaptation performance, demonstrating the potential of our method. Table 4 shows the running-time comparisons of SCDMDA (0-2) with 1-NN, KELM, JDA, STDA-CMC, and JDA-CDMAW on the PIE1 vs. PIE2 dataset. The following can be seen: (1) Because JDA, STDA-CMC, JDA-CDMAW, and SCDMDA (0-2) require 20 iterations for label refinement, these approaches consume more time than 1-NN, and KELM. (2) JDA-CDMAW needs less running time than JDA due to no requirement for the construction of the interdomain distribution divergence matrix. (3) SCDMDA (0-2) and STDA-CMC take more time than JDA-CDMAW to obtain the within-and between-class scatter matrices and graph Laplacian matrix (4) The time cost of SCDMDA2 requires more time than SCDMDA1 to compute the cross-domain mean constraint term for domain adaptation. Because KELM is faster than 1-NN, SCDMDA1 needs less time than SCDMDA0.

Sensitivity Analysis
In this section, we conducted some experiments to investigate the influence on the accuracy of SCDMDA2, including the number of dimensions of the subspace, the graph tradeoff parameter α, the SDA tradeoff parameter β, and the sparse regularization parameter λ 1 in SCDMDA; and the cross-domain mean constraint control parameter θ and the model regular tradeoff parameter λ 2 of KETLM. In addition, we needed to observe the convergence of the designed algorithm. Then, we carried out experimental analysis on four cross-domain tasks, including A vs. D, MNIST vs. USPS, PIE1 vs. PIE2, and amazon vs. dslr datasets, and the results are shown in Figure 4.
In Figure 4a, we observe that the accuracy of SCDMDA2 grows with the increasing number of iterations on the four sets of datasets and gradually stabilizes after the 8 th iteration, verifying its good convergence. Figure 4b illustrates the experiments with the subspace dimension k ∈ {20, 40, · · · , 200} on 3 datasets, which indicated the impact of the subspace dimension on our method. It can be observed that SCDMDA2 obtains strong robustness when k ∈ [80, 120]. Moreover, we did not perform this experiment on the amazon vs. dslr datasets, because the Office-31 dataset with ResNet-50 feature is robust with regard to different dimensions for feature transfer learning.   We ran SCDMDA2 with other parameters fixed α ∈ [10 −6 , 10 1 ]. From Figure 4c, we can see that the accuracy rises first and then falls with α increasing and obtains optimal performance when α = 10 −2 on 4 datasets. When α is too small, the original structure information cannot be exploited. On the contrary, if it is too large, other terms in Equation (13) will not work and hurt the proposed algorithm.
We ran SCDMDA with KTELM on A vs. D, MNIST vs. USPS, PIE1 vs. PIE2, and amazon vs. dslr datasets with the optimal parameter settings, and computed the CDMA distance (shown in Figure 5) according to its definition as follows: We can see that as the iteration number increases, SCDMDA reduces the betweendomain marginal and conditional distribution difference, leading to a smaller CDMA distance and higher accuracy. Therefore, SCDMDA can extract better shared features across domains.

Visualization Analysis
We further display the t-distributed stochastic neighbor embedding (t-SNE) visualization plots of feature embedding for the PIE1 vs. PIE2 classification task. t-SNE, as a dimension-reduction method, transforms high-dimensional data into a low-dimensional space for visualization. In t-SNE visualization plots, clusters of the same color are closer, clusters of different colors are more distant, and the extracted features are more discriminative. The visualized source and target features derived from raw data, JDA, STDA-CMC, and SCDMA were drawn in the feature scatterplot Figure 6a-h, in which each color represents one of the 68 categories.
As we can see from Figure 6a,b, the original data of the source and target domains before adaptation are highly mixed and difficult to distinguish. At the moment, we can easily obtain a poor classifier trained on the source domain. It can be seen from Figure 6c  We can see that as the iteration number increases, SCDMDA reduces the betweendomain marginal and conditional distribution difference, leading to a smaller CDMA distance and higher accuracy. Therefore, SCDMDA can extract better shared features across domains.

Visualization Analysis
We further display the t-distributed stochastic neighbor embedding (t-SNE) visualization plots of feature embedding for the PIE1 vs. PIE2 classification task. t-SNE, as a dimension-reduction method, transforms high-dimensional data into a low-dimensional space for visualization. In t-SNE visualization plots, clusters of the same color are closer, clusters of different colors are more distant, and the extracted features are more discriminative. The visualized source and target features derived from raw data, JDA, STDA-CMC, and SCDMA were drawn in the feature scatterplot Figure 6a-h, in which each color represents one of the 68 categories.
As we can see from Figure 6a,b, the original data of the source and target domains before adaptation are highly mixed and difficult to distinguish. At the moment, we can easily obtain a poor classifier trained on the source domain. It can be seen from Figure 6c-f that JDA and STDA-CMC can unite samples with the same category and partly separate samples with different categories. Compared with JDA and STDA-CMC from Figure 6d,f, the small number of easily misclassified samples of SCDMDA in the boxes of Figure 6h can be especially observed, showing that SCDMDA can effectively obtain more discriminative and domain-invariant feature representation. before adaptation are highly mixed and difficult to distinguish. At the moment, we can easily obtain a poor classifier trained on the source domain. It can be seen from Figure 6cf that JDA and STDA-CMC can unite samples with the same category and partly separate samples with different categories. Compared with JDA and STDA-CMC from Figure 6d,f, the small number of easily misclassified samples of SCDMDA in the boxes of Figure 6h can be especially observed, showing that SCDMDA can effectively obtain more discriminative and domain-invariant feature representation.

Conclusions
In this paper, we develop a semi-supervised domain adaptation approach which contains feature-based adaptation and classifier-based adaptation. At the feature-based adaptation stage, we introduce CDMA into SDA to reduce the cross-domain distribution discrepancy and develop SCDMDA for domain-invariant feature extraction. At the classifier-based adaptation stage, we select KELM as the baseline classifier. To further enhance

Conclusions
In this paper, we develop a semi-supervised domain adaptation approach which contains feature-based adaptation and classifier-based adaptation. At the feature-based adaptation stage, we introduce CDMA into SDA to reduce the cross-domain distribution discrepancy and develop SCDMDA for domain-invariant feature extraction. At the classifier-based adaptation stage, we select KELM as the baseline classifier. To further enhance knowledge transfer ability of the proposed approach, we design a cross-domain mean constraint term and add it to KELM to construct KTELM for domain adaptation. Through feature and classifier adaptation, we can learn more discriminatory feature representations and obtain higher accuracy. Comprehensive experiments on several visual cross-domain datasets show that SCDMDA with KTELM significantly outperforms many other state-of-the-art shallow and deep domain adaptation methods.
In the future, we plan to explore the interdomain metric criterion in more depth. At present, CDMA is superior to MMD in reducing cross-domain distribution differences; however, its computational complexity is greater than that of MMD. To solve this problem, better metrics can be developed to improve the distribution differences between domains.  Data Availability Statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.