Deep Transfer Learning Method Based on Automatic Domain Alignment and Moment Matching

: Domain discrepancy is a key research problem in the ﬁeld of deep domain adaptation. Two main strategies are used to reduce the discrepancy: the parametric method and the nonparametric method. Both methods have achieved good results in practical applications. However, research on whether the combination of the two can further reduce domain discrepancy has not been conducted. Therefore, in this paper, a deep transfer learning method based on automatic domain alignment and moment matching (DA-MM) is proposed. First, an automatic domain alignment layer is embedded in the front of each domain-speciﬁc layer of a neural network structure to preliminarily align the source and target domains. Then, a moment matching measure (such as MMD distance) is added between every domain-speciﬁc layer to map the source and target domain features output by the alignment layer to a common reproduced Hilbert space. The results of an extensive experimental analysis over several public benchmarks show that DA-MM can reduce the distribution discrepancy between the two domains and improve the domain adaptation performance.


Introduction
Deep neural networks have achieved substantial success in all aspects of machine learning applications [1][2][3]. However, the training and testing data for these networks may not have the same distribution. In addition, obtaining sufficient labeled target data is difficult [4]. Transfer learning [5,6] attempts to build effective classifiers that can be used in the target domain by using the labeled data of the source domain that obey different but related distributions. In such cases, the source and target data are obtained from similar but not identical domains and usually follow different distributions. Therefore, reducing the distribution differences between the source and target domains has become the main obstacle [7][8][9][10][11].
Recently, research based on deep learning has achieved remarkable results in different fields [12][13][14][15]. In most studies, the discrepancy between the source domain and target domain is reduced by learning the domain-invariant features, usually mainly using two main strategies. One strategy is based on minimizing the domain adaptation loss, which contains hyperparameters for a regular term, such as the minimization of the maximum mean discrepancy (MMD) [16][17][18] or the domain-confusion loss [19,20]. The common purpose of this strategy is to obtain more domain-invariant representations. We call this method the parametric approach.
The other strategy uses a nonparametric method [21][22][23][24] with the goal of reducing the domain shift between the two domains by designing a specific normalized distribution layer, such as adaptive batch normalization (AdaBN) [25] and domain alignment layers (DA layers) [26]. This strategy endows these layers with the ability to automatically learn the degree of alignment of different layers of the network without introducing additional loss terms (e.g., MMD or domain confusion) into the optimization function and their related hyperparameters. However, most of these methods require additional optimization steps and hyperparameters in order to establish a connection between the training and testing domains. Such an additional computational burden can greatly complicate the training of a deep neural network. For example, the hyperparameters that control the tradeoff between the supervised learning loss term of the source domain and the additional loss (e.g., MMD, JMMD, B-JMMD, domain confusion) term need to be fine-tuned, and even some well-designed additional loss items (B-JMMD, BDA) require hyperparameters with specific functions, for example, to balance the marginal and conditional distribution adaptations [27]. Again, the additional computational burden can complicate the training of a deep neural network, which is unappealing to most researchers.
The advantage of the nonparametric method is that it does not require prior knowledge (which layers need to be aligned), and there are no hyperparameters that need to be finetuned [25,26]. The distributions, such as the BN layer, can be aligned automatically [25]. In addition, AutoDial [26] can learn domain-invariant features without introducing additional loss terms (e.g., MMD or domain confusion) into the optimization function or any associated hyperparameters. Even in powerful deep learning models, the problem of domain shift can be alleviated but not eliminated, which raises the question: Does adding additional loss items such as MMD after the initial DA-layer alignment further improve the alignment?
The contribution of this work can be summarized from the following aspects: (a) The performance of the MMD parametric method can be significantly improved by embedding the automatic domain alignment layer (DA layer) between each domain-specific layer in a deep neural network. The MMD parameter can benefit from the preliminary alignment performed by the DA layer between the source domain and the target domain. (b) For the automatic alignment method, although AutoDial causes the model to learn domain-invariant features without introducing additional loss terms (e.g., MMD or domain confusion) into the optimization function and the associated hyper-parameters, there is still room for improvement. Incorporating MMD can further improve the degree of alignment and achieve better performance. (c) Compared with the automatic alignment method, our method needs to add DA layers to only a small number of domain-specific layers, which greatly reduces the number of network layers while improving the performance. (d) We conducted numerous comparative experiments to demonstrate the effectiveness of the proposed method.

Related Work
In this section, previous research on deep transfer learning and deep domain adaptation is discussed, and relevant differences between these approaches and our proposed method are identified. One of the main problems in deep transfer learning is reducing the discrepancy between the distributions of the source domain and target domain through two main strategies.
The first strategy is based on a parametric approach, L s (X s , Y s ) + lamda * DA loss (X s , Y s , X t ), where X s and Y s represent the labeled data in source data. X t represents the target data, L s (X s , Y s ) represents the source domain loss applied to the source samples, while DA loss (X s , Y s , X t ) is an entropy loss applied to the target samples. lamda is the regularization coefficients of the target domain predictor.
(a) Feature alignment based on MMD: L s + lamda * L MMD , where L s represents the source domain loss applied to the source samples, L MMD represents the MMD loss applied to the target samples. The minimization of the MMD [16][17][18] can be described as: the source and target data are projected into a common subspace, and then the distributions of the representations of the source domain and target domain are optimized by minimizing the mean embedding distance between the two domains to make them as similar as possible. The deep domain confusion (DDC) [28] approach introduces one adaptation layer to AlexNet [29] that uses a linear kernel MMD and an additional domain-confusion loss, causing the model to learn domain-invariant representations. DDC used the classical MMD loss to regularize the representation in the final CNN layer. To improve the effectiveness of adaptation, DAN [16] matches the mean embeddings of marginal distributions by introducing the multi-kernel MMD in the corresponding domain-specific layers. Thus, DAN further extends the method to multi-kernel MMD and multi-layer adaptation.
RevGrad [20] proposed a gradient reversal layer to compensate for the domain-specific back-propagated gradients. Bousmalis et al. [22] devised a domain separation network that can extract better domain-invariant features by explicitly modeling the private and shared components of the domain representations in a network. Different from the previous deep transfer methods, JMMD approximates the shift of joint distributions after the network activations in the second-order tensor product Hilbert space [30]. However, it is unclear how to determine which components of the representations support the reasoning about the original joint distributions. B-JMMD adaptively utilizes the importance of marginal and conditional distributions behind multiple domain-specific layers across domains and realizes the adaptive effect of a balanced distribution of deep network architectures [27]. At the same time, however, a balance factor is introduced.
(b) Adversarial-based deep transfer learning: L s + lamda * L adv , where L adv represents the adversarial loss applied to the target samples. Another strategy [19,20] relies on domainconfusion loss, which can predict if a sample comes from the source domain or the target domain by training an auxiliary classifier. Intuitively, the domain-invariant features can be obtained by maximizing this term, (i.e., poor performance is punished by using auxiliary classifiers). Researchers have attempted to minimize domain classification loss by making the feature distribution of the two domains as indistinguishable as possible [31]. This approach assumes that to transfer effectively, a good representation should not discriminate between the source domain and the target domain but should be discriminated for the main learning tasks.
(c) Embedding domain adaptation modules into network-based deep transfer learning: L s + lamda * L others [16], where L others represents the other loss applied to the target samples. This approach reduces the domain discrepancy by embedding domain adaptation modules into deep networks [28] and jointly learns adaptive classifiers and transferable features from labeled data in the source domain and from unlabeled data in the target domain. The model explicitly learns a residual function with reference to the target classier by inserting several additional layers into the deep network. Bousmalis et al. [22] learned domain adaptation and deep hash features simultaneously using a DNN.
A second strategy for unsupervised domain adaptation is the nonparametric approach.
Recently, researchers have begun to investigate alternative directions [21][22][23][24], such as reducing the possibility of the domain shift by introducing specific distribution normalization layers. Inspired by the popular batch normalization (BN) technique [23], a simple nonparametric approach called AdaBN was used to modify the Inception-BN network. AdaBN aligns the source and target representations of learning by using different mean/variance terms of the source and target domains when performing BN at the time of inference. Inspired by Li et al. [25], AutoDial introduces novel domain alignment layers (DA layers) embedded at different levels of a deep architecture. Different from Li et al. [23], all previous deep domain adaptation methods determine which layers should be adapted in advance, and AutoDial [26] endows the DA layers with the ability to automatically learn the alignment degree. This nonparametric approach causes the model to learn domain-invariant features without introducing additional loss terms (e.g., MMD or domain confusion) into the optimization function or associated hyperparameters.

Problem Definition
In this paper, we focus on unsupervised domain adaptation. Given a source domain . Generally, in machine learning problems, we assume the feature space χ s = χ t , the label space y s = y t , but in transfer learning, the marginal distribution P s (x s ) = P t (x t ), the conditional distribution P s (y s |x s ) = P t (y t |x t ). Transfer learning aims to obtain the label of the target domain through knowledge learning in the source domain, and domain adaptation solves the problem of transfer learning across domains by reducing the distribution difference between the two distributions.

Domain Alignment Layers (DA Layers)
The idea behind the AdaBN [25] algorithm is to align the source and target domain distributions independently to a standard normal distribution by using a certain method. In this process, the target samples do not affect the predictor network parameters. Due to insufficient target domain samples, the domain adaptability of the network structure still has some deficiencies. Compared with the AdaBN algorithm, the DA-layer approach considers the roles of the target domain samples and capitalizes on them. Specifically, DA layers [26] introduce a coupling parameter to mix the source and target domain samples and a cross-domain bias. The coupling parameter and the cross-domain bias jointly influence the model network parameters so that the DA layers of the network branches where the source and target domain predictors are located correspond with each other [26].
Generally, the inputs of DA layers in two predictors are represented by x s and x t , and the corresponding distributions are expressed as q s and q t . The coupling parameters are denoted by δ. The samples of the source and target domains are independent distributions in the first through sixth layers. In the seventh and eighth layers, the samples of the target domain become involved in generating predictors by introducing the coupling parameters δ, which causes the boundary between the source and target domains to become blurred, thereby reducing domain discrepancies [26]. The output of the DA layers of the source and target domain predictors are denoted by Formulas (1) and (2), respectively: To avoid the problem that the denominator is zero in the case of zero variance, we introduce a decimal ε > 0. Here, µ st,δ and σ 2 st,δ represent the expectation and variance of , respectively. Similarly, µ ts,δ and σ 2 ts,δ represent the expectation and variance of x ∼ q ts δ : respectively. q st δ and q ts δ represent a cross-domain distribution between the source and target domains, respectively, and are denoted by Formulas (3) and (4) as follows: where δ ∈ [0.5, 1]. When the coupling coefficient δ = 0.5, q st δ = q ts δ ; that is, the complete coupling is achieved, and therefore, no domain alignment is generated at this time; when δ = 1, the independent alignment transformation is performed on the two domains, which is equivalent to AdaBN [25]. The DA layers may compute different functions for two predictors; that is, the source and target domains are completely aligned. The coupling Mathematics 2022, 10, 2531 5 of 14 coefficient δ is configured to transform the independent alignment transformation of the two domains into a fully coupled state. It should be noted that λ is acquired during training and can automatically adjust the alignment degree between specific domains without requiring manual parameter adjustment.

Loss Function
The DA layers [26] fully consider the influence of the target and can make full use of the unique components of each domain during alignment. The network parameters are restricted by both the source and target domain functions. During training, under the constraints of the two functions and through continuous learning, the network ultimately learns the most suitable parameter to construct the optimal source and target domain predictors. This approach effectively utilizes the unlabeled samples of the target domain to better separate the samples that represent different categories. The predictor for the source domain network branch is measured by the SoftMax loss function and expressed by L s (φ) according to Formula (5): where f φ s (y s i ; x s i ) is the probability that sample point x s i takes label y s i according to the source predictor. The predictor for the target domain network branch is measured by the cross-entropy loss function, which is expressed by L t (φ), as shown in Formula (6): where f φ s (y s i ; x s i ) represents the probability that sample point x t i takes label y according to the target predictor. The loss function of the complete network structure is measured using the weightings of Formulas (5) and (6).

MMD Metric
The MMD [16], which measures the distance between two distributions in a regenerative Hilbert space, is a kernel learning method. If P and Q are used as the inputs to the MMD distance metric [16], then the distance can be estimated according to Formula (7) as follows: where ∈ L is the domain-specific layer, and D H (P, Q) is used to measure the cross-domain joint distribution of the middle layer L of the deep neural network, which is estimated by Formula (8): where k x * i , x * j is a Hilbert space mapping in the form of an inner product and used to map the original variables into a high-dimensional space, x sl i denotes the activation value generated by the source domain in layer L of the neural network, and x tl i denotes the activation value generated by the target domain in layer L of the neural network. As a domain-specific portion of the AlexNet [29] network structure, the last three layers of the L layer are domain-specific layers. In the GoogLeNet [32] network structure, an inner product layer acts as the domain-specific layer.

Algorithm Design
In this paper, we propose a deep transfer learning method based on automatic domain alignment and moment matching. First, DA layers are embedded in the front of each domain-specific layer of the neural network structure for preliminary alignment of the source and target domains. Then, MMD parameters are added between every two specific layers of each domain to map the source and target domain features output by the DA layers to a common space, RKHS, thereby further reducing the distribution discrepancy between the two domains. A deep transfer network architecture based on automatic domain alignment and moment matching is shown in Figure 1.  Domain-based automatic alignment and moment matching can also be applied to deeper network structures, such as ResNet [33], VGGNet [34], or GoogLeNet [32]. The total loss function of the structure is determined according to Formula (9): where the term  Domain-based automatic alignment and moment matching can also be applied to deeper network structures, such as ResNet [33], VGGNet [34], or GoogLeNet [32]. The total loss function of the structure is determined according to Formula (9): where the term L s (φ) is the standard log-loss applied to the source samples, while L s (φ) is an entropy loss applied to the target samples. MMD is the first-order MMD, and P and Q are output by DA layers and input into the MMD distance layer. Here, λ and γ are the regularization coefficients of the target domain predictor and the MMD distance, respectively. The specific flow of the algorithm is shown in Algorithm 1. Input: Source data with the label X s , Y s and target data X t . λ is the regularization coefficient of the target predictor; and γ is the regularization coefficient of the MMD distance. Output: Test accuracy; Test loss; (1) Set i = 0. Train a baseline neural network with X s , Y s and test with X t ; (2) for iteration i do Learn the coupling coefficient δ via Formulas (3) and (4) and calculate the output of the DA layer embedded in the front of the 7th and 8th layers using Formulas (1) and (2). (4) Learn the source domain and target domain predictors from Formulas (5) and (6) and fine-tune the parameters λ to achieve the best classification results. (5) Fine-tune the parameters γ of the MMD between specific layers in the field to achieve the best alignment effect. (6) return accuracy-test, loss-test

Experiments and Discussion
In this section, we evaluated the performance of the proposed DA-MM algorithm by conducting experiments on two popular datasets, Office-31 [35] and Office-Caltech, and using both AlexNet [29] and GoogLeNet [32] models. All our methods are based on the Caffe [36] framework. (2) Office-Caltech: The Office-Caltech dataset is another standard benchmark used in the domain adaptation field. It consists of Office 10 and Caltech 10 datasets and contains 10 categories that overlap with the Office-31 [35] and Caltech-256 [37] datasets. Each category is considered to be an independent domain. The Office-Caltech dataset provides an additional 12 transfer learning tasks. In the experiment, to observe the deviation of the dataset more fairly, only the six combinations that included category C were considered as the source and target domains.

Implementation Details
This paper validates the previous methods based on the AlexNet and GoogLeNet network structures, verifies the proposed method through experiments, and adjusts the two network structure models under the Caffe framework. The accuracy and speed of DDC [28], DAN [16], JAN [30], and AutoDial [26] increased in the field of target domain classification. We use mini-batch stochastic gradient descent with momentum to train our networks and use the following meta-parameters: momentum: 0.9, weight decay 0.0005, initial learning rate 0.003. For AlexNet, the batch_size of the source domain and target domain is 64 in the training phase, the batch_size of the target domain is 1 in the test phase; the training epoch is 795 (webcam) × 64 (batch_size), where amazon is 2817, webcam is 795, dslr is 498 for the target domain, max_iteration is 10,000, training time is about 60 min (Geforce GTX TITAN 6G). For GoogLeNet, the batch_size of the source domain and target domain is 16 in the training phase, the batch_size of the target domain is 1 in the test phase; the training epoch is 795 (webcam) × 16 (batch_size), max_iteration is 100,000, training time is about 300 min (Geforce GTX TITAN 6G). Other hyperparameters in this paper are consistent with those in [26], except for hyperparameters λ and γ. For the Office-31 dataset, we set λ to 0.1 and γ to 1, and for the Office-Caltech dataset, we set λ to 0.2 and γ to 1. Based on the stability of the DA-MM results and the form of the AutoDial results, we retained the average value of the results of each method.

Results
The results of unsupervised domain adaptation on the Office-31 dataset based on AlexNet and GoogLeNet are presented in Tables 1 and 2, respectively. Table 3 shows the result of unsupervised domain adaptation on the Office-Caltech dataset based on GoogLeNet network structure. To fairly compare DDC [28], DAN [16], RevGrad [20], DRCN [38], RTN [17], JAN [30], and AutoDial [26] in the same evaluation scenario, the results of unsupervised domain adaptation and the classification accuracies of these methods were taken directly from the literature. Based on the results, we can make the following observations. The proposed method is superior to all the comparison methods in most transfer tasks (11 out of 12 tasks). Specifically, the classification accuracy of the proposed AlexNet-based method on the Office-31 dataset exceeds that of the comparison method DAN (MMD only) by 5.8% and that of the method AutoDial (DA layer only) by 1.6%, while the GoogLeNet result exceeds that of DAN (MMD only) by 6.7% and that of AutoDial (DA layer only) by 2.3%. On the Office-Caltech dataset, the average classification accuracy of the GoogLeNet model is 93.2%, which is an improvement of 2.0% on average compared with the best comparison method, AutoDial. These results imply that (1) the performance of an MMD parametric method can be significantly improved by embedding the automatic domain alignment layer (DA layer) in front of the MMD parameter between each domain-specific layer in the deep neural network, and the MMD parameter benefits from this 'preliminary alignment' by the DA layer between the source domain and the target domain. (2) Although AutoDial leads to learning domain-invariant features without requiring additional loss terms (e.g., MMD, domain confusion) in the optimization function or associated hyperparameters, there is still room for improvement. Adding the MMD further improves the degree of alignment and achieves better performance.

Ablation Study and Discussion
(1) Ablation study As shown in Table 4, the results of the proposed method are significantly better than those of the method with no DA layers (exceeding 7.1% for the AlexNet-based network architecture and 6.7% for the GoogLeNet-based network architecture). The proposed method also outperformed the method with no MMD layers (by more than 1.6% for the AlexNet-based network architecture and 2.3% for the GoogLeNet-based network architecture). Corresponding network structure diagrams of these two methods are shown in Figures 1 and 2.

Ablation Study and Discussion
(1) Ablation study As shown in Table 4, the results of the proposed method are significantly better than those of the method with no DA layers (exceeding 7.1% for the AlexNet-based network architecture and 6.7% for the GoogLeNet-based network architecture). The proposed method also outperformed the method with no MMD layers (by more than 1.6% for the AlexNet-based network architecture and 2.3% for the GoogLeNet-based network architecture). Corresponding network structure diagrams of these two methods are shown in Figures 1 and 2. (2) Discussion Comparisons with MMD-based method: Table 5 shows the classification accuracy of each method (%) for unsupervised domain adaptation. We adopt the results on task A→W as an example. The results of the proposed method outperform those of the MMD-based method (DAN) and the DA-layer-based method (AutoDial). Their network architectures are shown in Figure 2a Table 5. Ablation study: a description of the network architecture and the classification accuracy achieved by each method (%) for unsupervised domain adaptation (the results on task A→W were used as an example).

Method
Description of Network Architecture Figure  Comparisons with AutoDial: Figure 3 shows the result of the A→D domain adaptations to compare the accuracy and loss of the proposed method and those of AutoDial during the testing phase. As shown in Figure 3, the accuracy curve of DA-MM is substantially higher than that of AutoDial, while the loss curve of DA-MM is lower than that of AutoDial. Combined with the conclusions drawn in Figure 3, the proposed method partially alleviates the difficulty of domain adaptation compared with AutoDial. These results show that although AutoDial can automatically align across domains, the degree of These results indicate that, to a certain extent, preliminary alignment by adding automatic domain alignment layers can help unleash the potential of the MMD parameter. (see Figure 2e (JAN based on GoogLeNet)). Table 5. Ablation study: a description of the network architecture and the classification accuracy achieved by each method (%) for unsupervised domain adaptation (the results on task A→W were used as an example).

Method
Description of Network Architecture Figure  Comparisons with AutoDial: Figure 3 shows the result of the A→D domain adaptations to compare the accuracy and loss of the proposed method and those of AutoDial during the testing phase. As shown in Figure 3, the accuracy curve of DA-MM is substantially higher than that of AutoDial, while the loss curve of DA-MM is lower than that of AutoDial. Combined with the conclusions drawn in Figure 3, the proposed method partially alleviates the difficulty of domain adaptation compared with AutoDial. These results show that although AutoDial can automatically align across domains, the degree of alignment can be further improved. The proposed method uses MMD to improve the domain alignment degree based on automatic domain alignment, which further reduces the distribution discrepancy between domains. In addition, comparing the network architecture of AutoDial (Figure 2d) and the proposed method (Figure 1b), which has 69 * 2 = 138 DA layers, the proposed method requires only one DA layer. Therefore, although the network structure of the proposed method is simpler, it achieves better results.
Mathematics 2022, 10, x FOR PEER REVIEW 13 of 15 alignment can be further improved. The proposed method uses MMD to improve the domain alignment degree based on automatic domain alignment, which further reduces the distribution discrepancy between domains. In addition, comparing the network architecture of AutoDial (Figure 2d) and the proposed method (Figure 1b), which has 69 * 2 = 138 DA layers, the proposed method requires only one DA layer. Therefore, although the network structure of the proposed method is simpler, it achieves better results.
(a) (b) Figure 3. Comparison with the AutoDial method: (a) accuracy and test loss for task D→A; (b) accuracy and test loss for task A→D.

Parameter Sensitivity
We also conducted sensitivity tests to investigate the effects of the parameters λ and γ. Figure 4 illustrates the changes in the transfer classification performance as   0, 0.2,...,1   and   0.1, 0.5,1, 2,3,...,10   on the A→W task. The accuracy of the proposed method first increases and then decreases as  and  vary, forming a bellshaped curve. This result confirms a good trade-off between the standard log-loss applied to the source samples and the entropy loss applied to the target samples, and the MMD distance can enhance feature transferability.

Parameter Sensitivity
We also conducted sensitivity tests to investigate the effects of the parameters λ and γ. Figure 4 illustrates the changes in the transfer classification performance as λ ∈ {0, 0.2, . . . , 1} and γ ∈ {0.1, 0.5, 1, 2, 3, . . . , 10} on the A→W task. The accuracy of the proposed method first increases and then decreases as λ and γ vary, forming a bellshaped curve. This result confirms a good trade-off between the standard log-loss applied to the source samples and the entropy loss applied to the target samples, and the MMD distance can enhance feature transferability.
Mathematics 2022, 10, x FOR PEER REVIEW 13 of 15 alignment can be further improved. The proposed method uses MMD to improve the domain alignment degree based on automatic domain alignment, which further reduces the distribution discrepancy between domains. In addition, comparing the network architecture of AutoDial (Figure 2d) and the proposed method (Figure 1b), which has 69 * 2 = 138 DA layers, the proposed method requires only one DA layer. Therefore, although the network structure of the proposed method is simpler, it achieves better results.
(a) (b) Figure 3. Comparison with the AutoDial method: (a) accuracy and test loss for task D→A; (b) accuracy and test loss for task A→D.

Parameter Sensitivity
We also conducted sensitivity tests to investigate the effects of the parameters λ and γ. Figure 4 illustrates the changes in the transfer classification performance as   0, 0.2,...,1   and   0.1, 0.5,1, 2,3,...,10   on the A→W task. The accuracy of the proposed method first increases and then decreases as  and  vary, forming a bellshaped curve. This result confirms a good trade-off between the standard log-loss applied to the source samples and the entropy loss applied to the target samples, and the MMD distance can enhance feature transferability.

Conclusions
In this paper, we propose an automatic domain alignment and moment matching (DA-MM) approach for deep domain adaptation. DA-MM aims to reduce the discrepancy between the source and target domain and improve the domain adaptation performance.

Conclusions
In this paper, we propose an automatic domain alignment and moment matching (DA-MM) approach for deep domain adaptation. DA-MM aims to reduce the discrepancy between the source and target domain and improve the domain adaptation performance. Our results show that the combination of the moment matching method and the automatic domain alignment method can further reduce domain discrepancy and can significantly outperform several state-of-the-art domain adaptation methods. Although we combine the advantages of the two methods in this article, we introduce new hyperparameters that need to be manually adjusted while improving the performance. This problem requires us to continue to explore and optimize in future work. In addition, the method we proposed in this paper is a combination strategy, which can be applied more widely, not limited to the moment matching method mentioned in the article is only represented by MMD; the automatic domain alignment method is represented by AutoDial, and these two modules can both be replaced by any similar method to achieve similar effects and performance, which also requires further verification in the next work.