Multi-Source Deep Transfer Neural Network Algorithm

Transfer learning can enhance classification performance of a target domain with insufficient training data by utilizing knowledge relating to the target domain from source domain. Nowadays, it is common to see two or more source domains available for knowledge transfer, which can improve performance of learning tasks in the target domain. However, the classification performance of the target domain decreases due to mismatching of probability distribution. Recent studies have shown that deep learning can build deep structures by extracting more effective features to resist the mismatching. In this paper, we propose a new multi-source deep transfer neural network algorithm, MultiDTNN, based on convolutional neural network and multi-source transfer learning. In MultiDTNN, joint probability distribution adaptation (JPDA) is used for reducing the mismatching between source and target domains to enhance features transferability of the source domain in deep neural networks. Then, the convolutional neural network is trained by utilizing the datasets of each source and target domain to obtain a set of classifiers. Finally, the designed selection strategy selects classifier with the smallest classification error on the target domain from the set to assemble the MultiDTNN framework. The effectiveness of the proposed MultiDTNN is verified by comparing it with other state-of-the-art deep transfer learning on three datasets.


Introduction
In the past two decades, machine learning has dramatically progressed, and it has become a practical technology from laboratory to widespread commercial use [1]. Currently, machine learning is one of the fastest growing technologies located at the core of artificial intelligence and data science, which has been widely used in intrusion detection [2,3], speech recognition [4,5], computer vision [6,7], spam detection [8,9], pattern recognition [10], text classification [11], and other fields. Of course, it has achieved great results. However, in order to obtain a high accuracy classification model, many machine learning algorithms need to satisfy the following two basic conditions: (1) the training and test data come from the same feature space and the same distribution, which satisfy the independent and identical distribution conditions; (2) enough training samples are available. Nevertheless, these assumptions are not always met in practical applications [11,12]. Especially in emerging applications such as text mining, bioinformatics, distributed network sensor networks, and social network research, the independent and identical distribution conditions cannot be satisfied between training and test data under the influences of time, environmental changes, or instability of sensor devices. When the data distribution changes, most of the models need to re-collect the training data, but the previous training data will not be used again, and this results in wasted data resources. In addition, data sample resources in some areas are often scarce, and the cost of collecting data is very expensive or even impossible. In this case, knowledge transfer between task domains is desirable [12,13].
Transfer learning, also known as domain adaptation, provides an effective means to solve the above problems. On the one hand, it no longer requires training and test data to satisfy independent and identical distribution conditions. On the other hand, when the training data in the target domain is scarce and not enough to obtain a good classifier, the data from the source domain (often containing a large number of labeling samples) is similar to the target domain and can be used to assist the learning tasks in the target domain. Transfer learning has achieved remarkable results in resisting this challenge by transferring knowledge from source to target domains with different distributions [13]. Therefore, transfer learning attracts more and more researcher attention and has made great progress: Gao et al. [14] proposed a local weighted embedded transfer learning algorithm LWE; a feature-based space transfer learning method LMPROJ are proposed by Brain et al. [15]; Lu et al. [16] proposed a selective transfer algorithm STLCF for collaborative filtering; Long et al. [17] proposed an SVM-based least squares transfer learning framework ARTL; Xie et al. [18] applied transfer learning to incremental learning and proposed an STIL algorithm; Li et al. [19] proposed a new transfer learning algorithm TL-DAKELM based on the extreme learning machine; Li et al. [20] proposed a transfer learning algorithm, RankRE-TL.
The above transfer algorithms can only handle a single source domain, but in many real-world applications, data from more than one source domain can be collected. Therefore, transfer learning algorithms with multi-source domains are naturally researched, and the classification effect is better than that using only one source domain [21]. Recently, machine learning algorithms of transfer learning with multi-source domains have been proposed. Yao et al. [22] extended the boosting framework and proposed MultiSource-TrAdaBoost and TaskTrBoost; Sun et al. [23] proposed a two-stage domain adaptive method that combines weights of data on marginal probability differences (first phase) and conditional probability differences (second phase) from multiple source and target domains; Duan et al. [24] proposed a multi-source domains adaptation method DAM; [25] proposed a new online transfer learning algorithm by using labeling data from multiple source domains to seek to improve classification performance in target domain; Ding et al. [26] attempted to use the incomplete multi-source domains to carry out effective knowledge transfer, and proposed an incomplete multi-source transfer learning to improve knowledge transfer in two directions; In [27], Jun et al. explored two problems of domain adaptation and proposed the A-SVM algorithm.
No matter multi-source or single-source transfer learning algorithms, although the classification effect of the above transfer learning algorithms can be accepted, in fact these algorithms belong to a shallow structure. Therefore, they cannot find deeper and more complex knowledge behind the data, and then find more common information between domains to further improve the classification effect in target domain.
With the emergence of deep learning, it has more powerful expression ability than the learning algorithms of shallow structure, and has gained a lot of attention for the advantage of better representation features. Consequently, deep learning can generate more domain-invariant features for knowledge transfer between domains. At present, there are many research works on the combination of transfer learning and deep learning. Huang et al. [28] proposed a shared hidden layer multilingual DNN (SHL-MDNN), in which the hidden layer is common in many languages, while the softmax layer is language dependent. Ding et al. [29] developed a new deep transfer low-rank coding based on a deep convolutional neural network, which can obtain a multi-layer general dictionary shared across two domains to bridge domain gaps, so that rich domain invariant knowledge can be captured by the way of layering. The deep transfer learning framework was proposed by Te et al. [30] extended marginal distribution adaptation to joint distribution adaptation and uses unambiguous structures associated with labeled samples of source domain to adjust the conditional distribution of the unlabeled samples in target domain, which ensures a more accurate distribution matching. [31] proposed a new deep adaptive network architecture Domain Adaptation Network (DAN), which extended the deep convolutional neural network to the domain adaptation scenario, the architecture learns the transferable features through statistical guarantees and can be embedded through the kernel without bias, and is estimated to perform linear expansion. A CNN framework that utilizes unlabeled or sparsely labeled data in the target domain is proposed to facilitate transfer by optimizing domain invariance [32]. Zhang et al. [33] proposed a new method for deep convolutional neural networks, Deep Convolutional Neural Networks with Wide First layer Kernels (WDCNN) that uses the original vibration signal as input and wide kernel in the first convolutional layer to extract features and suppress high frequencies noise. The proposed DHN algorithm aims to seek informational hash coding by combining deep structure learning with domain alignment [34]. DDC is the first to incorporate domain aliasing losses into the top layer of AlexNet to transfer drift during domain transfer [35]. But these algorithms only consider the differences of marginal probabilities distribution in domains and the knowledge from only the single source domain, and ignore conditional probabilities and intrinsic information of domains.
In this paper, inspired by previous researches on the combination of deep neural networks and transfer learning, we propose a new multi-source deep transfer neural network algorithm (MultiDTNN) based on convolutional neural networks and multi-source transfer learning. The core idea of this work is as follows: First, to enhance the feature transferability in specific layers in deep neural networks by reducing the domain differences between each source and target domain with using joint probability distribution adaptation (JPDA). Then, we train Convolutional Neural Networks (CNN) on each source and target domain to get a set of classifiers. Finally, for the sake of gaining MultiDTNN, the second stage of the TaskTrAdaBoost [22] algorithm is applied to design a selection strategy to select the classifier with the smallest classification error on target domain from the classifier set. To the best of our knowledge, we are the first to apply multi-source transfer learning and JPDA to the classification tasks of cross-domain knowledge transfer on deep neural networks.
Our threefold contributions are highlighted as follows: (1) the deep transfer structures are constructed based on JPDA and a convolutional neural network which can transfer more features of data in the source domain to the target domain; (2) more knowledge in multi-source domains are provided to assist in building the learning model of target domain, so the classification effect of the model is better; (3) ensemble system of classifiers is more advantageous than a single classifier in terms of prediction effectiveness and stability.
The remaining parts of the paper are organized as follows: In Section 2, the related works of multi-source transfer learning, convolutional neural networks, and maximum mean discrepancy are briefly discussed. The MultiDTNN is proposed and implementation details are also explained in Section 3. Section 4 verifies the effectiveness of MultiDTNN by comparing with state-of-the-art benchmark algorithms on three cross domain datasets. The last section summarizes the conclusions of this paper.

Multi-Source Transfer Learning
Transfer learning has been extensively studied for many years since it was proposed in NIPS-95 in 1995 [12]. However, in real-world applications, we can easily collect auxiliary data from multiple source domains. Therefore, the studies of multi-source domains transfer learning have gradually attracted the interest of researchers [13][14][15][16][17][18][19][20][21][22][23][24][25]. It can transfer knowledge from multiple source domains to learning tasks of the target domain compared to previous transfer learning algorithms with single domains [26]. In addition, if there is no or weak correlations between target and source domains, transfer learning not only has no ability to improve the performance of the target domain classifier, but also lead to negative transfer, on the contrary which will reduce the performance of target domain classifier. Therefore, when extracting knowledge from two or more source domains, the knowledge of data in source domains with more closely related to target domain is selected as much as possible to create a prediction model in target domain [27]. As shown in Figure 1, multi-source transfer learning makes use of the relationships between multi-source and target domains to improve the prediction performance of the samples in target domain, and assists in target domains to establish a prediction model. Multi-source transfer learning can be divided into two categories: the boosting-based methods [22,25] and regularization-based methods [26,27]. The regularization-based methods are the learning model with the regularization term to solve the optimization problems, and the boosting-based methods use the boosting algorithm to generate the set of classifiers. In this paper, the proposed MultiDTNN belongs to the latter. While multiple source domains can provide more knowledge, the differences of domains also present challenging transfer learning issues. To this end, many methods for solving the schemes of multi-source domains have been proposed in many practical applications [22][23][24][25][26][27].

Convolutional Neural Network
In the past few years, deep learning has achieved good performance in solving various problems. CNN has been extensively studied in different types of deep neural networks [36]. In 2006, Hinton et al. published a paper on Science, which first proposed a convolutional neural network [37]. As one of the most effective deep learning models, CNN has been widely used in image processing [38,39,40], face recognition [41] and feature extraction [42]. In general, a CNN consists of three parts: convolutional layers, pooling layers, and fully connected layers. The convolutional layer and the pooling layer are alternately arranged; that is, one convolutional layer is followed by one pooling layer, and so on. After the multiple convolutional and pooling layers, one or more fully connected layers are connected. The first step in CCN convolves the input signal to obtain a feature map through the use of convolution kernel, and then uses a nonlinear activation function (ReLU) to act on the feature map. The formal description of the convolution layer operation is as follows:   In Figure 1, (D S 1 , T S 1 ), (D S 2 , T S 2 ), . . . , (D S n , T S n ) respectively represent source domains and corresponding learning tasks. Similarly, (D T , T T ) is target domain and corresponding learning tasks. f t denotes classifier that is obtained by the way of training transfer learning system with using the datasets in target and source domains.
Multi-source transfer learning can be divided into two categories: the boosting-based methods [22,25] and regularization-based methods [26,27]. The regularization-based methods are the learning model with the regularization term to solve the optimization problems, and the boosting-based methods use the boosting algorithm to generate the set of classifiers. In this paper, the proposed MultiDTNN belongs to the latter. While multiple source domains can provide more knowledge, the differences of domains also present challenging transfer learning issues. To this end, many methods for solving the schemes of multi-source domains have been proposed in many practical applications [22][23][24][25][26][27].

Convolutional Neural Network
In the past few years, deep learning has achieved good performance in solving various problems. CNN has been extensively studied in different types of deep neural networks [36]. In 2006, Hinton et al. published a paper on Science, which first proposed a convolutional neural network [37]. As one of the most effective deep learning models, CNN has been widely used in image processing [38][39][40], face recognition [41] and feature extraction [42]. In general, a CNN consists of three parts: convolutional layers, pooling layers, and fully connected layers. The convolutional layer and the pooling layer are alternately arranged; that is, one convolutional layer is followed by one pooling layer, and so on. After the multiple convolutional and pooling layers, one or more fully connected layers are connected. The first step in CCN convolves the input signal to obtain a feature map through the use of convolution kernel, and then uses a nonlinear activation function (ReLU) to act on the feature map. The formal description of the convolution layer operation is as follows: In Equation (1), c r n is the n − th output of convolutional layer r, n denotes the number of convolution kernels in convolutional layer r, w r n and b r n respectively represent the convolutional kernel and the deviation, v r−1 m is the m − th output of convolutional layer r − 1, * is the convolutional operation. After calculating Equation (1), we can obtain the feature map and then perform average or maximum feature activation through the pooling layer in areas where the feature map does not intersect. Finally, the fully connected layer is used for classification. Given a data set where l(·) denotes the loss function to estimate the cost between true label Y(x j ) and predicted label by

Maximum Mean Discrepancy
Since the proposed MultiDTNN needs to measure the distribution differences between domains, it is necessary to choose a suitable measurement method of distribution distance. It has recently been demonstrated that the maximum mean deviation (MMD) in the regenerative kernel Hilbert space is a valid method for estimating the distance between two distributions [43]. For the convenience of calculation, the square form of MMD is generally used. The process of estimating the difference between two domains using MMD is as follows.
Given a labeled dataset in a source domain D s = ( x 1 , y 1 , . . . , (x n , y n )), an unlabeled dataset in target domain D t = (z 1 , . . . , z m ), the nonlinear mapping function in the regenerative kernel Hilbert space is φ. The squared form of MMD is defined as follows: In Equation (3), the differences of distribution between two domains is the distance between the two data distributions. The smaller of MMD value, the closer the two domains are. If the value is 0, the two domains match. At present, MMD have been widely used in transfer learning algorithms [15,21,23,24,26,29,30,32], which can be used to construct regularization terms to learn features in different domains with more similar. In neural network-based transfer learning algorithms, MMD is often added to the loss function for optimization [30].

Multi-Source Deep Transfer Neural Network
This section describes the multi-source deep transfer neural network algorithm in detail. For convenience, we only consider the binary classification problem. Given N source domains are defined as: . . , N}, x s i j denotes j − th sample of s i − th source domain, the corresponding class label is y s i j , n s i is the number of sample in s i − th source domain, P s i and Q s i mean marginal and conditional probability distribution. Analogously, target domain is D T = (x i )| n t i=1 , marginal and conditional probability distribution are P t and Q t . Normally, P s i P t and Q s i Q t .
In this paper, the goal of our proposed MultiDTNN is to use knowledge from multi-source domains to assist learning tasks of target domains to create an efficient classifier model, which can accurately label unlabeled samples in target domains. In MultiDTNN, knowledge transfer from the source to target domains is achieved through transfer learning [11]. Transfer learning is a new machine learning that solves learning problems in different but related domain (target domain) by using knowledge in existing historical data (source domain) [44,45]. At present, most of the transfer learning techniques commonly used by researchers are instance-based methods, which select representative instances from source domain to assist learning tasks in target domain [22]. However, target and source domains differ greatly in practical applications, if the instance data of source domain that is not related to target domain are forcibly transferred to target domain, which will not help the learning of target domains named as negative transfer. The negative transfer has been born with transfer learning, and it has always been the focus of researchers. In order to avoid negative transfer and better assist the learning tasks in target domain, it is particularly important to select samples in source domain with high similarity to target domain [12,13]. MultiDTNN can transfer knowledge from multiple source domains into the target domain, so as to improve the classifier effect, and we must fully consider the difference between each source and target domains, maximizing the knowledge transfer from source domains similar to target domains to avoid negative transfer. The composition strategy, the knowledge transfer from multi-source domains, and the classifier training process in the MultiDTNN model are described in detail below.

Joint Probability Distribution Adaptation
In practical applications, each source and target domains are not only different in marginal probability, but also have significant differences in conditional probability. If only the marginal probability between the source and target domains is considered, the negative transfer phenomenon may occur, and the better classification performance cannot be achieved in transfer learning. Therefore, in order to make the proposed MultiDTNN a better classification effect, we simultaneously consider both the marginal and conditional probability. Literature [30,46] points out that minimizing the differences of marginal and conditional distributions can effectively avoid negative transfer and improve the classification performance of transfer learning algorithms.
In Equations (4) and (5), φ(·) represents a feature mapping to a regenerating kernel Hilbert space, x s i is sample vector and y s i is label vector in s i − th source domain. x t is sample vector and y t is label vector in target domain. Di f f represents a function that calculates the differences between the source and target domains.
Equation (4) is to minimize the data distribution distance between the target and source domains. We apply MMD (Equation (3)) to calculate Equation (4): The conditional distribution in (5) is intractable because of unknown y t . We rewrite it into the following Equation (7): In order to solve the problems of the unknown sample label of the target domain, the literature [30,31] proposed a circuitous way: Equation (7) is processed by using the pseudo labels of data in the target domain. That is, by means of the pre-training model on labeled source data, pseudo labels in target domain will be obtained. The calculation method of samples pseudo-label in target domain is as follows: the similarity weight of samples in source and target domain is preferably calculated by using the MMD method, then the CNN classifier is trained by using the samples in the source domain and corresponding weight information, and finally the samples pseudo-label in target domain are labeled by the classifier. Supposing a total of C categories in target domain, c ∈ {1, . . . , C}. We utilize Equation (3) to measure the mismatch of conditional distributions with Q s i (x s i y s i = c) and Q t (x t y t = c) : There are certainly many errors in the initial pseudo labels of target data, but we can iteratively update the pseudo labels in subsequent model optimization stages until the best prediction accuracy is obtained.
In Equation (9), J s i and J t is the JPDA of s i − th source domain D s i and target domain D t . The minimization of Equation (9) ensures the match in marginal and conditional distributions with sufficient statics.

Construction of MultiDTNN
Based on JPDA in Section 3.1, we use convolutional neural network to establish a multi-source deep transfer neural network framework. The framework of MultiDTNN is shown in Figure 2.

Construction of MultiDTNN
Based on JPDA in Section 3.1, we use convolutional neural network to establish a multi-source deep transfer neural network framework. The framework of MultiDTNN is shown in Figure 2. D and a target domain; we implement a selectin strategy similar to that in [22] to choose ensemble of classifier, which composes the model of MultiDTNN. Ensemble is the system that uses multiple predictors, statistically independent to some extent, in order to attain an aggregated prediction [47]. Such systems usually perform better than a single predictor, and their stability is better. The two parts are described in detail below.
A. Construction of i s TCNN The structure of i s TCNN is shown in Figure 3. In general, we can train the CNN model on sufficient data in source domain from scratch by using the optimization task defined in Equation (2). When applying the pre-trained CNN model to the target domain, we integrate JPDA and as a loss function regularization term, redefining the new objective function as: is the parameter set of a CNN with l layers and  is non-negative regularization term. For CNN, as the number of layers increases, the features will change from general to specific. The upper layer tends to represent more abstract features, which will lead to larger domain differences. Therefore, we deploy regularization operations on the fully connected layer. From Figure 2, we divide the MultiDTNN into two parts: a set of classifier which contains N classifiers TCNN s i is obtained by training on CNN with JPDA using source domain D s i and a target domain; we implement a selectin strategy similar to that in [22] to choose ensemble of classifier, which composes the model of MultiDTNN. Ensemble is the system that uses multiple predictors, statistically independent to some extent, in order to attain an aggregated prediction [47]. Such systems usually perform better than a single predictor, and their stability is better. The two parts are described in detail below.
A. Construction of TCNN s i The structure of TCNN s i is shown in Figure 3. In general, we can train the CNN model on sufficient data in source domain from scratch by using the optimization task defined in Equation (2). When applying the pre-trained CNN model to the target domain, we integrate JPDA and as a loss function regularization term, redefining the new objective function as: is the parameter set of a CNN with l layers and λ is non-negative regularization term. For CNN, as the number of layers increases, the features will change from general to specific. The upper layer tends to represent more abstract features, which will lead to larger domain differences. Therefore, we deploy regularization operations on the fully connected layer. is the parameter set of a CNN with l layers and  is non-negative regularization term. For CNN, as the number of layers increases, the features will change from general to specific. The upper layer tends to represent more abstract features, which will lead to larger domain differences. Therefore, we deploy regularization operations on the fully connected layer. By minimizing Equation (10), we can adapt the pre-trained CNN to the classification task of the target domain. We use a mini-batch stochastic gradient (SGD) [29,30] and a backpropagation algorithm for the optimization of CNN networks. The gradient of Equation (10) for network parameters is as follows: The detailed formations of ∇D H (J s i , J t ) are described as: The training procedure mainly consists of two subprocesses: (1) pre-trained CNN on each labeled source domain data; (2) network adaptation in target domain using labeled data of source domain data and unlabeled data of target data by training CNN classification of (1). Therefore, we can get a collection of classifiers H ∈ {TCNN i } N i=1 on N source domains. The detailed procedure is shown Step 1 in Table 1. When the size of data in source domain becomes large, the calculation of CNN requires the support of high-performance computers, which is also the need for deep learning in the future. Therefore, in order to better record the performance indicators during the operation to provide support for optimizing CNN, various performance tuning tools are used.
B. Strategy of selection In order to get a powerful set of classifiers, we are inspired by [22] to implement an efficient strategy of selection. The strategy is as follows: the AdaBoost algorithm is cyclically executed on dataset of target domain, and a classifier is selected from each of the classifier sets in each iteration, and the classifier is trained on target domain; ensure that the knowledge of source domain is more closely related to the target task is transferred, calculate the error rate of the classifier on target domain dataset, and select the classifier which the error rate meets the requirements, else discard the classifier; in addition, the weight of the sample of target domain is updated for the next iteration. The detailed selection process is shown Step 2~Step 13 in Table 1. In the end, we will get a set of classifiers with better classification performance on target domain, which is our proposed MultiDTNN model.

Training Strategy of MultiDTNN
According to Sections 3.1-3.3, the training process of proposed MultiDTNN is summarized and described in Table 1. n t k=1 on D T by using CNN i Repeat j = j + 1 Compute the regularization term JPDA according to Equation (9) Obtain TCNN i by optimizing CNN i with Equation (10) Update the pseudo labelsŶ j with optimized network TCNN i Until convergence orŶ j =Ŷ j−1 , The weight vector w T are normalized to 1 Step 4.
Empty the current weak classifier set F ← ∅ for t ← 0 to N do Step 5.
Compute the error Step 8.
Find the weak classifier h t : x → y (h t , ε t ) = arg minε t (h k ,ε t )∈F Step 10.

Experimental Results
In this section, in order to analyze the effectiveness of the proposed MultiDTNN, we evaluate it on three cross-domain standard datasets. First, the experimental setup is introduced in Section 4.1. Then, Section 4.2 describes the three cross-domain datasets in detail. Finally, in Section 4.3 we compare the proposed MultiDTNN with several state-of-the-art deep transfer learning algorithms.

Experimental Setting
The following state-of-the-art transfer learning methods are chosen as benchmark algorithms for comparison with MultiDTNN: ARTL [12], STLCF [16], TaskTrBoost [22], FastDAM [24], IMTL [26], DTLC [29], DAN [31], SDT [32], DHN [34], DDC [35], CNN [38], and Deep CORAL [40]. Among these benchmark algorithms, CNN is a non-transfer learning algorithm, TaskTrBoost, FastDAM, and IMTL are transfer learning algorithms that can utilize knowledge in multiple source domains, STLCF and ARTL are non-deep transfer learning. For baseline methods, we adopt the standard procedures for model as described in their respective works to our paper. We implement the proposed MultiDTNN using TensorFlow and train with Stochastic Gradient Descent (SGD). The initial learning rate is set as 10 −3 , and momentum is 0.9 in SGD. The parameters λ is searched in the range from 0.01 to 100. Actually, MultiDTNN model can easily adopt other CNN structures, e.g., VGGNet, ResNet, and GoogleNet. Deeper CNN structures would improve the performance somehow. Since we are focusing on the specific layers, we only evaluate the AlexNet structure in this paper. We primarily follow an unsupervised standard evaluation protocol to adopt and use all labeled samples of source domain and unlabeled samples of target domain. For the fairness of experiments, a 5-fold cross-validation strategy is selected for all experiments, and we repeat the strategy twice as the final comparison results. In the experiments we will run 10 times, the average value of classification accuracy, with their standard deviations are recorded. The representation of classification accuracy is as follows: where the dataset of target domain is D t , y(x) represents the truth class label of x, f (x) is the class label of x predicted by the classifiers.

Datasets
Office-31, Office-10+Caltech-10 and Office+Home [30][31][32] are commonly well-known cross-domain standard datasets in transfer learning applications, so all experiments in this paper are performed on these datasets. The datasets are described in detail below.
Office-31 is a standard dataset that contains 4,652 images from the domains Amazon (A), Webcam (W), and DSLR (D). These images can be divided into 31 categories. Among them, the samples in Amazon are from www.amazon.com, and the samples in Webcam and DSLR are obtained through web cameras and digital SLR cameras in different environments. We construct six cross-domain tasks A->D, A->W, W->A, W->D, D->A, and D->W from source to target domains. On each of the above-mentioned cross-domain, the proposed multi-source MultiDTNN algorithm uses A, W, and D as the source domain.
Office-10+Caltech-10 contains 10 common objects shared by Office-31 and Caltech-256 (C) 2 datasets, which have been widely used in domain adaptation methods. As with the method of constructing cross-domain tasks on Office-31, we construct 12 cross-domain tasks. The number of source domain is 4 in MultiDTNN.
Office+Home collects objects from 4 domains: Art (Ar, artistic drawing object), Clipart (Cl, images collected from www.clipart.com), Product (Pr, similar to Amazon's sample with almost clean background) and Real-World (Re, object images taken with regular camera). The dataset has 65 objects with15500 image samples. Similarly, we constructed 12 cross-domain tasks in a similar way to Office-31, with MultiDTNN using 4 source domains simultaneously on each task.

Analysis of Experimental Results
In this section, the experimental results of MultiDTNN algorithm and 12 benchmark algorithms on real datasets are analyzed and compared. We compare the average accuracy rate after 10 experiments on the three datasets. Table 2 shows the results of six cross-domain tasks on Office-31. The results of 12 cross-domain tasks on Office-10+Caltech-10 are shown in Table 3. Table 4 shows the results on 12 cross-domain tasks of Office+Home.  Table 3. Average accuracy rate (%) with absolute value of standard variation on Office-10+Caltech-10 dataset.  From the results in Tables 2-4, we can draw the following conclusions:

Algorithms A->C D->C W->C A->W C->W D->W A->D C->D W->D C->A D->A W->A
(1) On the cross-domain tasks of three datasets, the average accuracy rate of the based deep learning methods outperform the common transfer learning algorithms ARTL and STLCF, which shows that the based deep learning methods are obviously superior to the shallow transfer learning algorithm.
(2) CNN-based deep transfer learning algorithms (e.g., DAN, DTN, SDT, D-COREL, DTLC, and DHN) can use the knowledge of source domain to assist in learning tasks in target domain, so their classification performance is better than standard deep learning method (CNN). This indicates that the data in source domain can be used to improve the learning task of target domain with unlabeled data on the deep neural network model combined with transfer learning, so their experimental results are better.
(3) In the benchmark algorithms, TaskTrBoost, FastDAM, and IMTL can utilize the sample features of multiple source domains to help learning tasks of target domain create classifier models, so their classification effect is better than ARTL and STCF, which are non-deep single source domain transfer learning algorithms, and even are obviously superior to CNN-based deep transfer learning algorithms in some cases.
(4) Comparing with Office-31 and Office-10+Caltech-10, Office+Home contains more categories and the distribution between categories is larger, so all algorithms cannot achieve promising performance. However, from the experimental results we could notice that our proposed model obtain better performance in most cases. Especially in Office+Home, MultiDTNN can achieve better performance than the benchmark algorithms.
(5) Comparing with the benchmark algorithms, our proposed MultiDTNN model can transfer knowledge from more than one source domain, so it can help the learning tasks of the target domain to build a more efficient classifier model. For example, for cross-domain task A->W of the dataset Office-31, the transfer deep neural network algorithms DTLC, DAN, SDT, DHN, D-CORAL, and DDC of the benchmark algorithms can only transfer the knowledge of one source domain A to the target domain W. Nevertheless, the proposed MultiDTNN can simultaneously use the knowledge of three source domains A, W, and D for the learning task of the target domain. Similarly, the number of source domains that MultiDTNN can utilize on the Office-10+Caltech-10 and Office+Home datasets is 4. We carefully analyzed all the experimental results on the three datasets, and see that MultiDTNN works best. In addition, the experimental results fully demonstrate that in deep neural networks, multi-source transfer can effectively compensate for the lack of single-source transfer.
From Table 1, we see that our proposed MultiDTNN model is an iterative algorithm with a key parameter λ, so it is necessary to analyze its convergence and the influence of λ on the model. Below we analyze the convergence and the impact of parameters λ of MultiDTNN.
A. Convergence analysis The training process of MulDTNN in Table 1 shows that the proposed algorithm consists of two sub-iterative processes: the first is that CNN is trained on source and target domains to obtain a set of classifier, and the other is to select classifiers from the set of classifier to compose an ensemble of classifier. Therefore, it is theoretically challenging to prove its convergence. So, we follow the researchers' experience to obtain the convergence curve of our model as shown in Figure 4. As can be seen from Figure 4, our model has good convergence. researchers' experience to obtain the convergence curve of our model as shown in Figure 4. As can be seen from Figure 4, our model has good convergence. B. Parameter analysis The parameter  indicates the regularization coefficient in the objective function of MultiDTNN, which greatly affects the correlations between source and target domains. Therefore, we evaluate the influence of  on model. Figure 5 gives a description of the classification performance over a range of three cross-domain tasks. We can see that MultiDTNN is a bell-shaped curve and can achieve better performance when the value is around 0.5. This also confirms that a good compromise between features of deep learning and distribution difference adaptation can enhance the transferability of features.

Conclusions
In this paper, we design a new deep transfer neural network framework: a multi-source deep transfer neural network, which integrates multi-source transfer learning, CNN, and JPDA into an optimization program. Multi-source transfer can provide more knowledge that is transferred into the target domain by using knowledge from multiple source domains, and the classification models of the target domain are built; CNN extracts more complex features of the dataset; JPDA is used to

B. Parameter analysis
The parameter λ indicates the regularization coefficient in the objective function of MultiDTNN, which greatly affects the correlations between source and target domains. Therefore, we evaluate the influence of λ on model. Figure 5 gives a description of the classification performance over a range of three cross-domain tasks. We can see that MultiDTNN is a bell-shaped curve and can achieve better performance when the value is around 0.5. This also confirms that a good compromise between features of deep learning and distribution difference adaptation can enhance the transferability of features. researchers' experience to obtain the convergence curve of our model as shown in Figure 4. As can be seen from Figure 4, our model has good convergence. B. Parameter analysis The parameter  indicates the regularization coefficient in the objective function of MultiDTNN, which greatly affects the correlations between source and target domains. Therefore, we evaluate the influence of  on model. Figure 5 gives a description of the classification performance over a range of three cross-domain tasks. We can see that MultiDTNN is a bell-shaped curve and can achieve better performance when the value is around 0.5. This also confirms that a good compromise between features of deep learning and distribution difference adaptation can enhance the transferability of features.

Conclusions
In this paper, we design a new deep transfer neural network framework: a multi-source deep transfer neural network, which integrates multi-source transfer learning, CNN, and JPDA into an optimization program. Multi-source transfer can provide more knowledge that is transferred into the

Conclusions
In this paper, we design a new deep transfer neural network framework: a multi-source deep transfer neural network, which integrates multi-source transfer learning, CNN, and JPDA into an optimization program. Multi-source transfer can provide more knowledge that is transferred into the target domain by using knowledge from multiple source domains, and the classification models of the target domain are built; CNN extracts more complex features of the dataset; JPDA is used to reduce the difference of probability distribution between domains and increases the transferability of features in source domains. Specifically, for the purpose of enhancing the transferability of features in deep neural networks, MultiDTNN utilizes JPDA to reduce the difference of domain probability distribution between each source and target domains. Then, on each source and target domains, we train CNN to obtain a set of deep learning classifiers. Finally, in order to select the classifier with the smallest classification error in the target domain from the classifier set, inspired by TaskTrAdaBoost a selection strategy is designed to obtain the MultiDTNN framework. The experimental results on the three cross-domain benchmark datasets demonstrate the effectiveness of our proposed model and have certain advantages over the benchmark algorithms. Although the experimental results show that the MultiDTNN has better classification performance than the benchmark algorithms, it still needs to work in the following aspects: further improve the convergence efficiency of the MultiDTNN model; in addition, it is also an interesting challenge to increase the number of source domains to more than 10.