A Two-stage Deep Domain Adaptation Method for Hyperspectral Image Classiﬁcation

: Deep learning has attracted extensive attention in the ﬁeld of hyperspectral images (HSIs) classiﬁcation. However, supervised deep learning methods heavily rely on a large amount of label information. To address this problem, in this paper, we propose a two-stage deep domain adaptation method for hyperspectral image classiﬁcation, which can minimize the data shift between two domains and learn a more discriminative deep embedding space with very few labeled target samples. A deep embedding space is ﬁrst learned by minimizing the distance between the source domain and the target domain based on Maximum Mean Discrepancy (MMD) criterion. The Spatial–Spectral Siamese Network is then exploited to reduce the data shift and learn a more discriminative deep embedding space by minimizing the distance between samples from di ﬀ erent domains but the same class label and maximizes the distance between samples from di ﬀ erent domains and class labels based on pairwise loss. For the classiﬁcation task, the softmax layer is replaced with a linear support vector machine, in which learning minimizes a margin-based loss instead of the cross-entropy loss. The experimental results on two sets of hyperspectral remote sensing images show that the proposed method can outperform several state-of-the-art methods.


Introduction
Hyperspectral images (HSIs) contain rich spectral and spatial information, which is helpful to identify different materials in the observed scene. HSIs have been widely applied in many fields such as agriculture [1], environment sciences [2], mineral exploitation [3], scene recognition [4], and defense [5]. Recently, supervised deep learning methods have attracted extensive attention in the field of hyperspectral image classification [6][7][8][9][10][11]. Although such methods of supervised learning work well, they heavily rely on a large number of label information. However, it is very time-consuming and expensive to collect the labeled data on hyperspectral images. To solve this problem, semi-supervised learning [12] and active learning [13] are widely used in HSI classification. These methods all assume that pixels of the same surface coverage class have the same distribution in the feature space.
In real remote sensing applications, due to high labor costs of labeling or some nature limitations, the HIS scene (called target domain) has only a few labeled samples or even no labeled sample. Another similar scene (source domain) may have sufficient labeled samples. To better classify the target domain, a natural idea is to the class-specific information in the source domain to help target domain classification. However, when the source and target domains are spatially or temporally different, Remote Sens. 2020, 12, 1054 3 of 18 network by jointly minimizing the cross-entropy error, MMD criterion and the geometrical structure of the target data.
Wang et al. [25] proposed a domain adaptation method based on a neural network to learn manifold embedding and matching source domain discriminant distribution. They matched the distribution of the target domain with the MMD to match the class distribution in the source domain. At the same time, the manifold regularization was added to the target domain to avoid the mapping distortion. Although the deep domain adaptation method considering MMD can reduce data shift, it cannot learn more discriminative embedding space.
In order to better solve the problems mentioned above, in this paper, we propose a two-stage deep domain adaptation method (TDDA) for hyperspectral image classification. In the first stage, according to the MMD criterion, the distribution distance between the source domain and the target domain is first minimized to learn a deep embedding space, so as to reduce the distribution shift between domains. In the second stage, the Siamese architecture is exploited to reduce the distribution shift and learn a more discriminative deep embedding space. In training, the pairwise loss minimizes the distance between samples from different domains but the same class label and maximizes the distance between samples from different domains and class labels. In addition, a margin-based loss is simultaneously minimized instead of the cross-entropy loss in the second stage. Softmax layer minimizes cross-entropy, while supporter vector machines (SVMs) simply try to find the maximum margin between data points of different classes. In [27], Tang demonstrated a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neural nets and SVMs in prior art, their results using L2-SVMs show that by simply replacing softmax with linear SVMs gives significant gains on popular deep learning datasets MNIST, CIFAR-10. Inspired by reference [27], we replaced cross-entropy loss with margin-based loss.
The three major contributions of this paper are listed as follows: (1) A two-stage deep domain adaptation method for hyperspectral image classification is proposed, and this method only needs very few labeled target samples per class to obtain better classification performance. (2) Three criteria including MMD, pairwise loss and margin-based loss are minimized at different stages, which can minimize the distribution shift between two domains and learn a more discriminative feature embedding space to the target domain. (3) The Spatial-Spectral Siamese Network is exploited to learn deep spatial-spectral features, which tend to be more discriminative and reliable.
The rest of this paper is organized as follows. Section 2 presents the details of the proposed TDDA method. Section 3 evaluates the performances of TDDA compared with those of other hyperspectral image classifiers. A discussion of the results is provided in Section 4. Finally, the conclusions are drawn in Section 5.

Proposed Method
First, we introduce the symbols used throughout this paper. The symbols and meanings are described in Table 1. Let D s l = (X s , Y s ) = x s i , y s i N i=1 be the N labeled samples in the source domain, be the M unlabeled samples in the target domain, and D t l = (X t , Y t ) = x t k , y t k Q k=1 be the Q labeled samples in the target domain (the few samples). x s i , x t j , x t k ∈ R chn be the pixels in D s l , D t u and D t l with chn-bands, respectively. y s i , y t k be the corresponding labels {1,2, ..., L}, where L is the number of classes.

Symbol Meanings
x t k be pixels in D s l , D t u and D t l with chn-bands y s i , y t k ∈ {1, 2, . . . , L} y s i , y t k be the corresponding labels with {1, 2, . . . , L} L L is the number of classes

A Two-Stage Deep Domain Adaptation Framework
The framework of the TDDA method is shown in Figure 1. Figure 1 consists of training and testing parts. In the training part, we divide the training process of TDDA into two stages to train the spatial spectral network. In the testing part, a large number of unlabeled images in the target domain are classified based on the spatial spectral network. The two training stages are detailed below.

The Spatial-Spectral Network
The CNN architecture generally consists of a convolutional layer, pooling layer, and fully connected layer, and each layer is connected to its previous layer, so that abstract features of higher layers can be extracted from lower layers. Generally, deeper networks can extract more discriminative information [29], which is helpful for image classification. Neural networks usually have several fully connected layers that can learn the abstract features and output the network's final predictions. We assume the given training data is ( , ) = {( , )} , so the feature output of the kth layer is:  In the first stage, the inputs are the labeled samples of the source domain and unlabeled samples of the target domain. The classification Loss (margin-based loss) and domain alignment Loss (MMD) are minimized. The sample features of the source and target domains are extracted by Spatial-Spectral Siamese Network (weight sharing) [28]. Then, the distribution shift between the source domain and the target domain is minimized based on MMD. For the classification function, we use linear support vector machines instead of the softmax layer and learn to minimize margin-based loss rather than cross-entropy loss. After the training task of the first stage is completed, the weights learned are used as the initial weights of the second stage.
In the second stage, the inputs are the labeled samples of the source domain and few labeled samples of the target domain. The classification loss (margin-based loss) and domain discriminative loss (Pairwise loss) are minimized. Based on the pairwise loss, the distance between samples from different domains but the same class is minimized and the distance between samples from different classes in different domains is maximized. For the test part, the inputs are the unlabeled samples of the target domain, and the outputs are the predicted labels.

The Spatial-Spectral Network
The CNN architecture generally consists of a convolutional layer, pooling layer, and fully connected layer, and each layer is connected to its previous layer, so that abstract features of higher layers can be extracted from lower layers. Generally, deeper networks can extract more discriminative information [29], which is helpful for image classification. Neural networks usually have several fully connected layers that can learn the abstract features and output the network's final predictions. We assume the given training data is (X, Y) = (x i , y i ) N i=1 , so the feature output of the kth layer is: where W k represents the weight matrix, ϕ (k−1) is the feature output in the (k − 1)th layer, B k is the bias of the kth layer, and g(·) is a non-linear activation function, for example, a linear unit function of rectification is g(x) = max(0, x) [30]. Hyperspectral images have abundant spatial and spectral information. Extracting advanced features from spatial and spectral branches respectively and fusing them can improve classification accuracy [31,32]. Therefore, in this section, the joint spatial-spectral features are extracted through the Spatial-Spectral Network.
As shown in Figure 2, the Spatial-Spectral Network has two CNN branches, which are used to extract spatial and spectral features, respectively. In the spatial branch, we first reduce the dimensionality of the input hyperspectral image with Principal Component Analysis (PCA) [33,34], and then take a pixel and its neighborhood (the neighborhood size is r = 4) as input (9 × 9 × 10); the spatial output of this branch is ϕ k spa (x i ). In the spectral branch, we take the spectral of this pixel and its neighborhood (r = 1) as input (3 × 3 × chn); the spectral output of this branch is ϕ k spe (x i ). We simultaneously feed the output of the two branches to the fully connected layer, and the joint spatial-spectral feature output is: where ⊕ indicates that the spatial output and spectral output are connected in series, and the output (ϕ(x i )) can be regarded as the final joint spatial-spectral feature output.

The First Stage of TDDA
In [27], Tang demonstrated a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Inspired by [27], we replaced softmax layer with SVM in our paper. SVM is generally used for binary classification. We assume that the label of a given training data is p i ∈ {−1, 1}. Owing to the fact that L1-SVM is not differentiable, its variant L2-SVM is adopted to minimize the square hinge loss: Remote Sens. 2020, 12, 1054 where w is the normal vector of the hyperplane in space, C is the penalty coefficient, and f (·) is the prediction of the training data. In order to solve the classification problem of multiple classes, we adopt the one-versus-rest approach. This method constructs L SVMs to solve the classification problem where the number of classes is L. Each SVM only needs to distinguish the data of this class from the data of other classes.  Hyperspectral images have abundant spatial and spectral information. Extracting advanced features from spatial and spectral branches respectively and fusing them can improve classification accuracy [31,32]. Therefore, in this section, the joint spatial-spectral features are extracted through the Spatial-Spectral Network.
As shown in Figure 2, the Spatial-Spectral Network has two CNN branches, which are used to extract spatial and spectral features, respectively. In the spatial branch, we first reduce the dimensionality of the input hyperspectral image with Principal Component Analysis (PCA) [33,34], and then take a pixel and its neighborhood (the neighborhood size is r =4) as input (9×9 ×10); the spatial output of this branch is ( ). In the spectral branch, we take the spectral of this pixel and its neighborhood (r = 1) as input (3 ×3×chn); the spectral output of this branch is ( ). We simultaneously feed the output of the two branches to the fully connected layer, and the joint spatialspectral feature output is: where ⨁ indicates that the spatial output and spectral output are connected in series, and the output (φ( )) can be regarded as the final joint spatial-spectral feature output.

The First Stage of TDDA
In [27], Tang demonstrated a small but consistent advantage of replacing the softmax layer with a linear support vector machine. Inspired by [27], we replaced softmax layer with SVM in our paper. SVM is generally used for binary classification. We assume that the label of a given training data is ∈ {−1,1}. Owing to the fact that L1-SVM is not differentiable, its variant L2-SVM is adopted to minimize the square hinge loss: where is the normal vector of the hyperplane in space， C is the penalty coefficient, and (⋅) is the prediction of the training data. In order to solve the classification problem of multiple classes, we adopt the one-versus-rest approach. This method constructs L SVMs to solve the classification problem where the number of classes is L. Each SVM only needs to distinguish the data of this class from the data of other classes.
In measuring the differences between domains, we consider that the source and target domains have similar distributions. Therefore, in the first stage, we use MMD to measure the distance between two different but related distributions, which can be defined as: In measuring the differences between domains, we consider that the source and target domains have similar distributions. Therefore, in the first stage, we use MMD to measure the distance between two different but related distributions, which can be defined as: A common embedding can be obtained by minimizing the distribution distance with MMD, and the main statistical properties of the data in the two domains are preserved.
Therefore, the weighted training standards in the first stage can be denoted as: Finally, to balance the classification versus the domain alignment portion (MMD) of the loss, the classification portion is normalized and weighted by 1 − α and MMD portion by α.

The Second Stage of TDDA
In the second stage, we obtained the label information of a few target domain samples. In this stage, the network parameters in the first stage are used as initialization parameters to retrain the network. In this stage, we use pairwise loss to minimize the distribution distance between samples from different domains but with the same class and maximize the distribution distance between samples from different classes between different domains to reduce the distribution shift and learn a more discriminative deep embedding space. In this stage, the Euclidean distance between samples in different domains is: where · 2 represents the Euclidean norm. Therefore, the Pairwise loss of samples between domains is: Remote Sens. 2020, 12, 1054 7 of 18 when = 0 means that the sample classes of the source domain and the target domain are the same (y s i = y t j ), and when = 1 means that the sample classes of the source domain and the target domain are different (y s i y t j ). γ represents the threshold. Therefore, the weighted training standards for the second stage can be expressed as: Finally, to balance the classification versus the domain discriminative portion (Pairwise) of the loss, the classification portion is normalized and weighted by 1 − α and Pairwise portion by α.

Data Sets Description
In this experiment, we conducted experiments on two sets of real-world hyperspectral remote sensing images, including Pavia University-Pavia Center dataset and the Shanghai-Hangzhou dataset.
First, Pavia University and Pavia Center datasets were obtained by ROSIS sensors during the air battle in Pavia in northern Italy [16]. The number of spectral bands obtained by the sensor on the datasets of the Pavia University and Pavia Center is 103 and 102, respectively. By reducing one band of the Pavia University Dataset, the spectral bands of both datasets are 102. Pavia University is a 610 × 610 pixels image, while Pavia Centre is a 1096 × 1096 pixels image, but some samples in both images do not contain any information, so they must be discarded before analysis. Therefore, the image of Pavia University is 610 × 315 pixels, and the image of Pavia Center is 1096 × 715 pixels, as shown in Figures 3a and 4a, respectively. We select seven classes that they both have, including trees, asphalt, self-blocking bricks, bitumen, shadows, meadows, and bare soil, as shown in Figures 3b and 4b, respectively. The names of land cover classes and number of samples for Pavia University-Pavia Center Dataset pair are listed in Table 2.
Second, Shanghai and Hangzhou datasets were both captured by EO-1 Hyperion hyperspectral sensor in in Shanghai and Hangzhou [14]. The sensor obtained a number of spectral bands of 220 in both scenes, leaving 198 spectral bands after removing bad bands. Shanghai is 1600 × 230 pixels, and Hangzhou is 590 × 230 pixels, as shown in Figure 5a. In this experiment, we selected three classes, including water, ground/buildings and plants, as shown in Figure 5b. The names of land cover classes and number of samples for the Shanghai-Hangzhou Dataset are listed in Table 3.   3. Experiments

Data Sets Description
In this experiment, we conducted experiments on two sets of real-world hyperspectral remote sensing images, including Pavia University-Pavia Center dataset and the Shanghai-Hangzhou dataset.       In the Spatial-Spectral Network, each branch of the model consists of two convolutional layers, one pooling layer and one dropout layer. The spatial and spectral features obtained from the two branches are combined to obtain a joint spatial-spectral feature, and the final joint spatial-spectral feature is obtained through three fully connected layers. We get the parameters of all network layers except the last layer, and transfer them to the second stage to retrain the network.
In the TDDA method, the first and second training stages are optimized using Adam optimization algorithm [35]. In the first stage, the training epoch is set to 100, the batch size is set to 128, and the learning rate is set to 0.001. In the second stage, the training epoch is set to 80, the batch size is set to 80, and the learning rates of the Pavia University-Pavia Center Datasets and the Shanghai-Hangzhou Datasets are set to 0.0001 and 0.00001, respectively. The specific parameters of the network are shown in Table 4. In addition, we performed a comparative experiment on the choice of the equilibrium parameter α on the Pavia University → Pavia Center dataset. As shown in Table 5, when the value of the equilibrium parameter α is 0.25, both the OA and AA of the experiment achieve the best value, and the result has the smallest degree of dispersion. Therefore, we take α = 0.25 as the equilibrium parameter of the experiment. In order to verify the effectiveness of our method, we compared the TDDA method with some of the latest methods such as Mei et al. [16], Yang et al. [17], Wang et al. + fine-tuning (FT) [25]. In addition, the first stage For the classification results of different datasets, we consider the following four cases: Pavia University → Pavia Center, Pavia Center → Pavia University, Shanghai → Hangzhou and Hangzhou → Shanghai, which respectively represent the source dataset → the target dataset. All of the above experiments were performed on a workstation equipped with an AMD Ryzen 5 4000 Quad-Core Processor 3.2 GHZ and 8 GB RAM.
For a fair comparison, the same training and test data sets are utilized in all methods. The overall accuracy (OA), average accuracy (AA), and kappa coefficients are used to evaluate the classification performance of all methods. All methods are performed 10 times, and the average result which adds the standard deviation obtained from 10 runs, was used to reduce the impact of random selection.

Experimental Results
To prove the superiority of the proposed TDDA method, we compared our method with other methods on two sets of datasets. In these experiments, the training set consists of two parts, one part randomly selects 200 labeled samples from the source domain, and the other part randomly selects 5 labeled samples from the target domain. The remaining samples in the target domain are used as the test set. In Tables 6-9, overall accuracy (OA), average accuracy (AA), and kappa coefficient are utilized as performance criteria. Tables 6-9 list the experimental results of different methods on Pavia University → Pavia Center, Pavia Center → Pavia University, Shanghai → Hangzhou and Hangzhou → Shanghai datasets. Figures 6-9 show the corresponding classification maps of all methods. As can be seen from the bold classification accuracy in Tables 5-8, TDDA performs better in OA and AA than other methods in all cases and has a smaller standard deviation in most cases, which proves the effectiveness and stability of the TDDA method. In addition, as can be seen from the classification maps in Figures 6-9, compared with other methods, the classification map obtained by TDDA method proposed in this paper is the most accurate. The detailed analysis of Tables 6-9 is as follows.  Table 6 shows the classification performance from Pavia University to Pavia Center dataset. It can be seen from Table 6 + FT), our proposed TDDA increases OA by 1.09%, which shows that TDDA not only reduces the domain shift, but also obtains a more discriminative feature space.  Table 7 shows the classification performance from Pavia Center to Pavia University dataset. As can be seen from Table 7, the method with the domain shift reduction + fine-tuning strategy is better than the method with the fine-tuning strategy. Compared with Mei      The training and testing times provide a direct measure of computational efficiency for TDDA. All experiments were carried out on a workstation equipped with an AMD Ryzen 5 4000 Quad-Core Processor 3.2GHZ and 8GB RAM. Table 10 shows the training and testing time of different methods in different situations. It can be seen that the training of the TDDA method takes the longest time.    The training and testing times provide a direct measure of computational efficiency for TDDA. All experiments were carried out on a workstation equipped with an AMD Ryzen 5 4000 Quad-Core Processor 3.2GHZ and 8GB RAM. Table 10 shows the training and testing time of different methods in different situations. It can be seen that the training of the TDDA method takes the longest time.
This is because the training of TDDA is divided into two stages, and especially in the second stage, samples with the same label between domains and samples with different labels between domains are fed into the network in pairs, which increases the computational time. Although TDDA is longer than the training time of other methods, the classification accuracies of TDDA are better than all other methods.       In addition, in order to better verify the effectiveness of the proposed method, we extend the above experiment, where one to five labeled samples are randomly selected from the target domain.   Table 9 shows the classification performance from the Hangzhou to Shanghai dataset. As can be seen from Table 9 The training and testing times provide a direct measure of computational efficiency for TDDA. All experiments were carried out on a workstation equipped with an AMD Ryzen 5 4000 Quad-Core Processor 3.2 GHZ and 8 GB RAM. Table 10 shows the training and testing time of different methods in different situations. It can be seen that the training of the TDDA method takes the longest time. This is because the training of TDDA is divided into two stages, and especially in the second stage, samples with the same label between domains and samples with different labels between domains are fed into the network in pairs, which increases the computational time. Although TDDA is longer than the training time of other methods, the classification accuracies of TDDA are better than all other methods. In addition, in order to better verify the effectiveness of the proposed method, we extend the above experiment, where one to five labeled samples are randomly selected from the target domain. Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing Figure 10 shows the classification performance of each method on Pavia University → Pavia Center. It can be seen from Figure 10 that the classification performance of the TDDA method is better than other methods regardless of the number of labeled samples in the target domain, which indicates that when the number of labeled samples in the target domain is small, TDDA can also learn more discriminative features to achieve better classification performance of hyperspectral images. Figure  11 shows the classification performance of all methods on Pavia Center → Pavia University dataset. As can be seen from Figure 11, TDDA has obvious advantages over other methods.  Figures 12 and 13 show the classification performance of all methods on Shanghai → Hangzhou and Hangzhou → Shanghai datasets. As can be seen from Figure 12 and Figure 13, with the increase of the number of labeled target samples per category, the OAs and AAs of TDDA do not change significantly. However, overall there is still a slight upward trend. This result is due to the pairwise loss and margin-based loss in the second stage, which can extract more discriminative features. In addition, the classification performance of the model is better when there is only one labeled sample per class in the target domain, possibly because of the small number of categories in the Shanghai-Hangzhou dataset (only three categories). Remote Sens. 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing than other methods regardless of the number of labeled samples in the target domain, which indicates that when the number of labeled samples in the target domain is small, TDDA can also learn more discriminative features to achieve better classification performance of hyperspectral images. Figure  11 shows the classification performance of all methods on Pavia Center → Pavia University dataset. As can be seen from Figure 11, TDDA has obvious advantages over other methods. Remote Sens. 2020, 12, x FOR PEER REVIEW

of 19
As can be seen from Figures 10-13, the methods using the fine-tuning strategy are not stable compared with the methods using the domain shift reduction + fine-tuning strategy in different experiments. From the results of Figures 10-13, it can be concluded that when the source domain has sufficient labeled samples and the target domain has only a small number of labeled samples, compared with other methods, TDDA is the most effective and stable classification method.  M e i e t a l . Y a n g e t a l . W a n g e t a l . + F T M e i e t a l . Y a n g e t a l . W a n g e t a l . + F T F i r s t -S t a g e + F T T D D A M e i e t a l . Y a n g e t a l . W a n g e t a l . + F T  Figure 10 shows the classification performance of each method on Pavia University → Pavia Center. It can be seen from Figure 10 that the classification performance of the TDDA method is better than other methods regardless of the number of labeled samples in the target domain, which indicates that when the number of labeled samples in the target domain is small, TDDA can also learn more discriminative features to achieve better classification performance of hyperspectral images. Figure 11 shows the classification performance of all methods on Pavia Center → Pavia University dataset. As can be seen from Figure 11, TDDA has obvious advantages over other methods. Figures 12 and 13 show the classification performance of all methods on Shanghai → Hangzhou and Hangzhou → Shanghai datasets. As can be seen from Figures 12 and 13, with the increase of the number of labeled target samples per category, the OAs and AAs of TDDA do not change significantly. However, overall there is still a slight upward trend. This result is due to the pairwise loss and margin-based loss in the second stage, which can extract more discriminative features. In addition, the classification performance of the model is better when there is only one labeled sample per class in the target domain, possibly because of the small number of categories in the Shanghai-Hangzhou dataset (only three categories).
As can be seen from Figures 10-13, the methods using the fine-tuning strategy are not stable compared with the methods using the domain shift reduction + fine-tuning strategy in different experiments. From the results of Figures 10-13, it can be concluded that when the source domain has sufficient labeled samples and the target domain has only a small number of labeled samples, compared with other methods, TDDA is the most effective and stable classification method. Secondly, Wang et al. +FT and First-Stage +FT methods use domain shift reduction strategy to minimize the distribution distance between domains, and then the fine-tuning strategy is used to perform knowledge transfer. As can be seen from Tables 5-8 and Figures 10-13, the methods using the domain shift reduction + fine-tuning strategy are more stable compared with the methods only using the fine-tuning strategy in different experiments, which indicates that a more suitable common feature space for the source and target domain can be obtained by minimizing the distribution distance between domains, and more stable classification results can be achieved by fine-tuning on the common feature space.

Discussion
Thirdly, TDDA uses domain shift reduction strategy to minimize the distribution distance between two domains at different stages, where labeled samples from the target domain are used to learn more discriminative feature spaces rather than fine-tuning the corresponding network model. The above experimental results show that in the case of very few labeled samples for the target domain, for the method based on deep learning, the proposed TDDA method has a very obvious advantage in classification accuracy compared to other methods. As can be seen from Tables 5-8, compared with the First-Stage +FT method, TTDA increases OA by 1.36%, 6.08%, 9.99%, and 3.39% respectively, which fully demonstrates the effectiveness of the second stage. Compared with Wang et al. +FT method, TTDA increases OA by 1.09%, 4.42%, 6.84%, and 2.98% respectively. It can be seen from these experiments that TTDA not only reduces the domain shift between the source and target domains, but also learns a discriminative embedded feature space that is more suitable for the target domain. As can be seen from Tables 5-8, compared with only the fine-tuning strategy (Mei et al. and Yang et al. methods), TTDA also achieve better classification performance. The OA of TTDA is 3.28%, 11.49%, 1.17%, and 1.17% higher than Mei et al. method respectively. The OA of TTDA is 18.68%, 17.44%, 16.45%, and 3.6% higher than Yang et al. method respectively. In addition, it can be seen from Figures 10-13 that even with fewer labeled samples from the target domain, TDDA still has better classification performance than other methods, which demonstrates the effectiveness and stability of TTDA.
Finally, the proposed TDDA method is divided into two training stages, which leads to its relatively long training time and means that TDDA is more computationally expensive than other methods. Fortunately, the adoption of GPU has greatly alleviated the extra computational costs.

Conclusions
In this paper, we propose a novel two-stage deep domain adaptation method for hyperspectral images classification. Compared with the previous networks, TDDA consists of two training stages and designs a Spatial-Spectral Siamese network for extracting spatial-spectral feature. The first stage is to obtain a deep common embedding feature space by minimizing MMD and margin-based loss, which can reduce the domain shift between the source and target domains. In the second stage, based on pair loss and margin-based loss, the few labeled samples from the target domain are used to learn a deep common embedding feature space that is more discriminative to the target domain. Compared with other methods, this method can simultaneously extract the abundant joint spatial-spectral information in the source domain and the target domain through the Spatial-Spectral Siamese network; minimize three criteria (including MMD, pairwise loss, and margin-based loss) to reduce the distribution shift between the two domains; and use a few labeled target domain samples to learn a more discriminative deep common embedding space, thereby improving the classification performance of the target domain. Analysis of experimental results on two sets of hyperspectral remote sensing images demonstrates that our method not only performs better than the other methods, but also extracts more discriminative feature representations to the target domain. In the future, we will further research the classification of hyperspectral images based on heterogeneous transfer learning.