Hyperspectral Image Classiﬁcation Based on Cross-Scene Adaptive Learning

: Aiming at few-shot classiﬁcation in the ﬁeld of hyperspectral remote sensing images, this paper proposes a classiﬁcation method based on cross-scene adaptive learning. First, based on the unsupervised domain adaptive technology, cross-scene knowledge transfer learning is carried out to reduce the differences between source scene and target scene. At the same time, depthwise over-parameterized convolution is used in the deep embedding model to improve the convergence speed and feature extraction ability. Second, two symmetrical subnetworks are designed in the model to further reduce the differences between source scene and target scene. Then, Manhattan distance is learned in the Manhattan metric space in order to reduce the computational cost of the model. Finally, the weighted K-nearest neighbor is introduced for classiﬁcation, in which the weighted Manhattan metric distance is assigned to the clustered samples to improve the processing ability to the imbalanced hyperspectral image data. The effectiveness of the proposed algorithm is veriﬁed on the Pavia and Indiana hyperspectral dataset. The overall classiﬁcation accuracy is 90.90% and 65.01%. Compared with six other kinds of hyperspectral image classiﬁcation methods, the proposed cross-scene method has better classiﬁcation accuracy.


Introduction
Hyperspectral sensor (i.e., spectral imager) images the object to be detected at the same time in tens to hundreds of continuous and subdivided spectral bands of the electromagnetic spectrum. Hyperspectral sensing images (HSI) are three-dimensional images combining space with spectrum information [1]. With rich third-dimensional information, it can more accurately subdivide and identify ground objects from the spectral space and has been widely used in military target reconnaissance [2], forestry monitoring [3], vegetation research [4], agriculture [5], chemistry [6][7][8], environmental science [9,10] and other fields. However, the labeled samples of hyperspectral images are limited, and manual collection of labeled samples for hyperspectral data is time-consuming and expensive. Therefore, how to use limited samples for classification processing without expanding the data source becomes important [11].
In the early stage of hyperspectral image classification, researchers only used spectral information for classification, including K-nearest neighbor (KNN) [12], random forest (RF) [13], support vector machine (SVM) [14,15], etc. Although these traditional methods effectively solved the problem of spectral information redundancy, they still have limitations. These methods do not deeply study the inherent spatial structure information of hyperspectral data, and it is difficult for the classification model to effectively handle the phenomena of "different body with same spectrum" or "same body with different spectrum" [16,17]. To solve this problem, researchers incorporated spatial information into hyperspectral classification and developed methods such as extended morphological profile (EMP). Liu et al., proposed a visual saliency-based extended morphological profile (VS-EMP) model, which is combined with SVM to improve the classification accuracy [18]. However, the feature extraction of this method depends on manual settings, and the implementation process is slightly complex.
In recent years, due to the powerful performance of deep learning in automatic feature extraction and learning different hierarchical structures, it has been widely used in the field of image classification [19]. In particular, convolutional neural networks (CNN), with its powerful feature expression ability, has been successfully applied to HSI classification [20][21][22][23][24][25]. In order to achieve better classification results, the above methods often need a large number of training samples. However, it is difficult to obtain labeled samples, which will make it more difficult for researchers to train network models. In order to alleviate this problem, Pan et al., proposed the multi-grained network (MugNet) network, which can be regarded as a simplified deep learning model. The multi-grained scanning strategy makes full use of optical spectrum and spatial information to improve the feature acquisition ability of the model and uses the semi supervised method to generate convolution kernel for reducing the model's dependence on samples [26]. Sun et al., combined the attention mechanism with CNN to suppress the influence of interfering pixels, capture the most significant features, and improve CNN's ability to distinguish ground objects [27]. He et al., proposed a heterogeneous transfer learning method to fully train the VGG-16Net network model on ImageNet and adjust the network parameters to transfer to hyperspectral data sets to complete effective classification [28]. Although the above methods reduce the dependence of models on samples to a certain extent, these models still need a certain amount of training samples to achieve better classification results.
In order to further reduce the dependence of the model on samples, some researchers have proposed a classification model suitable for small sample HSI data classification in recent years. Yang et al., proposed a new network called relationship network, which can learn and compare the categories of samples based on the similarity measure between sample pairs [29]. Rao et al., proposed the space-spectral relationship network to measure the deep similarity between samples and to increase the discrimination ability of the model to a small number of features by exploring the similarity measure between different samples [30]. Zhang et al., proposed a global prototype network, which projects the original data space into the embedded feature space, learns the vectors represented by the global prototype, and completes the classification by using KNN classifier through the similarity between vectors [31]. Although these classification models are better for small samples classification, it is easy to encounter the problem that the data to be classified does not have labeled samples in practical applications. The above classification models cannot obtain the similarity between samples and classify them effectively.
Because labeling samples is time-consuming and laborious, many HSI scenarios contain a small number of or unlabeled samples. However, with the increase of the number of HSI, a similar HSI scenario can still be found [32]. Some researchers have proposed a cross-scene classification method, which uses a similar source hyperspectral scene with large labeled samples to classify a target hyperspectral scene with no labeled samples or only a few labeled samples [33]. Kemker et al., input a large number of source scene data into the stacked convolution automatic encoder to learn similar features, and the obtained encoder can be used to classify the target scene through the fine-tuning process [34]. Due to the differences between different hyperspectral scenes, the source scene cannot be directly used to train the classifier. However, this method does not fully consider the differences between different scenes, which is also a major problem faced by cross-scene HSI classification. Therefore, Du et al., proposed the idea of domain adaptation to reduce the differences between different scenes and transfer knowledge for target scene classification by learning the features of a public subspace [35]. Deng et al., proposed a cross-scene classification model based on depth metric learning, using unsupervised domain adaptation technology to reduce the differences between different scenes and effectively use the source scene to classify the target scene [36]. However, this method still has shortcomings, such as ignoring the insufficient feature acquisition ability of traditional 2D convolutional neural network [37] and the imbalance of data categories in hyperspectral data [38].
This paper proposes a cross-scene adaptive learning classification model, which can reduce the dependence of the model on samples and enhance the processing ability of the model to handle the imbalance of data categories. Compared with the traditional methods (RBF-SVM, EMP-SVM) and deep learning methods (DCNN, ED-DMM-UDA, MDDUK, MDUWK), the classification accuracy has been significantly improved. The rest of this paper is organized as follows. Section 2 briefly introduces the relevant algorithms and improvements in this paper. Section 3 describes the experimental results and analysis of this paper. Section 4 summarizes the conclusions and future work.

The Proposed Methods
The cross-scene classification model proposed in this paper is shown in Figure 1. It can be seen from the figure that the model is composed of four core parts: deep hyperparameter embedding model, discriminator model, Manhattan metric model and WKNN classifier. The scene with training samples is called source scene, and the scene without training samples is called target scene. First, the samples of different scenes are input into the depth hyperparametric embedding model, and the multi-dimensional feature extraction of the depth hyperparametric convolution layer is used to generate clusters with the same category, The same size embedding spaces are generated through the network with symmetrical structure to reduce the difference between the source scene and the target scene. Then, the clustered samples are projected into the discriminator, and the unsupervised domain adaptive technology is used to transfer cross-scene knowledge, such that the target scene forms a distribution similar to the source scene. Then, the processed target scene samples are mapped into Manhattan metric space to learn metric distance of any two samples, and the samples close to the cluster center are given greater weight. Finally, the final classification result is obtained by weighted K-nearest neighbor classifier.

The Deep Hyperparametric Embedding Model
The deep hyperparametric embedding model is composed of deep convolution neural network (DCNN) and depthwise over-parameterized convolution (DO-Conv). After the introduction of DO-Conv, the ability of automatic feature extraction across scene models is retained, and the problem of slow convergence caused by the depth of embedded model layers is made up. The model can also be regarded as a feature extractor in the cross-scene model. In order to extract the features of small samples more effectively, a smaller convolution kernel will be set in the feature extractor. The small convolution kernel will reduce the receptive field of the neural network, but increasing the size of the convolution kernel will also increase the parameter of the network model, which is not conducive to small sample training. Therefore, Li et al., proposed a depthwise over-parameterized convolution. Using the depthwise over-parameterized convolution instead of the conventional convolution layer without changing the size of the convolution kernel can accelerate the convergence speed of the model, add learnable parameters to the model, and will not increase the computational complexity [39]. There are two composition methods in DO-Conv, namely feature composition and convolution kernel composition. This paper trains the network by using the composition of convolution kernel in the depth hyperparametric embedding model to improve computation efficiency.
First, in order to learn the embedded features, the training samples are defined as where n is the total number of training samples. Each source training sample x i and target training sample x j have the corresponding tag values y i and y j . The embedded source features are defined as E ϕ (x i ) and the target features are defined as E ϕ (x j ). In this space, the data reconstruction of the original samples is realized to form clusters with similar samples, as to better project the samples to the Manhattan metric space. The depth hyperparametric embedding model is shown in Figure 2, dots of different colors represent different categories, and red circles represent cluster centers. The output of each layer of deep hyperparametric embedding model is given in Figure 2. The deep superparametric embedding model in this paper is composed of two subnetworks, DO-CNN1 and DO-CNN2, with symmetrical structure. Symmetrical structure means that the layers and parameter settings of each sub network are the same, which is to obtain samples with the same output size and reduce the differences of the source scene and the target scene, such that the source scene can better guide the classification of target scenes. DO-CNN1 consists of a deep hyperparametric convolution layer, batch normalization (BN), ReLU activation function, average pooling(AvgPool) and full connection layer (FC). In order to train a small number of samples more effectively, the size of depth hyperparametric convolution kernel is set to 1 × 1, where padding is 0, stripe is 1, the size of average pooling kernel is 5 × 5, and the size of full connection layer is 1 × 128. Assuming that the input sample is 5 × 5 × nBand, the low-level features are extracted through the first layer of hyperparametric convolution, and the feature map with size 5 × 5 × 200 is output. Then higher-level features are extracted from the samples by the second, third and fourth layers to generate a feature map with a size of. The sample size is compressed in the input to the average pooling layer, and the output feature map with is size of 1 × 1 × 200. Finally, the full connection layer is connected to the network output size of 1 × 128. The output of each layer of DO-CNN2 is the same as that of DO-CNN1. Finally, an embedding space with the size of 1 × 128 is formed, which is the source embedding space E ϕs (S) and the target embedding space E ϕT (T ).
Depthwise over-parameterized convolution is composed of a conventional convolution W ∈ R C out ×D mul ×C in and a depthwise convolution D ∈ R (M×N)×D mul ×C in . In conventional convolution, the convolution layer processes the input data in a sliding way, and each element of the output feature is obtained from the horizontal section of a convolution kernel and the dot product of image blocks P. In the deep convolution layer, the deep convolution kernel is convoluted with each input channel in the training stage. After the training stage, the multi-layer composite linear operation used for over parameterization is folded into a compact single-layer representation. Then, only a single layer is used in reasoning, which reduces the calculation to complete equivalence with the conventional layer.
In Figure 3, M and N is the spatial dimension of P, C in is the number of input feature map, D mul is the number of depth convolutions, C out is the number of output feature map, D T ∈ R D mul ×(M×N)×C in is the transpose of D ∈ R (M×N)×D mul ×C in , and the convolution kernel of DO-Conv is W . First combine the depth convolution kernel D T with the convolution kernel of ordinary convolution W to generate W , then W convolutes P to output features O.

The Discriminator Model
The purpose of the discriminator model is to reduce the difference between the source scene and the target scene through unsupervised domain adaptive technology, such that the target scene forms a distribution similar to the source scene, as to better project it into the metric space. The network structure is shown in Figure 4. It is composed of full connection layer (FC), batch normalization (BN) and ReLU activation function. The input is two isolated embedding spaces, which are the source embedding space E ϕ (x i ) and the target embedding space E ϕ (x j ) size of 1 × 128. The two isolated embedding spaces are mapped into common embedding space1 through the first full connection layer (FC = 1 × 128) to learn the characteristics of the source scene in this space. After optimizing common embedding space 1 by the second full connection layer of (FC = 1 × 64), the size of common embedding "space2" is 1 × 64. Then, the third full connection layer (FC = 1 × 64) further optimizes the common embedding space2 to make the target scene form a distribution similar to the source scene, such as common embedding space 3. However, it contains the samples of the source scene. Therefore, the samples of the source scene are removed through the last full connection layer (FC = 1 × 2), and the samples of the target scene are left alone to generate common embedding space4. The difference between different scenes is reduced by the confusion discriminator. The definition of the domain confusion loss function is shown in Equation (2), where, ϕT and θ are the parameters of the target embedding model E ϕT and the discriminator module D θ , respectively, and the feature embedding distribution between the source E ϕs (S) and target E ϕT (T ) is adjusted by θ.
The discriminator D θ trains ϕT to confuse the source scene and the target scene, and trains θ to distinguish the embedding features E ϕs (S) and E ϕT (T ) of the two scenes in the projected space, and constantly update the embedding space of the target scene.
The clustering center C K is calculated by the average value of the measurement of K class embedding features. The minimization Equation (4) will promote the T scene embedding to form a distribution similar to the scene S, which helps the discriminator to confuse the differences between the data of the source scene and the target scene, such that the target scene can better learn the source scene.

Manhattan Metric Model
Manhattan metric model converts the metric distance of samples into a beneficial structure of metric space by learning the metric distance of samples because the distance metric has the characteristics of small intraclass spacing and large interclass spacing [40]. The reconstructed space still has the characteristics of high dimension of hyperspectral data. In order to retain the characteristic information of different dimensions, this paper uses Manhattan metric to obtain the attributes of samples, which has less computation than the commonly used European measurement and makes up for the shortcoming that the European measurement method does not consider the variability of values in all dimensions [41]. Manhattan distance metric will reduce the calculation cost of cross-scene model, which is defined as follows: After any x p and x q two samples are embedded in the Manhattan metric model of the reconstructed space, their metric values can be calculated through the Manhattan metric model, as shown in Figure 5. E ϕT (x p ) are the E ϕT (x q ) embedding features formed by different samples after embedding the model, and the corresponding metric values m p,q are generated after passing the metric model. The triplet loss where samples x p and x q have the same label, x p and x l have a different label. It is clear that Equation (6) encourages the embedding of x p to be closer to x q than to x l by at least margin 1. Optimizing such term results in moving the samples with the same label into a cluster and push those with different labels far away from each other. Therefore, an expected embedding is shown in Figure 5, which allows us to efficiently implement classification by using the KNN classifier based on the metric distance that we defined in (5) (Manhattan distance would be in this case).

The Weighted K-Nearest Neighbor
In order to improve the processing ability of the cross-scene model for unbalanced data of data categories, different weights are given to the samples after passing through the Manhattan metric model. The weighted classifier is called the weighted K-nearest neighbor (WKNN). First, this paper defines the adaptive weight of each class of sample in hyperspectral data set as shown in Equation (7), where W C k is the weight of each class sample, K is the category of sample and CN K is the number of K th class samples.
It can be seen from Table 1 that the weight of each type of sample in the Indiana and Pavia dataset. In the Indiana dataset, wheat is 3.65, 6.86 and 6.55 times higher than concrete/asphalt, orchard, soybeans and cleantill EW. It can be found that these two data have the characteristics of data category imbalance. However, the traditional KNN classifier is difficult to effectively process the category imbalance data [42]. This is bound to make it difficult to classify cross-scene hyperspectral data and the loss of feature information of some sample points will reduce the discrimination ability of the model to ground objects. Therefore, this paper gives adaptive weight to different sample points in Manhattan metric space, in which the weight is automatically transformed according to sample distance. As shown in Figure 6, the darker the color of the sample points, the greater the weight. This kind of sample points provide the most significant feature information for the model, which will improve the robustness of the model.
In order to enable KNN to process the category imbalance data more effectively, this paper increases the weight of the samples close to the embedding center C K , which are considered to contribute more to the classification, and give less weight to the samples with less contribution. In WKNN, the Manhattan distance calculated by the metric model is sorted and the corresponding weight W C k is assigned, and then the nearest K neighbor sample points are found to calculate the occurrence probability of the category, the category to which the sample with the highest probability of occurrence belongs is the category of the ground object finally determined by KNN classifier. The weight is as shown in Equation (8), and the weighted distance is as shown in Equation (9) where D c W C k represents the distance information of the K th class sample.

Experimental Datasets Description
In order to verify the effectiveness of the proposed model for hyperspectral data classification, two public hyperspectral data sets, Indiana and Pavia, are used to test the classification algorithm. Indiana scene data was acquired by AVIRIS sensor in Northwest Tippecanoe country [43]. Two separate datasets were selected as the source scene and the target scene, both of which have the same size 400 × 300 and 220 bands. The two scenarios share seven land cover classes for classification. Pavia scene data is obtained from the DAIS sensor over the area of Pavia City, Italy. The size of Pavia University was 243 × 243 × 72 as the source scene image for training, the size of the central area of Pavia is 400 × 400 × 72 as the target scene image [44]. The two scenarios share six land cover classes for classification. The details of the two data sets are shown in Table 2, and the ground truth diagrams are shown in Figures 7 and 8. C1 to C7 represent different category.

Experimental Platform Parameters Setting
In this paper, Windows 7 is used as the operating system. The experimental environment is Intel (R) core (TM) i5-6500 CPU @ 3.2 GHz processor, 16 GB running memory (RAM), NVIDIA geforce GTX 1060 GPU. The deep learning framework pytorch, which is programmed in Python. All experimental results are the average of 10 experiments. Because reducing the parameters in the network is conducive to small sample training, the size of all input data is set to 5 × 5. Epoch is set to 1000, SGD is optimized, momentum is set to 1 and learning rate is set to 0.001. The training samples of all experiments are randomly selected from hyperspectral data. Through a large number of experiments, it is found that when the number of training samples is 180, the accuracy and running time reach a balance.
In addition, in order to evaluate the performance of the proposed unsupervised domain adaptive weighted KNN cross-scene classification method (MDDUWK) combined with Manhattan metric and depthwise over-parameterized convolution, this paper uses six different methods for comparison, including traditional classification algorithms Radial Basis Function Support Vector Machine (RBF-SVM) and Extended Morphological Profile Support Vector Machine (EMP-SVM) and four classification algorithms based on depth learning: Deep Convolutional Neural Network (DCNN) and unsupervised domain adaptive classification based on European metric (ED-DMM-UDA). Unsupervised domain adaptive K-nearest neighbor classification (MDDUK) based on Manhattan metric depthwise convolution and unsupervised domain adaptive weighted K-nearest neighbor classification (MDUWK) based on Manhattan metric. The evaluation indexes are overall accuracy (OA), average accuracy (AA) and kappa coefficient (k).
In this paper, DO-Conv neural network is introduced as the main feature extraction network of deep hyperparametric embedding model, which is composed of four depthwise over-parameterized convolution layers, average pooling layers and full connection layers. In order to extract more feature information and improve the convergence performance of the model, DO-Conv is introduced into the neural network, which not only does not increase the parameters of the embedded model, but also improves the feature extraction ability of the model. In order to make full use of the spectral and spatial information of hyperspectral, each pixel of hyperspectral and its Band number (nBand) are set as the input, and the model input size is 5 × 5 × nBand, pooling core is 5 × 5. The output of the final full connection layer is 1 × 128. At the same time, set the number of filters per layer to 200 and the depth convolution multiplier to 5. Table 3 shows the parameter settings of the depth hyperparametric embedding model. The discriminator consists of three fully connected layers, two activation functions (ReLU) and one Softmax layer, in which the output of the fully connected layer is 1 × 64. The output of softmax layer is 1 × 2. The Manhattan metric model contains three conventional convolution (Conv) layers and convolution kernel 1 × 1, two BN layers and one Sigmoid layer. The training sample in this paper is 180, which is 0.42% compared to the 42,159 training samples in the Indiana data set, and 8.27% compared to the 1489 training samples in the Pavia data set. Compared with the total training samples, the number of samples selected for training in this paper is small; thus, we called it "small sample classification" or "few-shot classification".

Comparison Experiments and Analysis
In this paper, the values of OA, AA and kappa coefficients are compared and analyzed. It can be found from Tables 3 and 4 that the classification accuracy of cross-scene learning method has been greatly improved compared with RBF-SVM and EMP-SVM because the convolution kernel in conventional convolution is only size of 1 × 1. The receptive field is small; thus, this paper proposes to introduce DO-Conv into the depth hyparametric embedding model, in which DO-Conv convolutes each channel of the input data to extract more fine spatial information, which makes up for the poor feature acquisition ability of the model limited by the small convolution kernel size. At the same time, weighted KNN is introduced to increase the weight of small samples in HSI and enhance the processing ability of the network model for unbalanced data. Due to the high complexity of the network model, Manhattan distance measurement is introduced to reduce the calculation cost of the model and ensure the balance between accuracy and running time. Therefore, compared with the six comparative classification methods, this model has stronger classification accuracy and less running time. It can be seen from Table 4 that in the Indiana dataset, the OA obtained by the classification model in this paper reaches 65.01%, which is increased by 19.83%, 11.46%, 7.34%, 6.62%, 2.56% and 0.66% respectively compared with SVM, EMP-SVM, DCNN, ED-DMM-UDA, MDDUK and MDUWK. It can be seen from Table 5 that in Pavia data set, OA reached 90.90% at the highest, and increased by 10.08%, 9.59%, 5.92%, 2.14%, 1.55% and 0.89% respectively compared with SVM, EMP-SVM, DCNN, ED-DMM-UDA, MDDUK and MDUWK, which fully confirmed the effectiveness of MDDUWK model in HSI data classification task. It can be seen from Figure 9 that in the Indiana dataset, the categories of concrete/alpha, orchard and soybeans cleanTill EW of small samples are weighted separately. By comparing MDUWK and ED-DMM-UDA, the accuracy of concrete/alpha and soybeans cleanTill EW is improved by 0.18% and 2.3%. Compared with the improvement of the two types, the reduction by 0.54% in Orchard is acceptable. It can be found from Figure 10 that in the Pavia dataset, the categories of small samples, Parking lot and Bitumen, after using weighting alone, increased by 2.21% and 4.04%, respectively, by comparing MDUWK and ED-DMM-UDA. Therefore, it can be explained that the adaptive weighting method in this paper can effectively handle category unbalanced data. At the same time, MDDUWK obtained 4 best classification accuracy of seven terrain categories of interest in the Indiana dataset and MDDUWK obtains four best classification accuracy of six terrain categories of interest in the Pavia dataset; thus, it can be explained that the proposed algorithm also achieves the best classification effect in each category accuracy.   Table 6 shows the calculation time consumption of the seven network models. It can be found that when MDDUWK classifies hyperspectral data, the time consumption in Indiana data set is the third and that in Pavia data set is the second, which saves many computing resources. Although MDDUWK spent more time than MDDUK, a slight reduction in time is acceptable compared with the increase of accuracy. In order to subjectively evaluate the classification effect, Figures 11 and 12 respectively show the truth maps of two HSI data and the pseudo color maps of the classification results of each method. In Figure 11, Asphalt and bitumen both refer to the same black, sticky semi solid or a liquid substance derived from crude oil. However, in regular use, asphalt can also be used as a shortened term for asphalt concrete which is a popular construction composite made up of bitumen and mineral aggregates. Asphalt is called asphaltene. In Figure 12, CleanTill definition is to cultivate by stripping the soil clean of weeds and other harmful growth. EW stands for different longitudes, which are mainly used to distinguish features at different longitudes. Features with different longitudes will have some differences in characteristics. All researchers add EW to identify features more accurately. It can be seen that the method in this paper is closer to the real ground object distribution, and the area of false classification is greatly reduced. Compared with RBF-SVM, EMP-SVM, DCNN, MDDUK and MDUWK, the classification effect is greatly improved.

Conclusions
In this paper, a new cross-scene adaptive learning terrain classification model for hyperspectral images is proposed. Based on unsupervised domain adaptive technology, the model introduces depthwise over-parameterized convolution into the embedding model to accelerate the convergence speed of depth convolution neural network. At the same time, learning Manhattan metric distance saves the computational cost of cross-scene model. Finally, a weighted KNN classifier is introduced to enhance the ability of the model to handle data category imbalance problem. In this paper, experiments are carried out on Pavia and Indiana data sets. Compared with other six classification algorithms, this paper has higher classification accuracy and fast model running time. When the number of training samples is 180, overall accuracy on Pavia data set and Indiana data set reaches 90.9% and 65.0%, respectively. The proposed cross-scene classification model in this paper has a better classification effect on hyperspectral images without training samples, and the accuracy has been improved, which can be applied to crop yield estimation, pest detection, atmospheric environment monitoring and other fields. However, the computational cost of the model is still relatively large. Later, the network model will be further lightened to reduce the number of model parameters and improve the training efficiency of the model.