Depthwise Separable Relation Network for Small Sample Hyperspectral Image Classiﬁcation

: Although hyperspectral data provide rich feature information and are widely used in other ﬁelds, the data are still scarce. Training small sample data classiﬁcation is still a major challenge for HSI classiﬁcation based on deep learning. Recently, the method of mining sample relationships has been proved to be an effective method for training small samples. However, this strategy requires high computational power, which will increase the difﬁculty of network model training. This paper proposes a modiﬁed depthwise separable relational network to deeply capture the similarity between samples. In addition, in order to effectively mine the similarity between samples, the feature vectors of support samples and query samples are symmetrically spliced. According to the metric distance between symmetrical structures, the dependence of the model on samples can be effectively reduced. Firstly, in order to improve the training efﬁciency of the model, depthwise separable convolution is introduced to reduce the computational cost of the model. Secondly, the Leaky-ReLU function effectively activates all neurons in each layer of neural network to improve the training efﬁciency of the model. Finally, the cosine annealing learning rate adjustment strategy is introduced to avoid the model falling into the local optimal solution and enhance the robustness of the model. The experimental results on two widely used hyperspectral remote sensing image data sets (Pavia University and Kennedy Space Center) show that compared with seven other advanced classiﬁcation methods, the proposed method achieves better classiﬁcation accuracy under the condition of limited training samples.


Introduction
Hyperspectral remote sensing image (HSI) is the imaging of ground area by imaging spectrometer with dozens or even hundreds of continuous bands. It is a three-dimensional image combining spatial information and spectral information [1]. With the rich thirddimensional information, ground objects can be classified and identified more accurately from spectral space, so it has been widely used in agriculture [2], forestry monitoring [3], environmental monitoring [4] and other fields. However, manual collection of labeled samples for hyperspectral data is very time-consuming and expensive, so how to deal with small sample HSI classification becomes very important [5].
Many researchers used spectral information in HSI for terrain classification applications. Melgani et al. used the steepest ascent (SA) search strategy to remove redundant information from original data dimension, then used radial basis function as kernel function of SVM to identify the terrain category [6]. Chen et al. proposed a hyperspectral image classification algorithm based on dictionary sparsity. Spectral pixels are recovered by solving optimization problems constrained by sparsity and reconstruction accuracy. Finally, the category of test pixels is determined by the features of the recovered sparse vectors [7]. Gislason et al. established a random forest according to the idea of ensemble learning, which is composed of a decision tree set and voting strategy to generate multiple classifiers for improving the accuracy of basic classifier [8]. Bandos et al. classified hyperspectral remote sensing images by regularized linear discriminant analysis (RLDA) [9]. Bachmann et al. proposed a new algorithm using the nonlinear structure of hyperspectral images to seek a manifold coordinate system that maintains the geodesic distance in the high-dimensional hyperspectral data space. The manifold coordinate representation can use the nonlinear structure of hyperspectral images to distinguish the categories with similar spectra [10]. However, the traditional classification method has some limitations. In the face of the phenomenon of "different body with same spectrum" or "same body with different spectrum" [11], it is difficult to classify them effectively only by the spectral information classification model, and it is easy to appear Hughes phenomenon.
Therefore, people have incorporated spatial information into hyperspectral classification [12]. Gu et al. proposed a multi-core learning framework integrating spectral and spatial features, which uses multiple structural elements to generate extended morphological profiles (EMP) to represent spatial-spectral information. To better mine the similarity between EMP scales and structures, a nonlinear multi-kernel learning framework is introduced to learn the optimal combination kernel from the predefined linear basis kernel, which is combined with support vector machine (SVM) to recognize ground objects [13]. Mura et al. proposed the attribute profile, which can more completely describe the scene and more accurately model spatial information [14]. However, the feature extraction of these methods depends too much on manual setting, and the process of manual setting is complex and sensitive.
In recent years, because deep learning is very powerful in automatic feature extraction and learning different hierarchical structures, it has been widely used in the field of hyperspectral remote sensing [15]. Convolutional neural network (CNN) is the most popular and successful deep learning framework at present. It uses different convolution layers to automatically extract deeper and more abstract features, and has achieved good classification results in hyperspectral image classification [16][17][18]. However, as the network layers of CNN are too deep, there will appear gradient disappearance and gradient explosion. Therefore, He et al. proposed a residual network structure based on layer hopping connection, which can effectively alleviate gradient disappearance caused by too deep number of network layers [19]. Paoletti et al. proposed combining pyramid convolution with deep residual network to further improve the feature acquisition ability [20]. Paoletti et al. redesigned ResNet as a continuous time evolution model, and redefined the deep residual network as a more flexible model by evaluating the ordinary differential equation and adjusting the parameters [21]. Although deep learning algorithms can automatically extract features and make more effective use of HSI spatial and spectral information, there are still some shortcomings, such as the requirement for a large number of training samples, difficult parameter adjustment, slow model convergence, and so on [22]. In particular, the lack of labeled training samples becomes more intractable due to the data-hungry nature of deep neural networks, where millions of network weights need to be tuned.
To solve the appeal problem, some researchers combine transfer learning with deep learning to classify HSI, fully train the network model on the source dataset with a large number of labeled samples, and then transfer it to the target dataset with small sample to complete effective classification. Chen et al. trained the model structure and weight of VGG-16Net on the Imagenet dataset and transferred the trained network model to HSI classification model to achieve good classification results [23]. Jiang et al. trained the threedimensional separable residual network adjustment network (3-D-SRNet) on the source hyperspectral data to adjust the model parameters and classify the target hyperspectral data, which can effectively reduce the dependence of the model on samples [24]. Deng et al. proposed an active transfer learning strategy to train the network, train the stacked sparse autoencoder networks (SSAE) on the source hyperspectral data, and then select limited labeled samples from the source data and target data through active learning to fine tune the network for terrain classification [25]. Zhong et al. proposed a multi-scale spectral spatial unified network (MSSN) model with cross scene training strategy. The network is composed of two branch architectures and a multiscale library, which can extract optical spectrum and spatial features at the same time [26]. Chen et al. proposed an ensemble transfer learning method to train the depth residual network. Firstly, a sub classifier network is trained, and then the weight of the first layer of the network is transferred to each sub classifier to obtain the HSI terrain category through the voting strategy [27]. Although these methods have achieved good classification results, they still need relatively large training samples.
To further reduce the dependence of the model on training samples, some researchers have proposed a classification model suitable for small samples in recent years. In the classification task of HSI, Zhang et al. proposed a global prototype network to map the data to embedded space for learning the Euclidean metric distance between samples, and completed the classification through the nearest neighbor classifier [28]. Hoffer et al. proposed triple network, an improved twin network, which has been successfully applied to the field of small sample classification [29]. Sung et al. proposed a relational network to automatically match the similarity of different samples through neural network to obtain the feature category [30]. The global prototype network and triple network use the artificially set measurement distance. Although the relational network uses neural network to automatically obtain samples, it cannot be directly applied to the field of hyperspectral image classification. Therefore, Deng et al. designed a hyperspectral image classification model based on relational network to effectively reduce the dependence of the model on samples, which has achieved good results in the field of hyperspectral image classification [31]. However, this method faces some problems, such as high training difficulty and high calculation cost. Ma et al. designed the two-phase relation learning network. Although the number of training samples is reduced to a certain extent, the complexity of the network model will increase the parameters of the model, and it is easy for the random gradient descent method to fall into local optimal solutions, which is not conducive to the training model [32]. Rao et al. proposed the spatial spectral relationship network (SSRN), in which 3D-CNN is used to extract spectral and spatial features at the same time to capture the deep similarity between samples [33]. However, if this method wants to achieve a better classification effect, it often needs a lot of training time and large computational cost, which will increase the dependence of the algorithm on computational power.
According to small sample classification of hyperspectral images, this paper completes the classification based on modified relational network combined with depthwise separable convolution, which prefers to let the network learn how to measure the similarity between the two samples and reduce the amount of calculation of the network. The rest of this paper is organized as follows. Section 2 briefly introduces the relevant algorithms and improvements in this paper. Section 3 describes the experimental results and analysis of this paper and discussion is given in Section 4. Section 5 summarizes the conclusions and future work.

Proposed Methods
The classification model proposed in this paper is shown in Figure 1, and is composed of depthwise separable embedding model and comparison metric module. To avoid the classification model falling into local optimal solutions, the method in this paper adopts the cosine annealing algorithm to train the network model. Firstly, the samples are input into the deep separable embedding model, and the first layer of the model is used to extract the rough features of the samples. To activate the neurons more effectively, the Leaky-ReLU activation function is added to each layer of the neural network, and then the obtained features are input into the depthwise separable convolution, which can be divided into deep convolution and point-by-point convolution. Deep convolution convolutes each input channel to obtain finer spatial features. The spatial information between channels is fused by point-by-point convolution to preserve the interaction between channels. The feature vectors output by the embedded model are input into the relation model. The feature vectors of support samples and query samples are symmetrically spliced one by one. The spliced feature vectors are regarded as sample pairs with symmetrical structure, and then the sample pairs are mapped to the comparative metric model to obtain the measurement distance between symmetrical structures. The matching degree between samples is analyzed by neural network, and finally the terrain category is obtained. Leaky-ReLU activation function is added to each layer of the neural network, and then the obtained features are input into the depthwise separable convolution, which can be divided into deep convolution and point-by-point convolution. Deep convolution convolutes each input channel to obtain finer spatial features. The spatial information between channels is fused by point-by-point convolution to preserve the interaction between channels. The feature vectors output by the embedded model are input into the relation model. The feature vectors of support samples and query samples are symmetrically spliced one by one. The spliced feature vectors are regarded as sample pairs with symmetrical structure, and then the sample pairs are mapped to the comparative metric model to obtain the measurement distance between symmetrical structures. The matching degree between samples is analyzed by neural network, and finally the terrain category is obtained.

Depthwise Separable Embedding Model
The purpose of the deep separable embedding model is to obtain high-quality feature vectors, so as to splice high-quality symmetric feature spaces with comparative metric models. In deep learning, convolutional neural network is widely used in the field of image classification because of its strong feature expression ability [34]. The convolutional neural network obtains the information of different feature maps at different positions through neurons, so that the obtained images have richer features. With the deepening of the number of network layers, the convolutional neural network can deal with more complex actual environments; however, it also faces some problems, such as increasing computational cost, difficult convergence of the network, and much dependence on samples [35]. Therefore, on the basis of the premise that the convolutional neural network can obtain richer images, this paper introduces the lightweight depthwise separable convolutional network model to reduce the network model parameters and improve the model training speed, and introduces the Leaky-ReLU activation function to effectively activate the neurons in each layer of neural network and increase the processing ability of the model to complex environment.
In each training process, the sample is defined as

Depthwise Separable Embedding Model
The purpose of the deep separable embedding model is to obtain high-quality feature vectors, so as to splice high-quality symmetric feature spaces with comparative metric models. In deep learning, convolutional neural network is widely used in the field of image classification because of its strong feature expression ability [34]. The convolutional neural network obtains the information of different feature maps at different positions through neurons, so that the obtained images have richer features. With the deepening of the number of network layers, the convolutional neural network can deal with more complex actual environments; however, it also faces some problems, such as increasing computational cost, difficult convergence of the network, and much dependence on samples [35]. Therefore, on the basis of the premise that the convolutional neural network can obtain richer images, this paper introduces the lightweight depthwise separable convolutional network model to reduce the network model parameters and improve the model training speed, and introduces the Leaky-ReLU activation function to effectively activate the neurons in each layer of neural network and increase the processing ability of the model to complex environment.
In each training process, the sample is defined as X = (X train , X test ), the training sample is defined as X train = x i , x j , y i , y j n , and the embedded feature is defined as where n is the total number of training samples, and the training samples are divided into query samples x j and support samples x i , which have corresponding tag values y i and y j . Each time, one query sample x j and several support samples x i are randomly selected from the training set to train the model. The query sample is the sample to be classified, and each supporting sample represents its own category information. In the same way as to construct the samples of X train , query samples are randomly selected from X test , as shown in Figure 2. x and several support samples i x are randomly selected from the training set to train the model. The query sample is the sample to be classified, and each supporting sample represents its own category information. In the same way as to construct the samples of train X , query samples are randomly selected from test X , as shown in Figure 2.
. Firstly, the training samples are input into the first layer of convolution, and the rough features are extracted from the samples to generate a 64-channel feature map, in which Conv F represents the standard convolution operation and FM F represents the extracted feature mapping from HSI I . Then, a depthwise separable convolutional pair is used for FM F to complete nonlinear mapping, as shown in Formula (1).
Depthwise separable convolution (DS) can be divided into depthwise convolution (DW) and pointwise convolution (PW). DW convolutes each independent channel to obtain finer spatial features DWConv F . The depthwise separable embedding model is shown in Figure 3. Because a feature map (FM) is convoluted by only one filter, the feature information of different channels at the same spatial position cannot be effectively used. Therefore, pointwise convolution is introduced to fuse the spatial information between the same positions and preserve the interactivity of the channel to obtain the spatial feature PWConv F , and then the sample feature vector is obtained from the convolution layer Conv F , as shown in Formula (2).  Firstly, the training samples are input into the first layer of convolution, and the rough features are extracted from the samples to generate a 64-channel feature map, in which F Conv represents the standard convolution operation and F FM represents the extracted feature mapping from I HSI . Then, a depthwise separable convolutional pair is used for F FM to complete nonlinear mapping, as shown in Formula (1).
Depthwise separable convolution (DS) can be divided into depthwise convolution (DW) and pointwise convolution (PW). DW convolutes each independent channel to obtain finer spatial features F DWConv . The depthwise separable embedding model is shown in Figure 3. Because a feature map (FM) is convoluted by only one filter, the feature information of different channels at the same spatial position cannot be effectively used. Therefore, pointwise convolution is introduced to fuse the spatial information between the same positions and preserve the interactivity of the channel to obtain the spatial feature F PWConv , and then the sample feature vector is obtained from the convolution layer F Conv , as shown in Formula (2).
Symmetry 2021, 13, x FOR PEER REVIEW 6 of 21 Depthwise separable embedding module . Depthwise separable embedding model. Figure 3 shows the output of each layer of depth separable embedding model. In this paper, the depth separable embedding model consists of three layers of neural network. The first and third layers are composed of conventional convolution layer, batch normalization layer (BN) and Leaky-ReLU activation function. The second layer is the depthwise separable convolution layer, including depthwise convolution and pointwise convolution. The depthwise convolution is composed of 64 groups of convolution layer, BN layer and Leaky-ReLU activation function. The pointwise convolution is composed of one group of convolution layer, BN layer and Leaky-ReLU activation function. To train small samples more effectively, the convolution core size of each layer of neural network   The depthwise convolution is composed of 64 groups of convolution layer, BN layer and Leaky-ReLU activation function. The pointwise convolution is composed of one group of convolution layer, BN layer and Leaky-ReLU activation function. To train small samples more effectively, the convolution core size of each layer of neural network is set to 1 × 1, where padding is 0 and stripe is 1. Assuming the input sample is 5 × 5 × nBand, after the first layer of convolution, the feature map has a size of 5 × 5 × 64. The second layer of depthwise separable convolution is further filtered F FM to generate 64 groups of 5 × 5 × 1 isolated feature map, then one group with a size of 5 × 5 × 64 is generated by F PWConv to fuse the information of different feature maps, and finally the conventional convolution output E ϕ (X) has a size of 5 × 5 × 64.

Depthwise Separable Convolution
Hyperspectral images are different from general two-dimensional images, and contain a lot of information in the spatial dimension. The spatial information of hyperspectral images determines the spatial features between adjacent pixels in the spatial dimension, and the spatial features can make up for the shortcomings of spectral domain features to improve the ability of model to capture features. Although the two-dimensional convolution neural network can also extract the spatial features of hyperspectral images, it is not conducive to extracting the spectral and spatial features of pixels at the same time. In view of the insufficient use of hyperspectral data information by two-dimensional convolution, a depthwise separable convolution layer is added after the two-dimensional convolution layer, which can reduce the parameters and increase the spatial features at the same time [36]. More abundant spatial-spectral features are extracted to ensure that the model can distinguish the spatial information of different bands without loss.
The two-dimensional convolution generates new high-level features through the filtering features and merging features of the convolution kernel. Each channel of the convolution layer can be expressed as Formula (3), where D i is the channel of the i th convolution feature graph, W i represents the convolution kernel, b i is the offset term of the feature graph, and X j is the j th channel of the previous layer. The operator represents the two-dimensional convolution operation and Ω(·) represents the nonlinear Leaky-ReLU activation function to speed up the training process of network model.
Formula (4) represents the calculation amount of standard convolution operation. The size of the output feature map is defined as W × H, the number of channels is C, the size of the convolution kernel is K × K × C, the number of convolution kernels is N, and the feature map is convoluted with each convolution kernel.
The depthwise separable convolution performs a single channel filtering operation for each input channel, added after the depthwise convolution filtering, and performs pointwise convolution. It is assumed that the input samples have a depth convolution size of W × H × C and are divided into C groups, then, each group performs conventional convolution, which is equivalent to extracting the spatial features of each input channel C, that is, the features of DW, and the output is size of W × H × C. The features of each sample point after pointwise convolution have a size of 1 × 1 × C, i.e., the features of PW, and the output is size of W × H × N. Formula (5) represents depth convolution and Formula (6) represents point-by-point convolution. The structure of depthwise separable convolution is shown in Figure 4, where * is convolution operation. depthwise separable convolution are compared with those of the second and third layers and the two-layer convolution layer. Therefore, the parameter quantity in this paper is compared with the two-layer convolution neural network shown as follows:  Because the convolution neural networks of the second and third layers are processed together after the depthwise separable convolution is introduced, the parameters of the depthwise separable convolution are compared with those of the second and third layers and the two-layer convolution layer. Therefore, the parameter quantity in this paper is compared with the two-layer convolution neural network shown as follows:

Leaky-ReLU Activation Function
In the multilayer neural network, each neuron node accepts the output value of the upper neuron as the input value of the neuron, and transmits the input value to the next layer. The output of the upper node and the input of the lower node are transmitted through the activation function. To ensure that the neuron adapts to the complex linear relationship, the ReLU activation function is generally added to the neural network. The ReLU activation function refers to the modified linear unit. Only when the input value is positive will the neuron will be effectively activated, causing the model to lose part of the characteristic information in the negative signal. However, when the input is negative, the learning speed of the ReLU activation function becomes very slow, even directly making the neuron invalid. At this time, the input is negative, and the gradient is zero, so its weight cannot be updated. The mathematical expression of ReLU function is: Therefore, a linear element function with leakage correction is introduced, namely the Leaky-ReLU function. The function introduces a leakage value in the negative value interval of the ReLU function, and has a small slope for the negative value input, so that the derivative is always non-zero, reducing the emergence of silent neurons and effectively activating neurons in the network. The mathematical expression of the Leaky-ReLU function is: where a i = 0.01, and a small slope is given, that is, the negative input information is retained without increasing the computational cost of model training. x i is the input. On the one hand, it modifies the data distribution, on the other hand, it weights the negative value data with constant to retain the input information more completely. Compared with the ordinary ReLU function, Leaky-ReLU keeps the features of HSI negative signal data through a very small negative value weighting constant. After the Leaky-ReLU replaces the ReLU activation function and is nonlinearly weighted to the output of each convolution and full connection layer, when the input signal gradient is input to the Leaky-ReLU activation function and the input signal is less than 0, the output is a linearly weighted input signal a i x i . When it is greater than zero, the output is equal to the input.

Cosine Annealing Algorithm
The gradient descent method is an iterative algorithm that solves the gradient vector of the objective function, updates the parameter value along the direction of negative gradient, and solves the minimum of the objective function until convergence. Therefore, it is necessary to set η t , and then update each parameter of the network along the negative gradient direction to minimize the loss function. Generally, in deep learning, batch gradient descent (BGD) and stochastic gradient descent (SGD) are mainly used to update the parameter values. BGD needs to use all datasets to update each parameter. If the number of samples is too large, the training speed will be too slow, and the calculation cost will be increased. Although SGD has the characteristics of fast training speed, it is easy to fall into the local optimal solution because it only uses part of the information in the data. Therefore, on the premise of integrating the training sample speed and calculation cost, this paper introduces the cosine annealing algorithm to update the parameter value, which can reduce the learning rate through the cosine function, as shown in the following: where η i min and η i max are the minimum and maximum learning rate respectively, T i is the total number of epochs, T cur is the current epoch. η i max usually decays. When T cur = T i , η t reaches the minimum training batch.

Comparison Metric Model
It is known that the support samples x i and query samples x j generate the sum of two sets of feature vectors E ϕ (x i ) and E ϕ (x j ) size of 5 × 5 × 64 through the depth separable embedding model, and a feature vector Con j i with size 5 × 5 × 128 is generated after symmetric splicing operation C(, ), as shown in Formula (11). If the number is equal between x i and x j , the symmetric splicing operation recombines the unevenly distributed E ϕ (x i ) and E ϕ (x j ) and one-to-one splices to form a feature space with symmetric structure. Each time, a group of samples is mapped onto the comparative metric model. By measuring the metric distance between the symmetric structures of the samples, a small number of features can be effectively used to obtain the attributes of samples and reduce the dependence of the model on training samples.
M φ is a neural network composed of three convolution layers. The convolution size of the first two layers is 1 × 1 × 64. After the first two convolution layers, the Leaky-ReLU activation function is used for nonlinear feature mapping. The third convolution layer has a size of 5 × 5 × 1, and then the sigmoid function is used to output a specified scale to describe the similarity between samples, which is shown in Formula (12), and the value range of the function is [0, 1].
Then, the output feature vector is mapped to M φ , and the distance between the eigenvector x i and x j is measured by a convolution layer size of 1 × 1 × 64. The feature vector with size 5 × 5 × 64 is generated, and a relationship score m i,j with size 1 × 1 is generated by M φ to analyze the similarity between samples. Finally, m i,j is compared with x j , which is the class with the highest relationship score. The comparison metric model is shown in Figure 5.
range of the function is [ ] 0,1 . x , which is the class with the highest relationship score. The comparison metric model is shown in Figure 5.  To determine the label of the query sample, the feature map of each combination is put into M φ to generate a similarity, whose definition is as shown in Equation (13) to represent the similarity between any two embeddings. The output value m i,j is in the range [0, 1], which is considered that the score of similar relationship. The higher the output value, the greater the similarity.
The comparison metric model is trained by calculating the relationship scores between E ϕ (x i ) and E ϕ (x j ) by using the minimum mean square error (MSE) loss function. The loss function is represented by Formula (14), where 1(y i == y j ) refers to when the training sample x i belongs to the same category y i and y j , in which case it is 1, otherwise it is 0.

Hyperspecral Dataset Description
To verify the effectiveness of the proposed model for hyperspectral data classification, this paper uses two public hyperspectral datasets, the Pavia University (PaviaU) dataset and the KSC dataset. The Pavia University dataset is an urban scene collected from hyperspectral images of Pavia University in Italy in 1992, including multiple types of urban features. The image size is 610 × 340 pixels with a spatial resolution of 1.3 m, including 103 spectral bands after denoising (band range from 0.43 to 0.86 µm), a total of nine classes and 42,776 labeled samples were calibrated, as shown in Figure 6.
The KSC dataset was collected from mixed vegetation near Kennedy Space Center by the airborne visible/infrared imaging spectrometer (AVIRIS) of the National Aeronautics and Space Administration (NASA) on 23 March, 1996. The image size is 512 × 614 pixels with a spatial resolution of 18 m. Some atmospheric water absorption bands and low signal-to-noise ratio (SNR) bands were discarded, and only 176 bands were reserved for analysis. A total of 13 categories and 5211 labeled samples were calibrated, as shown in Figure 7 [37].
(PaviaU) dataset and the KSC dataset. The Pavia University dataset is an urban scene collected from hyperspectral images of Pavia University in Italy in 1992, including multiple types of urban features. The image size is 610 340 × pixels with a spatial resolution of 1.3 m, including 103 spectral bands after denoising (band range from 0.43 to 0.86 m μ ), a total of nine classes and 42,776 labeled samples were calibrated, as shown in Figure 6.

Experimental Platform Parameter Settings
In this paper, Windows 7 was used as the operating system. The experimental environment was Intel (R) core (TM) i5-6500 CPU @ 3.2 GHz processor, 16 GB running memory (RAM), NVIDIA GeForce GTX 1060 GPU. In addition, the deep learning framework was Pytorch, which uses Python as the programming language. To obtain more stable statistics data, all the experimental results presented in this paper are the average of 10 experiments. In the network training part, all experiments adopt a batch processing method. Because reducing the parameters in the network is conducive to small sample training, the size of the input data was set to 5, the number of training epochs was set to 4000, the momentum was set to 1 and the learning rate of cosine annealing was set to 0.1.  To ensure the reliability of the experiment, the training samples of all experiments were randomly selected from hyperspectral data. In addition, eight different methods were used to evaluate the performance of the proposed relational network for comparison, including morphological contour support vector machine (EMP-SVM), deep convolution neural network (DCNN), deep residual network (ResNet), pyramid residual network (PyResNet), relational network (RLNet), relational network based on depthwise separable convolution (DRNet), Leaky-ReLu relational network (LRNet)) and deep separable Leaky-ReLu relational network (DLRNet).
The evaluation indexes of classification results in the experiments adopted the overall accuracy (OA), average accuracy (AA) and Kappa coefficient (K) commonly used in remote sensing data classification. OA is the proportion of the number of correctly classified samples to the total number of samples, which measures the overall classification effect. AA is the mean of each class's accuracy, which can better measure the classification effect of small sample categories. Kappa coefficient can not only measure the classification accuracy, but also measures the consistency between the model prediction effect and the actual effect.

Design of Relation Network Model
In this paper, the neural network with depthwise separable convolution is introduced as the feature extraction network of the embedded model. The network is composed of a convolution layer, a depthwise separable convolution layer, the Leaky-ReLU activation function and a normalization layer. To train small amounts of sample data more fully, the lightweight depthwise separable convolution network model is introduced, which not only saves the computational cost, but also improves the training efficiency of the model. The Leaky-ReLU activation function is introduced to further accelerate the convergence speed and reduce the emergence of silent neurons. Because hyperspectral data have rich spectral information and spatial information, each hyperspectral pixel and its band number (nband) are set as the input, and the model input size is 5 × 5 × nBand. Since the model is trained on a small number of labeled samples, reducing the parameters in the network is more conducive to training; therefore, the size of convolution kernel in neural network is set as 1 × 1. At the same time, the number of filters per layer is set to 64, and the final output is 5 × 5 × 64. The relational model contains three conventional convolution layers and 1 × 1 convolution kernel, the Leaky-ReLU activation function, two BN layers and one Sigmoid layer. Table 1 shows the parameter settings of the depthwise separable embedding model.

Comparison of Experimental Results
In this section, the OA, AA and kappa coefficients are compared and analyzed on two public HSI datasets. It can be found from Tables 2 and 3 that the proposed algorithm in this paper is superior to other classification methods. Comparing DLRNet to RLNet, it can be found that the performance of DLRNet is better than the latter. The introduction of depthwise separable convolution reduces the parameters of the model, reduces the dependence of the model on computational power, and improves the universality of the model. At the same time, the Leaky-ReLU activation function is introduced to add a leakage value so that the neurons of the network model can be activated effectively. To avoid the model falling into local optimal solutions, the network model is trained by the cosine annealing algorithm. Therefore, compared with other classification methods, the model in this paper has excellent classification accuracy. It can be seen from Table 2 that the highest OA on PaviaU dataset is 91.12%, which is an increase of 2.16%, 1.48%, 0.8%, 0.22%, 1.56%, 2.05% and 1.84%, respectively, compared with EMP-SVM, DCNN, ResNet, PyResNet, RLNet, DRNet and LRNet. It can be seen from Table 3 that the highest OA obtained by this classification model on the KSC dataset is 98.61%, which is an increase of 6.4%, 5.6%, 1.82%, 1.68%, 0.83%, 0.99% and 0.49%, respectively, compared with EMP-SVM, DCNN, ResNet, PyResNet, RLNet, DRNet, and LRNet, which fully proves the effectiveness of DLRNet model hyperspectral data classification with small samples. Figures 8 and 9 show the classification results of different methods for each class on the PaviaU and KSC datasets, respectively.
The running time of the model is an important evaluation index of the deep learning classification model. Table 4 shows the calculation time of the eight network models. By comparing DLRNet and RLNet, it can be found that the training efficiency of the network model is improved, and it is reduced by 0.47 min and 0.49 min in the PaviaU dataset and KSC dataset, respectively, by introducing depthwise separable revolution and Leaky-ReLU, which proves the effectiveness of the method proposed in this paper. Although DLRNet consumes a large amount of time compared with EMP-SVM and DCNN, these traditional algorithms rely too much on the number of samples, which is not conducive to the classification of the real environment. Compared with the difficulty of obtaining labeled samples, the increase of time is acceptable. is 98.61%, which is an increase of 6.4%, 5.6%, 1.82%, 1.68%, 0.83%, 0.99% and 0.49%, respectively, compared with EMP-SVM, DCNN, ResNet, PyResNet, RLNet, DRNet, and LRNet, which fully proves the effectiveness of DLRNet model hyperspectral data classification with small samples. Figures 8 and 9 show the classification results of different methods for each class on the PaviaU and KSC datasets, respectively.   is 98.61%, which is an increase of 6.4%, 5.6%, 1.82%, 1.68%, 0.83%, 0.99% and 0.49%, respectively, compared with EMP-SVM, DCNN, ResNet, PyResNet, RLNet, DRNet, and LRNet, which fully proves the effectiveness of DLRNet model hyperspectral data classification with small samples. Figures 8 and 9 show the classification results of different methods for each class on the PaviaU and KSC datasets, respectively.  The running time of the model is an important evaluation index of the deep learning classification model. Table 4 shows the calculation time of the eight network models. By comparing DLRNet and RLNet, it can be found that the training efficiency of the network model is improved, and it is reduced by 0.47 min and 0.49 min in the PaviaU dataset and KSC dataset, respectively, by introducing depthwise separable revolution and Leaky-  To subjectively evaluate the classification effects, Figures 10 and 11 show the ground truth of two HSI data and the pseudo color maps of the classification results of each method, respectively. It can be seen that the classification results obtained by our method are closer to the distribution of real terrain, and the area of false classification is greatly reduced. Compared with seven other methods, the classification accuracy is greatly improved. To subjectively evaluate the classification effects, Figures 10 and 11 show the ground truth of two HSI data and the pseudo color maps of the classification results of each method, respectively. It can be seen that the classification results obtained by our method are closer to the distribution of real terrain, and the area of false classification is greatly reduced. Compared with seven other methods, the classification accuracy is greatly improved.

The Selection of Small Sample Number
The number of small samples is defined as the number of samples of each class, which is selected as 5, 10, 20, 30 and 40. Comparing the five sample sizes, it can be found in Figure 12 that when the sample number of each class is 20, the growth of OA reaches its peak and is relatively slow after 20. Therefore, this paper sets the number of training samples of each class to 20.

The Selection of Small Sample Number
The number of small samples is defined as the number of samples of each class, which is selected as 5, 10, 20, 30 and 40. Comparing the five sample sizes, it can be found in Figure 12 that when the sample number of each class is 20, the growth of OA reaches its peak and is relatively slow after 20. Therefore, this paper sets the number of training samples of each class to 20.

The Selection of Learning Rate
To better verify the influence of cosine annealing algorithm on the proposed model, this paper compares experimental results with different learning rates of 0.5, 0.1, 0.01, 0.05, 0.001 and 0.005. Figure 13 shows the values of OA, AA and Kappa corresponding to different learning rates on the two datasets. It can be found that although the learning rate can be changed automatically by using cosine degradation algorithm, the selected learning rate should not be too large. When the learning rate is 0.1, OA, AA and Kappa reach their optimum levels, and when the learning rate is 0.5, the values of OA, AA and Kappa are reduced. Therefore, this paper sets the learning rate as 0.1. Figure 14 shows the loss of the network corresponding to the two datasets in the training phase. With the increase of the number of epochs, the loss continues to decrease. We can see that the corresponding networks on the two datasets basically reach the convergence state when the epoch is about 4000. Therefore, it is verified that the network model in this paper converges easily.

The Selection of Learning Rate
To better verify the influence of cosine annealing algorithm on the proposed model, this paper compares experimental results with different learning rates of 0.5, 0.1, 0.01, 0.05, 0.001 and 0.005. Figure 13 shows the values of OA, AA and Kappa corresponding to different learning rates on the two datasets. It can be found that although the learning rate can be changed automatically by using cosine degradation algorithm, the selected learning rate should not be too large. When the learning rate is 0.1, OA, AA and Kappa reach their optimum levels, and when the learning rate is 0.5, the values of OA, AA and Kappa are reduced. Therefore, this paper sets the learning rate as 0.1.

Comparison of Parameters between Different Classification Models
To better describe the reduced parameters of depthwise separable convolution, this paper compares it with the ordinary four-layer convolution neural network, as shown in Table 5. The convolution neural networks of the second and third layers are processed together after the depthwise separable convolution is introduced. As shown in Table 5, the sum parameter number of Conv-1 and Conv-2 is 8320 for CNN, while the number of DSConv-2 is 4288, meaning that the parameters of the depthwise separable convolution are reduced by 48.46% compared with the two-layer ordinary convolution. The parameters of the overall four-layer embedding model of the PaviaU and KSC models were reduced by 20% and 16.58%, respectively. Compared with ordinary convolution, depthwise separable convolution has fewer parameters and better performance.   Figure 14 shows the loss of the network corresponding to the two datasets in the training phase. With the increase of the number of epochs, the loss continues to decrease. We can see that the corresponding networks on the two datasets basically reach the convergence state when the epoch is about 4000. Therefore, it is verified that the network model in this paper converges easily.

Comparison of Parameters between Different Classification Models
To better describe the reduced parameters of depthwise separable convolution, this paper compares it with the ordinary four-layer convolution neural network, as shown in Table 5. The convolution neural networks of the second and third layers are processed together after the depthwise separable convolution is introduced. As shown in Table 5, the sum parameter number of Conv-1 and Conv-2 is 8320 for CNN, while the number of DSConv-2 is 4288, meaning that the parameters of the depthwise separable convolution are reduced by 48.46% compared with the two-layer ordinary convolution. The parameters of the overall four-layer embedding model of the PaviaU and KSC models were reduced by 20% and 16.58%, respectively. Compared with ordinary convolution, depthwise separable convolution has fewer parameters and better performance.

Conclusions
In this paper, a hyperspectral image classification algorithm based on a depthwise separable relational network is proposed for small sample classification. The depthwise separable convolution is introduced into the deep embedding model to improve the training efficiency of the model. In addition, a learning rate adjustment strategy effectively reduces the time required to improve the classification performance of the model and improve the classification accuracy of the model. We have carried out experiments on two public HSI datasets, and compared DLRNet with other seven methods to verify the effectiveness of the proposed method. The experimental results show that our proposed method is more competitive. Its OA on the PaviaU dataset and the KSC dataset reaches 91.12% and 98.61%, respectively, which is better than other classical methods and ordinary relational network methods and implies that our proposed model could be a promising research direction for solving the limited training sample problem, especially in the context of HSI classification.