1. Introduction
Hyperspectral remote sensing images (HRSI) have attracted the attention of researchers because of their rich spatial and spectral information [
1]. They have been applied widely in the field of atmospheric exploration [
2], space remote sensing [
3], earth resources census, military reconnaissance [
4], environmental monitoring [
5], agriculture [
6], marine [
7] and so on [
8]. Hyperspectral classification technology is an important approach to extract thematic information and monitor the dynamic changes of the earth [
9,
10,
11]. In particular, recently the classification of HRSI based on deep learning has become one of the hot topics in the field of remote sensing. And convolutional neural network (CNN) [
12,
13] is the typical representative model, which has achieved high-precision and high-efficiency in HRSI classification. However, high dimensional information of HRSI and limited training samples are prone to the Hughes phenomenon. Meanwhile, high-dimensional information processing also consumes a lot of time and computing power, or the extracted features are not representative, resulting in unsatisfactory classification efficiency and accuracy. In order to solve these problems, it is necessary to consider more effective data compression methods, feature extraction and screening methods for HRSI classification.
At the beginning of the 21st century, with the development of science and technology, the performance of computers is gradually enhanced. Machine learning [
14,
15,
16,
17,
18] is used for HRSI classification, and methods based on hybrid schemes are widely used. It integrates more than two schemes into the HRSI classification research, especially the combination of dimension reduction method and machine learning method. To solve the high-dimensional disaster of hyperspectral information, both dimensionality reduction methods (such as PCA, LDA, AE, etc.) [
19] and sparse representation methods [
20] can be applied to HRSI classification. The former is dimensionally reduced and then combined with methods such as machine learning for classification, and the latter is represented by a linear combination of elements in the dictionary for HRSI classification. For example, Chen proposed an HRSI classification algorithm based on Principal Component Analysis (PCA) and Support Vector Machine (SVM), which significantly improved the classification accuracy, but lacked a discussion of time consumption [
21,
22]. Dinc et al. proposed the random forest (RF) algorithm and K-FKT transformation to classify HRSI, and obtained about 84% overall classification accuracy [
23]. Subsequently, with the rise of deep learning, the strong feature extraction ability and dimension reduction methods of combined deep learning proposed a series of methods that can be used for HRSI classification. Hinton [
24,
25] proposed the theory of deep learning, which can mine deep semantic information in data by learning raw data using multilayer neural networks. Tien-Heng Hsieh explored the classification performance of 1D/2D CNN combined with PCA for HRSI, and solved the problem of label misclassification by increasing the input vector to improve its classification accuracy [
26]. In addition, three-dimensional convolution neural network (3DCNN) using both spectral and spatial information was proposed to extract the joint features for HRSI classification tasks. Chen et al. applied 3DCNN to the task of hyperspectral image classification for the first time, and by extracting the joint spatial-spectral, a better feature map and a good classification accuracy was obtained [
27]. Shi used a super-pixel segmentation method to get preliminary classification results before extracting features from 3DCNN, which enabled 3DCNN to better extract deep features in images, thus improving classification accuracy [
28]. Liu et al. adopted a classification model based on 3DCNN without preprocessing, that is the hyperspectral images are directly input into the 3DCNN [
29]. However, without dimension reduction, it takes a lot of computing time. Palsson et al. [
30] reduced the band dimension of the HRSI before extracting features using a 3DCNN [
30], but this method caused the image data to lose the band continuity, which affected the classification accuracy. There are many mature neural network structures, and these excellent structure models play an important role in image processing [
31], target detection [
32,
33], and assistant diagnosis [
34].
Attention mechanisms (AM) are widely used in image and speech recognition, natural language processing [
35], and other different types of deep learning tasks, and it is considered one of the worthiest of attention and provides an in-depth understanding of deep learning technology [
36]. Heeyoul Choi et al. applied AM to the field of neural machine translation (NMT), and AM has become the most advanced recording method [
37]. An attention mechanism in a neural network, mimicking the selective attention mechanism of human beings, greatly increases the ability of visual information processing, especially efficiency and accuracy [
38]. What is more, its core purpose is to quickly select high-value information from a large amount of disordered information under limited attention resources [
39]. In fact, it is a mechanism for allocating computing resources, and the evenly allocated resources are adjusted according to different weights, that is, the important parts have a larger weight. In recent years, AM, by virtue of its ability to capture detailed information, has gradually become a hot topic in hyperspectral classification applications. Although these deep learning algorithms have achieved good results, there is still a large room for improvement [
40].
Deepwise detachable convolution was proposed by Laurent Sifre and has since been widely used and developed [
41,
42]. Hoang et al. applied depthwise separable convolutions to the field of human pose estimation, replacing the vanilla convolutions with depthwise separable convolutions to reduce the model size, FLOPs and inference time [
43]. Lu et al. used deep separable convolution to achieve low power consumption and high recognition accuracy in the field of keyword recognition [
44]. From the previous studies on the application of depthwise separable convolution (DSC), we find that it can reduce the time consumption as much as possible while ensuring the accuracy [
45]. On the other hand, the bands of hyperspectral imagery have more information. There is a lot of information redundancy between multiple bands, and it will consume too much time to classify them directly [
46]. Therefore, the introduction of DSC may help to greatly reduce the time consumption while guaranteeing the accuracy of HRSI classification [
47].
Based on the above analysis, the 3DCNN model assisted by AM and DSC is proposed to improve information screening ability and reduce time consumption, and three classic hyperspectral datasets are used to analyze its performance. AM may filter out the characteristics of high-value information, and DSC could reduce parameters to improve operation efficiency in the classification process. Meanwhile, in order to reduce training time and eliminate information redundancy, two low-dimensional algorithms and the DSC pruning method are applied in the HRSI processing stage. The main contribution of the paper can be divided into three aspects:
A lightweight approach called DSC is introduced into HRSI classification to reduce the time consumption. With fewer kernel moves, DSC could reduce the number of parameters and the amount of computation. In our experiment, and DSC reduces the training time by a maximum of 91.77% without significantly reducing the accuracy in the HRSI classification task;
A new method called 3DCNN-AM-DSC is proposed for hyperspectral images classification. It combines the ability of depth feature extraction, high-value information selection and lightweight convolution, to extract various features with high-value information, and to improve classification efficiency. The performance of the model is evaluated with the three classic HRSI datasets;
The influence of patchsize, the ratio of training samples to test samples, and classic dimensionality reduction methods on the classification performance is illustrated. Results show that appropriately increasing the patchsize and choosing an appropriate dataset allocation ratio can improve the overall average accuracy, and the data processed by dimensionality reduction and DSC reduce the sample size of model training, greatly improving the efficiency of HRSI classification.
The remainder of this article is as follows.
Section 2 briefly introduces the related work and the proposed 3DCNN-AM and 3DCNN-AM-DSC methods. Then, three classic HRSI datasets are described in detail, and the preprocessing process of the experiment is provided in
Section 3.
Section 4 reports the extensive experimental results and analysis. The strengths and limitations of the method proposed in this paper are discussed in
Section 5, followed by the conclusion of the paper in
Section 6.
2. The Proposed 3DCNN-AM/3DCNN-AM-DSC Method
The HRSIs [
48] are defined as spectral images with narrower spectrum and numerous bands, which improve the spectral resolution and reflect more continuous spectral features of the ground objects [
49]. Hyperspectral image data are presented as a three-dimensional data cube structure combining two-dimensional spatial feature and one-dimensional spectral characteristics, which determine the unique advantages of HRSI classification. For HRSI, higher spectral resolution, more spectral bands, stronger correlation between bands contribute abundant features, but they may result in redundant information.
2.1. Data Dimension Reduction
The three-dimensional structure of hyperspectral data is prone to information redundancy in the spectral dimension, the dimensionality reduction of hyperspectral data is added in the preprocessing stage. This operation ensures that enough information is retained for in-depth learning, image feature extraction and classification, and reduces the consumption of training/testing time. The essence of data dimensionality reduction is to map data from the original high-dimensional space to the low-dimensional space, which is divided into linear dimensionality reduction and nonlinear dimensionality reduction, or supervised dimensionality reduction and unsupervised dimensionality reduction according to the participation of labels [
50]. Next, this paper will focus on two common dimensionality reduction methods in HRSI classification: PCA and AE.
2.1.1. Principal Component Analysis
PCA [
51] is one of the most important dimensionality reduction methods in HRSI classification, which belongs to unsupervised dimensionality reduction. It only needs to decompose the eigenvalues of the data to achieve the purpose of data compression and elimination of redundancy, that is, dimensionality reduction. PCA maps the original
n features to a smaller number of
m features, and each new feature is a linear combination of old features. These linear combinations maximize the variance of samples and attempt to make the new
m features uncorrelated to each other.
2.1.2. Autoencoder
AE [
52] is an unsupervised neural network model, which includes two parts: encoding and decoding. The function of the encoding stage is to learn the implicit features of input data, and the purpose of the decoding stage is to reconstruct the original input data by using the learned new features. Because the neural network model can learn more effective new features and achieve the function of feature extraction, the feature representation ability of the data processed by AE is stronger. The data produced by AE has correlation, and can only compress those data similar to the training data. A specific encoder is trained through the input of a specified class to achieve the purpose of automatic learning from data samples.
AE belongs to unsupervised learning, and its learning goal is to restore input without providing labels. It can be regarded as a three-layer neural network structure: input layer, hidden layer, and output layer. The input layer and output layer have the same data scale size, and the structure diagram of AE is shown in
Figure 1.
The hidden layer feature output by the encoder, i.e., “coding feature”, can be regarded as the characterization of the input data
X. At the same time, the hidden layer feature is the feature obtained by encoder dimensionality reduction. Here, the data of the hidden layer
Z has lower dimensionality than the data of the input layer
X and the output layer
, that is,
and
. Calculate
Z according to the mapping matrix
from the input layer
X to the hidden layer
Z, and then calculate
according to the mapping matrix
from the hidden layer
Z to the output layer
. The above change process can be expressed by Equation (
1).
where
represents the embedding input space (also as the output space),
represents the size of the hidden space. Given input space
and characteristic space
, the self encoder solves the mapping
f and
g between the two space to minimize the reconstruction error of the input feature.
2.2. Attention Mechanism
The essence of the AM technique is to quickly filter out valuable information from a large amount of chaotic information by using limited attention resources, locating the interest information and restraining the useless information, and presenting the final results in the form of probability map or probabilistic characteristics vector [
53]. The former acts on the image data and the latter on the sequence information. In practical applications, attention includes: (1) The soft attention refers to taking into account all the data without setting filters, and calculating the attention weight for all the data; (2) The hard attention [
54] sets the filter, filters out some of the ineligible features after generating the attention weight, and sets its attention weight value to 0. We used the spatial attention and soft attention methods in this experiment.
The essential thoughts of AM are shown in
Figure 2, the source domain is composed of a series of key/value pairs of data. When an element in the given target domain is queried, the weight coefficient of the parameters corresponding to each key value will be obtained by calculating the similarity or correlation between the queried values and each key value. Then, the corresponding values are weighted sum to gain the final attention value [
55]. The attention model is intended to alleviate these challenges by giving the decoder access to the complete encoding input sequence
. The central idea is to introduce attention weight
into the input sequence to prioritize the set of locations with relevant information to produce the next output. The attention module in the architecture is responsible for automatically learning attention weight
, which captures the correlation between
(encoder hidden state, called candidate state) and
(decoder hidden state, called query state). These attention weights are then used to construct the context vector
V, which is passed as input to the decoder. Therefore, the AM is obtained by a weighted sum of elements in the source domain, and the query parameters and key values are applied to calculate the weight coefficient of corresponding parameter values. In other words, it can be roughly expressed as the following Equation (
2).
where
represents the length of the source domain,
represents the formula of AM,
is a calculation that can obtain the similarity or correlation between the query value and each key value,
and
represent the
i-th key value pairs,
Q represents the query value in the target. Conceptually, attention is often understood as selectively sifting through a small amount of important information from a large amount of information and focusing on it, ignoring most unimportant information. By calculating the weight coefficient to achieve the purpose of focusing, the greater the weight, the more focused on its corresponding values. Namely, the weight represents the importance of information, and the value is a measure of the amount of information.
2.3. Depthwise Separable Convolution
Depthwise separable convolution (DSC) is one of the two types of detachable convolution [
56], which not only deals with the spatial dimension, but also with the depth dimension (the number of channels). Therefore, more attributes extracted, more parameters can be reduced. DSC is a more common method in deep learning, which consists of two steps, the first step is depthwise convolution, which convolutes the input image without changing the depth. The second step is pointwise convolution, increasing the number of channels in each image, and using the
kernel function to enlarge the depth. Essentially, deep detachable convolution is the decomposition of 3D convolution kernels (decomposition on deep channel). Compared with standard 2D convolution, deep detachable convolution requires less computational effort only accounting for
of that of 2D convolution [
57]. The detachable convolution has fewer kernel moves, reduces the number of parameters required in the convolution, and reduces the amount of computation, enabling the network to process more data in a shorter time. lt can improve efficiency under the right circumstances, and can significantly improve efficiency without sacrificing model performance, which makes it a very popular choice [
58].
The network structure diagram of DSC is shown in
Figure 3. It is assumed that a dataset with
and
pixels channels (shape is
) is processed by depthwise separable convolution. After the first convolution operation, the deep convolution is completely in two dimensions. The number of convolution kernels is the same as with number of channels in the upper layer, and channels correspond to the convolution kernels one by one. A
-channel image is generated into
feature maps after operation. One filter only contains a kernel with a size of
. The size of the convolution kernel in the second step is
, and
is the number of channels in the upper layer. Therefore, the convolution operation here will make a weighted combination of the map produced in the previous step to generate a new feature map. The time complexity of deepwise detachable convolution can be calculated as
. Meanwhile, the time complexity of ordinary convolution can be calculated as
. Therefore, the time complexity of deepwise detachable convolution is
times that of an ordinary convolution. In the case that
feature maps are obtained with the same input, the number of parameters of the self-form convolution is about
of that of the conventional convolution. Therefore, the number of layers of the neural network with separable convolution can be done more deeply with the same number of parameters.
2.4. Classification Model
In some existing HRSI classification studies, 3DCNN is often used to obtain rich spatial-spectral features in hyperspectral images, but there is still a large room for improvement in classification accuracy and time consumption. At the same time, considering the importance of AM in feature selection, the classification model integrating AM and 3DCNN can improve the robustness and accuracy of individual features to a certain extent [
59]. To reduce the time consumption caused by 3DCNN, DSC is introduced into hyperspectral images classification, and we called it 3DCNN-AM-DSC. The 3DCNN-AM/3DCNN-AM-DSC classification models and data processing process designed in this article are shown in
Figure 4.
At the preprocessing stage, the hyperspectral raw data are processed by conventional remote sensing image processing such as data standardization, and the data dimension reduction method (PCA/AE) is used to reduce the data dimension. Specifically, we define a sampling function, which first disrupts each category sample, and then assigns them randomly in proportion. All training samples are stored in the training set, all validation samples are placed in the validation set, and all test samples are stored in the test set. The assignment of these samples is random and fully automatic, and no supervision is required. Therefore, there is no overlap in these datasets. As hyperspectral images are rich in spectral information, 1D convolution is used to extract spectral features, 2D convolution is applied to extract spatial information, and 3D convolution is used to extract joint spectral-spatial features.
In the 3DCNN-AM model structure, there are some convolution layers, max-pooling layers and full connection layers that can be stacked alternately. The convolution layer is the most important part of CNN. In each convolution layer, a learnable filter is input for convolution operation to generate multi-features mapping. After several convolution layers in the neural network, the max-pooling layer is inserted regularly to eliminate the information redundancy in the image. Using max-pooling operation, the space size of feature mapping is further reduced, and the training parameters and computation of neural network are declined. Through max-pooling operation, the size of feature map tends to shrink, and the extracted feature representation is more abstract. After the max-pooling operation, the feature mapping of the upper layer is flattened, and then input into the full connection (FC) layer. In traditional neural networks, the FC layer can reshape the feature mapping into n-dimensional vectors to extract deeper and more abstract features. A simple hidden layer is designed using lambda function to convert the data inherited by the module. Then, stack the two branches data, extract the deep features of the data using 3D convolution, and then process the input deep features through the multiply methods. The effect is equivalent to the weight representation of the overall features, and the function of this module is to further perform feature selection on the inherited data.
Unlike 3DCNN-AM, the convolution of 3DCNN-AM-DSC is replaced by DSC, the first pooling layer is average pooling instead of max-pooling, and the second max pooling is replaced with global average pooling. Average pooling considers more local information, conducts average processing on datasets, could retain the invariance of features while reducing parameters. The function of global average pooling is to evenly pool the whole feature maps, and then input them into a softmax layer to obtain the scores of each corresponding category. For the traditional classification network, the parameters of full connection account for a large proportion. Therefore, replacing the full connection layer with global average pooling can greatly reduce network parameters and over-fitting phenomenon. Each channel category of the output feature maps is given meaning, eliminating the black-box operation of the fully connected layer.
2.5. Evaluation Indicators
Four evaluation indicators are considered in our experimental analysis, including the overall accuracy (OA), the average accuracy (AA), the kappa coefficient (KC), and K-fold cross-validation accuracy (
) [
60]. The confusion matrix can clearly represent the number of correct classifications, the number of misclassifications and the total number of categories for each object, respectively. However, the confusion matrix cannot directly reflect the classification accuracy of the category, so various classification accuracy indicators derived from the confusion matrix, among which OA, AA and KC are the most widely used.
Overall classification accuracy (OA): refers to the proportion of correctly classified category units to the total number of categories. It can be expressed using Equation (
3):
where
C represents the categories that the samples to be classified,
is the number of samples to be classified in the class
i sample, and the number of
samples that are properly classified in the class
i sample.
Average classification accuracy (AA): The average of the correct rate of classification for all categories, reflecting the performance of individual classification. Its calculation is shown in Equation (
4).
The KC presents the ratio that represents a reduction in errors between classification and completely random classification, and is derived from confusion matrix using for consistency testing. It could be calculated using Equation (
5).
K-fold cross-validation accuracy (
) [
61] can avoid the occurrence of over-learning and under-learning, and its results are more convincing. The 10% of the total samples are randomly selected as the validation set, and the remaining 90% are split into
K sub-datasets. One sub-dataset is reserved as testing data, and the other
sub-datasets are used for training. Cross-validation is repeated
K times, and each sub-sample is validated once to obtain the validation accuracy
. A single estimate of
is obtained by averaging the results
K times. The advantage of this method is that it repeatedly uses randomly generated subsamples for training and verification, and the results are verified once each time. There is no overlap between training and testing datasets, each sample of data is used as training or testing, and cannot be used for both training and testing in one experiment. It could be calculated using Equation (
6).
5. Discussion
DSC is introduced to improve the efficiency of 3DCNN / 3DCNN-AM and the 3DCNN-AM-DSC was proposed for classifying hyperspectral images, which can significantly reduce the time consumption while maintaining comparable accuracy. We compared the classification performance of the three datasets according to different models (3DCNN and 3DCNN-AM) with DSC model and two dimensionality reduction methods (PCA and AE). In the IP dataset, the classification results using 3DCNN-AM with PCA are better, and 3DCNN-AM-DSC requires less computation time to obtain comparable classification results. 3DCNN combined with PCA method obtains higher accuracy and 3DCNN-DSC gets the best efficiency with the UP dataset. In the UH dataset, better classification results were obtained using 3DCNN and PCA methods, and the 3DCNN-AM-DSC method took the least amount of time. The results show that DSC is superior to traditional dimension reduction methods in time and obtains a certain degree of accuracy and applicability, although the performance of AM varies according to the dataset. In addition, the classification effect of SVM is closely related to the characteristics of the dataset itself [
65], some datasets are higher, while others are lower. The performance of our proposed method is basically similar in the three datasets, all of which greatly improve the efficiency of HRSI classification and obtain good classification accuracy.
From the perspective of overall classification performance, 3DCNN with PCA/3DCNN-AM with PCA can achieve better classification results. However, 3DCNN-DSC/3DCNN-AM-DSC can achieve relatively high time efficiency while taking into account the classification accuracy. 3DCNN with PCA can obtain high accuracy in the UP and UH datasets, and the corresponding time consumption is 4253.49 s and 1139.67 s, respectively. Similarly, 3DCNN-AM with PCA obtains the best classification accuracy on the IP dataset, and the required time is 898.07 s. 3DCNN-DSC can achieve the best time efficiency in the UP and UH datasets, with corresponding classification accuracies of 97.65% and 97.22%, time efficiency increased by 77.10% and 91.77%, and their classification accuracies are 1.96% and 2.38% lower than the best ones, respectively. 3DCNN-AM-DSC can obtain the best time efficiency in the IP dataset, the corresponding classification accuracy is 96.68%, the time efficiency is increased by 87.08%, and the classification accuracy is reduced by 1.79% compared with the best one. In these three datasets, the time efficiency of 3DCNN-AM-DSC is improved by at least 75.77% and at most by 87.37%. Similarly, the time efficiency of 3DCNN-DSC is improved by at least 77.10% and at most by 91.77%. The introduction of DSC can reduce the time consumption by maximum of 91.77%, and the accuracy can be reduced by 2.38% compared with other methods.
In order to compare the obtained results with those of other researchers (refs), we limit our discussion to the authors’ papers using the Indian Pines dataset and identify the same classification of ground objects as ours. The comparison results are shown in
Table 11. Due to the difference in batch size and allocation ratio, most of the work focuses on the improvement of OA and the reduction of time consumption. In this study, when using the IP dataset and the 3DCNN-AM-DSC model, an OA of 96.68% and a time consumption of 234.48 s were obtained.
In a study by Lu et al. [
10], CSMS-SSRN uses an attention mechanism to enhance the expressiveness of image features from both channel and spatial domains. Thus, the classification accuracy is improved, and 95.58% OA and 605.86s time consumption are obtained. This experiment compares models such as CNN, SSRN, and FDSSC, and can obtain better classification accuracy than them. The network structure of CSMS-SSRN is more complex, so it takes more time to achieve higher accuracy. In the experiments of Li et al. [
12], 3DCNN as well as three other (SAE, DBN and 2DCNN) models were tested for deep learning classification. 3DCNN obtains 99.07% accuracy, outperforming the other three methods, but lacks the discussion of time consumption. Chen et al. [
21] also carried out similar classification and proposed HRSI classification method based on PCA and SVM. Compared with traditional PCA and LLE and other dimensionality reduction methods, the accuracy of PCA-SVM is as high as 99.73%, and there is also a lack of discussion on time.
Several versions of convolution have been developed by Hsieh and Kiang et al. [
26] to address possible misclassification between similar labels by augmenting the input vector. The 2DCNN accuracy of the principal component is relatively high, 91.05%, and the time is 1020.00 s. To solve the HSI-FE and classification problems with limited training samples, Chen et al. [
27] proposed a 3DCNN-LR model. Compared with the 3DRBF-SVM model, 3DCNN-LR improves OA by 5.14%, but at the same time increases the time consumption by about 5 times. Mou and Zhu et al. [
66] used a spectral attention module and obtained an OA of 92.22%. Nor did they compare time consumption. However, it will consume a lot of time to directly process the original data without dimensionality reduction or DSC pruning. He et al. [
67] used the bidirectional encoder of the BERT transformer in the experiment, and the method obtained 98.77% high accuracy and 432s time consumption. Compared with our proposed 3DCNN-AM-DSC, sparse representation-based methods such as the literature [
68,
69,
70,
71], they also obtain good classification accuracy, but the dataset takes a lot of time to realize the representation coefficients of intra- and inter-class samples. Compared with SLRC (98.86%), MSS-GF (97.58%) and SDSC-AI (96.3%) methods, 3DCNN-AM-DSC can reduce OA by at least 0.94%, but with lower time consumption. Overall, the maximum accuracy difference between 3DCNN-AM-DSC and these methods is 2.18%, and in terms of time efficiency, 3DCNN-AM-DSC is very competitive. Experiments show that with 5-layer convolution 3DCNN-AM-DSC achieves comparable results to complex neural networks, very infomative simple neural networks, and sparse representation-based models on HRSI classification tasks [
72].
In addition, the classification performance of different classifiers depends on the different number of training samples and patchsize in each class. The class-specific and overall classification accuracy obtained by different types of features, the proposed strategy can significantly improve the classification performance even if the size of training samples and patchsize in each class are different.
6. Conclusions
A method called 3DCNN-AM-DSC is proposed for HRSI classification and prediction. Specifically, AM is introduced into the HRSI classification task to ensure a good representation of the learned features. Furthermore, DSC is introduced into the HRSI classification task to reduce convolution parameters and improve computational efficiency. The experimental results demonstrate the superiority of the proposed method compared to several state-of-the-art HRSI classification methods. The 3DCNN-AM-DSC method provides an alternative for dimensionality reduction for hyperspectral classification. That is, it is not necessary to reduce the dimension, but to reduce the model parameters through DSC, which can also greatly improve the time efficiency and reduce the amount of calculation while keeping the accuracy slightly reduced. Compared with the traditional dimensionality reduction method, our method is more time efficient and simpler to process, and will not damage the continuity of the ground objects in the original image.
However, it should be noted that while 3DCNN-AM-DSC perform boost in reducing time consumption and keeping comparable classification accuracy, it is still limited to working well in unbalanced sample size of HRSI. In the follow-up, we will further focus on the case of unbalanced sample size, consider other features selection and screening strategies, which can be combined with other light weight convolution, pruning technology and neural network search technology, to effectively improve the classification accuracy and reduce time consumption of HRSI. Meanwhile, meta-learning will be used to improve the generalization ability of the trained model and reduce the negative transfer of the model between different types of hyperspectral data, thereby reducing the long-term large-scale training pressure of the model and improving the time efficiency.