Joint Alternate Small Convolution and Feature Reuse for Hyperspectral Image Classification

A hyperspectral image (HSI) contains fine and rich spectral information and spatial information of ground objects, which has great potential in applications. It is also widely used in precision agriculture, marine monitoring, military reconnaissance and many other fields. In recent years, a convolutional neural network (CNN) has been successfully used in HSI classification and has provided it with outstanding capacity for improving classification effects. To get rid of the bondage of strong correlation among bands for HSI classification, an effective CNN architecture is proposed for HSI classification in this work. The proposed CNN architecture has several distinct advantages. First, each 1D spectral vector that corresponds to a pixel in an HSI is transformed into a 2D spectral feature matrix, thereby emphasizing the difference among samples. In addition, this architecture can not only weaken the influence of strong correlation among bands on classification, but can also fully utilize the spectral information of hyperspectral data. Furthermore, a 1 × 1 convolutional layer is adopted to better deal with HSI information. All the convolutional layers in the proposed CNN architecture are composed of small convolutional kernels. Moreover, cascaded composite layers of the architecture consist of 1 × 1 and 3 × 3 convolutional layers. The inputs and outputs of each composite layer are stitched as the inputs of the next composite layer, thereby accomplishing feature reuse. This special module with joint alternate small convolution and feature reuse can extract high-level features from hyperspectral data meticulously and comprehensively solve the overfitting problem to an extent, in order to obtain a considerable classification effect. Finally, global average pooling is used to replace the traditional fully connected layer to reduce the model parameters and extract high-dimensional features from the hyperspectral data at the end of the architecture. Experimental results on three benchmark HSI datasets show the high classification accuracy and effectiveness of the proposed method.


Introduction
Hyperspectral remote-sensing technology has been an important part of comprehensive Earth observation research since the 1980s.It is also a key point in the competition for international Earth observation technology.In addition, the applications of this technology have been gradually extended to many fields, such as environmental monitoring, land use, resource investigation, and atmospheric research.Hyperspectral image (HSI) data have the characteristics of combining images and spectra, that is, each pixel on an HSI corresponds to a spectral curve.The class of ground truth can be identified on the basis of its spectral reflectance because of the aforementioned characteristics.Each pixel in an HSI corresponds to hundreds or even thousands of bands.Moreover, the spectral resolution of HSI reaches nanoscale and contains abundant and detailed ground truth information.However, traditional multispectral data contain only a few spectral bands and have low spectral resolution.Therefore, the use of hyperspectral data is more advantageous than that of multispectral data in identifying and classifying ground truth.With hyperspectral data, a substantial variety of fine ground truth can be identified, and various samples and band combinations can be selected flexibly to obtain different features and fulfill varying task demands during ground truth recognition and classification.Furthermore, with the development of big data technology, HSI will have a bright future in big data research in remote sensing because of its considerable potential application value.
HSIs bring substantial opportunities to ground truth recognition and classification because of their rich information but remain a challenge to traditional remote-sensing image classification methods due to the large number of bands and insufficient training samples of HSI data.On the one hand, with the increase of band number, the classification accuracy obtained by directly using the information of all the bands is likely to decrease, that is, the so-called Hughes phenomenon [1].On the other hand, classification speed also restricts the application of HSIs as the number of bands increases.Many research institutions and related scholars have explored HSI classification in theory and application and proposed many mature and classic HSI classification methods to fully utilize the advantages of hyperspectral remote-sensing technology.From the perspective of HSI description space, HSI classification methods can be divided into two categories [2]: the methods based on spectral space and those based on feature space.The former utilizes the spectral curve that can reflect the spectral feature of ground truth to identify the ground truth.This kind of method include minimum distance spectrum matching, and spectral angle matching.The latter utilizes the statistical characteristics of ground truth in the feature space to build classification models, and representative methods, including the decision tree [3], artificial neural network (ANN) method [4], and support vector machine (SVM) classification [5].The ground truth that corresponds to the same pixel is not unique, and many mixed pixels exist due to "same objects with different spectrum, different objects with same spectrum" phenomenon and the low spatial resolution of HSI.Moreover, the classification accuracy of the methods based on spectral space is seriously affected.By contrast, classification methods based on feature space are not constrained by these factors and hence favored by scholars.For instance, Sun et al. [5] proposed a band-weighted SVM (BWSVM) method to classify HSIs.Li et al. [6] adopted a radial basis function to implement a kernel extreme learning machine (KELM), which provided better classification performance on the HSI than kernel SVM.Wei et al. [7] functionalized hyperspectral pixels, adopted functional principal component analysis (KPCA) to reduce the dimensionality of functional data, and classified the reduced-dimensionality data by KELM.Although scholars have attained many excellent research achievements in HSI classification, the characteristics of hyperspectral data, such as high dimension, large computation, and strong correlation among bands, still restrict traditional methods from improving classification accuracy.Furthermore, due to the shallow structure, the above methods are equipped with limit ability of feature extraction.Therefore, their classification performance for HSIs makes it easy to encounter bottlenecks.The convolutional neural network (CNN), with its deep structure and end-to-end learning mode, is provided with strong learning capability for features.In such a learning mode, the low-level features are abstracted layer by layer to the high-level features required by relevant tasks.Particularly, CNNs have demonstrated remarkable performance in image classification and target detection [8][9][10], which undoubtedly represent the gospel for improving the classification performance of HSI effectively.
A CNN is a special feedforward ANN.About 20 years ago, LeCun et al. [11] trained a CNN using the back-propagation algorithm and the gradient learning technique and then demonstrated its effectiveness through a handwritten digit recognition task.However, the development of CNN subsequently declined due to the constraints of computing power and difficulties in theoretical analysis for neural networks.In 2012, Hinton et al. [12] succeed at image classification challenge of ImageNet with Alex-Net, whose accuracy was approximately 12% higher than that of the immediate runner-up.Since then, CNNs such as NIN (Network in Network) [13], GoogLeNet [14], DeepID2+ [15], and ResNet [16] have made great historical breakthroughs and drawn a resurgence of attention to various visual recognition tasks.CNNs are being increasingly applied to HSI classification.For example, Chen et al. [17] exploited a CNN to extract the deep features of hyperspectral data and committed to solve the problems of high-dimensional and limited samples in hyperspectral data.Yue et al. [18] proposed a new classification framework composed of exponential momentum deep CNN and SVM.This framework can construct high-level spectral-spatial features hierarchically in an automated manner.Hu et al. [19] proposed a CNN model to classify HSIs directly in the spectral domain, whose CNN architecture only contains one convolutional layer and one fully connected layer, which may be hard to effectively extract robust spectral features from when the number of training samples per class is small.The 3D-CNN was also applied to hyperspectral classification [20][21][22], which can extract spectral and spatial features simultaneously and was provided with excellent classification performance.Makantasis et al. [23] encoded spatial and spectral information of hyperspectral images with CNN, and adopted randomized PCA along the spectral dimension to reduce the dimensionality of input raw data, which is a good idea.The use of CNNs effectively improves HSI classification.However, the strong correlation among bands in hyperspectral data, which is also an important factor that restricts the improvement of HSI classification accuracy, has rarely been studied.Furthermore, the CNN application technology in HSI classification is not mature enough, and such shortcomings as weak generalization capability and easy overfitting remain.
A new deep CNN architecture that realizes the classification task by learning the hyperspectral data features layer by layer is proposed in this research to solve the aforementioned problem.The major contributions of this work can be summarized as follows.

1.
Unlike existing HSI classification methods, this work transforms the 1D spectral vectors of hyperspectral data into 2D spectral feature matrices.The spectral features are mapped from 1D to 2D space.And the variations among different samples, especially those among samples from various classes, are highlighted.This work enables the CNN to fully use the spectral information from each band and extract the spectral features of the hyperspectral data accurately.Meanwhile, the interference of highly correlated bands for HSI classification can be weakened.

2.
The entire network architecture adopts small convolution kernels with size of 3 × 3 or 1 × 1 to form convolutional layers.The conversion of the 1D spectral vector to a 2D spectral feature matrix can weaken the interference of highly correlated bands for HSI classification, but cannot eliminate the correlation among bands.Adopting convolutional kernels with different sizes allows the acquisition of local receptive fields with varying sizes.After multilayer abstraction, the correlation among different spectral bands is gradually weakened.The entire network can learn the features of HSIs meticulously and robustly.Furthermore, cascaded 1 × 1 convolutional layers can increase the non-linearity of the network and make the spectral features of the hyperspectral data increasingly abstract.Simultaneously, the correlation among bands in the hyperspectral data can be weakened and the features of hyperspectral data can be learned effectively.

3.
1 × 1 and 3 × 3 convolutional layers are cascaded to form a special composite layer.The 1 × 1 convolutional layer can integrate high-level spectral features output by the front layer from a global perspective and increasing the compactness of the proposed CNN architecture.The 3 × 3 convolutional layer can deeply learn the features integrated by the 1 × 1 convolutional layer in detail from multiple local perspectives.Multiple composite layers are cascaded so that 1 × 1 and 3 × 3 convolutional layers are stacked in the network alternately.In a cross-layer connection, the input and output of each composite layer are spliced into new features in the feature dimension and passed to the next composite layer, thus accomplishing feature reuse.This combination of alternating small convolutions and feature reuse is called the ASC-FR module.
When extracting the features of hyperspectral data, the ASC-FR module can constantly switch the perspective of extraction between the global and local perspectives.Therefore, this module ensures that the spectral features can be fully utilized after multilayer abstraction, the deep features of hyperspectral data are extracted comprehensively and meticulously, and the adverse effects of strong correlation among bands on classification are weakened.To a certain extent, the overfitting and gradient disappearance in the proposed CNN architecture are solved.Thus, this module can improve the accuracy of HSI classification effectively.
The remainder of this paper is organized as follows.Section 2 briefly introduces the advantages of the small convolution kernel and describes some of the relevant work about CNN-based HSI classification.Section 3 describes the overall design of the proposed method.Section 4 evaluates the classification performance of this work through comparative experiments and analyses.Section 5 concludes the paper.

Small Convolution
Generally, a CNN contains the input layer, the convolutional layer, the pooling layer, the fully connected layer and the output layer.Each convolutional layer comprises several convolutional kernels, which can extract the local features from input feature maps.Small convolutional kernels (3 × 3 or 5 × 5) have more advantages than large ones.Through multilayer stacking, small convolutional kernels can provide the receptive field with the same size as that provided by large convolutional kernels.Moreover, the use of small convolutional kernels can bring two advantages.First, stacking multiple layers that consist of small convolutional kernels can increase the network depth, thereby enhancing the capacity and complexity of the network.Second, the model parameters can be reduced effectively by using small convolution in the model.In VGG-Net [24], which was proposed by the Oxford Visual Geometry Group of the University of Oxford, all convolutional layers adopt 3 × 3 convolutional kernels, thus possibly reducing network parameters effectively and enhancing the fitting capability of the network.As for a smaller convolutional kernel, the 1 × 1 sized convolutional kernel, its capacity was first discovered in the NIN [13] in 2014.This network has a multilayer perceptron convolutional (mlpconv) layer, which is a cascaded cross-channel parametric pooling (CCCP) on a normal convolutional layer.This CCCP structure allows the network to learn complex and useful cross-channel integration features.The CCCP layer is equivalent to a convolutional layer with 1 × 1 convolutional kernels.After the appearance of NIN, GoogLeNet, ResNet and their families also widely adopted small convolutional kernels and demonstrated that such kernels can greatly improve the capability of networks for learning features.

Convolutional Neural Network (CNN)-Based Classification for Hyperspectral Image (HSI)
CNN has demonstrated remarkable performance in HSI classification, denoising [25], segmentation [26] and so on.After summarizing the research works of CNN in HSI classification, according to the way of implementation, the HSI classification method based on CNN can be divided into 1D-CNN-based, 2D-CNN-based and 3D-CNN-based methods, and the method combining CNN with other approaches.Among them, the method based on 1D-CNN [17,19] usually assumes that each pixel in an HSI only corresponds to one class, and directly uses 1D spectral information for hyperspectral classification.Although this kind of methods are easy to implement, they are seriously affected by mixed pixels and strong correlation among bands, which causes that the model cannot effectively learn the spectral features of each pixel, and is provided with poor adaptability.The method based on 2D-CNN [27,28] usually adopts some approaches (such as principal component analysis, autoencoder) to extract the main components of spectral information in advance, and then extract neighborhood pixel blocks from the image as samples.The 3D-CNN based method utilizes pixel cube to implement hyperspectral classification [20][21][22].Both the two kinds of methods introduce spatial information, which enables the model to learn more useful features and significantly improve the classification effect.However, these two kinds of methods require a large number of labeled pixels to generate sufficient samples for training the CNN model, coupled with the limited number of labeled pixels in HSIs, which aggravates the shortage of training samples and makes the model easy to over-fit.
The two kinds of methods usually assume that the pixels in the same spatial neighborhood have similar spectral characteristics with the central pixels, and those pixels belong to the same class as the central pixel.It is ignored that the pixels in the same neighborhood may represent a different class of ground truth, that is, it is impossible to ensure that only one class of objects is included in the same neighborhood.Therefore, if the spatial size of the neighborhood pixel blocks is large, the two kinds of methods will be disturbed by heterogeneous noise (the pixels in a neighborhood pixel block belong to different class from the central pixel).The last kind of method [29,30] usually combines other approaches with CNN, which uses other approaches to make up for some shortcomings of CNN in HSI classification, so as to improve the classification performance of the CNN model.This kind of method requires fully excavating the characteristics of the CNN, and requires deep theoretical study.The implementation process is relatively tedious, but the performance is often excellent.

Data Pre-Processing
Hyperspectral data contain large number of bands and rich spectral values.Considering these characteristics, the 1D spectral vector is converted into the 2D feature matrix.Then, each pixel in the original HSI corresponds to a spectral feature map, and the difference among samples becomes increasingly remarkable.Inputting the spectral feature map into CNNs not only makes the network learn the spectral features of hyperspectral data effectively, but also reduces the influence of highly correlated bands on HSI classification.Moreover, the bondage of the high-dimension characteristic of hyperspectral data in the classification can be removed.Therefore, this data pre-processing technique can improve the classification accuracy effectively.The procedure of data pre-processing is displayed in Figure 1.It is noted that a few bands should be removed before transforming the spectral vector into the feature map.Removing a few bands can eliminate redundant information, weaken the correlation among bands of the data, and enhance the inter class separability.It also brings advantages to improvement of HSI classification accuracy.However, if too many bands are removed, some useful information will be lost, and the classification performance of the model will be degraded.Therefore, it is necessary to reasonably control the number of removed bands to make the advantages outweigh the disadvantages.Moreover, in order to convert the 1D spectral vector into the spectral feature map, the number of reserved bands should be a square number.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 5 of 20 as the central pixel.It is ignored that the pixels in the same neighborhood may represent a different class of ground truth, that is, it is impossible to ensure that only one class of objects is included in the same neighborhood.Therefore, if the spatial size of the neighborhood pixel blocks is large, the two kinds of methods will be disturbed by heterogeneous noise (the pixels in a neighborhood pixel block belong to different class from the central pixel).The last kind of method [29,30] usually combines other approaches with CNN, which uses other approaches to make up for some shortcomings of CNN in HSI classification, so as to improve the classification performance of the CNN model.This kind of method requires fully excavating the characteristics of the CNN, and requires deep theoretical study.The implementation process is relatively tedious, but the performance is often excellent.

HSI Classification Method
Based on Alternating Small Convolutions and Feature Reuse (ASC-FR)

Data Pre-processing
Hyperspectral data contain large number of bands and rich spectral values.Considering these characteristics, the 1D spectral vector is converted into the 2D feature matrix.Then, each pixel in the original HSI corresponds to a spectral feature map, and the difference among samples becomes increasingly remarkable.Inputting the spectral feature map into CNNs not only makes the network learn the spectral features of hyperspectral data effectively, but also reduces the influence of highly correlated bands on HSI classification.Moreover, the bondage of the high-dimension characteristic of hyperspectral data in the classification can be removed.Therefore, this data pre-processing technique can improve the classification accuracy effectively.The procedure of data pre-processing is displayed in Figure 1.It is noted that a few bands should be removed before transforming the spectral vector into the feature map.Removing a few bands can eliminate redundant information, weaken the correlation among bands of the data, and enhance the inter class separability.It also brings advantages to improvement of HSI classification accuracy.However, if too many bands are removed, some useful information will be lost, and the classification performance of the model will be degraded.Therefore, it is necessary to reasonably control the number of removed bands to make the advantages outweigh the disadvantages.Moreover, in order to convert the 1D spectral vector into the spectral feature map, the number of reserved bands should be a square number.

Proposed CNN Architecture
In the process of designing the network structure in this work, improving the HSI classification accuracy is considered as a goal.Meanwhile, deepening the abstraction of hyperspectral data features and avoiding the gradient disappearance and overfitting of the network as far as possible are considered as the main principles.The proposed CNN architecture is shown in Figure 2. At the input

Proposed CNN Architecture
In the process of designing the network structure in this work, improving the HSI classification accuracy is considered as a goal.Meanwhile, deepening the abstraction of hyperspectral data features and avoiding the gradient disappearance and overfitting of the network as far as possible are considered as the main principles.The proposed CNN architecture is shown in Figure 2. At the input end, three 3 × 3 convolutional layers are stacked, followed by two cascaded 1 × 1 convolution layers, that is, the mlpconv layer.This layer can enhance the complexity of this network and make the features of hyperspectral data increasingly abstract.Then, the output features are integrated by overlapping max pooling and forwarding to the two cascaded composite layers.The obtained features are then transmitted to a 1 × 1 convolution layer and an average pooling layer.Finally, global average pooling (GAP) is adopted to deal with the whole feature map, and classification results are outputted by Softmax layer.The outputs of each convolutional layer are treated with batch normalization (BN) and rectified linear unit (ReLU) in turn, that is, Conv → BN → ReLU.
Width expansion rate.In this work, the width (channel number) of each convolution layer is set as a multiple of g.The width of the network increases with g, which is thus called the width expansion rate.Different width expansion rates indicate that the number of new feature information received by each layer varies.Therefore, the width of the network can be adjusted and the efficiency of the network parameters can be improved by an adjustment of the width expansion rate.In the experiment, the output dimension of nearly all the convolution layers, except the composite layers, are set as 2g to adjust the width and depth of network conveniently.
ASC-FR module.The composite layer consists of 1 × 1 and 3 × 3 convolution layers, where the output dimensions of both 1 × 1 and 3 × 3 convolution layers are set as 4g.In this manner, the output dimension of each composite layer is maintained and the expansion of the network is facilitated.Through a stitching operation, the input and output features of each composite layer are combined as new features that form the input of the next composite layer.This operation leads to feature reuse, which is helpful in preventing overfitting.The aforementioned information is the detail of the ASC-FR module.When the feature dimension is high after splicing, the composite layer can reduce the dimension of features, thereby increasing the compactness of the network structure and reducing the computation.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 6 of 20 end, three 3 × 3 convolutional layers are stacked, followed by two cascaded 1 × 1 convolution layers, that is, the mlpconv layer.This layer can enhance the complexity of this network and make the features of hyperspectral data increasingly abstract.Then, the output features are integrated by overlapping max pooling and forwarding to the two cascaded composite layers.The obtained features are then transmitted to a 1 × 1 convolution layer and an average pooling layer.Finally, global average pooling (GAP) is adopted to deal with the whole feature map, and classification results are outputted by Softmax layer.The outputs of each convolutional layer are treated with batch normalization (BN) and rectified linear unit (ReLU) in turn, that is, Conv → BN → ReLU.Width expansion rate.In this work, the width (channel number) of each convolution layer is set as a multiple of g.The width of the network increases with g, which is thus called the width expansion rate.Different width expansion rates indicate that the number of new feature information received by each layer varies.Therefore, the width of the network can be adjusted and the efficiency of the network parameters can be improved by an adjustment of the width expansion rate.In the experiment, the output dimension of nearly all the convolution layers, except the composite layers, are set as 2g to adjust the width and depth of network conveniently.
ASC-FR module.The composite layer consists of 1 × 1 and 3 × 3 convolution layers, where the output dimensions of both 1 × 1 and 3 × 3 convolution layers are set as 4g.In this manner, the output dimension of each composite layer is maintained and the expansion of the network is facilitated.Through a stitching operation, the input and output features of each composite layer are combined as new features that form the input of the next composite layer.This operation leads to feature reuse, which is helpful in preventing overfitting.The aforementioned information is the detail of the ASC-FR module.When the feature dimension is high after splicing, the composite layer can reduce the dimension of features, thereby increasing the compactness of the network structure and reducing the computation.

Experiments and Analysis
The proposed CNN architecture is implemented via TensorFlow1.4.2.The experiments are conducted on a desktop PC with Windows 7 64-bit OS, Inter(R) Core(TM) i5-4460 CPU, 8 GB RAM and NVIDIA GeForce GTX 1070 8G GPU.Three benchmark datasets are used to evaluate the classification performance of the proposed method.This section introduces the datasets, provides details about the experimental design, and conducts analyses according to the experimental results.

Experiments and Analysis
The proposed CNN architecture is implemented via TensorFlow1.4.2.The experiments are conducted on a desktop PC with Windows 7 64-bit OS, Inter(R) Core(TM) i5-4460 CPU, 8 GB RAM and NVIDIA GeForce GTX 1070 8G GPU.Three benchmark datasets are used to evaluate the classification performance of the proposed method.This section introduces the datasets, provides details about the experimental design, and conducts analyses according to the experimental results.The Indian Pines image was captured by the AVIRIS sensor with a spatial resolution of 20 m over the Indian Pines test site in north-western Indiana.This image contains 145 × 145 pixels and 224 spectral reflectance bands, whose wavelength ranges from 0.4 μm to 2.5 μm.After discarding four zero bands, 220 bands are reserved.There are 16 ground-truth classes and 10,249 labeled pixels.Figure 3 displays the details of the Indian Pines image.Table 1 illustrates the number of samples per class in the Indian Pines dataset.For the Indian Pines dataset, 20 bands that cover the region of water absorption and four bands with low signal-to-noise ratio (SNR), including [104-108,150-165,218-220], are removed.The 196 preserved bands are used for classification.Figure 4 displays the mean spectral signatures of the 16 ground-truth classes.The 1D spectral vector that corresponds to each sample is converted into a  For the Indian Pines dataset, 20 bands that cover the region of water absorption and four bands with low signal-to-noise ratio (SNR), including [104-108,150-165,218-220], are removed.The 196 preserved bands are used for classification.Figure 4 displays the mean spectral signatures of the 16 ground-truth classes.The 1D spectral vector that corresponds to each sample is converted into a spectral feature map that contains 14 × 14 pixels.Then, the data are processed by zero mean treatment.Figure 5 shows the spectral feature maps of the 16 ground truth classes after quantizing spectral values at 20 levels.Figure 4 shows that the differences between different curves are noticeable in most bands, while being remarkably small in the other bands.A few spectral curves nearly coincide even in all the bands, and it is hard to distinguish them.In the process of classification, this situation will easily lead to the misclassification of some classes, which is not conducive to the improvement of overall classification accuracy.Therefore, training the network using 1D spectral vectors directly is not conducive to the extraction of features from the original data, thereby leading to an overfitting problem and poor overall classification performance of the proposed method.After converting the samples into 2D forms, spectral feature maps from different classes are easy to distinguish, as displayed in Figure 5.This illustrates that the data processing method is conducive to the network learning of the hyperspectral data features, thus improving the classification performance of the network.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 8 of 20 spectral feature map that contains 14 × 14 pixels.Then, the data are processed by zero mean treatment.
Figure 5 shows the spectral feature maps of the 16 ground truth classes after quantizing spectral values at 20 levels.Figure 4 shows that the differences between different curves are noticeable in most bands, while being remarkably small in the other bands.A few spectral curves nearly coincide even in all the bands, and it is hard to distinguish them.In the process of classification, this situation will easily lead to the misclassification of some classes, which is not conducive to the improvement of overall classification accuracy.Therefore, training the network using 1D spectral vectors directly is not conducive to the extraction of features from the original data, thereby leading to an overfitting problem and poor overall classification performance of the proposed method.After converting the samples into 2D forms, spectral feature maps from different classes are easy to distinguish, as displayed in Figure 5.This illustrates that the data processing method is conducive to the network learning of the hyperspectral data features, thus improving the classification performance of the network.spectral feature map that contains 14 × 14 pixels.Then, the data are processed by zero mean treatment.
Figure 5 shows the spectral feature maps of the 16 ground truth classes after quantizing spectral values at 20 levels.Figure 4 shows that the differences between different curves are noticeable in most bands, while being remarkably small in the other bands.A few spectral curves nearly coincide even in all the bands, and it is hard to distinguish them.In the process of classification, this situation will easily lead to the misclassification of some classes, which is not conducive to the improvement of overall classification accuracy.Therefore, training the network using 1D spectral vectors directly is not conducive to the extraction of features from the original data, thereby leading to an overfitting problem and poor overall classification performance of the proposed method.After converting the samples into 2D forms, spectral feature maps from different classes are easy to distinguish, as displayed in Figure 5.This illustrates that the data processing method is conducive to the network learning of the hyperspectral data features, thus improving the classification performance of the network.For the Salinas dataset, 20 bands that cover the region of water absorption and 8 bands with low SNR, including [107-113,153-168,220-224], are removed.The 196 remaining bands are used for classification.The spectral vectors of the preserved data are transformed into feature maps with 14 × 14 pixels.The data are processed by zero mean treatment.Accordingly, the mean spectral signatures of 16 classes in the Salinas dataset are shown in Figure 7, and the spectral feature maps after   quantizing spectral values at 20 levels are shown in Figure 8. Figures 7 and 8 demonstrate that transforming the samples from 1D to 2D emphasizes the difference among the samples from various ground-truth classes for the Salinas dataset.For the Pavia University image, only three bands were removed and 100 bands were retained to convert spectral vectors into feature maps because there exist obvious differences among different mean spectral curves (Figure 10).The corresponding spectral feature maps are shown in Figure 11 below.For the Pavia University image, only three bands were removed and 100 bands were retained to convert spectral vectors into feature maps because there exist obvious differences among different mean spectral curves (Figure 10).The corresponding spectral feature maps are shown in Figure 11 below.For the Pavia University image, only three bands were removed and 100 bands were retained to convert spectral vectors into feature maps because there exist obvious differences among different mean spectral curves (Figure 10).The corresponding spectral feature maps are shown in Figure 11 below.

Experimental Design
In this section, several comparative experiments are designed to evaluate the classification performance of the proposed method in HSI.The details of the experiments are as follows.
(1) Comparison of classification performances of proposed method under different parameter settings.Two comparisons are needed because different width expansion rates (g) and network depths lead to varying classification performances of the proposed method.① The number of composite layers is denoted as nc_layer and set as 2, and the classification performances of the proposed method when the value of g is 8/20/32 are compared.② g is set as 20, and the classification performances of the proposed method when the value of nc_layer is 2/4/8 are compared.
(2) Comparison with other methods.The classification performance of proposed method is compared with that of the deep learning method and non-deep learning method on the HSI.
In the training process, the training samples are divided into batches, and the number of samples per batch in our experiment is 64.For each epoch, the whole training set is learned by the proposed CNN.The total of epochs is 200 in each experiment.The Adam optimizer is applied to train the proposed CNN, and MSRA initialization method [31] is used for weight initialization.The parameter of the Adam optimizer, epsilon, is set as 1e-8.The initial learning rate for the Indian Pines dataset is 1e-2, which is reduced 10/100/200 times when epoch = 20/60/100, respectively.For the Salinas dataset and the Pavia University dataset, the initial learning rate is 1e-3, which is reduced 10/100/200 times when epoch = 40/100/150, respectively.If there is no special illustration, then all the parameters are set according to the aforementioned settings.These parameters may not be optimal, but at least effective for the proposed CNN.
Division of training set and test set.Generally, insufficient training samples will lead to serious overfitting of deep CNN models.However, the deep CNN model is equipped with strong feature extraction capacity due to its deep structure.Moreover, BN and MSRA initialization method are adopted for training the proposed CNN effectively, which improve the generalization performance of proposed CNN.Therefore, proposed method can extract deep features from small training samples set effectively.The limited available labeled pixels of HSI leads to insufficient training samples, which

Experimental Design
In this section, several comparative experiments are designed to evaluate the classification performance of the proposed method in HSI.The details of the experiments are as follows.
(1) Comparison of classification performances of proposed method under different parameter settings.
Two comparisons are needed because different width expansion rates (g) and network depths lead to varying classification performances of the proposed method. 1 The number of composite layers is denoted as nc_layer and set as 2, and the classification performances of the proposed method when the value of g is 8/20/32 are compared. 2g is set as 20, and the classification performances of the proposed method when the value of nc_layer is 2/4/8 are compared.
(2) Comparison with other methods.The classification performance of proposed method is compared with that of the deep learning method and non-deep learning method on the HSI.
In the training process, the training samples are divided into batches, and the number of samples per batch in our experiment is 64.For each epoch, the whole training set is learned by the proposed CNN.The total of epochs is 200 in each experiment.The Adam optimizer is applied to train the proposed CNN, and MSRA initialization method [31] is used for weight initialization.The parameter of the Adam optimizer, epsilon, is set as 1 × 10 −8 .The initial learning rate for the Indian Pines dataset is 1 × 10 −2 , which is reduced 10/100/200 times when epoch = 20/60/100, respectively.For the Salinas dataset and the Pavia University dataset, the initial learning rate is 1 × 10 −3 , which is reduced 10/100/200 times when epoch = 40/100/150, respectively.If there is no special illustration, then all the parameters are set according to the aforementioned settings.These parameters may not be optimal, but at least effective for the proposed CNN.
Division of training set and test set.Generally, insufficient training samples will lead to serious overfitting of deep CNN models.However, the deep CNN model is equipped with strong feature extraction capacity due to its deep structure.Moreover, BN and MSRA initialization method are adopted for training the proposed CNN effectively, which improve the generalization performance of proposed CNN.Therefore, proposed method can extract deep features from small training samples set effectively.The limited available labeled pixels of HSI leads to insufficient training samples, which usually make it a challenge to improve the classification accuracy of the HSI.Considering this, to evaluate the effectiveness of proposed method under limited training samples, we randomly select 25% of samples in each HSI dataset for the training set, and the rest for the test set.

Experimental Results and Analyses
In this work, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) are adopted to evaluate the classification performance of the proposed method in HSI data.Among them, OA refers to the ratio of the number of pixels correctly classified to the total number of all labeled pixels.AA is the average of classification accuracy of all classes.The Kappa coefficient is used to assess the agreement of classification for all the classes.The greater the kappa value, the better the overall classification effect.All the following data are the average of 10 experimental results under the same conditions to ensure the objectivity of the experimental results.

(1) Experimental Results in the Indian Pines Dataset
Classification performance of the proposed method under different width expansion rates: Figure 12 displays the bar of the classification results when nc_layer = 2 and g = 8/20/32.According to Figure 12, when the network depth is the same, the width of the network is increased gradually and the classification performance of the proposed method is improved gradually with the increase of width expansion rate (g).However, the trend of this improvement gradually saturates.OA/AA/Kappa are increased by 3.05%/5.08%/0.0346,respectively, when g is increased from 8 to 20.When g is increased from 20 to 32, OA and Kappa are almost invariable and AA is slightly reduced.Within a certain range, the increase of network width can lead to improving the feature extraction of hyperspectral data, thus enhancing classification accuracy.However, this improvement has an upper limit.Widening the network width will certainly increase the computation of the network, thereby increasing the time consumed by classification.Therefore, the width of the network should be controlled when designing the HSI classification method.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 13 of 20 usually make it a challenge to improve the classification accuracy of the HSI.Considering this, to evaluate the effectiveness of proposed method under limited training samples, we randomly select 25% of samples in each HSI dataset for the training set, and the rest for the test set.

Experimental Results and Analyses
In this work, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa) are adopted to evaluate the classification performance of the proposed method in HSI data.Among them, OA refers to the ratio of the number of pixels correctly classified to the total number of all labeled pixels.AA is the average of classification accuracy of all classes.The Kappa coefficient is used to assess the agreement of classification for all the classes.The greater the kappa value, the better the overall classification effect.All the following data are the average of 10 experimental results under the same conditions to ensure the objectivity of the experimental results.
(1) Experimental Results in the Indian Pines Dataset Classification performance of the proposed method under different width expansion rates: Figure 12 displays the bar of the classification results when nc_layer = 2 and g = 8/20/32.According to Figure 12, when the network depth is the same, the width of the network is increased gradually and the classification performance of the proposed method is improved gradually with the increase of width expansion rate (g).However, the trend of this improvement gradually saturates.OA/AA/Kappa are increased by 3.05%/5.08%/0.0346,respectively, when g is increased from 8 to 20.When g is increased from 20 to 32, OA and Kappa are almost invariable and AA is slightly reduced.Within a certain range, the increase of network width can lead to improving the feature extraction of hyperspectral data, thus enhancing classification accuracy.However, this improvement has an upper limit.Widening the network width will certainly increase the computation of the network, thereby increasing the time consumed by classification.Therefore, the width of the network should be controlled when designing the HSI classification method.Classification performance of the proposed method under different number of composite layers.Figure 13 displays the bar of classification results when nc_layer = 2 and g = 8/20/32.Increasing the number of composite layers cannot increase the classification accuracy of the proposed method when the width expansion rate remains the same.An increase in network depth will greatly increase the computation of the network and prolong the time for network training.Considering the large computation of hyperspectral data, the network depth should be controlled on the premise of ensuring high classification accuracy.
In summary, in terms of classification accuracy and speed, setting g = 20 and nc_layer = 2 is suitable.Classification performance of the proposed method under different number of composite layers.Figure 13 displays the bar of classification results when nc_layer = 2 and g = 8/20/32.Increasing the number of composite layers cannot increase the classification accuracy of the proposed method when the width expansion rate remains the same.An increase in network depth will greatly increase the computation of the network and prolong the time for network training.Considering the large computation of hyperspectral data, the network depth should be controlled on the premise of ensuring high classification accuracy.
In summary, in terms of classification accuracy and speed, setting g = 20 and nc_layer = 2 is suitable.Impact of band number.As can be seen from Figure 4, when the number of reserved bands is 196, many spectral curves still have serious aliasing at some bands, which is unfavorable for classification.In order to obtain the optimal number of reserved bands, the classification results of 100, 144 and 196 bands under nc_layer = 2, g = 20 are compared (Table 4).According to Figures 4 and  14, after removing more bands, the aliasing of spectral curves is much less, which enhances the inter class separability of Indian Pines data effectively.Unfortunately, and simultaneously, it also leads to the loss of much useful information.The enhancement of inter class separability bring benefits to the improvement of classification accuracy, while the loss of useful information cause damage to the improvement of classification accuracy.For the reason that the disadvantage outweighs the advantage, compared with 196-bands OA, 144-bands OA decreases by only 0.12%, and 100-bands OA decreases by 2.74% (see Table 4).As a result, it is the best choice to reserve 196 bands.The classification performance of the proposed method for the Salinas dataset under different width expansion rates or network depths.Table 5 shows the classification results of the proposed method on the Salinas dataset when nc_layer = 2 and g = 8/20/32 and when g = 20 and nc_layer = 2/4/8.When g is increased from 8 to 20 under the same network depth, OA/AA/Kappa increase by 1.19%/0.48%/0.0132,respectively.When g is increased from 20 to 32, OA/AA/Kappa are nearly Impact of band number.As can be seen from Figure 4, when the number of reserved bands is 196, many spectral curves still have serious aliasing at some bands, which is unfavorable for classification.In order to obtain the optimal number of reserved bands, the classification results of 100, 144 and 196 bands under nc_layer = 2, g = 20 are compared (Table 4).According to Figures 4  and 14, after removing more bands, the aliasing of spectral curves is much less, which enhances the inter class separability of Indian Pines data effectively.Unfortunately, and simultaneously, it also leads to the loss of much useful information.The enhancement of inter class separability bring benefits to the improvement of classification accuracy, while the loss of useful information cause damage to the improvement of classification accuracy.For the reason that the disadvantage outweighs the advantage, compared with 196-bands OA, 144-bands OA decreases by only 0.12%, and 100-bands OA decreases by 2.74% (see Table 4).As a result, it is the best choice to reserve 196 bands.Impact of band number.As can be seen from Figure 4, when the number of reserved bands is 196, many spectral curves still have serious aliasing at some bands, which is unfavorable for classification.In order to obtain the optimal number of reserved bands, the classification results of 100, 144 and 196 bands under nc_layer = 2, g = 20 are compared (Table 4).According to Figures 4 and  14, after removing more bands, the aliasing of spectral curves is much less, which enhances the inter class separability of Indian Pines data effectively.Unfortunately, and simultaneously, it also leads to the loss of much useful information.The enhancement of inter class separability bring benefits to the improvement of classification accuracy, while the loss of useful information cause damage to the improvement of classification accuracy.For the reason that the disadvantage outweighs the advantage, compared with 196-bands OA, 144-bands OA decreases by only 0.12%, and 100-bands OA decreases by 2.74% (see Table 4).As a result, it is the best choice to reserve 196 bands.The classification performance of the proposed method for the Salinas dataset under different width expansion rates or network depths.Table 5 shows the classification results of the proposed method on the Salinas dataset when nc_layer = 2 and g = 8/20/32 and when g = 20 and nc_layer = 2/4/8.When g is increased from 8 to 20 under the same network depth, OA/AA/Kappa increase by  (

2) Experimental Results on the Salinas Dataset
The classification performance of the proposed method for the Salinas dataset under different width expansion rates or network depths.Table 5 shows the classification results of the proposed method on the Salinas dataset when nc_layer = 2 and g = 8/20/32 and when g = 20 and nc_layer = 2/4/8.When g is increased from 8 to 20 under the same network depth, OA/AA/Kappa increase by 1.19%/0.48%/0.0132,respectively.When g is increased from 20 to 32, OA/AA/Kappa are nearly unchanged.The increase of network width can improve the classification performance of the proposed method in a certain range.If the width is sufficient, widening the network will not affect the classification capability of the proposed method.When the width expansion rate is the same, the classification accuracy is nearly unchanged with the increase of network depth.This phenomenon may be caused by the sufficiently large number of samples in the Salinas dataset, which is remarkably close to or may even reach the maximum capacity of the proposed method.

3) Experimental Results on the Pavia University Dataset
To demonstrate the relationship between classification accuracies and inter-class separability, Tables 6 and 7 display the details of the classification results for Pavia University data at nc_layer = 2 and g = 20.In Table 6, the value in the ith row, jth column means the number of samples of the jth class which is classified to the ith class.Each row represents the samples in a predicted class, and each class represents the samples in an actual class.The values on diagonal line means the number of samples which are classified correctly.Those values not on diagonal line means false positives or false negatives.For example, in the first row, 4790 samples of C1 are classified correctly, while 31 samples of C3 are misclassified into C1, which means false positives.In the seventh column, 887 samples of C7 are classified correctly, while 108 samples are misclassified into C1, which means false negatives.
In Table 7, the value in the ith row, jth column means the percentage of the samples classified into the ith class from the jth class.The values on diagonal line are just the classification accuracies of corresponding classes.For example, 93.37% of C7 samples are classified correctly, but 6.32% of the samples classified into C7 are from C1, which are misclassified.The smaller the difference of mean spectral curves, the smaller the difference among spectral feature maps from a different class, the worse the inter-class separability, the easier it is to cause misclassification.As shown in Figure 12, the curves of C3 and C8 are very similar, that is, the difference between C3 and C8 is small.So, many samples (212) of C3 are misclassified into C8.Similarly, many samples (179) of C8 are misclassified into C3.In this way, it is not difficult to explain that the OA of proposed method is only 89.95% on the Indian Pines dataset, while 96.01% on the Salinas dataset and 96.15% on the Pavia University dataset.Because at many bands, there is serious aliasing in the spectral mean curves of the Indian Pines dataset (Figure 4), which means poor inter-class separability.Therefore, there exist serious misclassification in some classes, resulting in the decline of overall classification accuracy for the Indian Pines dataset.

4) Comparison with Other Methods
In order to further evaluate the effectiveness of the proposed method, we implement two classical CNN architectures, NIN and LeNet-5.We take nc_layer = 2 and g = 20, the architectures of which are shown in Table 8.The classification results on three datasets are shown in Table 9.Table 10 displays the total run time for training and testing on the Indian Pines dataset classification, respectively.Figures 15-17 display the corresponding classification maps.It is easy to know that the classification performance of the proposed method is better than all the comparison methods.Furthermore, as can be seen from Tables 8 and 10, the FLOPS (floating-point operations) of proposed CNN is 10 6 less than that of NIN, and the training time and testing time in Indian Pines data classification are also significantly less than that of NIN.The OA of the proposed CNN is 0.98% more than that of NIN, which demonstrates the effectiveness of the ASC-FR module.However, FLOPS and computing consumption of LeNet-5 are much less than that of proposed CNN and NIN.Its shallow structure and insufficient training samples of HSI severely restrict the classification performance of LeNet-5.As a result, the OA of LeNet-5 is 3.19% less than that of NIN, and 4.17% less than that of proposed CNN.Finally, the classification performance of the proposed method is compared with some other HSI classification methods, as shown in Table 11.In this table, the accuracies outside brackets are taken from corresponding references directly, and those accuracies in brackets are obtained by the proposed method.It should be noted that we obtain the classification accuracy by dividing the training set and the test set according to the corresponding reference.The table demonstrates that the classification performance of proposed method outperforms all the comparison methods.In addition, DBN means deep belief network and DAE means denoising autoencoders in Table 11.

Conclusions
In this work, an HSI classification method based on ASC-FR is proposed.In data pre-processing, each 1D spectral vector that corresponds to a labeled pixel is transformed into a 2D spectral feature map, thereby highlighting the differences among samples and weakening the influence of strong correlation among bands for HSI classification.In the CNN design, 1 × 1 convolution layers are Finally, the classification performance of the proposed method is compared with some other HSI classification methods, as shown in Table 11.In this table, the accuracies outside brackets are taken from corresponding references directly, and those accuracies in brackets are obtained by the proposed method.It should be noted that we obtain the classification accuracy by dividing the training set and the test set according to the corresponding reference.The table demonstrates that the classification performance of proposed method outperforms all the comparison methods.In addition, DBN means deep belief network and DAE means denoising autoencoders in Table 11.

Conclusions
In this work, an HSI classification method based on ASC-FR is proposed.In data pre-processing, each 1D spectral vector that corresponds to a labeled pixel is transformed into a 2D spectral feature map, thereby highlighting the differences among samples and weakening the influence of strong correlation among bands for HSI classification.In the CNN design, 1 × 1 convolution layers are adopted to reduce the network parameters and increase network complexity, thus extracting increasingly accurate hyperspectral data features.Through the ASC-FR module, the utilization rate of the high-dimensional features in the network can be improved, the features of hyperspectral data can be extracted meticulously and comprehensively, overfitting can be prevented to a certain extent, and classification accuracy can be improved.Overlapping pooling and GAP are used to integrate the data features, thereby greatly enhancing the learning capability of the network for spectral features and improving the generalization capability of the CNN.Experimental results show when only 25% samples are selected for the training set, the classification accuracy of the proposed method can reach 89.95% for the Indian Pines dataset, and even 96.01%for the Salinas dataset and 96.15% for the Pavia University dataset.Comparative experiments on three benchmark HSI datasets demonstrate that the proposed ASC-FR module can improve the classification accuracy of CNNs for HSIs effectively and the proposed classification method has excellent classification performance, which outperforms all the comparison methods.However, the proposed method only configures the number of channels in each convolution layer simply, there is still much room for improvement.Furthermore, the spatial information of hyperspectral images cannot be effectively utilized by the proposed method.In future work, we will optimize the combination of convolution layer channels through a large number of experiments, and adopt some approaches to augment the number of training samples, more importantly, combining spectral information and spatial information to achieve hyperspectral classification.Finally, we plan to optimize the network structure of the proposed method by referring to the latest progress in CNN research.In those ways, we believe the performance of CNN-based HSI classification under a small training set can be improved effectively and significantly.

Figure 1 .
Figure 1.The diagram of data pre-processing procedure.

Figure 1 .
Figure 1.The diagram of data pre-processing procedure.

4. 1 .
Datasets and Data Pre-Processing 4.1.1.Indian Pines Dataset The Indian Pines image was captured by the AVIRIS sensor with a spatial resolution of 20 m over the Indian Pines test site in north-western Indiana.This image contains 145 × 145 pixels and 224 spectral reflectance bands, whose wavelength ranges from 0.4 µm to 2.5 µm.After discarding four zero bands, 220 bands are reserved.There are 16 ground-truth classes and 10,249 labeled pixels.Figure 3 displays the details of the Indian Pines image.

Figure 3 .
Figure 3.The Indian Pines image: (a) the 21th band image; (b) the ground truth of Indian Pines, where the white area represents the unlabeled pixels.

Figure 3 .
Figure 3.The Indian Pines image: (a) the 21th band image; (b) the ground truth of Indian Pines, where the white area represents the unlabeled pixels.

Figure 4 .
Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset.

Figure 5 .
Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

Figure 4 .
Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset.

Figure 4 .
Figure 4. Mean spectral signatures of 16 classes in the Indian Pines dataset.

Figure 5 .
Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

Figure 5 .Figure 6 .
Figure 5. 2D spectral feature maps of 16 classes in the Indian Pines dataset.

Figure 6 .
Figure 6.The Salinas image: (a) the 21th band image; (b) the ground truth of Salinas, where the white area represents the unlabeled pixels.

For
the Salinas dataset, 20 bands that cover the region of water absorption and 8 bands with low SNR, including [107-113,153-168,220-224], are removed.The 196 remaining bands are used for classification.The spectral vectors of the preserved data are transformed into feature maps with 14 × 14 pixels.The data are processed by zero mean treatment.Accordingly, the mean spectral signatures of 16 classes in the Salinas dataset are shown in Figure 7, and the spectral feature maps after quantizing spectral values at 20 levels are shown in Figure 8. Figures 7 and 8 demonstrate that transforming the samples from 1D to 2D emphasizes the difference among the samples from various ground-truth classes for the Salinas dataset.

Figure 7 .
Figure 7. Mean spectral signatures of 16 classes in the Salinas dataset.

Figure 8 .
Figure 8. 2D spectral feature maps of 16 classes in the Salinas dataset.

Figure 7 .
Figure 7. Mean spectral signatures of 16 classes in the Salinas dataset.

Figure 7 .
Figure 7. Mean spectral signatures of 16 classes in the Salinas dataset.

Figure 8 .
Figure 8. 2D spectral feature maps of 16 classes in the Salinas dataset.

Figure 8 .
Figure 8. 2D spectral feature maps of 16 classes in the Salinas dataset.

Figure 9 .
Figure 9.The Pavia University image: (a) the 21th band image; (b) the ground truth of Salinas, where the white area represents the unlabeled pixels.

Figure 10 .
Figure 10.Mean spectral signatures of 16 classes in the Pavia University dataset.

Figure 9 .
Figure 9.The Pavia University image: (a) the 21th band image; (b) the ground truth of Salinas, where the white area represents the unlabeled pixels.

Figure 9 .
Figure 9.The Pavia University image: (a) the 21th band image; (b) the ground truth of Salinas, where the white area represents the unlabeled pixels.

Figure 10 .
Figure 10.Mean spectral signatures of 16 classes in the Pavia University dataset.Figure 10.Mean spectral signatures of 16 classes in the Pavia University dataset.

Figure 10 .
Figure 10.Mean spectral signatures of 16 classes in the Pavia University dataset.Figure 10.Mean spectral signatures of 16 classes in the Pavia University dataset.

Figure 11 .
Figure 11.2D spectral feature maps of 16 classes in the Pavia University dataset.

Figure 11 .
Figure 11.2D spectral feature maps of 16 classes in the Pavia University dataset.

Figure 14 .
Figure 14.Spectral curves of 100 bands and 144 bands in the Indian Pines dataset.

Figure 14 .
Figure 14.Spectral curves of 100 bands and 144 bands in the Indian Pines dataset.

Figure 14 .
Figure 14.Spectral curves of 100 bands and 144 bands in the Indian Pines dataset.

Figure 15 .
Figure 15.Classification maps of different CNN architectures on the Indian Pines dataset.

Figure 15 .
Figure 15.Classification maps of different CNN architectures on the Indian Pines dataset.

Figure 15 .Figure 16 . 20 Figure 16 .
Figure 15.Classification maps of different CNN architectures on the Indian Pines dataset.

Figure 17 .
Figure 17.Classification maps of different CNN architectures on the Pavia University dataset.

Figure 17 .
Figure 17.Classification maps of different CNN architectures on the Pavia University dataset.
Table 1 illustrates the number of samples per class in the Indian Pines dataset.

Table 1 .
Indian Pines image: the number of samples per class.

Table 1 .
Indian Pines image: the number of samples per class.

Table 2 .
Salinas image: the number of samples per class.

Table 2 .
Salinas image: the number of samples per class.

Table 3 .
Pavia University image: the number of samples per class.

Table 3 .
Pavia University image: the number of samples class.

Table 3 .
Pavia University image: the number of samples per class.

Table 4 .
Classification results of different number of reserved bands in the Indian Pines dataset.

Table 4 .
Classification results of different number of reserved bands in the Indian Pines dataset.

Table 4 .
Classification results of different number of reserved bands in the Indian Pines dataset.

Table 5 .
Classification results under different width expansion rate or network depth in the Salinas dataset.

Table 6 .
The details of classification results for Pavia University data.

Table 7 .
The detailed classification accuracy of all the classes for Pavia University data.

Table 9 .
Classification results of different CNN architectures on three HSI datasets.

Table 10 .
Computing consumption of different CNN architectures on the Indian Pines dataset.

Table 9 .
Classification results of different CNN architectures on three HSI datasets.

Table 10 .
Computing consumption of different CNN architectures on the Indian Pines dataset.

Table 11 .
Comparison with other methods for the Indian Pines dataset.

Table 11 .
Comparison with other methods for the Indian Pines dataset.