A Hyperspectral Image Classification Method Based on the Nonlocal Attention Mechanism of a Multiscale Convolutional Neural Network

Recently, convolution neural networks have been widely used in hyperspectral image classification and have achieved excellent performance. However, the fixed convolution kernel receptive field often leads to incomplete feature extraction, and the high redundancy of spectral information leads to difficulties in spectral feature extraction. To solve these problems, we propose a nonlocal attention mechanism of a 2D–3D hybrid CNN (2-3D-NL CNN), which includes an inception block and a nonlocal attention module. The inception block uses convolution kernels of different sizes to equip the network with multiscale receptive fields to extract the multiscale spatial features of ground objects. The nonlocal attention module enables the network to obtain a more comprehensive receptive field in the spatial and spectral dimensions while suppressing the information redundancy of the spectral dimension, making the extraction of spectral features easier. Experiments on two hyperspectral datasets, Pavia University and Salians, validate the effectiveness of the inception block and the nonlocal attention module. The results show that our model achieves an overall classification accuracy of 99.81% and 99.42% on the two datasets, respectively, which is higher than the accuracy of the existing model.


Introduction
The bandwidth of a hyperspectral image is usually tens of nanometers, much narrower than that of a multispectral image [1]. Therefore, the hyperspectral image has more abundant spectral information; it is widely used in various fields [2][3][4]. In the geoscience field, the correct pixel-level classification of remote-sensing images is the premise of many research tasks [5,6]. Hyperspectral remote-sensing images have natural advantages in classification tasks with rich spectral information [7]. Therefore, hyperspectral remote sensing is widely used in precision agriculture [8], rock and mineral identification [9], environmental monitoring [10], marine remote sensing [11], and other fields.
Traditional methods of classifying remote-sensing images include the classification of spectral features and the classification of statistical data features. The hyperspectral image classification method based on spectral features centers on the spectral curve of the optical property of the ground object to recognize the ground object. First, the spectral features of the hyperspectral image are extracted and transformed. Then, the pixels in the image are classified using the known spectral data in the spectral library and the matching algorithm. The representative method is spectral information angle mapping. The classification method based on the statistical characteristics of data uses the characteristics to establish a classification model by obtaining the statistics of each class on the training set and then classifying pixels with similar characteristics on the testing set to one class according to this statistic. The representative algorithms include the maximum likelihood and minimum distance methods.
With its rapid development, artificial intelligence has begun to play an important role in different engineering fields, such as in disease diagnosis [12]. The deep learning model has since emerged and become a widely studied research topic. In recent years, some deep learning models have been introduced into hyperspectral remote-sensing image classification. The stacked automatic encoder (SAE) [13] and deep belief network (DBN) [14] have been used to extract spectral information from hyperspectral images and achieve higher classification accuracy than traditional methods. Compared with SAE and DBN, a convolutional neural network (CNN) [15] is not limited by the input dimension. CNNs perform well in various tasks. For example, Yu et al. [16] proposed a vision-based automatic method for the surface condition identification of concrete structures, consisting of the most advanced pretrained CNN, transfer learning, and decision-level image fusion. It could accurately identify the crack contour through incorrect predictions of the limited area, proving its potential in practical applications. Similarly, LeBien et al. [17] proposed an endto-end pipeline for training the CNN for the multi-species and multi-label classification of soundscape records, starting from the original and unmarked audio. The transfer learning of the pretraining model was used to reduce the necessary training data and time. The model achieved high classification accuracy in 24 species. In addition, CNNs have been widely used in image classification [18,19], object detection [20], semantic segmentation [21], and other fields.
CNNs are gradually being applied to the classification of hyperspectral images [22][23][24][25][26][27][28]. For example, Zhong et al. [29] designed a spatial-spectral residual network (SSRN) composed of spatial residual blocks and spectral residual blocks to jointly learn the spatial and spectral information in a hyperspectral image in order to further improve the recognition accuracy. Because the training time of an SSRN is too long, Wang et al. [30] designed a fast, dense spectral-spatial convolution (FDSSC) network that is faster than an SSRN and automatically extracts spatial features and spectral features in HSI by building dense spectral blocks and dense spatial blocks. Dense connections deepen the network and reduce the problem of gradient disappearance.
Multiscale features can better describe complex scenes; therefore, a multiscale strategy [31][32][33] is an effective way to improve the accuracy of HSI classification. Yang et al. [34] proposed a dual-channel convolutional neural network (two CNNs) to effectively extract images' spectral and spatial features. The network uses different channels of a CNN to learn image features from the spectral and spatial dimensions. He et al. [35] proposed a multiscale 3D deep-convolution neural network (M3D-DCNN) for HSI classification, which can learn spatial features and spectral features together from hyperspectral image data in an end-to-end manner and then extract spectral information with a 1D-CNN. Pooja et al. [36] combined a multiscale strategy with the CNN network to achieve effective hyperspectral image classification, reduce the interference of adjacent pixels, and improve the performance of features. Wu et al. [37] proposed a multi-branch spectral-spatial joint network (MSSN) based on a CNN. The MSSN structure consists of two branches, each of which can extract the spectral and spatial features of hyperspectral images. Lee et al. [23] proposed a deep CNN with a deeper and broader context, which uses multiscale filter banks to obtain different receptive fields in order to extract the multiscale spectral-spatial fusion features of images. An inception block can make the network wider, reduce the number of parameters, and extract high-dimensional features. It uses convolution kernels of different receptive fields on the same layer of the network to extract features at different scales. Furthermore, an inception block is an effective means of solving the problem of incomplete multiscale feature extraction in hyperspectral image classification. Bei et al. [38] proposed a 3D asymmetric inception network (AINet) to overcome the over-fitting problem of hyperspectral image classification. AINet uses two asymmetric inception units, a spatial inception unit and a spectral inception unit, to effectively convey and classify features. In addition, they developed a data fusion transfer learning strategy to improve the model initialization and classification performance. Experiments showed that AINet was superior to all of the most advanced methods.
In addition to these factors, the attention mechanism has been widely used in computer vision in recent years. Lu et al. [39] proposed a new multiscale spatial-spectral residual network (CSMS-SSRN) based on 3D channels and spatial attention. The network uses different 3D convolution kernels to learn the frequency spectrum and spatial features from their respective residual blocks and then superimposes the extracted depth multiscale features into the 3D attention module to enhance the expressiveness of image features from both channels and spatial domains, thus improving the accuracy of classification. Because the SSRN and FDSSC networks require a large number of training samples to obtain good classification results, Sun et al. [40] proposed a special spatial attention network (SSAN), which combines simple spectral-spatial networks with attention mechanisms to extract the spectral and spatial characteristics of images. The nonlocal attention mechanism works well in video classification. It can also be applied to image classification, target detection, target segmentation, and other visual tasks, and the effect has been improved to varying degrees. A hyperspectral image is a kind of data similar to video, which is also applicable to nonlocal attention mechanisms. Hu et al. [41] combined a nonlocal attention mechanism with a CNN to present a multilevel progressive HSI SR network. The dense nonlocal and local block (DNLB) was constructed to combine local and global features, which are used to reconstruct super-resolution images at each level. They also developed a nonlocal channel attention block to extract the global features of HSIs efficiently. A number of experiments have demonstrated that their method could reconstruct hyperspectral images more accurately than existing methods.
Because the convolution kernels of the same layer are the same size, the problem of insufficient information extraction can easily occur. Focusing on the problems of spectral information redundancy and insufficient feature extraction at a single scale in the existing hyperspectral image classification research, this study proposes a CNN algorithm that combines a nonlocal attention module [42] and an inception block [43] to classify hyperspectral images. The nonlocal attention module can suppress the redundant information of hyperspectral images. Thus, the network can focus on essential features and use inception to extract and fuse multiscale spatial information to avoid insufficient spatial feature extraction on a single scale. This study conducted experiments on two hyperspectral datasets, Pavia University (PU) and Salinas (SA). The experiments showed that the proposed model could achieve higher classification accuracy than other deep learning models. The ablation experiments showed that adding inception and nonlocal attention mechanisms to the network effectively improved the model's ability to extract spatial and spectral information from hyperspectral images.
The contributions of this study can be summarized as follows: 1.
We use the inception and nonlocal attention mechanism to solve the problems of insufficient spatial-spectral feature extraction and the high redundancy of spectral information in hyperspectral images and to achieve higher classification accuracy; 2.
We compare the nonlocal attention block with two other attention mechanisms to verify its effectiveness for hyperspectral image classification; 3.
Experiments were conducted using other parameters that affect the classification accuracy of hyperspectral images with the deep learning model. The results provide a reference for further improving the classification accuracy of hyperspectral images.
Through undertaking this work, we hope to solve the problem of insufficient feature extraction for hyperspectral images and spectral feature extraction caused by spectral information redundancy. Moreover, we hope to further clarify the impact of different parameters on hyperspectral image classification, which will be helpful for follow-up research.
The remainder of this study is organized as follows. Section 2 introduces the principle of inception and the nonlocal attention mechanism. Section 3 introduces the structure of

Related Works
Inception and nonlocal blocks are commonly used in various computer vision tasks, and they also play an essential role in our network. These two structures are introduced below.

Inception Block
The increase in the depth and width of a network can improve its performance, but the cost of larger network parameters and a heavier load of calculation easily lead to overfitting. The inception structure proposed by Szegedy et al. [43] operates and integrates the feature map input from the previous layer by using convolution kernels and pooling operations of different sizes in the same layer to obtain a new feature map. This increases the width of the network but, at the same time, by the use of 1 × 1 convolution, it reduces the number of parameters and avoids excessive calculation in each layer. Therefore, the inception block is often used in computer vision studies. Many scholars have combined inception blocks with U-Net architecture and have proposed various image segmentation models. For example, Ibrahim et al. [44] added inception blocks to U-Nets to increase the network width and developed a new network structure aided by the feature extraction ability of inception blocks to improve building detection. Zhang et al. [45] integrated the inception block into U-Net, used the Res-inception module to replace the standard convolution layer to increase the width of the network, and used the inception block to extract features to build a deeper network structure and achieve higher performance than the existing algorithms. In addition, inception blocks have been applied in many computer vision applications, such as facial recognition [46], lithography hotspot detection [47], handwritten letter recognition [48], and breast cancer detection [49], with good results.
Compared with the original inception, inception V2 uses two layers of 3 × 3 small convolution kernels instead of one layer of large 5 × 5 convolution kernels. This modification reduces the model parameters while keeping the receptive field unchanged, and it can provide more nonlinearity Figure 1 shows the structure of inception V2. The remainder of this study is organized as follows. Section 2 introduces the principle of inception and the nonlocal attention mechanism. Section 3 introduces the structure of the proposed model. The comparison experiment and the discussion of the parameters that affect the method are presented in Section 4. Conclusions are given in Section 5.

Related Works
Inception and nonlocal blocks are commonly used in various computer vision tasks, and they also play an essential role in our network. These two structures are introduced below.

Inception Block
The increase in the depth and width of a network can improve its performance, but the cost of larger network parameters and a heavier load of calculation easily lead to overfitting. The inception structure proposed by Szegedy et al. [43] operates and integrates the feature map input from the previous layer by using convolution kernels and pooling operations of different sizes in the same layer to obtain a new feature map. This increases the width of the network but, at the same time, by the use of 1×1 convolution, it reduces the number of parameters and avoids excessive calculation in each layer. Therefore, the inception block is often used in computer vision studies. Many scholars have combined inception blocks with U-Net architecture and have proposed various image segmentation models. For example, Ibrahim et al. [44] added inception blocks to U-Nets to increase the network width and developed a new network structure aided by the feature extraction ability of inception blocks to improve building detection. Zhang et al. [45] integrated the inception block into U-Net, used the Res-inception module to replace the standard convolution layer to increase the width of the network, and used the inception block to extract features to build a deeper network structure and achieve higher performance than the existing algorithms. In addition, inception blocks have been applied in many computer vision applications, such as facial recognition [46], lithography hotspot detection [47], handwritten letter recognition [48], and breast cancer detection [49], with good results.
Compared with the original inception, inception V2 uses two layers of 3×3 small convolution kernels instead of one layer of large 5×5 convolution kernels. This modification reduces the model parameters while keeping the receptive field unchanged, and it can provide more nonlinearity Figure 1 shows the structure of inception V2.

Nonlocal Block
The attention mechanism is useful in image classification tasks and can make the model ignore irrelevant information and focus on key information. The nonlocal block was designed by Wang according to traditional nonlocal methods in computer vision. It can break the restriction that the convolution layer can only process adjacent elements. It makes the calculation of each pixel in the feature map connect with all other pixels in the whole feature map. It directly captures remote dependencies by calculating the interaction between any two positions on the image and obtaining global information. Figure 2 shows the structure diagram of the nonlocal block.

Nonlocal Block
The attention mechanism is useful in image classification tasks and can make the model ignore irrelevant information and focus on key information. The nonlocal block was designed by Wang according to traditional nonlocal methods in computer vision. It can break the restriction that the convolution layer can only process adjacent elements. It makes the calculation of each pixel in the feature map connect with all other pixels in the whole feature map. It directly captures remote dependencies by calculating the interaction between any two positions on the image and obtaining global information. Figure 2 shows the structure diagram of the nonlocal block. The nonlocal block performs very well in video classification, target detection, and other fields. Shokir [50] proposed a new nonlocal full-convolution network to capture global correlations more effectively for video saliency target detection obtained good results, proving the effectiveness of the nonlocal operation in saliency target detection. Quan [51] added the nonlocal block to a CNN for electrocardiogram classification. The electrocardiogram classification was significantly improved through nonlocal blocks to capture the long-term dependence of features in the spatial and channel domains. Wang [52] added an improved nonlocal block, called the asymmetric pyramid nonlocal block (APNB), to U-Net to automatically extract buildings from high-resolution aerial images. APNB captured global context information and improved the classification accuracy of pixels inside large buildings.
The formula for a nonlocal block is as follows: where represents the input data (in this study, it refers to the three-dimensional image block of the input network), represents the output data, and and represent the spatial positions of the input. is a function used to calculate the similarity relationship with all other data, and is used to calculate the eigenvalue of the input data at the position. ( ) is a normalization parameter. To simplify the problem, only the linear � � case is considered; that is, � � = .
is a weight matrix that can be learned through training, which, depending on the input data, can be implemented in the neural network by the convolution operations of 1 × 1 or 1 × 1 × 1. There are many choices of functions. Here, we implement the embedded Gaussian, the formula for which is as follows: The nonlocal block performs very well in video classification, target detection, and other fields. Shokir [50] proposed a new nonlocal full-convolution network to capture global correlations more effectively for video saliency target detection obtained good results, proving the effectiveness of the nonlocal operation in saliency target detection. Quan [51] added the nonlocal block to a CNN for electrocardiogram classification. The electrocardiogram classification was significantly improved through nonlocal blocks to capture the long-term dependence of features in the spatial and channel domains. Wang [52] added an improved nonlocal block, called the asymmetric pyramid nonlocal block (APNB), to U-Net to automatically extract buildings from high-resolution aerial images. APNB captured global context information and improved the classification accuracy of pixels inside large buildings.
The formula for a nonlocal block is as follows: where X represents the input data (in this study, it refers to the three-dimensional image block of the input network), y represents the output data, and i and j represent the spatial positions of the input. f is a function used to calculate the similarity relationship with all other data, and g is used to calculate the eigenvalue of the input data at the position. c(X) is a normalization parameter. To simplify the problem, only the linear g X j case is considered; that is, g X j = W g X j . W g is a weight matrix that can be learned through training, which, depending on the input data, can be implemented in the neural network by the convolution operations of 1 × 1 or 1 × 1 × 1. There are many choices of functions.
Here, we implement the embedded Gaussian, the formula for which is as follows: where θ(X i ) = W θ X i , and ∅ X j = W θ X j is also achieved by convolution operations of 1 × 1 or 1 × 1 × 1 on normalized parameters C(X) = ∑ ∀ f X i , X j . As such, y can be fully expressed as follows:

The Proposed Method
To solve the problems of insufficient spatial feature extraction at a single scale, the presence of many bands, and the high redundancy of spectral information in hyperspectral image classification, we introduce the inception and nonlocal attention mechanisms into the CNN. Unlike the existing models, our proposed model uses several convolution kernels of different sizes on the same layer of the network to extract features, and the inception block provides multiscale receptive fields for the network to make feature extraction more efficient. The nonlocal attention mechanism is not limited to adjacent pixels and can determine the correlation between any positions, which is equivalent to constructing a convolution kernel of the size of a feature map. Therefore, the network can extract more comprehensive spatial and spectral features while suppressing the redundant information between spectral bands.
The network structure of this study is shown in Figure 3 First, the hyperspectral image is processed into several small overlapping H × W × C data cubes, which are the inputs of the network. The front end of the model can be considered as two branches. One is a 3D CNN with a nonlocal attention module responsible for extracting spatial-spectral information from input data. The nonlocal modules are similar to the receptive field of the extended convolution kernel. Nonlocal operations capture the remote correlation directly by calculating the interaction between any two locations, rather than being limited to adjacent points. This is equivalent to constructing a convolution kernel as large as the feature map to obtain more information. At the same time, the nonlocal block can also capture the long-distance interaction between pixels in different bands, which can better use the rich spectral information of hyperspectral images. The nonlocal attention module can also help the model to suppress irrelevant information and pay more attention to the salient features, thereby enhancing the ability of the network to extract spatial and spectral features.

Experimental Data and Evaluation Metrics
To verify the effectiveness of the network and the influence of other variables on the classification results of hyperspectral images, we used two published hyperspectral image datasets, Salinas (SA) and Pavia University (PU) [53], as experimental data. The datasets are shown in Figure 4. Hyperspectral images are 3D data with spatial and spectral dimensions according to different input data dimensions. The nonlocal operations can work in the spatial dimension, which is called the 2D nonlocal attention module when the input data dimension is H × W × C. When the input data dimension is D × H × W × C, a nonlocal operation can play a role in the spectral dimension, and it is called the 3D nonlocal attention module.
Because the input data have not been processed by dimension reduction or band selection, the spectral dimension contains much redundant information. Therefore, the input data first pass through a layer of spectral dimensions with a step size of 2 and a convolution kernel size of 3 × 3 × 7 to gather spectral information and simultaneously extract spatial-spectral joint features of input data to enhance salient features and suppress redundant data through 3D nonlocal modules. Then, we use a 2-layer 3 × 3 convolution layer to further extract spatial features and a 1-layer 1 × 1 × K convolution to obtain a 2D feature map.
Another branch is the multiscale spatial feature fusion module based on the inception structure. This module uses convolution kernels of different sizes to extract features of different scales. All these features are fused through the concat operation to finally obtain multiscale spatial features. Through the multiscale spatial feature fusion module, spatial features of different sizes are extracted and fused, and the features of the input data on each scale become more prominent. The data of the two branches are fused, and then the salient features are further enhanced through a 2D nonlocal attention module. After that, a 3 × 3 convolution and a global averaging pooling layer are applied on the feature map, and a SoftMax classifier is used for the final classification.

Experimental Data and Evaluation Metrics
To verify the effectiveness of the network and the influence of other variables on the classification results of hyperspectral images, we used two published hyperspectral image datasets, Salinas (SA) and Pavia University (PU) [53], as experimental data. The datasets are shown in Figure 4.

Experimental Data and Evaluation Metrics
To verify the effectiveness of the network and the influence of other variables on the classification results of hyperspectral images, we used two published hyperspectral image datasets, Salinas (SA) and Pavia University (PU) [53], as experimental data. The datasets are shown in Figure 4.  This study used three quantitative indicators to assess the merits of the classification results: overall accuracy (OA), average accuracy (AA), and the Kappa coefficient. Among them, the OA refers to the proportion of the number of correctly classified samples on the test set to the total samples of the test set; the AA is the sum of the number of samples in each category to the proportion of the population sample, which is then divided by the total number of categories to obtain the AA. The Kappa coefficient measures the model classification accuracy for the consistency of the model prediction classification and the actual classification. The value is generally between 0 and 1; the larger the value, the higher the classification accuracy.

Experimental Environment and Parameter Settings
All experiments were conducted on the same Dell laptop made in Xiamen, China, configured with an Intel(R) Core ™ i5-6300HQ CPU @ 2.3 GHz, 16.0 GB of running memory (Santa Clara, CA, USA), and an NVIDIA GeForce GTX 960 graphics card (Santa Clara, CA, USA). The operating system was Windows 10, and the deep learning framework was Pytorch 1.6.0 and CUDA 10.1. This study used three quantitative indicators to assess the merits of the classification results: overall accuracy (OA), average accuracy (AA), and the Kappa coefficient. Among them, the OA refers to the proportion of the number of correctly classified samples on the test set to the total samples of the test set; the AA is the sum of the number of samples in each category to the proportion of the population sample, which is then divided by the total number of categories to obtain the AA. The Kappa coefficient measures the model classification accuracy for the consistency of the model prediction classification and the actual classification. The value is generally between 0 and 1; the larger the value, the higher the classification accuracy.

Experimental Environment and Parameter Settings
All experiments were conducted on the same Dell laptop made in Xiamen, China, configured with an Intel(R) Core ™ i5-6300HQ CPU @ 2.3 GHz, 16.0 GB of running memory (Santa Clara, CA, USA), and an NVIDIA GeForce GTX 960 graphics card (Santa Clara, CA, USA). The operating system was Windows 10, and the deep learning framework was Pytorch 1.6.0 and CUDA 10.1.
The initialization of the convolutional kernel and fully connected layer parameters in the network adopted the He normal method, the initial bias was 0, and the network was trained with the Adam optimizer. The learning rate was set to 0.001, the batch size was 32, and the number of epochs was 100.
The training and test image samples were selected by random sampling. On the whole hyperspectral image, the corresponding proportion of pixels was randomly selected as the training set for model training, and the remaining pixels were selected as the test set. In the training set, 10% of the samples were selected as the validation set for parameter optimization. To ensure that the pixels at the edge of the image can be selected, we set the fill appropriately for the size of the input data around the image.

Selection of the Network Backbone
The network backbone could have been a 2D CNN, 3D CNN, or 3D-2D CNN. Compared with the 2D CNN, a 3D CNN has advantages in hyperspectral datasets but is more computation intensive. Using a 3D-2D CNN considers both the classification effect and the amount of calculation. Table 1 shows the classification performance of the three network structures on two hyperspectral datasets.  Table 1 shows that, on the PU and SA datasets, the OA of a 3D CNN with the same number of layers is higher than that of a 2D CNN. However, the number of parameters of the network is also greater, and the training time is several times that of a 2D CNN. The number of parameters of a 3D-2D CNN is significantly lower than a 3D CNN. Even with many bands, the number of parameters is lower than for a 2D CNN, reducing the risk of model overfitting. Moreover, the training time of a 3D-2D CNN is also significantly shorter. Additionally, the overall classification accuracy of the 3D-2D model is comparable to that of the 3D model on the PU dataset and is even better than that of the 3D CNN on the SA dataset. Thus, we selected the 3D-2D CNN as the backbone.

Comparing the Effectiveness of Multiscale Attention Modules
To verify the effectiveness of the multiscale spatial feature fusion module (MS) and nonlocal attention module added in this study, we added the multiscale spatial feature fusion module and 2D and 3D nonlocal attention modules to the network in turn. Table 2 shows the classification accuracy of several model structures on the PU and SA datasets. It shows that, on the PU dataset, when an MS and a nonlocal module are added to the baseline model, the overall classification accuracy of the model is improved by 0.163% and 0.294%, respectively, indicating that the two modules improve the performance of the model. The nonlocal and MS modules are added to the network at the same time. Table 2 presents three different forms: adding 2D nonlocal and 3D nonlocal separately and adding 2D and 3D nonlocal simultaneously. Regarding classification accuracy, adding nonlocal and MS to the network simultaneously produces better results than adding them separately. The performance of the 3D nonlocal module is better than that of the 2D nonlocal module, and the simultaneous use of the 2D and 3D nonlocal modules further improves the model's performance, indicating that the simultaneous use of the attention mechanism in the spatial and spectral dimensions achieves better results.
In addition to the nonlocal attention mechanism, SENet [54] and CBAM [55] are also high-quality attention mechanisms that are used in image classification. Therefore, we compared the effects of three attention mechanisms on the classification of two hyperspectral images. Table 2 also shows the OA of the network structure with three attention modules on the PU and SA datasets.
From the results shown in Table 2, we can see that the classification performance of the three network structures with attention modules on the two datasets is slightly different from that of the baseline. For the PU dataset, the classification accuracy is improved with three different attention modules, where the accuracy levels of SE and CBAM are similar, and the nonlocal module gives the best performance. For the SA dataset, because the number of bands is further increased, and the similarity of various ground features in the SA dataset is higher than that in the PU dataset, the classification accuracy with SE and CBAM is decreased, while the classification accuracy of the model with a nonlocal module is still steadily improved. Therefore, we selected the nonlocal module, which can capture the long-distance correlation between spectra and can better improve classification accuracy, as the final attention block.

Searching for the Optimal Parameters
In this section, the parameters affecting the classification accuracy of hyperspectral images are discussed, including the number of convolution kernels, the size of neighboring pixel blocks, and the rations of training samples.

1.
The number of convolutional kernels The number of convolutional kernels in the network structure is an important parameter for determining the appropriate number of convolutional kernels. Each layer in the model uses the same number of convolutional kernels. Table 3 shows the OA and the number of parameters of the model on the SA and PU datasets when using different numbers of convolutions. When there are 12 convolutional kernels, the model's classification accuracy on the PU and SA datasets is 98.962% and 96.640%, respectively. The classification accuracy of the model on the two datasets is significantly improved by gradually increasing the number of convolutional kernels. When the number of convolutional kernels increased from 24 to 30, the OA of the model on the PU and SA datasets increased by only 0.084% and 0.054%, respectively, the feature extraction ability of the model became saturated, while the parameters increased by 31.67% and 30.72%, respectively. It is uneconomical to increase the number of convolutional kernels in the model, so the final determination of the number of convolutional kernels in the model is 24.

2.
Neighboring pixel block size and proportion of training samples Since the input of the model is a neighborhood pixel block extracted from a hyperspectral image, the size of the neighborhood pixel block determines the amount of data received by the model, which has a significant impact on the final classification accuracy. Moreover, the proportion of training samples used also affects the effect of model feature extraction; therefore, we used 5%, 10%, 15%, and 20% of the training sample ratios to set 5 spatial sizes (5, 7, 9, 11, and 13) on the experimental dataset to explore the influence of the neighborhood pixel block size and training sample ratio on the classification accuracy. The results are shown in Table 4.  Table 4 shows that, as the size of the neighborhood pixel block increases from 5 to 13, the classification performance of the model on both datasets also improves, and this pattern persists when the proportion of training data changes. This is because larger neighborhood blocks provide more information and allow the model to extract more distinguishing features. However, it should also be noted that the classification accuracy of the model is highest when the size is 11, and further increasing the size of the neighborhood pixel block reduces the classification accuracy. For the proportion of training samples, the classification accuracy can be significantly improved by increasing the number of training samples but, like the neighborhood pixel block size, there is also a maximum value of 15%; the classification accuracy of this value is basically stable, and the classification accuracy of the training data will decrease when the proportion of training samples is further increased.

Algorithm Comparison Experiments
To verify the effectiveness of the proposed model (2-3D-NL CNN), we compared the model with the 2D CNN, 3D CNN, HybridSN [25],Two-CNN [32], SSRN [29], SSAN [40], FDSSC [30], Hamida [56], PResNet [57], and M3D-DCNN [35] models. Each model was trained with 10% of the training samples, and the spatial size of the input data was 5. Table 5 presents the parameter settings of each model. Table 6 presents the classification performance of several algorithms on the PU and SA datasets. It can be seen from Table 6 that the 2-3D-NL CNN performs better than other models in OA and AA, which proves the advanced state of the model. The 2-3D-NL CNN achieved higher classification accuracy than FDSSC, SSRN, PResNet, and other models on the PU and SA datasets because the 2-3D-NL CNN made full use of the multiscale spatial features, used a nonlocal attention mechanism to effectively capture the correlation between spectra, and finally extracted more discriminative spatial and spectral features.   Figures 5 and 6 are the classification effect diagrams of the algorithms. From the classification map of the two datasets, it is clear that simple 2D CNNs and 3D CNNs have obvious pixel misclassification owing to their insufficient feature extraction capabilities. The performance of several other models is much better, and the 2-3D-NL CNN proposed in this paper achieved the best classification effect; almost no pixels were misclassified on the PU dataset. The classification map obtained for the SA dataset is smoother that of other models, and there are fewer pixels misclassified.   Figures 5 and 6 are the classification effect diagrams of the algorithms. From the classification map of the two datasets, it is clear that simple 2D CNNs and 3D CNNs have obvious pixel misclassification owing to their insufficient feature extraction capabilities. The performance of several other models is much better, and the 2-3D-NL CNN proposed in this paper achieved the best classification effect; almost no pixels were misclassified on the PU dataset. The classification map obtained for the SA dataset is smoother that of other models, and there are fewer pixels misclassified.

Conclusions
Hyperspectral image classification is an important research area in the field of hyperspectral remote sensing. The hyperspectral image classification method based on CNN performs well in the hyperspectral image classification task. However, a fixed size of convolution kernel often leads to inadequate feature extraction, and the information redundancy of spectral dimension also makes spectral feature extraction difficult. To solve the above-mentioned problems, we proposed a 2-3D NL CNN. Based on the convolution neural network, an inception block and a nonlocal attention mechanism are introduced to improve the classification accuracy of hyperspectral images. Experiments were carried out based on the PU dataset and the SA dataset. The results indicate the following: 1.
The 2-3D NL CNN effectively improves the classification accuracy of hyperspectral images. The inception block uses convolution kernels of different sizes to provide different sizes of receptive fields for the network, making feature extraction more comprehensive. The nonlocal attention mechanism enhances the spectral feature extraction ability of the network and suppresses the information redundancy of spectral dimension.

2.
The nonlocal attention mechanism is more suitable for hyperspectral image classification tasks. Our experiment compared three attention mechanisms, namely SENet, CBAM, and a nonlocal attention mechanism, and the nonlocal attention mechanism improved the classification accuracy more significantly for the two datasets. This is mainly because the nonlocal attention mechanism can factor in the correlation between the pixels at a greater distance, as well as accounting for the pixels to be classified.
Although the model proposed in this paper showed excellent performance in hyperspectral image classification, it still has some shortcomings. For example, the spectral features of the pixels to be classified are actually disturbed by the spectral information of the neighboring pixels, and the generalization ability of the model was not verified. Future work should pay more attention to how to extract spatial features while avoiding the interference caused by the spectral information of neighboring pixels to the spectral feature extraction of classified pixels. Furthermore, more attention should be paid to the generalization ability of the model to find new hyperspectral image datasets for experiments.