Hyperspectral Image Classiﬁcation Based on Multi-Scale Residual Network with Attention Mechanism

: In recent years, image classiﬁcation on hyperspectral imagery utilizing deep learning algorithms has attained good results. Thus, spurred by that ﬁnding and to further improve the deep learning classiﬁcation accuracy, we propose a multi-scale residual convolutional neural network model fused with an efﬁcient channel attention network (MRA-NET) that is appropriate for hyperspectral image classiﬁcation. The suggested technique comprises a multi-staged architecture


Introduction
A hyperspectral image presents a target region in a spectrum of continuous and narrow bands, containing both spatial and spectral feature information at a pixel-level resolution [1]. Hyperspectral imagery is widely used in various applications such as urban planning, agricultural development, and environmental testing [2]. Technically, a hyperspectral image is a three-dimensional image composed of an image along several spectral dimensions. The analysis of a hyperspectral image's spatial and spectral characteristics can effectively contribute to the classification of ground objects. However, hyperspectral images are prone to the Hughes phenomenon due to the complexity of their structure and suffer from a small number of labeled samples affecting the overall performance of hyperspectral image classification. Due to these deficiencies, hyperspectral image classification still poses a hot research topic.
Thus, the literature presents several attempts toward hyperspectral image classification. For example, traditional classifiers are utilized such as the support vector machine (SVM), K-nearest neighbor algorithm, and multinomial logistic regression (MLR) [3][4][5][6]. These algorithms mainly exploit the spectral information of the image due to its high dimensionality. Principal component analysis (PCA), independent component analysis (ICA), and image sparse representation (SR) methods are also used to process the spectral information, by extracting its main features and reducing computational complexity [7][8][9][10][11]. However, as opposed to deep learning networks, these traditional methods are not able to extract deep-level features, imposing a relatively low classification accuracy. The deep learning method can extract the abstract-level information of the image, so it is more effective to cess, and the last matrix is operated by the fully connected layer and the sigmoid activation function and then output. Finally, the ECA-NET output feature is the multiplication of the original feature map {W, H, C} with the feature map {1, 1, c}. In this work, we used the ECA-NET with the improved attention mechanism to extract useful features that are associated with the target feature classes and ultimately produce output feature information with more characterization capabilities that fully combine the space-spectrum features of the hyperspectral images.

S2A Block
The Spectral-Spatial Attention (S2A) block [33] combines the SE-Net structure with the residual structure and uses two convolution kernels of different sizes to perform deep separable convolution on the input feature map. The feature map obtained by the convolution kernel is transposed and multiplied. Finally, the latter feature map is connected with the residual network that has an attention mechanism to realize the extraction of spatial and spectral feature information of the image. The S2A network structure diagram is shown in Figure 2.
Specifically, the SA2 block comprises three parallel processing subnetworks, the output of which is ultimately fused into a single three-dimensional feature map. Given the feature map with dimensions {a, a, h}, it is input in the first subnetwork after two-dimensional convolution, and it undergoes two depth-separable convolutions with a convolution kernel of (1*1), producing two feature maps of constant size. The first two channels of the two feature maps are merged to obtain two matrices of shape {a*a, h}, then, the latter matrix is transposed and multiplied with the former matrix to obtain the {a*a, a* a} matrix, which is output from the first subnetwork through a Softmax activation function. Additionally, in the second subnetwork, the input feature map is subjected to two-dimensional convolution twice to obtain the shape {a, a, h2} feature map, and the first two channels are merged to obtain the shape {a*a, h2} matrix. In the third subnetwork, the operation is the same as the first subnetwork, the input matrix {a, a, h} is subjected to a two-dimensional convolution and then two depth-separable convolutions to obtain two features. The difference from the first subnet is that this time the former matrix is transposed, and the latter is multiplied to obtain a matrix of shape {h2, h2}, which is then output by the Softmax activation function. Finally, the three distinct outputs of the corresponding subnetworks For the feature map with an input dimension of {W, H, C} along the corresponding dimension, ECA-NET initially performs a global average pooling (GAP) [32] operation, which reduces the number of parameters and integrates the spatial information of the feature map. A reshape operation is carried out to obtain a matrix of size {1, 1, c}. The second matrix with the shape {1, 1, c} is obtained via a one-dimensional convolution process, and the last matrix is operated by the fully connected layer and the sigmoid activation function and then output. Finally, the ECA-NET output feature is the multiplication of the original feature map {W, H, C} with the feature map {1, 1, c}. In this work, we used the ECA-NET with the improved attention mechanism to extract useful features that are associated with the target feature classes and ultimately produce output feature information with more characterization capabilities that fully combine the space-spectrum features of the hyperspectral images.

S2A Block
The Spectral-Spatial Attention (S2A) block [33] combines the SE-Net structure with the residual structure and uses two convolution kernels of different sizes to perform deep separable convolution on the input feature map. The feature map obtained by the convolution kernel is transposed and multiplied. Finally, the latter feature map is connected with the residual network that has an attention mechanism to realize the extraction of spatial and spectral feature information of the image. The S2A network structure diagram is shown in Figure 2. are multiplied forming a new matrix {a*a, h2}, which is then reshaped to an {a, a, h2} matrix. After passing the input feature map through the two-dimensional convolution with filters h2 again, a new feature map with shape {a, a, h2} is obtained. Ultimately, the newly obtained feature map, the feature map output by the second subnetwork after the second convolution and feature map of three subnetworks' multiplied output are added together to obtain the final feature map and is input to a batch standard normalization (BN) [34] function, a ReLU [35] activation function, and a MaxPooling operation to produce the final output. The SA2 block employs several convolutions with different kernel sizes, and also transpose multiplication operations, to obtain matrices containing spatial and spectral features that are connected by a residual structure to better capture the relationship between the classified features and the spectral information.

Residual Convolutional Layer Structure (ECA_Residual_NET)
Considering that a deeper network may cause gradient dispersion, an improved residual structure was used to build a neural network model, i.e., the ECA residual net. This structure can retain weak image information, effectively deepen the neural network, ex- Specifically, the SA2 block comprises three parallel processing subnetworks, the output of which is ultimately fused into a single three-dimensional feature map. Given the feature map with dimensions {a, a, h}, it is input in the first subnetwork after two-dimensional convolution, and it undergoes two depth-separable convolutions with a convolution kernel of (1*1), producing two feature maps of constant size. The first two channels of the two feature maps are merged to obtain two matrices of shape {a*a, h}, then, the latter matrix is transposed and multiplied with the former matrix to obtain the {a*a, a* a} matrix, which is output from the first subnetwork through a Softmax activation function. Additionally, in the second subnetwork, the input feature map is subjected to two-dimensional convolution twice to obtain the shape {a, a, h2} feature map, and the first two channels are merged to obtain the shape {a*a, h2} matrix. In the third subnetwork, the operation is the same as the first subnetwork, the input matrix {a, a, h} is subjected to a two-dimensional convolution and then two depth-separable convolutions to obtain two features. The difference from the first subnet is that this time the former matrix is transposed, and the latter is multiplied to obtain a matrix of shape {h2, h2}, which is then output by the Softmax activation function. Finally, the three distinct outputs of the corresponding subnetworks are multiplied forming a new matrix {a*a, h2}, which is then reshaped to an {a, a, h2} matrix. After passing the input feature map through the two-dimensional convolution with filters h2 again, a new feature map with shape {a, a, h2} is obtained. Ultimately, the newly obtained feature map, the feature map output by the second subnetwork after the second convolution and feature map of three subnetworks' multiplied output are added together to obtain the final feature map and is input to a batch standard normalization (BN) [34] function, a ReLU [35] activation function, and a MaxPooling operation to produce the final output.
The SA2 block employs several convolutions with different kernel sizes, and also transpose multiplication operations, to obtain matrices containing spatial and spectral features that are connected by a residual structure to better capture the relationship between the classified features and the spectral information.

Residual Convolutional Layer Structure (ECA_Residual_NET)
Considering that a deeper network may cause gradient dispersion, an improved residual structure was used to build a neural network model, i.e., the ECA residual net. This structure can retain weak image information, effectively deepen the neural network, extract high-level abstract feature information of the image without increasing the number of network parameters, and solve the problem of network degradation. This network structure is presented in Figure 3. Given an input feature of {c, c, d}, it undergoes a batch normalization (BN) operation, which greatly improves the processing speed of the subsequent data information. The output is then input to a two-dimensional convolutional layer utilizing a kernel of (3, 3) with a step size of a, and a ReLU activation function. The output then passes through a BN process. The latter convolutional BN process is repeated, and then, the output passes through an ECA-NET feature extraction module. The ECA-NET output result is added to the feature map that passed the BN layer for the first time, and ultimately the output is provided by a ReLU activation function. It is worth noting that the output result assigns lower weights to insensitive features in the hyperspectral domain and higher weights to abstract deep features.

CRE_Block
The CRE_Block comprises a two-dimensional convolution process, a ReLU activation function, a batch normalization, and an attention mechanism (ECA_NET) to aggregate the spatial spectrum characteristics of the image. The module structure is depicted in

Overall Network Structure of the Suggested Deep Classification Technique
The proposed deep network architecture uses the PCA algorithm to separate and extract the image space-spectrum features for the first time, and then uses the S2A module, which contains the attention mechanism and the improved residual network structure, to extract the spatial and spectral features of the image multiple times. Additionally, in the middle of the model, we also exploit the ECA-NET network, and finally perform a feature fusion process to merge the features from the distinct subnetwork, e.g., S2A module, ECA-NET, etc., and input the fused feature to a fully connected layer for classification. The proposed model structure is shown in Figure 5.
We utilize the PCA method to reduce the image spectral dimensions to 3 and 20 and select patches of different sizes to create two inputs (Input_1, Input_2) with size {27,27,3} and {7,7,20}, respectively. The feature map with the large patch size and the small spectral dimension, i.e., Input_2, contains more spatial feature information, while the feature map with the small patch size but large spectral dimension contains more spectral feature information, i.e., Input_1.
Initially, we input to our architecture the Input_1 feature map and extract the spacespectrum feature information of the image through an S2A block module with 128 filters

CRE_Block
The CRE_Block comprises a two-dimensional convolution process, a ReLU activation function, a batch normalization, and an attention mechanism (ECA_NET) to aggregate the spatial spectrum characteristics of the image. The module structure is depicted in Figure 4.

CRE_Block
The CRE_Block comprises a two-dimensional convolution process, a ReLU activation function, a batch normalization, and an attention mechanism (ECA_NET) to aggregate the spatial spectrum characteristics of the image. The module structure is depicted in

Overall Network Structure of the Suggested Deep Classification Technique
The proposed deep network architecture uses the PCA algorithm to separate and extract the image space-spectrum features for the first time, and then uses the S2A module, which contains the attention mechanism and the improved residual network structure, to extract the spatial and spectral features of the image multiple times. Additionally, in the middle of the model, we also exploit the ECA-NET network, and finally perform a feature fusion process to merge the features from the distinct subnetwork, e.g., S2A module, ECA-NET, etc., and input the fused feature to a fully connected layer for classification. The proposed model structure is shown in Figure 5.
We utilize the PCA method to reduce the image spectral dimensions to 3 and 20 and select patches of different sizes to create two inputs (Input_1, Input_2) with size {27,27,3} and {7,7,20}, respectively. The feature map with the large patch size and the small spectral dimension, i.e., Input_2, contains more spatial feature information, while the feature map with the small patch size but large spectral dimension contains more spectral feature information, i.e., Input_1.
Initially, we input to our architecture the Input_1 feature map and extract the spacespectrum feature information of the image through an S2A block module with 128 filters

Overall Network Structure of the Suggested Deep Classification Technique
The proposed deep network architecture uses the PCA algorithm to separate and extract the image space-spectrum features for the first time, and then uses the S2A module, which contains the attention mechanism and the improved residual network structure, to extract the spatial and spectral features of the image multiple times. Additionally, in the middle of the model, we also exploit the ECA-NET network, and finally perform a feature fusion process to merge the features from the distinct subnetwork, e.g., S2A module, ECA-NET, etc., and input the fused feature to a fully connected layer for classification. The proposed model structure is shown in Figure 5.
We utilize the PCA method to reduce the image spectral dimensions to 3 and 20 and select patches of different sizes to create two inputs (Input_1, Input_2) with size {27,27,3} and {7,7,20}, respectively. The feature map with the large patch size and the small spectral dimension, i.e., Input_2, contains more spatial feature information, while the feature map with the small patch size but large spectral dimension contains more spectral feature information, i.e., Input_1.
Initially, we input to our architecture the Input_1 feature map and extract the spacespectrum feature information of the image through an S2A block module with 128 filters and convolution kernels of (1,1) and (5,5). The output tensor of the S2A block is {13, 13, 128}, which is then input to the subnetworks Net_1 and Net_2, respectively. The Net_1 network first consists of an S2A block module with 64 filters and convolution kernels of (1, 1), (3,3), and comprises of two CRE modules followed by MaxPooling operations. Ultimately, the output shape is of this subnetwork is {3, 3, 64}. Regarding the Net_2 subnetwork, the Input_1 feature tensor is input to two ECA_Residual_NET modules and then passes through two CRE modules and MaxPooling operations to obtain an output feature map (map2) of size {3,3,64}. Considering the Input_2 feature map, it is initially input to the two CRE modules to create a feature map of {7, 7, 192} that is sent to Net_3 and Net_4, respectively. The Net_3 network contains two CRE modules with different parameters followed by a MaxPooling operation that outputs a feature map (map3) with shape {3, 3, 64}. The Net_4 network considers an S2A block module with 64 filters and (1, 1), (3, 3) convolution kernels followed by a MaxPooling layer, to ultimately create an output feature map (map4) with shape {3, 3, 64}. Finally, the four output feature maps (maps) are concatenated to realize the multi-scale feature extraction of the hyperspectral image, first through the global average pooling layer (GAP), and then through the two fully connected layers with two parameters (the variable g is used to represent the number of layers in the legend) respectively of 200, 100. The connection layer adds the sigmoid activation function, and finally, the hyperspectral image classification information is obtained through a fully connected layer (the variable h represents the classification type) plus the Softmax activation function output. This paper extracts hyperspectral image information from four different scales for many times, and finally connects the extracted features, which effectively improves the accuracy of the hyperspectral image classification problem. and convolution kernels of (1,1) and (5,5). The output tensor of the S2A block is {13, 13, 128}, which is then input to the subnetworks Net_1 and Net_2, respectively. The Net_1 network first consists of an S2A block module with 64 filters and convolution kernels of (1, 1), (3,3), and comprises of two CRE modules followed by MaxPooling operations. Ultimately, the output shape is of this subnetwork is {3, 3, 64}. Regarding the Net_2 subnetwork, the Input_1 feature tensor is input to two ECA_Residual_NET modules and then passes through two CRE modules and MaxPooling operations to obtain an output feature map (map2) of size {3,3,64}. Considering the Input_2 feature map, it is initially input to the two CRE modules to create a feature map of {7, 7, 192} that is sent to Net_3 and Net_4, respectively. The Net_3 network contains two CRE modules with different parameters followed by a MaxPooling operation that outputs a feature map (map3) with shape {3, 3, 64}. The Net_4 network considers an S2A block module with 64 filters and (1, 1), (3,3) convolution kernels followed by a MaxPooling layer, to ultimately create an output feature map (map4) with shape {3, 3, 64}. Finally, the four output feature maps (maps) are concatenated to realize the multi-scale feature extraction of the hyperspectral image, first through the global average pooling layer (GAP), and then through the two fully connected layers with two parameters (the variable g is used to represent the number of layers in the legend) respectively of 200, 100. The connection layer adds the sigmoid activation function, and finally, the hyperspectral image classification information is obtained through a fully connected layer (the variable h represents the classification type) plus the Softmax activation function output. This paper extracts hyperspectral image information from four different scales for many times, and finally connects the extracted features, which effectively improves the accuracy of the hyperspectral image classification problem.  Figure 5. Overall network structure.

Experimental Platform and Experimental Result
All experiments are performed on a Windows 10 system, with an Intel Core i7-9600 CPU, an Nvidia GeForce GTX2060S GPU with 8 GB video memory, using the TensorFlow2.3 deep learning framework and a Python 3.7 compiler. During trials, we challenge the Remote Sens. 2021, 13, 335 7 of 18 classification performance of the proposed network model against current hyperspectral classification models on various datasets. Additionally, we also analyze the influence of various hyperparameters on the classification performance of our model. The evaluation metrics used are the overall accuracy (OA), the average classification (AA) accuracy, and the Kappa coefficient.
The Pavia University dataset (UP) is a collection of hyperspectral images obtained from Pavia, Italy, with an example presented in Figure 6a. The spatial image size is 610 × 340 pixels, while the spectral information has 103 effective bands, and the wavelength range is 430~860 nm. The spatial resolution is 1.3 m, including nine types of ground features such as grass, asphalt, bricks, etc. The real ground feature map is shown in Figure 6b, with 42,776 pixels marked in total. During trials, as in work [20], we randomly select 10%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets for the datasets. The dataset feature types, along with the training and test set sample information are shown in Table 1. All experiments are performed on a Windows 10 system, with an Intel Core i7-9600 CPU, an Nvidia GeForce GTX2060S GPU with 8 GB video memory, using the Tensor-Flow2.3 deep learning framework and a Python 3.7 compiler. During trials, we challenge the classification performance of the proposed network model against current hyperspectral classification models on various datasets. Additionally, we also analyze the influence of various hyperparameters on the classification performance of our model. The evaluation metrics used are the overall accuracy (OA), the average classification (AA) accuracy, and the Kappa coefficient.

Introduction of the Dataset
The Pavia University dataset (UP) is a collection of hyperspectral images obtained from Pavia, Italy, with an example presented in Figure 6a. The spatial image size is 610 × 340 pixels, while the spectral information has 103 effective bands, and the wavelength range is 430~860 nm. The spatial resolution is 1.3 m, including nine types of ground features such as grass, asphalt, bricks, etc. The real ground feature map is shown in Figure  6b, with 42,776 pixels marked in total. During trials, as in work [20], we randomly select 10%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets for the datasets. The dataset feature types, along with the training and test set sample information are shown in Table 1.

Ashalt
Meadows Gravel Tree Sheets Soil Bricks Bitumen Shadows

KSC Dataset
The KSC dataset is a hyperspectral imagery collection obtained from the Kennedy Space Center. An example is shown in Figure 7a. The spectral information has a total of 176 effective bands, the spatial image size is 614 × 512 pixels, and the wavelength range is 400~2450 nm, including scrub, oak hammock, slash pine, and 13 other class categories. The ground truth feature map is shown in Figure 7b, where a total of 5211 pixels are labeled. During trials, as in work [20], we randomly select 20%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets for the datasets. The dataset feature types, along with the training and test set sample quantity information are shown in Table 2.

. KSC Dataset
The KSC dataset is a hyperspectral imagery collection obtained from the Ke Space Center. An example is shown in Figure 7a. The spectral information has a to 176 effective bands, the spatial image size is 614 × 512 pixels, and the wavelength ra 400~2450 nm, including scrub, oak hammock, slash pine, and 13 other class categ The ground truth feature map is shown in Figure 7b, where a total of 5211 pixels beled. During trials, as in work [20], we randomly select 20%, 10%, and 70% of the labeled samples as training, validation, and testing sets for the datasets. The datas ture types, along with the training and test set sample quantity information are sho      Figure 8a. The spectral information has a total of 200 effective bands, the spatial image size is 145 × 145 pixels, the wavelength range is 400~2500nm, and the spatial resolution is 20 m. This dataset includes 16 feature categories such as alfalfa, corn, oats, etc. The ground truth feature map is shown in Figure 8b, with 10,249 pixels labeled. Similarly, to the previous datasets, during our experiments, as in work [20], we randomly select 20%,10%, and 70% of the whole labeled samples as training, validation, and testing sets for the datasets. Table 3 shows the types of object classes in the dataset, the number of samples in the training set, and the test set.
taset is shown in Figure 8a. The spectral information has a total of 200 effective bands, spatial image size is 145 × 145 pixels, the wavelength range is 400~2500nm, and the spa resolution is 20 m. This dataset includes 16 feature categories such as alfalfa, corn, o etc. The ground truth feature map is shown in Figure 8b, with 10,249 pixels labeled. S ilarly, to the previous datasets, during our experiments, as in work [20], we random select 20%,10%, and 70% of the whole labeled samples as training, validation, and test sets for the datasets. Table 3 shows the types of object classes in the dataset, the num of samples in the training set, and the test set.

Parameter Setting
In this section, we analyze the interplay between the parameter setup and the classification performance of the proposed model. The tuned parameters include learning rate, batch_size, and training sample ratio. Regarding learning rate, it controls the speed of the gradient descent during the training process, with the appropriate learning rate parameters effectively controlling the convergence ability and speed of the model. We evaluate our network by using six learning rates with different sizes, i.e., 0.00005, 0.0001, 0.0003, 0.0005, 0.001, and 0.005. The test results are shown in Figure 9, from which we observe that when the learning rate is 0.0003, the classification performance on the three datasets is better. Additionally, tuning the learning rate parameter has less impact on the accuracy of the Pavia University dataset and a greater impact on the Indian Pines dataset.
ters effectively controlling the convergence ability and speed of the model. We evaluate our network by using six learning rates with different sizes, i.e., 0.00005, 0.0001, 0.0003, 0.0005, 0.001, and 0.005. The test results are shown in Figure 9, from which we observe that when the learning rate is 0.0003, the classification performance on the three datasets is better. Additionally, tuning the learning rate parameter has less impact on the accuracy of the Pavia University dataset and a greater impact on the Indian Pines dataset. The next trial investigates how the batch size affects the overall accuracy of our method. The batch size refers to the number of samples selected during training. Choosing a suitable batch_size can effectively improve the memory utilization and improve the convergence accuracy of the model. We challenge the performance on batch_size of 16, 32, 64, and 128, with the corresponding results presented in Figure 10. Our trials demonstrate that when the batch_size is 16, the classification attained on the three datasets is better. However, in the case of fewer training samples, a smaller batch_size will perform better. The next trial investigates how the batch size affects the overall accuracy of our method. The batch size refers to the number of samples selected during training. Choosing a suitable batch_size can effectively improve the memory utilization and improve the convergence accuracy of the model. We challenge the performance on batch_size of 16, 32, 64, and 128, with the corresponding results presented in Figure 10. Our trials demonstrate that when the batch_size is 16, the classification attained on the three datasets is better. However, in the case of fewer training samples, a smaller batch_size will perform better.  Our final trial considers utilizing 5%, 10%, 20%, 30%, and 40% of the sample data as the training set. After our model being trained on the respective sample dataset, we test our network, with the corresponding results presented in Figure 11. From the latter figure, we conclude that as the training samples increase, the overall accuracy of our model is increasing. To compare with other networks, we adopt the strategy of [20], and we randomly select 20%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets, respectively, for the Indian Pines and KSC datasets. For the Pavia University dataset, we employ a 10%, 10%, and 80% strategy.  Our final trial considers utilizing 5%, 10%, 20%, 30%, and 40% of the sample data as the training set. After our model being trained on the respective sample dataset, we test our network, with the corresponding results presented in Figure 11. From the latter figure, we conclude that as the training samples increase, the overall accuracy of our model is increasing. To compare with other networks, we adopt the strategy of [20], and we randomly select 20%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets, respectively, for the Indian Pines and KSC datasets. For the Pavia University dataset, we employ a 10%, 10%, and 80% strategy.
Our final trial considers utilizing 5%, 10%, 20%, 30%, and 40% of the sample data as the training set. After our model being trained on the respective sample dataset, we test our network, with the corresponding results presented in Figure 11. From the latter figure, we conclude that as the training samples increase, the overall accuracy of our model is increasing. To compare with other networks, we adopt the strategy of [20], and we randomly select 20%, 10%, and 70% of the whole labeled samples as training, validation, and testing sets, respectively, for the Indian Pines and KSC datasets. For the Pavia University dataset, we employ a 10%, 10%, and 80% strategy.  Figure 12 shows the accuracy curve of each classification model in the Pavia University dataset. It can be seen that our deep network model converges quickly, and the classification accuracy is higher compared to the competitor techniques. The model is trained on the Pavia University dataset in just 3 minutes and 31 seconds. Table 4 shows the classification accuracy of each model for all nine object classes. From that table, we observe that our classification network manages 99.67% OA, 99.21% AA, and the Kappa coefficient is 0.9971. Compared to the competitor algorithms, our method attains higher classification results. Specifically, the OA of the PCA algorithm is 87.23%, the AA is 88.15%, and the Kappa coefficient is 0.85. Under the same conditions, the overall classification accuracy of the SVM algorithm compared to PCA is increased by 3.3%, and the average classification accuracy is increased by 2.1%, but still inferior to our method. The OA of 2D-CNN is 93.33%, AA is 94.17%, and the Kappa coefficient is 0.92, while the OA of the 3D-CNN is 94.68%, AA is 95.37%, and the Kappa coefficient is 0.94. Compared to the traditional algorithms, the classification accuracy of both CNN methods is greatly improved reflecting the superiority of deep learning in the hyperspectral classification problem. Compared with 3D-CNN, the overall classification accuracy of the 3D residual network is increased by 3.1%, and the average classification accuracy is increased by 2.8%. SSRN network presents an appealing classification performance attaining 98.17% OA, 98.64% AA, and a Kappa coefficient of 0.98. We bolded the data with the highest accuracy of feature classification. The experimental results show that the overall performance of our proposed network model is better than other models. Figure 13 depicts some classification examples per method. It can be seen that the PCA and SVM classification algorithms have poor accuracy and present more misclassifications. The 2D-CNN classification results are slightly improved but still contain many misclassifications. The 3D-CNN, RES-3D, and SSRN models attain improved classification accuracy. However, the classification results of the suggested deep network architecture show an even more accurate classification, as they do not contain salt and pepper noise, and the boundaries are smooth and fit. Figure 13 depicts some classification examples per method. It can be seen that the PCA and SVM classification algorithms have poor accuracy and present more misclassifications. The 2D-CNN classification results are slightly improved but still contain many misclassifications. The 3D-CNN, RES-3D, and SSRN models attain improved classification accuracy. However, the classification results of the suggested deep network architecture show an even more accurate classification, as they do not contain salt and pepper noise, and the boundaries are smooth and fit.

KSC Dataset
The accuracy curve of each classification model for the KSC dataset are presented in Figure 14. It can be seen that as the epoch increases, the classification accuracy rate is increasing. The model is trained on the KSC dataset in just 1 minute and 43 seconds. Table

KSC Dataset
The accuracy curve of each classification model for the KSC dataset are presented in Figure 14. It can be seen that as the epoch increases, the classification accuracy rate is increasing. The model is trained on the KSC dataset in just 1 minute and 43 seconds. Table 5 shows the precise classification indicators of each method, where the proposed technique attains the highest metrics compared to the competitor methods, i.e., 99.81% OA, 99.74% AA, and 0.9952 Kappa metric. Additionally, from this table we observe that the overall accuracy of PCA, SVM, 2D-CNN, 3D-CNN, RES-3D, and SSRN has improved by 17.72%, 11.05%, 9.56%, 6.29%, 2.34%, and 1.65%, respectively, in the KSC dataset examined here compared to the Pavia University dataset evaluated in Section 3.2.2. Accordingly, the average accuracy has increased by 18.18%, 10.57%, 8.42%, 5.54%, 2.49%, 1.61%. This trial shows that RES-3D and the suggested network manage a better classification and highlights that our method is more suitable for a real landmark map. Classification examples per method are shown in Figure 15. Our network has started to converge in the 4th Epoch, and its performance is better than that of other models. Our network model shows better performance on the problem of hyperspectral image classification.  Figure 15. Our network has started to converge in the 4th Epoch, and its performance is better than that of other models. Our network model shows better performance on the problem of hyperspectral image classification.

Figure 14.
Overall accuracy curve of different models in KSC dataset.     Figure 16 shows the accuracy curve of each classification model in the Indian Pines dataset. Overall, the proposed deep classification network has the fastest convergence and the highest accuracy, and its classification performance is better compared to the competitor models. The model is trained on the Indian Pines dataset in just 1 minute and 37 seconds. Table 6 shows the precise classification index of each model for 13 classes of ground objects. From the latter table, we observe that the overall accuracy value of the proposed network is 24.05%, 18.81%, 13.22%, 5.09%, 1.79%, and 1.10% compared to PCA, SVM, 2D-CNN, 3D-CNN, RES-3D, and SSRN, respectively. Accordingly, the average accuracy attained by our model is higher by 24.02%, 18.02%, 14.22%, 5.48%, 2.09%, 1.17%, respectively, reaching a classification of 99.45% average accuracy and a Kappa coefficient of 0.9961. The classification results of each model are shown in Figure 17. It can be seen that the classification effect of PCA, SVM, and 2D-CNN models is poor, with more noise and speckles. 3D-CNN and RES-3D model classification results are less noisy, which improves  Figure 16 shows the accuracy curve of each classification model in the Indian Pines dataset. Overall, the proposed deep classification network has the fastest convergence and the highest accuracy, and its classification performance is better compared to the competitor models. The model is trained on the Indian Pines dataset in just 1 minute and 37 seconds. Table 6 shows the precise classification index of each model for 13 classes of ground objects. From the latter table, we observe that the overall accuracy value of the proposed network is 24.05%, 18.81%, 13.22%, 5.09%, 1.79%, and 1.10% compared to PCA, SVM, 2D-CNN, 3D-CNN, RES-3D, and SSRN, respectively. Accordingly, the average accuracy attained by our model is higher by 24.02%, 18.02%, 14.22%, 5.48%, 2.09%, 1.17%, respectively, reaching a classification of 99.45% average accuracy and a Kappa coefficient of 0.9961. The classification results of each model are shown in Figure 17. It can be seen that the classification effect of PCA, SVM, and 2D-CNN models is poor, with more noise and speckles. 3D-CNN and RES-3D model classification results are less noisy, which improves the classification accuracy of these models. SSRN and the suggested network both attain an appealing classification accuracy. It can be seen from Table 6 that the classifier shows that the classification accuracy of Corn-min till is slightly lower, while the classification accuracy of the other classes is higher. In addition, it can be seen from Figure 16 that our network converges extremely fast and achieved good classification accuracy in the 6th Epoch, which is also due to other models.    (e) (f) (g) Figure 17.

Conclusions
This paper studies the application of deep learning for hyperspectral image classification. Aiming at the characteristics of a wide range of hyperspectral images, high spectral resolution, and a large amount of redundant information, a multi-scale residual convolution with ECA-NET is designed. The neural network model extracts image feature information. The model uses ECA-Net, improved residual network, and other structures to extract hyperspectral image information multiple times from different scales, can fully fuse and extract the space-spectrum characteristics of the image and effectively solve the problems of gradient dispersion and sample information redundancy. We challenge our suggested classification model against six current classification models, i.e., PCA, SVM, 2D-CNN,3D-CNN, RES-CNN, SSRN, on the Pavia University, KSC, and Indian Pines datasets, and demonstrate that our algorithm can effectively classify various object classes and has certain advantages in dealing with hyperspectral classification problems. Our proposed method successfully attains 99.82%, 99.81%, and 99.37% overall accuracy, respectively, on the three different free datasets. All trials demonstrate the superiority of our method against the competitor ones attaining a high classification accuracy. Future work shall focus on studying spatial and spectral feature fusion methods to enhance the feature extraction process, improve the network structure and parameters, accelerate the model convergence, and reduce network training time.

Conclusions
This paper studies the application of deep learning for hyperspectral image classification. Aiming at the characteristics of a wide range of hyperspectral images, high spectral resolution, and a large amount of redundant information, a multi-scale residual convolution with ECA-NET is designed. The neural network model extracts image feature information. The model uses ECA-Net, improved residual network, and other structures to extract hyperspectral image information multiple times from different scales, can fully fuse and extract the space-spectrum characteristics of the image and effectively solve the problems of gradient dispersion and sample information redundancy. We challenge our suggested classification model against six current classification models, i.e., PCA, SVM, 2D-CNN,3D-CNN, RES-CNN, SSRN, on the Pavia University, KSC, and Indian Pines datasets, and demonstrate that our algorithm can effectively classify various object classes and has certain advantages in dealing with hyperspectral classification problems. Our proposed method successfully attains 99.82%, 99.81%, and 99.37% overall accuracy, respectively, on the three different free datasets. All trials demonstrate the superiority of our method against the competitor ones attaining a high classification accuracy. Future work shall focus on studying spatial and spectral feature fusion methods to enhance the feature extraction process, improve the network structure and parameters, accelerate the model convergence, and reduce network training time.