Multiscale Spatial-Spectral Convolutional Network with Image-Based Framework for Hyperspectral Imagery Classification

Jointly using spatial and spectral information has been widely applied to hyperspectral image (HSI) classification. Especially, convolutional neural networks (CNN) have gained attention in recent years due to their detailed representation of features. However, most of CNN-based HSI classification methods mainly use patches as input classifier. This limits the range of use for spatial neighbor information and reduces processing efficiency in training and testing. To overcome this problem, we propose an image-based classification framework that is efficient and straightforward. Based on this framework, we propose a multiscale spatial-spectral CNN for HSIs (HyMSCN) to integrate both multiple receptive fields fused features and multiscale spatial features at different levels. The fused features are exploited using a lightweight block called the multiple receptive field feature block (MRFF), which contains various types of dilation convolution. By fusing multiple receptive field features and multiscale spatial features, the HyMSCN has comprehensive feature representation for classification. Experimental results from three real hyperspectral images prove the efficiency of the proposed framework. The proposed method also achieves superior performance for HSI classification.


Introduction
Hyperspectral image (HSI) has attracted a lot of attention in recent years since it has hundreds of continuous observation bands throughout the electromagnetic spectrum, ranging from visible to near-infrared wavelengths.HSI has also been used in many applications due to its high-dimensionality and distinct spectral features [1][2][3].Supervised classification is one of the most critical applications and is widely used in remote sensing.However, spectral-based classification methods typically only measure the spectral characteristics of objects and ignore spatial neighborhood information [4].Hyperspectral image classification (HSIC) can be improved by considering both spatial and spectral information [5].Moreover, the multiscale spatial-spectral classification method is well adapted for HSI since different scale regions contain complementary but interconnected information for classification.
Multiscale spatial-spectral classification methods can be categorized into two groups: multiscale superpixel segmentation [6][7][8][9] and multiscale image cubes [10][11][12][13][14][15].Numerous methods have been developed to determine the optimal scale in multiscale superpixel segmentation.Yu et al. [6] proposed a multiscale superpixel-level support vector machine (SVM) classification method to exploit the spatial information within different superpixel scales.Zhang et al. [7] applied a multiscale superpixel-based sparse representation algorithm to obtain spatial structure information for different segmentation scales and generated a classification map using a sparse representation classifier.Similarly, Dundar et al. [8] added a guided filter to process the first three principal components and implemented multiscale superpixel segmentation.Chen et al. [9] used multiscale segmentation to obtain the spatial information from different levels, and the results of multiscale segmentation were treated as the input for a rotation forest classifier.Finally, the majority voting rule was used to combine the classification results from different segmentation scales.
An image pyramid refers to an image that is subject to repeated smoothing and subsampling and generates a series of weighted down images [10].Li et al. [11] proposed a segmented principal component analysis and Gaussian pyramid decomposition-based multiscale feature fusion algorithm for HSIC.The Gaussian image pyramid was used to generate images with multiple spatial sizes following PCA dimension reduction.Fang et al. [12] applied a multiscale adaptive sparse representation for selecting the optimal regional scales for different HSIs with various structures.Liu et al. [13] proposed a multiscale representation based on random projection.This method modeled the spatial characteristics at all reasonable scales comprising each pixel and its neighbors.A multiscale joint collaborative representation with a locally adaptive dictionary was developed to incorporate complementary contextual information into classification by multiplying different scales with distinct spatial structures and characteristics [14].He et al. [15] proposed feature extraction with a multiscale covariance map for HSIC.In their work, a series of multiscale cubes were constructed for each pixel in the dimension-reduced imagery.A Gaussian pyramid and edge-preserving filtering were also used to extract multi-scale features.The final classification map was produced using a majority voting method [16].Most of these multiscale classification methods independently extract features or spatial structures at different image scales and fuse the classification results in the final prediction stage.Thus, there is no interaction between the features at different scales.
Deep learning-based models have also been introduced for the HSIC in recent years.Spatial-spectral features can be extracted using a deep convolutional neural network, and these features represent low-to-high level semantic information.Different deep learning architectures have also been introduced for HSIC [17][18][19][20][21][22][23].Li et al. [17] proposed a pixel-pair method to significantly increase the number of training samples.This was necessary to overcome the imbalance between the high dimensionality of spectral features and limited training samples (also known as the Hughes phenomenon).A similar cube-pair 3-D convolution neural network (CNN) classification model has also been proposed [18].Du et al. proposed an unsupervised network to extract high-level feature representations without any label information [19].Recurrent neural network (RNN) and parametric rectified tanh activation functions have been introduced for HSIC [20].Self-taught learning was used for unsupervised extraction of features from unlabeled HSI [21].Lee et al. [22] proposed a network to extract features with multiple convolution filter sizes.Residual connection and 3-D convolution have also been applied to HSIC [23], and semi-supervised classification methods have been developed as well [24][25][26].He et al. [24] proposed a semi-supervised model using generative adversarial networks (GAN) to use the limited labeled samples for HSIC.A multi-channel network was proposed to extract the joint spectral-spatial features using semi-supervised classification [25].Zhan et al. [26] developed 1-D GAN to generate hyperspectral samples that were similar to real spectral vectors.Ahmad et.al [27] proposed a semi-supervised multi-kernel class consistency regularizer graph-based spatial-spectral feature learning framework.
More recently, various convolution neural networks with multiscale spatial-spectral features have been introduced for hyperspectral image classification [28][29][30][31][32][33][34][35][36][37][38][39][40][41].Jiao et al. [28] used a pooling operation to generate multiple images from HSI, and a pretrained VGG-16 was introduced to extract multiscale features.Fusion features were then fed into classifiers.Liang et al. [29] also used pretrained VGG-16 to extract multiscale spatial structures and proposed an unsupervised cooperative sparse autoencoder method to fuse deep spatial features and spectral information.Multiscale feature extraction has also been proposed using multiple convolution kernel sizes and determinantal point process priors [30].An automatic design CNN was introduced with automatic 1-D Auto-CNN and 3-D Auto-CNN [31].An attention mechanism was also introduced for HSIC [32][33][34].Fang et al. [32] proposed a 3-D dense convolutional network with a spectral attention network.Wang et al. [33] designed a spatial-spectral squeeze-and-excitation residual network to exploit the attention mechanism for HSIC.Mei et al. [34] used RNN and CNN to design a two-branch spatial-spectral attention network.Zhang et al. [35] proposed 3-D lightweight CNN for limited training samples and also presented two transfer learning strategies: (1) cross-sensor strategy and (2) cross-modal strategy.An unsupervised spatial-spectral feature learning strategy was proposed using 3-D CNN autoencoder to learn effective spatial-spectral features [36].A multiscale deep middle-level feature fusion network was proposed to consider the complementary and related information among different scale features [37].Two convolution capsule networks were proposed to enrich the spatial-spectral features [38,39].Pan et.al [40] proposed rolling guidance filter and vertex component analysis network to utilize spatial-spectral information.Several other CNN-based HSIC methods were also proposed [41][42][43].However, some of these multiscale methods still focus on independently extracting multiscale features based on different image scales.Most importantly, these methods only use image patches as model inputs, which limits the understanding of the remote sensing image.
This paper proposes an image-based classification framework for HSI to address the inefficient performance of existing methods.Based on this framework, this paper proposes a novel HyMSCN (Multiscale Spatial-spectral Convolutional Network for Hyperspectral Image) network to improve the representation ability for HSI.The main contribution of the proposed approach can be summarized as follows: (1) A novel image-based classification framework is proposed for hyperspectral image classification.
The proposed framework is more universal, efficient, and straightforward for training and testing processes compared to the traditional patch-based framework.(2) Local neighbor spatial information is exploited using a residual multiple receptive field fusion block (ReMRFF).This block integrates residual learning and multiple dilated convolutions featured as lightweight and efficient feature extraction.(3) Multiscale spatial-spectral features are exploited using the proposed HyMSCN method.
The method is based on the feature pyramid structure and considers both multiple receptive field fused features and multiscale spatial features at different levels.This approach allows for comprehensive feature representation.
The remainder of this paper is organized as follows.Section 2 introduces the dilation convolution and the feature pyramid.Section 3 presents the details of the proposed classification framework and HyMSCN.Section 4 evaluates the performances of our method compared with those of other hyperspectral image classifiers.Section 5 provides a discussion of the results, and the conclusion is presented in Section 6.

Dilated Convolution
Dilated convolution, or atrous convolution, has been shown to be effective for semantic segmentation [44][45][46][47].The dilated convolution refers to "convolution with a dilated filter", and the operator of dilated convolution can apply the same filter at different ranges using different dilation factors [44].It is likely constructed by inserting holes between each pixel in the convolution kernel.The operation of one 3×3 dilated convolution with different dilation factors is shown in Figure 1.It can be found that the dilation factor decides the sampling distance of the convolution kernel.Naturally, the extracted features represent spatial structure information at different scales.A larger convolution kernel can be used to enlarge the receptive field (such as 5×5 or 7×7).However, the number of parameters should be small to prevent overfitting since HSI contains a small number of training samples [5].Dilated convolution can enlarge the receptive field with the same number of parameters, and is well adapted for HSIC.A "gridding problem" is known to exist in the dilated convolution framework [44,45].This can result in serration gridding, as shown in Figure 2. Wang et al. [46] addressed this problem by developing a hybrid dilated convolution (HDC) to choose a series of proper dilation factors to avoid gridding.Chen et al. [45,47] proposed atrous spatial pyramid pooling (ASPP) to exploit multiscale features by employing multiple parallel filters with different dilation factors.The main purpose of both methods is to ensure that the final size of the receptive field fully covers a square region without any holes or missing edges.Following their work, we also designed a lightweight dilated convolution block to fuse multiple receptive field features (see Section 3.2).

Feature Pyramid
Most hyperspectral image classification methods [11,[14][15][16]28,29] construct image pyramids to extract multiscale features, such as in Figure 3a.These features are independently extracted for each image scale, which is a slow process.Most importantly, the image pyramid suffers from a deficiency in semantic classification and lacks interaction between different feature scales.
The feature pyramid network was first proposed by Lin et al. [48] for object detection.The method is developed to detect objects at different scales.Multiscale features enable the model to obtain a large range of scales at different pyramid levels for segmentation.A convolution network can extract hierarchical features layer by layer, and a convolution layer with a stride greater than one can change the feature spatial size to construct inherent multiscale features within a pyramid (Figure 3b).The feature pyramid produces multiscale features with strong semantic representation even at high-resolution feature scales.Additionally, the architecture combines low-to-high level features with top-down feature fusion which has rich semantics at all levels [48].The hyperspectral image classification network was designed based on this strategy (see Section 3.3).

Proposed Methods
In this section, we propose an image-based framework for HSI that has greater flexibility and efficiency compared with the tradition patch-based classification framework (see Section 3.1).A novel multiscale spatial-spectral convolutional network, or HyMSCN, was developed based on this framework.The network is composed of two components, including a residual multiple receptive field fusion block (ResMRFF) and feature pyramid.ResMRFF mainly focuses on extracting multiple receptive field features with lightweight parameters to avoid overfitting (see Section 3.2).In Section 3.3, the network structure containing the feature pyramid is developed and used to extract multiple features that have strong semantics at different scales.The multiscale features are then fused for the final classification.Obviously, the patch-based classification method has several disadvantages.Firstly, the patch size restricts the receptive field of the classification model.In CNN-based hyperspectral image classification models, the model cannot obtain information larger than the patch size even though the model is composed of a series of 3×3 convolution layers.Secondly, the model must be redesigned if the patch size changes.Most importantly, the optimal patch size is decided by the ground sample distance (GSD) of a remote sensing image.It is difficult to design a universal patch-based model to fit arbitrary images.Thirdly, the pixel-by-pixel processing of the test phase is inefficient.The testing patches will consume more memory than the original image because of the redundant information between different patches.In addition, for CNN-based model, the testing process with one by one processing will consume more bandwidth in the central processing unit (CPU) and graphics processing unit (GPU).
In this study, we propose an image-based classification framework as an alternative to overcome the above issues associated with patch-based classification (Figure 5).The image-based framework can utilize an arbitrary semantic segmentation model for hyperspectral image classification.Due to the fact that only a small number of pixels with labeled information can be used for training in HSIC.The training phase is different from the one of semantic segmentation model used in computer vision [34][35][36].In training phase, the image containing training samples is used as the input, and the output is the predicted labels for all corresponding pixels.Since only part of pixels in this image has labels, the position of labeled samples acting as a mask covers the output to select the corresponding pixels.The loss is calculated between the selected pixels and labeled pixels.In testing phase, an image is used as input and the corresponding labels are predicted for all the pixels.The image-based classification framework can fully utilize the graphics processing unit during testing to accelerate the inference process with non-redundant information.Moreover, the receptive field is not affected by the patch size since there is no slicing operation, and the testing process is simple and straightforward.This can directly output the results of the test image in one inference instead of pixel-by-pixel processing each patch.This will save more computing resources than the patch-based classification framework.

Residual Multiple Receptive Field Fusion Block
The residual block is a popular CNN structure used in many computer vision tasks [49,50].The method uses a skip connection to construct identity mapping which enables the block to learn the residual function.The residual function can be formulated as: X i =f (X i-1 )+X i-1 , where X i-1 and X i refer to the input and output of one residual block and f (•) refers to the non-linear transformation.A typical illustration of the residual block is shown in Figure 6.This method suggests that the forward and backward signals can be directly propagated from one residual block to another block.This can enhance the gradient transfer from top to bottom of the network and mitigate the gradient disappearance in the deep network.Following the strategy of the residual block, we designed a new block called ResMRFF which consists of residual learning and multiple receptive field fusion block (MRFF).In the new block, MRFF is used to replace the convolution layer of the original residual block to enlarge the receptive field with fewer parameters.MRFF is designed to enlarge the receptive field of features with fewer parameters based on the dilation convolution (Figure 7).The structure is designed following the reduce-transform-merge strategy.Firstly, a 1×1 convolution layer is used to reduce feature dimension.Secondly, dilation convolution layers are used to extract features with a variety of dilation factors.Lastly, multiple features are fused using a hierarchical feature fusion strategy [51].Larger receptive field features are fused with smaller receptive field features to improve representation at different spatial ranges.Due to only a small number of labeled samples are available in HSIC, it is reasonable to reduce the number of parameters to avoid overfitting.For an input feature X i−1 ∈ R H in ×W in ×C in , a standard convolution layer contains C out kernels k ∈ R k×k×C in to produce output feature X i ∈ R H out ×W out ×C out .H, W, C, and k represent the height and width of the features, channels, and kernel sizes, respectively.The number of parameters is C in ×k×k×C out for a standard convolution layer.The MRFF module is designed to follow the reduce-transform-merge strategy to reduce the number of parameters and to make the network more computationally efficient [52][53][54].Firstly, a 1×1 convolution layer is introduced to reduce the feature dimension by a factor of 4. Secondly, dilated convolution layers with different dilation factors are utilized to obtain multiple spatial features in parallel.Finally, multiple spatial features are fused using a 1×1 convolution layer with a hierarchical feature fusion strategy.The MRFF module is thus stated to have (C in ) 2 /4 + (kC in 2 /4 + C in C out parameters.Conversely, a standard convolution has C in ×k×k×C out parameters.Generally, the MRFF has fewer parameters and this enables the module to be faster and more efficient. Additionally, MRFF can obtain multiple receptive field features.A standard convolution layer has a k × k receptive field.Conversely, the receptive field of one dilated convolution layer is (k−1)d+1, where k and d refer to kernel size and dilation factor.In our MRFF module, four different dilation factors (1,2,3,4) are used and the receptive field of merged features is similar to Figure 8. MRFF employs multiple dilated convolutions to enlarge the diversity of features and remove the gridding artifact by hierarchical fusion before concatenation of multiple features.A dropout layer, instance normalization layer, and a ReLU non-linearity activation function are added following MRFF to prevent overfitting and to improve the generalization of the network.The skip-connection should be replaced by a 1 × 1 convolution layer to implement the operation of an element-wise plus if C in C out or stride 1, where stride refers to the pixel shifts over the input features.The stride is set to 1 or 2 in our method, and the corresponding illustration of each situation is shown in Figure 9. ResMRFF enables the output feature size to stay the same as the input when setting stride = 1.Conversely, the ResMRFF will down-sample the feature size to produce hierarchical features at a low spatial scale when stride = 2.The pooling layer can also implement the down-sample operation.However, the fixed-function for pooling is disadvantageous for learning hierarchical features and a max-pooling or min-pooling operation can result in a loss of information.

Multiscale Spatial-Spectral Convolutional Network
Two networks were designed based on the ResMRFF module, HyMSCN-A, and HyMSCN-B (Figure 10).Both networks were designed using the bottom-up pathway to compute hierarchical features at several feature levels.The difference between these two networks relates to whether the feature pyramid has been used.
In both networks, a 1×1 convolution layer, instance normalization layer, and non-linearity activation function are combined to form the basic feature extraction module.The process is repeated three times to extract spectral features from HSI.
The two networks are observed to contain different feature extraction and utilization abilities.The first network (HyMSCN-A) is composed of several ResMRFF blocks with stride = 1.The spatial size of all feature maps is the same since it does not contain any spatial scaling structure (see Figure 10a).The network takes hyperspectral imagery as input and learns the high-level spatial-spectral features to produce classification results.In comparison, the second network (HyMSCN-B) employs the feature pyramid structure to fully take advantage of the hierarchical features with different spatial scales as shown in Figure 10b.There are several stages in the network with different feature levels, where one stage refers to the blocks that produce features with the same spatial size.The output feature maps of one stage are used as inputs in the next stage and are then transferred to one pyramid level.Therefore, the feature pyramid is constructed based on the arrangement of multiscale features in the pyramid shape.The features of different pyramid levels are fused using a top-down pathway.This suggests that a coarser-resolution but semantically stronger feature can be upsampled and merged with the feature from a previous pyramid level.Here, bilinear interpolation is used to upsample coarser-resolution features.Additionally, merged features are interpolated to the finest-resolution spatial size and concatenated together.Finally, two 1×1 convolution layers, a normalization layer, and an activation function are integrated to produce the output feature maps.
The output feature is sliced into a vector according to the position of training samples following the proposed image-based classification described in Section 3.1.Cross entropy [55] is used to calculate the loss between the vector labels and the corresponding true labels.
The parameters of the proposed networks are shown in Table 1.Here, the Indian Pines is used as an example and the spatial size of this image is 145×145.The differences between the output sizes for HyMSCN-A and the HyMSCN-B can be clearly observed.The output size for HyMSCN-A remains the same, while HyMSCN-B features a multiscale output size.Furthermore, we set two feature dimensions for each network to evaluate feature size performance.A-64 feature refers to the HyMSCN-A network with a feature size of 64, and A-128 feature refers to parts of the HyMSCN-A network that contain a larger feature size.Note: 1 We used the spatial size of Indian Pines (145 × 145) as an example. 2 CNR refers to the convolution layer, normalization layer and ReLU activation function layer.

Experiment Setup
Three hyperspectral images were used to evaluate the performance of the proposed model including Indian Pines, Pavia University, and Salinas.
(1) The first dataset included imagery of Indian Pines in Indiana and was gathered by the AVIRIS sensor.It contains 145×145 size with 220 spectral bands with a range of 400-2500 nm.However, 20 spectral bands were discarded due to atmospheric absorption.The GSD was 20 m and the available samples contained 16 classes as shown in Figure 11.
(2) The second dataset included imagery of Pavia University that was gathered by the ROSIS sensor using a flight campaign in Pavia, Italy.It contained 610 × 340 pixels and 115 spectral bands with a range of 430-860 nm.However, 12 bands were discarded due to atmospheric absorption.The GSD was 1.3 m.The color composite and reference data are displayed in Figure 12.
(3) The third dataset was collected by the AVIRIS sensor and was located in Salinas Valley, California.The imagery contained 512 × 217 pixels with 224 spectral bands.A total of 20 water absorption bands were discarded (nos.108-112, 154-167, and 224).The GSD was 3.7 m per pixel and the ground truth contained 16 classes shown in Figure 13.
The Kaiming weight initialization method [42] was adopted to initialize the parameters of the proposed network.Considering the convergence speed and convergence accuracy, the network was trained using an Adam optimizer [56] with the default parameters set as: β 1 = 0.9, β 2 = 0.999, and = 10 −8 .Taking account of the training stability, the initial learning rate was set as 3 × 10 −4 after selecting the optimal parameter.The learning rate decay was 0.9 per 500 epochs and the total epochs was 2000.The batch size was set to one since only one input image was used in each experiment.The PyTorch deep learning framework [57] was used to train the proposed network.The computing environment consisted of the following specifications: i7-6700 CPU, 16 GB of RAM, and a GTX 1070 8GB GPU.

Experiments using the Indian Pines Dataset
Indian Pines data was used to compare the proposed model with other well-established models such as the support vector machine (SVM), SSRN [23], 3DCNN [58], DCCNN [59], UNet [60], and ESPNet [51].SSRN, 3DCNN, DCNN were used as a means for comparing patch-based classification.UNet and ESPNet were used as classic semantic segmentation methods for the image-based classification network.Among these methods, SVM represents a typical classifier that only uses the information in the spectral domain, and the grid search was used to tune the hyper-parameters of SVM.
In our first test, 30 samples per class were randomly selected with a total of 444 training samples.This was a small training sample size that only accounted for approximately 4.3% of the total labeled samples.Only half of the sample size in this class were randomly selected in training if the number of samples of one class was less than 30.The overall accuracy (OA), average accuracy (AA), and kappa coefficient (k) were used to assess the performance of different methods.For SSRN, 3DCNN, and DCCNN, the patch size and optimizer methods were set as the original papers.Table 2 reports the individual classification results of different methods.Figure 14 shows the corresponding classification maps of different methods.A number of observations can be made based on the above data.Firstly, SVM was observed to have the worst performance.This indicates that classifiers that only use spectral information cannot achieve high classification accuracy when using a small number of training samples.Secondly, image-based classifiers (e.g., UNet, ESPNet, HyMSCN) provided higher classification accuracy compared to SSRN, 3DCNN, and DCCNN.This demonstrates that image-based classification is a powerful framework for HSI.Thirdly, HyMSCN and ESPNet achieved superior results compared to UNet, proving that multiple receptive fields features can improve classification accuracy (where ESPNet also contains multiple dilation convolutions).Lastly, HyMSCN-B-128 provided the best results, thereby demonstrating that a well-designed network combining multiscale and multi-level features is suitable for HSIC.It should be noted that the classification map for HyMSCN-B-128 preserved the clear boundaries for ground objects shown in Figure 14j.Although the classification results in each class do not achieve the best accuracy, the overall classification accuracy is the highest.Compared with other methods, the classification results of the proposed method are more balanced.In our second test using the Indian Pines dataset, validated the performances of our proposed method using a different number of training samples.The sensitivity of the model to the number of samples was evaluated by randomly selecting 10, 20, 30, 40, and 50 samples per class.Table 3 reports the overall accuracy for these classification methods using different training samples.The results reveal similar conclusions to the first experiment presented above.First, although the classification accuracy of SVM can be improved by selecting optimal training samples to enhance the generalization performance of classifier [61][62][63].In our work, SVM was observed to have the worst classification accuracy with the random non-optimized training samples.Secondly, image-based classifiers provided superior performance compared to the patch-based method.Lastly, HyMSCN-B-128 achieved the best accuracy in comparison to the other methods in each group.These results demonstrate that the proposed method is both reliable and robust.

Experiments Using the Pavia University Dataset
In this section, Pavia University dataset was used to compare the performance of different classification methods.Firstly, 30 samples per class were randomly selected as training data with a total of 270 samples (approximately 0.63% of the labeled pixels).Table 4 displays the overall accuracy, average accuracy, k statistic, and individual accuracy.From Table 4, we can draw conclusions similar to the results of the Pines experiment.The HyMSCN-B-128 network achieved the highest individual classification accuracy in most of the classes.Moreover, SSRN, 3DCNN, DCCNN, UNet, ESPNet, and HyMSCN provided a more homogeneous classification map compared to SVM.Conversely, the SVM classification map contained a large amount of salt-and-pepper noise.This demonstrates that the spatial-spectral classifier can greatly improve classification performance by taking advantage of the spatial neighborhood information.The performances of the classification methods were further assessed using different training samples of Pavia University data.A training set was generated by randomly selecting 10, 20, 30, 40, and 50 samples per class.Table 5 displays the overall accuracy for each group.The proposed network was observed to achieve a high classification accuracy with a limited number of training samples.For instance, HyMSCN-B-128 yielded a 95.80% overall accuracy when only using 20 training samples per class.A comparison of the proposed network with other methods reveals that multiscale spatial features can capture spatial neighbor structure information at multiple feature levels and greatly improve classification accuracy.

Experiments Using the Salina Dataset
In this experiment, we first randomly selected 30 samples per class with a total of 480 samples (approximately 0.88% of the labeled pixels) to form the training set.The classification map and results obtained from different methods are shown in Figure 16 and Table 6.The HyMSCN-B-128 network was observed to provide the best performance.The proposed HyMSCN-B was observed to enhance the utilization of features from low-to-high levels in the network and improved the representation of multiscale features.Moreover, SSRN provided the worst performance, which indicates the instability and sensitivity of the method.HyMSCN-A also obtained higher classification accuracies relative to UNet and ESPNet by integrating local neighbor spatial information.HyMSCN-A and HyMSCN-B displayed more homogeneous classification maps with a clearer edge compared to the other methods.A second test of the Salina data was used to evaluate the performance of various classification methods using a different number of training samples.These were randomly selected from 10 to 50 samples per class.Table 7 shows the overall accuracy produced by various classifiers.As expected, the classification accuracy increased with the number of training samples.The patch-based method provided unstable performance for the Salina dataset.This can be explained by the patch-based classification not being able to acquire a larger range of receptive field features, especially in the case of limited training samples.However, the proposed HyMSCN-A-128 and HyMSCN-B-128 networks achieved robust performance and high accuracy along with local spatial consistency and multiscale feature representation.
The above results suggest that the proposed HyMSCN method achieves superior individual and overall classification performance and provides more homogeneous classification maps with a clear object edge.The classification results for the different number of training samples also demonstrate the robustness of the proposed method.The training and testing times are investigated to compare the efficiency of patch-based and image-based classification.SSRN [23] and HyMSCN-B-64 were used as examples for patch-based classification and image-based classification, respectively.For a fair comparison, the maximum batch size was set based on the experimental conditions for SSRN.Conversely, the batch size for HyMSCN-B-64 was set to 1 since the image-based classification processes an entire image at one time.For these tests, 50 samples per class were randomly selected as training samples.Only half of the sample size in this class was used in training if the number of samples for one class was less than 50.Furthermore, we also investigated the effect of different patch sizes on training and testing times.
Table 8 lists the training and testing times for these two networks.Clearly, the training times of HyMSCN were 6 to 10 times shorter than those of SSRN.The number of training samples for Pavia University was 450, and one iteration contained all training samples when we set the batch size as 450 for SSRN.Despite this fact, SSRN training was still slower than HyMSCN.Similarly, a very large batch size was set for SSRN in the testing phase, and the results revealed the method lasted 160 to 800 times longer than HyMSCN.It can be imagined that the patch-based classification method will consume more time when processing a larger image.This is because adjacent patches contain a lot of redundant information and a large number of computer resources are wasted on repeated calculation.In contrast, there is no redundant information in the image-based classification and the processing of testing phase is faster than the patch-based classification.

Evaluating the Multiple Receptive Field Fusion Block
The effectiveness of the multiple receptive field fusion block (MRFF) was evaluated when using HyMSCN-A-64 as the network backbone.For the compared network, the dilation factors of all MRFF were set to 1 and the changed network is called HyMSCN-N.A total of 50 samples per class were randomly selected as training samples for each dataset and the total epochs were set to 1000.
Figure 17 displays the results of the overall accuracy for over 1000 training epochs.For these three datasets, HyMSCN-A-64 achieved superior results compared to HyMSCN-N by approximately 3%, 2.7%, and 1.1% for Indian Pines, Paiva University, and Salinas data, respectively.These results suggest that MRFF with multiple dilation factors can improve the classification performance of all three datasets.With the increasing of number dilation factors, the classification accuracy also improves.Comparing with a small number of dilation factors, a larger number of dilation factors leads to extract more diverse features.However, the classification accuracy begins converge when the number of dilation factors reaches 5.This means the representation of the used model achieves the maximum under the condition of limited training samples.Although a larger number of dilation factors leads to achieving a higher precision, we set the number of dilation factors to (1, 2, 3, 4) as a compromise trade-off between the number of parameters and efficiency of the model.

Evaluating the Feature Pyramid
To validate the effectiveness of the feature pyramid structure, we compare the performance of proposed networks: HyMSCN-A and HyMSCN-B.As illustrated in Section 3.3, HyMSCN-A used an MRFF block with the same feature size to construct the network and did not contain any multiscale features or feature pyramid structure.Conversely, HyMSCN-B included both multiple receptive field features and multiscale features.Furthermore, the performance of different feature dimensions was also investigated, including HyMSCN-A-64, HyMSCN-A-128, HyMSCN-B-64, and HyMSCN-B-128.A total of 50 samples per class were used as training data.The training process for each network was repeated five times using the same training samples.Figure 19 displays the average overall accuracy.
HyMSCN-B-64 and HyMSCN-B-128 featured an improved classification accuracy compared with HyMSCN-A-64 and HyMSCN-A-128.Furthermore, it is worth noting that increasing the number of features did not significantly improve the classification accuracy.In the case of small samples, major factor that determined classification accuracy was the diversity and validity of features rather than the number of features.

Conclusions
Hyperspectral image is characterized by an abundance of spectral features and spatial structure information.It has been demonstrated that convolutional neural networks have a strong ability to extract spatial-spectral features for classification and feature representation.In this context, we proposed an image-based classification framework for the hyperspectral image to overcome the inefficiency of patch-based classification.The results revealed that the processing speed of the image-based classification framework was 800 times faster than the patch-based classification for the test set, especially for larger hyperspectral images.Different regional scales are known to contain complementary but interconnected information for classification.In this context, the HyMSCN network is designed to integrate multiple local neighbor information and multiscale spatial features.Experiments performed on three hyperspectral images suggest that the proposed HyMSCN network can achieve a high classification accuracy and robust performance.

Figure 1 .
Figure 1.An illustration of the receptive field for one dilated convolution with different dilation factors.A 3×3 convolution kernel is used in the example.

Figure 2 .
Figure 2. The gridding problem resulting from dilated convolution.(a) True-color image of Pavia University.(b) Classification results with dilated convolution.(c) Enlarged classification results for (b).

Figure 3 .
Figure 3.The image and feature pyramid.(a) The image pyramid independently computes features for each image scale.(b) The feature pyramid creates features with strong semantics at all scales.

3. 1 .
Patch-Based Classification and Image-Based Classification for HSI Most deep learning-based hyperspectral image classification methods generate image patches as training data to extract spatial-spectral information.This is known as a patch-based classification for hyperspectral imagery.The patch-based classification framework is displayed in Figure 4.The training pixel and its neighbor pixels are selected as training patches.These patches are fed into the model to predict the labels for the center pixel of each patch.This method is observed to generate patches pixel by pixel during testing.Patches are then inputted into the model one by one and finally reshapes the predicted labels to the spatial size of the input image.

Figure 4 .
Figure 4.An illustration of patch-based classification for hyperspectral image (HSI).

Figure 5 .
Figure 5.An illustration of image-based classification for HSI.The model is similar to the segmentation model used in computer vision which takes an input of an arbitrary size and produces an output with a corresponding size.Additionally, the predicted labels should be selected in the output according to the training pixel position during training.

Figure 6 .
Figure 6.An illustration of the residual block.

Figure 7 .
Figure 7.A schematic of the residual multiple receptive field fusion block (ResMRFF).The basic strategy of the multiple receptive field fusion block is represented as Reduce-Transform-Merge.

Figure 8 .
Figure 8.An illustration of the receptive field for the merged features in one MRFF block.

Figure 9 .
Figure 9.An illustration of two ResMRFF blocks with different strides, where s refers to the convolution stride and d refers to the dilation factor.(a) refers to the ResMRFF block designed with stride=1, (b) refers to the ResMRFF block designed with stride=2.

Figure 10 .
Figure 10.An illustration of the proposed two networks, (a) HyMSCN-A and (b) HyMSCN-B.

Figure 11 .
Figure 11.Indian Pines imagery with (a) color composite, reference data, and (c) class names.
Figure 15 displays the corresponding classification maps, training, and testing samples.

Figure 18 .
Figure 18.Test accuracy for the different number of dilation factors in MRFF block.(a) Indian Pines data, (b) Pavia University data, (c) Salina data

Table 1 .
The parameter settings of the proposed networks.The same colors refer to the same pyramid levels.

Table 2 .
Overall, average, k statistic, and individual class accuracies for the Indian Pines data with 30 training samples per class.The highest accuracies are highlighted in bold.

Table 3 .
The overall accuracies obtained from various classification methods for Indian Pines data using different training samples.The best performances for each group are highlighted in bold.

Table 4 .
Overall, average, k statistic, and individual class accuracy for Pavia University data with 30 training samples per class.The highest accuracies are highlighted in bold.

Table 5 .
The overall accuracies produced by various classification methods for Pavia University data using a different number of training samples.The best performances for each group are highlighted in bold.

Table 6 .
The overall, average, k statistic, and individual class accuracies for the Salina dataset with 30 training samples per class.The best results are highlighted in bold typeface.

Table 7 .
The overall accuracies produced by different classification methods for the Salina using a different number of training samples.The best results are highlighted in bold type.

Table 8 .
The training and testing times for patch-based and image-based classification.Each control group is highlighted with the same color.