1. Introduction
Hyperspectral Imaging (HSI) has hundreds of continuous spectral bands and high spatial correlation, so it contains abundant spectral and spatial information which is useful for the classification of different materials. HSI has been applied to many fields, including environment management [
1], geological mapping [
2], mineral exploitation [
3], and scene recognition [
4]. Although HSI contains rich spectral information, it is difficult to obtain enough training samples in practice, which often leads to the “curse of dimensionality”. In addition, the neighboring bands of HSI are of high correlation, which means that only a few bands play a critical role. This increases the computational complexity and affects the following classification process. Therefore, dimensionality reduction (DR) is necessary for HSI classification preprocessing, which can reduce computational complexity and retain useful information of classification [
5,
6,
7,
8]. Feature selection and feature extraction are the traditional methods to implement DR [
9]. Feature selection aims to find more discriminative bands from the raw HSI data to represent the entire image and this method can remain the physical meaning of original data [
10,
11,
12,
13,
14,
15]. Some clustering-based methods [
16,
17] and ranking-based methods [
18,
19] find the representative bands to classify distinct classes. Compared with feature selection, feature extraction [
20,
21,
22,
23,
24,
25,
26] finds more useful features through mathematical transformation to improve the classification accuracy. These features are, for example, multilinear principal component ananlysis (PCA) [
27] and Fisher’s linear discriminant analysis [
28] etc., but these methods can only extract low-level features which have limited representation capacity to express the abundant information of spectral and spatial features.
Recently, about the above issue, many deep learning models have been proposed and they can learn more distinguished features with the goal of high classification accuracy [
29,
30]. In the typical deep learning model, stacked autoencoders (SAEs) can extract spatial and spectral information, then combine these features for HSI classification [
10]. The potential of deep belief networks (DBN) [
31] and restricted Boltzmann machines [
32] is used to combine the spatial and spectral information to classify the image. These methods are intended for 1-D input and the input data misses the spatial structure information, which is important for HSI classification. A deep convolutional neural network (CNN) [
33] is adopted to get the spatial feature and it has no requirement for the input. 3-D CNN is used to extract spectral-spatial features from the original image directly and gets better classification accuracy. Reference [
34] proposes an end-to-end framework to learn the spectral and spatial features and this method can exploit the correlation between the spectral and spatial domains. But in this framework, the input of spectral data is 1-D dimension. It is missing the neighborhood information of spatial dimension. And the classification accuracy of these deep learning models will decrease when the network is deeper. Reference [
35] proposes a supervised spectral–spatial residual network and the idea of identity mapping in residual blocks mitigates the decreasing-accuracy phenomenon, but this network firstly learns the spectral features that are used as the input to extract the spatial information, so the spatial features are found from data that has been transformed and so misses the original spatial correlation. Reference [
36] applies CNN to extract multiple spatial features and then stacks with spectrum to generate the spectral-spatial feature. This method would have better performance if the spectral feature was extracted by multiscale. Song W proposes a deep fusion feature network [
37] for classification. In this network, the features from the low layer, the middle layer and the high layer are respectively extracted by the residual network, and the features of different layers are fused in the fully convoluted layer to classify the image. Although the network considers the influence of different layer features on the classification, it does not consider the spectral-spatial fusion features and it directly extracts the features from the original image. Moreover, the features of fusion at the fully connected layer cannot enable the entire network to fully use the fusion features to learn more discriminative features. Most of the proposed deep learning models now consider the spectral-spatial fusion feature under single scale input and do not consider the abundant correlation between the spectral and spatial in multiscale inputs. Even though some models consider multiscale inputs, however, they cannot guarantee that each scale feature is optimal. Furthermore, these models cannot make full use of the strong complementary and related information among the multiscale fusion features because the features are fused in the fully connected layer to directly classify the image.
To solve these problems and extract more discriminative fusion features, we propose a multiscale deep middle-level feature fusion (MMFN) network for hyperspectral image classification. The training of the network contains two stages: in the first stage, each scale input is used to train a model and the optimal model is saved. The middle-level feature is extracted from the corresponding scale model and it can guarantee the multiscale middle-level features are optimal. In the second stage, the multiscale middle-level features are fused in the convolution layer and the subsequent residual learning block can fully use the strong complementary and related information among multiscale fusion features to extract more discriminative and higher-level features for classification. Furthermore, the residual learning [
38] can help the network maintain a higher accuracy when the layer is deeper and make the network more robust.
The three major contributions of this paper include:
(1) The idea of multiscale features fusion is proposed, and this is an idea that contains more abundant neighborhood correlation and low-level features, such as spatial structure, and texture features, which are beneficial for classification.
(2) The training of the network consists of two stages, the first stage obtains the optimal models corresponding to different scales, and extracts the middle-level features under the corresponding scale model. It can ensure the multiscale middle-level features are optimal, which is helpful for the subsequent training stage extracting more discriminative features. The second stage fuses the optimal multiscale middle-level features in the convolutional layer to train a new model for final classification.
(3) Different scale features have strong complementary and related information. Compared with the features that are fused directly in the fully connected layer to classify the image, the multiscale deep middle-level features are fused in the convolutional layer, which can enable the network to make full use of the strong complementary and related information among multiscale fusion features. Moreover, the subsequent residual learning modules can learn the multiscale fusion features to extract more discriminative and higher-level features for classification and can help the network maintain a higher accuracy with deeper layers.
The rest of this paper is organized as follows. 
Section 2 introduces the detailed architecture of our method. 
Section 3 presents the results of classification accuracy on the four data sets, and shows the performance of all methods. Finally, the conclusion is provided in 
Section 4.
  2. Methodology
A deep network can be regarded as a process of feature learning, which is a step-by-step abstract representation of the original input through a hidden layer. It can learn the original input data structure and find more useful features. Through feature combination, it transfers the original input into the low-layer features, middle-level features, high-level features up to the final mission objectives. Deep learning through the learning of hierarchical features can extract features from the texture information in the low-level features to the local information in the middle layer to the object information in the high-level layer. From this process, it is not difficult to find the connection between the original input and the low-level features, and the connection between the middle-level layer features and the high-level features, it is difficult to cross directly from the original input to the high-level features. In the MMFN framework, it consists of two training stages. The first stage mainly obtains the optimal model corresponding to each scale, and extracts the features of the last residual block of the corresponding scale in the optimal model. The second stage mainly fuses the multiscale features from the first stage in the convolution layer to train a new model for final classification. Because the multiscale features are extracted from the residual block, which are neither the low-level features that are close to the original image or nor the high-level features that are close to the fully connected layer, the features are defined as the middle-level features in the MMFN network, and the multiscale fusion features are used as inputs to a new model in the second stage to learn more discriminative features for classification.
  2.1. Extracting Multiscale Deep Middle-Level Features
HSI data can be denoted as 
, 
 is 
 the band image, 
 denote that the Hyperspectral Image has 
 pixels, and 
 bands, respectively. The main purpose of first training stage on MMFN is to extract optimal multiscale deep middle-level features and each scale 3-D data cube is used to train the corresponding model. The model contains a spectral and spatial learning module with different size of convolution filters. Let 
 be the input of a convolutional layer and 
 is the 
th feature map of 
. Supposing that the convolutional layer has 
k filters denoted as 
 and the bias parameter is 
b, the 
 output of the convolutional layer can be represented:
The features from the spectral and spatial module are fused as spectral-spatial fusion features, the fusion operation is defined as:
 represents the relu function and it is a rectified linear unit activation function which sets elements with negative numbers to zero. 
 and 
 represent the outputs which are found from the spectral and spatial learning module, respectively. The subsequent residual learning module can use the spectral-spatial fusion feature to learn more discriminative features and the structure of residual learning block is showed in 
Figure 1.
In 
Figure 1, 
 represents the input of the first residual block, 
 is the function learned through a two convolution layers and it is defined as:
 is the convolutional kernel, 
F is residual function and can be written as:
 is parameters of the first and second convolution kernel, respectively, 
 is the next and the next two layers bias of the input layer, respectively. In residual learning, we use the batch normalization (BN) operation to regularize the learning process for every convolutional operation and BN is formulated as:
		and 
 represents the output of 
n th layer after BN operation, 
 mean the convolutional kernels and bias, respectively on the 
n th layer. And the 
 is defined as:
		which 
 is the output of (
n−1) th layer after BN operation. After the residual block layer, the average pooling operation is done for the output of the residual block and the average pooling operation is formulated as:
We suppose the S is the filter size and 
C is the number of elements of S, 
 is the value of the corresponding position 
 in the input data 
, and in this paper, we use global average pooling. After the average pooling, the feature is sent to the softmax layer for HSI classification. The predicted value of the framework is a vector 
, and the truth label vector 
, 
c is the number of land-cover categories. The parameters of the framework are updated through back propagating the gradients of the cross-entropy objective function which is defined as:
For each scale input data, the above operation is done for training a corresponding scale model. In order to get an optimal model, we use the classification accuracy of a validation set to see whether it is improved in some training epochs to determine whether the model is optimal. Through this method, one can guarantee every scale input corresponds to an optimal model, and the feature from the last residual block is extracted on every model as the deep middle-level feature. These multiscale features are calculated by the weights from the optimal trained model, so these multiscale features are the best, which is helpful for the final classification.
  2.2. Fusing Multiscale Deep Middle-Level Features
In the first training stage of MMFN, we can get the optimal multiscale middle-level features and these features have different spatial sizes because of the different scales of inputs. Before fusing these features, the spatial size of features should be same. For example, there have been three different size of features which are 
 respectively, the spatial sizes are 5 × 5, 7 × 7, 9 × 9. For the size of 7 × 7, 9 × 9, and we can use 3 × 3 and 5 × 5 filters to make the features with same size of 5 × 5, then these three features are fused. The fusion operation is formulated as: 
X represents the tensor after fusing the multiscale middle-level features in the convolution layer,  is the different convolution operations to guarantee the features with the same spatial size.  is concatenating the outputs from the multiscale features,  denote the convolutional kernels and bias in the convolution layer respectively. After getting the multiscale middle-level fusion feature, the residual block is used to learn higher-level and discriminative features, which are sent to the softmax layer for the final classification.
  2.3. Classifying HSI Based on the MMFN
We take the IN Data Set as an example to describe the architecture of our method in 
Figure 2 and 
Figure 3. 
Figure 2 shows the first training stage of the MMFN, the size of 7 × 7 × 200 as the input data are sent to the spatial learning and spectral learning module with the size of 3 × 3,128 and 1 × 1, 128, respectively and the features with the size of 7 × 7 × 128 are obtained. Then these features are concatenated as the spectral-spatial fusion features to do the next convolutional and BN operation, and the size of features is constant with 7 × 7 × 128. In a residual learning module, it contains two residual blocks and every block uses the size of 3 × 3 × 128, 24 filters to extract features from the spectral-spatial fusion feature tensor, the feature size of 5 × 5 × 24 is generated after residual learning and the BN operation is done after every convolutional layer, which can regularize the learning process and improve the classification performance. What the feature tensor gets from the residual block is an input to the average pooling layer and it can obtain a 1 × 1 × 24 vector that is sent to the softmax layer for the final classification. After several epochs of training, we can get the optimal model. For the other scale of 9 × 9 × 200 and 11 × 11 × 200 are the inputs of the network, they are done the same as the above operation and we can get the corresponding scale of the model. From the optimal models corresponding to multiscale inputs, the features, of which the sizes are 5 × 5 × 24, 7 × 7 × 24, 9 × 9 × 24, respectively, are extracted from the last residual block as the deep middle-level features.
Figure 3 shows the second training stage of the whole network and this stage fuses the multiscale middle-level features and sends them to the residual learning block to learn the higher-level and discriminative features for the final classification. Because each input scale corresponds to a separately trained model, we save the parameters of the optimal model by the classification accuracy of the validation set, and in this way, we can guarantee that the middle-level feature is calculated under the optimal parameters when the features are extracted from the corresponding scale model. It means that the multiscale middle-level features are optimal which can improve the classification performance. The middle-level features have a size of 5 × 5 × 24, 7 × 7 × 24, 9 × 9 × 24, and we can use 3 × 3, 24 and 5 × 5, 24 kernels to convolute the features tensors of 7 × 7 × 24, 9 × 9 × 24, respectively and make them with the size of 5 × 5 × 24. Then, these three different scales of middle-level features are concatenated to generate a tensor with the size of 5 × 5 × 72. Through the size of 3 × 3 filters, the fused feature with size of 5 × 5 × 72 will be transformed into a tensor with size of 5 × 5 × 24 as an input to the residual learning module which contains two residual blocks and every block consists of 3-D convolution operation with size of 3 × 3 × 128, 24. Finally we can get the higher-level feature with size of 5 × 5 × 24 as the input of the average pooling layer and a tensor with size of 1 × 1 × 24 is generated for the final classification.
   3. Experimental Results
  3.1. Data Description and Experimental Settings
In this section, the effectiveness of our method is proved in four real-world hyperspectral remote sensing data sets which contain the Indian Pines (IN) Data Set, Pavia University (UP) Data Set, Kennedy Space Center (KSC) Data Set and Salinas Valley (Salinas) Data Set and the proposed method is compared with other state-of-the-art methods. The overall accuracy (OA) and the average accuracy (AA) are the classification metrics used to assess the classification performance of all the methods.
The Indian Pines Data Set (IN) was collected by AVIRIS in 1992 in northwestern Indiana. This commonly used data set has 16 vegetation classes and 224 bands. The spatial size is 145×145 and the spatial resolution is 20 m per pixel. To avoid the negative influence on classification due to water absorption and noise, some bands are discarded and the remaining 200 bands are adopted for analysis.
The Pavia University Data Set (UP) was captured by a Reflective Optics System Imaging Spectrometer optical sensor over an urban area surrounding the University of the Pavia. The image is of size 610 × 340 × 115 with a resolution of 1.3 m per pixel and 9 urban land-cover classes are considered in this experiment. The number of remaining bands is 103 after discarding the useless bands.
The KSC Data Set was collected by AVIRIS in 1996 in Florida, and contains 512 × 614 pixels with spatial resolution of 18 m per pixel and the ground-truth classes are 13. After removing the noise bands, 176 bands are retained and used for the experiment.
The Salinas Data Set which was gathered by AVIRIS and it consists of 224 bands with spatial size of 512 × 217 pixels. The spatial resolution of the data is 3.7 m per pixel and the ground-truth classes are 16. Twenty noisy bands are removed, and 204 bands are left for the next experiment.
The information of data sets is shown in 
Table 1, 
Table 2, 
Table 3 and 
Table 4. The corresponding the false-color image and ground-truth map are shown in 
Figure 4, 
Figure 5, 
Figure 6 and 
Figure 7. For all data sets, the number of experiments was twenty times to reduce the influence of random effects, which are caused by randomly choosing different training samples every time. Through verifying whether the accuracy of validation set is improved in some epochs to determine whether the network model is optimal, the optimal weight values of each model were saved. We made the average effects the final results to evaluate the classification accuracy of every method. We evaluated the performance of all methods on the small training samples to prove that our proposed MMFN has strong robustness and generalization. Furthermore, the MMFN had a better performance on classification accuracy when training samples were small. In four data sets which contained IN, UP, KSC and Salinas, we split the data into training, validation, testing set, and the ratio was 5%, 10%, 85%, respectively on these four data sets.
In our implementation, the training epoch was set to 100 and the optimizer adopted the standard stochastic gradient descent method. The batch size was set to 64, the optimum learning rates in IN, UP, KSC, Salinas data set were fixed as 0.0003, 0.0001, 0.0001, 0.0003, respectively, and the momentum was set to 0.9.
The proposed method was compared with some state-of-the-art methods including the SVM [
39], ResNet [
38], SAE [
10], 3-D CNN [
23] and Two-CNN [
34], SSRN [
35]. In order to compare fairly, we used the SVM using spatial information through the Gaussian filter. The framework of ResNet adopts the same residual blocks as our method and it does not contain the spectral-spatial learning module.
  3.2. Influence of Parameters
  3.2.1. The Selection of Multiscale Inputs
The spatial scale of input data was changed on the four data sets, and the appropriate multiscale inputs were determined through the classification accuracy achieved by the different scale inputs. In the four data sets, the data was split into 5%, 10% and 85% to comprise the training set, validation sets, and test sets, respectively. The classification results of different spatial scale inputs are shown in 
Table 5. In most cases, as the spatial scale of the input became larger, the classification accuracy achieved higher results on the four data sets. It was proved that when the spatial scale was within a certain range, the input data contained more spatial structure information with a larger spatial scale, which was helpful for the network learning more discriminative features and obtaining a higher classification accuracy.
From 
Table 5, when the spatial scale of input was larger than or equal to 11 × 11, the improvement of classification accuracy on the four data sets was relatively small and basically stable. And when the spatial scale of the input was 7 × 7, the classification results on the four data sets reached a higher accuracy, which had obvious advantages over the classification accuracy achieved by 3 × 3 and 5 × 5 as spatial input scales. Therefore, in order to select a relatively small spatial scale of input and achieve a higher classification accuracy, the selection of spatial scales were 7 × 7, 9 × 9, 11 × 11 as the multiscale inputs of the whole network.
  3.2.2. The Effectiveness of Multiscale Inputs
In order to validate the suggestion that multiscale inputs were more beneficial for HSI classification than a single input, some experiments were done on these four data sets. The experiment results are showed in 
Table 6, and the number of training samples for the four data sets ranges from 3% to 6% for each class, and the classification accuracy is evaluated by the overall classification accuracy (OA). It can be seen from the 
Table 6 that the input with a large spatial scale has a higher classification accuracy than the input with a small spatial scale. Although the number of training samples increased the classification accuracy of the input with large spatial scale and the input with small spatial scale were both improving, however the accuracy of larger spatial scale input was higher in all cases than the smaller spatial scale input. When the training samples of the four data sets were 3%, the classification accuracy obtained by multiscale inputs was higher than the single scale input. Especially in the IN data set, which was difficult to classify, the classification accuracy achieved by the multiscale inputs had an obvious advantage compared with the single scale input. From the 
Table 6, the bolded classification accuracies of multiscale inputs are higher than the single input, no matter how many training samples in the four data sets. It proves the multiscale inputs are more useful than single input for classification.
Although the classification accuracy did not improve significantly in the UP and KSC data sets, and the single scale input and multiscale inputs were both reaching a higher accuracy because the UP and KSC data sets both have a higher spatial resolution, the advantages of multiscale inputs were not obvious, but in most cases the classification performance of multiscale inputs had better generalization and effectiveness with small labeled samples. Multiscale inputs of the network can generate multiscale spectral-spatial fusion features that contain abundant the correlation between spatial and spectral, spatial structure information and texture information compared with the spectral-spatial fusion feature generated by single scale input. This information can help the network learn more discriminative features for better classification. With increase of training samples, the classification accuracy was improved in most cases, and the multiscale inputs achieved better classification results than single scale input. The experimental results also proved that the idea of multiscale inputs was more suitable for deep network classification and could improve the final classification accuracy.
  3.2.3. The Selection of Number of Residual Block
In the second stage of the MMFN, the residual learning module was used to learn higher-level features from the middle-level fusion features. The selection of the number of residual blocks in the network was determined by experiments in this section. In four data sets that were IN, UP, KSC, Salinas, the suitable number of residual blocks was selected through the classification accuracy achieved by changing the number of residual blocks and training samples. The result is shown in 
Table 7. It can be seen that by increasing the number of training samples, the classification accuracy was improved regardless of the number of residual blocks in the four data sets. It was also proved that more labeled samples are helpful for improving classification results. In comparing the different number of residual blocks in the case of the same training samples, the classification accuracy was higher when the network contained two residual blocks than zero, one and three residual blocks on the four data sets.
Although on the UP and Salinas data sets, when the number of training samples reaches 6%, the network contains two residual blocks and the classification accuracy is slightly lower than the accuracy of three residual blocks, because the UP and Salinas data sets have a high spatial resolution with 1.3 m and 3.7 m respectively, which helps the network to achieve a high accuracy even with small training samples. So, when the number of training samples is 6% per class, the MMFN has an excellent performance regardless of the network contains zero, one, two or three residual blocks. The results of classification accuracy show that the network contained residual blocks and achieved higher accuracy than the network which did not use residual block in the four data sets and it proves that the residual block can help the network to improve classification accuracy. In most cases, the classification performance was better than other number of residual blocks when the network contained two residual blocks from the bolded classification accuracies. The result also shows that the network layer was shallow with one residual block, and the features that may be learned were not discriminative, and the three residual blocks fused features from too many layers which may introduce too much redundant information resulting in a reduction of classification accuracy. Through the experimental results, we chose two residual blocks to form the structure of MMFN in the second training stage.
  3.3. Experiment Results and Analysis
In order to prove the superiority of the proposed network MMFN in the case of small label samples, we compared MMFN with other state-of-the-art methods on the four data sets, and the classification results are shown in 
Figure 8. Changing the number of training samples from 3% to 6% each class. 
Figure 8a shows the classification performance of each method on the IN data set. It can be seen from the 
Figure 8a that MMFN has a distinct advantage over other methods when the number of training samples was 3%, it showed that the MMFN network can learn more discriminative features to help with classifying the image even with small training samples. 
Figure 8b is the classification result of each method on the UP data set. Although the classification accuracy curve of the MMFN network was close to the curves of the SSRN when the number of training samples increased. However, when the number of training samples was small, the classification performance of the MMFN network was better than other methods. 
Figure 8c shows the performance of all methods in the KSC data set.
MMFN has obvious advantages over other methods at most cases. It shows that the optimal middle-level features are helpful for the second training stage extracting more discriminative features, and the multiscale middle-level features are fused in the convolution layer can make the network to learn strong complementary and related information among multiscale features. 
Figure 8d is the classification result on the Salinas data set. Because the spatial resolution of the Salinas data set is high, the accuracies achieved by the MMFN, and SSRN networks are high in the case of small training samples, however it can be seen from the figure that MMFN still has obvious advantages in classification.
From the experimental results of the four data sets, the MMFN network fused the extracted multiscale middle-level features in the convolutional layer, which helped the residual network to learn more discriminative and higher-level features, however in Two-CNN, the fusion features were fused in the fully connected layer and this made the network use these fusion features only in the classification, which may have reduced the classification accuracy.
Table 8 shows the classification accuracies of different methods on the four data sets which contain IN, UP, KSC, and Salinas. 
Table 9, 
Table 10, 
Table 11 and 
Table 12 list the class-specific accuracies of different methods on the four data sets. The training set, validation set, and test set are split into 5%, 10%, and 85%, respectively. It can be seen from the bolded classification accuracied in these tables that MMFN performs the better than other methods in OA and AA in most cases, and it proves the effectiveness of the network. MMFN achieved higher classification accuracy than the ResNet network on the four data sets, because the MMFN made full use of the multiscale middle-level features, and sent the fused features to the residual block instead of learning directly from the original image like ResNet.
 This also shows that the optimal middle-level feature obtained by each scale input in the first stage of MMFN was beneficial for classification. This also proves the validity of extracting middle-level features in MMFN. Compared with the idea of spectral-spatial fusion in SSRN, MMFN introduces the idea of multiscale inputs, which provides the network with abundant complementary and related information among different scale features, and the spectral and spatial learning module in MMFN is based on original image, it can extract more primitive and accurate spatial structure information.
The input of the spatial learning module in the SSRN network is based on the features extracted from the spectral learning module, the spatial learning will miss the spatial information in the original image, so the classification accuracy is lower than the MMFN. It can be seen from the table that the variance value of MMFN network classification result is smaller than other methods in most cases, which shows the stability of the network.
The training and testing times provide a direct measure of computational efficiency for MMFN. All experiments were conducted on an HP z620 workstation with GT 980Ti graphical processing unit (GPU). The loss function values on the four data sets were 0.2520(IN),0.2318(UP),0.0653(KSC) and 0.1457(Salinas), respectively. 
Table 13 shows the results of training and test times for all methods on four different data sets. The training set, validation set, and test set of all methods on the four data sets were split into 5%, 10%, and 85%, respectively. Two-CNN, ResNet, SAE, CNN, SSRN, MMFN were iterated 100 times and the SVM was trained 20 times. It can be seen from 
Table 13 that the training of the MMFN network takes the longest time, because the training of the network is divided into two phases, and the network has multiscale inputs to increase the computational time, although MMFN is longer than the training time of SSRN about 1–6 minutes in the larger data sets such as UP and Salinas, but when MMFN has small labeled samples, the classification accuracy of the network is higher than SSRN, especially in the IN data set that is difficult to classify, the advantage of the MMFN is obvious. In other methods, ResNet contains two residual blocks, so the training time of it is longer than Two-CNN, SAE, and CNN, but its classification accuracy is higher than these methods.
  3.4. Discussions
In this section, we briefly discuss the experimental results presented earlier. First, we found that the performance of MMFN model on all four data sets was generally better than other models. There are three possible reasons for such a performance improvement: 1) the multiscale model effectively fuses more abundant neighborhood correlation and low-level feature. 2) the middle-level features fusion structure can better exploit strong complementary and related information among multiscale fusion features than a high-level features fusion structure. 3) the residual learning modules can extract more discriminative and higher-level features and make deep learning models much easier to train. As can be seen in 
Figure 8, a residual-based network model achieved a better performance on all four data sets compared with SAE, CNN and Two-CNN. As can be seen in 
Figure 8 and 
Table 8, MMFN achieved higher classification accuracy than the ResNet and SSRN on the four data sets. MMFN made full use of the multiscale middle-level features, and sent the fused features to the residual block instead of learning directly from the original image like ResNet. SSRN also adopted residual connections, and treated spectral features and spatial features separately in two consecutive blocks, however, if the input of the spatial block is based on the spectral block, the spatial learning will miss the spatial information.
Second, two aspects will influence the HSI classification accuracy: 1) the number and spatial size of input into the network; 2) the number of training labeled samples. MMFN uses the suitable number and spatial size of inputs through the experiments and achieves a higher accuracy. Multiscale inputs with relatively larger spatial size contained more useful and abundant information which can boost the classification performance. MMFN performs better with relatively small labeled samples and this network can be generalized to other remote-sensing scenarios because of its deep feature learning capacity.
Finally, the disadvantage of the MMFN model is that the training time is relatively long, which is mainly because the training of the network is divided into two stages, and the multiscale input increases the corresponding time. As can be seen in 
Table 13, the training time of MMFN is about 1–6 minutes longer than that of SSRN and 2–12 minutes longer than that of CNN, which means that MMFN is more computationally expensive than the SSRN and the CNN. Fortunately, the adoption of GPU has largely alleviated the extra computational costs and reduced the training times.