Fully Dense Multiscale Fusion Network for Hyperspectral Image Classiﬁcation

: The convolutional neural network (CNN) can automatically extract hierarchical feature representations from raw data and has recently achieved great success in the classiﬁcation of hyperspectral images (HSIs). However, most CNN based methods used in HSI classiﬁcation neglect adequately utilizing the strong complementary yet correlated information from each convolutional layer and only employ the last convolutional layer features for classiﬁcation. In this paper, we propose a novel fully dense multiscale fusion network (FDMFN) that takes full advantage of the hierarchical features from all the convolutional layers for HSI classiﬁcation. In the proposed network, shortcut connections are introduced between any two layers in a feed-forward manner, enabling features learned by each layer to be accessed by all subsequent layers. This fully dense connectivity pattern achieves comprehensive feature reuse and enforces discriminative feature learning. In addition, various spectral-spatial features with multiple scales from all convolutional layers are fused to extract more discriminative features for HSI classiﬁcation. Experimental results on three widely used hyperspectral scenes demonstrate that the proposed FDMFN can achieve better classiﬁcation performance in comparison with several state-of-the-art approaches.


Introduction
Hyperspectral images (HSIs) usually consist of hundreds of narrow contiguous wavelength bands carrying rich spectral information. With such abundant spectral information, HSIs have been widely used in many fields, such as resource management [1], scene interpretation [2], and precision agriculture [3]. In these applications, a commonly encountered problem is HSI classification, aiming to classify each pixel to one certain land-cover category based on its unique spectral characteristic [4].
In the last few decades, extensive efforts have been made to fully exploit the spectral information of HSIs for classification, and many spectral classifiers have been proposed, including support vector machines (SVMs) [5,6], random forest [7], and multinomial logistic regression [8]. However, the classification maps obtained are still noisy, as these methods only exploit spectral characteristics and ignore the spatial contextual information contained in HSIs. To achieve more accurate classification results, spectral-spatial classifiers were developed, which exploit both the spatial and spectral information embedded in HSIs [9][10][11][12]. In [11], extended morphological profiles (EMPs) were employed to extract spatial morphological features, which were combined with the original spectral features for HSI classification. In [12], Kang et al. proposed an edge preserving filtering method for optimizing the pixelwise probability maps obtained by the SVM. In addition, methods of multiple kernel learning [13,14], sparse representation [15,16], and superpixels [17] have also been introduced for spectral-spatial classification of HSIs. Nonetheless, the above mentioned methods rely on human engineered features, which need prior knowledge and expert experience during the feature extraction phase. Therefore, they cannot consistently achieve satisfactory classification performance, especially in the face of challenging scenarios [18].
Recently, deep learning based approaches have drawn broad attention for the classification of HSIs, due to their capability of automatically learning abstract and discriminative features from raw data [19][20][21][22][23]. Chen et al. first employed the stacked auto-encoder (SAE) to learn useful high level features for hyperspectral data classification [24]. In [25], a deep belief network (DBN) was applied to the HSI classification task. However, owing to the requirement of 1D input data in the two models, the spatial information of HSIs cannot be fully utilized. To solve this problem, a series of convolutional neural network (CNN) based HSI classification methods was proposed, which can exploit the relevant spatial information by taking image patches as input. In [26], Zhang et al. proposed a dual channel CNN model that combines 1D CNN with 2D CNN to extract spectral-spatial features for HSI classification. In [27], Zhao et al. employed the CNN and the balanced local discriminative embedding algorithm to extract spatial and spectral features from HSIs separately. In [28], Devaram et al. proposed a dilated convolution based CNN model for HSI classification and applied an oversampling strategy to deal with the class imbalance problem. In [29], a 2D spectrum based CNN framework was introduced for pixelwise HSI classification, which converts the spectral vector into 2D spectrum image to exploit the spectral and spatial information. In [30], Guo et al. proposed an artificial neural network (ANN) based spectral-spatial HSI classification framework, which combines the softmax loss and the center loss for network training. To exploit multiscale spatial information for the classification, image patches with different sizes were considered simultaneously in their model. In addition, 3D CNN models have also been proposed for classifying HSIs, which take original HSI cubes as input and utilize 3D convolution kernels to extract spectral and spatial features simultaneously, achieving good classification performance [31][32][33].
In a CNN model, shallower convolutional layers are sensitive to local texture (low level) features, whereas deeper convolutional layers tend to capture global coarse and semantic (high level) features [34]. In the above mentioned CNN models, only the last layer output, i.e., global coarse features, is utilized for HSI classification. However, in addition to global features, local texture features are also important for the pixel level HSI classification task, especially when distinguishing objects occupying much smaller areas [22,35]. To obtain features with finer local representation, methods that aggregate features from different layers in the CNN were proposed for HSI classification [36][37][38]. In [36], a multiscale CNN (MSCNN) model was developed, which combines features created by each pooling layer to classify HSIs. In [37], a deep feature fusion network (DFFN) was proposed, which fuses different levels of features produced at three stages in the network for HSI classification. Although feature fusing mechanisms were utilized in the MSCNN and the DFFN, only three layers were fused for HSI classification. In [38], Zhao et al. proposed a fully convolutional layer fusion network (FCLFN), which concatenates features extracted by all convolutional layers to classify HSIs. Nonetheless, FCLFN employs a plain CNN model for feature extraction, which suffers from the vanishing gradient and declining accuracy problems when learning deeper discriminative features [39]. In [40], a densely connected CNN (DenseNet) was introduced for HSI classification, which divides the network into dense blocks and creates shortcut connections between layers within each block. This connectivity pattern alleviates the vanishing gradient problem and allows the utilization of various features from different layers for HSI classification. However, only layers within each block are densely connected in the network, which presents local dense connectivity pattern and focuses more on the high level features generated by the last block for HSI classification. These methods have demonstrated that taking advantage of features from different layers in the CNN can achieve good HSI classification performance, but not all of them fully exploit the hierarchical features.
In this paper, inspired by [41], we propose a novel fully dense multiscale fusion network (FDMFN) to achieve full use of the features generated by each convolutional layer for HSI classification. Different from the DenseNet that only introduces dense connections within each block, the proposed method connects any two layers throughout the whole network in a feed-forward fashion, leading to fully dense connectivity. In this way, features from preceding layers are combined as the input of the current layer, and its own output is fed into the subsequent layers, achieving the maximum information flow and feature reuse between layers. In addition, all hierarchical features containing multiscale information are fused to extract more discriminative features for HSI classification. Experimental results conducted on three publicly available hyperspectral scenes demonstrate that the proposed FDMFN can outperform several state-of-the-art approaches, especially under the condition of limited training samples.
In the rest of this paper, Section 2 briefly reviews the CNN based HSI classification procedure. In Section 3, the proposed FDMFN method is described. In Section 4, the experimental results conducted on three real HSIs are reported. In Section 5, we give a discussion on the proposed method and experimental results. Finally, some concluding remarks and possible future works are presented in Section 6.

HSI Classification Based on CNN
Deep neural networks can automatically learn hierarchical feature representations from raw HSI data [42][43][44]. Compared with other deep networks, such as SAE [24], DBN [25], and the long short-term memory network (LSTM) [45], CNN can directly take 2D data as input, which provides a natural way to exploit the spatial information of HSIs. Different from natural image classification that uses a whole image input for CNNs, HSI classification, as a pixel level task, generally takes image patches as the input, utilizing the spectral-spatial information contained in each patch to determine the category of its center pixel.
Convolutional (Conv) layers are the fundamental structural elements of CNN models, which use convolution kernels to convolve the input image patches or feature maps to generate various feature maps. Supposing the lth Conv layer takes x l−1 as input, its output x l can be expressed as: where * represents the convolution operator. W l and B l are the weights and biases of the convolution kernels in the lth Conv layer, respectively. Behind each Conv layer, a batch normalization (BN) [46] layer is generally attached to accelerate the convergence speed of the CNN model. The procedure of BN can be formulated as: where the learnable parameter vectors γ and β are used to scale and shift the normalized feature maps.
To enhance the nonlinearity of the network, the rectified linear unit (ReLU) function [20] is placed behind the BN layer as the activation layer, which is defined as: In addition, a pooling layer (e.g., average pooling or max pooling) is periodically inserted after several Conv layers to reduce the spatial size of feature maps, which not only reduces the computational cost, but also makes the learned features more invariant with respect to small transformations and distortions of the input data [47].
Finally, the size reduced feature maps are transformed into a feature vector through several fully connected (FC) layers. By feeding the vector into a softmax function, the conditional probability of each class can be obtained, and the predicted class is determined based on the maximum probability.

Local Dense Connectivity in DenseNet
It has been demonstrated that introducing shortcut connections in the network alleviates the vanishing gradient problem and enables feature reuse, which can effectively enhance the classification performance [39]. He et al. proposed the residual network (ResNet), which introduces identity shortcut connections to improve the information flow in the network and pushes the depth of the network up to thousands of layers [39]. Deep ResNets can be constructed by stacking residual blocks, in which input features can be passed directly to deeper layers through an additive shortcut connection, as shown in Figure 1. To further enhance the information flow throughout the network, Huang et al. proposed a different network, called the densely connected convolutional network (DenseNet), in which shortcut connections are employed to concatenate the input features with the output features instead of adding [48]. However, pooling layers, which increase the robustness of the learned features, will change the spatial size of feature maps, resulting in the concatenation operation being unfeasible. To address this problem, Huang et al. divided the network into multiple dense blocks, which do the dense connections in each block and add a pooling layer behind each block, as shown in Figure 2.   [48]. Note that only layers within each block are densely connected, presenting a local dense connectivity pattern. Figure 3 shows the architecture of a dense block. Let x 0 be the input of the block. For the lth layer in the block, it receives x 0 and features produced by all preceding layers, i.e., x 1 , x 2 , · · · , and x l−1 , as input and its output can be formulated as: where [·] denotes the concatenation operator and H l (·) is a composite Conv layer with the pre-activation structure of BN-ReLU-Conv [49]. Finally, input features and those generated by each Conv layer are concatenated as the output of the dense block, as shown in Figure 3.
In DenseNet, only layers within each block are densely connected, leading to a local dense connectivity pattern (see Figure 2). In addition, behind the first and second dense blocks, Conv layers are employed to make the extracted features more compact, but the non-dense connections between each block make the network focus more on the high level features (i.e., global coarse and semantic features) extracted by the last dense block for image classification.

Fully Dense Connectivity
To exploit the hierarchical features from all the Conv layers fully, here, we propose a fully dense connectivity pattern in which shortcut connections are introduced between any two layers in a feed-forward manner, enabling features learned by any layer to be accessed by all subsequent layers. Specifically, the features produced by preceding layers are concatenated as the input of the current layer, and its output features are fed into all the subsequent layers, achieving the maximum information flow. Figure 4 shows the layout of the proposed connectivity pattern schematically. To address the issue that different layers may have different feature map sizes, pooling layers are employed to down-sample feature maps with higher resolutions when they are inputted into lower resolution layers. The average pooling is adopted in this work. Let x 1 0 be the initial features extracted from the original HSIs and x s l the output of layer l at the sth scale. Each layer (i.e., H s l ) shown in Figure 4 has the composite structure of BN-ReLU-Conv. For each layer, it receives x 1 0 and feature maps produced by all preceding layers. Table 1 summarizes the output of each layer. For instance, H 1 2 , Layer 2 at the first scale, receives x 1 0 and x 1 1 as input, and its output can be computed by . Note that here, we illustrate the fully dense connectivity with only two Conv layers in each scale, and one can easily deduce situations with more layers by extending Figure 4 and Table 1.
Input Output Identity 2×2 Pooling 4×4 Pooling Asphalt Road  The output x s l of layer l at the sth scale. Herein, [·] represents the concatenation operator. P and P 2 refer to the 2 × 2 and 4 × 4 pooling layers, respectively.
The proposed fully dense connectivity pattern has the following advantages for the classification of HSIs. First, it enhances feature reuse and discriminative feature extraction. As shown in Figure 4, features produced by each Conv layer can be accessed by all subsequent Conv layers, which achieves more comprehensive feature reuse than DenseNet. The reuse of abundant learned features in the subsequent layers is effective for new feature exploration and improves efficiency [48]. In addition, this connectivity pattern further enhances the information flow and alleviates the problem of gradient disappearance. Furthermore, when a classifier is attached behind the output, each intermediate layer will receive an implicit supervision signal through a shorter connection, enforcing them to learn more discriminative and robust features, especially in early layers [50].
Second, the complementary and correlated features from all Conv layers can be exploited for HSI classification. In a CNN model, the receptive field size of Conv layers increases as the number of Conv layers increases [34]. The shallower layers with a narrow receptive field tend to capture local features (e.g., shapes, textures, and lines) of the input objects, whereas the deeper layers with a larger receptive field tend to extract global coarse and semantic features (see Figure 4). Due to the complex spatial environment of HSIs, in which different objects tend to have different scales, only using the global coarse features cannot effectively recognize objects with multiple scales, particularly for those occupying much smaller areas [22]. Through fully dense connectivity, features containing structural information of different scales can be combined for classification, which is beneficial for more accurate recognition of various objects in HSIs.

Fully Dense Multiscale Fusion Network for HSI Classification
The complete HSI classification framework based on the proposed FDMFN is shown in Figure 5.
As an example, consider the Indian Pines (IP) scene: image patches of size 23 × 23 × 200 from raw image are taken as inputs of FDMFN, fully exploiting the spectral and spatial information. At the beginning, a Conv layer with a 1 × 1 kernel size is utilized to reduce the dimensionality of input spectral-spatial data and extract features, and therefore, the size of the input is condensed from 23 × 23 × 200 to 23 × 23 × 2k, where k is a constant integer referred to as the growth rate, e.g., k = 20. Next, the obtained features are further processed by a series of 3 × 3 Conv layers and pooling layers to extract hierarchical feature representations. Note that the number of output features doubles whenever their spatial size shrinks (see Figure 5), because extracting diversified high level features with the increased feature dimension is very effective for classification tasks [51]. After global average pooling (GAP), multiscale hierarchical features from all Conv layers are fused by concatenation to generate more discriminative feature. Finally, the fused feature is fed to a fully connected (FC) layer for classification.
Let v m be the feature vector learned by the FC layer, where m = 1, 2, . . . , M and M is the total number of training samples. For each sample, the probability distribution of each class is obtained by the softmax function, which can be expressed as: where T denotes the total number of classes, v m i represents the ith value of v m , and p m i refers to the probability of the mth training sample belonging to the ith class. The loss function of FDMFN is defined as: where t m i denotes the ith value of the truth label vector t m . Note that the truth label of each sample is encoded by a vector of length T, in which the position of the correct label is value "1" and all the other positions are value "0", that is one-hot encoding. The network is trained by minimizing the loss function using the Adam [52] optimization algorithm with 100 epochs. After the optimization is completed, for each test sample, the probability distribution of each class can be obtained by the trained FDMFN, and the predicted label is determined by the maximal probability.
Pooling GAP Figure 5. Framework of the proposed fully dense multiscale fusion network (FDMFN) for HSI classification. For convenience, the BN, ReLU layers that precede the global average pooling (GAP) layer are not given. Table 2 summarizes the details of the layers of FDMFN for the IP dataset. The stride of convolution is one. Note that for the 3 × 3 Conv layers, their inputs are zero padded with one pixel on each side to keep the spatial size of feature maps fixed during convolution.

Description of Datasets
Three publicly available hyperspectral datasets were utilized to verify the effectiveness of our FDMFN method, i.e., Indian Pines (IP), University of Houston (UH), and Kennedy Space Center (KSC) datasets [53,54].
The IP dataset was gathered in 1992 by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) instrument [55] from Northwest Indiana. This dataset mainly covers a mixed agricultural/forest area, consisting of 145 × 145 pixels and 224 spectral bands in the spectral range from 400 to 2500 nm. The geometric resolution is 20 m by pixel, and the ground reference map has 16 classes. After discarding four null bands and another 20 water absorption bands, 200 channels were utilized for the experiments.
The UH dataset was gathered in 2012 by the Compact Airborne Spectrographic Imager (CASI) sensor [56] over the campus of the University of Houston and its neighboring region. It was first presented in the 2013 Geoscience and Remote Sensing Society (GRSS) Data Fusion Contest [57]. This scene is composed of 349 × 1905 pixels, and its ground reference map contains 15 classes. The geometric resolution is 2.5 m by pixel. It has 144 spectral bands in the spectral range from 380 to 1050 nm.
The KSC dataset was acquired in 1996 by the AVIRIS instrument [55] in Florida with a geometric resolution of 18 m by pixel. This scene consists of 512 × 614 pixels and mainly includes 13 classes. After discarding noisy bands, the considered scene had 176 spectral bands for the classification.
In our experiments, the labeled samples of each dataset were divided into training, validation, and testing sets, and the split ratio was 5%:5%:90%. Tables 3-5 summarize the number of samples of the three datasets. Note that the network parameters were only tuned using the training set. During the training phase, the interim trained model that achieved the highest classification performance on the validation set was saved. Finally, the testing set was used to evaluate the preserved model's classification performance. Three widely used quantitative metrics, overall accuracy (OA), average accuracy (AA), and the Kappa coefficient [58], were adopted to assess the classification performance. To avoid biased estimation, all experiments were repeated five times with randomly selected training samples, and the average values were reported for all the performance metrics.

Experimental Setup
We trained the proposed network for 100 epochs using the Adam [52] optimizer with batch size of 100 as done in [40]. The network parameters were initialized by using the He initialization method [59]. We used an L2 weight decay penalty of 0.0001 and a cosine shape learning rate, which began from 0.001 and gradually decreased to zero [60]. The proposed network was implemented by using the Pytorch framework [61]. All the experiments were conducted on a PC with a single NVIDIA GeForce RTX 2080 GPU and an AMD Ryzen 7 2700X CPU.

Analysis of Parameters
In the proposed FDMFN method, except the weights and biases of the network, which could be tuned automatically during the training phase, the number of Conv layers, the growth rate k, and the size of input image patches were also important to the final classification performance. Figure 6a shows the impact of the number of Conv layers on the average accuracy (AA) of the proposed FDMFN. The number of Conv layers determines the network depth, which is an important parameter that can affect the classification performance. Although a deeper network can learn more abstract features for classification, it will increase the possibility of overfitting. From Figure 6a, one can see that the best AA value was achieved when the number of Conv layers was 16 for the IP dataset. For the UH and KSC datasets, the AA values could reach the highest when the number of Conv layers was 13. Therefore, in the following experiments, the number of Conv layers was set to 16, 13, and 13 for the IP, UH, and KSC datasets, respectively. Figure 6b illustrates the influence of the growth rate k on the AA of the proposed method. The parameter k also determines the representation capacity of the proposed FDMFN. We assessed different k values from 8 to 24 with a step of 4 for each dataset. As shown in Figure 6b, when k was 20, the model obtained the best performance on the IP dataset. For the UH and KSC datasets, when k was 12, the models achieved the highest classification accuracy. Therefore, k was set to 20, 12, and 12 for the IP, UH, and KSC datasets, respectively. Figure 6c shows the tendencies of AA of the proposed FDMFN over different sizes of input image patches. As can be seen, with the increase of the patch size, the AA values tended to increase first and then decrease on the three datasets. Generally speaking, a larger size of image patch would bring more spatial information, which would help to increase the classification accuracy. However, a large image patch may contain pixels belonging to multiple classes, misleading the classification of the target pixel. From Figure 6c, one can see that when the patch size was 23 × 23, the proposed FDMFN could produce the best classification results for all three datasets. Therefore, we chose 23 × 23 as the default size of input image patches.
Specifically, SVM only exploits the spectral information embedded in HSIs for classification. The remaining are spectral-spatial based classification methods. 3D CNN directly extracts spectral-spatial features from original HSIs using 3D convolutional kernels. DFFN fuses spectral-spatial features generated at different stages in a deep residual network for the classification of HSIs. FCLFN fuses spectral-spatial features produced by all Conv layers for HSI classification. For DenseNet, due to the local dense connectivity pattern, only spectral-spatial features from layers in the last block were fully combined for HSI classification. In addition, for SVM, the penalty parameter C and the RBF kernel parameter γ were determined through five-fold cross-validation (C = 2 −8 , 2 −7 , . . . , 2 8 , γ = 2 −8 , 2 −7 , . . . , 2 8 ). For other methods, we used the default parameter setting in the corresponding references [32,37,38,40]. Take the Indian Pines dataset as an example: for the DFFN, FCLFN, and DenseNet methods, the default size of input image patches was 25 × 25, 23 × 23, and 11 × 11, respectively. Figure 7 shows the classification maps obtained by various approaches on the IP dataset (all the classification maps were the result of the first experiment of the five experiments). One can see that the SVM classifier generated rather poor estimations in its classification map (see Figure 7c), as it only exploited the spectral information of HSI. In contrast, by utilizing spatial and spectral information, the other methods showed better visual performances in their classification maps (see Figure 7d-h). Table 6 presents the quantitative results of different methods. One can see that the proposed FDMFN outperformed the contrastive approaches in terms of three overall metrics (i.e., OA, AA, and Kappa), demonstrating the effectiveness of our method. Note that the class distribution of this dataset was quite unbalanced. The largest class, soybean-mintill (Category 11), contained 2455 samples, while the smallest class, oats (Category 9), had only 20 samples. When facing a dataset with an uneven class distribution, the minority classes may be heavily underrepresented, leading to poor classification performance. From Table 6, one can see that SVM, DFFN, FCLFN, and DenseNet achieved rather poor results on the oats class (Category 9). However, the proposed FDMFN avoided this problem and achieved the highest classification accuracy on the oats class, again verifying the effectiveness of our method.  Tables 7 and 8 report the quantitative classification results (obtained by averaging of five runs) on the UH and KSC datasets, respectively. Figures 8 and 9 separately show the corresponding classification maps. Compared with DenseNet, FDMFN improved the OA from 95.78% to 97.41% for the UH dataset. Moreover, FDMFN achieved significant performance gains over DenseNet with 2.41% in terms of AA for the KSC dataset. Overall, FDMFN outperformed all other compared methods in terms of the three overall metrics on the two datasets, which validated the effectiveness of our method.

Classification Results
To demonstrate whether the accuracy improvement of the proposed FDMFN over the compared methods was statistically significant, we performed the standardized McNemar's test [62], which is defined as: where f ij denotes the number of samples that are correctly classified by classifier i and incorrectly classified by classifier j. When Z is larger than 2.58, it means that the performance improvement of Classifier 1 over Classifier 2 is statistically significant at the 99% confidence level. As shown in Table 9, all the Z values were much higher than 2.58, which confirmed that our FDMFN method significantly outperformed the contrastive approaches.  Finally, the computing times of the proposed FDMFN and other deep neural networks on the three datasets are reported in Table 10. One can see that FCLFN consumed the lowest time, while the proposed FDMFN was the most time consuming on the three datasets, because of there being more parameters in FDMFN than other compared models on the IP dataset (see Table 11). For the UH and KSC datasets, although FDMFN had fewer parameters than DenseNet, it took a larger size of image patch as input (23 × 23 for FDMFN and 11 × 11 for DenseNet [40]) and thus also spent more time than DenseNet. Although time consuming, FDMFN could achieve better classification performance in comparison with other deep networks.

Effect of Different Ratios of Training Samples
In this section, the effectiveness and robustness of the proposed method are investigated when different ratios of training samples were considered. For each dataset, the ratio of training samples ranged from 2% to 10% with an interval of 2%. The OA values obtained by different methods on the three datasets are illustrated in Figure 10. It can be observed that FDMFN provided better classification accuracies in comparison with other methods under all different ratios of training samples. Furthermore, with less training samples (e.g., using only 2% of training samples), the proposed FDMFN had significant advantage over other compared approaches on the IP and KSC datasets.

Input Patch Size Analysis
In general, larger input HSI patches with more spatial information tend to have an advantage over the ones with less spatial information. However, larger input patches may contain pixels from multiple classes, which confuse the recognition of the target pixel. In addition, it reduces the inter-sample diversity and increases the possibility of overfitting. In this experiment, we further compared the proposed method with other deep neural networks when they shared the same size of input HSI patches. The experiment was implemented on the KSC dataset, and the spatial size varied in the set {11 × 11, 15 × 15, 19 × 19, 23 × 23, and 27 × 27}. For the small spatial size (e.g., 11 × 11), we removed the pooling layers due to the rapid down-sampling of the input image patch. The overall accuracies obtained from different methods are shown in Table 12. As can be seen, the proposed FDMFN outperformed other methods regardless of the spatial sizes of the input HSI patches, which demonstrated the effectiveness of the proposed FDMFN method. In addition, all the overall accuracies obtained by the proposed method with different patch sizes were higher than 98%, which suggests that our FDMFN method is robust to the spatial size of input patches.

Comparison with Other State-of-the-Art Approaches
In this section, we further compared the proposed FDMFN method with another three state-of-the-art deep learning based HSI classification approaches: the dilated convolution based CNN model (Dilated-CNN) [28], the 2D spectrum based CNN model [29], and the artificial neuron network with center-loss and adaptive spatial-spectral center classifier (ANNC-ASSCC) [30].
Specifically, we first compared the proposed method with the Dilated-CNN on the IP and Salinas datasets. The detailed information of the Salinas dataset can be also found in [53]. Following Dilated-CNN [28], for each dataset, 60% of the labeled samples per class were randomly selected for training. Next, the proposed FDMFN was compared with the 2D spectrum based CNN model [29] on the IP, KSC, and Salinas datasets. For a fair comparison, we utilized the same number of samples, as in [29], for model training. Finally, the proposed method was compared with the ANNC-ASSCC method [30] on the Salinas dataset. Following [30], we randomly chose 200 labeled samples from each class for training.
The corresponding classification results are shown in Tables 13-15. As can be observed, the proposed method achieved improved performance in comparison with the other three deep learning models. By making full use of the hierarchical features, our method could exploit multiscale information for the classification of HSIs. In addition, the comprehensive feature reuse in the proposed network also facilitated discriminative feature learning. The results shown in Tables 13-15 further demonstrate the superiority of our FDMFN model for HSI classification.  Table 15. Comparison between the proposed FDMFN and the artificial neuron network with center-loss and adaptive spatial-spectral center classifier (ANNC-ASSCC) [30] on the Salinas dataset. Note that the best results reported in [30] were used for comparison here. In addition, the best results are highlighted in bold font.

Discussion
There are mainly two reasons why FDMFN achieved a superior classification performance. First, the proposed method achieved comprehensive reuse of abundant information from different layers and provided additional supervision for each intermediate layer, enforcing discriminant feature learning. Second, the multiscale hierarchical features learned by all Conv layers were combined for HSI classification, which allowed finer recognition of various objects and hence enhancing the classification performance.
From the comparison of the execution time of different deep neural networks, we can find that the proposed model was not computationally efficient. However, our method could achieve better classification performance in comparison with other methods on the three real hyperspectral datasets. Furthermore, when limited training samples were utilized, the proposed method significantly outperformed other approaches on the IP and KSC datasets, further demonstrating its effectiveness for HSI classification. In our future work, to reduce the computational load, a memory efficient implementation of the proposed network will be investigated.

Conclusions
In this work, we proposed a novel FDMFN to fully exploit the hierarchical features from all Conv layers for spectral-spatial HSI classification. The proposed FDMFN was characterized by introducing shortcut connections between any two layers in the network. Through fully dense connectivity, the spectral-spatial features learned by each layer could be accessed by all subsequent layers, achieving comprehensive feature reuse. In addition, the proposed method enforced discriminative feature learning by providing additional supervision. Furthermore, multiscale features extracted by all Conv layers were fused to extract more discriminative features for HSI classification. Experimental results on three widely used hyperspectral scenes demonstrated that the proposed FDMFN outperformed other state-of-the-art methods.
Note that although the combination of all hierarchical features provided a good classification performance, the contribution from features with different scales varied for different objects. In our future work, attention mechanisms [63] will be considered to adaptively emphasize representative features and suppress less useful ones for each sample, to enhance the classification performance further.