Multipath Residual Network for Spectral-Spatial Hyperspectral Image Classification

Convolutional neural networks (CNNs) have recently shown outstanding capability for hyperspectral image (HSI) classification. In this work, a novel CNN model is proposed, which is wider than other existing deep learning-based HSI classification models. Based on the fact that very deep residual networks (ResNets) behave like an ensemble of relatively shallow networks, our proposed network, called multipath ResNet (MPRN), employs multiple residual functions in the residual blocks to make the network wider, rather than deeper. The proposed network consists of shorter-medium paths for efficient gradient flow and replaces the stacking of multiple residual blocks in ResNet with fewer residual blocks but more parallel residual functions in each of it. Experimental results on three real hyperspectral data sets demonstrate the superiority of the proposed method over several state-of-the-art classification methods.


Introduction
Remote sensing hyperspectral images (HSIs) usually contain information about hundreds of spectral bands spanning from visible to infrared spectrum. Each pixel in HSIs is a high-dimensional vector whose entries correspond to the spectral reflectance in a specific wavelength, providing rich spectral information for distinguishing land covers of interest [1]. Recently, HSI classification with the aim of identifying the land-cover type of each pixel has become one of the most active research fields in the remote sensing community, because it is an essential step in a wide variety of earth monitoring applications, such as environmental monitoring [2] and precision agriculture [3].
The spectral and the spatial information of HSIs are two major characteristics that can be exploited for classification [4]. Traditional classification methods such as random forest [5], support vector machine (SVM) [6] and multinomial logistic regression [7], mainly focus on making use of the abundant spectral information for classification. To improve classification performance, methods such as morphological profiles [8], multiple kernel learning [9], superpixel [10] and sparse representation [11] have been introduced to combine the spatial information with the spectral information for HSI classification [12,13]. For instance, Benediktsson et al. utilized extended morphological profiles (EMPs) to obtain spectral-spatial features of HSIs [8]. Fang et al. proposed a multiscale adaptive sparse representation (MASR) model to exploit the multiscale spatial information of HSIs [11]. Fauvel et al. proposed a morphological kernel based SVM classifier to jointly use the spatial and the spectral information for classification [14]. Nevertheless, the common limitation of these methods is that they are then presented and discussed in Section 4. Finally, some conclusions and suggestions are provided in Section 5.

CNN-Based HSI Classification
Let U ∈ R H×W×C be an HSI data set, where H and W represent the height and the width of the spatial dimensions, respectively and C is the number of spectral bands. Instead of directly classifying each hyperspectral pixel vector, in CNN-based models an image patch centered at each pixel is generally taken for classification. In this way, the spatial and spectral information contained in such patches are combined in the task of classifying pixels, resulting in a reduction of the label uncertainty and intraclass variability [42].
Convolutional (Conv) layers are the key parts of a CNN model in which the input HSI patches or feature maps are convolved with convolution filter banks (also called convolution kernels) to produce feature maps as follows: where X l−1 and X l represent the input and output of the lth Conv layer, respectively, W l and B l refer to the weights and biases of the Conv layer, respectively and * stands for convolution operator.
To alleviate the gradient vanishing problem and speed up the training process in a deep CNN, a batch normalization (BN) layer [43] is placed behind each Conv layer to reduce the internal covariance shift by imposing a Gaussian distribution on each batch of feature maps, allowing a more independent learning process in each layer. The BN layer can be expressed as where γ and β are learnable parameter vectors, respectively and is a parameter for numerical stability. Following the BN layer, an activation layer is added to improve the nonlinearity of the network. To effectively avoid the vanishing gradient problem, a rectified linear unit (ReLU) is used as a nonlinear activation function [15]. In addition, a pooling layer is periodically placed behind several Conv layers in the CNN to reduce the size of feature maps, decreasing the amount of computation of the network. Then, fully connected (FC) layers are adopted to transform the size-reduced feature maps into a one-dimensional vector z, which is input to a softmax function to compute the class probability distribution for each pixel where n refers to the number of classes. Finally, the predicted category of each pixel is determined by the maximal probability

ResNet
It is well known that deeper networks usually lead to a better performance over shallower ones. However, training very deep networks is difficult due to the vanishing gradient problem, that is, gradient signals fade slightly when passing through each layer during the backpropagation process and become close to zero in shallower layers, hampering the convergence of the network from the beginning [44]. In addition, based on approximation theory, the hypothesis space "drifts" away from the true solution when adding more layers [45]. As a consequence, when extremely increasing network depth, the classification accuracy first saturates and then degrades rapidly. Deep ResNets avoid this problem by employing identity skip-connections, which help the gradient flowing back to the shallow layers without vanishing and facilitate the training of very deep networks up to thousands of layers [33].
ResNet is constructed by stacking multiple fundamental structural elements called residual blocks. Figure 1 illustrates the architecture of a typical residual block, called bottleneck residual block [46]. The residual block performs the following computation: where x l−1 and x l are the input and output of the lth residual block, respectively and f l (·) denotes the residual function to be learned. As can be seen from the Figure 1, f l (·) consists of 3 convolutional (Conv) layers each of which is preceded by a batch normalization (BN) layer [43] and a ReLU activation function, which is known as the pre-activation [46]. The kernel size of three Conv layers are 1 × 1, 3 × 3 and 1 × 1, respectively. Here the first 1 × 1 layer is used to reduce feature dimension and the second 1 × 1 layer to expand it back.

Conv Conv
Identity Skip-Connection Residual Function

Methodology
This section is structured as follows. First, the phenomenon that deep ResNet behaves like a large ensemble of relatively shallow networks is described. Then, the proposed MPRN model is introduced. Finally, the HSI classification framework based on MPRN is described.

Deep ResNets Behave Like Ensembles
To better understand ResNets, Veit et al. interpreted them in an unraveled view and conducted a detailed experimental study revealing that ResNets act like ensembles of relatively shallow networks [40]. Consider a ResNet consisting of 3 residual blocks from input x 0 to output x 3 . Its conventional graphical representation is shown in Figure 2a. Based on Equation (5), the computation process can be expressed as The graphical view of Equation (6) is illustrated in Figure 2b. It can be clearly seen that there are many paths that can be chosen when data flowing from the input to the output. Each path denotes a unique configuration that decides which residual function to perform and which to skip. For a ResNet consisting of m residual blocks, there will be 2 m number of possible paths from the input to the output (also known as the multiplicity of the network), which is different from classical networks such as AlexNet [15] or VGGNet [47], where input flows along a single path from input to output. Moreover, in classical networks, each layer depends only on the output of its previous layer. As for ResNets, each residual function f l (·) receives data with 2 l−1 different distributions generated from every possible path of the previous l − 1 residual blocks. Note that paths in a ResNet are of differing length and the distribution of all possible path lengths follows a Binomial distribution. For instance, there is one path that passes through all residual functions and are m paths that only pass through one residual function. During training, shallow paths contribute more gradient than deep paths [40]. For example, for a 110-layer ResNet, only paths between 10 and 34 layers depth have significant contributions towards the gradient updates. These paths are called effective paths, which are relatively shallow compared with the network depth. Additionally, paths in a ResNet exhibit ensemble-like behavior, that is, they are not strongly depend on each other. In addition, the performance of a ResNet smoothly correlates with the number of effective paths. Moreover, deep paths are indeed not needed as they do not contribute any gradient during training, for example, a ResNet trained only on the effective paths can obtain comparable performance with a full ResNet [40].

MPRN
In previous methods, usually the depth of ResNets is increased for extracting deeper discriminative features to improve the classification performance [33,46]. However, every percentage of improvement demands significantly increase of the number of layers, for example, a 164-layer ResNet was with a test error rate of 5.46% and a 1001-layer ResNet of 4.92% on the CIFAR-10 image classification data set, whereas the latter model has six times more computational complexity than the former [33]. One possible reason for this problem is that the increase of depth cannot improve the network performance in an efficient manner, since deep paths do not contribute any gradient during training. In addition, wide ResNets that have 50 times few layers can outperform the original ResNet, indicating that the power of ResNet arises from the identity skip-connections rather than the extreme increase of the network depth [48].
To further improve the classification performance, in this work, a multipath ResNet (MPRN) is proposed in which each residual block consists of multiple residual functions, as shown in Figure 3b. By introducing multiple residual functions to Equation (5), the output of the lth block in MPRN can be computed as: where f n l (·) denotes the nth residual function in the lth residual block. In MPRN, all the residual functions have the same architecture as the residual function in the bottleneck residual block. Consider a MPRN with 3 residual blocks each of which has 2 residual functions. For the lth block, there are 4 possible paths for gradient flow: (1) performing f 1 l and skipping f 2 l ; (2) performing f 2 l and skipping f 1 l ; (3) performing both f 1 l and f 2 l ; and (4) skipping both f 1 l and f 2 l . Therefore, the multiplicity of each block is 4 and the multiplicity of the whole network is 4 3 . Giving a MPRN with m residual blocks and n residual functions in each block, the multiplicity of each block is 2 n since every residual function can be either performed or skipped and the total multiplicity of MPRN is 2 mn . Now consider a baseline ResNet with m residual blocks and let c be a constant integer. To improve its classification performance, a deeper network can be constructed by increasing the number of residual blocks to cm. In addition, a wider network, that is, MPRN, can be constructed by increasing the number of residual functions in each block to c while keeping the number of residual blocks to m. Note that the two improved networks have the same number of residual functions, that is, cm and thus the same number of parameters. In addition, the multiplicity of both networks are 2 cm . However, compared with the deeper ResNet, MPRN has more shorter-medium paths that can significantly contribute gradient during training. This way not only enhances the efficiency of parameters utilization but also improves performance.
Let [a, b] be the range of effective paths' length (also known as the effective range) of the baseline ResNet. Due to the exponential reduction in the gradient magnitude during back propagation, the deeper ResNet is shifted and/or scaled toward a shallower network [40]. Therefore, the effective range of the c times deeper ResNet does not increase linearly, that is, not in [ca, cb]. In fact, the upper bound is lower than cb, which means every percentage of improvement in ResNet requires significantly increasing the number of layers. For MPRN, the distribution of all possible path lengths follows a multinomial distribution [41]. When increasing the number of residual functions in each block, the number of paths of each length is increased and the effective range of MPRN can increase linearly. Therefore, two networks with the same multiplicity, MPRN will reach better performance than ResNet.

MPRN for HSI Classification
Now we consider the Indian Pines data set as an example, Figure 4 shows the framework of the proposed MPRN for HSI classification. From the framework, one can see that MPRN takes image patch as input and the patch size is set to 11 × 11 × 200. In this way, both the spectral and spatial information embedded in HSI can be utilized for classification. First, a 1 × 1 Conv layer is employed to compress the input and extract features for the rest of the network. Through several multipath residual blocks, deep spectral-spatial features can be extracted. Next, the feature extracted by the last block is transformed into a 1-D vector using a global average pooling layer. The vector is fed to a fully connected (FC) layer followed by a softmax function. Finally, the predicted label of the center pixel is determined by the maximal probability. The detailed structure of MPRN is summarized in Table 1. The K refers to the size of the convolving kernel. The N In and N Out denote the number of input and output channels, respectively. In addition, convolution stride is set to one and padding to zero and one for 1 × 1 and 3 × 3 Conv layers, respectively, in order to keep the size of output feature maps unchanged in convolution. The initialization method [49] is employed to initialize the network parameters and the Adam algorithm [50] is used to optimize the parameters by minimizing cross-entropy loss. Note that the proposed network is trained in an end-to-end manner and hence the parameters and effective paths can be learned automatically.

Hyperspectral Data Sets
To demonstrate the effectiveness of our proposed method, now we consider three real hyperspectral data sets including Indian Pines, Houston University and Kennedy Space Center (KSC) data sets. These data sets are openly accessible online [51,52]. The number of samples per class of the three data sets are summarized in Table 2. The Indian Pines data set was collected by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the agricultural Indian Pines test area with a spatial resolution of 20 m. This HSI consists of 145 × 145 pixels with 224 spectral bands ranging from 400 to 2500 nm. After removing 20 water absorption bands and four null bands, 200 channels were used for the classification. Its ground reference map covered 16 classes of interest.
The Houston University data set was captured by the Compact Airborne Spectrographic Imager (CASI) sensor over the Houston University campus and its neighboring region with a spatial resolution of 2.5 m. It was used in the 2013 GRSS Data Fusion Contest. The image consists of 349 × 1905 pixels with 144 spectral bands ranging from 380 to 1050 nm. The ground reference map of this data set includes 15 classes of interest.
The KSC data set was captured by the AVIRIS sensor over KSC, Florida. This HSI is composed of 512 × 614 pixels with a spatial resolution of 18 m. After removing noisy bands, 176 spectral bands were used for the classification. Its ground reference map covered 13 classes of interest.

Experimental Setup
For each data set, the labeled samples were split into training, validation and testing sets. The training set was used to tune the model parameters. The validation set was utilized to evaluate the interim trained models created during training and the model with the highest validation accuracy was preserved. The testing set was employed to assess the classification performance of the saved model. For the Indian Pines and Houston University data sets, 10%, 10% and 80% of the labeled data per class were randomly selected to form the training, validation and testing sets, respectively. As for the KSC data set, the split ratio was 2%, 2% and 96%, respectively. Note that each data set was standardized to mean value with unit variance.
To assess the classification performance of the proposed method, the overall accuracy (OA), the average accuracy (AA), the Kappa coefficient, the F1-score and the Precision were adopted as evaluation metrics [53]. To avoid biased estimation, the metrics obtained by averaging of five repeated experiments with randomly selected training samples were reported.
The proposed network was trained for 100 epochs with an L2 weight decay penalty of 0.0001. The batch size was set to 100 and a cosine shape learning rate was employed which starts from 0.001 and gradually reduces to 0 [54]. In addition, our implementation was based on Pytorch framework [55] and conducted on a PC with AMD Ryzen 7 2700X CPU, 16 GB of RAM and a NVIDIA RTX 2080 GPU.

Parameters Discussion
It is well known that increasing the network depth can enhance the model representation capability and lead to a better classification performance. In this section, we will show that depth is not the only factor for achieving high classification accuracy. In addition, the increase of network width is able to obtain a better performance than the increase of network depth. In the following experiments, network depth (i.e., m) is represented by the number of residual blocks and network width (i.e., n) is denoted by the number of residual functions in each block. Note that conventional ResNets have a single residual function in each block, that is, n = 1.
First, the network depth m and width n of MPRN are analyzed together. In our experiments, the m ranges from 1 to 5 with step 1 and n from 1 to 21 with step 2. Consider the fact that extremely shallow networks, compared to deep ones, tend to be difficult in capturing higher level features, which are beneficial for deep semantic feature extraction. However, over-deeper structure will spend great running time. Therefore, a proper network depth should be set to balance classification accuracy against timeliness. As can be observed from the left column of Figure 5, when m and n are respectively larger than 3 and 7, MPRNs achieve relatively stable high accuracy for all data sets, demonstrating the robustness of MPRN to different m and n values. Meanwhile, with m and n values rise, the parameters of the corresponding models and thus the computing time will increase rapidly, as shown in the right column of Figure 5. Therefore, to effectively leverage the overall performance, we set m to 3 for all data sets.  To clearly show the impact of network width n on the classification performance of MPRN, we fix m = 3 and show the effect of n with value ranging from 1 to 20 with step 1. In addition, we further give a contrastive evaluation of our method with ResNets (n = 1) with different network depth. To make a fair comparison, for ResNet, the m ranges from 3 to 60 with step 3. In this way, each pair of the MPRN and ResNet have the same number of parameters, for example, MPRN with m = 3 and n = 9 has the same number of parameters as ResNet with m = 27. In addition, when m = 3 and n = 1, MPRN and ResNet have the same network architecture. Figure 6 shows the effects of network width n on the performance (on AA) of the proposed MPRN method over the three data sets, while Figure 7 demonstrates the impacts of network depth m of the ResNet method. From Figures 6 and 7, one can see that increasing any dimension of network, width or depth, will improve classification accuracy. Clearly, when the network depth goes beyond a certain level, increasing the depth become less effective. In contrast, increasing the width can further improve the classification performance. University and KSC data sets, which further validate the effectiveness of our method. This is because paths in MPRN are relatively shallow, which have significant contribution towards the gradient updates during training. For ResNet, the increase of the depth will not only introduce more deeper paths that do not contribute significant gradient during training but also results in the overfitting phenomenon (see Figure 7b,c). In the following experiments, the optimal architectures of ResNet and MPRN are employed for comparison (see Table 3).

Comparison Results of Different Methods
The proposed method was compared with several state-of-the-art classification methods available in the literature: (1) 3-D CNN [32]; (2) fully convolutional layer fusion network (FCLFN) [56]; (3) deep feature fusion network (DFFN) [35]; (4) DenseNet [57]; and (5) ResNet [46]. More specifically, 3-D CNN, FCLFN, DFFN, DenseNet, ResNet, together with the proposed MPRN, are spectral-spatial classifiers. 3-D CNN utilizes 3-D convolutional kernels to simultaneously extract the spatial and spectral features from HSIs. FCLFN combines features extracted from each Conv layer for classification. DFFN fuses multiple-layer features from a deep ResNet for classification. DenseNet employs shortcut connections between layers, in which the outputs of the previous layers are concatenated as inputs into all subsequent layers and hence can combine various spectral-spatial features across layers for HSI classification. ResNet is constructed by stacking multiple conventional residual blocks (with a single residual function in each block). In addition, some parameters of the compared methods had been set in advance. For the 3-D CNN, FCLFN, DFFN and DenseNet, the parameters were set according to the default values in the corresponding references. For ResNet and MPRN, the optimal architectures were used according to the Table 3. In addition, they were trained under exactly the same experimental setting, for example, using the same optimizer and L2 weight decay penalty.
The first experiment was conducted on the Indian Pines data set and 10% of the labeled samples in each class were randomly selected for training. The quantitative classification results, that is, classification accuracy of each class, OA, AA, Kappa, F1-score and Precision values obtained by different approaches are reported in Table 4. It can be seen that MPRN achieves the best results in terms of the five overall metrics, that is, OA, AA, Kappa, F1-score and Precision. From Table 4, one can observe that MPRN improves the performance of 11 classes out of 16 compared with ResNet, indicating that MPRN is more effective than ResNet. Moreover, the false-color composite image, ground reference map and the classification maps obtained by the six considered methods in a single experiment are shown in Figure 8. Table 4. Classification accuracies (%) obtained by different methods on the Indian Pines data set. The improvement of MPRN over ResNet is also provided. The best results are highlighted in bold font. In addition, the positive and negative improvements are marked in blue and red, respectively.

Class
Color 3-D CNN [ The second and third experiments were conducted on the Houston University and KSC data sets, respectively. For the Houston University data set, 10% of the labeled samples in each class were randomly selected for training. For the KSC data set, 2% of the labeled samples per class were randomly chosen for training. Tables 5 and 6, respectively, show the quantitative classification results obtained by different approaches on the two data sets. It can be seen that the proposed MPRN improves the OA value from 98.53% to 98.88% for the Houston University data set and 95.24% to 96.00% for the KSC data set compared with the ResNet. In addition, the proposed method obtains the best classification performance in terms of the five overall metrics (the OA, AA, Kappa, F1-score and Precision) among all the six methods on the two data sets, which demonstrates the effectiveness of the proposed method. The corresponding classification maps are respectively illustrated in Figures 9 and 10.    As shown in Table 7, the standardized McNemar's test [58] was performed to demonstrate the statistical significance in accuracy improvement of the proposed MPRN. When the Z value of McNemar's test is larger than 1.96 and 2.58, it indicates that the difference in accuracy between classifiers 1 and 2 are statistically significant at the 95% and 99% confidence levels, respectively. The Z value larger than 0 means that classifier 1 is more accurate than classifier 2 and vice versa. In this experiment, the proposed MPRN is compared with five other methods, that is, 3-D CNN, FCLFN, DFFN, DenseNet and ResNet. From Table 7, one can see that all the Z values are larger than 2.58, demonstrating that the proposed MPRN can significantly outperform the compared methods. Finally, the total number of parameters and computing time of the six considered methods on the three data sets are reported in Tables 8 and 9, respectively. From Table 9, we can find that FCLFN achieves the lowest training times on the three data sets. In addition, MPRN spends less time than ResNet on the Indian Pines data set because it has fewer parameters compared with ResNet. For the Houston University and KSC data sets, the proposed method is the most time-consuming, which is attributed to the processing of a large number of Conv layers.

Effect of Input Spatial Patch Size
In this experiment, we compare our MPRN method with the spatial-spectral ResNet (SSRN) in Reference [37]. In this case, the Indian Pines and KSC data sets are considered. Following Reference [37], 20% of the available labeled samples are randomly selected to form the training set. In addition, input patches with four different spatial sizes {5 × 5, 7 × 7, 9 × 9 and 11 × 11} have been considered. Since a patch too large may contain pixels from multiple classes that detract from the target pixel. In addition, it results in the degradation of intersample diversity, increasing the possibility of overfitting and curse of dimensionality as well. Table 10 shows the overall accuracies obtained in this experiment. From Table 10, one can see that MPRN achieves remarkable improvements in terms of OA regardless of the sizes of the considered image patches. For example, the proposed MPRN reach 6.58 percent higher OA than the SSRN with the same amount of spatial information (5 × 5 patch size) on the Indian Pines data set. Furthermore, all the OAs, obtained by MPRN with different patch sizes on the two data sets, are higher than 99%, indicating the robustness of our MPRN method to input patch size.

Effect of Limited Training Samples
Since manual labeling of hyperspectral data is expensive and time demanding, labeled samples are usually limited in practice. Therefore, it is necessary to assess the performance of the proposed method when limited training data is available. Figure 11 illustrates the overall classification accuracies achieved by different methods on the three data sets using limited numbers of training samples (ranging from 0.1% to 0.5%, with a step of 0.1% per class). As can be seen in Figure 11, for each data set, the proposed MPRN consistently performs the best among all methods under all different training samples, demonstrating the effectiveness and robustness of the proposed approach. (c) Figure 11. Overall classification accuracies (in %) obtained by 3-D CNN [32], FCLFN [56], DFFN [35], DenseNet [57], ResNet [46] and MPRN when considering different percentages of training samples on the (a) Indian Pines, (b) Houston University and (c) KSC data sets.
In the face of limited training data, deep networks with a large number of parameters tend to overfit the training set and thus obtain poor accuracy on the testing set. However, MPRN can be interpreted as an ensemble of exponential relatively shallow networks, each of which has a small number of parameters to be optimized and thus avoids the overfitting problem naturally. Therefore, the proposed method is able to provide superior performance when facing limited training data.

Conclusions
In this work, a novel network architecture named MPRN is proposed for spectral-spatial HSI classification. The proposed model employs multiple residual functions in the residual blocks in order to make the ResNet wider, rather than deeper. As a result, more shorter-medium neural connections are learned, which can effectively contribute gradient during training. With our analysis, ResNets integrated multiple residual functions in each residual block could achieve better performance than those with much deeper layers and our proposed MPRN, further reduced training parameters, not only achieves comparable or even better performance but spends less operation than ResNet. Experimental results on three real HSI data sets demonstrate that the proposed method performs better than other state-of-the-art methods in terms of both visual performance and quantitative metrics, especially in the face of limited number of training samples.
Note that designing a proper deep learning architecture is important for accurate HSI classification. However, it is a time-consuming and error-prone process. In our future work, neural architecture search methods [59] will be considered to engineer neural architectures in an automatic manner.