Mixed Structure with 3D Multi-Shortcut-Link Networks for Hyperspectral Image Classiﬁcation

: A hyperspectral image classiﬁcation method based on a mixed structure with a 3D multi-shortcut-link network (MSLN) was proposed for the features of few labeled samples, excess noise, and heterogeneous homogeneity of features in hyperspectral images. First, the spatial–spectral joint features of hyperspectral cube data were extracted through 3D convolution operation; then, the deep network was constructed and the 3D MSLN mixed structure was used to fuse shallow representational features and deep abstract features, while the hybrid activation function was utilized to ensure the integrity of nonlinear data. Finally, the global self-adaptive average pooling and L-softmax classiﬁer were introduced to implement the terrain classiﬁcation of hyperspectral images. The mixed structure proposed in this study could extract multi-channel features with a vast receptive ﬁeld and reduce the continuous decay of shallow features while improving the utilization of representational features and enhancing the expressiveness of the deep network. The use of the dropout mechanism and L-softmax classiﬁer endowed the learned features with a better generalization property and intraclass cohesion and interclass separation properties. Through experimental comparative analysis of six groups of datasets, the results showed that this method, compared with the existing deep-learning-based hyperspectral image classiﬁcation methods, could satisfactorily address the issues of degeneration of the deep network and “the same object with distinct spectra, and distinct objects with the same spectrum.” It could also effectively improve the terrain classiﬁcation accuracy of hyperspectral images, as evinced by the overall classiﬁcation accuracies of all classes of terrain objects in the six groups of datasets: 97.698%, 98.851%, 99.54%, 97.961%, 97.698%, and 99.138%


Introduction
Hyperspectral images (HSIs) contain rich spatial and spectral information [1,2] and are widely applied in the fields of precision agriculture [3], urban planning [4], national defense construction [5], and mineral exploitation [6], among other fields. It has been very successful in allowing active users to participate in collecting, updating, and sharing the massive amounts of data that reflect human activities and social attributes [7][8][9][10].
The terrain classification of hyperspectral images is a fundamental problem for various applications, where the classification aims to assign a label with unique class attributes to each pixel in the image based on the sample features of HSI. However, an HSI is highdimensional with few labeled samples, images between wavebands have a high correlation, and terrain objects with heterogeneous structures that may be homogeneous present the terrain classification of HSIs with huge challenges.

Residual Networks (ResNets)
With the deepening of network layers, a 3D CNN is prone to gradient dispersion or gradient explosion. Proper use of regularity initialization and the intermediate regularization layer can deepen the network, but the training set accuracy will be saturated or even decreased. A ResNet [56] alleviates the gradient disappearance problem that occurs in deep neural networks by skipping a connection in the hidden layer.
A ResNet is a hypothesis raised on the basis of identity mapping: assuming a network n with K layers is the currently optimal network, then several layers of a built deeper network should be the identity mapping from the outputs at the Kth layer of the network n. The deeper network should not underperform relative to the shallower network. If the input and output dimensions of the network's nonlinear units are consistent, each unit can be expressed as a general formula: Here, x and y are input and output vectors of layers considered, respectively, and F(·) is a residual function. In Figure 2, there are two layers, i.e., F = W2σ(W1x), in which σ denotes the ReLU and the biases are omitted for simplifying notations. The process of F + x is an operation of a shortcut connection and element-wise addition.
The residual learning structure ( Figure 2) functions by adding the output(s) from the previous layer(s) and the output computed at the current layer and inputting the result of the summation into the activation function as the output of the current layer, which addresses the degeneration of neural networks satisfactorily. A ResNet converges faster under the precondition of the same number of layers.

Residual Networks (ResNets)
With the deepening of network layers, a 3D CNN is prone to gradient dispersion or gradient explosion. Proper use of regularity initialization and the intermediate regularization layer can deepen the network, but the training set accuracy will be saturated or even decreased. A ResNet [56] alleviates the gradient disappearance problem that occurs in deep neural networks by skipping a connection in the hidden layer.
A ResNet is a hypothesis raised on the basis of identity mapping: assuming a network n with K layers is the currently optimal network, then several layers of a built deeper network should be the identity mapping from the outputs at the Kth layer of the network n. The deeper network should not underperform relative to the shallower network. If the input and output dimensions of the network's nonlinear units are consistent, each unit can be expressed as a general formula: Here, x and y are input and output vectors of layers considered, respectively, and F(·) is a residual function. In Figure 2, there are two layers, i.e., F = W 2 σ(W 1 x), in which σ denotes the ReLU and the biases are omitted for simplifying notations. The process of F + x is an operation of a shortcut connection and element-wise addition.
The residual learning structure ( Figure 2) functions by adding the output(s) from the previous layer(s) and the output computed at the current layer and inputting the result of the summation into the activation function as the output of the current layer, which addresses the degeneration of neural networks satisfactorily. A ResNet converges faster under the precondition of the same number of layers.

Activation Function
Since the gradient of the activation function ( Figure 3) of the rectified linear unit (ReLU) [71] is always 0 when the input is a negative value, the ReLU neurons will not be activated after parameters are updated, leading to the "death" of some neurons during the training process.
The parametric rectified linear unit (PReLU) [72] is used to address the issue of neuronal death brought about by the ReLU function, where the parameters in it are learned through backpropagation. The activation function of the self-exponential linear unit (SELU) [73] demonstrates high robustness against noise and enables the mean value of activation of neurons to tend to 0 so that the inputs become fixedly distributed after a certain number of layers.
Therefore, the algorithm in this study used PReLU as the activation function after the shallow multi-shortcut-link networks convolution operation, and SELU as the activation function in the deep residual structure of the block, which can make full use of hyperspectral 3D cube data.

Activation Function
Since the gradient of the activation function ( Figure 3) of the rectified linear unit (ReLU) [71] is always 0 when the input is a negative value, the ReLU neurons will not be activated after parameters are updated, leading to the "death" of some neurons during the training process.

Activation Function
Since the gradient of the activation function (Figure 3) of the rectified linear unit (ReLU) [71] is always 0 when the input is a negative value, the ReLU neurons will not be activated after parameters are updated, leading to the "death" of some neurons during the training process.
The parametric rectified linear unit (PReLU) [72] is used to address the issue of neuronal death brought about by the ReLU function, where the parameters in it are learned through backpropagation. The activation function of the self-exponential linear unit (SELU) [73] demonstrates high robustness against noise and enables the mean value of activation of neurons to tend to 0 so that the inputs become fixedly distributed after a certain number of layers.
Therefore, the algorithm in this study used PReLU as the activation function after the shallow multi-shortcut-link networks convolution operation, and SELU as the activation function in the deep residual structure of the block, which can make full use of hyperspectral 3D cube data.  The parametric rectified linear unit (PReLU) [72] is used to address the issue of neuronal death brought about by the ReLU function, where the parameters in it are learned through backpropagation. The activation function of the self-exponential linear unit (SELU) [73] demonstrates high robustness against noise and enables the mean value of activation of neurons to tend to 0 so that the inputs become fixedly distributed after a certain number of layers.

ReLU SELU PReLU
Therefore, the algorithm in this study used PReLU as the activation function after the shallow multi-shortcut-link networks convolution operation, and SELU as the activation function in the deep residual structure of the block, which can make full use of hyperspectral 3D cube data.

Loss Function
In deep learning, the softmax function (Equation (2)) is usually used as a classifier, which maps the outputs of multiple neurons into the interval (0, 1). Define the ith input feature x i with label y i , f j denotes the jth data point (j∈ [1, N], N is the number of classes) Remote Sens. 2022, 14, 1230 6 of 28 of the vector of class scores f, and M is the number of training data. In Equation (2), f is the activations of a fully connected layer W; f y i = W T y i x i in which W y i is the y i th column of W. Take a dichotomy as an example, i.e., ||W 1 || ||x||cos(θ 1 ) > ||W 2 || ||x||cos(θ 2 ), and thus the correct classification result of x is obtained. However, its learning ability is relatively weak for strongly discriminative features. This study adopted large-softmax as the loss function to upgrade the classification accuracy of HSI datasets.
The L-softmax loss function can be defined by the following expression: where ϕ(θ) can be expressed as: Experiments demonstrated that the features acquired by L-softmax have more distinctive distinguishability [74,75] and achieve better results than using softmax in both classification and verification tasks. In a ResNet, the author develops an architecture that stacks building blocks of the same shortcut connecting pattern called "residual units (ResUs)." The original ResU can be computed using these formulas: x l+1 = f (y l ) Here, x l is the input for the lth ResU. W l = {W l,k | 1≤k≤K } is a set of weights and bias corresponding to the lth ResU, and K is the number of layers in the ResU. F denotes the residual function, e.g., a stack of two 3 × 3 convolutional layers in a ResNet. This study expanded the convolutional dimension to 3 × 3 × 3. Function f is the process after elementwise addition, then operates the ReLU activation function. Function h is set as the identity mapping: h(x l ) = x l .
This study mainly focused on creating a multi-shortcut-link path for propagating tensor information, not only within ResU but also through the entire network model. As mentioned above, we denote s(x 0 ) as a shortcut link of the original ResU, which suggests y 0 = h(x 0 ) + F(x 0 , W 0 ). If f was also used as an identity mapping, that meant x l+1 ≡ y l ; putting s(x 0 ) and Equation (2) into Equation (1), and adding a multi-shortcut link S gave After recursion, we obtained Equation (9) indicates that for any deeper (L) and shallower (l) unit, the feature x L of L can be represented as the feature x l of l plus a residual operation, which is between any L and l presenting as a residual function in an MSLN.
Denoting a loss function as ξ, according to the chain rule of backpropagation, we obtained Equation (10) exhibits that the gradient ∂ξ ∂x l can be decomposed into two additive terms: propagates information directly without any concerned weight layers and with weight layers. The additive term of ∂ξ ∂x L guarantees that information is propagated back to any shallower unit l directly. In another way, it implies that because the term ∂ ∂x l ∑ L−1 i=l F cannot always be −1, the gradient ∂ξ ∂x l is unlikely to be canceled out for a mini-bath. This indicates that even when the weights are extremely small, the gradient of a layer also does not vanish.
This derivation reveals that if we add a shortcut link before a residual block and both h(x l ) and f(y l ) are identity mappings, the feature map signal could be propagated both forward and backward. This indicates that fused shallow features and deep features via multi-shortcut-link networks can are certain to obtain strongly discriminative features, which was also shown in the experiments in Section 4.2.

Structure of an MSLN
Based on the network structure of ResNet 18 (Table 1), this study added a convolution layer preceding each of the 2nd, 6th, 10th, and 14th layers, which is spliced in depth with the output result of the previous layer and as the input of the next convolution layer. Meanwhile, the 3D convolution operation of the original HSI cube block (H × W × C) contains the 1st, 2nd, 7th, 12th, and 17th layers in the MSLN, which implies the input shape of all these five layers are (batch size, input data, channels of HSI, kernel size, stride), and the numbers of convolution kernels were set according to Table 1, which were 16, 16, 16, 32, and 64 ( Figure 4), respectively.
In order to make the MSLN more convenient and minimize the conflicts when splicing channels as much as possible, the number of convolution kernels in the block was set to 16, 32, 64, and 128, respectively. Compared with the number of channels in the ResNet (64,128,256,512), the total size of parameters in the ResNet (130.20 ± 0.65 MB) was substantially decreased (10.87 ± 0.27 MB) and were greatly improved regarding the convergence speed.
As shown in Figure 5, conv i_1 (i = 1, 2, 3, 4) output were shallow features, and the feature graph had a higher resolution, which could retain more feature information and better describe the overall characteristics of the data. As the depth of the network increased, the deep features became more and more abstract. Fusing shallow features and deep features via multi-shortcut-link networks could reduce the loss of shallow features and correlation decay of gradients, boost the use ratio of features, and enhance the network's expressiveness.
Average pool 1000-d fc l-softmax Average pool 128-d fc l-softmax   Therefore, splicing the shallow feature conv i_1 (i = 1, 2, 3, 4) in depth with the output of each residual block (conv j_1 (j = 1, 2, 3,4)) to implement the multi-shortcut-link network fusion of features across different network layers could better alleviate gradient dispersion (explosion) and even network degeneration.     Therefore, splicing the shallow feature conv i_1 (i = 1, 2, 3, 4) in depth with the output of each residual block (conv j_1 (j = 1, 2, 3,4)) to implement the multi-shortcut-link network fusion of features across different network layers could better alleviate gradient dispersion (explosion) and even network degeneration.  Therefore, splicing the shallow feature conv i_1 (i = 1, 2, 3, 4) in depth with the output of each residual block (conv j_1 (j = 1, 2, 3,4)) to implement the multi-shortcut-link network fusion of features across different network layers could better alleviate gradient dispersion (explosion) and even network degeneration. Figure 4 displays the overall process of the HSI classification framework of the MSLN. It can be seen from this figure that the multi-shortcut-link network's structure is bridged to four residual blocks ( Figure 6) for a total of four times ( Figure 7); then, the last output layer of the third residual block is spliced with conv4_1 as the input tensor for the first layer of the fourth residual block; after the fourth residual block is processed, the global self-adaptive average pooling downsampling is used to expand the output tensor into one-dimensional vectors via the fully connected layer to map the learned distributed features into the space of sample labels; finally, the large-softmax loss function is used for classification.
Average pool 1000-d fc l-softmax Average pool 128-d fc l-softmax Figure 4 displays the overall process of the HSI classification framework of the MSLN. It can be seen from this figure that the multi-shortcut-link network's structure is bridged to four residual blocks ( Figure 6) for a total of four times ( Figure 7); then, the last output layer of the third residual block is spliced with conv4_1 as the input tensor for the first layer of the fourth residual block; after the fourth residual block is processed, the global self-adaptive average pooling downsampling is used to expand the output tensor into one-dimensional vectors via the fully connected layer to map the learned distributed features into the space of sample labels; finally, the large-softmax loss function is used for classification.  In this study, all the convolution kernels adopted the uniform size 3 × 3 × 3, which could both reduce the computational load and enlarge the receptive field of convolution operation [56]. Figure 8 shows the visualization of the MSLN structure and training process using the Botswana dataset, which covers 145 wavebands. The gray elements in the graph indicate the node is a backward operation, and the light blue elements indicate the node is a tensor that is input/output. At the top of Figure 8, there is a tensor whose shape is (64,1145,3,3), which means the batch size processed in the model was 64 and the input 3D tensor of the MSLN model was (145, 3, 3). There were five sets of bias matrices and weight matrices whose background color is light blue in the second row, corresponding to the first convolutional layer and four shortcut link convolutional layers, and the names of the rectangles are arranged in order conv1, add 1.0, add 2.0, add 3.0, and add 4.0. There are four residual blocks in the MSLN structure referred to as layerx (x = 2, 3, 4) and each layer contains four convolution layers, which were named layerx.0.conv1, layerx.0.conv2, layerx.1.conv1, and layerx.1.conv2 (x = 1, 2, 3, 4), and the downsampling operation was mainly done to manage the problem of an inconsistent number of convolution kernel channels. At the bottom of this figure, there is a green element, which is the final output of the results of the HSI classification. Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 29 In this study, all the convolution kernels adopted the uniform size 3 × 3 × 3, which could both reduce the computational load and enlarge the receptive field of convolution operation [56]. Figure 8 shows the visualization of the MSLN structure and training process using the Botswana dataset, which covers 145 wavebands. The gray elements in the graph indicate the node is a backward operation, and the light blue elements indicate the node is a tensor that is input/output. At the top of Figure 8, there is a tensor whose shape is (64,1145,3,3), which means the batch size processed in the model was 64 and the input 3D tensor of the MSLN model was (145, 3, 3). There were five sets of bias matrices and weight matrices whose background color is light blue in the second row, corresponding to the first convolutional layer and four shortcut link convolutional layers, and the names of the rectangles are arranged in order conv1, add 1.0, add 2.0, add 3.0, and add 4.0. There are four residual blocks in the MSLN structure referred to as layerx (x = 2, 3, 4) and each layer contains four convolution layers, which were named layerx.0.conv1, layerx.0.conv2, layerx.1.conv1, and Deep features are abstract and associated with a small receptive field. When the shallow features of a vast receptive field are mapped into the space of abstract features of a small receptive field, the number of parameters will grow with the increase in the number of layers, which will lead to an increased computational load, as well as a large loss of shallow representation information. The multi shortcut link networks structure proposed in this study, combined with the ResNet, can make up for the information loss of the deep network's shallow features very well and better address the difficulty in learning deep abstract features. (

Datasets Results and Analysis
The MSLN proposed in this study was based on the Python language and PyTorch deep learning framework, with the test environment being a Windows 10 OS with 32 GB RAM, an Intel i7-8700 CPU, and an NVIDIA Quadro P1000 4 GB GPU.

Hyperspectral Test Datasets
To validate the robustness and generalization property of the proposed algorithm, six groups of opensource datasets collected by M Graña et al. were used [76] to learn all types of labeled terrain objects without undergoing any human screening, and the ratio of training sets to validation sets to test sets was 0.09:0.01:0.9 (Tables 2-7 for details).   1  Brocoli_green_weeds_1  181  20  1808  2  Brocoli_green_weeds_2  355  37  3334  3  Fallow  177  20  1779  4  Fallow_rough_plow  125  14  1255  5  Fallow_smooth  241  27  2410  6  Stubble  357  39  3563  7  Celery  322  36  3221  8  Grapes_untrained  1014  113  10,144  9  Soil_vinyard_develop  558  62  5583  10  Corn_senesced_green_weeds  292  33  2953  11  Lettuce_romaine_4wk  96  11  961  12  Lettuce_romaine_5wk  171  19  1737  13  Lettuce_romaine_6wk  81  9  826  14  Lettuce_romaine_7wk  96  11  963  15  Vineyard_untrained  646  71  6551  16  Vineyard_vertical_trellis  157  17  1633  Total  4869  539 48,721    (1) Indian Pines (IP) Dataset The IP dataset was acquired using AVIRIS sensors on the test site in Indiana in 1996 with an image resolution of 145 × 145 pixels, a spectral range of 0.4-2.45 µm, and a spatial resolution of 20 m. There remained 200 effective wavebands for classification after the wavebands affected by noises and suffering severe water vapor absorption were eliminated, and a total of 16 crops were labeled. This dataset was shot in June, when some crops, such as corn and soybean, were in the early growth stage. The coverage rate of less than 5% was prone to pixel mixture, leading to a significant increase in the difficulty of vegetative terrain classification.
(2) Salinas (S) Dataset The S dataset was shot in Salinas Valley, California, using AVIRIS sensors with an image resolution of 512 × 217 pixels, a spectral range of 0.43-0.86 µm, and a spatial resolution of 3.7 m. There remained 204 effective wavebands for classification after the wavebands affected by noises and suffering severe water vapor absorption were eliminated, and a total of 16 crops were labeled, covering vegetables, bare soils, vineyards, etc.

(3) Pavia Centre (PC) and Pavia University (PU) Datasets
The PC and PU datasets stemmed from two scenes were captured using ROSIS sensors during a flight over Pavia in northern Italy, with 102 and 103 wavebands remaining, respectively, after the noise-affected wavebands and information-free regions were eliminated; with image resolutions of 1096 × 715 pixels and 610 × 340 pixels, respectively; and a spatial resolution of 1.3 m; both images have nine classes of labeled terrain objects, though the categories are not fully congruent.

(4) Kennedy Space Center (KSC) Dataset
The KSC dataset was shot using AVIRIS sensors at Kennedy Space Center, Florida, on March 23, 1996, with an image resolution of 512 × 614 pixels, a spectral range of 0.4-2.5 µm, and a spatial resolution of 18 m. There remained a total of 176 wavebands for analysis after the wavebands suffering water vapor and noises were eliminated, and a total of 13 classes of terrain objects. The extremely low spatial resolution plus the similarity in the spectral signatures of some vegetation types provide considerably increased difficulty regarding terrain classification.

(5) Botswana (B) Dataset
The B dataset was shot using Hyperion sensors at Okavango Delta in Botswana with an image resolution of 1476 × 256 pixels, a spectral range of 0.4-2.5 µm, and a spatial resolution of 30 m, covering 242 wavebands in total. The UT Space Research Center eliminated the uncalibrated and noise-affected wavebands covered with moisture absorption features, leaving a total of 145 wavebands for classification, including 14 observation data whose categories were determined, and these categories include the seasonal and sporadic swamps, dry forests, and other land cover types in the delta.
The IP dataset and S dataset both contain 6 major categories and 16 sub-categories; it is necessary to improve the discrimination between classes to improve the classification accuracy. However, the IP dataset has a lower resolution; if the training data is selected according to the above ratio, four terrain objects have no training data (alfalfa, grasspasture-mowed, oats, and stone-steel-towers). The same situation is also reflected in the KSC dataset (hardwood swamp) and B dataset (hippo grass and exposed soils), mainly due to too few labeled samples. To reduce the validation error and test error, one sample was randomly selected among the training samples for validation.
The PC dataset and PU dataset have fewer categories, with a higher resolution and richer labeled samples, but have the issue of "distinct objects with the same spectrum." There are 7 swamp types out of the 13 classes of terrain objects in the KSC dataset, which has the issue of "the same object with distinct spectra" and considerably increased difficulty regarding terrain classification.
In this study, the MSLN structure in the ResNet and some hyperparameter settings could resolve the above problems and improve the classification accuracy of the six groups of the datasets.

Results and Analysis
To validate the effectiveness and classification results of the MSLN 22 network structure, an ordinary 3D CNN [37] was selected as a baseline network for comparative analysis with an RNN [32], a multiscale 3D CNN (MC 3D CNN) [44], a 3D CNN residual (3D CNN Res) [41], and a 3D ResNet. The settings of the hyperparameters were as follows: batch size was set to 64, which not only mitigated gradient oscillation but also allowed for better performance of the GPU; the initial learning rate was set to 0.01 and the learning rate dropped to 0.001 after the loss function was stabilized; the maximum count of iterations epoch was set to 600; the cross-entropy was selected as the loss function in the comparison algorithm, whereas L-softmax was adopted as the loss function in the network produced in this study; the dropout was set to 0.5 after the global self-adaptive average pooling and before the fully connected layer; and 50% of the neurons were discarded at random to address the overfitting issue and enhance the model's generalization property.
Tables 8-13 present the classification results corresponding to the six datasets. The multi-shortcut-link networks structure of the MSLN proposed in this study could extract spatial-spectral joint features and fuse shallow and deep features, and the evaluation criterion kappa coefficient, average accuracy (AA), and overall accuracy (OA) were all the highest among the networks tested. Figures 9-14 each represent the comparison of the HSI classification results among all network structures.
Tables 8-10 and Figures 9-11 indicate that the MSLN, which had 22 convolution layers, could extract rich deep features, and the multi-shortcut-link network's structure also fused the shallow features of a vast receptive field and mitigated the effect of noises via the global self-adaptive average pooling layer, achieving a significant improvement and exhibiting good robustness in the classification results, with the average accuracy of classification across all terrain objects reaching 98.1% (IP dataset), 99.4% (S dataset), and 99.3% (B dataset), including the terrain objects that share similar attributes (untrained grapes, untrained vineyard, and vineyard vertical trellis) and relatively few samples in the IP dataset (alfalfa, grass-pasture-mowed, oats, and stone-steel-towers) and the B dataset (hippo grass and exposed soils).  Figure 9. Classification maps and overall accuracy from using different methods on the Indian Pines dataset.  Figure 10. Classification maps and overall accuracy from using different methods on the Salinas dataset.    Figure 11. Classification maps and overall accuracy from using different methods on the Botswana dataset.
Tables 8-10 and Figures 9-11 indicate that the MSLN, which had 22 convolution layers, could extract rich deep features, and the multi-shortcut-link network's structure also fused the shallow features of a vast receptive field and mitigated the effect of noises via the global self-adaptive average pooling layer, achieving a significant improvement and exhibiting good robustness in the classification results, with the average accuracy of classification across all terrain objects reaching 98.1% (IP dataset), 99.4% (S dataset), and 99.3% (B dataset), including the terrain objects that share similar attributes (untrained grapes, untrained vineyard, and vineyard vertical trellis) and relatively few samples in the IP dataset (alfalfa, grass-pasture-mowed, oats, and stone-steel-towers) and the B dataset (hippo grass and exposed soils).  The PC and PU datasets each have nine disjoint classes of terrain objects with excellent connectivity and whose attributes are simplex, while the connectivity is slightly inferior for the terrain objects asphalt, shadows, gravel, and bare soil, which are exposed to more noise, highlighting the issue of "distinct objects with the same spectrum" and even resulting in as low as a 67.8% classification accuracy for some algorithms. Through the double constraints of shallow and deep features, the algorithm produced in this study could improve the classification results so that the classification accuracies for the above terrain objects were 97.9%, 99.9%, 98.3%, and 99.8%, respectively, and that the object-spectrum confounding issue was addressed effectively.  Figure 13. Classification maps and overall accuracy from using different methods on the Pavia University dataset.  The PC and PU datasets each have nine disjoint classes of terrain objects with excellent connectivity and whose attributes are simplex, while the connectivity is slightly inferior for the terrain objects asphalt, shadows, gravel, and bare soil, which are exposed to The KSC dataset has 7 swamp types out of the 13 classes of terrain objects, which presented increased considerable difficulty regarding terrain classification, leading to all comparison algorithms being unable to effectively distinguish the discriminative features of different types of swamps during learning, mainly because of the "the same object with distinct spectra" circumstance being very obvious for swamps; therefore, the validation accuracies of all comparison algorithms shown in Figure 15e were highly unstable. According to the classification results, the fused features extracted via multi-shortcut-link networks by the algorithm produced in this study could express the abstract attributes of "the same object with distinct spectra" very well. The accuracy of classification results was 96% at minimum for the swamp type of terrain objects, and the classification results for salt marsh were all correct. However, the classification results remained not very high for cabbage palm and oak hammock, as well as oak and broadleaf hammock, which have extremely similar spectral signatures, mainly because these two pairs of terrain objects have quite similar shallow features and inferior connectivity, in addition to more noise in the datasets such that the classification accuracies were only 88.6% and 89.4%, respectively.
Remote Sens. 2022, 14, x FOR PEER REVIEW 22 of 29 more noise, highlighting the issue of "distinct objects with the same spectrum" and even resulting in as low as a 67.8% classification accuracy for some algorithms. Through the double constraints of shallow and deep features, the algorithm produced in this study could improve the classification results so that the classification accuracies for the above terrain objects were 97.9%, 99.9%, 98.3%, and 99.8%, respectively, and that the objectspectrum confounding issue was addressed effectively.
The KSC dataset has 7 swamp types out of the 13 classes of terrain objects, which presented increased considerable difficulty regarding terrain classification, leading to all comparison algorithms being unable to effectively distinguish the discriminative features of different types of swamps during learning, mainly because of the "the same object with distinct spectra" circumstance being very obvious for swamps; therefore, the validation accuracies of all comparison algorithms shown in Figure 15e were highly unstable. According to the classification results, the fused features extracted via multi-shortcut-link networks by the algorithm produced in this study could express the abstract attributes of "the same object with distinct spectra" very well. The accuracy of classification results was 96% at minimum for the swamp type of terrain objects, and the classification results for salt marsh were all correct. However, the classification results remained not very high for cabbage palm and oak hammock, as well as oak and broadleaf hammock, which have extremely similar spectral signatures, mainly because these two pairs of terrain objects have quite similar shallow features and inferior connectivity, in addition to more noise in the datasets such that the classification accuracies were only 88.6% and 89.4%, respectively.
(a) Indian Pines To better indicate the parameter size of each network structure, the upper limit of the vertical coordinate in Figure 16 was set to 25 MB, mainly because of the great number of channels in the 3D ResNet algorithm and the large parameter size (130.20 ± 0.65 MB) when the six datasets were learned, which also coincided with the longest training time. It can be seen from Figure 16 that the parameter size of the MSLN network structure was 10.87 ± 0.27 MB, while Figure 17 indicates that the learning time of the algorithm proposed in this study failed to increase with the increase in parameter size. This fully illustrated that the multi-shortcut-link network structure proposed in this study not only improved the overall classification accuracy but also shortened the model's training and learning times while diminishing overfitting. Therefore, the algorithm proposed in this study took a shorter training time to achieve the highest accuracy compared with other models. To better indicate the parameter size of each network structure, the upper limit of the vertical coordinate in Figure 16 was set to 25 MB, mainly because of the great number of channels in the 3D ResNet algorithm and the large parameter size (130.20 ± 0.65 MB) when the six datasets were learned, which also coincided with the longest training time. It can be seen from Figure 16 that the parameter size of the MSLN network structure was 10.87 ± 0.27 MB, while Figure 17 indicates that the learning time of the algorithm proposed in this study failed to increase with the increase in parameter size. This fully illustrated that the multi-shortcut-link network structure proposed in this study not only improved the overall classification accuracy but also shortened the model's training and learning times while diminishing overfitting. Therefore, the algorithm proposed in this study took a shorter training time to achieve the highest accuracy compared with other models. Remote Sens. 2022, 14, x FOR PEER REVIEW 25 of 29

Conclusions
In view of the characteristics of hyperspectral images, such as few labeled samples, excess noise, and homogeneity with heterostructures, this study built a multi-shortcutlink network structure to extract the 3D spatial-spectral information of HSIs based on the properties of a 3D CNN and the shortcut link characteristics in a ResNet and tested six groups of HSI datasets by making full use of the shallow representational features and deep abstract features of HSIs. The results showed the following: (i) The MSLN could directly input the cube data of the HSIs and could then effectively extract the spatialspectral information. The hybrid use of activation functions ensured the integrity of the nonlinear features of input data, which not only improved the use ratio of neurons but also increased the model's rate of convergence. (ii) The multi-shortcut-link network's structure fused shallow features and deep features, which reduced the gradient loss of deep features, solved the degeneration of the deep network satisfactorily, and enhanced the network's generalization ability. The L-softmax loss function endowed the learned features with stronger discriminatory power, effectively addressing the issue of "the same object with distinct spectra, and distinct objects with the same spectrum" and achieving more significant classification results. Therefore, the MSLN proposed in this study could effectively improve the overall classification result.
Although the multi-shortcut-link network's structure proposed in this study demonstrated extraordinary superiority regarding performance and classification accuracy, no discussion has ever covered the issues of information interaction and weight allocation between different channels. In future work, the attention mechanism will be introduced while the network is deepened, and the relevance between space and channels will be

Conclusions
In view of the characteristics of hyperspectral images, such as few labeled samples, excess noise, and homogeneity with heterostructures, this study built a multi-shortcutlink network structure to extract the 3D spatial-spectral information of HSIs based on the properties of a 3D CNN and the shortcut link characteristics in a ResNet and tested six groups of HSI datasets by making full use of the shallow representational features and deep abstract features of HSIs. The results showed the following: (i) The MSLN could directly input the cube data of the HSIs and could then effectively extract the spatialspectral information. The hybrid use of activation functions ensured the integrity of the nonlinear features of input data, which not only improved the use ratio of neurons but also increased the model's rate of convergence. (ii) The multi-shortcut-link network's structure fused shallow features and deep features, which reduced the gradient loss of deep features, solved the degeneration of the deep network satisfactorily, and enhanced the network's generalization ability. The L-softmax loss function endowed the learned features with stronger discriminatory power, effectively addressing the issue of "the same object with distinct spectra, and distinct objects with the same spectrum" and achieving more significant classification results. Therefore, the MSLN proposed in this study could effectively improve the overall classification result.
Although the multi-shortcut-link network's structure proposed in this study demonstrated extraordinary superiority regarding performance and classification accuracy, no discussion has ever covered the issues of information interaction and weight allocation between different channels. In future work, the attention mechanism will be introduced while the network is deepened, and the relevance between space and channels will be

Conclusions
In view of the characteristics of hyperspectral images, such as few labeled samples, excess noise, and homogeneity with heterostructures, this study built a multi-shortcut-link network structure to extract the 3D spatial-spectral information of HSIs based on the properties of a 3D CNN and the shortcut link characteristics in a ResNet and tested six groups of HSI datasets by making full use of the shallow representational features and deep abstract features of HSIs. The results showed the following: (i) The MSLN could directly input the cube data of the HSIs and could then effectively extract the spatialspectral information. The hybrid use of activation functions ensured the integrity of the nonlinear features of input data, which not only improved the use ratio of neurons but also increased the model's rate of convergence. (ii) The multi-shortcut-link network's structure fused shallow features and deep features, which reduced the gradient loss of deep features, solved the degeneration of the deep network satisfactorily, and enhanced the network's generalization ability. The L-softmax loss function endowed the learned features with stronger discriminatory power, effectively addressing the issue of "the same object with distinct spectra, and distinct objects with the same spectrum" and achieving more significant classification results. Therefore, the MSLN proposed in this study could effectively improve the overall classification result.
Although the multi-shortcut-link network's structure proposed in this study demonstrated extraordinary superiority regarding performance and classification accuracy, no discussion has ever covered the issues of information interaction and weight allocation between different channels. In future work, the attention mechanism will be introduced while the network is deepened, and the relevance between space and channels will be utilized to enhance the discriminatory power of the features of terrain objects with inferior classification accuracy to achieve higher classification accuracy.