Small Sample Hyperspectral Image Classiﬁcation Based on Cascade Fusion of Mixed Spatial-Spectral Features and Second-Order Pooling

: Hyperspectral images can capture subtle differences in reﬂectance of features in hundreds of narrow bands, and its pixel-wise classiﬁcation is the cornerstone of many applications requiring ﬁne-grained classiﬁcation results. Although three-dimensional convolutional neural networks (3D-CNN) have been extensively investigated in hyperspectral image classiﬁcation tasks and have made signiﬁcant breakthroughs, hyperspectral classiﬁcation under small sample conditions is still challenging. In order to facilitate small sample hyperspectral classiﬁcation, a novel mixed spatial-spectral features cascade fusion network (MSSFN) is proposed. First, the covariance structure of hyperspectral data is modeled and dimensionality reduction is conducted using factor analysis. Then, two 3D spatial-spectral residual modules and one 2D separable spatial residual module are used to extract mixed spatial-spectral features. A cascade fusion pattern consisting of intra-block feature fusion and inter-block feature fusion is constructed to enhance the feature extraction capability. Finally, the second-order statistical information of the fused features is mined using second-order pooling and the classiﬁcation is achieved by the fully connected layer after L2 normalization. On the three public available hyperspectral datasets, Indian Pines, Houston, and University of Pavia, only 5%, 3%, and 1% of the labeled samples were used for training, the accuracy of MSSFN in this paper is 98.52%, 96.31% and 98.83%, respectively, which is far better than the contrast models and veriﬁes the effectiveness of MSSFN in small sample hyperspectral classiﬁcation tasks.


Introduction
Hyperspectral images can capture subtle differences in reflectance of features in hundreds of narrow spectral bands, offering the possibility of accurate identification of features with comparable color and texture [1]. Therefore, hyperspectral image classification is the cornerstone of many applications requiring high classification granularity, such as agricultural yield estimation [2], tree species identification [3], natural resource survey [4], and disaster monitoring [5], and have long been a research hotspot in the field of remote sensing. Nowadays, deep learning methods represented by 3D-CNN are capable of automatically extracting joint spatial-spectral features, and have been extensively investigated and applied in hyperspectral remote sensing applications in recent years [6][7][8]. However, the raw hyperspectral data have a high redundancy, and it is difficult to obtain sufficient manually labeled samples [9]. The available training samples in real-world hyperspectral classification tasks are often scarce and contain considerable noise. Therefore, effective dimension reduction for hyperspectral images without losing crucial spatial-spectral features separable convolution, which effectively improved the obtained classification accuracy of hyperspectral images under small sample conditions [37]. Based on R-HybridSN, Feng et al. proposed M-HybridSN of which the core feature extraction module is a combination of three densely connected 3 × 3 × 3 convolutional layers and one 7 × 7 × 7 convolutional layer for global information enhancement; Zhang et al. proposed AD-HybridSN based on dense connectivity and two attention modules with the aim of spatial-spectral feature refinement. The above two models can be viewed as improved versions of R-HybridSN, and they both improved the network's ability to learn robust spatial-spectral features [38,39]. How to make better use of the features extracted by CNN is another key issue. Aiming at the limitations of existing CNN models using global average pooling or fully connected layers, Zheng et al. proposed a method of classifying the features extracted from convolutional networks by mining covariance information and designed a covariance pooling-based mixed CNN model (MCNN-CP) [40]. The combination of mixed CNN models and dimensionality reduction algorithms will attract continuous attention due to their high classification accuracy and low computational cost.
The above residual learning and mixed CNN models are served as network optimization methods and are aimed to facilitate small sample hyperspectral classification at the network level. The training process can benefit from a large number of unlabeled samples with the help of semi-supervised learning methods, which is focused on the intra-dataset level [41][42][43]. Wu et al. proposed a semi-supervised deep learning framework based on pseudo labels [44]. A clustering method called a constrained Dirichlet process mixture model (C-DPMM) was adopted to generate pseudo labels. A classic pre-training and fine-tuning scheme was utilized to further improve the classification performance of the two convolutional recurrent neural networks (CRNN). Liu et al. proposed a deep active learning method using a densely connected CNN model [45]. Several branches of this network formed a loss prediction model, and those samples with large predicted losses were manually labeled and re-involved in the training process. Liu et al. proposed a deep few-shot learning method (DFSL) and built a connection between an HSI domain (called target domain) with very few labeled data and another HSI domain (called source domain) with enough labeled data. The DFSL was focused on the inter-dataset level, and a deep residual 3D-CNN was adopted to learn a metric space [46]. The well-trained network can be viewed as an embedding tool and the classification of unlabeled data can be achieved by another simple distance-based classifier. Recently, inspired by the success of DFSL, several novel few-shot learning methods have been proposed for small sample hyperspectral classification, such as deep relation network-based few-shot learning methods [47] and deep cross domain few-shot learning methods [48]. Few-shot learning methods have changed the paradigm of classification using features extracted by convolutional layers and opened up a promising research field for small sample hyperspectral classification. In addition, network-level optimization is of non-negligible significance. On the one hand, supervised learning methods are easy to be implemented and applied in real-world remote sensing applications, since only network-related hyperparameters need to be fine-tuned. On the other hand, advanced CNN models with a discriminative feature learning ability can be combined with the above novel learning patterns and obtain a better classification accuracy.
Based on the above observations, we proposed a mixed spatial-spectral features cascade fusion network (MSSFN) to facilitate small sample hyperspectral classification. Factor analysis is combined with our model and it can analyze the covariance structure of hyperspectral data and realize effective dimensionality reduction to improve inter-class separability. The MSSFN adopts two 3D multiple spatial-spectral residual blocks and one 2D separable multiple residual block to extract mixed spatial-spectral features. A cascade feature pattern composed of intra-block feature fusion and inter-block feature fusion was proposed to make the learned features more robust to different kinds of hyperspectral datasets. The second-order pooling is designed to further mine the higher-order statistical information of the cascade fusion features. Extensive experiments were conducted on three real-world hyperspectral datasets, and the classification accuracy of MSSFN was far better than that of the contrast models.

Factor Analysis
Factor analysis (FA) is a multivariate statistical analysis method that can examine the underlying structure of high-dimensional data. FA can extract the few factors that characterize the data and serve as a dimensionality reduction method. It is similar to PCA in terms of the calculation process. The difference between them is that PCA focuses on the total variance of the variables, while FA focuses on the covariance [49,50]. Hyperspectral data are high-dimensional and highly nonlinear, so using FA as a dimensionality reduction method is helpful to improve inter-class separability.
Suppose the hyperspectral image data is represented as X = (x 1 , x 2 , . . . , x n ) T . In order to express the hyperspectral data mathematically, a latent variable model can be introduced as Equation (1), where H = (h 1 , h 2 , . . . , h n ) T is the unobservable random viriables and it is called the factors of X; µ = (µ 1 , µ 2 , . . . µ n ) T is a offset vector which satisfies E(x i ) = µ i ; ε = (ε 1 , ε 2 , . . . , ε n ) T is the noise item, which is usually assumed to obey the normal distribution. Suppose the variance is expressed as The W represents the coefficient matrix to be estimated, and was known as the factor loading matrix. If the variance of each component of X is equal, i.e., ψ = σ 2 I (I is the unit matrix), then the hypothesis will lead to the PCA model; if the variance of each component is not equal, the hypothesis will lead to the FA model. Factor analysis focuses on finding the factor loading matrix, which can be proven to be the correlation coefficient between each factor and the original variables [51].
Suppose the eigenvalues of covariance matrix of X satisfies λ 1 > λ 2 > λ 3 . . . > λ n ≥ 0, and the corresponding eigenvectors are l 1 , l 2 , l 3 , . . . , l n . Then the covariance matrix Cov can be decomposed as Equation (2), When the latter (n − m) eigenvalues are small, the corresponding eigenvectors will be discarded and only the former m eigenvectors will be retained. Thus, the Cov can be approximately decomposed, as shown in Equation (3), The dimensionality reduction of hyperspectral data can be achieved by (3), and the detailed solution process of FA can be found in the literature [49].
In practical usage, to improve the interpretability, the factor loading matrix of FA can be further rotated to maximize the variance or the quartic variance. By rotation, the values of the factor loading matrix will be sparser, which can improve the interpretability of the extracted factors.

Cascade Fusion of Mixed Spatial-Spectral Features
In order to better learn spatial-spectral features, this paper proposes a mixed spatialspectral feature cascade fusion pattern and constructs the MSSFN network accordingly. MSSFN is an improved mixed CNN model, and the basic unit includes the 3D convolutional layer and the 2D convolutional layer. There is a big difference between the two types of convolutions in extracting image features. Two-dimensional convolution convolves the input data in two directions at a time, and the result of a single convolutional kernel is a two-dimensional tensor. The value of the (x, y) position on the jth feature map of the ith layer is calculated by Equation (4), where k h,w i,j,m represents the value of the jth convolutional kernel at the (h, w) position of the ith layer and the convolutional kernel convolves the mth feature map extracted by the previous layer; H i and W i represents the height and the width of the kernel, respectively; s (x+h),(y+w) (i−1),m denotes the value of the mth feature map at (x + h), (y + w) position in the previous layer; s x,y i,j denotes the result value of the feature map at (x, y) position; b i,j represents the bias and act() represents the activation function. Three-dimensional convolution extends to the band dimension, and a single convolution kernel results in a three-dimensional tensor. The value of the position (x, y, z) of the jth feature cube in the ith convolutional layer can be calculated in Equation (5), The proposed MSSFN is based on a modular design, and it can be seen as a duplication of the basic feature extraction module and its variants. Based on the two-dimensional MultiResBlock proposed in the literature [52], three improved multiple residual learning modules are proposed in this paper, which can be divided into two categories. One type is the 3D spatial-spectral multiple residual modules, which are used to extract joint spatialspectral features; the other type is the 2D separable multiple residual module, which is used for spatial feature enhancement based on depth-separable convolutional layers. The general schematic diagram of the multiple residual modules is shown in Figure 1 The proposed MSSFN is based on a modular design, and it can be seen as a dup of the basic feature extraction module and its variants. Based on the two-dimensional ResBlock proposed in the literature [52], three improved multiple residual learning m are proposed in this paper, which can be divided into two categories. One type is the 3 tial-spectral multiple residual modules, which are used to extract joint spatial-spect tures; the other type is the 2D separable multiple residual module, which is used for feature enhancement based on depth-separable convolutional layers. The general sch diagram of the multiple residual modules is shown in Figure 1 and contains two br Branch A contains three convolutional layers for extracting multi-scale features, wh multi-scale characteristics are reflected in the concatenation of three consecutive convol layers. Branch B uses global residual connections for inter-block feature fusion. In this paper, two types of multi-residual modules are designed to extract mix tial-spectral features using 3D convolution and 2D convolution, respectively, and th information of the modules is shown in Table 1. The convolutional kernel sizes of t branches for the two 3D multi-residual modules are set to 1 × 1 × 3 , 1 × 1 × 3 × 3 × 1, 7 × 7 × 1, respectively. The two 3D multi-residual modules are combi extract spatial-spectral features, and it is much more memory-efficient than the kern combination of 3 × 3 × 3 and 7 × 7 × 7 adopted in M-HybridSN. The convolution nel size in branch B is larger than that in branch A to learn global spatial-spectral f [38]. The 2D multi-residual module uses depth-separable convolution with In this paper, two types of multi-residual modules are designed to extract mixed spatial-spectral features using 3D convolution and 2D convolution, respectively, and the basic information of the modules is shown in Table 1. The convolutional kernel sizes of the two branches for the two 3D multi-residual modules are set to 1 × 1 × 3, 1 × 1 × 7 and 3 × 3 × 1, 7 × 7 × 1, respectively. The two 3D multi-residual modules are combined to extract spatial-spectral features, and it is much more memory-efficient than the kernel size combination of 3 × 3 × 3 and 7 × 7 × 7 adopted in M-HybridSN. The convolutional kernel size in branch B is larger than that in branch A to learn global spatial-spectral features [38]. The 2D multi-residual module uses depth-separable convolution with large convolution kernels (5 × 5) for spatial feature reinforcement and global residual connections with 1 × 1 convolution for information transfer enhancement. Depth separable convolution divides the ordinary two-dimensional convolution into two steps: depth convolution and point-wise convolution, which can effectively reduce the number of parameters and operations [37][38][39]. In this paper, the three multi-residual module feature learning processes can be uniformly expressed in Equation (6), where k i (x) denotes the non-linear transformation by the ith convolutional layer; k global () represents the non-linear transformation by the global residual connection; [] represents the feature map concatenation and denotes pixel-wise addition. The literature [53] stated that if successive convolutional layers are used, there is a square relationship between the number of convolutional kernels in the first layer and the memory consumption. Therefore, in order to reduce the memory consumption, the number of kernels in the three convolutional layers is no longer equal, but the ratio is set to 1:2:3 with reference to [52].

3D Spectral
The above three multiple residual modules are the basic building blocks of the MSSFN, and pipelined stacking in the network learns features with increasing levels of abstraction. In order to further utilize the intermediate results and construct diverse feature learning paths, a cross-layer feature fusion model is designed, as shown in Figure 2. Suppose the output dimension of the features extracted by the two 3D multi-residual blocks be h 3d × w 3d × b 3d × C, and the dimension of the features will be adjusted to h 3d × w 3d × 1 × C using the global 3D convolutional layer with the kernel size of 1 × 1 × b 3d . To meet the requirements of the subsequent 2D convolutional layer, the adjusted features will be reshaped to a 3D tensor of h 3d × w 3d × C. This design avoids the problem of the excessive number of channels after direct reshaping, which is one of the most significant drawbacks of mixed CNN models. The multi-residual module implements an intra-block feature fusion, and the structure shown in Figure 2 implements inter-block cross-layer feature fusion. The two feature fusion methods constitute a cascade feature fusion pattern. Compared with the idea of fusing features extracted by different types of convolutional layers with max-pooling proposed by R-HybridSN, the cascade feature fusion pattern in this paper is more thorough. In addition, the three modules have obvious differences in the extracted features due to different kernel sizes and convolution types. The cascade feature fusion pattern can make full use of the large feature variability to improve the applicability to different types of hyperspectral images.

Second-Order Pooling
Before classification, several fully connected layers are commonly used to further integrate the features extracted from the convolutional layers. This approach leads to a high parameter number and cannot effectively eliminate the negative impact of redundant features on the classification. Max-pooling and average pooling layers are usually adopted to filter the noise in the features; however, this approach utilizes only first-order statistical features and the relationship between different channels are not considered. Second-order pooling (SOP) was proposed by Carreira et al. and can mine second-order statistical information of image features [54]. In order to make full use of the features extracted from the convolutional layers, SOP is inserted in MSSFN to further process the mixed spatial-spectral features extracted from the three modules in Section 2.2.

Second-Order Pooling
Before classification, several fully connected layers are commonly used to fur tegrate the features extracted from the convolutional layers. This approach leads to parameter number and cannot effectively eliminate the negative impact of redund tures on the classification. Max-pooling and average pooling layers are usually a to filter the noise in the features; however, this approach utilizes only first-order st features and the relationship between different channels are not considered. Secon pooling (SOP) was proposed by Carreira et al. and can mine second-order statis formation of image features [54]. In order to make full use of the features extracte the convolutional layers, SOP is inserted in MSSFN to further process the mixed spectral features extracted from the three modules in Section 2.2.
Suppose the fused features extracted by the three multi-residual blocks be d as , and its dimension is × × . can be divided into three grou responding to three multi-residual blocks. The three groups of features have larg bility and direct pooling will not take into account such variability and the relat among channels. Therefore, the fused features are further processed using secon pooling. Firstly, the spatial dimension of is reshaped, and the reshaped f are denoted as with dimension of × . The second-order pooling is cal as shown in Equation (7), = where represents the second-order pooling features and its dimension is denotes the transpose matrix of and its dimension is × . The v column of row , , is the result of multiplying the th channel and the th c which can indicate the correlation between these two channels, and the larger the the stronger the correlation. The elements on the diagonal of can be rega the result of the weighted average of the channels, which can express the charac of the channel itself. Therefore, the contains not only the features of the ori channels, but also the correlation between different channels.
The schematic diagram of SOP processing for the fused features is shown in 3. In order to further improve the feature robustness, L2 normalization was utilize the channel dimension and the calculation process is shown in Equation (8), represents the L2 normalization result of the feature, , of th channel. T and the L2 normalization can be inserted to the network and trained in an end-to-end m In fact, SOP is a special case of bilinear pooling which was proposed in the literature [ Suppose the fused features extracted by the three multi-residual blocks be denoted as F f used , and its dimension is H × W × C. F f used can be divided into three groups, corresponding to three multi-residual blocks. The three groups of features have large variability and direct pooling will not take into account such variability and the relationship among channels. Therefore, the fused features are further processed using second-order pooling. Firstly, the spatial dimension of F f used is reshaped, and the reshaped features are denoted as F resh with dimension of HW × C. The second-order pooling is calculated as shown in Equation (7), where F sop represents the second-order pooling features and its dimension is C × C; F T resh denotes the transpose matrix of F resh and its dimension is C × HW. The value in column j of row i, f ij , is the result of multiplying the ith channel and the jth channel, which can indicate the correlation between these two channels, and the larger the value, the stronger the correlation. The C elements on the diagonal of F sop can be regarded as the result of the weighted average of the C channels, which can express the characteristics of the channel itself. Therefore, the F sop contains not only the features of the original C channels, but also the correlation between different channels. The schematic diagram of SOP processing for the fused features is shown in Figure 3. In order to further improve the feature robustness, L2 normalization was utilized along the channel dimension and the calculation process is shown in Equation (8), where f L2 i represents the L2 normalization result of the feature, f i , of ith channel. The SOP and the L2 normalization can be inserted to the network and trained in an end-toend manner. In fact, SOP is a special case of bilinear pooling which was proposed in the literature [55].

Hyperspectral Image Classification Based on MSSFN
The structure of the MSSFN network is shown in Figure 4, where the convolutional kernel sizes and kernel numbers are marked for each convolutional layer. The MSSFN takes the hyperspectral image patch after factor analysis processing as input, and the land-use type is determined by the center pixel. Every hyperspectral patch can be denoted as P M×M×C , where M is the predefined neighborhood size and C is the band number after dimension reduction.

Hyperspectral Image Classification Based On MSSFN
The structure of the MSSFN network is shown in Figure 4, where the convolutional kernel sizes and kernel numbers are marked for each convolutional layer. The MSSFN takes the hyperspectral image patch after factor analysis processing as input, and the landuse type is determined by the center pixel. Every hyperspectral patch can be denoted as × × , where is the predefined neighborhood size and is the band number after dimension reduction. Hyperspectral image classification based on MSSFN can be divided into four stages, namely factor analysis for dimensionality reduction, mixed spatial-spectral feature extraction and cascade fusion, second-order pooling, and classification. Factor analysis models the covariance structure of hyperspectral data while acting as a dimensionality reduction method to extract features with high discriminative power and low redundancy. The mixed spatial-spectral features are extracted through three multi-residual modules. The two 3D multi-residual modules extract spatial-spectral features from two directions, and the 2D separable multi-residual module further strengthens the spatial features as a complement to the 3D spatial-spectral features. The features extracted by the 3D modules are downscaled by global 3D convolution, and all the features are fused through concatenation. The intra-block feature fusion and the inter-block feature fusion together form a cascade feature fusion pattern. Then second-order pooling and L2 normalization are used to further extract second-order statistical information of the fused features and enhance feature robustness. Finally, the classification is achieved by the output layer. The implementation details of MSSFN are shown in Table 2.

Hyperspectral Image Classification Based On MSSFN
The structure of the MSSFN network is shown in Figure 4, where the convolutional kernel sizes and kernel numbers are marked for each convolutional layer. The MSSFN takes the hyperspectral image patch after factor analysis processing as input, and the landuse type is determined by the center pixel. Every hyperspectral patch can be denoted as is the predefined neighborhood size and is the band number after dimension reduction. Hyperspectral image classification based on MSSFN can be divided into four stages, namely factor analysis for dimensionality reduction, mixed spatial-spectral feature extraction and cascade fusion, second-order pooling, and classification. Factor analysis models the covariance structure of hyperspectral data while acting as a dimensionality reduction method to extract features with high discriminative power and low redundancy. The mixed spatial-spectral features are extracted through three multi-residual modules. The two 3D multi-residual modules extract spatial-spectral features from two directions, and the 2D separable multi-residual module further strengthens the spatial features as a complement to the 3D spatial-spectral features. The features extracted by the 3D modules are downscaled by global 3D convolution, and all the features are fused through concatenation. The intra-block feature fusion and the inter-block feature fusion together form a cascade feature fusion pattern. Then second-order pooling and L2 normalization are used to further extract second-order statistical information of the fused features and enhance feature robustness. Finally, the classification is achieved by the output layer. The implementation details of MSSFN are shown in Table 2. Hyperspectral image classification based on MSSFN can be divided into four stages, namely factor analysis for dimensionality reduction, mixed spatial-spectral feature extraction and cascade fusion, second-order pooling, and classification. Factor analysis models the covariance structure of hyperspectral data while acting as a dimensionality reduction method to extract features with high discriminative power and low redundancy. The mixed spatial-spectral features are extracted through three multi-residual modules. The two 3D multi-residual modules extract spatial-spectral features from two directions, and the 2D separable multi-residual module further strengthens the spatial features as a complement to the 3D spatial-spectral features. The features extracted by the 3D modules are downscaled by global 3D convolution, and all the features are fused through concatenation. The intra-block feature fusion and the inter-block feature fusion together form a cascade feature fusion pattern. Then second-order pooling and L2 normalization are used to further extract second-order statistical information of the fused features and enhance feature robustness. Finally, the classification is achieved by the output layer. The implementation details of MSSFN are shown in Table 2. To accelerate network training and reduce overfitting, a batch normalization (BN) layer was inserted after each convolutional layer, and the calculation process is shown in Equation (9) where n is the batchsize; x i the ith sample of that batch; ε is a small value to ensure that the denominator is not zero. Rectified Linear Unit (ReLU) is adopted as non-linear activation function and the calculation process can be expressed as follows, Softmax activation function is adopted in the output layer and it can be calculated as Equation (11), where x j represents the value of the jth neuron of the output layer before activation; S(x) j represents the value of the jth neuron after activation and denotes the probability that the center pixel of the input patch belongs to the as jth class; P represents the class number.

Experimental Datasets
Three real-world hyperspectral datasets with different spatial and spectral resolutions, Indian Pines (IP), Houston (HU), and University of Pavia (PU), were used for verifying the effectiveness of MSSFN. The basic information of the three datasets is shown in Table 3. For the IP and PU datasets, the number of bands after denoising is given in Table 3. It should be noted that the HU dataset was originally released during the 2013 Data Fusion Contest by the Image Analysis and Data Fusion Technical Committee of the IEEE Geoscience and Remote Sensing Society, so it is also named "2013_IEEE_GRSS_DF_Contest".  15 9 In our experiments, the labeled samples of the three datasets were randomly divided into the training set, validation set and test set. In order to investigate the performance of our proposed model under small sample conditions, only 5%, 3%, and 1% of labeled samples of IP, HU, PU were used to train the model. The validation set, which has the same proportion of samples as the training set, is not involved in training and is only used to obtain the well-trained model. For the three datasets of IP, HU, and PU, the number of training samples of each class of ground object was small at the rate of 5%, 3%, and 1%, respectively. For example, some classes of the IP dataset only contain one to two training samples. The detailed distribution of training, validation, and testing samples for IP, HU, and PU datasets is shown in Tables 4-6.

Contrast Models and Experimental Settings
The experimental hardware environment is R7 5800H CPU; RTX3060 graphics card with 6G video memory; 32G RAM. The software environment is Tensorflow 2.4, Python 3.7. To verify the effectiveness of MSSFN, Res-3D-CNN [26], M-HybridSN [38], AD-HybridSN [39], DFFN [27] and MCNN-CP [40] mentioned above are selected as the contrast models for the experiments. The convolution type, number of parameters, and input data size of each model are shown in Table 7. The input data size is expressed as the product of the input data length, width, and number of bands, taking the number of bands in the IP dataset as an example. The number of parameters and the input data size reflects the complexity of the model to a certain extent. The number of parameters of the MSSFN proposed in this paper is much less than the contrast models, and the input data size is moderate.   Table 7. The convolution type, parameter number and the input data size of MSSFN and the contrast models. The various settings of the contrast models in the experiments are kept consistent with the corresponding papers. The MSSFN proposed in this paper uses Adam as the optimizer, with the learning rate set to 0.001 and the number of training epochs set to 100. The classification accuracy of the validation set is monitored during training, and the model with the highest accuracy in the validation set is saved within the specified number of training epochs.

Experimental Results
The following three widely adopted evaluation indices are used to quantitatively evaluate the performance of MSSFN and the contrast models.
(1) Overall accuracy (OA). It is an overall evaluation index of the classifier and it is calculated by the number of correctly classified pixels divided by the total number of pixels. (2) Average accuracy (AA). This index refers to the average accuracy of all types of ground objects and it will be greatly affected by a small number of hard samples. (3) Kappa coefficient. The Kappa is an index based on the confusion matrix. It is thought to be a more robust evaluation metric and it can reflect the degree of agreement between the ground truth map and the predicted map [56]. Tables 8-10 show the classification results of each model for IP, HU, and PU, containing the classification accuracy of each class and the results of the three overall indicators OA, AA, and Kappa. Ten consecutive experiments were conducted using each model for each dataset, and the average accuracy was given in the three tables. For the three overall indicators, OA, AA, and Kappa, standard deviations were shown after ±. The bold format in the tables represents the best result. From Tables 8-10, it can be seen that the classification accuracy of Res-3D-CNN is significantly lower than other methods. The biggest difference is that Res-3D-CNN does not adopt prior dimensionality reduction. It is speculated that hyperspectral data redundancy has a greater adverse effect on classification accuracy when the training samples are very limited, and 2 × 2 × 4 max-pooling layer alone is not enough to remove data redundancy. The OA of DFFN using only 2D convolution outperformed the Res-3D-CNN by 9.86%, 4.38%, and 7.79% in the three datasets, IP, HU, and PU, respectively. Moreover, in IP and PU datasets, the OA of DFFN is even higher than the simpler structured mixed convolutional network, MCNN-CP. The above observations verify the importance of deep feature fusion.
The three improved mixed CNN models, M-HybridSN, AD-HybridSN, and MSSFN outperform other models, among which MSSFN proposed in this paper achieves the best OA in all datasets. For example, in the IP dataset, the OA of MSSFN was 11.42%, 1.53%, 1.41%, 1.56%, and 1.65% higher than that of Res-3D-CNN, M-HybridSN, AD-HybridSN, DFFN, and MCNN-CP, respectively; the OA of MSSFN in the PU dataset is 9.45%, 1.18%, 0.85%, 1.66%, and 1.66 higher than Res-3D-CNN, M-HybridSN, AD-HybridSN, DFFN, and MCNN-CP, respectively. In the HU datasets, although the proposed MSSFN significantly outperforms all the contrast models, all models perform poorly in this dataset. This is presumably due to the large size of the dataset, as well as the small number of available samples and the large intra-class variation. Although MSSFN achieved the highest classification accuracy in all datasets, its classification accuracy of class 9 in the IP dataset was significantly lower than the other methods. No similar phenomenon was observed in other classes or other datasets. At present, the reason behind this phenomenon is not clear, and we will pay continued attention to the issue in the future.   Tables 7-9 lead to a similar conclusion. Generally speaking, there is less noise and better homogeneity in the classification result maps obtained by MSSFN for the three datasets, which are closer to the real-world distribution maps. The above results validate the effectiveness of MSSFN. The proposed classification framework composed of factor analysis, mixed spatial-spectral feature cascade fusion, and second-order pooling can learn the spatial-spectral features with stronger discriminative power.

Comparison with Other Dimension Reduction Methods
In order to further explore the applicability of different dimensionality reduction (DR) methods combined with MSSFN for hyperspectral classification, PCA, sparse PCA (SPCA), Gaussian random projection (GRP), and sparse random projection (SRP) were chosen to compare with FA. The PCA is the widely adopted DR method, and the SPCA is its variant, which aims to extract the sparse components that best represent the data. GRP and SRP are two simple and computationally efficient DR methods. The dimensions and distribution of random projection matrices are controlled to preserve the pairwise distances between any two samples of the dataset. The GRP relies on a normal distribution to generate the random matrix, and the SRP relies on a sparse random matrix. Furthermore, the FA with two rotation methods, which are named FA-varimax and FA-quartimax corresponding to the maximum of variance and the quartic variance, are also adopted

Comparison with Other Dimension Reduction Methods
In order to further explore the applicability of different dimensionality reduction (DR) methods combined with MSSFN for hyperspectral classification, PCA, sparse PCA (SPCA), Gaussian random projection (GRP), and sparse random projection (SRP) were chosen to compare with FA. The PCA is the widely adopted DR method, and the SPCA is its variant, which aims to extract the sparse components that best represent the data. GRP and SRP are two simple and computationally efficient DR methods. The dimensions and distribution of random projection matrices are controlled to preserve the pairwise distances between any two samples of the dataset. The GRP relies on a normal distribution to generate the random matrix, and the SRP relies on a sparse random matrix. Furthermore, the FA with two rotation methods, which are named FA-varimax and FA-quartimax corresponding to the maximum of variance and the quartic variance, are also adopted to compare with the original FA. The experimental settings were kept consistent with Section 3.3. The experimental results are shown in Table 11. From the experimental results shown in Table 11, it is clear that FA significantly outperforms other DR methods in the IP dataset, and the OA improves 0.40%, 0.56%, 1.02%, and 1.16% compared with PCA, SPCA, GRP, and SRP, respectively. In the HU dataset, the OA obtained by the FA method is 0.31% higher than that of PCA. The above experimental results verified the effectiveness of covariance information for hyperspectral classification. The two rotation methods of FA have a negative impact on the classification accuracy, and it is speculated that the interpretability and the discriminability of the extracted factors for the hyperspectral images are to some extent contradictory. The accuracy of SPCA is lower than that of PCA, and it varies widely in different datasets, indicating that its robustness for feature extraction of different hyperspectral data is insufficient. GRP and SRP, as random projection methods, also lack robustness for classification hyperspectral using few labeled samples, and the accuracy varies widely in IP and SA compared with other DR methods. The above results verify the effectiveness of the combination of FA and MSSFN for small sample hyperspectral classification.

Model Ablation Experiments
The MSSFN proposed in this paper is based on cascade fusion of mixed spatial-spectral features, and the higher-order statistical information of the fused features is extracted using second-order pooling. To further verify the effectiveness of the above designs, ablation experiments are conducted in this section. Four ablation models are made for this experiment, and they are named as model 1, model 2, model 3, and model 4. The description of each model and the experimental results are shown in Table 12. The experimental results in Table 12 indicated that MSSFN has the best overall classification accuracy in the three datasets, verifying the effectiveness of cascade fusion and second-order pooling. The OA of MSSFN has improved by 0.40% and 0.25% over model 1 in the IP and HU datasets, respectively. It is indicated that for mixed spatial-spectral features, using pixel-by-pixel addition will cause some information loss, which is unfavorable for classification. The classification accuracy of model 2 and model 3 is lower than that of MSSFN, and the accuracy of model 2 is significantly better than that of model 3, indicating the effectiveness of the cascade feature fusion pattern consisting of inter-block feature fusion and intra-block feature fusion. The OA of MSSFN is significantly higher than that of model 4 in IP and HU datasets. Meanwhile, model 4 and MSSFN have approximately equal OA in the PU dataset, and it is presumed that the first-order statistical features are sufficient to distinguish features for the PU dataset. Since the original channel features are included in the SOP calculation process, the classification accuracy is not degraded, and the above results verify the effectiveness of SOP for hyperspectral small sample classification. However, how to better integrate the first-order features and second-order features to improve the classification accuracy for different types of hyperspectral datasets on the basis of convolutional extracted features needs further study.

The Performance of MSSFN under Extreme Small Sample Cases
The work in this paper revolves around the problems in small sample hyperspectral classification tasks, and the effectiveness of the related designs have been verified in Sections 4.1 and 4.2, respectively. To further verify the applicability of MSSFN under small sample conditions, two extreme small sample cases will be considered in this section.
(1) Fixed small training sample ratio case. Since the training sample number of some ground objects in the IP dataset has been reduced to one at the 5% ratio, the PU and HU datasets are chosen for the fixed small sample ratio case experiments. The training sample ratios of PU and HU are further reduced to 0.75%, 0.5%, 0.25%, and 2.25%, 1.5%, 0.75% of the total number of labeled samples, respectively. The other experimental settings remained the same as in Section 3. The experimental results of the above two cases are shown in Tables 13 and 14, respectively. By analyzing the above experimental results, the following conclusions can be drawn.
(1) MSSFN achieved the highest classification accuracy in both extreme small sample cases. Furthermore, the advantage of MSSFN over other methods enlarges with the decreasing sample size. Therefore, the experimental results in this section further validate the effectiveness of our proposed methods in the extreme small sample cases. (2) The classification accuracy of all models degrades when the number of training samples decreases. The relative ranking relationships between the classification accuracy of each model remain the same. Meanwhile, the accuracy gap gradually enlarges. In general, the three improved mixed CNN models, namely M-HybridSN, AD-HybridSN, and MSSFN, have obvious advantages compared with other models in extreme small sample cases. It can be inferenced that the superiority in terms of low parameter number and network structure is stable in small sample hyperspectral classification tasks. The lower the sample size is, the more noticeable this advantage is compared with other models. (3) The sample distribution has a significant influence on classification accuracy. In the balanced small training sample number case, the AA values of all models are larger than the OA. In the fixed small training sample ratio case, this is the other way round.
In the real-world hyperspectral classification tasks, we believe that the fixed training sample ratio case is more common, since there exist giant variabilities in the difficulty of labeling different kinds of ground objects. Since AA is a vital evaluation metric, the active learning strategy can be adopted to manually label valuable samples. The sample distribution and active learning need further investigation.

Computional Time
The computational time of most current 3D-CNN-based spatial-spectral methods for hyperspectral classification is very long and affects its practicability in real-world hyperspectral classification tasks. Therefore, computational time has had considerable attention paid to it in our research from the very beginning. Many factors affect the running time of the model, such as hardware environment, model parameters, frequency of residual connections usage, structural complexity, etc. In this section, the lightweight design of MSSFN will be introduced, and the running time of MSSFN and the contrast models will be discussed and analyzed.
The parameter scale of MSSFN is much smaller than the contrast models, and due to this, the convolutional kernels in MSSFN are designed in a spatial-spectral separable manner. For example, the 7 × 7 × 7 convolutional kernel is divided into two 7 × 7 × 1 and 1 × 1 × 7 kernels for the two 3D multiple residual blocks. The above spatial-spectral separable manner will significantly reduce the parameter number and computation time. The depth-separable convolutional layers are adopted in our model, and they are computationally efficient. Table 15 shows the training time and testing time of MSSFN and each contrast model in the IP dataset, which has the largest training sample number (512) and the longest training time. The running time results in the table are the average running time of ten experiments. Res-3D-CNN focuses on analyzing the raw hyperspectral data and extracting spatialspectral features using continuous 3 × 3 × 3 convolutional layers; DFFN improves classification accuracy by stacking 2D residual blocks, and it has far more layers than other models. The running time of Res-3D-CNN and DFFN is much longer than the other models, indicating that excessive usage of the 3D convolutional layer and deep network has a negative influence on the running time. As for the four mixed CNN models, M-HybridSN, AD-HybridSN, MCNN-CP, and MSSFN, it seems that the complexity of the network struc-ture has a great influence on the running time. MCNN-CP, with the simplest structure, has the shortest running time, but its classification accuracy is not ideal in our experiments.
M-HybridSN and AD-HybridSN do not adopt a global feature fusion scheme, and the features learned by the shallow layers of the network cannot directly affect the final classification. On the contrary, our proposed MSSFN adopts a cascade feature fusion pattern and improves the hyperspectral classification accuracy under small sample conditions. The running time of MSSFN is shorter than that of Res-3D-CNN and DFFN. As an improved mixed convolutional network model, MSSFN has prominent lightweight characteristics compared with 3D-CNN with a similar layer number and deeper 2D-CNN. However, the running time of MSSFN is longer than that of M-HybridSN and AD-HybridSN, which can be seen as the cost of model complexity. Frequent feature fusions result in intermediate results that must be saved and taken into account when calculating gradients. This case will increase the training time. We believe that the computational time of MSSFN is moderate and acceptable, but we will continue to pay attention to this issue, look for a more lightweight network design scheme, and strive to improve the classification accuracy without increasing the running time.

Comparison with Other Methods Which Are Not Focused on CNN Architectures
Some advanced CNN-based models have been compared with MSSFN, and in-depth analysis of MSSFN has been provided in terms of ablation experiments, running time, and extreme small sample cases in the above sections. As discussed in the introduction, many researchers have proposed some novel methods which are not focused on CNN architectures for small sample hyperspectral classification. Aiming at clearly locating the meaning and value of MSSFN in the field of hyperspectral classification, some recently released and advanced methods have been investigated and will be compared with MSSFN near the end of this paper. The additional contrast methods are Rank-1 FNN [22], SS-LSTM [57], S-DMM [58], and A-SPN [59]. A brief introduction to the above methods is as follows, and they are not focused on CNN architectures.
(1) Rank-1 FNN. The Rank-1 FNN is a tensor-based method, and the weight parameters satisfy the rank-1 canonical decomposition property. The parameters required to train the classifier have been significantly reduced, and this method can provide a clear explanation of hyperspectral classification results. (2) SS-LSTM. The SS-LSTM was based on Long Short-Term Memory (LSTM) networks, and it has two branches. Spatial-spectral feature learning is reflected in the different ways of organizing hyperspectral input data in each branch. (3) S-DMM. This method is based on deep metric learning. A simple 2D-CNN was adopted as the feature embedding tool, and a distance-based classifier, KNN, was used for classifying the unseen data. (4) A-SPN. PCA, Batch Normalization, L2 normalization are adopted to extract firstorder features. The spatial attention and second-order pooling are combined to extract higher-order features. This pure attention-based method abandons complex hyperparameters of convolutional layer and has obvious lightweight characteristics.
We have trained the A-SPN model from scratch, during which we used some public available codes which can be found at https://github.com/ZhaohuiXue/A-SPN-release (accessed on 13 January 2022). As for the other methods, the experimental results reported in the literature [10,22] will be used for comparison. The experimental dataset is PU. 10, 50, and 100 samples for every class will be randomly selected as the training set, respectively. The comparison results are shown in Table 16. By analyzing the above experimental results, the following conclusions can be drawn.
(1) The OA comparison with some advanced methods which are not focused on CNN architectures further verify the effectiveness and research value of MSSFN. (2) A-SPN can obtain competitive classification accuracies. Considering that it is a classification framework only consisting of PCA, normalization technologies, and attention-based second-order pooling, such a performance is very impressive. The pure attention-based models will have continued attention paid to them in our future research. (3) The S-DMM, which is based on deep metric learning, can obtain a good accuracy under small sample conditions. However, when the training sample is increased from 50 to 100, the increase in accuracy is not significant. It is speculated that the feature learning ability of the feature embedding model is not sufficient. Our proposed model will be considered to combine with metric learning to further improve the classification accuracy.
Based on the above observations, we will keep studying how to further optimize the structure of MSSFN on the one hand, and explore how to break through the limitations of CNN-based methods and how to effectively integrate CNN with other methods on the other hand.

Conclusions
In order to facilitate the small sample hyperspectral classification, the mixed spatialspectral feature fusion network, MSSFN, is proposed based on factor analysis, mixed spatialspectral feature cascade fusion, and second-order pooling. First, the covariance structure of hyperspectral data is modeled by factor analysis, and the raw data is downscaled. Then, the mixed spatial-spectral features are extracted by two 3D multi-residual modules and one 2D multi-residual module, and the features extracted by the three modules are concatenated. Finally, the second-order statistical features of the fused features are extracted by second-order pooling, and classification is achieved by the fully connected layer. In the experiments with three real-world hyperspectral datasets with different spatial resolutions and spectral characteristics, IP, HU, and PU, with very few samples, MSSFN achieves the best classification accuracy compared with other models. The extensive experimental results verify the effectiveness of MSSFN in the small sample hyperspectral classification tasks.
Although MSSFN has an ideal performance in small sample hyperspectral classification, the improvement of second-order pooling in some datasets is not so obvious. How to better integrate the first-order and second-order features to improve the classification accuracy for different types of hyperspectral datasets needs further study. In our research, FA is used for dimension reduction, and its effectiveness has been verified through ablation experiments. In our future research, dimensionality reduction methods that can be effectively paired with a mixed CNN model will receive continued attention. Furthermore, deep few-shot learning and deep active learning will be paid more attention and our proposed MSSFN can be used as a baseline model. In addition, MSSFN contains some modular, highly re-usable designs, and they can be improved or applied in other remote sensing image classification tasks. We hope that the above designs will be inspiring to other researchers and the ideas behind our proposed MSSFN can be further expanded.