Fusion of Multidimensional CNN and Handcrafted Features for Small-Sample Hyperspectral Image Classiﬁcation

: Hyperspectral image (HSI) classiﬁcation has attracted widespread concern in recent years. However, due to the complexity of the HSI gathering environment, it is difﬁcult to obtain a great number of HSI labeled samples. Therefore, how to effectively extract the spatial–spectral feature with small-scale training samples is the crucial point of HSI classiﬁcation. In this paper, a novel fusion framework for small-sample HSI classiﬁcation is proposed to fully combine the advantages of multidimensional CNN and handcrafted features. Firstly, a 3D fuzzy histogram of oriented gradients (3D-FHOG) descriptor is proposed to fully extract the handcrafted spatial–spectral feature of HSI pixels, which is suggested to be more robust by overcoming the local spatial–spectral feature uncertainty. Secondly, a multidimensional Siamese network (MDSN), which is updated by minimizing both contrastive loss and classiﬁcation loss, is designed to effectively exploit the CNN-based spatial–spectral features from multiple dimensions. Finally, the proposed MDSN combined with 3D-FHOG is utilized for small-sample HSI classiﬁcation to verify the effectiveness of our proposed fusion framework. The experimental results on three public data sets indicate that the proposed MDSN combined with 3D-FHOG is signiﬁcantly better than the representative handcrafted feature-based and CNN-based methods, which in turn demonstrates the superiority of the proposed fusion framework.


Introduction
Compared with gray-scale and RGB images, hyperspectral image (HSI) can provide a rich amount of spatial and spectral information of objects. Since the additional spectral information may help to overcome the existing difficulties of traditional image processing technology, HSI has attracted widespread concern in recent years. HSI classification [1][2][3][4], which is the focus of research in the field of HSI processing, has been widely applied in various areas, such as scene understanding [5,6], disease examination [7,8], face recognition [9,10] and city planning [11,12]. Note that the feature extracted from HSI is the basis of these applications, thus it is essential to obtain the more robust and effective spatial-spectral feature for HSI classification. However, due to the complexity and potential fatalness of the HSI acquisition environment, it is a laborious and time-consuming job to collect a huge amount of labeled samples. Therefore, how to effectively extract the spatial-spectral feature with small-scale training samples has become a hot spot of current research. Existing HSI feature extraction methods can be divided into two types: handcrafted feature extraction methods and deep feature extraction methods.
Before the development of deep learning, handcrafted feature extraction was the mainstream approach in the field of image processing, and its effectiveness has been verified in image matching and classification [13,14]. Lowe [13] designed an image feature extraction method called the scale-invariant feature transform (SIFT), which shows its (1) A 3D-FHOG descriptor is proposed to fully extract the handcrafted spatial-spectral feature of HSI pixels. It calculates the HOG features from three orthogonal planes to generate the final 3D-FHOG descriptor based on fuzzy fusion operation, which is able to overcome the local spatial-spectral feature uncertainty; (2) An effective Siamese network, i.e., MDSN is designed for further exploiting the multidimensional CNN-based spatial-spectral feature in the scenery of small-scale labeled samples. It mainly utilizes the hybrid 3D-2D-1D CNN to learn the spatial-spectral feature from multiple dimensions and is updated by minimizing both contrastive loss and classification loss. Compared with the single-dimensional CNN-based networks, the performance of MDSN is significantly better in small-sample HSI classification; (3) It provides a novel extensible fusion framework for the combination of hand-crafted and multidimensional CNN-based spatial-spectral features. More importantly, experimental results indicate that our proposed MDSN combined with 3D-FHOG features can achieve better performance than the handcrafted features-based and CNN-based algorithms, which in turn verifies the superiority of the proposed fusion framework.
The rest of this paper is organized as follows. Section 2 presents the related works of this study. The proposed methodology is presented in Section 3. Then, Section 4 reports the experimental results and discussions on three public data sets. Finally, a conclusion of this study is presented in Section 5.

Histogram of Oriented Gradients
Histogram of oriented gradients (HOG) proposed by Dalal et al. [31] is a classical handcrafted feature descriptor, which is generated by computing the gradients of pixels in a local area. Additionally, it not only provides the rotation invariance and under-standability, but also has a strong capacity of shape feature expression, which makes it widely used in image recognition. Surasak et al. [32] applied the HOG algorithm to human detection in video, which is able to accurately obtain the number of people for each video frame. Mao et al. [33] utilized the HOG-based method and support vector machine (SVM) classifier to perform the preceding vehicle detection, which shows excellent performance in different traffic scenarios. Qi et al. [34] designed a ship histogram of oriented gradient (S-HOG) to characterize ship targets, which proved to be also effective when ship size varies. Since each HSI pixel corresponds to a spectral curve with different changing patterns, it is suggested that constructing the statistics histogram of local gradient change for HSI pixels is an Remote Sens. 2022, 14,3796 4 of 24 effective solution for describing its local spatial-spectral features. Chen et al. [35] proposed a novel algorithm for hyperspectral face recognition by extracting the HOG feature, which outperforms several existing methods in the experiment. However, existing HOG-based algorithms ignore the characteristics of HSI, such as strong correlation between bands, vast amounts of redundant information and spatial-spectral feature uncertainty. Therefore, in this study, by introducing the fuzzy logic theory, we design a novel handcrafted feature descriptor named 3D-FHOG to fully exploit the spatial-spectral information and to overcome the spatial-spectral feature uncertainty.

Siamese Network
For the problem of small-scale labeled samples, it is suggested that a model named Siamese network [36,37] will be an effective solution for small-sample HSI classification. Specifically, Siamese network consists of two branches with the same architecture, and the image pairs are adopted as the input of Siamese network to minimize the contrastive loss. Early research of Siamese network mainly focuses on the application for target tracking. Tao et al. [38] first proposed to utilize the Siamese network in tracking tasks, which achieves state-of-the-art performance. Bertinetto et al. [39] designed a fully convolutional Siamese network to locate an exemplar image within a larger search image. Since the number of labeled samples will be augmented by generating the image pairs, Siamese network has been employed for few-shot classification tasks. Koch et al. [40] applied the Siamese network to the one-shot image recognition task, which obtains promising results. With respect to HSI classification, Zhao et al. [41] utilized the Siamese network to enlarge the training set and extract the effective spatial-spectral features, which is able to improve the classification performance. Liu et al. [42] proposed a Siamese network supervised with a margin ranking loss function for HSI classification, which can obtain better classification results than those of the conventional methods. Very recently, Cao et al. [43] designed a hybrid Siamese network called 3DCSN to perform HSI classification, which is suggested to be a robust and accurate classifier in the scenery of small-scale training samples. As described in [44], it can be known that the spatial-spectral features extracted from different CNN layers may contain the semantic information of objects with different scales. Therefore, in this paper, an effective Siamese network named MDSN is proposed to fully exploit the multidimensional CNN-based spatial-spectral feature. Moreover, we train the proposed MDSN by using both the contrastive loss function and classification loss function. Especially, our proposed MDSN is integrated with the idea of prototypical network in the testing phase, which is suggested to be more effective for small-sample HSI classification.

Methodology
The fusion framework of multidimensional CNN-based and handcrafted features for small-sample HSI classification is shown in Figure 1. In this study, to verify the effectiveness of our proposed fusion framework, we design the 3D-FHOG and MDSN for the handcrafted and multidimensional CNN-based feature extraction, respectively. As shown in Figure 1, small-sample HSI classification of MDSN combined with 3D-FHOG features mainly consists of three parts: firstly, principal component analysis (PCA) [45] algorithm is implemented on HSI to extract the representative band data; next, 3D patches divided from HSI are utilized to perform the 3D-FHOG feature extraction and the hybrid 3D-2D-1D CNN feature extraction; finally, through the linear layers and the distance metric between labeled and unlabeled samples with the 3D-FHOG and MDSN features, three class-score vectors are obtained (i.e., P 1 , P 2 and P 3 ), which are fused to compute the probability of a HSI pixel classified into a specific class.
Remote Sens. 2022, 14, 3796 5 of 24 1D CNN feature extraction; finally, through the linear layers and the distance metric between labeled and unlabeled samples with the 3D-FHOG and MDSN features, three classscore vectors are obtained (i.e., 1 P , 2 P and 3 P ), which are fused to compute the probability of a HSI pixel classified into a specific class. Figure 1. The fusion framework of multidimensional CNN and handcrafted features for small-sample HSI classification. Initially, 3D-FHOG is adopted as the handcrafted feature extraction method, and MDSN is utilized for multidimensional CNN-based feature extraction.

The Proposed 3D-FHOG
By introducing the fuzzy logic theory, the proposed 3D-FHOG is utilized to fully extract the handcrafted spatial-spectral feature and to overcome the spatial-spectral feature uncertainty. Figure 2 shows the schematic of 3D-FHOG feature extraction.  The fusion framework of multidimensional CNN and handcrafted features for small-sample HSI classification. Initially, 3D-FHOG is adopted as the handcrafted feature extraction method, and MDSN is utilized for multidimensional CNN-based feature extraction.

The Proposed 3D-FHOG
By introducing the fuzzy logic theory, the proposed 3D-FHOG is utilized to fully extract the handcrafted spatial-spectral feature and to overcome the spatial-spectral feature uncertainty. Figure 2 shows the schematic of 3D-FHOG feature extraction. 1D CNN feature extraction; finally, through the linear layers and the distance metric between labeled and unlabeled samples with the 3D-FHOG and MDSN features, three classscore vectors are obtained (i.e., 1 P , 2 P and 3 P ), which are fused to compute the probability of a HSI pixel classified into a specific class. Figure 1. The fusion framework of multidimensional CNN and handcrafted features for small-sample HSI classification. Initially, 3D-FHOG is adopted as the handcrafted feature extraction method, and MDSN is utilized for multidimensional CNN-based feature extraction.

The Proposed 3D-FHOG
By introducing the fuzzy logic theory, the proposed 3D-FHOG is utilized to fully extract the handcrafted spatial-spectral feature and to overcome the spatial-spectral feature uncertainty. Figure 2 shows the schematic of 3D-FHOG feature extraction.  Let H be the HSI, thus the HSI pixel with a spatial coordinate of (x, y) and λ bands can be represented as a λ-dimensional vector, as follows: where H(x, y, z) denotes the spectral response of HSI pixel, and z represents the spectral domain coordinate. Then, the λ-dimensional vector of HSI pixel is converted into the 3D local spatial-spectral neighborhood for HOG feature extraction. It can be known that the 3-D local spatial-spectral neighborhood of HSI pixels can be expressed as a group of orthogonal planes, including XY, XZ and YZ planes. Therefore, HOG feature extraction is implemented on XY, XZ and YZ planes, respectively. For the XY planes, assuming that H(x, y, k) denotes the spectral response of HSI pixel with spatial coordinate (x, y) at kth band, thus the x-axis oriented gradient and y-axis oriented gradient of H(x, y, k) can be calculated as follows: where G xy1 (x, y, k) and G xy2 (x, y, k) denote the y-axis and x-axis oriented gradient of H(x, y, k), respectively. Therefore, the oriented gradient of HSI pixels in the kth XY plane of 3-D local spatial-spectral neighborhood can be expressed as follows: where G xy (x, y, k) represents the gradient magnitude of H(x, y, k), α xy (x, y, k) is the gradient direction of H(x, y, k). Then, by setting the suitable block size and cell size of HOG descriptor in 3-D local spatial-spectral neighborhood, the final expression of HOG descriptor for XY planes is obtained. According to [31], the nine-bin histogram h k xy is obtained from each cell of HOG descriptor. Therefore, let the cell size of HOG descriptor be M × N, the bin h k xy (b) of h k xy can be expressed as follows: where s(a, b) is defined as: Since the histogram channels are spread over 0 to 180 degrees, thus β is normally set to be 20. In general, each block of HOG descriptor contains 2 × 2 cells, thus the HOG descriptor for XY planes can be represented as: Because HSI has the characteristic of low spatial resolution and wide distribution of ground objects, thus the 3D local spatial-spectral neighborhood of a HSI pixel that belongs to a specific class may contain the spatial-spectral information of other classes. This leads to the problem of spatial-spectral feature uncertainty in the process of spatial-spectral feature extraction. Especially, when performing the feature fusion of three orthogonal planes, the confidence of HOG feature extracted from each plane is uncertain. Therefore, fusing the HOG feature of three orthogonal planes directly may result in the performance degradation of small-sample HSI classification.
Fuzzy logic proposed by Zadeh [46] is a significant approach for overcoming the uncertainties among the raw data. Inspired by this, we apply the theory of fuzzy integration to the process of 3D-HOG feature extraction. According to [47,48], let S V represent the fuzzy integration function, thus it can be expressed as follows: where q denotes the fuzzy factor. Hence, by performing the fuzzy integration for the HOG features extracted from three orthogonal planes, 3D-HOG descriptor is trans-formed into the 3D-FHOG descriptor with strong robustness, as below: where FHOG k 3D represents the kth 3D-FHOG descriptor. Let L be the step size of 3D-FHOG feature, thus the final expression of 3D-FHOG descriptor for → H λ (x, y) can be formulated as below: As mentioned above, our proposed 3D-FHOG is able to provide a stricter mathematical explanation, which makes it more reliable to be utilized in the high sensitive areas. Moreover, it can not only fully extract the handcrafted spatial-spectral feature of HSI pixels, but also overcome the spatial-spectral feature uncertainty. Therefore, in this study, the proposed 3D-FHOG feature is used to enhance the performance of multidimensional CNN.

The Proposed MDSN
The structure of the MDSN is presented in Figure 3. As shown in Figure 3, the spatial-spectral features are first extracted by the 3D convolutional blocks from the input patches. Secondly, the 2D convolutional block is performed to further enhance the spatial feature. Then, spectral features are further extracted by the 1D convolutional block. Finally, through the linear layer, the obtained MDSN feature is adopted to compute contrastive loss, and the classification loss is calculated based on the class probability output from the linear classifier.
The structure of the MDSN is presented in Figure 3. As shown in Figure 3, the spatialspectral features are first extracted by the 3D convolutional blocks from the input patches. Secondly, the 2D convolutional block is performed to further enhance the spatial feature. Then, spectral features are further extracted by the 1D convolutional block. Finally, through the linear layer, the obtained MDSN feature is adopted to compute contrastive loss, and the classification loss is calculated based on the class probability output from the linear classifier. Figure 3. The structure of MDSN. Each convolutional block contains a convolutional layer, a batch normalization layer and a ReLU nonlinearity corresponding to its convolutional dimension. When performing the contrastive learning, two different patches are adopted at the same time. In the classification phase, only one patch is used.
Let Ψ be the training set of N labeled samples, as follows: represents the 3D patches of HSI pixels divided from HSI,

{ }
During the training process, our proposed MDSN is updated by minimizing both contrastive loss and classification loss. When performing the contrastive learning, let θ be the nonlinear parameters in MDSN. Thus, the formula of θ updated by contrastive loss can be expressed as follows: where ( ) 1 g  denotes the encoder function in the branch of MDSN utilized for computing the contrastive loss, c L represents the contrastive loss function, as below: Figure 3. The structure of MDSN. Each convolutional block contains a convolutional layer, a batch normalization layer and a ReLU nonlinearity corresponding to its convolutional dimension. When performing the contrastive learning, two different patches are adopted at the same time. In the classification phase, only one patch is used.
Let Ψ be the training set of N labeled samples, as follows: represents the 3D patches of HSI pixels divided from HSI, y = {y 1 , y 2 , . . . , y N } denotes the corresponding label. Next, a pair of 3D patches x i , x j is randomly selected from x, which is adopted as the input of MDSN. Let y i,j be the label of x i , x j , the value of which is defined as below: During the training process, our proposed MDSN is updated by minimizing both contrastive loss and classification loss. When performing the contrastive learning, let θ be the nonlinear parameters in MDSN. Thus, the formula of θ updated by contrastive loss can be expressed as follows: where g 1 (·) denotes the encoder function in the branch of MDSN utilized for computing the contrastive loss, L c represents the contrastive loss function, as below: where margin is a constant, and its typical value is 1.25. In the classification phase, we adopt the cross-entropy loss function L s to compute the classification loss. Besides, only one patch x i is used at a time. Hence, the formula of θ updated by classification loss can be represented as follows: where g 2 (·) represents the encoder function in the branch of MDSN utilized for computing the classification loss, h(·) denotes the class-score mapping function of linear layers,ŷ i is the predicted label of x i . In summary, by training with contrastive loss and classification loss, our proposed MDSN can effectively exploit the multi-dimensional CNN-based spatialspectral feature. Compared with the single-dimensional CNN-based models, MDSN is able to achieve better performance in small-sample HSI classification.

MDSN Combined with 3D-FHOG for Small-Sample HSI Classification
As described early, in some special HSI classification tasks with only a few labeled samples, we need to achieve higher classification accuracy without considering the computational cost. However, the scarcity of labeled samples makes it difficult to train an effective CNN-based classifier. In terms of the handcrafted feature, it can provide a stricter mathematical explanation for its feature extraction process, but it is difficult to be applied in some complex data processing tasks. Therefore, in this paper, we design a fusion framework of multidimensional CNN and handcrafted features for small-sample HSI classification. Especially, our proposed MDSN combined with 3D-FHOG is utilized to verify the effectiveness of our proposed fusion framework.
According to Equations (12) and (13), let Ψ k be the training set of N 1 labeled samples labeled with class k. After 3D-FHOG feature extraction, Ψ k can be expressed as below: where F(·) represents the 3D-FHOG feature embedding function. According to the idea of prototypical network [49,50], 3D-FHOG prototype can be calculated through the mean method, as follows: Then, based on the distance metric with 3D-FHOG prototypes, the probability of pixel x classified as class k can be formulated as below: where d(·) denotes the distance function, G is the total number of classes. As mentioned above, MDSN feature vector is extracted from the branch of MDSN utilized for computing the contrastive loss. Therefore, class probability based on the distance metric with hybrid-CNN prototypes can be formulated as follows: where c k denotes the hybrid-CNN prototype labeled with class k. We assume that P 3 denotes the class-score vector output from the linear layers. Hence, by performing the fusion operation on P 1 , P 2 and P 3 , the final class probability of HSI pixels, which is obtained from the MDSN combined with 3D-FHOG features, can be expressed as below: where δ(·) denotes the fusion function. Specifically, the fusion method for P 1 , P 2 and P 3 , such as concatenation and average, can be designed according to the needs of computational cost. In our experiment, P 1 , P 2 and P 3 are fused by average. To sum up, by integrating with the idea of prototype calculation, MDSN and 3D-FHOG features are effectively fused to calculate the class probability of HSI pixels, which is able to obtain more accurate results in small-sample HSI classification. The detailed process of MDSN combined with 3D-FHOG for small-sample HSI classification is described as follows (Algorithm 1).

Algorithm 1. MDSN combined with 3D-FHOG for small-sample HSI classification
Input: HSI pixels x. Output: The fused class probability P.
Step 1. Generating the 3D patches x 1 ∈ R W 1 ×W 1 ×K 1 and x 2 ∈ R W 2 ×W 2 ×K 2 from the local spatial-spectral neighborhood of x.
Step 3. Computing the distance metric between F(x 1 ) and c k to obtain the class probability P 1 .
Step 4. Calculating the distance metric between g 1 (x 2 ) and c k to obtain the class probability P 2 .
Step 6. Fusing the class probability P 1 , P 2 and P 3 by using Equation (24).

Data Sets
The Indian Pine (IP) data set which contains 16 classes and 10,249 samples was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in northwestern Indiana. Its spatial size and resolution are 145 × 145 pixels and 20-m per pixel. The number of bands is reduced to 200 by removing bands covering the region of water absorption. The false-color image and corresponding ground-truth map of the IP data set are shown in Figure 4. Table 1 lists the samples of the IP data set.

Experimental Setup
In our experiment, we mainly perform the small-sample HSI classification to demonstrate the effectiveness and robustness of the proposed fusion framework. Since HSI contains a rich amount of redundant information, we need to preprocess the HSI data to improve the efficiency of subsequent feature extraction. Accroding to [26], PCA algorithm, which is a commonly used strategy for preprocessing the HSI data, is first employed for the dimensionality reduction in HSI to extract the representative band data. After preprocessing, different handcrafted feature-based and CNN-based methods are utilized for the feature extraction of HSI pixels in the experiment.
In the training process of MDSN, two phases are included. Firstly, the parameter of our proposed model is updated by Adam optimizer [53] and contrastive loss when performing the contrastive learning. Besides, we use an initial learning rate of 5 × 10 −3 , and the weight decay is set to be 0 in contrastive learning phase. In the classification training phase, we use another Adam optimizer and cross-entropy loss to update the parameters of our model. Moreover, the learning rate is set to be 1 × 10 −3 , and the weight decay is set to be 5 × 10 −5 in this phase. According to [43], the input patch size of MDSN is empirically set to be 25 × 25 × 30 for IP and 25 × 25 × 15 for PU and SA, respectively.
In order to compare the classification performance of the above different methods, overall accuracy (OA), average accuracy (AA) and kappa coefficient (κ) are adopted as the evaluation metric. Quantitatively, the greater the values of OA, AA, and κ are, the better the classification result is. Moreover, the classification experiments of each method are repeated 5 times to avoid the accidental phenomenon. Classification accuracy mean and variance of each method are shown in the experimental statistical table.

Experimental Result and Analysis
In this section, the classification results of different feature extraction methods on three public HSI data sets are analyzed visually and quantitatively.

Influence of the Input Patch Size for 3D-FHOG
First, influence of the input patch size for our proposed 3D-FHOG is examined. Specifically, the size of input patch for 3D-FHOG is set to be 7 × 7 × 7, 9 × 9 × 9, 11 × 11 × 11, 13 × 13 × 13, 15 × 15 × 15 and 17 × 17 × 17 for analysis. Table 4 shows the classification perfor-mance of 3D-FHOG with different input patch sizes on IP data set. It can be concluded that when the size of input patch increases, the values of OA, AA and κ of 3D-FHOG show a trend of increasing first and then decreasing. When input patch size is 15 × 15 × 15, the classification performance of 3D-FHOG is the best. The reason for the decrease in classification performance is the redundant information contained in the local spatial-spectral neighborhood. Therefore, in the following experiment, the input patch size is fixedly set to be 15 × 15 × 15 for 3D-FHOG. In order to verify the effectiveness of 3D-FHOG in small-sample HSI classification, the proposed 3D-FHOG is compared with seven handcrafted feature-based methods. When three labeled samples per class are adopted as the training set, detail classification results with different handcrafted feature-based methods are listed in Tables 5-7 for IP, PU and SA data sets, respectively. As observed from Tables 5-7, we can make four observations, which are described as follows.  Firstly, compared with the 2D handcrafted feature descriptors, spectral feature and 3D handcrafted feature descriptors can achieve better classification performance. For instance, the OA of 3D-LBP is 15.70% higher than that of HOG on the IP data set. This indicates that only extracting the spatial information will destroy the correlation between spectral bands.
Secondly, the classification performance of 3D handcrafted feature descriptors is always superior to the spectral feature. Because 3D handcrafted feature descriptors can exploit both spatial and spectral information, which makes them more effective in HSI classification.
Thirdly, EMAP feature descriptor shows more excellent classification performance on PU and SA data sets. Since EMAP is based on morphological attribute filters and multi-level analysis, it is suggested that combining the features with different dimensions or scales is an effective solution for small-sample HSI classification.
Finally, by comparing with spectral, 2D handcrafted and 3D handcrafted feature descriptors, we can observe that our proposed 3D-FHOG feature descriptor obtains the best classification results in three public data sets. By integrating with fuzzy logic, 3D-FHOG is able to overcome the local spatial-spectral feature uncertainty and extract more discriminative spatial-spectral features.

Compared with CNN-Based Methods
To further demonstrate the effectiveness and robustness of the proposed fusion framework, 3D-FHOG + MDSN method is compared with eight representative CNN-based methods. These CNN-based methods include Semi-1D CNN, 3D FCN, Semi-3D CNN, 3D CNN, 1D RNN, HybridSN, 3DCSN and MDSN. Table 8 reports the OA, AA, κ and the classification accuracy of each class for HSI classification with three labeled samples per class on the IP data set. The statistical results suggest that HybridSN, 3DCSN and MDSN methods, which incorporate the multidimensional CNN feature, are superior to the single-dimensional CNN-based methods. Additionally, both Semi-3D CNN and 1D RNN can achieve excellent classification performance. This verifies that 3D CNN combined with semi-supervised learning is effective for small-sample HSI classification, and 1D RNN is able to take full advantage of spectral correlation and band-to-band variability. Moreover, MDSN and 3DCSN, which are based on the learning mechanism of Siamese network, can obtain more accurate results. Especially, 3D-FHOG + MDSN method has the best classification results on the IP data set, which indicates that our proposed fusion framework can fully combine the advantage of multidimensional CNN and handcrafted features. As for the PU data set, detail classification results with nine different methods are listed in Table 9. As observed in Table 9, the performance of MDSN, 3DCSN and HybridSN is significantly higher than that of the Semi-1D CNN, 3D FCN, Semi-3D CNN, 3D CNN and 1D RNN on the PU data set, which further demonstrates the superiority of multidimensional CNN features and learning mechanism of Siamese network in small-sample HSI classification. Furthermore, 3D-FHOG + MDSN consistently provides the best classification results on PU data set. With respect to the SA data set, for three labeled samples per class, the OA, AA and κ measure for each class using different approaches are shown in Table 10. From  -1D CNN, 3D FCN, Semi-3D CNN, 3D CNN, 1D RNN, HybridSN, 3DCSN and MDSN, respectively. The same conclusion can be drawn that classification accuracies obtained by 3D-FHOG + MDSN are better than others, which in turn demonstrates the effectiveness of our proposed fusion framework.

Classification Maps
Furthermore, classification performances of nine different methods are visually investigated on three public HSI data sets. Figure 7 shows the classification maps of Semi- 1D CNN, 3D FCN, Semi-3D CNN, 3D CNN, 1D RNN, HybridSN, 3DCSN, MDSN and 3D-FHOG + MDSN on the IP data set with three labeled samples per class. Comparing the classification maps of each method in Figure 7, the maps of HybridSN, 3DCSN, MDSN and 3D-FHOG + MDSN are obviously more similar to the ground map than those of other methods. Especially, the map of 3D-FHOG + MDSN, which effectively combines the multidimensional CNN and handcrafted features, is the most similar. Additionally, the classification maps on the PU data set using nine different methods with three labeled samples per class are shown  Figure 8. It can be seen from Figure 8 that more query samples are obviously assigned to the correct class on the maps of multidimensional CNN-based methods than others, and the map of 3D-FHOG + MDSN is more consistent with the ground-truth map, which indicates that the performance of MDSN is effectively enhanced with the 3D-FHOG feature. In terms of the SA data set, Figure 9 displays the classification maps resulting from nine different methods for the SA data set with three labeled samples per class. The same conclusion can be drawn that the map of 3D-FHOG + MDSN is more similar to the ground-truth map than other methods, which further shows its robustness in small-sample HSI classification. methods. Especially, the map of 3D-FHOG + MDSN, which effectively combines the multidimensional CNN and handcrafted features, is the most similar. Additionally, the classification maps on the PU data set using nine different methods with three labeled samples per class are shown in Figure 8. It can be seen from Figure 8 that more query samples are obviously assigned to the correct class on the maps of multidimensional CNN-based methods than others, and the map of 3D-FHOG + MDSN is more consistent with the ground-truth map, which indicates that the performance of MDSN is effectively enhanced with the 3D-FHOG feature. In terms of the SA data set, Figure 9 displays the classification maps resulting from nine different methods for the SA data set with three labeled samples per class. The same conclusion can be drawn that the map of 3D-FHOG + MDSN is more similar to the ground-truth map than other methods, which further shows its robustness in small-sample HSI classification.  methods. Especially, the map of 3D-FHOG + MDSN, which effectively combines the multidimensional CNN and handcrafted features, is the most similar. Additionally, the classification maps on the PU data set using nine different methods with three labeled samples per class are shown in Figure 8. It can be seen from Figure 8 that more query samples are obviously assigned to the correct class on the maps of multidimensional CNN-based methods than others, and the map of 3D-FHOG + MDSN is more consistent with the ground-truth map, which indicates that the performance of MDSN is effectively enhanced with the 3D-FHOG feature. In terms of the SA data set, Figure 9 displays the classification maps resulting from nine different methods for the SA data set with three labeled samples per class. The same conclusion can be drawn that the map of 3D-FHOG + MDSN is more similar to the ground-truth map than other methods, which further shows its robustness in small-sample HSI classification. (a)    (a)

Influence of Training Sample Size
To further illustrate the superiority of 3D-FHOG + MDSN with different numbers of labeled samples, we take 3, 5, 7, 9, and 11 labeled samples for each class to build the training data set. Specifically, we have conducted three groups of experiment on three public HSI data sets. Then, the curve change of nine methods is obtained. It can be seen from Figure 10 that the OA of each method generally rises as the number of labeled samples increases. However, single-dimensional CNN-based methods are unstable in the scenery of small-scale training samples, which may result in the sharp decrease in classification accuracy when the number of labeled samples increases. Especially, the proposed 3D-FHOG + MDSN method outperforms other methods in most cases, which demonstrates its adaptability to the variance of the number of labeled samples. Additionally, Figures 11  and 12 display the AA and κ measure as functions of the number of labeled samples per class. The same conclusion can be drawn that the AA and κ of our proposed 3D-FHOG + MDSN method is always the best in terms of different training sample sizes. Besides, we also find that the gap between the classification accuracy of various methods increases when the number of training samples is fewer. This indicates that the performance of CNN-based method will be enhanced with the increase in labeled samples, which in turns minimizes the contribution of handcrafted features. Meanwhile, incorporating the handcrafted feature with the CNN-based method will not cause a decrease in classification accuracy, but will improve the reliability of classification result. Hence, it is suggested that our proposed 3D-FHOG + MDSN method is more robust and reliable in the scenery of small-scale training samples. To further illustrate the superiority of 3D-FHOG + MDSN with different numbers of labeled samples, we take 3, 5, 7, 9, and 11 labeled samples for each class to build the training data set. Specifically, we have conducted three groups of experiment on three public HSI data sets. Then, the curve change of nine methods is obtained. It can be seen from Figure 10 that the OA of each method generally rises as the number of labeled samples increases. However, single-dimensional CNN-based methods are unstable in the scenery of small-scale training samples, which may result in the sharp decrease in classification accuracy when the number of labeled samples increases. Especially, the proposed 3D-FHOG + MDSN method outperforms other methods in most cases, which demonstrates its adaptability to the variance of the number of labeled samples. Additionally, Figure 11 and Figure 12 display the AA and κ measure as functions of the number of labeled samples per class. The same conclusion can be drawn that the AA and κ of our proposed 3D-FHOG + MDSN method is always the best in terms of different training sample sizes. Besides, we also find that the gap between the classification accuracy of various methods increases when the number of training samples is fewer. This indicates that the performance of CNN-based method will be enhanced with the increase in labeled samples, which in turns minimizes the contribution of handcrafted features. Meanwhile, incorporating the handcrafted feature with the CNN-based method will not cause a decrease in classification accuracy, but will improve the reliability of classification result. Hence, it is suggested that our proposed 3D-FHOG + MDSN method is more robust and reliable in the scenery of small-scale training samples.  To further illustrate the superiority of 3D-FHOG + MDSN with different numbers of labeled samples, we take 3, 5, 7, 9, and 11 labeled samples for each class to build the training data set. Specifically, we have conducted three groups of experiment on three public HSI data sets. Then, the curve change of nine methods is obtained. It can be seen from Figure 10 that the OA of each method generally rises as the number of labeled samples increases. However, single-dimensional CNN-based methods are unstable in the scenery of small-scale training samples, which may result in the sharp decrease in classification accuracy when the number of labeled samples increases. Especially, the proposed 3D-FHOG + MDSN method outperforms other methods in most cases, which demonstrates its adaptability to the variance of the number of labeled samples. Additionally, Figure 11 and Figure 12 display the AA and κ measure as functions of the number of labeled samples per class. The same conclusion can be drawn that the AA and κ of our proposed 3D-FHOG + MDSN method is always the best in terms of different training sample sizes. Besides, we also find that the gap between the classification accuracy of various methods increases when the number of training samples is fewer. This indicates that the performance of CNN-based method will be enhanced with the increase in labeled samples, which in turns minimizes the contribution of handcrafted features. Meanwhile, incorporating the handcrafted feature with the CNN-based method will not cause a decrease in classification accuracy, but will improve the reliability of classification result. Hence, it is suggested that our proposed 3D-FHOG + MDSN method is more robust and reliable in the scenery of small-scale training samples.  In summary, the performance of MDSN enhanced with 3D-FHOG feature in smallsample HSI classification is better than those of representative handcrafted and CNNbased spatial-spectral feature extraction methods, especially when the training sample size is smaller. This in turn verifies the effectiveness of the proposed fusion framework.

Time Consumption
In this section, the running time of different methods is analyzed to evaluate their computational efficiency. Table 11 reports the running time of different methods on the three HSI data sets with three labeled samples per class. All the experiments are conducted on a computer with an Intel Core i3-4160 processor with 3.6 GHz, 8 GB of DDR3 RAM, an NVIDIA Geforce RTX 1060 graphical processing unit (GPU). For the higher computational cost methods including Semi-1D CNN, 3D FCN and Semi-3D CNN, the processing time is long. In terms of 3D CNN, it has a relatively short training time, but achieves poor classification performance. In addition, for the lightweight networks (i.e., 1D RNN and HybridSN), the running time is short, and these methods can obtain better classification results. Additionally, for the 3DCSN, MDSN and 3D-FHOG+ MDSN, since these methods are based on the idea of Siamese network and composed of CNN blocks with multiple dimensions, more time is consumed by learning the multi-dimensional CNN features from the HSI pixel pairs. Meanwhile, the classification performances of these methods are effectively improved.  In summary, the performance of MDSN enhanced with 3D-FHOG feature in smallsample HSI classification is better than those of representative handcrafted and CNN-based spatial-spectral feature extraction methods, especially when the training sample size is smaller. This in turn verifies the effectiveness of the proposed fusion framework.

Time Consumption
In this section, the running time of different methods is analyzed to evaluate their computational efficiency. Table 11 reports the running time of different methods on the three HSI data sets with three labeled samples per class. All the experiments are conducted on a computer with an Intel Core i3-4160 processor with 3.6 GHz, 8 GB of DDR3 RAM, an NVIDIA Geforce RTX 1060 graphical processing unit (GPU). For the higher computational cost methods including Semi-1D CNN, 3D FCN and Semi-3D CNN, the processing time is long. In terms of 3D CNN, it has a relatively short training time, but achieves poor classification performance. In addition, for the lightweight networks (i.e., 1D RNN and HybridSN), the running time is short, and these methods can obtain better classification results. Additionally, for the 3DCSN, MDSN and 3D-FHOG+ MDSN, since these methods are based on the idea of Siamese network and composed of CNN blocks with multiple dimensions, more time is consumed by learning the multi-dimensional CNN features from the HSI pixel pairs. Meanwhile, the classification performances of these methods are effectively improved. Especially, our proposed 3D-FHOG + MDSN contains the additional time of handcrafted feature extraction (HFE). Table 12 reports the HFE time of 3D-FHOG + MDSN on the three HSI data sets with three labeled samples per class. Note that the HFE time represents the time of HFE for all HSI samples in the data set. Hence, the more samples the data set contains, the more time it takes for 3D-FHOG feature extraction. In some high-sensitive areas, we can spend more time to obtain the more reliable and accurate results. Note that HFE time for each pixel is 0.07s. In real military applications, a military target contained in the HSI is generally composed of about 100 pixels, which only takes 7s for HFE. Therefore, the increase in time is acceptable. To sum up, in terms of some special small-sample HSI classification tasks without considering the computational cost, our proposed method will be an effective solution to achieve more accurate and reliable classification results.

Conclusions
In this paper, a fusion framework of multidimensional CNN and handcrafted features is proposed for small-sample HSI classification. Specifically, we design the 3D-FHOG descriptor to extract the handcrafted spatial-spectral feature, which is suggested to be more robust by overcoming the local spatial-spectral feature uncertainty. Then, to further extract the CNN-based spatial-spectral feature, an effective Siamese network, i.e., MDSN is proposed, which can effectively achieve the integration of CNN-based spatial-spectral features from multiple dimensions. Finally, our proposed MDSN combined with 3D-FHOG is employed for small-sample HSI classification to verify the effectiveness of our proposed fusion framework. Experiment results on three public HSI data sets indicate that our proposed MDSN combined with 3D-FHOG is superior to the representative handcrafted and CNN-based spatial-spectral feature extraction methods, which in turn demonstrates the effectiveness of the proposed fusion framework. More importantly, our proposed fusion framework has the advantage of expandability. In the future work, we will continue to explore the more discriminative and efficient spatial-spectral feature extraction methods, and integrate them into our proposed fusion framework, which helps to improve the small-sample HSI classification accuracy.