3D Octave and 2D Vanilla Mixed Convolutional Neural Network for Hyperspectral Image Classiﬁcation with Limited Samples

: Owing to the outstanding feature extraction capability, convolutional neural networks (CNNs) have been widely applied in hyperspectral image (HSI) classiﬁcation problems and have achieved an impressive performance. However, it is well known that 2D convolution suffers from the absent consideration of spectral information, while 3D convolution requires a huge amount of computational cost. In addition, the cost of labeling and the limitation of computing resources make it urgent to improve the generalization performance of the model with scarcely labeled samples. To relieve these issues, we design an end-to-end 3D octave and 2D vanilla mixed CNN, namely Oct-MCNN-HS, based on the typical 3D-2D mixed CNN (MCNN). It is worth mentioning that two feature fusion operations are deliberately constructed to climb the top of the discriminative features and practical performance. That is, 2D vanilla convolution merges the feature maps generated by 3D octave convolutions along the channel direction, and homology shifting aggregates the information of the pixels locating at the same spatial position. Extensive experiments are conducted on four publicly available HSI datasets to evaluate the effectiveness and robustness of our model, and the results verify the superiority of Oct-MCNN-HS both in efﬁcacy and efﬁciency.


Introduction
Hyperspectral images (HSIs), as an important outcome of remote sensing data recording both abundant spatial information and hundreds of spectrum bands on the earth surface, play a significant role in many fields, such as environmental monitoring [1][2][3], fine agriculture [4], military applications [5], among others. In multitudinous applications, how to deal with information redundancy and extract effective features for precisely distinguishing different targets is a fundamental issue that has drawn increasing attention in recent years.
Considering the rich spectral information in HSIs, several typical machine learning classifiers are used for target discrimination, for example, the k-nearest neighbor (kNN) [6], decision tree [7], extreme learning machine (ELM) [8], support vector machine (SVM) [9], and random forest (RF) [10]. Although these algorithms can construct a feature representation based on the feature similarity among pixels, they perform suboptimally due to the existence of spectral variability in intra-category pixels and spectral similarity in inter-categories pixels. To remedy this redundancy problem, some dimensionality reduction techniques focusing on extracting effective features are widely used, such as principal component analysis (PCA) [11], independent component analysis (ICA) [12], and linear discriminant analysis (LDA) [13]. However, the aforementioned methods merely utilize the properties in the spectral domain but ignore the inherent characteristics along the spatial domain, which may cause misclassified pixels to some extent and lead to the "salt-andpepper" noise on the classification maps. Thus, in order to further exploit the universal properties, some classification methods have been developed jointly considering spatialspectral correlation information, for instance, sparse representation [14], Markov random field [15], and edge-preserving filtering [16], to name just a few.
Recently, research on HSI classification is undergoing a paradigm shift. This phenomenon is attributed to the superpower of deep learning-based methods [17,18] on the hierarchical learning ability of high-level abstract features, which has pushed the traditional handcrafted, feature-based models aside. The stacked autoencoder (SAE) [19] and the deep belief network (DBN) [20] pioneer the probing into feature acquisition of HSIs, owing a good deal to their strong nonlinear capabilities. However, due to their compulsive requirement of a one-dimensional input format, a significant amount of effective information would inevitably be discarded while applying these methods into HSI imagery. To alleviate this issue, 1D CNNs [21] for extracting spectral features, 2D CNNs [22][23][24] for acquiring spatial context information, and 3D CNNs [25][26][27] for jointly obtaining spatial-spectral information have been successively proposed, which have promoted convolutional neural networks (CNNs) as the most popular network for HSI processing. Specifically, both 1D convolution and 2D convolution lack consideration of certain feature correlations to some extent, while 3D convolution captures spatial-spectral priors at the expense of a huge computational cost. For the purpose of extracting the most universal features, different kinds of fusion strategies have been derived for a better discriminative ability. Typically, in [28,29], a parallel dual-branch framework is proposed, where 1D convolution is used to extract spectral features and a 2D CNN is added for spatial information acquisition. Moreover, in our previous work [30], by attaching one 2D convolution layer after three 3D convolution layers, we accomplish the spatial-spectral feature integration by simply fusing the generated feature maps, and meanwhile implicitly realize a dimensionality reduction for better efficiency.
With the unceasing intensifying of the comprehension of traditional convolution operations, some fantastic convolution methods have recently been proposed and exhibit appealing performance. For instance, in [31], conditionally parameterized convolution is proposed to break through the setting of sharing convolution kernels for all of the examples in the vanilla convolution operation and learn specialized convolutional kernels for each sample. Reference [32] proposes a network that extracts vital information by performing relatively complex operations on the representative part disassembled from the input feature maps, and extracts hidden information by using lightweight operations on the remaining part, thereby improving the accuracy with acceptable reasoning speed. In addition, some novel architectures have also been considered for further performance improvement. For instance, the graph convolution networks (GCNs) [33,34] could introduce a local graph structure to promote the convolution characteristics. However, how to transform the data into a graph structure and reveal the deep relationship between nodes is still challenging. Besides, recurrent neural networks (RNNs) [35][36][37] regard all of the spectral bands as an image sequence whose drawback lies in the short consideration of spatial features. Moreover, the high computational cost and disappointing performance under limited samples are significant bottlenecks of GCNs and RNNs in the HSI classification task, particularly when using large-scale image data. Therefore, considering only a laptop computer as a supporting device, convolution operations are still the core technology to jointly harvest spatial-spectral information.
In addition to constructing different convolution methods for better feature extraction, several other strategies have also been proposed to improve the effectiveness of information acquisition. Two representatives are the attention mechanism [38][39][40] emphasizing key points while suppressing interference, and the covariance pooling operation for characteristic information sublimation. On the one hand, the attention mechanism pursues highlighting the spectral bands and spatial locations that have more obvious discriminative properties while suppressing the unnecessary ones, thereby the representation ability of CNNs can be greatly enhanced. In [30], motivated by the channel-wise attention mechanism, a scheme of a channel-wise shift is proposed to highlight the important principal components and recalibrate the channel-wise feature response. On the other hand, the covariance pooling operation [22,30,41] attempts to obtain second-order statistics by calculating the covariance matrix between feature maps, which leads to a more significant and compact representation. However, in the process of mapping a covariance matrix on the Riemannian manifold space to the Euclidean space, the loss of partial effective information and the addition of several extra calculation operations turn into the evident disadvantages of the covariance pooling scheme.
To overcome the aforementioned drawbacks and capture more detailed spatialspectral information, this paper proposes an improved CNN-based network architecture for the HSI classification problem. Specifically, based on the most recently proposed MCNN-CP [30] model (3D-2D mixed CNN with covariance pooling), our model further employs 3D octave and 2D vanilla mixed convolutions with homology shifting. The model is referred to as Oct-MCNN-HS for short. Through a comprehensive comparison with the other state-of-the-art methods, Oct-MCNN-HS achieves twofold breakthroughs both in convergence speed and classification accuracy in the classification task with small-size labeled samples. The main contributions of our work are listed as follows.

1.
Aiming at the classification of tensor-type hyperspectral images, we design 3D octave and 2D vanilla mixed convolutions in order to mine potential spatial-spectral features. Specifically, we first decompose the feature maps into different frequency components, and then apply 3D convolutions to accomplish the complementation of inter-frequency characteristics. Finally, the 2D vanilla convolution is attached to fuse along the channel direction of the feature maps, which reduces the output dimension and improves the generalization performance.

2.
Note that the final feature maps are sent to the classifier along the channel dimension, that is, the information at the same spatial location is discretely distributed in the vector form. Therefore, we propose the homology-shifting operation to aggregate the information of the same spatial location along the channel direction to ensure more compact features. It is commendable that homology shifting can enhance the generalization performance and stability of the model without any computational consumption.

3.
Extensive experiments are conducted on four HSI benchmark datasets with smallsized labeling samples. The results show that the proposed Oct-MCNN-HS model outperforms other state-of-the-art deep learning-based approaches in terms of both efficacy and efficiency. The proposed model with optimized parameters has been uploaded online at https://github.com/, accessed on ZhengJianwei2/Oct-MCNN-HS, whose source code will be coming soon after the review phase.
The remainder of this paper is organized as follows. Section 2 gives a brief review of the related works. In Section 3, we present the proposed network architectures in detail. The comprehensive experiments on four HSIs are conducted in Section 4. Finally, Section 5 draws the conclusion and provides some suggestions for future work.

Related Work
Based on MCNN-CP [30], the backbone architecture of our model is formed by trimming the suboptimal parts and retaining the competitively capable ones. In this subsection, we roughly review the reserved components for self-containment. Based on the fact that HSIs are naturally with a tensor structure and contain redundant spectral information, MCNN-CP starts with a dimensionality reduction operation using PCA, followed by 3D-2D mixed convolution layers obtaining representative features. Afterward, the covariance pooling scheme is appended to fully extract the second-order information from feature maps. In addition, in [42], the 2D octave convolution based on multi-frequency feature mining was proposed, showing its superiority.

3D-2D Mixed Convolutions
By virtue of its dramatic performance, convolution treatment is the most favorable modus operandi in the visual processing community since the birth of deep learning. Practically, 2D convolution and 3D convolution are the two most representative. Among these two, 2D convolution seeks mainly for spatial features but neglects the appreciatory interspectral information, making it insufficient to fully acquire discriminative features and perform suboptimally for most applications. Relatively speaking, 3D convolution can naturally extract more discriminative spatial-spectral information due to the tensor essence of hyperspectral images. The most unacceptable point associated with 3D convolution is, as the scale of generated feature maps increases, the operation gets much more complicated and huge computational demand is required. Therefore, probing a balanced way to integrate the advantages of 3D convolutions capturing more information and 2D convolution running with higher efficiency is a worthwhile attempt. In our studies, we found a significantly effective yet easy-to-accomplish way for this purpose is simply adding a layer of 2D convolution after several 3D convolutional operations. To be specific, the added 2D convolution can fuse the feature maps generated by the 3D convolutions, thereby playing a dual role of obtaining richer information while simultaneously reducing the dimension of the spectral bands.

Covariance Pooling
The covariance pooling method computes the covariance matrix between different feature maps, and realizes the mapping from the Riemannian manifold space to the Euclidean space through the logarithmic function, so as to obtain a vector with second-order information for classification. To some extent, as a plug-and-play method, covariance pooling can be regarded as an operation to sublimate the information of the feature maps generated by convolutions.
However, the covariance pooling method also has some unavoidable shortcomings. Firstly, non-trivial covariance computation will inevitably require additional calculation and storage consumption. Secondly, due to the need for space mapping, there must be some loss of effective information to a certain extent. Therefore, it is necessary to hunt for a simpler feature information sublimation operation that can further supply the model more formidable mining capabilities.

2D Octave Convolution
In [42], the authors claim that the output feature maps of the convolution layer are characteristically similar to that of the natural image, which may prompt a factorization of a high spatial frequency locally describing the fine details and another low spatial frequency representing the global structure. Consequently, 2D octave convolution is proposed to process the two components with different 2D vanilla convolutions (ordinary convolutions without additional operations) at their corresponding frequency and obtain global and local information under two resolution representations. Since the two frequency components are decomposed along the channels, the total number of feature maps remains constant, while the resolution of the low-frequency part is reduced. This reduces the spatial redundancy for inference acceleration and enlarges the receptive field for rich contextual information. However, this method incurs new hyperparameters, which cannot be empirically set and requires an exhausting searching process.

Methodology
In this section, as shown in Figure 1, we introduce our proposed Oct-MCNN-HS. Following a similar backbone architecture as MCNN-CP, we also preprocess the HSI data using the PCA technique, and adopt the patch-wise treatment to the data as the input. Different from MCNN-CP, we first develop 3D octave convolutions to jointly extract the spectral-spatial information hidden in the principal components. By considering the benefits of different frequency components, 3D octave convolutions disassemble the feature maps into two parts, i.e., high-frequency components and low-frequency components, through which the generated dual-branch and multi-scale convolutions can provide even more discriminative features. Afterward, the homology-shifting operation is imposed on our model to aggregate the pixel information at the same spatial position along the spectral dimension, thereby further pursuing the sublimation of the feature information. Then, mixed use of 3D octave convolutions and 2D vanilla convolution is proposed for digging deeply into the spatial-spectral features hidden in the principal components. Finally, the feature maps generated by the convolutions are fused through the homology-shifting operation before being sent to the fully connected layers.

3D Octave and 2D Vanilla Mixed Convolutions
Although the 3D-2D mixed convolutions perform well in spatial-spectral feature extraction, it neglects the truth that most visual information can be conveyed at different frequencies. Following the fact that HSIs can be decomposed into low spatial frequency components describing a smooth structure and high spatial frequency components with fine details, we decompose the output features of the convolutional layers into features with different spatial frequencies. As shown in Figure 1, we construct 3D octave convolutions to replace 3D vanilla convolution layers, which manifests a multi-frequency feature representation. Roughly speaking, the convolutional feature maps are up-sampled and down-sampled to obtain new maps with different spatial resolutions. Evidently, the low spatial resolution maps are employed as low-frequency components, and the high-resolution maps are introduced as high-frequency components.
Let X ∈ R h×w×c denote the input tensor, as a whole, we regard it as a high-frequency module that can be disassembled into multi-frequency components through two-branch and multi-scale convolutions. One branch performs a 3D vanilla convolution to seek high-frequency component X H ∈ R h×w×c , and the other branch executes both average pooling and 3D vanilla convolution to recruit low-frequency component 1] is the proportion of channels allocated to the low-frequency part. Different from this, we apply the abovementioned twobranch and multi-scale approach to fully capture the predominant information and avoid an exhausting search for the optimal hyperparameter α ∈ [0, 1]. The specific formulations are as follows: where Conv3D denotes the 3D convolution, W H→H and W H→L represent the intra-frequency and inter-frequency convolution kernels, respectively, and pool(X, s) is an average pooling operation with kernel size s × s and stride s.
Obtaining {X H , X L } from input X can be seen as a simplified octave convolution, where the paths related to the low-frequency input are disabled and only two paths are adopted. In contrast, as shown in Figure 1, the module from {X H , X L } to {Y H , Y L } is a complete octave convolution with a four-branch structure, which effectively extracts the information of different frequency components in their corresponding frequency tensor and synchronously fulfills inter-frequency communication. Specifically for high-frequency component Y H , we first calibrate the impact of high-frequency component X H by using a regular 3D convolution for the intra-frequency update, and then up-sample the feature maps generated by convolving a low-frequency component X L to achieve inter-frequency communication. Similarly, the low-frequency component X L is employed when seeking intra-frequency update and the high-frequency component X H is down-sampled for interfrequency communication, thereby jointly obtaining the low-frequency feature maps Y L . Thus, the equations of Y H and Y L can be formulated as where X H→H and X L→L denote the intra-frequency update, while X H→L and X L→H imply inter-frequency communication, and upsample(X, s) is an up-sampling operation by a factor of s via the nearest interpolation. Finally, considering the repercussions of spatial resolution on subsequent operations, similar to the generation and calculation of Y L , we obtain the low-frequency output Y by jointly performing 3D convolutions on low-frequency Y L and feature maps obtained by down-sampling the high-frequency Y H . The formulation of Y is In a nutshell, although 3D octave convolutions slightly lead to increased consumption, they provide more discriminative features by decomposing the feature maps into different frequency components, and also accomplish the complementation of spatial features through two mechanisms, viz., intra-frequency update and inter-frequency communication.
In other words, this multi-scale and dual-branch structure deserves to be introduced for more information extraction. After that, a layer of 2D vanilla convolutional layer is attached to pursue the fusion of feature maps along the channel direction, by which the output Z can be obtained achieving both a dimensionality reduction and information aggregation. The formulation of Z is where Conv2D denotes the 2D vanilla convolution. By constructing a hybrid of 3D Octave and 2D vanilla convolution, it is not only possible to effectively mine the potential spatial-spectral information from the tensor-type hyperspectral data, but also to achieve a balance between convergence performance and discrimination efficiency. In this way, the generalization performance of the designed model on a different dataset is improved to a certain extent.

Homology Shifting
During our previous studies, we found that utilizing a certain specific sublimation operation before sending the generated feature maps to the classifier has the potential for revealing deeper information. For example, in [30], the covariance pooling scheme can obtain second-order statistics by calculating the covariance between spatial-spectral feature maps, thereby significantly improving the classification accuracy when limited training samples are available. In this work, we attempt to explore the homology-shifting operation for further feature excavation without any extra demand of computational cost.
Essentially, the homology-shifting operation can be regarded as an image reshaping operation, as shown in Figure 2, which only needs to aggregate the pixel information at the same spatial position along the channel dimension and requires no extra computational overhead. Specifically, the homology-shifting operation can be formulated as where X ∈ R h×w×c 2 is the feature maps following convolution operations, HS is the abbreviation of the homology-shifting operation, A i ∈ R 1×1×c 2 is composed of the corresponding c 2 pixels at the ith spatial position on X, B i ∈ R c×c is the block generated by regrouping A i along the channel direction, X ∈ R hc×wc represents the final output feature map, and the function ConCat indicates that blocks {B 1 , B 2 , ..., B h×w } are spliced according to the original spatial position. By sequentially shifting the feature maps of h × w spatial size, each block A i corresponding to the ith spatial position on the c 2 feature maps are reorganized along the channel direction to form block B i . Consequently, blocks {B 1 , B 2 , . . . , B h×w } are spliced together and an aggregated feature map X is generated. After continuously extracting features through convolutions, the feature maps generated by consistently down-sampling treatment in space can already ensure of the obtained information being useful for a certain high-level semantic. On that basis, performing the additional homology-shifting operation would further merge the spatial-spectral information on multiple feature maps into a more discriminative one. Finally, the transformed feature map is flattened into a vector and input into the classifier; to a certain extent, the information of different spectral bands and the same spatial location can be aggregated.
Compared with the covariance pooling scheme, homology shifting has several significant advantages. On one hand, covariance pooling requires the computation of a covariance matrix between different feature maps. Moreover, this should be further fed into a spatial mapping step for the final discriminant vector. In contrast, the construction of homology shifting only goes through a simple pixel reorganization following the original convolution operation, which does not cost any additional computational overhead. On the other hand, when covariance pooling is applied, certain information loss inevitably happens during the mapping from Riemannian manifold space to Euclidean space. This phenomenon would not occur under the homology-shifting process, in which case all of the elements of convolution-generated feature maps are fully kept. Instead, more information utilization would be earned owing to the shifting mechanism. These merits guarantee an open prospect for the application of homology shifting in discriminative feature extraction.

Experiments
In this section, we specify the hyperparameters for the model configuration and evaluate the proposed methods using three classification metrics, including overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa). For assessing the classification performance of the Oct-MCNN-HS framework, we introduce four publicly available HSI datasets, including Indian Pines (IP), the University of Houston (UH), the University of Pavia (UP), and Salinas Scene (SA). In all of the experimental scenarios, we repeatedly run all of the algorithms five times with randomly selected training data and report the average results of the main classification metrics to reduce the influence of random initialization.

Data Description
The four used datasets are illustratively shown in Tables 1-4. In addition, a more detailed description of each dataset is given as follows.

Framework Setting
Since the designed Oct-MCNN-HS model exploits the PCA to achieve a balance between redundant information removal and effective feature retention, the number of preserved components is an indispensable hyperparameter for the classification of a new HSI. In practice, when sufficient samples are available, the number of principal components is less sensitive for the classification results. However, in this work, we focus more on the classification performance in the case with limited samples. In our experiments, for different HSI data, different selections of the component number are tested to generate the classification results. The specific values are shown in the horizontal axis of Figure 3. Since the practical component number would have a significant impact on both the classification accuracy and running efficiency, we list both the OA value and the training duration. From this figure, our first observation is that the peak performance varies along with different HSIs as well as the number of retained components. Overall, following this figure and taking a balanced consideration of the accuracy and training duration, we finally set 110, 25, 20, and 30 as the final values for the IP, UH, UP, and SA datasets, respectively.

Experimental Setup
To demonstrate the appealing performance of our model, we compare Oct-MCNN-HS with several recently published and deep learning-based HSI classification methods, including HybridSN [43], SSRN [26], A-SPN [41], and MCNN-CP [30]. It is worth mentioning that, except for the missed evaluation on the UH dataset, MCNN-CP has demonstrated state-of-the-art classification performance whenever under large or small sample sizes on the IP, UP, and SA datasets. Moreover, we will find from our experimental results that, on the basis of MCNN-CP, the newly proposed Oct-MCNN-HS can move a remarkable breakthrough forward in performance. Four numerical metrics are used to record the quantitative results from each competing model, including class-wise accuracy, OA, AA, and Kappa. In addition to the classification performance comparison in the case of limited samples, for further investigating the superiority of our model in various scenarios, we also exploit different training sizes for the experiments. It is worth noting that the number of instances of different classes may vary greatly, which results in the well-known class imbalance issue. To fairly balance the accuracies of the minority and the majority classes, we set the "class_weight" attribute in the Keras framework to "auto" following the similar idea that was intensively discussed in [44]. Specifically, as shown in Figure 1, a class weight would be optionally assigned to each class striving for an equal contribution of each one to the loss function. For the datasets IP and UH, certain occasions that quite limited the sample sizes would occur when an extremely low sample ratio, e.g., 1% or 0.5%, is adopted. For example, when one uses 1% samples as the training/validation set, then the 7th and 9th classes of IP data both consist of zero samples. Accordingly, the class weight cannot be correctly derived and the mechanism is insufficient to tackle the imbalance issue. Therefore, we borrow the sampling scheme used in EPF [16] to earn a more balanced performance. In the implementation, a fixed number of training samples for each category is initially set. For certain categories having smaller samples than the given number, half of the samples are used for training, and the remaining vacancy is filled up by the ones from other rich categories.  For all of the hybrid models combining 3D and 2D convolutions, to circumvent the explosion of the parameter amount and ensure their performability on the commonly equipped laptop, we set three 3D convolution layers with size 3 × 3 × 3 × 8 (8 volumes of 3 × 3 × 3 size), 3 × 3 × 3 × 16, and 3 × 3 × 3 × 32 convolution kernels, respectively, to extract sufficient features and one 2D convolution operation with size 3 × 3 × 64 (64 volumes of 3 × 3 size) kernel to fuse the feature maps. It is worth noting that to ensure a full information discovery, we perform the operation of padding-with-zeros for all of the 3D convolutions. In addition, different from the MCNN-based models, 1 × 1 × 512 is set as the kernel size of the 2D convolution layer in all of the Oct-MCNN-based models. A detailed summary of the proposed architecture in terms of the layer types, output map shapes, and the number of parameters is given in Table 5. In all of the experiments, the patch size, the dropout proportions of the fully connected layers, the training epoch, the batch size, the optimization algorithm, and the learning rate are set to 11 × 11, 0.4, 100, 256, adam, and 0.001, respectively. The experiments are conducted on a personal laptop equipped with an Intel Core i7-9750H processor of 2.6 GHz, 32 GB of DDR4 RAM, and NVIDIA GeForce GTX 1650 GPU. The coding is carried out under the TensorFlow framework with the programming language Python 3.6. In the meantime, we configure the environment with CUDA 10.0 and cuDNN 7.6 for GPU acceleration.

Performance Comparison
The first comparison experiments are conducted using limited samples. Specifically, considering the imbalance problem of the training samples of the IP and UH datasets, 5 samples per category (a total of 80 samples) and approximately 100 samples per class (2000 in total) are respectively adopted as the training set for model optimization. For the other two datasets, i.e., UP and SA, 0.1% of each labeled cube is selected for training groups and verification groups, respectively, and the remaining 99.8% of the samples are used for performance evaluation. The generated results of our proposed model, as well as several representative models, i.e., HybridSN [43], SSRN [26], A-SPN [41], and MCNN-CP [30], are shown in Tables 6-9. From these tables, one can easily observe that the method of HybridSN performs relatively poorer than the others. Although A-SPN achieves favorable results on the IP dataset, it performs poorer than MCNN-CP on the other three datasets, i.e., UH, UP, and SA. Note that MCNN-CP has proved its remarkable capability of discriminative feature extraction on IP, UP, and SA [30]. Yet encouragingly, Oct-MCNN-HS developed in this paper consistently achieves better classification results, which demonstrates its superiority in extracting formidable features compared with MCNN-CP. Evidently, among all of the compared models, Oct-MCNN-HS achieves the optimal classification performance and also with lower standard deviation, which demonstrates the advantage of mixed convolutions and homology shifting.
Together with the false-color images of the original HSI and their corresponding ground-truth maps, Figures 4-7 further visualize the classification results on the four datasets. From these figures, one can see that the qualitative comparison evidently confirms the quantitative comparison shown in Tables 6-9. Concretely, the classification maps generated by HybridSN are full of noisy points and the other methods achieve better results with clearer contours and sharper edges. However, due to the limitation of training size, there are still some scattered points that unavoidably exist in some categories. Again, by carefully comparing these maps with the ground-truth ones, we argue that the maps learned from Oct-MCNN-HS hold the highest fidelity compared to the others. All the remaining competitors generate more artifacts, e.g., broken lines, blurry regions, and scattered points, to some extent.     For the sake of verifying the effectiveness of our model under various training samples, in the third experiment, we evaluate the competing approaches using sample proportions traversed from the candidate set {80, 160, 480, 960} for IP, {2000, 4000, 10,000, 20,000} for the UH, and {0.1%, 0.5%, 1%, 5%} for the UP and SA, respectively. In Figure 8, the overall accuracies and training time of different classifiers are illustrated. Our first observation is that Oct-MCNN-HS always ranks in the top place in all of the cases. Although the gap gradually shrinks along with the increasing size of the training samples, the gains of our method over the other competitors are quite significant when limited samples are available. The phenomenon is reasonable since all of the candidate methods perform well with sufficient samples. Overall, the practical performance can be arranged in descending order as Oct-MCNN-HS > MCNN-CP > A-SPN > SSRN > HybridSN. Note that, taking only the accuracy into account, the performance of A-SPN fluctuates more strongly on different datasets than others. In terms of running efficiency, we can also conclude from this figure that in all of the cases, the convergence speed of our Oct-MCNN-HS is significantly ahead of SSRN and MCNN-CP. Although our method lags behind HybridSN and A-SPN due to the requirement of more retained components in the PCA step, the speed gap between them lies in an acceptable range. The efficiency together with the accuracy verifies that our proposed method enjoys a high robustness, favorable generalization, and excellent convergence under different scenarios.
To dig deeply into the behaviors of 3D octave and 2D vanilla mixed convolutions in our network model, several variants that use 3D-2D mixed convolutions (MCNN) as the backbone are further introduced as the competitors. Furthermore, the advantage of homology shifting over covariance pooling on the role of feature sublimation should also be explored. A total of seven deep learning-based approaches are used in this experiment, including the original MCNN, MCNN plus covariance pooling (MCNN-CP), MCNN plus homology shifting (MCNN-HS), 3D octave CNN, MCNN using 3D octave convolution (Oct-MCNN), Oct-MCNN plus covariance pooling (Oct-MCNN-CP), and Oct-MCNN-HS. In the implementation, 160 and 4000 samples for the IP and UH datasets, and 0.5% training samples for the UP and SA are used for model optimization. The classification results are shown in Figure 9. Taking MCNN as the baseline, the first observation is that the two techniques presented, i.e., octave convolution and homology shifting, both benefit the model. However, the improvement from the covariance pooling scheme cannot be guaranteed. For Oct-MCNN-CP, it even underperforms the counterpart, i.e., Oct-MCNN, in all four cases. With homology shifting involved, both MCNN-HS and Oct-MCNN-HS outperform their counterparts, evidently. Comparing three naive models, i.e., MCNN, 3D Octave CNN, and Oct-MCNN, 3D octave convolution enjoys better feature extraction capabilities. Moreover, attaching a layer of 2D vanilla convolution after 3D octave convolutions to form the Oct-MCNN model can further ensure better performance. In our studies, we can assure that the addition of homology shifting would incur gains both in efficacy and efficiency. With both of these two techniques involved, the final Oct-MCNN-HS outperforms all of the competitors in all of the experiments. In terms of classification accuracy, all of these variants can be sorted in a descending sequence as Oct-MCNN-HS>Oct-MCNN>MCNN-HS>MCNN.

Discussion
Specific to the roles of mixed convolutions and feature sublimation, several conclusions can also be drawn. First of all, experiments on the MCNN-or Oct-MCNN-based models prove that compared with covariance pooling scheme, homology-shifting operation has more information involved, leading to better effectiveness on a variety of datasets. In addition, 3D octave convolutions provide more multi-scale supplementary information through feature maps of different resolutions, which will positively manifest the relationship between feature maps, especially the difference between low-frequency maps and high-frequency maps. What is more, by attaching a layer of 2D vanilla convolution after 3D octave convolutions, the feature maps can be further merged along channels, thereby reducing the dimensionality and aggregation information. On that basis, covariance pooling obtains the second-order statistics by calculating the covariance matrix between the feature maps. This cannot be positively paired with the Oct-MCNN model. Relatively, homology shifting as a pixel reorganization operation concentrates the information of pixels in the same location along the channel direction, pushing more features to be aggregated. Although the combination of it and the non-trivial Oct-MCNN model cannot achieve a significant speed-up in the training process, it improves the classification accuracy without incurring any extra computation consumption.

Conclusions
In this work, a new network architecture for HSI classification, namely Oct-MCNN-HS, is proposed involving 3D octave and 2D vanilla mixed convolutions. Specifically, we first adopt a principal component analysis to reduce the dimension and redundancy of spectral bands. In the feature extraction stage, we construct dual-branch and multiscale 3D octave convolutions through up-and down-sampling strategies, as well as intra-frequency update and inter-frequency communication mechanisms to obtain more discriminative information. Subsequently, a layer of 2D vanilla convolution is attached to fuse the feature maps generated by 3D octave convolutions along the channel direction, thereby reducing the dimensionality and aggregate information. In addition, for the sake of better feature sublimation, the homology-shifting operation is employed to assemble the pixels information located in the same spatial position along with different maps. The final model, Oct-MCNN-HS, constructed with the above two information aggregation operations, i.e., 2D vanilla convolution and homology shifting, achieves breakthroughs both in the convergence speed and classification accuracy on several open available HSIs with scarce labeled samples. In the near future, we will focus on automatic search mechanisms to get rid of the trouble tuning of hyperparameter selection.