Deep Cube-Pair Network for Hyperspectral Imagery Classiﬁcation

: Advanced classiﬁcation methods, which can fully utilize the 3D characteristic of hyperspectral image (HSI) and generalize well to the test data given only limited labeled training samples (i.e., small training dataset), have long been the research objective for HSI classiﬁcation problem. Witnessing the success of deep-learning-based methods, a cube-pair-based convolutional neural networks (CNN) classiﬁcation architecture is proposed to cope this objective in this study, where cube-pair is used to address the small training dataset problem as well as preserve the 3D local structure of HSI data. Within this architecture, a 3D fully convolutional network is further modeled, which has less parameters compared with traditional CNN. Provided the same amount of training samples, the modeled network can go deeper than traditional CNN and thus has superior generalization ability. Experimental results on several HSI datasets demonstrate that the proposed method has superior classiﬁcation results compared with other state-of-the-art competing methods.

HSI classification has been one of the most popular research areas for HSI analysis in the past several decades, which aims at assigning each pixel a pre-defined class label.Numerous methods thus have been proposed for HSI classification, which can be roughly divided into non-deep-earning-based and deep-learning-based methods [18][19][20][21][22][23][24][25][26].Classifiers and feature extractions are two ingredients of non-deep-learning-based HSI classification methods [27], among which typical classifiers include k-nearest neighbor (k-NN) [28,29], logistic regression (LR) [30][31][32], and support vector machine (SVM) [33][34][35][36].By evaluating the distances between the training samples/pixels and the test sample, the k-NN method selects k training samples that have the smallest distance to the test sample and then assigns the test sample a label which dominates the selected k training samples.The logistic regression method is proposed for HSI classification considering it has the merit to estimate class probabilities directly using the logit transform.The SVM seeks to trace an optimal hyperplane that linearly separates features into two groups with a maximum margin, which shows a powerful capability of classifying hyperspectral data.In addition, for non-deep-learning-based HSI classification methods, feature extraction methods, such as principal component analysis (PCA) [37], independent component analysis (ICA) [38], and minimum noise fraction (MNF) [39], are always used together with the above classifiers to cope with the high-dimensionality and nonlinearity of the data.However, two problems limit the performance of non-deep-learning-based HSI classification methods.(1) They use shallow structures (i.e., the SVM can be attributed to a single-layer classifier, while PCA can be seen as a single-layer feature extractor), which have limited nonlinear representation capability and may not be able to represent the nonlinearity in the HSI.(2) The features adopted are usually hand-crafted, which may not fit the classification task very well.
In contrast, deep learning methods [40][41][42][43][44][45][46][47][48][49] are based on multi-layer structures and thus have superior nonlinear representation ability.The stack autoencode (SAE) [44,48,50,51] and the convolutional neural network (CNN) [40][41][42]45,52] are two representative categories.A commonly used strategy in building an SAE model includes unsupervised pretraining over the unlabeled samples first and then a supervised fine-tuning over the labeled samples.The deep belief network (DBN) [53,54] also belongs to this category, where unsupervised pretraining over the unlabeled samples is accomplished via the DBN instead of the SAE.Compared with the SAE, the CNN is a completely supervised deep learning method and shows a more powerful classification capability since it intergrates feature extraction and a classifier naturally into one framework (i.e., the extracted feature is specific to the classifier).Thus, we focus on CNN-based HSI classification method in this study.Some CNN-based methods have been proposed.Hu et al. [42] proposes a CNN-based method based on spectral information only.Slavkovikj et al. [46] incorporates both spatial and spectral information into CNN within a local patch structure.Zhang et al. [41] proposes a dual-stream CNN, where one stream extracts the spectral feature and the other stream extracts the spatial-spectral feature.Chen et al. [47] and Li et al. [45] propose a 3D CNN network to consider the 3D structure of HSI data.
Two characteristics are considered important for HSI classification.First, an HSI is inherently a 3D datacube, which contains both spectral and spatial structure.However, the majority of existing CNN-based HSI classification methods consider spectral only, or destroy the original correlation between spatial and spectral without fully considering the useful 3D structure.Second, since labeling an HSI is tedious, expensive, and can only be accomplished by experts, labeled pixels provided for HSI classification are limited.However, all CNN-based methods demand large amounts of labeled samples due to a huge amount of parameters in the network, and more parameters will be generated as CNN has a deeper structure.Given limited labeled samples, many CNN-based methods cannot be fully trained, i.e., the generalization ability of the neural network is unsatisfactory with insufficiently labeled data.To address these problems, inspired by the newly proposed pixel-pair feature [40], we propose a cube-pair-based CNN classification architecture in this study, where cube-pair is used to enhance the training sample and to model the local 3D structure simultaneously.Within this architecture, a 3D fully convolutional network (FCN) is further modeled, which has fewer parameters compared with the traditional CNN.Provided the same amount of training samples, the modeled network can go deeper than the traditional CNN and thus has superior generalization ability for HSI classification.The main ideas and contributions are summarized as follows. (1) Cube-pair is used when modeling CNN classification architecture.The advantage of using cube-pair is that it can not only generate more samples for training but can also utilize the local 3D structure directly.(2) A 3D FCN is modeled within a cube-pair-based HSI classification architecture, which is a deep end-to-end 3D network pertinent for the 3D structure of HSI.In addition, it has fewer parameters than the traditional CNN.Provided the same amount of training samples, the modeled network can go deeper than traditional CNN and thus has superior generalization ability.(3) The proposed method obtains the best classification results, compared with the pixel-pair CNN and other deep-learning-based methods.
The remainder of this paper is structured as follows.Section 2 describes the deep cube-pair network for HSI classification including the cube-pair-based CNN classification architecture and the cube-pair-based FCN.Experimental results and analysis are provided in Section 3. Section 4 concludes the paper.

The Deep Cube-Pair Network for HSI Classification
First, we categorize the existing CNN-based methods into three categories in Section 2.1, which include pixel-based architecture, pixel-pair-based architecture, and cube-based architecture.We then propose a new cube-pair-based HSI classification architecture that takes advantage of both cube-based and pixel-pair-based methods in Section 2.2.Since any kind of 3D deep neural network can be used within this architecture (i.e., acting as a cube-pair network), we give a brief introduction of cube-pair-based HSI classification architecture including cube-pair generation for training and test procedures, and the class label inference for the test data.Finally, we model a specific 3D deep neural network in Section 2.3.We introduce the structure of the modeled 3D fully convolutional network in detail and briefly introduce its training and test strategies.

Mathematical Formulation of Commonly Used CNN-Based HSI Classification Architecture
In this study, we denote X ∈ R w×h×d as an HSI dataset, where w, h, and d represent the width, height, and bands (i.e., spectral channels/wavelengths), respectively.Among the total number of w × h pixels, N pixels are labeled and denoted as training set T = {x i , y i } N i=1 , where x i ∈ R d is a d-dimensional spectrum of one pixel, and y i is its corresponding label chosen from K = {1, • • • , K}.K is the total number of classes.
Pixel-level-based HSI classification architecture is a commonly used architecture, which is on the pixel level.Specifically, a prediction function as follows is learned.
Then, the learned function f is used to assign labels for unlabeled pixels x j ∈ T. In this study, f represents CNN-based methods.
A pixel-pair-based architecture is proposed to address the small training dataset problem.For an HSI, only limited labeled samples can be provided in real conditions (i.e., N is small) since labeling HSI is tedious and expensive, and can only be accomplished by experts.However, the CNN (i.e., f ) always demands large amounts of labeled training samples (i.e., where N is large) to train the parameters, especially when the network goes deeper.To address this contradiction, Li et al. [40] proposed a pixel-pair-based HSI classification architecture, where they reformulated pixel-level classification architecture as f : The label y it for the pixel-pair {x i , x t } in [40] is determined by Though the number of labeled pixels may be limited, it can be seen that the number of labeled pixel-pairs can be huge since the combination of pixels in the training set is larger than the number of training pixels (square-level magnitude for pixel-pair versus the original number for pixel), which mitigates the gap between the number HSI can be provided for training and the number deep learning methods demanded.Then, a pixel-pair network (e.g., f ) is constructed based on pixel-pairs.Finally, a voting strategy is proposed to obtain the final classification result for the test pixel based on the value output from f .Though it can effectively increase the training sample, the useful 3D structure is ignored for a pixel-pair-based architecture.
Cube-based architecture is proposed to directly use 3D structure of HSI for classification, which can be represented as C(x i ) k ∈ R k×k×d represents a local cube centered at x i , whose width k equals the height.The basic idea using Equation ( 4) is that spatial neighboring pixels tend to have the same class label.However, a cube-based architecture alone does not address the small training dataset problem, i.e., f in Equation ( 4) still needs a large amount of training samples.In addition, though a cube-based architecture is proposed to model 3D structure of HSI, the majority of existing CNN-based HSI classification methods do not model 3D data directly.Those methods reshape the original 3D tensor structure of HSI into vectors and matrices first, then construct a 1D or 2D CNN network based on the reshaped data.Though those methods capture spectral and spatial information to some extent, the original 3D structure (e.g., the correlation between spatial and spectral) is destroyed accomplished with reshaping, which influences the performance of HSI classification results.

The Proposed Architecture
A small sample and a 3D structure are two important characteristics of an HSI.However, as shown above, pixel-pair-based and cube-based architecture address only one.To the best of our knowledge, no existing architecture utilizes them simultaneously, which inspires us to propose cube-pair-based HSI classification architecture as From Equation ( 5), we can see cube-pair-based architecture is suitable for 3D data.In addition, more samples can be generated for training within this architecture, which addresses the small training dataset problem.Different strategies can be used to determine the label of cube-pair {C i , C t }, which is denoted as y it in this paper.Considering that neighboring pixels in HSI are prone to be from the same class label, for simplification, we selected the pixel centered at the cube and determined y it based on the selected pixels.The strategy proposed in [40] could then be used to determine y it , shown as Equation (3).If the selected pixels were from the same class, we assigned y it a class label same with the selected pixels.If the pixels were from different classes, a new class label was generated, which is denoted as Class 0 in this paper.Thus, y it varies from 0 to K.

Training and Test Procedures of the Proposed Architecture
Since cube-pair architecture is different with other architectures, we briefly summarize its training and test procedures in this subsection.Considering the proposed cube-pair-based architecture is a general framework, i.e., any kind of 3D deep neural network can be used within this architecture, we introduce the training and test procedures without assigning a specific CNN network.
Training procedure.Given a training set T = {x i , y i } N i=1 , the training procedure consists of the following steps, which is also illustrated in the top half of Figure 1.
Step (1).We sample cubes centered at the training pixels in T one by one by preserving their spatial neighoring pixels in the original HSI (in the following, we use cubes with a 3 × 3 spatial size as an example).
Step (2).We generate cube-pairs from the sampled cubes and determine their labels by Equation (3).
Step (3).We train classifier f using the generated cube-pairs and their labels as Equation (5) shows.( f can be any 3D deep neural network and a specifically modeled FCN can be seen from Section 2.3).We take classification problem that has 9 classes as an example, where each class has 200 cubes.For the classes from 1 to 9, we can obtain 200 × 199 cube-pairs for each class (it should be noted that the generated cube-pairs are sensitive to the order of the chosen cubes).For Class 0, we can obtain much more cube-pairs, since the cube combination from different classes is much more than that from the same class.To ensure the balance of the data from different classes, only part of the cube-pairs from Class 0 are generated in the experiment.Specifically, from Class 1 to Class 9, we repeatedly conduct the following operation to generate cube-pairs for Class 0. We used all 200 cubes in one class and randomly selected 3 cubes from 8 other classes to generate the cube-pairs.Thus, we obtained 9 × 200 × 8 × 3 cube-pairs.Since 9 × 200 × 8 × 3 equals 200 × 216, the number of cube-pairs generated from different classes is close to 200 × 199 (i.e., the cube-pairs generated from the same class) .
Test procedure.Once we obtain f , the procedure of inferring the label of the unlabeled pixel x j ∈ T can be summarized as follows based on [40], which is illustrated in the bottom half of Figure 1.
Step (1).We sample an extended-cube, which centered at x j and has larger spatial-size than the size used in the training procedure (e.g., 5 × 5 for the extended-cube versus 3 × 3 for training).
Step (2).We generate all cube-pairs within the extended-cube.For each generated cube-pair, one cube is from the central pixel (i.e., x j ) and the other is from non-central pixels.Both cubes are of the same size as the cubes generated in the training procedure (i.e., 3 × 3).
Step (3).We apply f , which is obtained in the training procedure, on all cube-pairs generated in Step 2) one by one.We obtain a set of logit outputs, and each output is a (k + 1) dimensional vector.
Step (4).We remove the first dimension from the obtained logit output, and use the remaining k-th vector to predict the label of each cube-pair with a softmax function.When obtaining predicted labels from all cube-pairs, we assign x j the class label, which dominates the predicted labels.

The Proposed Deep Cube-Pair Network
The proposed cube-pair-based architecture is a general framework.Thus, any kind of 3D deep neural network can be used within this architecture.The existing HSI classification method always adopts a CNN network.However, a CNN contains many parameters and thus demands a large amount of labeled training data, which is beyond an HSI can provide.Thus, our motivation is to model a 3D network that has fewer parameters.The traditional CNN-based method is composed of convolutional layers, pooling layers, and a fully connected layer.Considering most parameters are in the fully connected layer of the CNN network, we use FCN, which omits the fully connected layer and thus has fewer parameters compared with the CNN.On the one hand, with the modeled FCN, we have the chance to guarantee that the network can be well trained given a smaller amount of training data compared with the CNN.On the other hand, when we use the FCN in the cube-pair architecture, we have the chance to build a much deeper network with superior generalization ability.
To cope with the 3D structure of the HSI data without flattening it into a matrix or a vector, we model the 3D FCN, which we termed a deep cube-pair network (DCPN) in this study.Since a convolution layer is only used to construct the network, we emphasize how the 3D convolution layer works first.We then introduce the constructed DCPN, and its training and test strategies.

The Structure of the DCPN
We denote the l-th convolution kernel as K l and the activation function as Φ.The relation between the input I and the output O of the convolution layer can be represented as where O l represents the output (i.e., feature map) using the l-th convolution kernel and O l uvt is the feature at position (u, v, t).I (u+z 1 )(v+z 2 )(t+z 3 ) denotes the input of the convolution layer at the position (u + z 1 , v + z 2 , t + z 3 ) in which (z 1 , z 2 , z 3 ) denotes its offset to (u, v, t).K l z 1 z 2 z 3 represents the kernel weight connected to I (u+z 1 )(v+z 2 )(t+z 3 ) , and b is the bias.Rectified linear units (ReLUs) is adopted as an activation function Φ, since it can improve model fitting without extra computational cost and over-fitting risk, which can be represented as By concatenating cube-pair {C i , C t } together as [C i , C t ] ∈ R (2×k)×k×d , we use it as the input I for the first convolution layer.It is noticable that the order of subscript i and t matters to the data, i.e., C it = C ti .In addition, considering that the spectrum is essential to discriminate different classes, d is set equally to the spectrum dimensionality of the HSI to preserve the global correlation along the spectrum.For clarification, we use the Pavia dataset as an example to show the modeled DCPN, which adopts a nine-layer structure (shown in Figure 2).By removing the absorption bands, we adopted 103 bands for the Pavia dataset and set k equal to 3 (classification results with different k are analyzed in Section 3.4), and the resulted input I is in the size of 6 × 3 × 103.
In the first convolution layer, considering that a small convolution kernel with size 1 × 1 × 1 has advantages to increase the depth of the network [55], six different small convolution kernels were utilized in the first convolution layer.
In the second convolution layer, six different 3D convolution kernels with size 3 × 1 × 8 were used, and the stride size was set to 1 × 1 × 3. Multiple 3D convolution kernels were used to explore different kinds of spectral and local spatial feature patterns.The stride was used for dimensionality reduction, which was accompanied with a convolution kernel.According to Equation ( 6), six feature maps can be obtained, and each map is a 4 × 3 × 32 tensor.
A structure similar to that of the second convolution layer was adopted from Layers 3 to 8, where the output from (n − 1)-th layer was used as the input of the n-th layer.The difference between those layers and the first layer only comes from the number of the convolution kernel, as well as the size of the convolution kernel and the convolution stride, which are listed in Table 1.
For the last layer, the softmax function instead of activation function was used together with the convolution operation.Specifically, the input of this layer (i.e., the output from the eighth convolution layer) convolved with the convolution kernels in this layer first.A softmax fucntion was then exploited on the convolution results.We set the number of convolution kernel equally to the class number K + 1 in this layer.Thus, the output of the softmax function can be used to represent the probability input cubes [C i , C t ] belonging to a different class, which we denote as y it .

Training and Test Schemes of DCPN
Since the DCPN is a feedforward network, i.e., the output from the (n − 1)-th layer was used as the input of the n-th layer, it can be seen that the mapping function f (defined in Equation ( 5)) for the whole network equals f = φ (9) (φ (8) (...(φ (1) ))), where φ (n) denotes the mapping function from the n-th convolutional layer.Considering that the parameters including the kernel weights K and the bias b from different layers decide f , we should first address how those parameters are effectively set.
Cross entropy was used to estimate those parameters in this study, which can be calculated as where ŷit represents one-hot code of the true class label y it (e.g., we code 3 as [0, 0, 1, 0, 0] for a classification problem with five classes in total).extented-cube as 3 × 3 and 5 × 5, respectively.For a fair comparison, we set the spatial size to 3 × 3 for those competing methods, which consider spatial information into account.
In this study, we chose overall accuracy (OA), which defines the ratio of correctly labeled samples to all test samples, to measure HSI classification results.

Comparison with Other Methods
In this section, we first chose 200 samples from each class as the training set and used the remaining samples for test.The number of training and test samples for each dataset can be seen from Table 2.We then conducted experiments, where the number of training sample varied.
(4) Though cube-based methods including 3D-CNN and 2D-CNN have superior performance than pixel-level-based methods, these methods are inferior to both the proposed method and the pixel-pair-based method.This phenomenon is caused by limited training samples, which makes 3D-CNN and 2D-CNN not well trained.Thus, it generalizes poorly on the test data.On the contrary, both cube-pair and pixel-pair strategies increase the training samples effectively, which guarantee that the network can be well trained.Typical classification maps on three datasets are given in Figures 3-5, where (a) represents the ground truth and (b)-(h) represent the classification maps from different methods.We use different colors to denote different categories in these figures, which are illustrated in Figure 6.We can see that the proposed method has the best classification results, which is consistent with the results analyzed above.

Experimental Results with Different Number of Training Samples
The classification results with different numbers of training samples are shown in Figures 7-9, where the number varied from 50 to 200 with an interval of 50.From the experimental results, we can see the classification results of deep-learning-based methods increase when more samples are introduced for training, which is natural since the classifier can be well trained with more training samples.Nevertheless, the proposed method outperforms all competing methods stably given any amount of training samples.
From the above results, we can conclude that the proposed method has superior performance than any other competing methods.

Discussion
Considering that the cube size and layer number (i.e., depth) are two important parameters of the DCPN, to further testify the influence of these two parameters on classification results, the following two experiments are described and discussed.
In the first experiment, we fixed the layer number but set a different cube size.The experimental results on the Indiana Pines dataset can be seen in Table 6, where ecs denotes the size of the extended-cube and k denotes the size of the cube, respectively.It is noticeable that, when we set the cube size k as 1, the proposed method degenerates to a pixel-pair-based method.When we increase the cube size k from 1 to 3, the classification accuracy is also improved, which demonstrates local neighboring pixels are indeed helpful for classification.However, when we increase the cube size k further (e.g., from 3 to 5), the classification performance drops slightly.This phenomenon is caused by pixels from different categories, which are prone to be included with a larger cube size and decrease the classification accuracy.Thus, we set the size of cube k and extended-cube ecs as 3 and 5, respectively, and fixed them in all experiments.
In the second experiment, we fixed the cube size but set a different layer number.The experimental results of the proposed DCPN and 3D-CNN on the Indiana Pines dataset can be seen in Table 7, where the layer number is chosen as 3, 5, 7, and 10.It can be seen that, with the increase in layer number, the classification accuracy of the DCPN improves, whereas the classification accuracy of 3D-CNN decreases.The comparison results are consistent with the above analysis.FCN has fewer parameters; thus, given the same amount of training samples, it can go deeper than CNN and has a superior nonlinear representation ablility.

Conclusions
In this paper, we propose a cube-pair-based HSI classification architecture.The proposed architecture can utilize the 3D characteristic of HSI and generalize well to the test data given only limited labeled training samples.Within this architecture, a 3D fully convolutional network is further modeled, which has fewer parameters than CNN.Thus, the proposed network has superior generalization ability compared with CNN when given the same amount of training samples.Experimental results on several HSI datasets demonstrate the proposed method has superior classification result compared with other state-of-the-art competing methods.

Figure 3 .
Figure 3. Classification maps of different methods on the Indiana Pine dataset.

Figure 4 .
Figure 4. Classification maps of different methods on the PaviaU dataset.

Figure 5 .
Figure 5. Classification maps of different methods on the Salinas dataset.

Figure 6 .
Figure 6.Colors represent different classes for three different datasets.

Figure 7 .
Figure 7. Classification performance with different numbers of training samples on the Indiana Pines dataset.

Figure 8 .
Figure 8. Classification performance with different numbers of training samples on the PaviaU dataset.

Figure 9 .
Figure 9. Classification performance with different numbers of training samples on the Salinas dataset.

Table 1 .
Parameter settings of different layers in the deep cube-pair network (DCPN) model for PaviaU.

Table 3 .
Classification accuracy (%) of different methods on the Indiana Pines dataset.

Table 4 .
Classification accuracy(%) of different methods on the PaviaU dataset.

Table 5 .
Classification accuracy(%) of different methods on the Salinas dataset.

Table 6 .
Classification accuracy with different cube sizes on the Indiana Pines dataset.

Table 7 .
The classification accuracy with different layer numbers on the Indiana Pines dataset.