Multi-Stage Convolutional Broad Learning with Block Diagonal Constraint for Hyperspectral Image Classiﬁcation

: By combining the broad learning and a convolutional neural network (CNN), a block-diagonal constrained multi-stage convolutional broad learning (MSCBL-BD) method is proposed for hyperspectral image (HSI) classiﬁcation. Firstly, as the linear sparse feature extracted by the conventional broad learning method cannot fully characterize the complex spatial-spectral features of HSIs, we replace the linear sparse features in the mapped feature (MF) with the features extracted by the CNN to achieve more complex nonlinear mapping. Then, in the multi-layer mapping process of the CNN, information loss occurs to a certain degree. To this end, the multi-stage convolutional features (MSCFs) extracted by the CNN are expanded to obtain the multi-stage broad features (MSBFs). MSCFs and MSBFs are further spliced to obtain multi-stage convolutional broad features (MSCBFs). Additionally, in order to enhance the mutual independence between MSCBFs, a block diagonal constraint is introduced, and MSCBFs are mapped by a block diagonal matrix, so that each feature is represented linearly only by features of the same stage. Finally, the output layer weights of MSCBL-BD and the desired block-diagonal matrix are solved by the alternating direction method of multipliers. Experimental results on three popular HSI datasets demonstrate the superiority of MSCBL-BD.


Introduction
With the rapid development of remote sensing technology, the spectral and spatial resolutions of hyperspectral images (HSIs) are increasing. HSIs can be used to identify subtle differences among ground objects, demonstrating strong discriminative ability [1,2], which has been applied in many fields such as crop monitoring [3], environmental analysis and prediction [4], climate detection [5], and forest surveys [6]. Pixel-wise classification is one of the common tasks in these applications. The commonly used classification methods include support vector machine (SVM) [7,8], and k-nearest neighbor [9]. However, due to the large number of bands and the strong correlation between adjacent bands, band redundancy exists in HSI data [10,11]. Direct classification with the original HSI will result in lower classification accuracy. Therefore, before HSI classification, performing feature extraction [12] (mapping the original data to another feature space through the mapping matrix) or feature selection (directly selecting several bands from the original band according to certain criteria or strategies) [13] can effectively improve the classification than CNN in situations with limited labeled samples. Chan et al. proposed a simplified CNN structure named principal component analysis network (PCANet) [44], which has attracted attention due to its simple training process and not needing a large number of iterative processes. Furthermore, Pan et al. [45] used kernel PCA to replace PCA for convolutional kernel learning to solve the problem of the insufficient nonlinear mapping ability in PCANet. By combining the rolling guidance filter and vertex component analysis network, Pan et al. [46] proposed a simplified deep-learning model for HSI classification. In addition to the above work, sample expansion is another commonly used solution to address the issue of insufficient labeled samples. For example, Li et al. [47] proposed a CNN with pixel-pair features (CNN-PPF), constructing a new training set with many more labeled samples than the original training set by comparing the labels between every two samples.
In summary, CNN has strong feature representation ability, but when labeled samples are limited, this ability is limited to some extent. The structure of BLS is simple and flexible, but the used linear mapped features cannot fully express HSI data. To this end, we constructed a novel multi-stage convolutional broad learning with a block-diagonal constraint (MSCBL-BD) method for HSI classification by taking full advantage of CNN and BLS. The main contributions of our work are summarized as follows: (1) By combing CNN and BLS, their advantages can be simultaneously utilized. When labeled samples are limited, the training set cannot characterize the complete distribution of HSIs. Although features extracted by CNN can provide strong discriminative ability, they may overfit to the training set. The concatenation of convolutional features and broad features can be seen as the combination of fine-and coarse-designed features, which have stronger generalization ability than either of them alone. (2) After the multi-layer mapping of the CNN, some information of the original HSI is inevitably lost. Therefore, multi-stage convolutional features are utilized and expanded stage-by-stage to mitigate the information loss to a certain extent. (3) Due to the use of width expansion and multi-stage features, the similarity between the features of different stages may be improved accordingly. This results in the redundancy of features. Therefore, we use a block-diagonal matrix to impose constraints on the multi-stage convolutional broad features to enhance the independence between the convolutional broad features of different stages, which is helpful to seek diversity in features and learn a more accurate HSI classification model. The rest of this paper is organized as follows: We elaborate the proposed MSCBL-BD for HSI classification in Section 2. Experiments on three popular hyperspectral datasets are described in Section 3, followed by the discussion of the proposed method in Section 4. The conclusions are provided in Section 5.

Structure of MSCBL-BD
The structure of MSCBL-BD is shown in Figure 1, which mainly includes the following parts: (1) obtaining the low-dimensional neighboring region representation, and using PCA to perform band reduction on the original HSI and constructing the low-dimensional neighboring region representation; (2) extracting the multi-stage CBFs. Firstly, use the limited labeled samples to pre-train a three-stage CNN and extract the multi-stage convolutional features (CFs) of the HSI. Secondly, perform channel-wise global average pooling on the features of the first two stages. Thirdly, perform stage-wise width expansion on the CFs to obtain the broad features (BFs), then further combine the CFs and BFs to obtain the convolutional broad features (CBFs). (3) Imposing the block-diagonal constraints on the multi-stage CBFs. Map the CBFs of all three stages through a block-diagonal matrix to obtain the block-diagonal-constrained CBFs, which ensures that each CBF is only linearly represented by features of the same stage. The optimal solutions of the output layer weight and block-diagonal matrix can be found by the alternating direction method of multipliers (ADMM).

CNN Pre-Training
The original hyperspectral image is presented in the shape of a 3D cube. If vectorization is performed directly on the HSI, it not only leads to an increase in dimensions, but also destroys the inherent structure of HSI data. The neighboring region representation is a common spectral-spatial representation method in HSIs [48], which is constructed by selecting several pixels around the target pixel. Figure 2 shows a schematic diagram of selecting the surrounding 24 pixels to form a 5 × 5 × B-size neighboring region representation, where B denotes the number of bands in the original HSI. According to this representation, not only the band information of the target pixel but also the information of neighboring pixels can be obtained. Furthermore, if the original high-dimensional neighboring region representation is directly used as the input of the CNN, redundant information will exists in the band and the number of network parameters will increase dramatically, thereby affecting the final performance of the CNN. A common method is utilizing a dimensionality reduction technique, such as principal component analysis (PCA), to reduce the number of bands in the neighboring region representation of an HSI. Define a low-dimensional neighboring region representation χ ∈ R d 1 ×d 2 ×d 3 as the input data of the CNN, where d 1 , d 2 , and d 3 are the width, height, and number of bands, respectively.  Due to the large amount of parameters of the fully connected layer, the CNN used in our work only includes the convolutional, pooling, nonlinea, and SoftMax layers. The input is connected to the convolutional layer by the convolutional kernel to obtain the output feature maps. The calculation formula is: where b C is the bias, F C represents the output features of the convolution layer, and I C is the input of the convolutional layer. For the first layer, the input data is the neighboring region representation of the HSI, which is denoted as I C 1 = χ. K C indicates the convolutional kernel of the convolutional layer and * represents the convolutional operation. In general, a pooling layer is added after a convolutional layer, which aims to quickly reduce the dimensions and enhance the invariance of the extracted features. The features F P obtained by the pooling layer are: where down(·) indicates a max-pooling operation and I P is the input data of the pooling layer, which is also the output of the previous convolutional layer. To achieve nonlinear mapping, a convolutional layer or a pooling layer is typically connected to a nonlinear layer. Here, the activation function of the nonlinear layer is the sigmoid function: where I N is the input data of the nonlinear layer, which is also the output of the previous pooling layer; F N denotes the output nonlinear features of the nonlinear layer. The SoftMax function is generally used as the last layer of the CNN. The number of neurons in the SoftMax layer is equal to the number of classes. The SoftMax loss function is defined as [49]: where N denotes the number of training samples; C is the number of classes; y i is the label of the ith sample; W S and b S are the weight and bias of the SoftMax layer, respectively; I S is the input data of the SoftMax layer, which is obtained by the previous multiple convolutional, pooling, and nonlinear calculation procedures; and 1{·} is the indicator function, so that 1{a true statement} = 1, and 1{a f alse statement} = 0. The CNN training process consists of two parts: forward and backward calculations. In the forward calculation process, the output of each layer is calculated according to the current parameters. In the backward calculation process, the weight and bias of each layer are updated by minimizing the loss function. The batch stochastic gradient descent algorithm is used for weight and bias update.
Here, one convolutional layer plus one pooling layer plus two nonlinear layers plus one 1 × 1 convolutional layer is defined as one stage, and the pre-trained CNN can obtain the multi-stage convolutional feature: where F stage i represents the ith stage feature, i = 1, · · · , s; and s is the total number of stages, i } denotes the learnable parameter set from the input to the ith stage of the CNN, including convolutional kernels and biases; and f (·) denotes the nonlinear mapping procedure of th CNN under θ i .

MSCBL-BD
The conventional BLS can be regarded as a three-layer neural network, including an input layer, an intermediate layer (composed of an MF and an EN), and an output layer. The MF is obtained by mapping the input with the weights fine-tuned by the linear autoencoder. It is worth noting that the linear features obtained by linear mapping cannot fully express the complex spectral-spatial features of an HSI, thus affecting the final classification performance. The EN is obtained by mapping the MF with random weights to achieve the width expansion of the MF. The output layer is connected to both the MF and the EN. By minimizing the error between the output vector and the ground truth label vector, the objective function of BLS is constructed by: where the first term is the empirical risk, which aims to calculate the error between the model output vector and the real label vector; Y denote the labels of the samples; · 2 is the l 2 -norm; the second term is the structural risk, which is used to improve the generalization ability of the model; Z B and H B , respectively, represent the features of the MF and EN; W m B indicates the connection weights of the output layer; and λ 1 is the coefficient of structural risk term. Equation (6) can be solved by ridge regression theory.
Since the linear sparse feature extracted by BLS cannot fully characterize the complex spectral-spatial features of an HSI, the CFs are used to replace the linear sparse features in the MF to achieve more complex nonlinear mapping. Furthermore, in order to reduce the information lost by the CNN in the multi-layer mapping process, the multi-stage features extracted by the CNN are utilized here and separately expanded in width. Given the , the channel-wise global average pooling is first performed on the CFs of the first two stages: where down G (·) represents the channel-wise global average pooling, which can summarize the global information of the feature map in each channel, to some extent. After the pooling operation, several 1D vectors P stage = P stage 1 | · · · |P stage s−1 are obtained and, together with the CFs of the last stage, constitute the multi-stage CFs (MSCFs) where φ(·) is the nonlinear function, such as the tansig. Splice the MSCFs and the MSBFs to obtain the MSCBFs A and rewrite them as: Enhancing the linear independence among features can help to produce a more accurate classification model [25,26], so we introduce the block-diagonal constraint [50]. In the commonly used block-diagonal representation method, each sample is represented only by samples with the same class, which can enhance the mutual independence among different classes of samples [51,52]. Here, we use a block matrix D = diag(d 11 , · · ·, d ss ) to map the MSCBFs into a subspace in which each feature is only represented by those of the same stage. MSCBL-BD optimizes the following objective function: where W m denotes the output layer weights of MSCBL-BD. Further consider the error term E and rewrite Equation (10) as: Since the absolute block diagonal structure is difficult to learn, the work of [50] was consulted. Assuming that the components on the non-block diagonal are as small as possible, the incoherent extra stage is boosted, and the coherent intra-stage representation is further enhanced at the same time [50]. Two terms are constructed to achieve the above objectives: (1) P D 2 F is used to minimize the elements on the non-block diagonal, where is the Hadamard product, · F denotes the Frobenius norm, 1 D is a D-dimensional vector whose elements are all 1; (2) construct a sparse term Q D 0 to enhance the coherent intra-stage representation, where · 0 represents the l 0 -norm, which can calculate the number of non-zero elements in a matrix, Here, sparsity here is used to calculate the number of elements that are equal to 0. By minimizing the sparse term, we can make as many elements on the non-diagonal as possible trend to 0. Since the optimization of l 0 -norm is NP-hard, a relaxed term Q D 1 is used here, where · 1 denotes the l 1 -norm. Then, Equation (11) can be rewritten as: where λ 2 ∼ λ 5 are the balancing coefficients, D * is used to explore the potential correlation patterns [53], and · * denotes the nuclear norm. Due to the difficulty in solving the l 1 -norm and the nuclear norm problem, auxiliary variables M and N are introduced, and Equation (12) can be rewritten as: Equation (13) can be solved by ADMM [50], and the augmented Lagrangian expression is: where C 1 , C 2 , and C 3 are Lagrangian multipliers, C 1 , A − AD − E = tr C T 1 (A − AD − E) , and µ > 0 is the penalty parameter. Each variable is updated alternately to find the desired solution and the detailed calculation process of each variable is provided below.
(1) Update W m . Fixing the other variables, the update process for W m is equivalent to solving the following objective function: Calculate the derivative of Equation (15) with respect to W m and make it zero, then we can obtain the closed solutions: (2) Update D. When the remaining variables are fixed, the expression of Equation (14) about D is: where R =Ỹ D (t) . Calculate the derivative of Equation (17) with respect to D and make it zero; thus, the optimal solution of D is:  (14) as an expression only for variable N: According to [54], the solution can be obtained by using the singular value threshold operation: where UΣV T is the singular value decomposition of D (t+1) − C (t) 3 µ (t) , and h λ 5 µ (t) (·) is the soft threshold operation, which is defined as: where λ is a threshold value.
(4) Update M. The update process of variable M is equivalent to solving the following problem: which can be updated by the point multiplication mechanism, and the optimal solution is calculated as follows: where (5) Update E. When fixing the other variables, we rewrite the objective function of Equation (14) on variable E and obtain: According to [50], let G = A − AD (t+1) + The above steps are alternated until a predetermined maximum number of iterations is reached, thereby obtaining the desired output layer weight W m and the block diagonal matrix D, and further calculating the prediction label vector Y : Consequently, sparse signal recovery problems can be solved by a dozen different methods, such as orthogonal matching pursuit, K-SVD, ADMM, etc. Among them, ADMM was designed for general decomposition methods and decentralized algorithms in optimization problems. Furthermore, many state-of-the-art algorithms for l 1 -norm-involved problems can be derived by ADMM. In addition, the problem in Equation (13) can be seen as a special case of the classical ADMM problem. Therefore, ADMM is used here to solve the problem in Equation (13).

Experiments and Analysis
To verify the validity of the proposed method, we selected three popular real HSI datasets: Indian Pines, Pavia University, and Salinas. The ground-truth maps and sample information of the three HSI datasets are shown in Figure 3. The spatial resolutions of the three datasets are 145 × 145, 610 × 340, and 512 × 217, respectively. All experiments were performed on the MATLAB 2016a platform;the used computer was configured as: CPU Intel I7-4790, 16 G memory, and GPU GTX980. In order to eliminate the randomness, all experiments were repeated 10 times to obtain the mean values. A total of 200 labeled samples per class were randomly selected for training and the others were used for testing. Four evaluation indicators were selected to evaluate the experimental results, which were the classification accuracy of each class of ground object, the average classification accuracy (AA), the overall classification accuracy (OA) and the Kappa coefficient.

Parameter Setting
Except for the number of feature maps of the last convolutional layer (the numbers of feature maps are 9, 9, and 16 for the Indian Pines, Pavia University, and Salinas datasets, respectively), the network structures of the CNNs for three HSI datasets were the same. The detailed structure of the CNN is shown in Table 1. It can be seen that: (1) there were total 5 convolutional layers, i.e., C1∼C5. The convolutional kernel sizes of C2 and C4 were 1 × 1, which is helpful for obtaining a deeper network structure at the cost of a small increase in network parameters. The convolutional kernel sizes of C1 and C3 were 4 × 4, the convolutional kernel size of C5 was 2 × 2. (2) There were two max-pooling layers, i.e., P1 and P2, both step sizes of which were 2. (3) A total of 4 nonlinear layers denoted as N1∼N4 were included. The activation functions used in all of the nonlinear layers were the Sigmoid function. Step Output I1  17  17  15  4335  C1  17  17  15  30  4  4  1  14  14  30  5880  P1  14  14  30  2  2  2  7  7  30  1470  N1  7  7  30  /  /  /  /  7  7  30  1470  C2  7  7  30  30  1  1  1  7  7  30  1470  N2  7  7  30  30  1  1  1  7  7  30  1470  C3  7  7  30  30  4  The CNN training process consists of two steps: forward calculation and back propagation. The former aims to calculate the classification result based on the current network parameters, and the latter updates the network parameters. Here, we used a stochastic gradient descent algorithm to update the CNN with a batch size of 100, a learning rate of 0.1, and an iteration of 1000. The CNN training process is based on the Matconvnet toolbox [55].

Width Height Channel of Filters of Filters of Filters Size Width Height Channel Dimension
The parameter settings of MSCBL-BD on the three datasets are shown in Table 2, where λ 1 ∼ λ 5 are tuned to achieve the best performance via fivefold cross-validations from {0.01, 0.1, 1, 5, 10}.

Comparative Experiments
In order to verify the validity of the proposed MSCBL-BD, six benchmark methods and two special cases of MSCBL-BD (CBL and MSCBL) were selected for comparative experiments: (1) Traditional classification methods: SVM [8], whose optimal super-parameters are selected through the fivefold cross-validation method; (2) Deep learning methods: SS-DBN [34], CNN-PPF [47], and CNN [38]. To set the parameters of SS-DBN and CNN-PPF, we referred to the corresponding articles. The configuration of CNN is shown in Table 1.

Discussion
From the experimental results, we can see that: (1) On all of three HSI datasets, for three evaluation indexes (AA, OA, and Kappa coefficients), MSCBL-BD, MSCBL, and CBL achieved higher performance than the other methods. Taking the Indian Pines dataset as an example, CBL outperformed BLS by 11.52% in terms of OA. The main reasons are two-fold: the features learned by BLS are linear sparse features and only the spectral information is utilized. Therefore, the features of the HSI cannot be fully represented. In addition, CBL, MSCBL, and MSCBL-BD outperformed CNN by 1.68%, 2.7%, and 3.91% on the OA, respectively, which verifies the superiority of width expansion. (2) On the three HSI datasets, MSCBL outperformed CBL by 1.02%, 0.24%, and 0.45%, respectively, because MSCBL utilizes the MSCFs and performs feature expansion for each stage, which makes the features learned by MSCBL more discriminative. Furthermore, MSCBL-BD outperformed MSCBL by 1.21%, 0.15%, and 0.6% on the three datasets, respectively, because the MSCBFs are mapped through a block-diagonal matrix, so that each obtained feature is linearly represented only by the features of its own stage, thereby enhancing the linear independence of the MSCBFs of each stage. This can help with learning a more accurate HSI classification model, which in turn improves the final classification accuracy. (3) Both SS-DBN and CNN are deep spectral-spatial classification methods. Compared with SS-DBN, CNN yielded higher OAs on all HSI datasets. This is because SS-DBN takes features by stacking the spectral vector and the vectorized neighboring region representation as the input of the DBN, which not only leads to an increase in the dimension of the input data, but also destroys the inherent structure of the data. In addition, in the case of a small number of labeled samples, as a kind of fully-connected neural network, over-fitting may occur in the training process of SS-DBN. Figures 4-6 show the classification maps achieved by different methods. It can be seen that the classification maps obtained by MSCBL-BD on all of three HSI datasets are smoother and more detailed. Taking the Indian Pines dataset as an example, the benchmark methods misclassify more Soybean-clean, Soybean-notill, and Corn-notill as Soybean-mintill; misclassify more Grass-trees into Woods; and misclassify more Corn-notill into Corn-minitill.

Conclusions
A novel BLS based method (MSCBL-BD) for HSI classification was proposed in this paper. By replacing the linear sparse features with convolutional features, the nonlinear mapping ability of the model is improved and the complex spectral-spatial features of HSI can be better represented. Therefore, the CBL has higher classification accuracy than BLS. Furthermore, in order to reduce the information loss occurring in the multilayer mapping process of CNN, the MSCBFs are utilized. Moreover, in order to train a more accurate HSI classification model, the block-diagonal constraint is introduced to the MSCBL. The MSCBFs are mapped by a block-diagonal matrix, and the obtained features are only linearly represented by those of their own stages. Therefore, the linear independence of the MSCBFs is enhanced and the classification accuracy is improved ultimately. The experimental results demonstrated the superiority of MSCBL-BD compared to some competitive methods. In our future work, we will increase the robustness of our method, so that satisfactory performance can be achieved when bands are missing.
Author Contributions: All of the authors provided significant contributions to the work. Y.K. and X.W. conceived and designed the experiments; Y.K. and X.W. performed the experiments; Y.K. and Y.C. analyzed the data; Y.K. and Y.C. wrote the original paper; C.L.P.C. reviewed and edited the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Natural Science Foundation of China under grants 62006232, 61976215, 61772532 . This research was also funded by the Natural Science Foundation of Jiangsu Province under grant BK20200632.

Data Availability Statement:
Publicly available datasets were analyzed in this study. These data can be found: http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes accessed on 15 July 2018.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

ADMM
Alternating direction method of multipliers BASS-Net Band-adaptive spectral-spatial feature learning neural network BF Broad feature BLS