A Novel Hyperspectral Image Classiﬁcation Pattern Using Random Patches Convolution and Local Covariance

: Today, more and more deep learning frameworks are being applied to hyperspectral image classiﬁcation tasks and have achieved great results. However, such approaches are still hampered by long training times. Traditional spectral–spatial hyperspectral image classiﬁcation only utilizes spectral features at the pixel level, without considering the correlation between local spectral signatures. Our article has tested a novel hyperspectral image classiﬁcation pattern, using random-patches convolution and local covariance (RPCC). The RPCC is an e ﬀ ective two-branch method that, on the one hand, obtains a speciﬁed number of convolution kernels from the image space through a random strategy and, on the other hand, constructs a covariance matrix between di ﬀ erent spectral bands by clustering local neighboring pixels. In our method, the spatial features come from multi-scale and multi-level convolutional layers. The spectral features represent the correlations between di ﬀ erent bands. We use the support vector machine as well as spectral and spatial fusion matrices to obtain classiﬁcation results. Through experiments, RPCC is tested with ﬁve excellent methods on three public data-sets. Quantitative and qualitative evaluation indicators indicate that the accuracy of our RPCC method can match or exceed the current state-of-the-art methods.


Introduction
There are more and more remote sensing applications based on hyperspectral images (HSIs). The latest hyperspectral sensors can obtain hundreds of spectral channel data points in high spatial resolution [1]. Rich spectral-spatial information is widely used in HSIs for scene recognition [2], regional variation of urban areas [3], and classification of features [4][5][6]. Classification of HSIs for ground objects can be widely used in precision agriculture [7], urban mapping [8], and environmental monitoring [9]. As such, classification has attracted much attention, and a wide variety of methods have been developed. HSI classification uses a small number of manual tags to indicate the category label of each pixel [10]. Like other classification applications, there are significant challenges involved in HSIs classification tasks, such as the well-known Hough phenomenon [11]. If the label data are very limited, more spectral data will reduce the accuracy of classification [12].
In order to overcome this problem, a large number of studies [13,14] have proposed many effective methods. These methods include dimensionality reduction (DR) [15,16] and band selection [17,18]. Hyperspectral dimensionality reduction has both supervised and unsupervised methods [19]. The difference is in whether the two use annotation information-for example, locally linear embedding [20], principal component analysis (PCA), and the maximum noise fraction (MNF) [21]. Jiang et al. [22] 1.
For the first time, we introduce a RPCC method combining random patches convolution and covariance matrix representation into hyperspectral image classification. RPCC has a simple structure, and the experiments show that its performance can match the state of the art. 2.
Our RPCC is able to extract highly discriminative features, and combines both multi-scale multi-layer convolution information and the correlation between different spectral bands without any training. 3.
We verified that the applicability of the randomness and localizing in our method is a kind of regularization pattern that has great potential to overcome the salt-and-pepper noise and over-smoothing phenomena in HSIs processing.
Our article is organized as follows. In Section 2, we introduce the proposed method RPCC in detail. Section 3 introduces the relevant experiments, and the results show the excellent performance of RPCC in three experiments. Sections 4 and 5 provide a discussion and conclusions, respectively. Figure 1 shows the framework of RPCC, which has two parallel branches. It obtains multi-scale convolution features via a random patches convolution [49] algorithm. Local covariance matrices are calculated to obtain all spectral correlation information for the image. Then, the obtained multi-scale spatial information and spectral covariance matrix are merged into spectral-spatial features. Finally, we added the fused spectral-spatial features into the SVM for HSIs classification.

Maximum Noise Fraction-Based Dimensionality Reduction
Each HSI band records the sunlight in a different spectral range, and the scale between these bands are varied. So the principal components (PCs) generated by the PCA transformation calculated using the covariance matrix do not represent the characteristics of the original data well. The research of Eklundh and Singh [50] indicates that the PCs calculated using the correlation matrix consistently yielded significant improvements in SNR to comparison to those calculated using the covariance matrix. The results obtained in this study show that PCA using the correlation matrix is the desirable mode of analysis in remote sensing applications. In most remotely sensed data, the signal at any point in the image is strongly correlated with the signal at neighboring pixels, while the noise shows only weak spatial correlations. According to this, Switzer [51] developed the min/max autocorrelation factors (MAF), which in effect estimates the noise covariance matrix for salt-and-pepper noise, as well as other forms of signal degradation such as striping. In response to the inadequacy of the PCA transformation, Green et al. [21] proposed minimum noise fraction (MNF) transformation, which uses the MAF to estimate the noise covariance matrix for the more complicated cases that often exists in remotely sensed multispectral scanner data and provides an optimal ordering of images in terms of image quality. The challenge of MNF transform is to obtain the noise covariance matrix. Nielsen and Larsen [52] have given four different ways of estimating it. They all rely on the data being spatially correlated. One way is by computing the covariance of the first order differences, assuming the noise is temporally uncorrelated. This way, the MNF transform is identical to min/max autocorrelation factors transform [51,53]. Fang et al. [29] explained the results of comparative experiments on HSIs reduction using MNF, PCA, and independent component analysis, (ICA), and found that MNF offers great advantages for the calculation of spectral relationships. Therefore, in the present study we use MAF to estimate the noise covariance matrix and MNF as a HSIs preprocessing method to extract the spectral features.
We defined the input HSIs data as ∈ . Here m, n, and z are the number of rows, columns, and spectral bands, respectively. Assuming that ∈ is separated into the noise and the signal , we have . (1) The transformation matrix was obtained by maximizing the SNR of the signal covariance to the noise covariance: where Cov and Cov are the covariance of the signal and the noise, respectively. The optimization problem of Equation (2) is equivalent to:

Maximum Noise Fraction-Based Dimensionality Reduction
Each HSI band records the sunlight in a different spectral range, and the scale between these bands are varied. So the principal components (PCs) generated by the PCA transformation calculated using the covariance matrix do not represent the characteristics of the original data well. The research of Eklundh and Singh [50] indicates that the PCs calculated using the correlation matrix consistently yielded significant improvements in SNR to comparison to those calculated using the covariance matrix. The results obtained in this study show that PCA using the correlation matrix is the desirable mode of analysis in remote sensing applications. In most remotely sensed data, the signal at any point in the image is strongly correlated with the signal at neighboring pixels, while the noise shows only weak spatial correlations. According to this, Switzer [51] developed the min/max autocorrelation factors (MAF), which in effect estimates the noise covariance matrix for salt-and-pepper noise, as well as other forms of signal degradation such as striping. In response to the inadequacy of the PCA transformation, Green et al. [21] proposed minimum noise fraction (MNF) transformation, which uses the MAF to estimate the noise covariance matrix for the more complicated cases that often exists in remotely sensed multispectral scanner data and provides an optimal ordering of images in terms of image quality. The challenge of MNF transform is to obtain the noise covariance matrix. Nielsen and Larsen [52] have given four different ways of estimating it. They all rely on the data being spatially correlated. One way is by computing the covariance of the first order differences, assuming the noise is temporally uncorrelated. This way, the MNF transform is identical to min/max autocorrelation factors transform [51,53]. Fang et al. [29] explained the results of comparative experiments on HSIs reduction using MNF, PCA, and independent component analysis, (ICA), and found that MNF offers great advantages for the calculation of spectral relationships. Therefore, in the present study we use MAF to estimate the noise covariance matrix and MNF as a HSIs preprocessing method to extract the spectral features.
We defined the input HSIs data as I∈R m×n×z . Here m, n, and z are the number of rows, columns, and spectral bands, respectively. Assuming that I∈R m×n×z is separated into the noise I N and the signal I S , we have The transformation matrix V was obtained by maximizing the SNR of the signal covariance to the noise covariance: Remote Sens. 2019, 11, 1954 5 of 21 where Cov(I S ) and Cov(I N ) are the covariance of the signal and the noise, respectively. The optimization problem of Equation (2) is equivalent to: where Cov(I) represents the overall covariance of the HIS data, Cov(I) = Cov(I N ) + Cov(I S ). Cov(I N ) is estimated by MAF [51]. According to the Lagrange multiplier method, the optimal solution to Equation (3) is: Then the eigenvalues were arranged from large to small, and the first d corresponding eigenvalue vectors were used as a transformation matrix: Therefore, the number of MNF principal components is d and the output data after MNF transformation are I mnf :

Spatial Feature Extraction with Random Patches Convolution
For spatial feature extraction, the data I mnf ∈ R m×n×d were obtained by MNF transformation. Then, the data I mnf were whitened to reduce the correlation between different data bands. After the whitening operation, the data were I whiten ∈ R m×n×d .
Inspired by the RPNet [49], we used multi-layer convolution, which not only has a simple architecture and requires no training but also can obtain multi-scale spatial information by extracting shallow and deep features. That can be seen in Figure 2.

of 21
where Cov represents the overall covariance of the HIS data, Cov = Cov + Cov . Cov is estimated by MAF [51]. According to the Lagrange multiplier method, the optimal solution to Equation (3) is: Then the eigenvalues were arranged from large to small, and the first corresponding eigenvalue vectors were used as a transformation matrix: Therefore, the number of MNF principal components is and the output data after MNF transformation are : .

Spatial Feature Extraction with Random Patches Convolution
For spatial feature extraction, the data ∈ were obtained by MNF transformation. Then, the data were whitened to reduce the correlation between different data bands. After the whitening operation, the data were ∈ . Inspired by the RPNet [49], we used multi-layer convolution, which not only has a simple architecture and requires no training but also can obtain multi-scale spatial information by extracting shallow and deep features. That can be seen in Figure 2. We generated k random pixel locations using random functions from the data .Then we obtained k random patches by using an image window with a size of around each pixel as random patches. For the pixels at the edge of the image, the blank area is filled by mirroring the adjacent pixels. Taking a Pavia University data-set as an example, the original Pavia University data are subjected to MNF transformation (PCs = 20), a whitening processing, and 20 pixels edge mirroring to obtain data ∈ . In Section 3.2, we specifically analyzed the effect of the number of MNF PCs, the size of patches and the number of patches on the accuracy of classification. Figure 3a is the first three-channel color composite image of ∈ with 20 randomly generated 40 40 20 red rectangular windows; Each red window has a yellow number, from 1 to 20. Figure 3b gives the corresponding random patches extracted from ∈ . Each patch has a number corresponding to the left red window. These random patches will all be selected as convolution kernels. After obtaining the first-layer convolution feature, the above process was repeated to obtain the second layer convolution feature. The specific process is as follows. We generated k random pixel locations using random functions from the data I whiten . Then we obtained k random patches by using an image window with a size of w × w × d around each pixel as random patches. For the pixels at the edge of the image, the blank area is filled by mirroring the adjacent pixels. Taking a Pavia University data-set as an example, the original Pavia University data are subjected to MNF transformation (PCs = 20), a whitening processing, and 20 pixels edge mirroring to obtain data L 1 ∈ R m×n×20 . In Section 3.2, we specifically analyzed the effect of the number of MNF PCs, the size of patches and the number of patches on the accuracy of classification. Figure 3a is the first three-channel color composite image of L 1 ∈ R m×n×20 with 20 randomly generated 40 × 40 × 20 red rectangular windows; Each red window has a yellow number, from 1 to 20. Figure 3b gives the corresponding random patches extracted from L 1 ∈ R m×n×20 . Each patch has a number corresponding to the left red window. These random patches will all be selected as convolution kernels. After obtaining the first-layer convolution feature, the above process was repeated to obtain the second layer convolution feature. The specific process is as follows.  Each random patch acts as a convolution kernel, so the random patching and whitening data convolution operations can be defined as follows: where is the i-th feature map, 1,2, … , , is the j-dimensional whitening data, * is the convolution operator, and is the j-dimensional random patch of the i-th feature map. The stride of the 2D convolution is 1 and the vacant area is filled by mirroring the adjacent pixels.
Let and be the -th and 1 -th layer feature sets, respectively. Then, at the beginning, the first-layer feature set , , … , . We performed MNF dimensionality reduction on the first-layer feature set to generate new random patches. Then, by convolving the with random patches, we obtained the second-layer feature set . In a similar manner, we obtained the -th-layer feature set. Finally, we obtained all the features from different layers ∈ : , , … , .

Spectral Feature Extraction
Local covariance can indicate the degree of correlation between features; [26] and [27] used local covariance for facial recognition and image classification. Fang et al. [29] used local spectral covariance for hyperspectral data classification. By comparing the spectral covariance with the entire image range, the local spectral covariance can be obtained more accurately, with more feature information and less computational complexity. As shown in Figure 4, for every pixel in the data , we obtained N local spectral cubes by using an image window with a size of . The total number of HSIs pixels is N, where . Furthermore, we used the k nearest-neighbor method to extract the spectral bands in each local spectral cube; each local spectral cube obtained the most relevant T spectral bands. is the t-th spectral band in the local cube region : For each local spectral cube, we constructed a corresponding covariance feature matrix which is Each random patch acts as a convolution kernel, so the random patching and whitening data convolution operations can be defined as follows: where S i is the i-th feature map, i = 1, 2, . . . , k, I whiten ( j) is the j-dimensional whitening data, * is the convolution operator, and P i ( j) is the j-dimensional random patch of the i-th feature map. The stride of the 2D convolution is 1 and the vacant area is filled by mirroring the adjacent pixels. Let F l and F l−1 be the l-th and (l − 1)-th layer feature sets, respectively. Then, at the beginning, the first-layer feature set F 1 = {S 1 , S 2 , . . . , S k }. We performed MNF dimensionality reduction on the first-layer feature set F 1 to generate new random patches. Then, by convolving the F 1 with random patches, we obtained the second-layer feature set F 2 . In a similar manner, we obtained the l-th-layer feature set. Finally, we obtained all the features from different layers F spatial ∈R m×n×lk :

Spectral Feature Extraction
Local covariance can indicate the degree of correlation between features; [26] and [27] used local covariance for facial recognition and image classification. Fang et al. [29] used local spectral covariance for hyperspectral data classification. By comparing the spectral covariance with the entire image range, the local spectral covariance can be obtained more accurately, with more feature information and less computational complexity. As shown in Figure 4, for every pixel in the data I mnf , we obtained N local spectral cubes by using an image window with a size of w × w × d. The total number of HSIs pixels is N, where N = n × m. Furthermore, we used the k nearest-neighbor method to extract the spectral Remote Sens. 2019, 11,1954 7 of 21 bands in each local spectral cube; each local spectral cube B i obtained the most relevant T spectral bands. b i t is the t-th spectral band in the local cube region B i : For each local spectral cube, we constructed a corresponding covariance feature matrix which is expressed as follows: where µ is the mean vector of the B i . Finally, we obtained all of the features from all spectral cubes: Referring to references [26,54,55], we needed to make C i strictly positive. Therefore, we let It is worth noting that, since F spectral is on the Manifold space [26], the covariance matrices will be converted to Euclidean space using the method in [56]. Given two covariance matrices C 1 and C 2 , the Log-Euclidean distance (LED) was defined as follows: where . F and logm are the F norm and the logarithm operator. where is the mean vector of the . Finally, we obtained all of the features from all spectral cubes: Referring to references [26,54,55], we needed to make strictly positive. Therefore, we let * , where E is the unit matrix and = trace 10 . It is worth noting that, since is on the Manifold space [26], the covariance matrices will be converted to Euclidean space using the method in [56]. Given two covariance matrices and , the Log-Euclidean distance (LED) was defined as follows: where ‖. ‖ and logm are the F norm and the logarithm operator.

Classification Based on Spectral-Spatial Features
We transformed the spectral-spatial features obtained by the above method to the same dimension. Let ∈ be ∈ and ∈ be ∈ by reshaped operator. Then, the spectral-spatial vectors are as follows: , .
It is worth noting that these features must be normalized, the formula for which is as follows: mean var (14) where var , mean and are the variance ,mean of and normalized spectral-spatial feature, respectively. The fused spectral-spatial are fed into the SVM classifier. The spectral-spatial features extracted by our method combine the shallow and deep convolution features of the spatial domain, which means that the method can better characterize multi-scale feature information in hyperspectral remote sensing images. The method also combines spectral information from different local spectral cubes of the spectral domain. It can obtain spectral spatial information simply and efficiently.
Lastly, unlike the RPNet proposed by reference [49], our RPCC does not involve the fusion of raw HSI data into an SVM, and there is no nonlinear activation operation. We used the same method as in references [29,54] to approximate the matrices in the Euclidean space. The source code will be released soon (https://github.com/whuyang/RPCC).

Experimental Setup and Analysis
In our experiments, to overcome the categories' imbalance problem, instead of splitting a dataset

Classification Based on Spectral-Spatial Features
We transformed the spectral-spatial features obtained by the above method to the same dimension. Let F spatial ∈R m×n×lk be F spatial ∈R N×lk and F spectral ∈R d×d×N be F spectial ∈R N×d 2 by reshaped operator. Then, the spectral-spatial vectors are as follows: It is worth noting that these features must be normalized, the formula for which is as follows: F spectial-spatial norm = F spectial-spatial − mean F spectial-spatial var F spectial-spatial (14) where var F spectial-spatial , mean F spectial-spatial and F spectial-spatial norm are the variance, mean of F spectial-spatial and normalized spectral-spatial feature, respectively. The fused spectral-spatial F spectial-spatial norm are fed into the SVM classifier.
The spectral-spatial features extracted by our method combine the shallow and deep convolution features of the spatial domain, which means that the method can better characterize multi-scale feature information in hyperspectral remote sensing images. The method also combines spectral information from different local spectral cubes of the spectral domain. It can obtain spectral spatial information simply and efficiently.
Remote Sens. 2019, 11, 1954 8 of 21 Lastly, unlike the RPNet proposed by reference [49], our RPCC does not involve the fusion of raw HSI data into an SVM, and there is no nonlinear activation operation. We used the same method as in references [29,54] to approximate the matrices in the Euclidean space. The source code will be released soon (https://github.com/whuyang/RPCC).

Experimental Setup and Analysis
In our experiments, to overcome the categories' imbalance problem, instead of splitting a dataset by an average percentage of each class, we specified the number of labeled samples for each annotated class of each data set. The training set is generated randomly from the ground reference data and the remaining reference samples consist of testing sets; all are listed in Tables 1-3. Then we used the LibSVM [57] implementation for the SVM classification with five-fold cross validation. The range of the regularization parameters is from 2 −8 to 2 10 . In order to reduce the deviation caused by random sampling on classification results, all the experiments in this paper were randomly repeated 30 times using the same training and test samples number and both the average value and the standard deviation are reported. Moreover, the accuracy of each category, OA and kappa coefficient (Kappa) were chosen as criteria for quantitative assessment. All algorithms were programmed in Matlab 2017a (MathWorks, Natick, MA, USA) and tested with an Intel E5-2667v2, 128 GB and GTX1080Ti.

Data Sets
Here, we conducted experiments and evaluated our approach on three common hyperspectral public data sets.
The Indian Pine HSI data-set was used for the first experiment, as detailed in reference [58]. The data set has a rows and columns of 145 × 145 pixels and 200 spectral bands. The wavelength is from 0.4 to 2.5 µm and the spatial resolution is 20 m. The labels used for training have sixteen classes. Figure 5 gives the false color composite image (bands 36, 17, and 11 for R, G, and B, respectively) and the ground truth color map of the data set. Table 1 gives the specific training and test sample information for the experiment. standard deviation are reported. Moreover, the accuracy of each category, OA and kappa coefficient (Kappa) were chosen as criteria for quantitative assessment. All algorithms were programmed in Matlab 2017a (MathWorks, Natick, MA, USA) and tested with an Intel E5-2667v2, 128 GB and GTX1080Ti.

Data Sets
Here, we conducted experiments and evaluated our approach on three common hyperspectral public data sets.
The Indian Pine HSI data-set was used for the first experiment, as detailed in reference [58]. The data set has a rows and columns of 145 × 145 pixels and 200 spectral bands. The wavelength is from 0.4 to 2.5 μm and the spatial resolution is 20 m. The labels used for training have sixteen classes. Figure 5 gives the false color composite image (bands 36, 17, and 11 for R, G, and B, respectively) and the ground truth color map of the data set. Table 1 gives the specific training and test sample information for the experiment.

Class
Name Training  Test  1  Alfalfa  30  16  2 Corn-no till 150 1278 3 Corn-min till 150 680  The Pavia University HSI data set was used for the second experiment, as detailed in reference [58]. The data set has a rows and columns of 610 × 340 pixels and 103 spectral bands. The wavelength is from 0.43 to 0.86 µm and the spatial resolution is 1.3 m. The labels used for training have nine classes. Figure 6 gives the false color composite image (bands 10, 27, and 46 for R, G, and B, respectively) and the ground truth color map of the data set. Table 2 gives the specific training and test sample information for the experiment.
The Kennedy Space Center (KSC) HSI data set was used for the third experiment, as detailed in [58]. The data set has a rows and columns of 512 × 614 pixels and 176 spectral bands. The wavelength is from 0.4 to 2.5 µm and the spatial resolution is 1.8 m. The labels used for training have thirteen classes. Figure 7 gives the false color composite image (bands 10, 34, and 19 for R, G, and B, respectively) and the ground truth color map of the data set. Table 3 gives the specific training and test sample information for the experiment. The Pavia University HSI data set was used for the second experiment, as detailed in reference [58]. The data set has a rows and columns of 610 × 340 pixels and 103 spectral bands. The wavelength is from 0.43 to 0.86 μm and the spatial resolution is 1.3 m. The labels used for training have nine classes. Figure 6 gives the false color composite image (bands 10, 27, and 46 for R, G, and B, respectively) and the ground truth color map of the data set. Table 2 gives the specific training and test sample information for the experiment.   The Kennedy Space Center (KSC) HSI data set was used for the third experiment, as detailed in [58]. The data set has a rows and columns of 512 × 614 pixels and 176 spectral bands. The wavelength is from 0.4 to 2.5 μm and the spatial resolution is 1.8 m. The labels used for training have thirteen classes. Figure 7 gives the false color composite image (bands 10, 34, and 19 for R, G, and B, respectively) and the ground truth color map of the data set. Table 3 gives the specific training and test sample information for the experiment.

Parameter Analysis
There were several important parameters used in the proposed RPCC: P is the number of MNF PCs, W is the size of patches, K is the number of pixels in a local cube, N is the number of patches, and D is the number of convolutional layer. Among them, according to our algorithm design, N should be greater than or equal to P. Considering the efficiency of the algorithm, we set N equal to P.
First, we designed the experiment and plotted the relationship between P and classification accuracy in our RPCC. Parameters W, K, N, and D are 25, 160, 45, and 5, respectively. As shown in Figure 8, when P increases from 5 to 20, the classification accuracy of the Indian Pine and KSC data has increased significantly, but the Pavia University rise is not obvious. When P is larger than 20, the overall accuracy (OA) of all three decreases and the OA of the Pavia University data set is significantly lower. As the P value increases, the experimental time of the three data sets is significantly longer. Considering the balance between accuracy and time consumption, we set P = 20.

Parameter Analysis
There were several important parameters used in the proposed RPCC: P is the number of MNF PCs, W is the size of patches, K is the number of pixels in a local cube, N is the number of patches, and D is the number of convolutional layer. Among them, according to our algorithm design, N should be greater than or equal to P. Considering the efficiency of the algorithm, we set N equal to P.
First, we designed the experiment and plotted the relationship between P and classification accuracy in our RPCC. Parameters W, K, N, and D are 25, 160, 45, and 5, respectively. As shown in Figure 8, when P increases from 5 to 20, the classification accuracy of the Indian Pine and KSC data has increased significantly, but the Pavia University rise is not obvious. When P is larger than 20, the overall accuracy (OA) of all three decreases and the OA of the Pavia University data set is significantly lower. As the P value increases, the experimental time of the three data sets is significantly longer. Considering the balance between accuracy and time consumption, we set P = 20. Second, we analyzed the sensitivity of the classification accuracy to parameter W, with parameters P, K, N, and D set to 20, 160, 20, and 5, respectively. The value of W is 15 to 31, and the step size is 2. Figure 9 shows that as W increases from 15 to 21, the classification accuracy of the Indian Second, we analyzed the sensitivity of the classification accuracy to parameter W, with parameters P, K, N, and D set to 20, 160, 20, and 5, respectively. The value of W is 15 to 31, and the step size is 2. Figure 9 shows that as W increases from 15 to 21, the classification accuracy of the Indian Pine data set gradually increases, and the classification accuracy of KSC and University of Pavia increases slightly. When W exceeds 21, the classification accuracy of the Indian Pine data set begins to decrease, while the classification accuracy of the KSC and University of Pavia decreases slightly, and tends to be stable. So we chose W = 21. Third, we evaluated the classification accuracy of the RPCC with different values of K and D. In this experiment, K was varied from 40 to 360 with a step of 40, and D was varied from 1 to 9 with a step of 1. The parameters P, W, and N were 20, 21, and 20, respectively. Figure 10 shows that the OA for all three HSIs data sets are maintained at a high level. When K and D increase, the OA of Indian Pine, University of Pavia and KSC are gradually increasing. As can be seen in Figure 10a, 10c, when the highest OA of the Indian Pine and KSC data sets was achieved, K = 160 and D = 5. When the highest classification accuracy of Pavia University data-set was obtained, D = 5 and K = 160, 200, or 240. Taking into account the computational efficiency and the previous work of reference [29], we set K = 160 and D = 5. Table 4 summarizes all the parameters in our experiments. Third, we evaluated the classification accuracy of the RPCC with different values of K and D. In this experiment, K was varied from 40 to 360 with a step of 40, and D was varied from 1 to 9 with a step of 1. The parameters P, W, and N were 20, 21, and 20, respectively. Figure 10 shows that the OA for all three HSIs data sets are maintained at a high level. When K and D increase, the OA of Indian Pine, University of Pavia and KSC are gradually increasing. As can be seen in Figure 10a,c, when the highest OA of the Indian Pine and KSC data sets was achieved, K = 160 and D = 5. When the highest classification accuracy of Pavia University data-set was obtained, D = 5 and K = 160, 200, or 240. Taking into account the computational efficiency and the previous work of reference [29], we set K = 160 and D = 5. Table 4 summarizes all the parameters in our experiments. for all three HSIs data sets are maintained at a high level. When K and D increase, the OA of Indian Pine, University of Pavia and KSC are gradually increasing. As can be seen in Figure 10a, 10c, when the highest OA of the Indian Pine and KSC data sets was achieved, K = 160 and D = 5. When the highest classification accuracy of Pavia University data-set was obtained, D = 5 and K = 160, 200, or 240. Taking into account the computational efficiency and the previous work of reference [29], we set K = 160 and D = 5. Table 4 summarizes all the parameters in our experiments.  Table 4. Parameters for the proposed random patches convolution and local covariance-based classification (RPCC) method.

Classification Results
Our method was compared with RAW [59], MNF [21], and the five state-of-the-art HSIs classification methods, SMLR-SpTV [60], Gabor-based [33], EMAP [35], LMRC [29], and RPNet [49], and a detailed analysis was carried out using the quantitative and qualitative experimental results. The last five methods were developed in recent years [61,62] and are closely related to our methods. SMLR-SpTV uses a spatially adaptive hidden Markov field and spectral fidelity to obtain spectral-spatial information for HSIs classification. The Gabor-based method uses the classical Gabor filter to obtain effective spectral features. EMAP can extract the geometric features of HSIs and form a feature vector space that describes the information of HSIs structure attributes, which is an effective spectral-spatial classification method. LMRC integrates spatial context information and spectral correlation information for HSIs classification by means of local covariance matrix representation. RPNet has a new multi-layer convolution structure that can quickly obtain high-precision classification results. The Monte Carlo iteration of SMLR-SpTV method is 10 times. The EMAP attribute extraction threshold is 2.5-10% according to the mean of each feature, the standard deviation attribute step is 2.5%, and the area attribute threshold is 200,500, and 1000, respectively. The parameters in the Gabor-based, LMRC, and RPNet methods are the same as the ones used in references [29,33,49]. Tables 5-7 show the results of quantitative experiments for all methods. Figures 11-13 give color classification results maps of the corresponding methods.
The quantitative results of the aforementioned state-of-the-art methods in the first experiment are shown in Table 5, and Figure 11 shows the corresponding classification map. It is clear that our RPCC approach achieves the highest OA and Kappa as well as the best classification map. From Table 5 it is clear that, in most categories, our RPCC has the best class accuracy. Among the 16 categories, only the accuracy of Corn, Soybean-clean, Wheat, and Buildings-Grass-Trees-Drives are lower than SMLR-SpTV and LCMR. Considering the accuracy of all the categories, our proposed RPCC achieves an advantage of 2%-22% on the indicator of OA and Kappa. The RAW and MNF methods only use spectral information, and more noise and misclassification can be seen on their classification map from Figure 11, such as for the Soybeans-min class in the middle of the map. Clearly, spatial information is beneficial for improving classification. The remaining six methods all consider spectral-spatial features. Comparing the classification map obtained using the SMLR-SpTV, Gabor-based, and EMAP methods, it can be seen that the SMLR-SpTV and Gabor-based methods produce smoother classification maps. In the integrated SMLR-SpTV method, MRF improves the classification performance and allows spatially smooth classification. The LMRC and RPNet methods both have good classification performance; however, our proposed RPCC obtains better results by combining the advantages of LMRC and RPNet. The RPCC not only uses spectral-spatial features to reduce misclassification, but also uses local k-nn clustering, covariance expression, and random patches to make the obtained spectral-spatial features more discriminative. Therefore, RPCC has a simple and efficient feature extraction strategy that can produce competitive experimental results.
14 of 21    For the second experiment, Figure 12 and Table 6 show the classification results and classification accuracy, respectively, of the Pavia University data set. On the whole, five classification methods based on spectral-spatial features have achieved similar classification accuracy. Our method still has the highest OA and Kappa. Figure 12 also illustrates that the classification maps of RAW, MNF, and EMAP are quite noisy, especially in the Bare Soil and Meadows regions. The classification maps of SMLR-SpTV and Gabor-based show overfitting. Additionally, the LCMR and RPNet methods failed to distinguish between the categories of Bricks, Bare soil, and Meadows. Therefore, our proposed RPCC method can achieve good performance in both classification mapping and accuracy. For the second experiment, Figure 12 and Table 6 show the classification results and classification accuracy, respectively, of the Pavia University data set. On the whole, five classification methods based on spectral-spatial features have achieved similar classification accuracy. Our method still has the highest OA and Kappa. Figure 12 also illustrates that the classification maps of RAW, MNF, and EMAP are quite noisy, especially in the Bare Soil and Meadows regions. The classification maps of SMLR-SpTV and Gabor-based show overfitting. Additionally, the LCMR and RPNet methods failed to distinguish between the categories of Bricks, Bare soil, and Meadows. Therefore, our proposed RPCC method can achieve good performance in both classification mapping and accuracy.    For the third experiment, Figure 13 and Table 7 show the classification results, false-color images, corresponding ground-truth maps, and classification accuracy, respectively, of the KSC data-set. Our proposed RPCC method is 1%-10% higher than the other seven methods and has the highest accuracy in all categories. Figure 13 shows that our RPCC method achieves better classification for the classes of Water, Mud flats, Salt marsh, and Cattail marsh than the other methods. Similar to the Pavia University and Indian Pine data sets, the classification maps obtained using the RAW and MNF methods show over-smoothing. The SMLR-SpTV, Gabor-based, EMAP, and RPNet methods typically misclassify Water and Cattail marsh in some areas. Moreover, the LCMR method results in a large amount of misclassification for the CP/Oak category.
Remote Sens. 2019, 11, x; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing accuracy in all categories. Figure 13 shows that our RPCC method achieves better classification for the classes of Water, Mud flats, Salt marsh, and Cattail marsh than the other methods. Similar to the Pavia University and Indian Pine data sets, the classification maps obtained using the RAW and MNF methods show over-smoothing. The SMLR-SpTV, Gabor-based, EMAP, and RPNet methods typically misclassify Water and Cattail marsh in some areas. Moreover, the LCMR method results in a large amount of misclassification for the CP/Oak category.

Discussion
In Figures 11-13 and Tables 5-7, a comparison with the other seven methods of the Indian Pine, Pavia University, and the KSC data-set shows that the proposed RPCC method can obtain better visual effects and higher accuracy. This proves the validity of the spectral spatial-feature extraction pattern in our method. There are three reasons for this. First, we perform spectral clustering on each pixel neighborhood region and then calculate the spectral covariance matrix of the extracted pixels, so that we obtain spectral correlation information for all regions of the entire image. Second, the random-patch convolution can extract shallow and deep features, allowing both multi-scale and multi-layer spatial features to be combined. Third, the randomness and localization in RPCC are a kind of regularization pattern that has great potential to overcome the pepper noise and over-smoothing phenomena in HSIs processing. Through the above classification results and quantitative evaluation, our method can be a novel and effective spectral-spatial classification framework.
In the field of machine learning, it is difficult to achieve the desired performance with single features and single models. An important method is to integrate. Typical fusion methods are early fusion and late fusion. Early fusion is a feature-level fusion that concatenates different features and puts them into a model for training. For example, these spectral-spatial classification methods in [39,40] and our RPCC are early fusion methods. Late fusion refers to the fusion of the score level. The practice is to train multiple models. Each model will have a prediction score, and the results of all models will be fused to obtain the final prediction results. Here, we have designed two late fusion methods as variants of RPCC. One is RPCC-LPR, which shares the same process as RPCC except that it uses SVM with linear, polynomial, and radial basis functions. Another is S-LPR-S-LPR, which uses SVM with linear, polynomial and radial basis functions to classify spatial and spectral features. Both methods use the majority vote method to obtain the final classification result with default parameters. Table 8 shows the classification accuracy of the three methods. It can be seen that the two simple fusion strategies do not improve the classification accuracy. On the one hand, it may be that the two methods require more complicated parameter adjustments to achieve the best results. On the other hand, since the most suitable kernel functions may be different for different features, perhaps the multiple kernel learning (MKL) method is more suitable for spectral-spatial feature fusion. By adopting different kernels for different features, multiple kernels are formed for different parameters. Then we can train the weight of each kernel and learn the best combination of kernel functions for classification. In short, our method is very suitable as a simple baseline method based on spectral-spatial feature classification, but there is still much room for improvement. The computation time of the algorithm has a significant impact on various remote sensing applications. Table 9 summarizes the computational time required by the SMLR-SpTV, Gabor-based, EMAP, LCMR, RPNet, and RPCC methods. Based on the overall performance of each method in the three experiments, the SMLR-SpTV and Gabor-based methods are the slowest. This is due to the fact that the SMLR-SpTV method trained samples in 10 Monte Carlo runs and the high-dimensional features of each pixel extracted by Gabor-based methods reduce its efficiency; both of these are very time-consuming processes. The RPNet method is clearly the fastest. Compared with the CMR and EMAP methods, our proposed RPCC runs faster for two data-sets, since the efficient random convolution and simplified covariance representation operations are adopted when extracting spatial-spectral features. Our RPCC method is more time-consuming than the RPNet method for all three data sets, this is because the RPCC is subject to the construction of covariance features. This process takes up two-thirds of the total runtime of the method. Since our algorithm is a two-branch structure, spectral and spatial features can be calculated in parallel. Therefore, we can further improve the efficiency of our algorithm. To summarize, by comparing the above algorithms on three experiments, it was shown that the RPCC method is able to extract highly discriminative features by combining both multi-scale, multi-layer convolution information and correlations between different spectral bands in the classification. The RPCC can be a competitive and robust approach for hyperspectral image classification. Specifically, our experiments show that randomness and local clustering are reliable techniques and have great potential to overcome the pepper noise and over-smoothing phenomena in HSI classification.

Conclusions
In this study, a new hyperspectral image classification pattern using random patch convolution and local covariance is proposed. RPCC is an effective two-branch method. First, it obtains a specified number of convolution kernels from the image space through a random strategy for extracting deep spatial features. Second, a covariance matrix is constructed between spectral bands by clustering local neighboring pixels in order to explore the correlation between different bands. Then the obtained multi-scale spatial information and spectral covariance matrix are merged into spectral-spatial features, which are fed into an SVM classifier for HSIs classification. Experiments comparing the performance of our model with those of five closely related spectral-spatial methods showed that our RPCC method can match or exceed current state-of-the-art methods.
However, considering that the RPCC is not fast enough, we plan to design an effective and efficient spectral-feature representation method. Furthermore, the framework of spectral-spatial feature extraction is not sufficiently coupled in our method, and we will therefore further integrate randomness and localization techniques, for example by introducing a deep spectral feature or superpixel methods.