SLIC Superpixel-Based l2,1-Norm Robust Principal Component Analysis for Hyperspectral Image Classification

Hyperspectral Images (HSIs) contain enriched information due to the presence of various bands, which have gained attention for the past few decades. However, explosive growth in HSIs’ scale and dimensions causes “Curse of dimensionality” and “Hughes phenomenon”. Dimensionality reduction has become an important means to overcome the “Curse of dimensionality”. In hyperspectral images, labeled samples are more difficult to collect because they require many labor and material resources. Semi-supervised dimensionality reduction is very important in mining high-dimensional data due to the lack of costly-labeled samples. The promotion of the supervised dimensionality reduction method to the semi-supervised method is mostly done by graph, which is a powerful tool for characterizing data relationships and manifold exploration. To take advantage of the spatial information of data, we put forward a novel graph construction method for semi-supervised learning, called SLIC Superpixel-based l2,1-norm Robust Principal Component Analysis (SURPCA2,1), which integrates superpixel segmentation method Simple Linear Iterative Clustering (SLIC) into Low-rank Decomposition. First, the SLIC algorithm is adopted to obtain the spatial homogeneous regions of HSI. Then, the l2,1-norm RPCA is exploited in each superpixel area, which captures the global information of homogeneous regions and preserves spectral subspace segmentation of HSIs very well. Therefore, we have explored the spatial and spectral information of hyperspectral image simultaneously by combining superpixel segmentation with RPCA. Finally, a semi-supervised dimensionality reduction framework based on SURPCA2,1 graph is used for feature extraction task. Extensive experiments on multiple HSIs showed that the proposed spectral-spatial SURPCA2,1 is always comparable to other compared graphs with few labeled samples.

The above mentioned methods suggest that the pixels in a hyperspectral image lie in a low-rank manifold. However, the spatial correlation among pixels is not revealed. For HSIs, the adjacent pixels' correlation is typically extremely high [41], which has potential low-rank attributes. Pixels in homogenous areas usually consist of similar materials, whose spectral properties are highly similar and can be taken as being approximately in the same subspace. By extracting the low-rank matrix of the data, the low-dimensional structure of each pixel is revealed and we can classify each pixel more accurately. Xu et al. put forward a low-rank decomposition spectral-spatial algorithm, which incorporates global and local correlation [42]. Fan et al. integrated superpixel segmentation (SS) into Low-rank Representation (LRR) and proposed a novel denoising method called SS-LRR [43]. A multi-scale superpixel based sparse representation (MSSR) for HSIs' classification is proposed to overcome the disadvantages of utilizing structural information [44]. Sun et al. presented a novel noise reduction method based on superpixel-based low-rank representation for hyperspectral image [45]. In [46], Mei et al. proposed a new unmixing method with superpixel segmentation and LRR based on RGBM. Tong et al. proposed multiscale union regions adaptive sparse representation (MURASR) by multiscale patches and superpixels [47]. A novel method, robust regularization block low-rank discriminant analysis, is proposed for HSIs' feature extraction [48].
Spatial information can play a very important role in mapping data. However, many graph construction methods do not make full use of data's spatial information. For example, in hyperspectral images, adjacent pixels in a homogenous area usually belong to the same category. Therefore, how to consider the features of the actual data globally to further improve the semi-supervised dimensionality reduction performance is one of the main research components of this paper. Considering the spatial correlative and spectral low-rank characteristics of pixels in HSIs, we integrate the Simple Linear Iterative Clustering (SLIC) segmentation method into low-rank decomposition. Figure 1 shows the formulation of the proposed SLIC Superpixel-based l 2,1 -norm Robust Principal Component Analysis (SURPCA 2,1 ) for hyperspectral image classification. As shown in Figure 1, we preprocess the hyperspectral image by the Image Fusion and Recursive Filtering feature (IFRF), which eradicates the noise and redundant information concurrently [12]. The proposed method divide the image into multiple homogeneous regions by the superpixel segmentation algorithm SLIC. The pixels in each homogeneous region may belong to the same ground object category. Therefore, we stack the pixels in the same homogeneous region into a matrix. Due to the low-rank property of the matrix, RPCA is used to recover the low-dimensional structure of all pixels in the homogeneous region. Then, we combine these low-rank matrices together to an integrated low-rank graph. In addition, we process the semi-supervised discriminant analysis for dimension reduction, which takes advantage of the labeled samples and the distribution of the whole samples. The k-Nearest Neighbor (kNN) algorithm is applied to handle the low-rank graph for the regularized graph of semi-supervised discriminant analysis. Finally, we implement the Nearest Neighbor classifier method. We summarize the main contributions of the paper in the following paragraphs.
• Inspired by robust principal component analysis and superpixel segmentation, we put forward a novel graph construction method, SLIC superpixel-based l 2,1 -norm robust principal component analysis. The superpixel-based l 2,1 -norm RPCA extracts the low-rank spectral structure of all pixels in each uniform region, respectively. • The simple linear iterative clustering addresses the spatial characteristics of hyperspectral images. Consequently, the SURPCA 2,1 graph model can classify pixels more accurately. • To investigate the performance of the SLIC superpixel-based l 2,1 -norm robust principal component analysis graph model, we conducted extensive experiments on several real multi-class hyperspectral images. We start with a scientific background in Section 2. Section 3 deciphers the HSIs classification with l 2,1 -norm robust principal component analysis. Section 4 performs comparative experiments on real-world HSIs to examine the performance of the proposed graph. We provide the discussion in Section 5. Conclusions are presented in Section 6.

Graph-Based Semi-Supervised Dimensionality Reduction Method
In the dimensionality reduction method, the promotion of supervised method to semi-supervised method is mostly done by graph-based methods. Maier et al. [49] showed that different graph structures obtain different results in the same clustering algorithm. A good graph construction method has great influence in graph-based semi-supervised learning. Therefore, how to construct a good graph sometimes seems to be more important than a good objective function [50].
Graph is composed by nodes and edges, which represents the structure of the entire dataset. The nodes represent the sample points, and the edges' weight represents the similarity between the nodes. Generally, the greater the weight is, the higher the similarity between the nodes will be. If the edge does not exist, the similarity between the two points will be zero. The similarity between nodes is usually measured by distance, such as Euclid, Mahalanobis, and Chebyshev distance [51]. Different graph construction methods have great impact on the performance of classification results. The graph construction process mainly includes the graph structure and the edges' weight function. Two commonly used graph structures are Fully Connected graphs and Nearest Neighbor graphs (k-nearest neighbor graph and ε-ball Graph).

•
Fully Connected graphs [21,52]: In the fully connected graph, all nodes are connected by edges whose weights are not zero. The fully connected graphs are easy to construct and have good performance for semi-supervised learning methods. However, the disadvantage of the fully connected graphs is that it requires processing all nodes, which would lead to high computational complexity. • k-nearest neighbor graph [21]: Each node in the k-nearest neighbor graph is only connected with k neighbor in a certain distance. The samples x i and x j are considered as neighbor if x i is among the k nearest neighbor of x j or x j is among the k-nearest neighbor of x i . • ε-ball graph [21]: In the ε-ball graph, the connections between data points occur in the neighborhood of radius ε. That is to say, if the distance between x i and x j exists d(x i , x j ) ≤ ε, ε ∈ R, ε > 0, there will be a neighbor relationship between these two nodes. Therefore, the connectivity of the graph is largely influenced by the parameter ε.
It is also necessary to assign weight for the connected edges W. The followings are commonly used weight functions: • Inverse of Euclidean distance [53], • 0-1 weighting [21], • Heat kernel weighting [21], kNN graph can make full use of the local information of adjacent nodes. Since kNN graph is sparse, it can solve the computational complexity and storage problem in the fully connected graph. The parameters k and ε depend on the data density, which is difficult to select. The choice of neighbor size parameter is the key to the effectiveness of the graph-based semi-supervised learning method. The Nearest Neighbor graph lacks global constraints of the data points, and its representation performance is greatly reduced when the data is seriously damaged [21].

Simple Linear Iterative Clustering
The superpixel is an image segmentation technique proposed by Xiaofeng Ren in 2003 [54]. It refers to the image regions with certain visual meanings composed of adjacent pixels which have similar physical characteristics such as texture, color, brightness, etc. [55]. The superpixel segmentation methods segment pixels and replace a large number pixels with a small number of superpixels, which greatly reduces the complexity of image post-processing. It has been widely used in computer vision applications such as image segmentation, pose estimation, target tracking, and target recognition. The boundary information is relatively obvious between the superpixels. Achanta et al. introduced a simple linear iterative clustering algorithm to efficiently produce superpixels that are compact and nearly uniform [56].
The superpixel algorithms are broadly classified into graph-based and gradient-ascent-based algorithms. In graph-based algorithms, each pixel is treated as a node in a graph, and edge weights between two nodes are set proportional to the similarity between the pixels. Superpixel segmentations are extracted by effectively minimizing a cost function defined on the graph [57]. Simple linear iterative clustering was proposed in 2010, and is simple to use and understand. Simple linear iterative clustering is based on the k-means clustering algorithm [58], which is done in the five-dimensional Labxy space, where Lab is the pixel color vector in CIELAB color space, which is widely considered as perceptually uniform for small color distances, and x, y is the pixel position [57]. Starting from an initial rough clustering, the clusters from the previous iteration are refined to obtain better segmentation in each iteration by gradient ascent method until convergence. Then, the distance metrics of the five-dimensional feature vectors are constructed. Finally, the pixels are clustered in the image locally [57,59]. The method has a high comprehensive evaluation index in terms of computational speed and superpixel shape, which achieves good segmentation results. It has two significant advantages compared with other algorithms. One is that it restricts the search space to proportionate the size of the superpixels, which can significantly reduce the number of distance calculations during the optimization process. The other is that the weighted distance combines color and spatial metrics while control the number and compactness of superpixels.
The SLIC method performs a k-means-based local clustering algorithm in a five-dimensional space distance measurement, which achieves compactness and regularity in superpixel shapes. L represents luminosity, and ranges from 0 (black) to 100 (white). The color-associated elements a and b represent the range of colors from magenta to green and yellow to blue, respectively. The Lab color model not only contains the entire color gamut in RGB and CMYK, but also colors they cannot present. The Lab space can express all colors perceived by the human eyes, and is perceptually uniform for small color distance. Instead of directly using the Euclidean distance in this five-dimensional space, SLIC introduces a new distance measure that considers superpixels' size. For an image with N pixels, the SLIC algorithm takes a desired number of equally-sized superpixels N/K as input. There is a superpixel center at each grid interval S = √ N/K. SLIC is easy in practice application because it only needs to set the unique superpixels' desired number K.
The clustering procedure begins with K initial cluster centers (seeds) C k = [l k , a k , b k , x k , y k ] T , which are sampled on regular grids with an S pixels interval. Each seed is moved to the lowest gradient position in its 3 × 3 neighborhood to avoid centering the superpixel at the edge position and reduce the chance of seeding with noisy pixels. We assume that the cluster center is located within a 2S × 2S area since the spatial extent of a superpixel is about S 2 . This strategy can accelerate the convergence process, as shown in Figure 2. The SLIC superpixel segmentation algorithm is a simple local k-means algorithm, whose search area is the 2S × 2S area nearest to each cluster center. Since each pixel is searched by multiple seed pixels, the cluster center of the pixel is the seed with the minimum value.  Let [l i , a i , b i , x i , y i ] T be the five-dimensional vector of a pixel. Calculate the distance between each pixel and the seeds. The distance between the pixel and seed C k is shown as follows: The variable l controls the superpixels' compactness. The larger the value l is, the more compact the cluster will be.
The above steps are repeated iteratively until convergence, which means that the cluster center of each pixel is no longer changing. l ranges from 1 to 20. Here, we chose l = 10, which roughly matches the empirically perceptually meaningful Lab distance and provides good balance between spatial proximity and color similarity.
After the iterative optimization, superpixel multi-connected conditions may occur. Some superpixels' sizes may be too small, a single superpixel may be segmented into a plurality of discontinuous superpixels, etc., which can be solved by enhancing the connectivity of the superpixels. The solution is to reassign discrete and small-size superpixels to their adjacent superpixels until all the pixels have been traversed.

Robust Principal Component Analysis
Decomposing the matrix into low-rank and sparse parts can separate the main components from outliers or noise, which is suitable for mining low-dimensional manifold structures in high-dimensional data. We stack the pixels in the same homogeneous region into a matrix X k ∈ R N ×n k , where n k is the pixel number in the kth homogeneous region. Since the matrix has low-rank property, a low-rank matrix recovery algorithm is employed to recover the low-dimensional structure of all pixels in the homogeneous region. After recovering the low-rank matrix, the low-rank matrices are combined together into an integrated low-rank graph. The low-rank matrix restored by low-rank representation is a square matrix, while the dimensionality of the low-rank matrix restored by robust principal component analysis is the same as the original X k ∈ R N ×n k . Since the number of pixels in each homogeneous region is different, robust principal component analysis is suitable for restoring the low-dimensional structures of all pixels in a homogeneous region here.
Principal Component Analysis is a fundamental operation in data analysis, which assumes that the data approximately lies in a low-dimensional linear subspace [16]. The success and popularity of PCA reveals the ubiquity of low-rank matrices. When the data are slightly damaged by small noise, it can be calculated stably and efficiently by singular value decomposition. However, noise and outliers limit PCA's performance and applicability in real scenarios.
Suppose we stack samples as column vectors of a matrix X ∈ R m×n . The problem of classical PCA is to seek the best estimate of Z that minimizes the difference between Z and X: where . F denotes the Frobenius norm. r is the upper-bound rank of the low-rank matrix Z. This problem can be efficiently solved via singular value decomposition (SVD) when the noise E is independent and identically distributed (i.i.d.) small Gaussian noise. A major disadvantage of classic PCA is its robustness to grossly corrupted or outlying observations [16]. In fact, even though only one element in the matrix changes, the obtained estimation matrix is far from the ground truth.
Recently, Wright et al. [37] proposed an idealized robust PCA model to extract low-rank structures from highly polluted data. In contrast to the traditional PCA, the proposed RPCA can exactly extract the low-rank matrix and the sparse error E (relative to Z). The robust principal component analysis model is expressed as follows: min where λ is to balance low-rank matrix Z and the error matrix E. We recover the pair (Z 0 , E 0 ) that were generated from data X. Unfortunately, Equation (8) is a very non-convex optimization problem with no valid solution. Replacing the rank function with the kernel norm, and replacing the l 0 -norm with the l 1 -norm, results in the following relaxing Equation (8) convex surrogate: The optimization in Equation (9) can be treated as a general convex optimization problem. Currently, the optimization algorithms of RPCA mainly contain Iterative Thresholding (IT), Accelerated Proximal Gradient (APG), and DUAL. However, the iterative thresholding algorithm proposed in [37] converges slowly. The APG algorithm is similar to the IT algorithm, but the number of iterations is significantly reduced. In addition, the DUAL algorithm does not require the complete singular value decomposition of the matrix, so it has better scalability than the APG algorithm. Recently, some researchers proposed Augmented Lagrangian Multiplier (ALM) algorithm, which has a faster speed than previous algorithms. The exact ALM (EALM) method turns out that it has a satisfactory Q-linear convergence speed, whereas the APG is theoretically only sub-linear. A slight improvement over the exact ALM leads to an inexact ALM (IALM) method, which converges as fast as the exact ALM. Therefore, the IALM method is applied to obtain the optimization here.

Methodology
In the above section, we discuss the simple linear iterative clustering and robust principle component analysis. Furthermore, labeled and unlabeled samples can be exploited simultaneously by depicting the underlying superpixels' low-rank subspace structure. Considering this, we have attempted to decipher our present work the superpixel segmentation and l 2,1 -norm pobust principal component analysis feature extraction model.

Superpixel Segmentation and l 2,1 -Norm Robust Principal Component Analysis
Hyperspectral images preserve low-rank properties, but the category composition corresponding to a hyperspectral image is very complicated in practical applications. The robust principal component analysis model supposes that the data lie in one low-rank subspace, which makes it difficult to characterize the structure of the data with multiple subspaces. Hence, we explore the robust principal component analysis via superpixel, which is usually defined as a homogeneous region. Here, we employ the SLIC superpixel segmentation method to segment the hyperspectral image into spatially homogeneous regions. Each superpixel is a unit with consistent visual perception, and the pixels in one superpixel are almost identical in features and belong to the same class. Initially, we extract the information of the HSIs by image fusion and recursive filtering [12].
Suppose R = (r 1 , r 2 , · · · , r D ) ∈ R M×D represent the original hyperspectral image, which has M pixels and D-dimensional bands. We segment the whole hyperspectral bands into multiple subsets spectrally. Each subset is composed of L contiguous bands. The number of subsets N = D/L , which represents the largest integer not greater than D/L. The ith (i ∈ (1, 2, · · · , N )) subset is shown as follows: Afterwards, the adjacent bands of each subset are fused by an image fusion method (i.e., the averaging method). For example, the ith fusion band, that is, the image fusion feature F i , is as follows: Here, P i n refers to the nth band in the ith subset of a hyperspectral image. N i is the band number in the ith subset. After image fusion, we remove the noise pixels and redundant information for each subset. Then, we transform the domain recursive filtering algorithm on Q i to obtain the ith feature: Here, RF represents the domain recursive filtering algorithm. δ s is the filter's spatial standard deviation, and δ r is defined as the range standard deviation [60]. Then, we get the feature image Image fusion and recursive filtering (IFRF) is a simple yet powerful feature extraction method that aims at finding the best subset of hyperspectral bands that provide high classes' separability. In other words, the IFRF feature method is used to select better bands that remove the noise and redundant information simultaneously [12]. Hence, we preprocess the HSIs firstly by IFRF to eliminate redundant information. Let x M ] ∈ R N ×M be the preprocessed features vector, where x i represents a pixel with N band numbers in the hyperspectral image, and M is the pixels number. We begin by sampling the K cluster centers and moving them to the lowest gradient position in the 3 × 3 neighborhood of the cluster centers. The squared Euclidean norm of the image is calculated by Equation (13): where I(x, y) represents the L, a, b vector corresponding to the pixel at position (x, y), and . is the l 2 -norm. This takes both color and intensity information into account. Each pixel in the image is associated with the nearest cluster center whose search area overlaps the pixel. After each pixel is associated with its nearest cluster center, a new center is computed as the average Labxy vector of all the pixels in the cluster. Then, we iteratively associate the pixel with the nearest cluster center and recalculate the cluster center until convergence. We enforce the connection by relabeling the disjoint segmentations with their largest neighboring cluster's label. The simple linear iterative clustering algorithm is summarized in Algorithm 1.

Algorithm 1 SLIC superpixel segmentation.
Input: Processed HSIs image X ∈ R N ×M , Desired number of approximately superpixels K.
1: Initialize cluster centers C k = [l k , a k , b k , x k , y k ] T by sampling pixels at regular grid steps S. for each pixel i in a 2S × 2S region around C k do 8: Compute the distance D between C k and i.

9:
if D < d(i) then 10: set d(i) = D Hyperspectral images are segmented into many irregularly homogeneous regions. We stack the pixels in one homogeneous region into a matrix. We record the segment image as X = {X 1 , X 2 , · · · , X m }, where k ∈ {1, 2, · · · , m} is the index of the superpixels. m is the exact total number of superpixels in X. n k is the number of pixels in the superpixel X k , and each column represents the bands of one pixel. Due to the low-rank attribute of the data, robust principal component analysis is used to extract the low-dimensional structure of all pixels in the homogeneous region. Then, the observed data matrix X k ∈ R N ×n k can be decomposed into a low-rank matrix and a sparse matrix. That is, X k = Z k + E k , where Z k represents the low-rank matrix of X k , and E k indicates the sparse noise matrix. Further, the RPCA optimization problem for each superpixel region is converted to the following form: In the RPCA model, noise is required to be sparse, and the structural information of the noise is not considered. However, in machine learning or image processing fields, each column (or row) of the matrix has some meaning (e.g., a picture, and a signal data, etc.). Here, each column represents a pixel, which causes the noise to be sparse in column (or row). This structural information cannot be represented by the definition of an l 1 -norm. To generate structural sparsity, the l 2,1 -norm is proposed in the robust principal component analysis.
Unlike the l 1 -norm, the l 2,1 -norm produces sparsity based on columns (or rows). For the matrix X ∈ R m×n , its l 2,1 -norm is [61,62]: Thus, the l 2,1 -norm is the sum of the l 2 -norm of column vectors, which is also a measure of joint sparsity between vectors. For data that are interfered by structured noise, the following l 2,1 -norm RPCA model in Equation (16) is very suitable: In contrast to the original RPCA model, we call the error term with l 2,1 -norm in Equation (16) as RPCA 2,1 . Here, the inexact augmented Lagrangian multiplier (ALM) algorithm [63,64] is utilized for the optimal solution Z k .
The augmented Lagrangian multiplier method solves the optimization problem by minimizing the Lagrangian function as follows: where Y is defined as Lagrangian multipliers and µ > 0 is the penalty parameter. The sub-problem for all variables is convex, which can supply a relevant and unique solution. Alternate iterations Z k , E k , and Y: Note that Z i+1 and E i+1 are convergent to Z i+1 * and E i+1 * , respectively. Then, we update Y as follows: Finally, parameter µ is updated: where ρ > 1 is a constant and ε > 0 is a relatively small number. The optimization process outline of the inexact ALM method [63] is given in Algorithm 2.
The initialization Y 0 = X/J(X) is defined as where · ∞ is the maximum absolute value of the matrix entries.

Input:
Observation matrix X ∈ R m×n and parameter λ for local affinity. 1: Initialize: 7:

HSIs' Classification Based on the SURPCA 2,1 Graph
To overcome the "Curse of dimensionality" caused by the high dimension, we apply semi-supervised discriminant analysis to reduce the dimension. SDA is derived from linear discriminant analysis (LDA). The rejection matrix a can represent a priori consistency assumptions according to the regularization term [65]: Here, S t is the total class scatter matrix. Parameter α is used to balance the complexity and empirical loss of the model. Empirical loss J(a) controls the learning complexity of the hypothesis family. The regularizer term J(a) incorporates the prior knowledge into some particular applications. When a set of unlabeled examples available, we aim to construct a J(a) incorporating the manifold structure.
Given a set of samples {z i } M i=1 , we can construct the graph G to represent the relationship between nearby samples by kNN. The samples z i and z j are considered as k-nearest neighbor if z i is among the k-nearest neighbor of z j or z j is among the k-nearest neighbor of z i . Here, we employ the most simple 0-1 weighting [21] methods to assign weights for S: where N k (z i ) denotes k-nearest neighbor of z i . The SDA model incorporate the prior knowledge in the regularization term J(a), as follows: The diagonal matrix D, D ii = ∑ j S ij is column (or row, since S is symmetric) sum of S. L = D − S is the Laplacian matrix [66]. Then, SDA objective function is: We obtain the projective vector a by maximizing the generalized eigenvalue problem where d is the weight matrix's rank for the labeled graph.
The low-rank matrix extracted by RPCA captures the global correlation of the homogeneous region [37,39,67,68], whereas the k-nearest neighbor algorithm characterizes the local correlation of pixel points [69]. The probability that the neighboring pixels belong to the same category is large, which corresponds to the spatial distribution structure of the hyperspectral image. Therefore, the SURPCA 2,1 graph can achieve good feature representations for graph-based semi-supervised dimensionality reduction.
Given a set of samples {(z i , y i )} l i=1 and unlabeled samples {z i } m i=l+1 with c classes, the l k is the kth class's samples number. The algorithmic procedure of HSIs' classification by applying the SLIC superpixel-based l 2,1 -norm robust principal component analysis method is stated in the following paragraphs: Step 1 Construct the adjacency graph: Construct the block low-rank and kNN graph S in Equation (24) for the regularization term. Furthermore, calculate the graph Laplacian L = D − S.
Step 2 Construct the labeled graph: For the labeled graph, construct the matrix as: Define identity matrixĨ = I 0 0 0 ∈ R l×l .
Step 3 Eigen-problem: Calculate the eigenvectors for the generalized eigenvector problem: d is the rank of W, and {a 1 , a 2 , · · · , a d } is the d eigenvectors.
Step 4 Regularize discriminant analysis embedding: Let the transformation matrix A = [a 1 , a 2 , · · · , a d ] ∈ R N ×d . Then, embedding the data into d-dimensional subspace, Step 5 Finally, apply the simple and ubiquitously-used classifiers nearest neighbor and Support Vector Machine (SVM) for classification. Figure 3 shows a model diagram of the SURPCA 2,1 for hyperspectral image classification. The low-rank matrix measures the global correlation of the homogeneous regions, while the kNN algorithm preserves the local correlation of the pixels. The probability that adjacent pixels belong to the same category in the kNN is large, which corresponds to the spatial distribution structure of the hyperspectral image. Therefore, the SURPCA 2,1 graph can be used to achieve high-quality data representation.

Experiments and Analysis
To investigate the performance of the SLIC superpixel-based l 2,1 -norm robust principal component analysis graph model, we conducted extensive experiments on several real multi-class hyperspectral images. We performed experiments on a computer with CPU 2.60 GHz and 8 GB RAM.

•
The Indian Pines image is for the agricultural Indian Pine test site in Northwestern Indiana, which was a 220 Band AVIRIS Hyperspectral Image Data Set: 12 June 1992 Indian Pine Test Site 3. It was acquired over the Purdue University Agronomy farm northwest of West Lafayette and the surrounding area. The data were acquired to support soils research being conducted by Prof. Marion Baumgardner and his students [70]. The wavelength is from 400 to 2500 nm. The resolution is 145 × 145 pixels. Because some of the crops present (e.g., corn and soybean) are in the early stages of growth in June, the coverage is minuscule-approximately less than 5%. The ground truth is divided into sixteen classes, and are not all mutually exclusive. The Indian Pines false-color image and the ground truth image are presented in Figure 4.   • The Salinas image is from the Salinas Valley, California, USA, which was obtained from an AVIRIS sensor with 3.7 m spatial resolution. The image includes a size of 512 × 217 with 224 bands. Twenty water absorption bands were discarded here. There are 16 classes containing vegetables, bare soil, vineyards, etc. Figure A1 displays the Salinas image's false-color and ground truth images.

Evaluation Criteria
We give some evaluation criteria to evaluate the proposed method for HSIs, as follows. Classification accuracy (CA) is the classification accuracy of each category in the image. The confusion matrix [71] is often used in the remote sensing classification field, and its form is defined as: M = [m ij ] n×n , where m ij indicates that the number of pixels labeled by the j class should belong to the i class. n is the class number. The reliability of classification depends on the diagonal values of the confusion matrix. The higher the values on the diagonal of the confusion matrix are, the better the classification results will be.
We also used the following three main indicators: overall accuracy (OA), average accuracy (AA), and the kappa coefficient [72]. OA refers to the percentage of the overall correct classification. AA estimates the average correct classification percentage for all categories. The kappa coefficient takes the chance agreement into account and fixes it, whereas OA and AA check how many pixels are classified correctly. Assuming that the reference classification (i.e., ground truth) is true, then how well they agree is measured. Here, we assumed that both classification and reference classification were independent class assignments of equal reliability. The advantage of the kappa coefficient over overall accuracy is that the kappa coefficient considers chance agreement and corrects it. The chance agreement is the probability that the classification and reference classification agree by mere chance. Assuming statistical independence, we obtained this probability estimation [73,74].

Comparative Algorithms
We give several comparative methods to illustrate the great improvement by the SLIC superpixel-based l 2,1 -norm robust principal component analysis graph model in the HSIs' classification. For fairness, these comparative algorithms incorporate the simple linear iterative clustering (SLIC), which are shown below.
• PCA (principal component analysis) method [63]: PCA [16] seeks the best low-rank representation of the given data matrix that minimizes the difference between Z and X: • IFRF (image fusion and recursive filtering) algorithm. • Origin: Original bands of the unprocessed image.

Classification of Hyperspectral Images
We performed experiments on the three hyperspectral images to examine the performance of the SURPCA 2,1 graph. Ten independent runs of each algorithm were evaluated by resampling the training samples in each run. We chose the mean values as the results. In practice, it is difficult to obtain labeled samples, while unlabeled samples are often available and in large numbers. Unlike most of the existing HSI classifications, we tested the performance of all comparative methods using only small rate of the labeled samples. Table 1 shows the training and testing sets for all datasets, where the training sets are chosen randomly. The training samples were approximately 4%, 3%, and 0.4% for the three images, which were minimal sets to the entire dataset. Considering the classes with a meager number of samples, we incorporated a minimum threshold of training samples. Here, we set the minimum threshold of training samples for each class as five, which can eradicate the difference between the classes with a low number of samples. The SLIC superpixels' number for the three HSIs is 200, 600, and 400, respectively. Figures 4b, 5b and A1b show the segmentation maps. The amount of reduced dimension is 30 for the three hyperspectral images. The spatial standard deviation and range standard deviation of the filter are δ s and δ r with 200 and 0.3, respectively. The parameter σ in ) is 0.1, which is provided randomly.  C1  7  39  46  205  6426  6631  14  1995  2009  C2  63  1365  1428  565  18,084  18,649  20  3706  3726  C3  39  791  830  69  2030  2099  13  1963  1976  C4  15  222  237  97  2967  3064  11  1383  1394  C5  25  458  483  46  1299  1345  16  2662  2678  C6  35  695  730  156  4873  5029  21  3938  3959  C7  7  21  28  45  1285  1330  20  3559  3579  C8  25  453  478  116  3566  3682  51  11,220  11,271  C9  6  14  20  34  913  947  30  6173  6203  C10  44  928  972  ---19  3259  3278  C11  104  2351  2455  ---10  1058  1068  C12  29  564  593  ---13  1914  1927  C13  14  191  205  ---9  907  916  C14  56  1209  1265  ---10  1060  1070  C15  21  365  386  ---35  7233  7268  C16  9  84  93  ---13  1794  1807   Total  702  9547  10,249  1762  41,014  42,776  305 53,824 54,129 In the beginning, we utilized different manifold mapping graphs to obtain the manifold matrix of the hyperspectral images. Then, we applied the SDA algorithm for semi-supervised dimensionality reduction. Both the nearest neighbor and support vector machine algorithms were applied as the classifiers here to test and verify the proposed SURPCA 2,1 graph. Tables 2, 3 and A1 give detailed classification results for CA, OA, AA, and kappa coefficients received from different methods, where bold numbers represent the best ones for various graphs. Figures 6, 7 and A2 indicate classification maps of the hyperspectral images acquired by several methods (randomly selected from our experiments above) with the NN classifier. It was observed that the results in the figures were random for each method. We analyzed the running times of different models on the Indian Pines image, Pavia University scene, and Salinas scene image. Ten separate runs were calculated for the total time.  In Tables 2, 3, and A1, we can notice that the SURPCA 2,1 graph is superior to the other graph models. Therefore, it significantly improves the classification performance of hyperspectral images, which indicates that the SURPCA 2,1 graph is a superior HSIs feature extraction method with both NN and SVM classifiers. For example, in Table 2, although the overall accuracy and kappa coefficient of other graphs are high, the average accuracy is not good. The accuracy of Grass-pasture-mowed and Oats are more (i.e., 12-37%) improved with the NN classifier and with SVM classifier (i.e., 25-64%) than the compared graphs. In addition, the Corn, Soybean-notill, and Soybean-clean classes are significantly improved, especially the Grass-pasture-mowed, for the Indian Pines image. For the Salinas dataset, the classification of Fallow, Lettuce_romaine_6wk, and Vinyard_untrained classes are obviously improved. We can see that, although sometimes the overall accuracy is not very different, the average accuracy is far ahead of the compared graphs. This is because, when there are fewer samples in some classes, it is very easy to be confused. For the Indian Pines image, the accuracy of all categories is more than 89.81%, except for the 81.96% accuracy of the Grass-pasture-mowed with the NN classifier, and it is also more than 95.03% with the SVM classifier, which meets the general accuracy requirement of remote sensing. For the Pavia University scene and Salinas image, the accuracy of all categories is larger than 91.69% and 91.95%, and 91.7% and 97.45% with NN and SVM, respectively. We can also see that our graph performed well in all categories with both of the classifiers. However, for other methods, they might perform much worse in some confusing categories, especially with the NN classifier. This phenomenon shows that our proposed graph is robust to classifier methods.
We analyzed the running times of different models on the three hyperspectral images. We used ten separate runs to calculate the total time. We give the mean running time. As shown in Table 4, the execution time of the SURPCA 2,1 graph method is slightly longer than the other methods. Although our algorithm is slower than the traditional algorithms, the performance is much better than these baselines at an acceptable running time. Note that the RPCA 2,1 is significantly faster than the classic RPCA. This is owing to the superior ability of the l 2,1 -norm to converge more quickly than the l 1 -norm.
Although the classification results with NN are slightly worse with the SVM algorithm, the running time is much less than the SVM classifier. Moreover, in Tables 2, 3, and A1, we can see that the classification results with NN classifier are acceptable. Therefore, in the following experiments, the nearest neighbor is applied. We evaluated the performance of the proposed SURPCA 2,1 graph. To fully determine the superiority of the proposed graph model, we also analyzed the robustness of SURPCA 2,1 . Considering the practical situation, we analyzed the robustness of SURPCA 2,1 on the size of the labeled samples and noise situation.

Labeled Size Robustness
We analyzed the impact of different sizes of training and testing sets. The experiments were carried out with ten independent runs for each algorithm. We calculated the mean values and standard deviations of the results. Figure 8  In most cases, our graph model SURPCA 2,1 consistently achieves the best results, which are robust to the label percentage variations. With increasing size of the training sample, OA generally increases in all methods, showing a similar trend. When the labeled ratio is fixed, the SURPCA 2,1 method is usually superior to others. Similarly, the three classification criteria are increased with the increasing training samples simultaneously.
Note that, our proposed graph realize higher classification results, even with a low label ratio. The other comparison algorithms are especially not as robust as the SURPCA 2,1 graph with low label ratio. In particular, we can see that our graph model perform relatively better, and the difference between other comparison methods is larger when the labeling rate is smaller. Therefore, due to the high cost and difficulty of labeled samples, our proposed graph is much more robust and suitable for real-world hyperspectral images.

Noise Robustness
We evaluated the robustness to noise of the SURPCA 2,1 graph on the three hyperspectral images. We added zero-mean Gaussian noise with a signal-to-noise ratio (SNR) of 20 dB to each band. Figure 9 shows an example noise image of three randomly selected bands in the Indian Pines image, which is similar to the other two hyperspectral images.  We evaluated different graphs' performance in the noisy situation. Each algorithm ran ten times independently, with re-sampling of training samples. We calculated the mean and standard deviation results. The labeled sampling ratios were 4%, 3%, and 0.4%, respectively, as described in Table 5. The bold numbers represent the best results for various graphs. Figures 10, 11, and A3 display classification maps of the three noise hyperspectral images correlated with OA values (randomly selected from our experiments above). It is observed that the results in the figures were random for each method.
From the results of these different graph models, we can see that our graph is robust to noise and the sample size of the labeled set. In a noisy environment, our method is relatively less degraded and more robust. With few labeled samples, the SURPCA 2,1 graph is more forceful than the other graphs due to the l 2,1 -norm robust principal component analysis noise robustness. Moreover, the SURPCA 2,1 graph performs very well for all three experimental hyperspectral images.

Parameters of SURPCA 2,1 Model
We evaluated the SURPCA 2,1 graph's parameters superpixel number m and the reduced dimensions of SDA. We conducted ten independent runs per algorithm and calculated the mean values and standard deviations of the results. Table 6 and Figure 12 show the performance of different superpixel number m and the reduced dimensions, respectively. For these three hyperspectral images, the labeled ratio was about 4%, 3%, and 0.4%, respectively. The superpixel segmentation numbers are shown in Table 6 for the three hyperspectral images. In general, the classification results are not greatly affected by the number of superpixel segmentations, indicating that our graph is robust to the number of superpixel segmentations. It is observed that the increase in superpixel numbers simultaneously accelerated the running time considerably. However, the impact is relatively small when the number of samples is large. Therefore, we could use relatively small superpixel numbers in actual situations for the purpose of efficiency with minimal overall loss of classification accuracy. 5   We can observe that the reduction dimension is robust in Figure 12. It will reach steady high accuracy at a relatively low dimension such as 10 or 15.

Discussion
The hyperspectral images' classification plays a pivotal role in HSIs' application. In the present work, we propose a novel graph construction model for HSIs' semi-supervised learning, SLIC superpixel-based l 2,1 -norm robust principal component analysis. Our goal is to improve HSIs' classification results through useful graph model. The experimental results show that SURPCA 2,1 is a competitive graph model for hyperspectral images' classification. Tables 2, 3, and A1 show that SURPCA 2,1 is an effective graph model that achieves the highest classification accuracy. Even with simple classifiers and few labeled samples, the performance is excellent. In most cases, the graph model SURPCA 2,1 obtains the highest classification results, which indicates that the SURPCA 2,1 graph is a good graph construction model and can notably enhance the hyperspectral images' classification performance. In some cases, other graph models (e.g., RPCA and PCA) may perform well in some categories, but they are not as robust as our graph in all categories. This is because the structural information cannot be well represented from the definition of l 1 -norm, while the l 2,1 -norm RPCA 2,1 produces sparsity based on columns (or rows). The SLIC superpixel segmentation method considers the spatial correlation of pixels in hyperspectral images. For an HSI, the correlation of the adjacent pixels is usually very high since their spectral features are approximately in the same subspace [41]. By extracting a low-rank matrix in each homogeneous region, we can reveal the low-dimensional structure of pixels and enhance the classification results. Consequently, the SURPCA 2,1 graph model is a superior hyperspectral image's feature graph which significantly improves the performance of HSIs' classification.
We analyzed the robustness to variations in the labeled samples ratio and noisy situations. Figure 8 shows that the SURPCA 2,1 graph model provides the best results, which are robust to the varying label percentage. Figure 8 also shows that the proposed graph achieves high classification accuracy even at meager label rates, while other graphs may not be as robust as the SURPCA 2,1 graph, particularly when the label rate is low. Therefore, since labeling data is a costly and difficult task, our SURPCA 2,1 model is more suitable and robust for real-world applications. From these results of different graph models, we can see that our graph is robust to the sample ratio of the labeled set as well as the noisy environments. Upon adding noise, the performance of the SURPCA 2,1 graph does not see a large decrease. However, for the comparative graphs, the overall accuracy drops much more than that of SURPCA 2,1 . In addition to the India Pines image, we could see the same results from the other two hyperspectral images. This benefits from the robustness of the l 2,1 -norm robust principal component analysis to sparse noise, whereas RPCA and PCA may be sensitive to sparse noise. Therefore, the SURPCA 2,1 graph is robust to noisy environments and the labeled sample ratio, which indicates that it is highly competitive in practical situations.
We evaluated the performance of parameters with different superpixel segmentations number and reduction dimensions for the SDA method. In Table 6, we can see that the classification results are not greatly affected by the number of superpixel segmentations, indicating that our graph is robust to the number of superpixel segmentations. Therefore, we could use a relatively small number of superpixels in actual situations for the purpose of efficiency, with barely any classification accuracy loss. Moreover, in Figure 12, we can see that the classification results are robust after the dimension reached a certain numerical value. It reaches a high accuracy at a relatively low dimension (e.g., 10 or 15). Overall, our proposed SURPCA 2,1 graph is very much robust and shows excellent performance in the classification of HSIs.

Conclusions
In this paper, we present a novel graph for HSI feature extraction, referred to as SLIC superpixel-based l 2,1 -norm based robust principal component analysis. The SLIC superpixel segmentation method considers the spatial correlation of pixels in HSIs. The l 2,1 -norm robust principal component analysis is used to extract the low-rank structure of all pixels in each homogeneous area, which captures the spectral correlation of the hyperspectral images. Therefore, the classification performance of the SURPCA 2,1 graph is much better than the compared methods RPCA, PCA, etc. Experiments on real-world HSIs indicated that the SURPCA 2,1 graph is a robust and efficient graph construction model.