Pairwise Elastic Net Representation-Based Classification for Hyperspectral Image Classification

The representation-based algorithm has raised a great interest in hyperspectral image (HSI) classification. l1-minimization-based sparse representation (SR) attempts to select a few atoms and cannot fully reflect within-class information, while l2-minimization-based collaborative representation (CR) tries to use all of the atoms leading to mixed-class information. Considering the above problems, we propose the pairwise elastic net representation-based classification (PENRC) method. PENRC combines the l1-norm and l2-norm penalties and introduces a new penalty term, including a similar matrix between dictionary atoms. This similar matrix enables the automatic grouping selection of highly correlated data to estimate more robust weight coefficients for better classification performance. To reduce computation cost and further improve classification accuracy, we use part of the atoms as a local adaptive dictionary rather than the entire training atoms. Furthermore, we consider the neighbor information of each pixel and propose a joint pairwise elastic net representation-based classification (J-PENRC) method. Experimental results on chosen hyperspectral data sets confirm that our proposed algorithms outperform the other state-of-the-art algorithms.


Introduction
A hyperspectral image is a 3D remote sensing image containing hundreds of bands, from visible to infrared spectra. Due to their abundant spectral information, HSIs have become an actual application in the field of remote sensing, such as skin imaging [1], ground elements identifying [2] and mineral exploration [3]. To date, many classification algorithms for hyperspectral datasets have been proposed. Among the techniques, the support vector machine (SVM) [4], Gaussian mixture-model (GMM) [5] and the Gaussian maximum-likelihood classifier (MLC) [6] are all proved to be effective for solving HSI classification problem. The most concerning research methods in recent years can be roughly divided into two categories: representation-based algorithms and deep learningbased algorithms. On the one hand, in order to make full use of the spectral and spatial information of HSIs, some effective spectral-spatial feature extraction methods have been combined with sparse models to improve the characterization capability of models, such as [7][8][9][10]. On the other hand, since the deep convolutional neural network (CNN) with deep architecture has been proven to be very effective in using image features, this type of method using deep CNN for hyperspectral feature extraction has stimulated various studies [11][12][13][14].
This paper is mainly focused on the HSI classification algorithm based on representation learning. The classification principle of the method is to assume that each testing pixel can be reconstructed with labeled training pixels. Then, the abundance coefficients of the testing pixel can be obtained with the penalty of l 1 -norm or l 2 -norm, which is named sparse representation classification (SRC) [15] and collaborative representation classification (CRC) [16]. In [17], Chen et al. first introduced the sparsity model into hyperspectral classification and proposed the joint sparse representation classification (JSRC) method by incorporating the contextual information. In [18], considering that different atoms have different importance for the reconstruction process, Li et al. proposed the nearest regularized subspace (NRS) classifier with Tikhonov regularization. By wisely combing SRC and KNN, in Ref. [19] a class-dependent sparse representation classifier (cdSRC) was proposed. However, some research [18,20] shows that the collaboration of approximation enhances classification results rather than competition. Therefore, in Ref. [21], a joint within-class CRC was provided to solve the HSI classification tasks. In [22], the kernel version of CRC was further considered and the Kernel-based CRC (KCRC) was proposed. There are also some investigations dedicated to improving classification effectiveness. On the one hand, some focus on the more simple and robust dictionary to reduce computation costs. On the other hand, some take the neighborhood spatial information as an important factor in improving classification accuracy. In Ref. [23], the nonlocal joint collaborative representation (NJCRC) algorithm was proposed by utilizing a subdictionary whose atoms are obtained by the k-nearest neighbor (K-NN) with testing samples rather than the whole dictionary atoms. In [24], Fang et al. introduced the shape irregular neighbor region into the joint SRC model and proposed the shape adaptive joint sparse representation (SAJSRC).
It is worth noting that both the SRC-based algorithms and CRC-based algorithms have their limitations. In these representation-based classification models, the obtained abundance coefficients reflect the importance of each training sample for reconstruction. Accordingly, the primary concern of this type of method is the solution of the abundance coefficient. Ideally, the test pixels should be linearly represented by atoms from the same category. The nonzero terms of sparse coefficients should be located at the position of the corresponding class. For SRC, it tends to select as few atoms as possible. The too sparse property will lead to the deviation of the absolute reconstruction error, and the sparsity will be weakened when the number of training atoms sets is small. For CRC, it tends to select all the atoms for reconstruction, and the class discrimination will be weak when including mixed-class information. Intuitively, SRC and CRC should be balanced to achieve better classification performance is necessary.
To solve the above problem, in Ref. [25], the elastic net representation-based classification (ENRC) method was proposed. The elastic net originally raised in [26] encourages both sparsity and grouping by forming a convex combination of the CRC and SRC governed by a selectable parameter. Furthermore, the elastic net can yield a sparse estimate with more than n nonzero weights. Based on these advantages, the ENRC improves of HSI classification performance. However, the optimal balance factors are all obtained by traversing the manufactured parameter space. This makes the algorithm time-consuming and complex. Additionally, the pixelwise fusion algorithm cannot make full use of the spatial information of the HSI.
Fortunately, the recent literature [27] has pointed out that the pairwise elastic net (PEN) model using similarity measures between regressors can establish a local balance between SRC and CRC. It can achieve more flexible grouping than ENRC. Moreover, PEN allows the customization of the sparsity relationship between any two features. Hence, in this work, we propose the pairwise elastic net representation-based classification (PENRC) method to overcome the indigenous disadvantages of ENRC, SRC and CRC. It can automatically achieve the balance between l 1 -norm and l 2 -norm so that more robust weight coefficients can be estimated, and further realizing better between-class sparse and intraclass collaborative classification performance.
Specifically, the main contribution of the proposed PENRC can be briefly summarized as follows. First, considering the computation cost when using all the dictionaries, we adopt the KNN to select the labeled atoms, which are more similar to the testing pixel as an optimal sub-dictionary. Then, unlike the ENRC, which assigns only a single global tradeoff between sparsity and collaboration, we introduce a similar matrix about sub-dictionary atoms in penalties, resulting in the local sparsity and collaboration tradeoff and be more flexible than ENRC. After obtaining the abundance coefficients, we use the principle of minimum reconstruction error to decide the final label. We also provide a further extension of our algorithm by incorporating the neighbor information of each pixel.
In summary, it is expected that the abundance coefficients from PENRC reveal a more powerful discriminant ability, thereby outperforming the original SRC, CRC and ENRC.
The remaining parts of the paper are organized as follows: Section 2 briefly introduces the two classical SRC and CRC classifiers. Section 3 details the proposed PENRC mechanism. Section 4 gives the experimental results on chosen two datasets. Finally, Section 5 concludes this paper.

Related Works
Denoting a testing pixel as y = [y 1 , . . . , y B ] ∈ R B×1 and the dictionary composed of training atoms with class order as X = [X 1 , . . . , X C ] ∈ R B×N , where B is the number of spectral bands, N = ∑ C c=1 N c is the training atoms number and C is the total number of categories. The sub-dictionary X c ∈ R B×N c is the set of training atoms in c-th class.

Sparse Representation for HSI Classification
The sparse model assumes that a testing pixel can be linearly approximated with few dictionary atoms suitably [15]. Then, for a testing pixel y, the purpose of SRC model is to obtain the corresponding abundance coefficients by minimizing the reconstruction error y − Xα SRC 2 2 with the sparse constraint term α SRC 1 . Mathematically, the object function can be represent as follows: where λ 1 is the balancing parameter. The weight vector α SRC ∈ R N×1 is sparse and only have few nonzero terms. It can be obtained by solving Equation (1) with basis pursuit (BP) or basis pursuit denoising (BPDN) algorithms [28,29]. When l 2 -norm is directly used, Equation (1) can be solved by subspace pursuit (SP) and orthogonal matching pursuit (OMP) algorithms [30]. After obtaining the weight vector α SRC , we can assign the final class label which corresponding the mimimum reconstruction error to the testing pixel: where α SRC c is the subset of sparse vector α SRC which belongs to c-th class.

Collaborative Representation for HSI Classification
Unlike SRC model, the CRC assumes that a testing pixel can be linearly combined with all the training set [21]. The CRC attempts to obtain abundance coefficients by minimizing the reconstruction error y − Xα CRC 2 2 with the term α CRC 2 . Thus, the CRC can be expressed as: where λ 2 balances the influence of the reconstruction error and constraint term. Equation (3) can be simply solved with a closed form. Assuming that the derivative of the above cost function and is zero, we can obtain the optimal value of α CRC : where I is an identity matrix with the size of N × N. After obtaining α CRC , the final class label c of testing pixel can be determined with the minimm residual rule as introduced in last section. For the above representation-based classification methods, training atoms tend to be "competitive" in SRC due to the sparse constraints. With l 2 -norm, all atoms participate in the representation process equally. Thus, CRC tends to be "cooperative". Researchers compared the performance of SRC with CRC in literature [21,22]. Moreover, the experiments showed that in some cases, SRC performances better than CRC while CRC performance was better in other cases. For example, in remote sensing images, the SRC algorithm gave rise to a more remarkable improvement with some mixed pixels [31]. Thus, it is an effective way to combine SRC and CRC appropriately. In fact, in Ref. [25], FRC and ENRC algorithms to combine SRC with CRC were proposed. However, the dictionary chosen in [25] consists of all the training samples and brings a large computational burden. In addition, the algorithms in [25] only set a global trade-off between SRC and CRC, leading to the inflexible balance of different classes.

Proposed PENRC
The framework of our proposed PENRC algorithm is shown in Algorithm 1. First, we built a local adaptive dictionary to reduce the amount of calculation. Given a test pixel, we used the KNN algorithm to select the K pixels that are most similar to the local adaptive dictionary set. Second, we constructed the PENRC model of the hyperspectral image. We used the local adaptive dictionary to construct the PEN model and obtain the abundance coefficients corresponding to the testing pixel. Then, we calculated the reconstruction error of each class according to the abundance coefficients and used the minimum reconstruction error to classify the testing pixels. In addition, in order to further improve the classification performance, we also integrated the spatial information of the pixel neighborhood into the model, named joint pairwise elastic net representation-based classification (J-PENRC).
Step 3: Decide the final label class(y) by the minimum reconstruction error principle by Equation (14).

Local Adaptive Dictionary
In representation-based methods, dictionaries are usually composed of all labeled training pixels [32,33]. In order to have a robust representation, it is necessary to ensure that the dictionary is complete (that is, enough training samples are needed). However, training samples are usually limited in practice. In addition, using all training pixels directly will lead to a large amount of computation. Therefore, to solve the above problems, we utilize the local adaptive dictionary to obtain a more robust representation.
For a testing pixel y, we utilize the KNN to construct a similar signal set D as the adaptive dictionary. However, due to the high dimension of the hyperspectral image, it is unreasonable to directly use Euclidean distance to measure the similarity of the spectral vector. In order to increase the separability of data, LDA [34] algorithm is used to project HSIs into low-dimensional space, which can find an optimal projection direction to minimize the intraclass distance of samples and maximize the inter-class distance. Let Γ ∈ R B ×B indicate the LDA mapping matrix and B represent the reduced dimension. Then, the similarity measure between the testing atom y and arbitrary training atoms x n can be expressed as: Then, we sorted all the distance set [x 1 , x 2 , . . . , x N ] in descending order and obtained the dictionary indices {i c } c=1,...C corresponding to the first K large distance values. The adaptive dictionary can be denoted as:

Pairwise Elastic Net Representation Based Classification
First, we introduce the concept of correlation matrix. Consider the following two matrices R 1 and R 2 : We can see that the three features in the R 1 matrix have the same similarity values. At this point, it is effective to set the global trade-off between l 1 -norm and l 2 -norm. Nevertheless, for the matrix R 2 , feature 1 is very similar to feature 2 (regarding l 2 -norm), feature 1 is independent from feature 3 (regarding l 1 -norm) and feature 2 is slightly related to feature 3 (regarding elastic net). Hence, we need a flexible trade-off scheme to match the regularization term with the data structure. Thus, the objective function of our proposed PENRC can be denoted as: where R is the similarity matrix between atoms in the adaptive dictionary D ∈ R K×K . Some frequently-used similarity measures are absolute atom correlation R ij = D T i D j and Gaussian kernel R ij = exp − D i − D j 2 /σ 2 et al. Considering some basic results and notation with abundance coefficients and similarity matrix: where I is the identity matrix and 1 is a vector of all ones. Then, the fourth term in Equation (8) representing the trade-off between l 1 -norm and l 2 -norm can be explained as follows. For the completely similar features, R = 11 T . Equation (8) only left l 2 -norm, reducing the impact of the l 1 constraint. For the completely dissimilar features, R = I, and Equation (8) reduces to SRC model with only l 1 constraint. That is to say, when the two features are similar, we take the CRC method; when the two features are dissimilar, we take the SRC method; for the remaining cases, we take the ENRC method. Thus, the flexible trade-off scheme can be realized though our proposed PENRC.
To further enhance the classification performance, we also incorporate the spatial information of HSI pixel into the PENRC model. In [24], a shape adaptive (SA) region is proposed for each pixel. In our work, we utilized the neighbor information with SA and the chosen pixel can be represented by the average of all pixels in the SA window. For an arbitrary pixel y in the HSI, the corresponding SA set matrix can be denoted as Y SA = [y 1 , y 2 , . . . , y T ]. T is the number of chosen pixels in SA. Then, the pixel y introduced into spatial information can be obtained bȳ Then, the sparse coefficients α SA forȳ SA can be denoted as: Once the sparse coefficients α SA is obtained, the final label can be determined by the category minimum reconstruction error: where,D c SA and α c SA represent the subset ofD SA and α SA corresponding to c-th class, respectively.

Coordinate Descent
To solve Equation (8), we rewrite it as following: where P = I + 11 T − R. As [27] proves, only if P has nonnegative entries and is a positive semidefinite (PSD) matrix, the second term |α| T P|α| in above model is convex. However, the matrix P in Equation (15) is not always a PSD matrix. We can consider the following way as proved in [27]: where τ τ+1 ≤ θ ≤ 1 and τ = −min{0, λ min (P)}. Then, Equation (15) can be seen as a quadratic program (QP) problem and can be solved by the QP solver. However, the QP solver does not meet the high-dimensional data requirements. In order to obtain more exact results, we use the coordinate descent method [35] in this paper. The approach can be summarized as: given a convex function f (α), we calculate the derivative ∂f ∂α i ; update α i by holding all α j (where j = i) fixed with the equation ∂f ∂α i = 0; cyclic each α i iteratively until the termination condition is satisfied. In PENRC, we have where P is PSD and nonnegative, Q = D T D and q = X T y. Then, the derivative ∂f If the derivative is 0, we update α i according to α −i = α {1:K}\i : Then, we define the scalars a, b and c. Let a = Q ii + P ii , b = λ ∑ j =i P ij α j and c = q i − ∑ j =i Q ij α j . The update equation can be denoted as:

Data Set
In this paper, we chose the three HSI data sets for experimental evaluation. The first testing data set is Indian Pines dataset. The scene is obtained by AVIRIS sensor over the Indian Pines test site in Northwest Indiana [38]. The size of the image is 145 × 145 with 224 spectral reflectance bands whose wavelength ranging from 0.4 µm to 2.5 µm. Removing the crops with less coverage, we choose 9 kinds of crops in the given ground truth which are corn-notill, corn-mintill, grass-pasture, grass-trees, hay-windrowed, soybean-notill, soybean-mintill, soybean-clean and woods. Figure 1a,b illustrate the corresponding false color composition and ground truth map respectively. The second data set is the Pavia Centre data set, which is acquired by the ROSIS sensor during a flight campaign over Pavia. The geometric resolution is 1.3 m. The image size is 1096 × 715 × 102. Due to the lack of information in the image, some samples do not contain any information. Therefore, it must be discarded before analysis. For Pavia Centre, we chose nine classes in the given ground truth: water, trees, asphalt, self-blocking bricks, bitumen, tiles, shadows, meadow and bare soil. Figure 2a,b illustrate the corresponding false color composition and ground truth map, respectively.
The third one is the Pavia University data set, also collected by the ROSIS sensor. The spatial resolution is 610 × 340, and it contains 103 spectral bands. The Pavia University dataset contains nine classes with the given ground truth: asphalt, meadows, gravel, trees, paninted mental sheets, bare soil, bitumen, self-blocking bricks and shadows. Figure 3a

Parameter Analysis
During the experiment, we used three evaluation indicators to measure the classification performance: OA, AA and Kappa [39]. OA represents the proportion of all correctly classified atoms to the total number of testing atoms, while AA is the average value of OAs in each class. Kappa indicates the percentage of classified testing pixels corrected by the number of agreements that would be expected by chance. Detailed definitions for each indicator can be referred to [40].
There are two main parameters (the number of adaptive dictionary atoms K and balancing parameter λ) that have a significant impact on classification results in our proposed PENRC. In this section, we analyze the impact of the two parameters by carrying out the sweep of the chosen parameter space and find the optimal parameters according to Figure 2. For Indian Pines, we chose 10% pixels per class as training samples. For Pavia Center, we chose 100 pixels per class as training samples and the same number for the Pavia University dataset. From Figure 4, we can see that OA increases first and then decreases with the K value increasing. Few adaptive dictionary atoms lack enough locality information, and too many dictionary atoms may introduce redundant category information. Then, we fixed the value of K, and the classification can be locally maximum with the appropriate value of λ. Then, from the maximum OA shown in Figure 4, we set K to 20 and λ to 1 × 10 −3 , 1 × 10 −2 and 1 × 10 −4 for Indian Pines, Pavia Center and Pavia University, respectively.

Comparisons with Other Approaches
To avoid any bias, we repeated the experiments five times and reported the average classification accuracy.
For Indian Pines, we employ 10% labeled samples per class as training set and others as testing set. The detailed partition strategy is illustrated in Table 1. From Table 2, we can see the classification performance of our proposed PENRC and J-PENRC as well as chosen compared algorithms, and the optimal results for each class are indicated in bold. For certain classes, such as grass-pasture, grass-trees, hay-windrowed and woods, the classification accuracies of our proposed PENRC and J-PENRC can be above 98%, specially for hay-windrowed, which can be up to 100%. For category soybean-clean, our algorithm improves the classification accuracy by 19.08% relative to the chosen optimal comparison algorithm ENRC. Furthermore, from Table 1, we can clearly see that our algorithms are optimal in terms of OA, AA and Kappa. In order to prove the effectiveness of our algorithm more comprehensively, we also compare the OAs, which are calculated under the different number of training samples. The classification results are shown in Figure 4. The abscissa represents the number of training samples per class, and the ordinate represents the classification accuracy. The dashed line represents OAs of the pixelwise algorithms, and the solid line represents OAs of the algorithms based on spatial information. From Figure 4, we can see that even in the case of insufficient training samples, our algorithm can achieve an ideal classification result. Furthermore, our algorithm have always been optimal compared to the same kind of contrast algorithms.  -notill  142  1286  Water  100  2500  Asphalt  100  800  2  Corn-mintill  83  747  Trees  100  2500  Meadows  100  800  3  Grass-pasture  49  434  Meadow  100  2500  Gravel  100  800  4 Grass-trees 73 657 Self-Blocking Bricks  100  2500  Trees  100  800  5  Hay-windrowed  48  430  Bare Soil  100  2500  Painted mental sheets  100  800  6  Soybean-notill  98  874  Asphalt  100  2500  Bare Soil  100  800  7  Soybean-mintill  246  2209  Bitumen  100  2500  Bitumen  100  800  8  Soybean-clean  60  533  Tile  100  2500  Self-Blocking Bricks  100  800  9 Woods 127 1138 Shadows 100 2500 Shadows 100 800 For Pavia Center, we employ 100 labeled samples per class as a training set and 2500 per class as a testing set.The detailed partition strategy is illustrated in Table 1. Table 3 illustrates the classification performance of our proposed PENRC and J-PENRC compared to other chosen algorithms, and the optimal results for each class are indicated in bold. For meadow, the classification accuracies of our proposed PENRC can be above 99.6%. For some classes, such as asphalt and tile, the classification accuracies of our J-PENRC can be above 99%. Especially for water, the classification accuracy of both PENRC and J-PENRC can be up to 100%. Furthermore, Table 3 illustrates that our proposed algorithms are optimal in terms of OA, AA and Kappa compared to other chosen algorithms. In order to further prove the effectiveness of our algorithm, we also compare the OAs of chosen algorithms under the different number of training samples. The classification results are shown in Figure 5. The number of training samples is selected from 50 samples per class to 300 samples per class. It can be seen from Figure 5 that compared with similar algorithms, our algorithm always has the best classification effect. With regard to the Pavia University dataset, we randomly selected 100 labeled samples per class as a training set and 800 per class in the rest as a testing set (such as the shaows class, which only contains 947 labeled samples). The detailed partition strategy is illustrated in Table 1. Table 4 presents the classification result of our proposed PENRC and J-PENRC with other comparison algorithms, and the optimal results for each class are denoted in bold.
For bitumen, the classification accuracy of our proposed PENRC reached 99.17%. For some classes, such as gravel and bair coil, the classification accuracy of our J-PENRC can be above 97%. Especially for meadows, painted mental sheets and shadows, the classification accuracy of J-PENRC can be up to 100%. In addition, in terms of OA, AA and Kappa, Table 3 also illustrates that our proposed algorithms are optimal compared to other chosen algorithms. In order to further prove the effectiveness of our algorithm, we also compared the OAs of the chosen algorithms with different numbers of training samples. The classification results are shown in Figure 5. The number of training samples is selected from 50 samples per class to 300 samples per class. It is evident that our algorithm always gains the most extraordinary performance.

Computational Complexity
In this section, we compare the computational complexity for each classifier with the Indian Pines, Pavia University and Pavia Centre datasets. All above experiments were executed five times to avoid any bias. Table 5 illustrates the total time of algorithm execution and verification of the three datasets. All experimental settings and the parameters were set to be the same as described above. As can be seen from Table 5, ENRC has a lower time complexity than PENRC. There are two reasons for this. First, ENRC uses artificial prior information to set a fixed weight parameter to combine l 1 -norm and l 2 -norm, while PENRC automatically learns this weight parameter through the similarity matrix. Second, the time complexity consumed by the solution approximation algorithm used by the two methods is not the same due to the difference in the math models. On the other hand, Table 5 also lists the time complexity comparison with or without LAD. Obviously, the use of LAD substantially reduces the computational complexity of PENRC and yields a better classification performance.

Conclusions
In this paper, we proposed a hyperspectral image classification algorithm named PENRC. The local constrained dictionary was first constructed to reduce the computation costs. Then, by introducing a correlation matrix, the PENRC was constructed to realize the group sparsity with self-balancing between l 1 -norm and l 2 -norm. The pairwise elastic net model was proven to be capable of the grouping selection of highly correlated data via establishing local, or pairwise, tradeoffs of similarity between correlation matrices, thereby rendering more robust weight coefficients. To further improve the classification performance, we also introduced spatial information and proposed the J-PENRC model. The experimental results of real hyperspectral images verified that the proposed algorithms could outperform the existing representation-based classifiers. Compared to the existing pixelwise and spatial-based algorithm, experiments on our chosen Indian Pines verified the effectiveness of our proposed PENRC and J-PENRC in quantitative and qualitative terms.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: