Discovering the Representative Subset with Low Redundancy for Hyperspectral Feature Selection

: In this paper, a novel unsupervised band selection (BS) criterion based on maximizing representativeness and minimizing redundancy (MRMR) is proposed for selecting a set of informative bands to represent the whole hyperspectral image cube. The new selection criterion is denoted as the MRMR selection criterion and the associated BS method is denoted as the MRMR method. The MRMR selection criterion can evaluate the band subset’s representativeness and redundancy simultaneously. For one band subset, its representativeness is estimated by using orthogonal projection (OP) and its redundancy is measured by the average of the Pearson correlation coefﬁcients among the bands in this subset. To ﬁnd the satisfactory subset, an effective evolutionary algorithm, i.e., the immune clone selection (ICS) algorithm, is applied as the subset searching strategy. Moreover, we further introduce two effective tricks to simplify the computation of the representativeness metric, thus the computational complexity of the proposed method is reduced signiﬁcantly. Experimental results on different real-world datasets demonstrate that the proposed method is very effective and its selected bands can obtain good classiﬁcation performances in practice.


Introduction
Hyperspectral images contain large amounts of bands, which brings several problems, such as the heavy computational burden and storage cost.In addition, there is high correlation among the hyperspectral bands due to the high resolution of spectrum, using all the bands is unnecessary.Therefore, it is necessary to perform dimensionality reduction (DR) for processing the hyperspectral data effectively.The commonly used DR techniques include feature extraction and band selection (i.e., feature selection).Feature extraction reduces the feature space by extracting a few new features from the original features through some function mapping, this kind of methods include principal component analysis (PCA) [1,2], nonnegative matrix factorization (NMF) [3], and so on [4][5][6][7].Different from feature extraction, band selection directly selects a subset of features from the original ones.For hyperspectral imagery, band selection (BS) is preferable because the selected bands still have the physical meaning and preserve the relevant original information in the data.BS methods can be broadly split into the supervised and unsupervised methods in terms of the prior knowledge availability.Supervised BS methods try to find the most informative bands with respect to the available prior knowledge [8,9], whereas unsupervised methods do not use any object information [10,11].Because the prior knowledge is often unavailable in practice, developing unsupervised BS techniques is necessary.
Many unsupervised band selection methods have been proposed in these years.Some of them use different information criteria to measure the importance of hyperspectral bands, then all the bands are sorted and several top ranked bands would be selected.These kind of methods include information divergence BS (IDBS) [10], linearly constraint minimum variance (LCMV) [10], constrained band selection (CBS) [10], mutual information [12], maximum-variance principal component analysis (MVPCA) [13] and so on [14].Other band selection methods take bands' correlation into consideration.For instance, the maximum ellipsoid volume (MEV) based methods [15][16][17][18] measure the importance of band subsets by calculating the ellipsoid volume of band subsets, and it has been proved that this kind of methods can well consider the correlation among bands [19].Recently, some BS methods based on orthogonal projection (OP) are proposed [19][20][21][22], these methods use OP to select the bands with low redundancy and they have shown good performances in practice.Additionally, many BS methods based on advanced machine learning algorithms are also proposed, these methods include the clustering-based methods [23][24][25][26][27], manifold ranking (MR) [28], sparsity based BS methods [29,30], graph theory based BS [11] and so on.
Through analyzing the selection criteria of these BS methods, it can be found that, generally, band-ranking based methods mainly consider bands' information but neglect the correlation among selected bands [10,21]; although the correlation-based methods pay sufficient attention on band correlation, their selected bands are not always highly representative and thus the classification performances are not very satisfactory [11,20,21]; clustering-based methods and some other advanced methods may take into account information and correlation implicitly, but their computational burden is usually heavy and there is still a large room for improvement in the selection criteria [21,28].Therefore, we aim to design a new BS method which explicitly considers the bands' information and redundancy simultaneously.
In this paper, we proposed a novel BS selection criterion based on maximizing representativeness and minimizing redundancy, which is denoted as the MRMR selection criterion.Combine the MRMR selection criterion with the immune clone selection (ICS) [31], a new method named the MRMR BS method is obtained.The MRMR selection criterion is based on orthogonal projection (OP) [19] and it provides a novel perspective for evaluating the importance of band subsets; The MRMR selection criterion consists of two metrics, i.e., the representativeness metric and the redundancy metric.The orthogonal projection is used to evaluate the representativeness of a band subset, and the average of the Pearson correlation coefficients among the bands in the band subset is used to measure the redundancy of this band set.By combining these two metrics properly, the MRMR selection criterion can evaluate the importance of each candidate band subset.After the selection criterion has been determined, the BS task is reduced to be a subset searching problem, namely, we need to traverse all the candidate band subsets to find the satisfactory one.Since exhaustive searching strategy is impractical, we have to apply a suboptimal group searching strategy.In this paper, a simple but effective evolutionary searching algorithm, i.e., the immune clone selection (ICS) algorithm, is applied as the subset searching strategy.ICS ensures that the BS algorithm can obtain the desired band subset in a reasonable time.Furthermore, to ease the huge computation burden caused by orthogonal projection, two efficient tricks are introduced to simplify the computation of representativeness metric, and these tricks may be also helpful for reducing the computational complexity of other similar OP-based methods.The major contribution of this paper can be summarized as follows: (1) A new perspective, that is, the selected features should not only well represent the whole feature set but also have low redundancy among themselves, is provided for measuring the importance of feature subset in unsupervised feature selection.(2) Based on the above basic idea for designing selection criteria, the OP and mean correlation coefficient are respectively used for evaluating the representativeness and redundancy of feature subsets, and the MRMR selection criterion is proposed.(3) Two ways are introduced to accelerate the proposed method, and these tricks may be also helpful for reducing the computational complexity of other similar methods.
The remainder of this paper is organized as follows: Section 2 introduces some related works associated with the proposed method, and Section 3 specifically explains the proposed method.Section 4 presents experiments on different real-world hyperspectral images.Finally, Section 5 gives some concluding remarks.

Related Works
The proposed MRMR method aims to find the band subset that can best represent the whole image dataset.Some similar methods can be found in [10], in which the linearly constraint minimum variance based methods (LCMV) are proposed.The LCMV-based methods design a finite impulse response (FIR) filter for bands, and by minimizing the averaged least squares filter output, the band selection is transferred to an optimization problem that is similar to the constrained energy minimization (CEM) [32].LCMV can find the bands that best represent the whole image, but it tackles each band individually and does not consider the band redundancy among the selected bands, so the bands obtained by this kind of methods may be highly correlated with each other.In practice, the high correlation among selected features often deteriorates the performances of classification, so a good band selection method should consider band correlation for obtaining good classification performances.
For the MRMR method, we use OP to measure the representativeness of a subset relative to the whole image cube.Namely, for a candidate band subset, we orthogonally project each remaining band (all the bands excluded in this subset) to the vector space spanned by the bands of the subset, then the sum of all the remaining bands's distances to their OPs can reflect the representativeness of this band subset.Some methods based on similar processes have been proposed, for instance, the orthogonal-projection-based BS method (OPBS) [19] , the OSP based BS method (OSP-BSVD) [21], and the volume-gradient-based BS method (VGBS) [20]; all these methods can be considered as the BS methods based on OP.OPBS and OSP-BSVD have almost the same selection criterion, but they are derived independently from different perspectives.Since OPBS and OSP-BSVD are quite similar, for convenience, we only compare the OPBS method with the proposed method in the following.The OPBS method applies sequential forward search (SFS) [33] as the searching strategy, so it selects one band for each time.For the OPBS method, at each round of lookup, the band that has the maximum OP onto the orthogonal complement of the vector space spanned by the currently selected bands would be regarded as the target band and added into the selected band set [19].VGBS is similar to OPBS, but it removes one band from the original band set iteratively, until the desired number of bands retain [20].These similar OP-based methods mainly consider the redundancy of bands but pay insufficiently attention on the representativeness of bands, so the selected bands obtained by these methods usually have low redundancy but may not represent the whole dataset well [19,20].
The major differences between the proposed MRMR method and these similar methods could be summarized as follows: (1) When compared with the the LCMV-based methods; although both the LCMV-based methods and the MRMR method evaluate the representativeness of bands, their explicit selection criteria are totally different.The LCMV-based methods measure one band's representativeness relative to the whole dataset by using a finite impulse response (FIR) filter [10].The MRMR method evaluates the representativeness of a band subset relative to the remaining bands by using OP.Moreover, LCMV cannot consider redundancy among selected bands [10,19], but the MRMR method can achieve it.(2) When compared with the existing OP-based methods like OPBS, OSP-BSVD and VGBS; although both these similar methods and the MRMR method use OP to measure the relationship among bands, their objectives are totally different.For the OPBS, OSP-BSVD and VGBS methods, OP is used to evaluate the redundancy or the dissimilarity between a candidate band and the currently selected bands [19][20][21]; while for the MRMR method, OP is used to measure the representativeness of a band subset relative to the remaining unselected bands.The existing OP-based mainly consider the redundancy among selected bands but do not pay sufficient attention on the selected bands' representativeness [19], in contrast, the MRMR method can well consider both the redundancy and the representativeness of the selected band subset.(3) Finally, all the LCMV, OPBS, OSP-BSVD and VGBS methods are point-wise band selection methods, namely, the desired bands are obtained individually [10,[19][20][21]; whereas the MRMR method is a group-wise method, in which the desired bands are obtained simultaneously.Because the selected bands actually works together in the applications like pixel classification, the effect of the selected bands should be considered jointly.The group-wise methods are usually more effective than the point-wise methods, since the group searching strategy is more suitable for evaluating the joint effect of multiple bands.

Background of OP
The selection criterion of the MRMR method is associated with the vector space.In linear algebra, a vector space is defined as a set that is closed under finite vector addition and scalar multiplication [34].Suppose that there is a set of column vectors which is denoted as where N and m represent the numbers of elements and vectors, respectively, then the vector space spanned by all the column vectors of A can be denoted as follows: where a i could be any scalar.Assume that there is another column vector x 0 , if we want to evaluate its relationship with the vector set A, we can compute the distance of x 0 to the vector set A. The distance can be obtained by orthogonally projecting the vector x 0 onto the vector space W. In linear algebra, W is also a linear subspace (or a linear manifold) of the vector space spanned by the vector set {x 0 , x 1 , x 2 , • • •, x m } (note that this set includes x 0 ), so W can be considered as a hyperplane relative to the latter [34].The orthogonal projection of x 0 onto the hyperplane W can be computed by: where x0 is the orthogonal projection (OP) of x 0 onto W, and P is called the orthogonal projector.Then, the squared distance of x 0 to the hyperplane W is The squared distance d is also the squared norm of the orthogonal projection of x 0 onto the orthogonal complement of W [19].
From the perspective of the linear regression, the orthogonal projection x0 is also the linear estimate or prediction of x 0 using the vectors in A, and the distance d evaluates the prediction error [19].More specifically, it is easy to find that the term (A T A) −1 A T x 0 in (3) is an m × 1 vector and thus can be denoted as follows: where α i is a scalar, then the OP x0 is rewritten as Obviously, the OP x0 is a linear combination of the vectors in A, and (A T A) −1 A T x 0 is the weight vector that determines how each vector affects the prediction.In fact, it can be proved that the term (A T A) −1 A T x 0 is exactly a least squared solution [19].Therefore, the distance d is the linear prediction error and it reflects how difficult it is to use the vectors in A to estimate the single vector x 0 .It is evident that, the smaller the distance d is, using the vectors in A to linearly represent x 0 is easier.For instance, Figure 1 shows an intuitive example in 3-D.In Figure 1a, the vector x 0 cannot be totally linearly represented by the vectors x 1 and x 2 , in other words, x 0 does not belong to the vector space W, and correspondingly, the distance d does not equal zero; whereas in Figure 1b, the vector x 0 belongs to the vector space W and thus it can be linearly represented by other vectors completely; in this case, the distance d equals zero.It can be found that, the distance of a vector to the hyperplane spanned by other vectors actually reflects the similarity between this single vector and a set of vectors.In band selection, if each band image is reshaped into a column vector, we can use OP to compute a band's distance to a band set for measuring the relationship between this single band and a set of bands.

MRMR Selection Criterion
The objective of the proposed method is to find the band subset with the maximum representativeness and the minimum redundancy.In this section, we would introduce how the selection criterion considers these two factors simultaneously.
For the representativeness of a band subset, we use the OP to measure it.Specifically, considering that the BS process would drop most of the original bands, we want that the selected bands can preserve the information of the whole dataset as much as possible.Therefore, for a band subset, we orthogonally project all the remaining bands (i.e., all the bands excluded in this subset) onto the hyperplane spanned by the bands of the subset, then the sum of distances to the hyperplane can be used to measure the representativeness of this band subset.Suppose that the total dataset is D ∈ R N×L , where N and L represents the numbers of pixels (samples) and total bands (features).Assume that we want to select n bands out of the total bands, and a candidate subset of D is denoted as X = [x 1 , x 2 , ..., x n ] ∈ R N×n , then the correspondingly remaining band subset is denoted as Y = [y 1 , y 2 , ..., y L−n ] ∈ R N×(L−n) .Obviously, we have the relationship that D = X ∪ Y, then the representativeness of X is computed by (7) where S rp (X) denotes the representativeness of X relative to Y, and the term ŷi is the OP of y i onto the hyperplane spanned by X.
According to the analysis of Section 2, the term S rp (X) can be explained as the difficulty of using the bands of X to represent the bands of Y, thus, the larger the term S rp (X) is, the representativeness of X is lower.For instance, Figure 2 shows an intuitive example.If all the bands' distances to the hyperplane equal zero, any band of Y can be linearly represented by using the bands of X, in this case, the bands that are not included in X can be abandoned because they are totally redundant (in fact, this occasion almost never happens, because the hyperspectral dataset D ∈ R N×L is usually a matrix of rank L).Therefore, we can consider that the subset with small S rp (X) is highly representative.hyperplane   Figure 2. A 3-D example for illustrating the rationality of Equation ( 7).The round marks denote the bands of the remaining subset Y, and the hyperplane is spanned by X.
On the other hand, for band selection, the redundancy among selected bands should be also considered.The hyperspectral bands usually have significant correlation with each other, so if the selected bands are highly correlated with each other, the much redundancy would cause that the selected bands cannot provide sufficiently useful information for further applications.Furthermore, just using the metric (7) may have the risk that the selected bands are similar to each other, because if one band in X is highly representative, its neighboring bands may be also highly representative.Therefore, our proposed selection criterion further take into account the redundancy among selected bands by designing an explicit redundancy metric.In this paper, we compute the average of the Pearson correlation coefficients of the bands in a band set to measure the redundancy.For instance, for the band subset X, its redundancy is computed by where S rd (X) represents the redundancy of X, and c i,j denotes the correlation coefficient between x i and x j ; µ i and µ j respectively denote the mean of the bands x i and x j ; σ i and σ j represent the standard deviations of x i and x j , respectively.Obviously, when S rd (X) is large, the bands of X are highly correlated, and thus the redundancy of X is high.In practice, repetitively computing c i,j for different subsets is inefficient, we can construct the correlation coefficient matrix of the total bands of D before BS, then any band pair's correlation coefficient can be conveniently acquired from the correlation coefficient matrix of D.
Consequently, the two metrics for constructing the selection criterion have been introduced.Since the objective is to find the band subset with the maximum representativeness and the minimum redundancy, we should minimize both S rp (X) and S rd (X) as much as possible.Therefore, the MRMR selection criterion is defined as follows: where S(X) is the score of the band subset X; λ is a nonnegative real number and it controls the effects of two metrics; the value for λ can be set adaptively according to the value of S rp (X), and these contents will be introduced in Section 3.4.The score is larger, the band subset is more important, so our objective is to find the band subset with the maximum score, i.e., Then, we need a subset searching method to traverse over candidate subsets for finding the one with the largest score.

Subset Searching Strategy
When dealing with the hyperspectral datasets, exhaustive strategies cannot be used because there are a huge number of feasible band combinations.In this case, many suboptimal searching methods such as greedy methods and evolutionary methods have been widely used in band selection [33,[35][36][37][38].In greedy methods, the desired bands are obtained gradually, these methods include sequential forward search (SFS) [33], sequential backward search (SBS) [33], beam search [39] and so on.As for the evolutionary methods, the desired bands are obtained simultaneously, these methods include genetic algorithm [37], immune clone selection (ICS) [31], particle swarm optimization (PSO) [40] and so on.Generally, the greedy methods like SFS are sensitive to the initial feature set and they tackle the candidate bands individually.Considering that our proposed method needs to compute the scores of band subsets, greedy methods cannot be applied.Among the commonly used evolutionary methods, ICS is chosen as the searching strategy because it is easy to be implemented and has a satisfactory performance.It should be pointed that although we use ICS in this paper, other group searching methods like PSO can be also combined with the proposed MRMR selection criterion.
Immune clone selection is motivated by the immunology and is a typical paradigm of artificial immune systems [31].In the biological immune system, when a new type of antigens has invaded, the organism can perform immune clonal multiplication to evolve the high-affinity antibody for defense [31].This process mainly involves three procedures, i.e., clone, mutation and selection.Correspondingly, ICS selects the desired antibody through these three operators.In this paper, an antibody denotes a candidate subset, then some candidate subsets are chosen to construct the antibody population X = {X 1 , X 2 , ..., X m }, where m is heuristically set to be 10.It should be noted that the bands of each initial antibody X i are not directly randomly chosen from the total bands.Instead, if X i contains n bands, we divide all the bands into n groups on average according to their band indices, and then randomly choose one band from each group to construct a initial candidate subset, repeat this process m times for acquiring the initial antibody population X.This initialization may be helpful for the ICS algorithm to find the satisfactory subset in a shorter time.Once the initial antibody population X is obtained, it will undergo the procedures as follows: where T C , T M and T S respectively denote the clone, mutation and selection operators; X (t), X (t) and X(t + 1) are the associated evolved antibody population.
In the clone stage T C , antibodies conduct self-replication, and the clone number of each antibody is determined by its affinity [31].The affinity of the antibody (candidate band subset) X i is computed by where S(X i ) is the score of the subset (i.e., antibody) X i and it is computed by using (9).Consequently, the clone number of X i is computed as follows: where Round(•) is a rounding-up function.
In the mutation stage T M , mutation enriches the diversity of antibodies.We randomly choose some elements from each copied antibody and replace them with equivalent quantity of other candidate bands.Note that the candidate bands for an antibody refer to all the bands that are not included in this antibody.For the copied antibody X i , we set the mutation number N M (X i ) to be a random number ranging from 1 to min[N C (X i ), n], where N C (X i ) and n represent the clone number of the parent antibody X i and the number of the bands in each antibody, respectively.Obviously, the mutation number of bands is also related with antibodies' affinities.
Then, in the selection stage T S , we preserve the antibodies with the highest affinities as the new parent antibody cells [31].The number of preserved antibodies is also equal to m. Repeat these three procedures until the relative change rate in the largest score S during the last N step steps falls below a predefined tolerance τ [31].In this paper, N step and τ are set to be 50 and 10 −4 , respectively.In the end, the ICS will find a subset with a satisfactory score, and this subset is exactly our final selected band set.

Adaptive Determination of λ
For the selection criterion shown in (9), we need to set a suitable value for λ to control the effects of representativeness and redundancy.Generally, the value for the first term S rp is quite small, e.g., about 10 −4 ; whereas the value for S rd is much larger, e.g., about 0.5.Since the value for S rd is usually much larger than that of S rp , we should set λ as a quite small value for limiting the influence of S rd .In this paper, the value for λ is set adaptively according to the value for S rp .Specifically, during the ICS, the value for λ is set according to the minimum S rp of antibodies in the previous generation (after clone, mutation and selection have been conducted, a new generation of antibody population is generated).For instance, denote the minimum S rp in the previous generation as min_Srp , then λ can be set as follows: where β is another parameter.We can influence the value for λ by changing β.In this paper, β is set as 0.5 in default, therefore, we have λ = 0.5 • min_Srp (the initial min_Srp is set as 10 −5 ).The key idea of ( 14) is to set λ to be a value that is close to S rp , then both the two terms in (9) would have similar effects on the values of antibodies' scores.

Accelerating Tricks of Computing S rp
Another problem of the proposed method is the heavily computational burden of computing S rp (X).According to (7), it can be found that the computation of ŷi is quite computationally complex.For instance, for one candidate subset X, the computational complexity of computing ŷi is about O(nN 2 ), then the complexity of computing S rp (X) is about O(nLN 2 ).For a hyperspectral image, it usually has hundreds of bands (L) and only tens of bands (n) are to be selected, whereas the pixel number N is often larger than 10 5 .Considering that there are thousands of candidate subsets to be tested, the total complexity is too heavy.Therefore, we introduce two tricks to reduce the computational complexity of (7).
The first way is to compute the Gram matrix of all the bands in D, then (7) can be easily computed by acquiring elements from the Gram matrix.Likewise, for the raw dataset D ∈ R N×L , it is split into two portions: X ∈ R N×n and Y ∈ R N×(L−n) .According to (7), the OP of the band y i can be obtained by ŷi = X(X T X) −1 X T • y i (15) For convenience, the term X(X T X) −1 X T is denoted as P, and it is worth noting that P is symmetric and idempotent, i.e., P = P T (16) Then, the term y i − ŷi 2 in (7) equals It is easy to find that the first term y T i y i is exactly the squared norm of the band y i and is exactly one of the diagonal entries of the Gram matrix of D, i.e., D T D [34].As for the second term y T i Py i , it can be further written as follows: Obviously, (19) demonstrates that y T i Py i is also related with the Gram matrix D T D. All the entries of X T X and X T y i can be acquired from the matrix D T D, since both X and Y are the subsets of D. Therefore, we can rewrite (7) as follows: where all the terms, i.e., y T i y i , X T y i and X T X can be directly acquired from D T D, thus the computation of S rp (X) is simplified significantly.In this way, the complexity of computing S rp (X) is only about O(n 3 L), which is much smaller than the original complexity of O(nLN 2 ).
The second way is using the singular value decomposition (SVD) to map original high-dimensional bands into a low-dimensional space.Specifically, we can find that S rp (X) is actually only related with the Gram matrix D T D, so if we can reduce the dimensionality of each band through some function mapping and do not change the Gram matrix D T D, the computational complexity would be reduced significantly.For the dataset D ∈ R N×L , where N and L are the numbers of the pixels and total bands, it can be decomposed according to SVD, i.e., where U is an N × N real or complex unitary matrix, Σ is an N × L rectangular diagonal matrix with non-negative real numbers on the diagonal, and V is an L × L real or complex unitary matrix.Then, substitute (21) into D T D and yield that which demonstrates that we can use ΣV T to replace the original D. Interestingly, we just need to use the first L rows of Σ to compute ΣV T , this occurs because that the remaining N − L rows of Σ are all zero vectors.Therefore, the dimensionality of ΣV T is actually reduced to L × L, which means that the bands of D have been mapped into an L-dimensional space.Then we can use the mapped dataset D = ΣV T ∈ R L×L to compute S rp (X).It should be noted that only the first L non-zero row vectors of Σ are used in this process.In practice, it is unnecessary to conduct the full SVD, including a full unitary decomposition of the null-space of the matrix, to the matrix D. Instead, we can compute a reduced version of the SVD named the thin SVD.Since D is an N × L matrix of rank L, the thin SVD only calculates the L columns of U corresponding to the row vectors of V T , and the remaining column vectors of U are not calculated, i.e., The thin SVD is significantly quicker and more economical than the full SVD because N is much larger than L for the hyperspectral datasets.Therefore, to simplify the calculation of ( 7), we can use the thin SVD in (23) to obtain the mapped dataset D = ΣV T ∈ R L×L , then subsets X and Y are also mapped into L-dimensional space, thus the computational complexity of S rp (X) is reduced to O(nL 3 ), which is also much smaller than the original complexity of O(nLN 2 ).
We have introduced two ways to reduce the computational complexity of computing S rp (X).The first method is to compute the Gram matrix D T D and acquire elements from D T D to compute S rp (X) (using ( 20)).The second method is to perform the thin SVD to the matrix D and map it into a low-dimensional space, then use the mapped dataset for computing S rp (X) (using ( 7)).Both the two ways can reduce the computational complexity of computing S rp (X) significantly.It is worth noting that the first way only computes D T D once, and likewise, the second way just perform the thin SVD once, both these two preprocesses results in about the complexity of O(NL 2 ).Because the first way is a little more efficient, we use this method to simplify the calculation in this paper.

The Number of Selected Bands
Another issue of band selection is to determine the number of bands to be selected.In practice, determining the number of the bands to be selected is a challenging problem for unsupervised band selection.In most cases, the number of selected bands is determined by users manually, and it is also reasonable to set the number of selected bands to be a value that is close to the number of classes in the dataset [19,41].Generally, the number of classes can be determined by using a virtual dimensionality (VD) estimation approach proposed in [41], but this way also leads to additional computational burden and the class number is sometimes not well estimated since choosing suitable values for the parameters in VD is also difficult.Finally, the basic procedures of the proposed method are shown in Algorithm 1, where the number of selected bands n is set by users manually or determined by the estimate value of the class number in the dataset.

Algorithm 1 The MRMR Algorithm
Input: Observations D ∈ R N×L , the number of selected bands n.Initialize: m, N step , τ and min_Srp.
Step1: Compute the Gram matrix G = D T D, then use it to compute subsets' representativeness S rp (using (20)) in the following processes.Step2: Compute the correlation coefficient matrix of D, then use it to compute subsets's redundancy S rd (using (8)) in the following processes.
Step4: while the stop criterion is not met do 1: Copy the antibodies according to their affinities.2: According to the clone selection strategy, randomly select some bands from each copied antibody and replace them with other candidate bands.3: Select the m antibodies that have the highest affinities to construct the new antibody population.end while Step5: The antibody that has the largest affinity is regarded as the final selected band subset.Output: n selected bands.

Experiments
To observe the effectiveness of the proposed methods, some comparative tests are conducted to evaluate the proposed method's performance.Three different hyperspectral datasets and five different types of unsupervised BS methods are used in our experiments.The competitor methods include maximum-variance PCA (MVPCA) [13], LCMV band correlation constraint (LCMVBCC) [10], LCMV band correlation minimization (LCMVBCM) [10], exemplar component analysis (ECA) [23], and orthogonal-projection-based BS (OPBS) [19].We would compare these methods in terms of three aspects, i.e., pixel classification accuracy, band correlation and computing time.Two different classifiers, i.e, support vector machine (SVM) [42] and K-nearest neighborhood (KNN) [43], are respectively used for conducting pixel classification in our experiments.For the KNN classifier, the number of neighbors is set as 3; for the SVM classifier, the Gaussian radial basis function (RBF) is used as the kernel function, and the parameters of SVM is set by using grid search and cross validation, moreover, the one-against-all scheme [44] is used for multi-class classification.

Indian Pine Dataset
The first hyperspectral image is the Indian Pines dataset, which has 145×145 pixels and 220 bands with a wavelength range from 400 to 2500 nm (Figure 3).

Classification Results
In the classification experiments, we first select some bands (i.e., features) by using different BS methods, then randomly split the samples into training testing sets, and finally conduct pixel classification.To minimize the effect of stochastic process, we conduct experiments for five times, and the average results of the five runs are shown in Figure 4. Figure 4 shows the overall classification accuracies of using different numbers of selected bands, and the selected band number ranges from 2 to 50.Additionally, in Figures 5 and 6, we provide the classification maps of using the fifteen bands selected by different BS methods.It can be seen from these results that the MRMR method shows the best overall classification performance among all the BS methods we used.Specifically, we can see from Figure 4 that, all the classification accuracies of all the BS methods increase as the increase of the number of selected bands.When using the SVM classifier (Figure 4a), the MRMR method obtains the best overall performance, followed by ECA, OPBS, LCMVBCM and others.MRMR always outperforms the other competitors, and it obtains a significant increment on the classification accuracy when compared with other methods.For instance, in most cases, when compared with the second best method, i.e., ECA, the accuracy of MRMR is about 4% higher than that of the ECA method.As for the KNN classifier (Figure 4b), likewise, the MRMR method obtains the overall classification results, followed by ECA, OPBS and others.When compared with ECA, the classification accuracy of the proposed method is still about 3% higher than that of ECA.
Figures 5 and 6 show the classification maps of using the fifteen bands selected by different methods.The results show that the classification results of MRMR are much better than other five methods and further support the observations from Figure 4. Furthermore, Table 1 lists the overall accuracy (OA) and average accuracy (AA) of classification.Overall accuracy is the ratio of correctly classified samples versus total samples, and average accuracy is the average of each accuracy per class.We can see from Table 1 that both the OAs and AAs of MRMR are much higher than those of other methods, which further verifies that the proposed method is superior to other methods.
Therefore, the experimental results on the Indian Pine dataset demonstrate that, the proposed method is an effective BS method and its selected bands can obtain much better classification performance than other competitors.Thence, this experiment has verified that the proposed method can select the band subset that well represent the whole image dataset and the selected bands are informative for classification.

Band Correlation Comparison
BS methods should also take the band correlation among selected bands into consideration, because the high correlation among selected bands usually leads to much information redundancy and then deteriorates the pixel classification performances.In this section, we compare the average band correlation among the selected bands obtained by each BS methods.The overall band correlation of one band subset is measured by the average of correlation coefficients (ACC) of all the band pairs in this band set.Obviously, the larger the ACC is, the higher the band correlation is.
The ACCs of the fifteen selected bands of different BS methods are listed in Table 2, from which we can see that the bands obtained by the MVPCA and LCMV-based methods are highly correlated, while the ones obtained by the other BS methods are with much lower correlation.Among all the BS methods, the OPBS method selects the bands with the lowest correlation, this occurs because that the OPBS method selects the band that is the most dissimilar (i.e., the lowest correlated) to the currently selected bands in each round.The bands selected by MRMR are also with low correlation, which is quite close to the correlation of the bands selected by OPBS.This demonstrates that the selection criterion of MRMR have well taken into account the band correlation among bands.
Furthermore, Figure 7 shows the 2D maps of the distribution of the bands along with the marked selected bands.In Figure 7, each curve denotes a spectrum of one category across a range of wavelengths, and the straight lines denotes the selected bands.For each category, the average of samples is used to represent this category.The 2D maps demonstrates that most of the bands selected by the MVPCA and LCMV-based methods are neighboring bands, while the ECA, OPBS and MRMR methods select much fewer neighboring bands.The neighboring hyperspectral bands are generally highly correlated with each other, so if a BS method selects many neighboring bands, the correlation among these selected bands would be significant.This is exactly the reason that the bands selected by the MVPCA and LCMV-based methods are with so high correlation.In practice, the high correlation among selected bands leads to much redundancy, which deteriorates the classification performances.For instance, we can observe from Tables 1 and 2 that, the bands with high correlation usually corresponds to the low classification accuracies, and reducing the correlation among selected bands is helpful for improving the classification accuracies.Additionally, although ECA and OPBS select the bands that have quite low correlation, their classification performances are not as good as that of the proposed MRMR method, this occurs because that the selected bands obtained by these two methods are less informative than the bands obtained by the MRMR method, which indicates that the selection criterion of MRMR is more effective for finding the bands that have good representativeness.
To sum up, a good band selection method should pay sufficient attention on the band correlation among selected bands, and the band correlation comparison has verified that the proposed method can take into account the band correlation and select the bands with low redundancy.

Computing Time Comparison
The computing time of selecting 15 bands by different methods is also listed in Table 2, from which we can see that the MVPCA method runs the fastest, followed by OPBS, ECA, MRMR and other methods.Although the MVPCA method has better computational efficiency than the proposed method, considering that the MRMR method can achieve much better classification accuracy, it is acceptable that the MRMR method costs a little more time.When compared with the methods except for MVPCA, the MRMR method costs the medium time, so it also has a satisfactory computational efficiency and it enable to find the desired bands in a reasonable time.It should be pointed that, because the Indian Pine dataset is an image with a small number of pixels (N = 145 × 145 = 21,025), the acceleration effect of the tricks introduced in Section 3.4 is not very significant.In fact, for the following dataset of larger size, the superiority of the proposed method on the computational efficiency would become more significant.

Pavia University Image
The second image is the Pavia University dataset, which is acquired by the ROSIS-3 optical sensor (Germany).The dataset has 103 spectral bands and there are 610 × 340 pixels.There are nine classes in this image and all the classes are used in our experiments (Figure 8).For this dataset, we also randomly choose 10% pixels for training and the rest for testing.

Classification Results
Likewise, different numbers of bands are selected from this dataset and the number ranges from 2 to 50.The average classification results of five runs are shown in Figure 9. Figures 10 and 11 also show the classification maps of using the ten bands selected by different methods.It is evident that, for this dataset, the proposed method is also superior to other competitors.
It can be observed from Figure 9 that, the MRMR method performs the best, followed by the OPBS, ECA, MVPCA and LCMV-based methods.For the SVM classifier (Figure 9a), the MRMR method achieves the highest overall accuracies.In the case of selecting a small number of bands, e.g., less than 12 bands, the accuracy of the MRMR method is about 3% higher than the accuracy of the second best method (i.e., OPBS).When more bands are selected, the accuracy of the OPBS method increases significantly and is sometimes slightly higher than that of MRMR.As for the KNN classifier, likewise, MRMR achieves the highest classification accuracy, and OPBS also performs well.These two methods show a significant superiority relative to the remaining four methods.We also notice that when a large number of bands are selected, e.g., larger than 25 bands, most BS methods can obtain good classification performances.Since the major purpose of BS is selecting a few informative bands to improve the computational efficiency and ease the storage burden, fewer bands with a good classification performance is encouraged.The proposed method achieves the best classification performance and shows a significant superiority to other competitive methods when selecting quite few bands (e.g., less than 15 bands), which indicates that the proposed method is valuable.
Furthermore, we can observe from Figures 10 and 11 that the classification maps of MRMR are the most correct.Table 3 further lists the OAs and AAs of using the ten bands obtained by different methods.Similar to the results on the Indian Pine dataset, MRMR again acquires the highest OAs and AAs.These results further support the observations from Figure 9 and we can conclude that the MRMR method is superior to others.

Band Correlation Comparison
Table 4 lists the average band correlation of the ten selected bands.Similarly, for this dataset, the bands selected by the MVPCA, LCMV-based methods are still highly correlated, whereas the other BS methods select the bands with lower correlation.Figure 12 shows that the MVPCA, LCMV-based methods select many neighboring bands, while the other three methods select less neighboring bands.It is worth noting that, although the bands selected by ECA and OPBS are not with high correlation, the distributions of these two methods' selected bands are more centralized than that of the MRMR method.In other words, the distribution of the bands selected by MRMR is more dispersed.For instance, most bands selected by ECA are distributed among the bands 1-10 and 65-85; and about one half of the selected bands of OPBS belong to the bands 1-10; while the bands selected by MRMR are much more isolated.By observing the spectrums of categories, we can find that, for ECA and OPBS, some selected bands like bands 74 and 73 may be little useful for discriminating most categories in the dataset, and some bands like bands 1-4 are actually similar to each other, which means some selected bands of OPBS and ECA may be lowly representative (e.g., bands 73 and 74) or redundant (e.g., bands 1-4).On the contrary, the distribution of the selected bands of MRMR is more dispersed, and each selected band is useful for discriminating categories, so we can intuitively conclude that the selected bands of MRMR is more useful for classification.Some similar results can be also observed in Figure 7. Therefore, it can be concluded according to the classification results and band correlation comparison that the selected bands obtained by the MRMR method are not only highly representative but also lowly redundant.

Computing Time Comparison
Table 4 also gives the computing time of selecting ten bands by different methods.For this dataset, MVPCA costs the shortest time, followed by MRMR, OPBS, and other methods.We can see that the computing time of MRMR is only higher than that of MVPCA and is close to that of OPBS, this occurs because that the Pavia University dataset is with a huge number of pixels (N = 610 × 340 = 207,400), so the accelerating effect of the tricks introduced in Section 3.4 becomes more significant.When compared with the results on the Indian Pine dataset, it can be concluded that the proposed method has a more significant superiority in computational efficiency when dealing with large-scale images.Therefore, the experimental results on this dataset further proves that the MRMR method has a satisfactory computational efficiency, especially when processing the large-scale images.

Salinas Dataset
The third image was also collected by the 224-band AVIRIS sensor over Salinas Valley, California, and was characterized by a high spatial resolution (3.7-meter pixels) (Figure 13) [45].The dataset has a medium size of 512 × 217 pixels, and the spectral range is from 370 to 2507 nm.In our experiments, all the 16 classes in the Salinas dataset are used.

Classification Results
The classification results on this dataset are shown in Figures 14-16 and Table 5.It is evident that, for this dataset, the proposed method is also superior to other competitors.
The classification accuracy curves in Figure 14 shows all methods perform well for this dataset, especially the MRMR, OPBS and ECA methods.Although OPBS and ECA performs quite well, we can see that the proposed method still obtains the best results in most cases.We also notice that, for this dataset, the proposed can obtain quite good classification performances when selecting quite a limited number of bands (e.g., less than 5 bands).Furthermore, the results in Table 5 demonstrate that the proposed method obtains the highest OAs and AAs for both the two classifiers, and correspondingly, the associated classification maps of the proposed method is the most similar to the ground truth maps among all the classification maps (Figures 15 and 16).Therefore, these classification results on Salinas dataset further indicate that the proposed method is effective for finding the bands that are informative for classification.

Band Correlation Comparison
Likewise, Table 6 lists the average band correlation of the fifteen selected bands.For this dataset, the bands selected by the MVPCA, LCMV-based methods are still highly correlated, whereas the other BS methods select the bands with lower correlation.For this dataset, the proposed method's selected bands have the lowest average correlation, and we can also see from Figure 17 that the MVPCA, LCMV-based methods select many neighboring bands, while the selected bands of other methods are more dispersed.It is worth noting that when compared with the OPBS and ECA methods, the distribution of the bands selected by the proposed method is also more dispersed and it can be intuitively seen that the bands selected by the proposed method are more reasonable.

Computing Time Comparison
The computing time of selecting ten bands by different methods is also listed in Table 6.The Salinas dataset has more pixels than the previous Indian Pine dataset, so the proposed method should show good performance in terms of the computational efficiency.We can see that MVPCA costs the shortest time, followed by MRMR, OPBS, and other methods.The computing time of MRMR is only higher than that of MVPCA and is slighted shorter than that of OPBS, which further indicates that the accelerating effect of the tricks introduced in Section 3.4 is effective.Therefore, the experimental results on this dataset also proves that the MRMR method has a satisfactory computational efficiency, especially when processing the large-scale images.

Summary
In the end, some important results can be summarized from all the experiments.In unsupervised band selection, the BS methods should evaluate the representativeness and the correlation among selected bands jointly.The proposed method explicitly designs two metrics for evaluating these two factors and then combine them into an effective selection criterion.Experimental results have verified that the selected bands obtained by the MRMR method are not only informative for pixel classification but also with low correlation.Among all the methods we used, the MRMR method shows the best performance of classification, it even outperforms the state-of-art methods like OPBS and ECA.When compared with the similar methods, namely, the OPBS and LCMV-based methods, the MRMR method is much superior to them, which demonstrates the effectiveness of the proposed selection criterion.Furthermore, considering that BS is to select several bands to replace the whole dataset, it is preferable that the BS methods select fewer bands but maintain a satisfactory classification performance.When selecting quite few bands, the MRMR method still obtains quite good classification performances, so this method is valuable.Finally, thanks to the accelerating tricks for computing the orthogonal projection, the MRMR method has a satisfactory computational efficiency.In conclusion, the effectiveness of the proposed method has been verified.

Conclusions
In this paper, we proposed an unsupervised feature selection approach based on maximizing representativeness and minimizing redundancy to select some important bands from hyperspectral images.The MRMR method aims to find the band subset that has the maximum representativeness and the minimum redundancy.The representativeness of one band subset is measured by the distances of the remaining bands to their orthogonal projections onto the hyperplane which is spanned by the bands of the subset.The redundancy of one band subset is measured by the average correlation coefficient of the bands in this subset.To find the subset with good representativeness and low redundancy, an effective evolutionary algorithm named the Immune Clone Selection (ICS) is applied as the searching strategy.Moreover, to ensure that the proposed method can be used in practical applications, two useful tricks are introduced to accelerate the computation of the subsets' representativeness, any of them can be applied to reduce the computational burden of the MRMR method.The experimental results on three different datasets have verified that the proposed method is a highly effective BS method with a satisfactory computational efficiency.Finally, our future research interest is to find the other effective metrics to evaluate the representativeness and redundancy for improving the performance of the proposed method.

Figure 1 .
Figure 1.An intuitive explanation of orthogonal projection.(a) The vector x 0 cannot be linearly represented by the vectors x 1 and x 2 .(b) The vector x 0 can be linearly represented by the vectors x 1 and x 2 .

Figure 3 .
Figure 3. Band 170 and the ground truth of the Indian Pine dataset.(a) Band 170.(b) The ground truth map (label 0 denotes background).

Figure 4 .
Figure 4. Overall classification accuracies of using the bands selected by different BS methods from the Indian Pine dataset.(a) SVM (b) KNN.

Figure 7 .
Figure 7. Spectrums of the categories on the Indian Pine dataset.The straight lines denote the bands selected by different BS methods.(a) MVPCA (b) LCMVBCC (c) LCMVBCM (d) ECA (e) OPBS (f) MRMR.

Figure 8 .
Figure 8. Band 50 and the ground truth of the Pavia University dataset.(a) Band 50.(b) The ground truth (label 0 denotes background).

Figure 9 .Figure 10 .Figure 11 .
Figure 9. Overall classification accuracies of using the bands selected by different BS methods from the Pavia University dataset.(a) SVM (b) KNN.

Figure 12 .
Figure 12.Spectrums of the categories on the Pavia University dataset.The straight lines denote the bands selected by different BS methods.(a) MVPCA (b) LCMVBCC (c) LCMVBCM (d) ECA (e) OPBS (f) MRMR.

Figure 13 .
Figure 13.Band 100 and the ground truth of the Salinas dataset.(a) Band 100.(b) The ground truth (label 0 denotes background).

Figure 14 .Figure 15 .Figure 16 .
Figure 14.Overall classification accuracies of using the bands selected by different BS methods from the Salinas dataset.(a) SVM (b) KNN.

Table 1 .
Overall Accuracies and Average Accuracies of Using the Fifteen Bands Selected from the Indian Pine Dataset.(The bold denotes the best result).

Table 2 .
Average band correlation and computing time for selecting fifteen bands from the Indian Pine dataset.

Table 3 .
Overall Classification Accuracies and Average Classification Accuracies of Using the Ten Bands Selected from the Pavia University Dataset.(The bold denotes the best result).

Table 4 .
Average band correlation and computing time for selecting fifteen bands from the Pavia University dataset.

Table 5 .
Overall Classification Accuracies and Average Classification Accuracies of Using the Fifteen Bands Selected from the Salinas Dataset.(The bold denotes the best result).

Table 6 .
Average band correlation and computing time for selecting fifteen bands from the Salinas dataset.