Hyperspectral Dimensionality Reduction Based on Inter-Band Redundancy Analysis and Greedy Spectral Selection †

: Hyperspectral imaging systems are becoming widely used due to their increasing accessi-bility and their ability to provide detailed spectral responses based on hundreds of spectral bands. However, the resulting hyperspectral images (HSIs) come at the cost of increased storage requirements, increased computational time to process, and highly redundant data. Thus, dimensionality reduction techniques are necessary to decrease the number of spectral bands while retaining the most useful information. Our contribution is two-fold: First, we propose a ﬁlter-based method called interband redundancy analysis (IBRA) based on a collinearity analysis between a band and its neighbors. This analysis helps to remove redundant bands and dramatically reduces the search space. Second, we apply a wrapper-based approach called greedy spectral selection (GSS) to the results of IBRA to select bands based on their information entropy values and train a compact convolutional neural network to evaluate the performance of the current selection. We also propose a feature extraction framework that consists of two main steps: ﬁrst, it reduces the total number of bands using IBRA; then, it can use any feature extraction method to obtain the desired number of feature channels. We present classiﬁcation results obtained from our methods and compare them to other dimensionality reduction methods on three hyperspectral image datasets. Additionally, we used the original hyperspectral data cube to simulate the process of using actual ﬁlters in a multispectral imager.


Introduction
Optical remote sensing systems are predicated on analyzing the spatial and spectral information contained within the imagery they collect. Due to this ability, the applications of remote sensing are widespread, ranging from measurements for precision agriculture [1] to applications in forensic sciences [2]. In many of these applications, the spatial information carried in the image must be considered when classifying objects in a scene; however, the spectral information plays a crucial role in the process. The ability to maintain spatial detail while also generating several spectral channels has given rise to several different forms of passive remote sensing imaging systems, ranging from multispectral imaging (MSI) [3,4] to hyperspectral imaging systems (HSIs) [5,6]. As opposed to MSI systems, which capture several discrete spectral bands, HSI systems capture hundreds of contiguous spectral bands. The utility of each type of imaging system varies based on the goal of the work. For example, if spectral bands are known a priori, MSI systems are often advantageous as the specific spectral bands can be measured directly with a relatively simple system and data products. However, if the spectral bands for a certain application are unknown, HSI can become a powerful tool.
Hyperspectral imaging systems have the ability to capture good spatial resolution while also capturing rich spectral content. However, the spectrally dense data cubes captured by HSI systems often have large file sizes, high data density, and introduce additional computational complexity. Taken together, these factors present computational limitations when postprocessing imagery and storing data products. For example, the Hyperspectral Infrared Imager (HyspIRI) launched by the Jet Propulsion Laboratory (JPL) of the National Aeronautics and Space Administration (NASA) has an average continuous data rate of 65 Mb/s, which produces a daily data volume of 5.2 Tb [7]; therefore, the inherent computational cost makes fast onboard processing difficult.
In applications with smoothly varying spectral features, the additional complexity and cost introduced by HSI systems may be unnecessary. For example, plant stress is often estimated with the Normalized Difference Vegetation Index (NDVI) using a single red band and a single near-infrared band instead of an HSI system, as is the case for the Advanced Very High-Resolution Radiometer (AVHRR) instrument of the National Oceanic and Atmospheric Administration (NOAA) [8] or the drone-based MicaSense RedEdge-MX sensor [9]. Yet, for many applications, there is a need to resolve sharp spectral features or the salient wavelengths are not known a priori, meaning MSI systems would provide little value. Though HSI systems capture rich spectral data, the determination of the most relevant wavelengths in a data cube is a challenging task. The successful determination of salient wavelengths within a hyperspectral data cube would provide several distinct benefits. If using a band selection method, the processing and storage requirements can be relaxed as only relevant spectral bands are stored and processed [10]. When using feature extraction methods, all spectral bands are stored, but classification accuracy can be increased [11]. Regardless of which method is used, the identification of relevant spectral bands provides a means of developing multispectral imagers in place of hyperspectral imagers for a given classification task, with applications ranging from precision agriculture [12] to dermatology [13].
In this paper, we present a dimensionality reduction technique called interband redundancy analysis (IBRA), which removes redundant spectral bands, thus simplifying spectrally dense image data captured by a hyperspectral imager. IBRA is based on a recursive collinearity analysis between each spectral band and its neighbors, which allows for an approximation of the minimum number of bands we need to move away from a band to find spectral bands with sufficiently distinct information. Having calculated this distance metric for each spectral band, its distribution across the spectrum is used to cluster the spectrum into sets of similar and contiguous spectral bands. This allows us to identify a reduced set of independent bands that act as the centroid of their corresponding clusters in the spectrum.
Along with simplifying the hyperspectral data cubes, our IBRA method can be used as part of a novel two-step hybrid feature selection process. The first step applies IBRA to find a reduced set of independent representative bands; then, it applies our proposed wrapperbased method called greedy spectral selection (GSS) to select a user-defined number of spectral channels, as depicted in Figure 1. GSS ranks the candidate bands according to their information entropy to obtain an initial selection of bands, which is used to train a classifier that evaluates the effectiveness of the given selection of bands based on the achieved classification performance. Then, we remove from the selection the band that shows the most severe indication of multicollinearity and repeat the process by considering the next band with the highest information entropy, verifying if the classification performance improves. As such, GSS aims to maximize the discrimination between classes and preserve their spectral uniqueness. This band selection process is implemented as a part of achieving our goal of designing low-cost, high-accuracy MSI systems for various applications. That is, given a classification problem, our method allows us to select a user-defined number of salient wavelengths from an HSI dataset that can then be used for MSI while maintaining or improving the accuracy of the original HSI-based system. Furthermore, we explore the applicability of IBRA in combination with other dimensionality reduction methods besides the various feature selection methods. Specifically, we propose a feature extraction framework whose first step aims to remove the spectral redundancy by using IBRA. The second step consists of applying any desired dimensionality reduction method on the IBRA-selected set of spectral bands to obtain a new representation with the desired number of dimensions or feature channels. Thus, the structure of the proposed framework is similar to that shown in Figure 1, but including a dimensionality reduction method (e.g., principal component analysis) instead of the GSS block. In this work, we show experimental results for this framework considering both a supervised and an unsupervised feature extraction method for the second step. Our results show that our feature extraction framework yields better or competitive results in comparison to several alternative feature selection methods in the context of hyperspectral image classification.
This paper is an extension of work originally presented at the International Joint Conference on Neural Networks (virtual event, July 2021) [14]. Extensions include an extensive set of experiments that allow for a broader analysis of the behavior of the proposed band selection method. In particular, we include an additional hyperspectral image dataset for our experiments and comparisons to two recently published band selection methods. We also present the results of new experiments focused on testing performance using different dataset sizes (i.e., 100%, 75%, 50%, 25%) to evaluate the impact of data quantity on the effectiveness of our method. In our previous paper, we showed experiments selecting six and ten bands from a hyperspectral dataset used for herbicide resistance classification; here, we include results using eight bands as well in order to evaluate the tradeoff between the number of bands and classification performance more effectively and to explain the advantages of our proposed methods under different scenarios. Finally, we extend the use of our IBRA method to include feature extraction and show how a feature extraction method behaves in comparison to feature selection in hyperspectral image classification.
Our specific contributions are summarized as follows: 1.
We propose a filter-based method called interband redundancy analysis (IBRA) that works as a preselection method to remove redundant bands and reduce the search space dramatically; 2.
We present a two-step band selection method that first applies IBRA to obtain a reduced set of candidate bands and then selects the desired number of bands using a wrapper-based method called greedy spectral selection (GSS); 3.
We show that IBRA can be used as part of a more general two-step feature extraction framework where any dimensionality reduction method can be applied following IBRA to obtain the desired number of feature channels; 4.
Since one of the objectives of this work is to aid in the design of multispectral imaging systems based on the wavelengths recommended by a band selection method, we also present an extensive set of experiments that use the original hyperspectral data cube to enable simulating the process of using actual filters in a multispectral imager.
The remainder of this paper is structured as follows: In Section 2, we provide a brief review of previous related work performed with feature extraction and feature selection methods in the context of hyperspectral image classification. Section 3 provides further details about the hyperspectral image datasets used in this work, as well as how we preprocessed the data for our experiments. In this section, we also describe our IBRA and GSS algorithms and explain how they are used for dimensionality reduction. Section 5 discusses the experimental results that are presented in Section 4. Finally, Section 6 offers concluding remarks.

Related Work
Working with hyperspectral images often benefits from applying dimensionality reduction techniques as a preprocessing step in order to avoid unnecessarily high computational complexity and to reduce redundancy of the data. Dimensionality reduction techniques not only reduce the inherent computational cost and relax the need for advanced hardware requirements for processing the data, but they also combat the "curse of dimensionality" [15,16]; i.e., the fewer the training samples, the worse the performance of an HSI classifier when the dimensionality of HSI data (the number of features) increases [17]. Dimensionality reduction techniques rely on feature extraction or feature selection approaches; the former apply linear or nonlinear transformations to extract specific features from the original data, while the latter select the most useful subset of the features of the data without transforming them.
Feature extraction methods can be subdivided into two classes: unsupervised and supervised. Among unsupervised methods, principal component analysis (PCA) and its variants (e.g., folded-PCA and kernel PCA) are some of the most commonly used methods to remove spectral redundancy and reduce the dimensionality of the raw data [18]. Nevertheless, theoretical arguments and experimental evidence of the ineffectiveness of PCA for hyperspectral feature extraction have been presented [19]. According to this, PCA should not be used as a preprocessing step to solve small-sample-size problems as there is evidence that class separation deteriorates after PCA transformation.
Other unsupervised methods are based on independent component analysis (ICA) and its variants (e.g., kernel ICA and Random Fourier Feature-ICA) [20]. Alternatively, supervised methods such as Fisher's linear discriminant analysis (LDA) [21,22] or partial least squares discriminant analysis (PLS-DA) [23] are used widely for hyperspectral feature extraction. More recent feature extraction methods apply local manifold learning to find nonlinear embeddings that are used to project the data into lower-dimensional spaces. For instance, Hong et al. [24] proposed a robust local manifold representation that incorporates a hierarchical neighbor selection that aims to mitigate multicollinearity in the local manifold space. Moreover, the computed weights are designed to jointly embed spectral and spatial information. Feature extraction methods can also be applied with the objective of improving data quality. In that sense, Zhuang et al. [25] proposed a robust hyperspectral denoising technique that implements low-rank representations in order to project the data into lowerdimensional spaces while reducing the impact of noise.
On the other hand, feature selection methods select a subset of spectral bands without modifying the data or projecting it into a new basis. In other words, the aim of a hyperspectral band selection method is to identify which spectral wavelengths are most responsive or relevant for a particular classification task without modifying the data. Additionally, identifying a reduced subset of relevant spectral bands allows for a better understanding of the optical properties of the materials and provides information that is useful when designing cheaper task-specific multispectral imagers.
There exists four types of feature selection methods: filter-based, wrapper-based, hybrid, and embedded. Filter-based methods select bands based on various statistical tests that assess the correlation between each band and the outcome variable or among other bands [26,27]. Wrapper-based methods, on the other hand, rely on a learning model to evaluate the suitability of a given set of bands. Hence, the problem becomes a search problem, which is computationally more expensive than filter-based methods [28,29]. In order to avoid an exhaustive search, hybrid methods have been developed. They use a filterbased method as a first step to select some relevant bands; then, a wrapper-based method is used to make the final selection [16,30]. Finally, embedded methods use learning models that internally select the most relevant features during the training phase [31,32]. Below, we provide a brief review of previous related works corresponding to the aforementioned types of feature selection methods.
Given the potential benefits of feature selection, several methods have been proposed for hyperspectral image classification. In this context, some feature extraction methods can be adapted for filter-based feature selection. For example, Chang et al. [33] presented a joint band-prioritization method based on PCA where the spectral bands are ranked according to their variances. PLS-DA can also be adapted into a feature selection scheme using the resulting regression coefficients from full spectra to identify the most influential wavelengths [34,35].
Some filter-based methods analyze the statistics of the data and heuristically estimate a scoring index for each feature. For instance, [36] employed a filter feature-selection algorithm based on minimizing a tight bound on the Vapnik-Chervonenkis (VC) dimension [37]. Ranking-based methods are one of the most common filter-based feature selection methods; these methods estimate the importance of each spectral band using such metrics as the variance inflation factor (VIF) [38] in order to select the top-ranked bands. We will use the idea of calculating the VIF value to measure collinearity, but the spectral bands will not be ranked based on this simple measure alone. Another example of a ranking-based method is the fast density-peak-based clustering (FDPC) method proposed by Rodriguez et al. [39]; it ranks each band according to its ability to become a cluster center, which is measured using two factors: local density and distance from points of higher density. These factors are calculated using a similarity matrix based on Euclidean distance. Similarly, Xu et al. [40] constructed a similarity matrix based on structural similarity (SSIM) that is used to calculate two measures called average similarity and significant dissimilarity to evaluate the ability of each band to become a cluster center. Then, it ranks the bands according to the product of these two measures so that the top-ranked bands are selected.
Other methods estimate the relevance of each band as part of an optimization approach. For instance, Medjahed and Ouali [41] formulated the feature selection problem as a combinatorial optimization problem where feature relevance is measured by mutual information. Furthermore, Wang et al. [27] proposed an optimal clustering framework (OCF) that searches for an optimal band partition. Here, the clustering strategy is to evaluate the contribution of each cluster and then sum the contributions as a measure of the whole clustering result. After obtaining an optimal partition, the method ranks each band within a cluster according to a selected measure (e.g., information entropy) and selects those with higher rank values. In subsequent work, Wang et al. [26] proposed a fast neighborhood grouping method for hyperspectral band selection (FNGBS) that uses a neighborhood band grouping approach to partition the data into several groups based on Euclidean distance. After partitioning the data, the method ranks the bands within each group according to the product of local density and information entropy in order to obtain the most relevant and informative band for each cluster.
Recently, wrapper-based feature selection approaches based on genetic algorithms have gained more attention [28,42,43]. For instance, a recent method for bandwidth selection is known as histogram assisted genetic algorithm for reduction in dimensionality (HAGRID) [29]. This method maintains a population of index vectors identifying a specific number of wavelengths and fits a Gaussian mixture model to the converged population to identify the main wavelengths with their associated filter bandwidths.
Alternatively, some embedded-based approaches have been proposed in the context of deep learning. Taherkhani et al. [31] proposed regularizing the convolutional filters of the first layer of a convolutional neural network (CNN) using a group least absolute shrinkage and selection operator (LASSO) algorithm in order to sparsify the redundant spectral bands. Similar attempts, although they are not explicit feature selection methods, have been carried out in works such as [44,45], where a spectral-wise attention mechanism in the form of a fully connected layer is applied to the inner convolutional layers of the network with the objective of emphasizing informative spectral features and suppressing less useful spectral features. Furthermore, the use of saliency maps have gained popularity due to their ability to estimate which features (i.e., pixels) are more relevant for a model [46,47]. For example, Nagasubramanian et al. [48] adapted the concept of spatial saliency maps to an HSI classification scheme in order to estimate the relative importance of each wavelength given a specific class.

Dataset Descriptions
In our experiments, we used an in-greenhouse controlled HSI dataset of kochia leaves in order to classify three different herbicide resistance levels. We also used two well-known remote sensing HSI datasets: Indian Pines (IP) [49] and Salinas (SA) [50].
The Kochia dataset consists of images of the weed kochia (Bassia scoparia), which is considered one of the most problematic weeds in small grains and broadleaf crops such as soybeans and sugar beets [51][52][53]. This dataset was collected and analyzed by Scherrer et al. [1] with the aim of learning to differentiate between three different classes of herbicide resistance of this weed: (1) herbicide-susceptible, (2) glyphosate-resistant, and (3) dicamba-resistant, where glyphosate and dicamba are two components commonly found in commercial herbicides [53]. It is important to note that the difference between these three classes is imperceptible using standard color digital cameras that collect data for three bands in the visible spectrum of light (red, green, and blue), which justifies the need of using hyperspectral or multispectral imaging systems. The images were captured using a Resonon Pika L hyperspectral imager with 300 spectral channels across a spectral range of 387.12 nm to 1023.50 nm, resulting in a spectral resolution of approximately 2.12 nm. The kochia samples were illuminated using diffuse sunlight in a greenhouse setting. A total of 76 hyperspectral images of kochia with varying ages and spatial resolutions were captured at the Montana State University Southern Agricultural Research Center (SARC). Each image contains three kochia leaves of the same herbicide resistance class with a height of 900 pixels and width ranging from 700-1200 pixels.
The Indian Pines dataset [49] is an aerial 145 × 145-pixel image of the Indian Pines site in Northwestern Indiana. It was acquired in 1992 using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [54], and it originally had 224 spectral bands in the wavelength range 400-2500 nm, resulting in a spectral resolution of approximately 9.5 nm. The number of bands was first reduced to 220 after removing four damaged bands without information (i.e., all of their pixels were zero value) [49], and then to 200 after removing 24 noisy bands and bands covering the region of water absorption corresponding to the wavelengths (1333-1373 nm), (1789-1893 nm), and (2499 nm) that correspond to the band indices (104-108), (150-163) and 220, as recommended by Tadjudin and Landgrebe [55]. The data are divided into 16 classes containing agriculture, forest, and other natural perennial vegetation.
Similarly, the Salinas dataset [50] is a 512 × 217-pixel aerial image of Salinas Valley, California, that was gathered in 199 using the same AVIRIS sensor as Indian Pines. Bands in the same wavelength ranges as Indian Pines were removed; however, in this case, they correspond to the following band indices: (108-112), (154-167) and 224. Thus, only 204 spectral bands are used. It is divided into 16 classes containing vegetables, vineyard fields, and bare soil.

Data Preprocessing
Each of the 76 collected images in the Kochia dataset contains a reflectance standard and three kochia leaves that correspond to the same herbicide resistance class. The white panel shown on the left side of each image in Figure 2 corresponds to the reflectance standard, a 99% reflectance Spectralon panel, which is commonly used as a Lambertian calibration reference. These images were captured by the Pika L hyperspectral imager in raw digital numbers, meaning the image data required preprocessing prior to analysis. For our experiments on the Kochia dataset, we converted the raw digital numbers to reflectance values using the Spectralon panel as the reflectance reference.
The first step of the reflectance correction was to select the pixels in the image that contain the Spectralon reflectance target. Next, we averaged the values of all pixels within the selected region at each spectral band, leaving us with a single, averaged digital number for each spectral band n ∈ [1, N] (where N represents the total number of spectral bands) denoted as DN (n) target . Consequently, we obtained a digital number representing 99% of the reflected light for each spectral band. Finally, we calculated the spectral reflectance at the i-th pixel of the n-th spectral band, ρ (n) i , as follows: scene i represents the digital number value captured at the i-th pixel of the n-th spectral band, DN (n) dark represents the dark current or background signal generated through sporadic electron generation in the imager's sensor, and ρ target represents the reflectivity of the reflectance target. From the calibrated images, we extracted 6316 25 × 25-pixel overlapping patches ( Figure 2c). Once extracted, we reduced the number of spectral bands within each patch from 300 to 150 by averaging adjacent pairs of bands, which can be interpreted as 2× spectral binning, where the resulting spectral resolution of each band was modified from approximately 2.12 nm to 4.24 nm. Thus, the final shape of this preprocessed dataset is 6316 × 25 ×25 × 150 pixels.
Given that one of the goals of this work is to aid in the design of multispectral imaging systems, we note that decreasing the overall spectral resolution is unlikely to affect our results, as it is unlikely that optical filters with a bandwidth less than 20 nm will be considered. Hence, this process gives us an upfront reduction in dimensionality that greatly reduces the potential overfitting impact in our following analysis without overly constraining the design parameters of a multispectral imager. In addition, averaging consecutive bands has a smoothing effect that helps to combat possible noise in some spectral bands. However, we used a different process for the IP and SA datasets. As described in Section 3.1, these datasets consist of single large images, meaning that we had to divide them into small patches so that each patch represents one class. In this case, we extracted 5 × 5-pixel patches around each pixel with an assigned label. From the IP and SA datasets, we extracted 10,249 and 54,129 patches, respectively. Hence, the size of the preprocessed IP dataset is 10,249 ×5 × 5 × 200 pixels while that of the preprocessed SA dataset is 54,129 ×5 × 5 × 204 pixels.

Interband Redundancy Analysis
Our first contribution consists of having developed a method called interband redundancy analysis (IBRA) that selects a subset of representative spectral bands from the original data cube aiming to reduce the interband redundancy. IBRA is a filter-based selection method that iteratively calculates the variance inflation factor (VIF) [56] between pairs of bands in order to determine how correlated they are; that is, to verify the presence of collinearity between them.
To calculate the VIF value between two bands, x 1 and x 2 , we built an ordinary least squares (OLS) regression model that takes one of the bands as the independent variable (x 1 ) and the other as the dependent variable (x 2 ). Then, we used the R-squared (coefficient of determination) value from the resulting model, R 2 x 1 ,x 2 , to calculate the VIF value as follows: VIF(x 1 , VIF values range from 1 upwards. The higher the VIF value, the more there is the risk of collinearity. If two bands are collinear, they explain the same variance within the dataset and can be considered to be redundant. We will consider VIF values greater than a threshold θ as representing the presence of collinearity between two spectral bands. In the literature, the recommended values of θ are between 5 and 10 [57,58], so we test different values of θ ∈ [5,12] to observe how the classification performance is affected by this threshold and to choose the best θ for a given classification task. Some methods, such as that proposed by Castaldi et al. [38], use the VIF metric to perform a stepwise selection process where the bands that show a VIF value greater than a threshold θ (i.e., bands that show a high risk of multicollinearity) are removed from the selection. Our approach is novel and distinct in that we use the VIF metric as part of a preselection step, assessing the collinearity degree between each band and its local neighbors iteratively in order to find independent salient bands that are suitable cluster centers.
Our IBRA preselection method first analyzes each spectral band x n with spectral index n ∈ [1, N] to calculate the minimum number of bands we need to move away to the left side from it in order to find bands sufficiently different from band x n , denoted as d le f t (x n ). Similarly, we calculated the minimum number of bands we need to move away to the right side from band x n in order to find bands sufficiently different from band x n , denoted as d right (x n ) (see Algorithm 1). From this, we assumed that all the bands within the interval are similar to x n and form a possible cluster.
We considered that a suitable cluster center should represent the bands that are on the left side of the cluster in a similar way that it represents those on the right side; therefore, we calculated the difference d(x n ) = |d le f t (x n ) − d right (x n )| to quantify the suitability of band x n to become a cluster center, where low values of d(x n ) indicate better suitability. Take Figure 3 as an example. Here, x 4 (d(x 4 ) = 0) is more suitable than x 5 (d(x 5 ) = 2) to be considered a bandwidth center because it is similar to both the left and right sides, so it better represents the bands within the interval [x 2 , x 6 ]. Recall that there exists a strong correlation between bands with close wavelengths [27], which means that the more distant two bands are, the more different their spectral information is.
Next, we analyzed how the differences d(x n ) were distributed across the spectrum. This distribution was used to find clusters with similar bands and their corresponding cluster centers. Figure 4 shows the distribution of the variable d(x n ) for the Kochia dataset given different thresholds θ. From this, we observed that each distribution consisted of a series of "V" patterns. In this context, a local minimum-the center of a "V" patternrepresents a salient band that explains the variance within the dataset in a way similar to its neighbors. Even though all the bands within a "V" pattern are similar, the band at the leftmost position is more similar to those bands on the left side of the cluster. Similarly, the band at the rightmost position of the "V" is more similar to those bands on the right side of the cluster, while the band corresponding to the local minimum is similar to both sides, acting as a centroid. In general, we kept the bands corresponding to a local minimum in the n vs. d(x n ) plot and remove the rest since they are assumed to be redundant. We considered only local minima where d(x n ) < 5; otherwise, they will not be suitable bandwidth centers.
The entire IBRA process is described in Algorithm 1 and Figure 5. Here, getVIF(a, b) calculates the VIF value between bands x a and x b . Note that getVIF(a, b) = getVIF(b, a); therefore, in order to avoid re-calculations and make the algorithm more efficient, we construct a symmetric matrix table that stores the pre-computed VIF values between pairs of bands. Furthermore, getLocalMinima(d) is a function that retrieves the position of the local minima of the vector d. Hence, the algorithm returns a vector called preselection, which is a set of numbers S c ∈ Z N (where N N) representing the indices of the preselected spectral bands. It is worth mentioning that the number of preselected bands N is not known a priori and depends on each dataset and the selected VIF threshold θ.     20: if preselection ← getLocalMinima(d) 28: return preselection

Band Selection Using Pre-Selected Bands
Our second contribution was the development of a two-step hybrid band selection method: First, it uses IBRA to obtain a set of independent candidate bands from the original spectrum, as we showed in Section 3.3. Then, it employs a wrapper-based method to select the best combination of bands given the desired number of bands k (whose indices are denoted by S f ∈ Z k ) from the set of N available candidate bands preselected by IBRA (whose indices are denoted by S c ∈ Z N ) instead of the entire set of spectral bands, which greatly reduces the search complexity. We call this process greedy spectral selection (GSS).
The GSS process starts by ranking each candidate band x c , where c ∈ S c , according to some criterion. In this work, we used information entropy to calculate an initial relevance score for each spectral band. The three datasets used in this work were acquired using a bit depth of 14 bits, allowing us to interpret a spectral band as a discrete random variable. The entropy of band x c , H(x c ), conveys the average level of information inherent in x c and is defined as follows: where P(z) is the probability mass function of random variable z, and Ω x c is the space that encompasses all the possible values that can occur in the spectral band x c .
Here, it is important to highlight the difference of our method with respect to other band selection methods that use information entropy as a ranking measure, such as those proposed by Wang et al. [26,27]. In particular, those methods use an optimization approach to group the data into k clusters of similar bands. Then, they use information entropy to find the most representative band from each cluster (i.e., the band with the largest amount of information) to obtain a final selection of k bands. Our approach is distinct in two aspects: First, when using IBRA, we already picked the most representative band from each cluster (defined as a local minimum in the n vs. d(x n ) plot, as shown in Figure 4). Then, we ranked the N preselected bands based on information entropy in order to select a subset S f of k bands. Thus, initially, S f consists of the indices of the top-k candidate bands of S c with the greatest entropy. The second aspect is that, while other methods use information entropy to obtain a final selection of bands directly, we used it to obtain an initial selection that will be refined further using a greedy algorithm based on a multicollinearity analysis.
We note that collinearity does not exist between pairs of candidate bands obtained by the IBRA preselection process because the calculated VIF value between two cluster centers must be lower than a given threshold θ. However, it is possible that three or more bands are highly correlated-a phenomenon known as multicollinearity [57]. In that case, we checked for the presence of multicollinearity within the k selected spectral bands the same way as described in Section 3.3 with collinearity; that is, taking one band at a time, considering it as the dependent variable while considering the rest of the k − 1 bands as the independent variables, and calculating its corresponding VIF value.
With this, after obtaining an initial set of k selected bands based on information entropy, GSS employs the following greedy algorithm (see Algorithm 2). First, we calculated the average classification performance (F1 score) when using the current k selected bands. To do this, we performed a 5 × 2-fold cross-validation using a convolutional neural network (CNN) classifier (see Section 3.5) and calculated the average F1 score obtained on the 10 validation folds. Then, after calculating the VIF value for each of the currently selected bands, we removed the one with the greatest VIF value and consider the next available band with the greatest entropy. With this new subset of k bands, we performed a new 5 × 2-fold cross-validation process to determine if the average classification performance had improved. If so, we considered the current selection of k bands as the current best selection and updated the indices of S f . We repeated this process until there were no more available bands. Finally, we selected the combination of bands that showed the best classification performance.
12: 15: if newF1 > bestF1 then 16: bestS f ← S f 17: bestF1 ← newF1 18: return bestS f Figure 6 depicts the process of selecting k = 6 bands for the Kochia dataset from the set of 17 band candidates obtained using IBRA with a VIF threshold of 10 (see Figure 4b).
Then, in the first iteration of GSS, the first six bands selected are those with the highest information entropy values; for example, those with indices [1, 18,31,54,64,68]. With this selection of bands, we performed 5 × 2-fold cross-validation training a CNN and obtained an average F1 score of 92.49%. Then, we determined that the band with index 64 had the highest VIF value (i.e., 86.8), which means that its variance could be explained well using the other five bands. For this reason, this band was removed from the selection and the next available band (i.e., the band with index 43) was considered in the second iteration.
We show the results obtained after six iterations only given that the following iterations could not achieve better classification performance scores. Therefore, the best selected bands were found at the fifth iteration and their indices were [1, 18,43,68,81,143]. The GSS algorithm is presented in Algorithm 2 and Figure 7. Note that it selects the indices of the best k bands from a list of candidate band indices S c . Function getEntropy(S c ) returns the entropy value of each of the candidate bands with indices S c . Then, S c .sort(key = H) sorts the elements of S c with respect to their corresponding entropy values in nonincreasing order. Function getVIFMulti(S f ) returns a list with the VIF value for each band in S f . Next getMax(l) returns the position where the maximum value in a list l was found. Finally, the performance of the currently selected bands was evaluated using trainSelection(S f ), which returns the average F1 score obtained using the bands with indices S f and 5 × 2-fold cross-validation.

Convolutional Neural Network Architecture
The type of classifier we used for the GSS process to evaluate the appropriateness of a given selection of spectral bands (line 6 in Algorithm 2) is a CNN. Preliminarily, we experimented with other types of classifiers to use in the GSS process; specifically, support vector machines, random forests, and feedforward neural networks. However, these approaches showed slow convergence rates and poor classification performance in comparison to our CNN, despite being a more complex model. For this reason, we continued to use CNNs over the other types of classifiers.
All the classifiers used in this work are CNNs that use a modified version of the Hyper3DNet architecture [59], which is a 3D-2D CNN architecture specifically designed for HSI classification problems using a reduced number of trainable parameters. The original Hyper3DNet architecture consists of two main modules: a 3D feature extractor and a 2D spatial encoder. The 3D feature extractor uses a sequence of 3D convolutional layers with filters that analyze not only spatial neighborhoods but also spectral neighborhoods while maintaining spatial resolution. Then, the 2D spatial encoder progressively reduces the spatial resolution using 2D separable convolutional layers, which are used due to their ability to reduce the number of trainable parameters and computational burden while improving the classification performance [60,61]. Experimental results demonstrated relative superiority of this architecture over state-of-the-art architectures.
The modified CNN architecture that we used in this paper, referred to as Hyper3DNet-Lite, is a simplification of the original Hyper3DNet architecture. The difference with the original architecture is that its 3D feature extractor consists of two 3D convolutional layers instead of a densely connected block with four layers; additionally, its 2D spatial encoder has three layers instead of four. The Hyper3DNet-Lite architecture used for the Kochia dataset is detailed in Table 1, where N denotes the number of spectral bands in the input, "SepConv2D" denotes a 2D separable convolutional layer, and "ReLU" denotes a rectified linear unit activation layer (where ReLU(x) = max(0, x)). The only difference with the network used to process the IP and SA datasets is that, since the input image is smaller (5 × 5 pixels), the stride used in the last two "SepConv2D" layers is (1, 1) instead of (2, 2) to avoid dimensionality inconsistencies.
Another difference of our Hyper3DNet-Lite with respect to the original Hyper3DNet is that, after the last "SepConv2D" layer, we used a "GlobalAveragePooling" operation instead of a "Flatten" operation. For example, for the case of the network shown in Table 1, we transformed our 3D tensor with dimensions 7 × 7 × 256 pixels into a 1D vector with 256 elements by averaging the 49 values from each channel. Conversely, if we were to use a flattening operation, we would obtain a 1D vector with 12,544 elements. This relaxes the need of using an intermediate fully connected layer to reduce the dimensionality of the tensor before the final classification layer; thus, we avoided using an excessive number of trainable parameters.
The simplified architecture of the Hyper3DNet-Lite network becomes especially suitable for datasets that use just a few spectral bands, given that these datasets often do not require models with a high level of complexity to process them, unlike those datasets that use all the available spectral bands. In this way, we avoided overparameterization, which results in our models being less prone to overfitting.

Multispectral Imager Design
A spectral band is determined by its wavelength center and its spectral resolution; that is, an imager captures the reflected light in a given band around its center wavelength within a range determined by its spectral resolution. The method described in Section 3.4 was used to select the most relevant spectral bands from the original hyperspectral data cube and, as a consequence, their wavelengths. Knowing which wavelengths are more useful for a given application allows us to discard the information captured at different wavelengths, thus reducing storage requirements. However, this knowledge also allows for the design of a compact multispectral imager that would be used instead of a full hyperspectral imager. Hence, to test our hypothesis that we can use our method to provide the specifications needed to design a multispectral imager, we assessed classification performance with a simulated multispectral imager by using the original hyperspectral data cube and simulate applying optical filters to capture data from an imager that would use our selected wavelengths.
The bandwidth of the simulated multispectral filters is set to be equivalent to the bandwidth encompassed by five bands, which is equivalent to 21 nm for the case of the Kochia dataset, and 50 nm for the case of the IP and SA datasets. These are commonly available optical filter bandwidths. To obtain the reflectance measured by the simulated multispectral sensor, we generated k Gaussian distributions, one for each of the selected bands, taking the spectral wavelengths selected by a band selection method as the centroids. The spread of the generated Gaussian curves is proportional to the filter bandwidth. In particular, the full-width at half-maximum (FWHM) of the curve was set to be equal to the bandwidth [29]. The FWHM is defined based on the standard deviation of the curve σ as FWHM = 2σ √ 2 ln 2. Then, the simulated multispectral reflectance measurement was obtained by multiplying the original hyperspectral data cube by the corresponding Gaussian distribution generated for each band, then integrating under the resulting response curve to obtain a single reflectance value. This process was repeated for each of the k Gaussian distributions.

Feature Extraction Framework
Feature extraction approaches project the N original spectral bands into a new representation basis of N features (where N < N). As such, transformed features can no longer be interpreted as spectral bands. In other words, we lose data interpretability. However, in some cases, feature extraction methods may present some performance advantages over feature selection methods. For example, projecting the data into a new basis with reduced dimensionality based on PCA helps to remove redundancy while trying to explain the original variance of the dataset. Conversely, feature selection methods simply discard the information of the bands that were not selected. Therefore, a classifier trained on N transformed features that explain most of the original variance of the dataset might perform better than a classifier trained on N selected bands, considering that the discarded bands did not contain trivial information.
We propose generalizing our method described above into a two-stage feature extraction framework. First, we applied IBRA to remove redundant bands; then, we can use any feature extraction method to obtain the desired number of feature channels k. In this work, we demonstrate the pipeline using two types of feature extraction methods: unsupervised and supervised. We chose PCA as the unsupervised method and denote the resulting instance as IBRA-PCA. In addition, we used PLS-DA as the supervised method; here, we used the class labels of the dataset to maximize the class separation within the reduced space. This instance is denoted as IBRA-PLS-DA.

Experimental Results
In this section, we present the experimental results for each step of our proposed methods and analyze their performance in comparison to other dimensionality reduction methods. For the sake of consistency and fairness, we used the same configuration for all the networks trained in our experiments, including those used during the GSS process. That is, we used the same network architecture (i.e., Hyper3DNet-Lite), optimizer, and batch-size. While this strategy does not guarantee the best possible results, it allowed us to compare the behavior of different band selection methods under the same conditions. All CNNs were trained using the Adadelta optimizer [62], which is a gradient descent method based on an adaptive learning rate, so that there is no need to select a global learning rate manually. The mini-batch size was set to 128. The last layer of the CNNs used a softmax activation function, and we employed a categorical cross-entropy loss function.
We applied 5 × 2-fold stratified cross-validation to train and evaluate all networks. That is, we randomly divide our datasets into two equally sized folds so that one fold works as the training set and the other as the validation set, train again with the roles of these folds reversed, and then repeat this process four more times. Furthermore, stratification indicates that each fold has the same proportion of samples of a given class. Note that z-score normalization (mean equal to zero and standard deviation equal to one) was applied onto each spectral band of each training set while the exact same scaling was applied to their corresponding validation set. In order to analyze the behavior of our models, we calculated four metrics by aggregating the results on all ten of the validation sets: overall classification accuracy, macro-average precision, macro-average recall, and macro-average F1 score. The source code and datasets are available online (Codebase: https://github.com/GiorgioMorales/HSI-BandSelection.git, accessed on 5 July 2021).

Training with Preselected Bands Alone
Previously (Figure 4), we showed some examples of applying the preselection method using IBRA on the Kochia dataset using three different VIF thresholds (12, 10, and 8), which reduced our search space from 150 bands to 19, 17, and 16 bands, respectively. Table 2 gives the number of preselected bands for the Kochia, IP, and SA datasets when using IBRA with a VIF threshold of 10 (θ = 10); it also gives the average performance for the four metrics and corresponding standard deviations using the Hyper3DNet-Lite network when training on the full hyperspectral spectrum and only the preselected bands. The number of parameters required to train each network is reported in the last column.

Greedy Spectral Selection
From Figure 4, we notice that different VIF thresholds can lead to slightly different selections of candidate bands. In general, the lower the VIF threshold, the more distinct two bands have to be to be considered noncollinear. Exactly how large this threshold needs to be is not clear. If it is too large, we may end up including too many candidate bands that do not represent suitable bandwidth centers. On the other hand, if it is too small, we may discard bands that could have been useful for the classification task. For that reason, we applied IBRA using different VIF thresholds θ ∈ [5,12] so that we obtained a set of preselected bands for each of them. Then, we applied GSS on each set of candidates to select the desired number of bands k. Finally, we selected the classifier that achieved the best classification performance based on the mean F1-score obtained after a 5 × 2-fold cross-validation.
For the Kochia dataset, we considered six, eight, and ten bands in order to evaluate the tradeoff between the number of bands and classification performance. For the IP and SA datasets, we selected only five bands. This is because we found that the ease of training these datasets when using k > 5 yielded to high-performance metrics (>99% accuracy) no matter which feature selection method was chosen. That is, the lack of sufficiently distinct results prevented us from being able to evaluate the existence of any performance difference among the various methods, which is why selecting more than five bands from these datasets was not considered appropriate for this work. In addition, for each dataset, we experimented with different dataset sizes (i.e., 100%, 75%, 50%, and 25%) to evaluate how consistent the band selection results as the data availability changes.
For the Kochia dataset, when selecting six bands (k = 6), the best results were obtained using a VIF threshold of ten (θ = 10), and the wavelengths of the selected bands (in nm) were [391.2, 463.3, 569.3, 675.3, 730.4, 993.3] for each of the four dataset size variations. Equivalently, the indices of the selected bands were [1, 18,43,68,81,143]. When selecting eight bands (k = 8), the best results were obtained using a VIF threshold of θ = 8 when using 100%, 50%, and 25% of the dataset and θ = 6 when using 75% of the dataset. In the four cases, the difference between the classification performance obtained using θ = 8, θ = 7, and θ = 6 is not statistically significant, so we considered the three options equally as good. Therefore, in this work, we simply chose the bands selected using θ = 8 since it achieved slightly higher performance metrics in three out of the four dataset variations. When selecting ten bands (k = 10), the best results were obtained using a VIF threshold of θ = 7 when using 100% and 50% of the dataset, and θ = 8 when using 75% and 25% of the dataset.  Table 3 shows the performance using IBRA and GSS on the full Kochia dataset for different VIF thresholds (θ ∈ [5,12]) when selecting six, eight, and ten bands. The bold entries in these tables represent the best VIF threshold and average F1 score.
For the IP dataset, the best results were obtained using a VIF threshold of ten (θ = 10) and the wavelength of the selected bands were [498.3, 626.9, 706.2, 821.8, 1023.7] (i.e., with band indices [11,25,34,39,67]) for all four dataset size variations except when we used 50% of the dataset, where the best VIF threshold was nine (θ = 9) and the wavelength 557.5 nm (band index 46) was selected instead of the wavelength 498.3 nm (band index 39). Table 4 shows the performance using IBRA and GSS on the full IP dataset.
For the SA dataset, the best classification performance was obtained using a VIF threshold of θ = 8 when using 100% and 75% of the dataset. The wavelength of the selected bands were [731.83, 950. 54, 1159.73, 1254.82, 2044.73] (nm), which correspond to the indices [37,60,82,92,175] of the corrected SA dataset after discarding the 20 water absorption bands (see Section 3.1). When we reduced the dataset size, different VIF thresholds were selected. Specifically, when we used 50% of the dataset, the best VIF threshold was six (θ = 6), and we selected the wavelengths 570. . Similar to what happened with the Kochia dataset when selecting eight bands, the classification performance obtained using θ = 8, θ = 7, and θ = 6 was not statistically significant, so we considered the three options equally as good; however, in this work, we chose to use the bands selected using θ = 8, as it yielded slightly better performance metrics in two out the four dataset variations. Table 5 shows the performance using IBRA and GSS on the full SA dataset.

Comparative Results
We compared our IBRA-GSS method to four other band selection methods: OCF [27], HAGRID [29], SR-SSIM [40], FNGBS [26], and PLS-DA [63]. For OCF, we used the normalized cut criterion as the objective function along with information entropy as the ranking method, given that this setting showed the best performance. For HAGRID, we used a grid search to choose the following hyperparameters: a crossover rate of 0.25, a mutation rate of 0.05, a tournament size of 5, a population size of 1000, and 300 iterations. For SR-SSIM, we used a window size of 3 for the SSIM index algorithm to construct a hyperspectral similarity matrix, as it achieved the best performance. For FNGBS, to calculate the number of shared neighbor elements between two bands, we considered a nearest neighbor of two elements. Finally, we adapted PLS-DA for feature selection by sorting the coefficients obtained by PLS-DA in order to select the top-k bands. We also compared our IBRA-GSS method to our IBRA-PCA and IBRA-PLS-DA methods to analyze the behavior of feature extraction methods in front of a feature selection method.
To assess the effectiveness of the dimensionality reduction methods, we compared the performance of eight Hyper3DNet-Lite CNNs, each trained on the features selected or extracted by the eight methods. This comparison was carried out using the same network architecture, hyperparameters, and other configurations for all of the methods. Finally, to determine if the difference in performance scores was statistically significant, we performed a paired t-test and a paired permutation test using the F1 scores at the α = 0.05 level.
The method comparison is shown in Table 6 for the Kochia dataset, Table 7 for the IP dataset, and Table 8 for the SA dataset, with the best performing metrics highlighted in bold. Here, the first row of each method represents the results obtained after training a model using the original selected bands (identified as "original band selection"), while the second row represents the results obtained after using simulated filters that take the position of the selected bands as their central wavelengths (identified as "multispectral filter simulation"), as explained in Section 3.6. Additionally, we chose the average F1 score to compare the results of all the methods graphically in Figure 8. The simulated filters used for the Kochia dataset were 20 nm, while for the IP and SA datasets, they were 50 nm. Note that the feature channels extracted by IBRA-PCA and IBRA-PLS-DA cannot be interpreted as spectral bands; thus, we could not apply the "multispectral filter simulation" approach directly on them. Instead, we applied a Gaussian transformation on the IBRA-selected bands to simulate filters that take the position of the preselected bands as their central wavelengths; then, we used PCA or PLS-DA on the transformed bands to obtain the desired number of feature channels.   Additionally, Table 9 shows the resulting p-values of the paired t-tests and paired permutation tests performed between IBRA-GSS and the other compared methods when selecting six, eight, and ten bands from the Kochia dataset. Similarly, Table 10 shows the results of the statistical significance tests between IBRA-GSS and the other methods when selecting five bands from the IP and SA datasets. In both cases, the upward-pointing arrow (↑) indicates that IBRA-GSS performed significantly better (i.e., p-value < 0.05), the downward-pointing arrow (↓) indicates that the compared method performed significantly better than IBRA-GSS, and the equal symbol (=) indicates that the difference between IBRA-GSS and the compared method is not statistically significant (i.e., p-value ≥ 0.05). According to both the t-test and the permutation test, our IBRA-GSS method performed significantly better than the other five band selection methods in each of the cases; however, it did not perform better than IBRA-PLS-DA nor IBRA-PCA (except for the "original band selection" case) when selecting eight or ten bands from the Kochia dataset.
We also evaluated how the distribution of the F1 scores resulting from using different dimensionality reduction techniques changes as k changes. To do this, we considered a population that consisted of all the F1 scores obtained by all the methods (i.e., a sample size of 80) on the full Kochia dataset. From this, we show in Table 11 that the overall mean F1 score increases as k increases, while the standard deviation decreases; that is, the F1 scores are more spread out when k is small. We also performed a one-way analysis of variance (ANOVA). In this case, each group consisted of the ten F1 scores obtained by each method. From Table 9, we already know that such statistically significant differences exist, so, as expected, the F statistics obtained for the three values of k are high and their corresponding p-values are quite small (p-value 0.05), as shown in Table 11. However, notice that the F statistic decreases and the p-value increases as k increases. In other words, as k increases, it becomes less likely that choosing a specific dimensionality reduction technique will cause a significant difference in the resulting average F1 score. This is not surprising, since, by increasing k, we are effectively reinserting what was considered to be overlapping information.
We also tested the four dataset size variations, as explained previously. For example, Figure 9 depicts the performance metrics comparison of the eight methods for the Kochia dataset using four dataset size variations and selecting six, eight, and ten bands. For the three tested datasets, we found that the improvement in performance of our IBRA-GSS method over the other feature selection methods was still statistically significant even when the dataset size was reduced, except when selecting eight bands (k = 8) from the Kochia dataset and using 25% of the dataset (Figure 9d).

Discussion
Our IBRA method is used to cluster the entire spectrum into sets of similar (or collinear) spectral bands and to identify their cluster centers directly. Furthermore, we can interpret these bands as representing a set of influencing spectral bands with wavelengths that are suitable for the design of multispectral filters. These preselected bands explain the variance of their neighbors (i.e., spectral bands that belong to the same cluster) in the original spectrum with a VIF value greater than a threshold θ ∈ [5,12]; therefore, keeping them and removing the other bands of each cluster allow us to avoid spectral bands that do not contain useful information for performing classification. That is, our method effectively identifies those spectral bands that carry information for performing classification while discarding redundant spectral bands. Results shown in Table 2 demonstrate that it is possible for a model trained on the subset of spectral bands determined by IBRA to achieve high accuracy values (∼97-99%) while discarding most of the bands, similar to those obtained when using the full spectrum. For example, for the case of the Salinas dataset when using a VIF threshold of 10, we reduced the number of bands from 204 to 14, discarding more than 93% of the bands while decreasing the average F1 score less than 0.2 percentage points, which was also shown not to be significantly different. Thus, we were able to train a CNN classifier that required 80% fewer trainable parameters than a model trained on the full spectrum yet without detriment to its performance.
Our GSS method uses information entropy as a ranking criterion to identify which bands incorporate more information among the bands preselected by IBRA. However, if we need to select at most k bands, the subset of bands with the greatest information entropy values may not be the best selection. For instance, for the Kochia dataset, for k = 6, the wavelengths of the bands with the highest entropy values were [391.2, 463.3, 518.4, 616.0, 658.4, 675.3] (nm); however, line 9 in Algorithm 2 detected strong multicollinearity between bands with wavelengths 616.0, 658.4, and 675.3 nm in the first iteration (see Figure 6). Rather than using redundant bands, our method selects a more diverse subset of bands if this helps to improve the classification performance.
It is worth noting that, according to the results described in Section 3.4, there does not necessarily have to be a single optimal VIF threshold θ to find the best set of spectral bands for a given classification task. Instead, more than one value of θ can lead to the same or similar selection of bands. For example, from Table 5, the only difference that we obtained using θ = 8 with respect to using θ = 7 is that we selected the wavelength 2224.7 nm (band index 175) instead of 2215.2 nm (band index 174). Considering that these two bands are contiguous, they explain practically the same variance, so both options are equally as good, which can be verified by observing that their resulting classification performance were very similar. There were other cases where, given two VIF thresholds, the GSS method selected bands that are located in different parts of the spectrum, and yet both options are considered equally as good as there was no statistical difference between their classification performance, as observed in Table 3 when using θ = 6 and θ = 8.
From Tables 6-8, we see that our IBRA-GSS method performed better than the other five band selection methods on the three datasets. The results remained consistent even when considering different dataset sizes except in one case when using the "multispectral filter simulation" approach on the Kochia dataset. In fact, Figure 9 shows that the gap in performance between our method and the others became more noticeable when the dataset size was reduced. In addition, Table 11 shows that the average F1 scores obtained by each method in both approaches are more spread out when selecting six bands than when selecting eight or ten bands (which can be visually verified in Figure 9). This confirms that the fewer spectral bands we select, the harder the band selection task will be; however, multispectral imagers generally become more practical computationally as the number of spectral channels becomes smaller. Thus, the fact that IBRA-GSS consistently outperformed the others when selecting a reduced set of spectral bands makes it the most suitable band selection method for multispectral imager design, at least for the data sets analyzed and the algorithms studied.
With our IBRA-GSS method, the classification performance resulting from the "original band selection" approach was very similar to the performance with the "multispectral filter simulation" approach, unlike some of the compared methods. This can be explained by the way we selected the first band candidates using IBRA. That is, a spectral band corresponding to a local minimum in the plot of spectral index vs. distance d(x) (Figure 4) acts as a centroid due to its similarity to the spectral bands located on either side. Therefore, if we take this local minimum as the central wavelength of a multispectral filter, generate a Gaussian distribution around it by considering a standard bandwidth, and integrate under the response curve, then we obtain reflectance values similar to those of the central band. Given that the simulated multispectral filter and the original spectral band present similar information, their classification performance was similar. This is convenient for a multispectral sensor design, as we would like the central wavelength of a filter to be the most representative.
Regarding our tested feature extraction methods, we observe from Table 9 that IBRA-PCA and IBRA-PLS-DA were able to obtain better or competitive results than those obtained by our IBRA-GSS band selection method when using eight (k = 8) or ten (k = 10) feature channels from the Kochia dataset. Nevertheless, they could not obtain better performance than IBRA-GSS or other band selection methods when extracting fewer channels. This is due to the fact that PCA tries to explain the total variance of the original features (i.e., the IBRA-selected bands) and PLS-DA tries to explain the maximum variance of the response variable (i.e., maximizes inter-class separability) so that the fewer the number of dimensions we choose in both cases, the worse this explanation is. Conversely, when k is small, our GSS method does not apply a poor transformation on the spectral bands. Instead, it seeks to find a set of original spectral bands that maximize class separability. For the specific case of PCA, we noticed that the performance obtained using the "original band selection" was significantly lower than that obtained using the "multispectral filter simulation" approach. This is because reflectance measures captured by narrow hyperspectral bands tend to introduce a considerable number of outliers and PCA is an outlier sensitive feature extraction technique. On the other hand, the Gaussian transformation applied to simulate multispectral filters has a smoothing effect that reduces outlier noise; thus, improving the classification performance dramatically.

Conclusions
Hyperspectral imaging systems capture highly complex data that allow us to extract information that could have not been achieved using only visual spectral bands or spectrometers. However, processing hyperspectral images entails a high computational cost that can be avoided using dimensionality reduction techniques. In that sense, feature extraction techniques are used to transform the data and reduce its dimensionality, which improves the computational efficiency when processing HSIs while avoiding performance degradation derived from the phenomenon known as the curse of dimensionality. Feature selection techniques, on the other hand, help to determine the most relevant wavelengths for a given application, which allows for the design of compact multispectral imagers that could be used in place of hyperspectral imagers. This approach leads to even more significant economic savings, as fewer specialized storage and processing devices are required.
Given these potential benefits, in this paper, we focused on the proposal of a filterbased (IBRA) and a hybrid (IBRA-GSS) band selection method in the context of hyperspectral image classification. Experimental results showed that our IBRA method alone is able to discard up to 93% of the spectral bands and achieve similar classification performance to that obtained using the full spectrum. Therefore, IBRA successfully removes redundant spectral bands while maintaining the most relevant spectral information for a given classification task.
Furthermore, our IBRA-GSS method was proposed as a hybrid selection method that allows for the selection of a specific number of spectral bands k. IBRA-GSS was shown to generally outperform the other competing band selection methods on the three tested datasets even when we reduced the dataset sizes. We also observed that more than one VIF threshold in the IBRA step can lead to the same or similar selection of bands that were found to be equally as effective. This behavior was found to be consistent when the dataset size was reduced.
We also showed how to use our IBRA method as part of a novel feature selection framework. Two variants of this framework were tested, namely IBRA-PCA and IBRA-PLS-DA, which were able to obtain better results than all of the band selection methods when k was sufficiently large. In that sense, we demonstrated that the dimensionality reduction task becomes harder when k becomes smaller. Thus, given that our IBRA-GSS method consistently outperformed the other dimensionality reduction techniques when selecting a reduced set of spectral bands, we consider it the most suitable band selection method for the design of compact multispectral imagers. This is also supported by the fact that our multispectral filter simulations showed that IBRA effectively provides representative filter centers that are suitable for the design of multispectral sensors.
In the future, we plan to develop multispectral sensors based on the wavelengths recommended by our IBRA-GSS method. This would allow us to study implementation issues that may not be noticed using our multispectral filter simulation. Having compact multispectral sensors would also enable us to easily acquire data from drones and other platforms of opportunity for cost-effective sensing of herbicide-resistant weeds and many other applications.