Using a Genetic Algorithm as an Optimal Band Selector in the Mid and Thermal Infrared (2.5–14 μm) to Discriminate Vegetation Species

Genetic variation between various plant species determines differences in their physio-chemical makeup and ultimately in their hyperspectral emissivity signatures. The hyperspectral emissivity signatures, on the one hand, account for the subtle physio-chemical changes in the vegetation, but on the other hand, highlight the problem of high dimensionality. The aim of this paper is to investigate the performance of genetic algorithms coupled with the spectral angle mapper (SAM) to identify a meaningful subset of wavebands sensitive enough to discriminate thirteen broadleaved vegetation species from the laboratory measured hyperspectral emissivities. The performance was evaluated using an overall classification accuracy and Jeffries Matusita distance. For the multiple plant species, the targeted bands based on genetic algorithms resulted in a high overall classification accuracy (90%). Concentrating on the pairwise comparison results, the selected wavebands based on genetic algorithms resulted in higher Jeffries Matusita (J-M) distances than randomly selected wavebands did. This study concludes that targeted wavebands from leaf emissivity spectra are able to discriminate vegetation species.


Introduction
Hyperspectral sensors, because of their high spectral detail over contiguous narrow bands, have proven to be a valuable tool for discriminating plants species [1][2][3][4] compared to multispectral resolution sensors [5]. However, due to high dimensionality, working with hyperspectral data poses challenging problems such as redundancy, intensive computation, and singularity of covariance matrix inversion [6][7][8][9][10]. To overcome these problems, the dimensionality of hyperspectral data needs to be reduced without compromising the information content. The dimensionality of the data is reduced through either band extraction or band selection [6]. In band selection a subset of the original bands is selected without affecting the physical meaning of the selected bands. In band extraction a certain number of bands is selected after transforming the original dataset [11]. Band selection is often preferred to band extraction as the physical meaning of the data remains unchanged [6,[12][13][14][15].
Genetic algorithms constitute problem solving optimization methods based on the philosophy of genetics and natural selection through "survival of the fittest" [16,17]. A genetic algorithm is a popular band selector and dimensionality reduction procedure for spectral analysis [8,[18][19][20][21][22]. The genetic algorithm as a band selector has performed with higher accuracy than other band selection algorithms for both synthetic [23] and real remote sensing data [8,18,19,24]. In remote sensing, genetic algorithms selected spectral bands for classification with hyperspectral data, as well as bands sensitive to the chemical content of plants and soils [18,19]. The majority of the studies used genetic algorithms as a band selector where the class information was broad (i.e., the spectral signatures of the different classes were distinct from each other) [25] and the genetic algorithms easily selected bands that differentiated between various classes. Using visible to short-wave infrared (VIS-SWIR; 0.4-2.5 μm) spectra, Vaiphasa et al. [8] discriminated between sixteen mangrove plant species with similar spectral characteristics. The present study extends the genetic algorithms to the mid to thermal infrared for optimal band selection for discriminating plant species.
Till recently, vegetation spectra in the mid to thermal infrared (2.5-14 µm) was perceived as a line without any spectral features [26]. However, the introduction of spectroradiometers sensitive to mid and thermal infrared revealed that certain spectral features are associated with the composition of leaf epidermal materials (i.e., cell walls and cuticular membranes), which can act as a fingerprint for discriminating vegetation [26][27][28][29]. The present study attempts to discriminate between 13 broadleaf vegetation species using genetic algorithms from high resolution mid to thermal infrared data (2.5-14.0 μm, comprising 3,024 spectral bands). The possibility of using genetic algorithm-based selected features for distinguishing vegetation species (from laboratory measured emissivity spectra) will be an important prerequisite for adjusting band positions of air-borne and space-borne floristic mapping campaigns.

Leaf Sampling
The dataset of leaf samples used in this study was the same as used in the [29]. The leaves were collected (between July and September 2010) from thirteen plant species (Table 1) species. To avoid pseudo-replication, leaves were collected from at least ten different plants of the same species. Leaves were acquired from different part of the plant (both on the sun and the shaded side). The leaves, attached to small twigs, were brought to the laboratory within 5 minutes, and placed in moist cotton to avoid desiccation. Spectral measurements were recorded as soon as possible.

Spectral Measurements
A Bruker VERTEX 70 FTIR spectrometer (Bruker Optics GmbH, Ettlingen, Germany) was used to acquire the Directional Hemispherical Reflectance (DHR) spectrum of each leaf. Nitrogen (N 2 ) gas was used to continuously purge the spectrometer from water vapor and carbon dioxide. A mid-band mercury-cadmium-tellurium (MCT) detector cooled with liquid nitrogen was used to measure the DHR spectrum of the adaxial (upper) surface of the leaf samples between 2.5 and 14 µm (Figure 1), with a spectral resolution of 4 cm −1 . Thirty five (35) leaves were measured per species, thus 455 leaves were measured in total. Each leaf measurement was referenced against a calibration measurement of gold plate (infragold; Labsphere reflectance technology) with a high reflectance (approximately 96%). One thousand (1,000) scans were averaged to produce each leaf spectrum. The spectra between 6 to 8 µm were noisy (due to water absorption) and were excluded from the analysis. The DHR spectra were converted to emissivity using Kirchhoff's law (Emissivity = 1 − R) [30][31][32]. For further detail about the spectrometer and data acquisition, see [29,33].

Concept of Genetic Algorithm
Genetic algorithms, introduced for the first time by Holland [17], are a popular type of evolutionary optimization computation based on the concept of natural selection. The innovation behind genetic algorithms is the random (stochastic) model that uses a population of solutions rather than a single solution. During each iteration, solutions are represented in the form of a "chromosome", with selected wavelength bands positioned as "genes". The algorithm commences with a population of random solutions, termed the first generation. A fraction of these solutions, with the best "fitness" according to a pre-defined objective function are then selected to produce (i.e., undergo the mechanism of crossover and mutation) a second generation that consists of hybridized offspring of the first generation. Of this second generation, again the solutions with the highest fitness are selected to reproduce a third generation, and so on, until the improvement in fitness between subsequent generations levels off to a pre-set threshold. Parameters that have to be selected before starting the algorithm are the chromosome size (i.e., how many bands can be selected per solution), the population size (i.e., the number of solutions per generation), the fraction of a generation that is selected to be the "parents" for the new generation, and when to stop the algorithm. The reproduction operators, objective function, and selection mechanism are summarized in the next subsection, while the detailed practical implementation (step by step procedure) can be found in Goldberg [16]. The genetic algorithms script was written at the Faculty of Geo-Information Science and Earth Observation (ITC), the Netherlands.

Reproduction Operators
For problem solving, the selected chromosomes directly undergo crossover and mutation. In the crossover operation the two selected parent chromosomes merge and produce offspring (new chromosomes) that share the properties of both parents. A single point crossover was used in this study, where two parent chromosomes split into four segments (two segments per parent). Then the exchange of gene segments produces two offspring from every two parents. In mutation, a single gene (band, in this case) in the offspring chromosome is randomly altered and as a result the characteristics of the offspring differ from the parental chromosome combination.

Objective Function
An objective function is required to assign a value to each chromosome. The associated value of each chromosome is an indication how well it fits the solution it represents. The spectral angle mapper (SAM) nearest neighbour classifier was used to evaluate the fitness values (in this case the overall classification accuracy) of the chromosome population during the process of evolution. The SAM determines the spectral similarity between two spectra (i.e., target and reference) by calculating the angle between them in an n-dimensional space. To calculate the fitness function, half of the spectra of each species (17 spectra per species) were used for training purposes, and the remaining half for validation purposes. For each species, the average spectrum of training dataset was used as a reference spectrum.

Selection
On the basis of fitness value (i.e., the classification accuracy resulted from the SAM), the parent chromosomes were selected to reproduce offspring using random (roulette wheel) selection. The chromosomes with higher fitness values have a higher chance of being selected for reproduction and to generate a new chromosome.

Preliminary Parameters and Chromosome Size
The initial parameters were configured as follows: Population size = 1,000, maximum number of generations = 500, crossover probability = 1, probability of mutation = 0.01, elite count (i.e., the number of chromosomes with best fitness values in the current generation that are guaranteed to survive into the next generation; these chromosomes are called elite children) = 2.
In order to define the number of genes in a chromosome for maintaining high classification accuracy, the genetic algorithms were run with different gene numbers per chromosome. The minimum threshold for class separability (i.e., classification accuracy) was set to 85% [25]. The minimum number of genes in a chromosome that exceeded the defined threshold was five. There was little increase in the classification accuracy when the genetic algorithm was executed with chromosomes with six bands (Figure 2). Therefore, a chromosome with five bands was chosen for further analysis. The consistency of the genetic algorithms for discriminating vegetation species was checked by repeating the analysis 40 times. The data was reshuffled at the beginning of each run. The algorithms start with a random initial population and undergo selection (based on fitness score), crossover, mutation and elite count processes.

Evaluating the Performance of the Genetic Algorithm
The performance of the genetic algorithms in separating the species was assessed by using the Jeffries Matusita (J-M) distance [34]. The J-M distance is the average distance between two class density functions. The J-M distance takes into account the distance between class mean and the distribution of values from the means. Another advantage is that it can be executed over a number of bands (unlike M-statistics). The J-M distance is a parametric test, of which values range between 0 and 2, providing an easy comparison of class separability [1,3]. The J-M distance was calculated between each pair of species using the genetic algorithm based winner chromosome (using the bands selected on the basis of the genetic algorithm) as well as a randomly selected chromosome. Prior to conducting the tests, the distribution of the spectral emissivity values across selected waveband was tested for normality and the homogeneity of variance (homoscedasticity) was verified for every spectral band.
The average J-M distance between each species pair selected using the genetic algorithm's selected bands were compared with the average J-M distance derived from the randomly selected bands. The significance of difference in the J-M distances between the genetic algorithm based bands and randomly selected bands was tested using a t-test.

Length of the Chromosome
The results (Figure 2) compare the fitness score against chromosome size for the thirteen species. The minimum number of genes in a chromosome that exceeded the defined threshold (classification accuracy of 85%) was five. There was no substantial increase in the classification accuracy using a six, compared to a five, band chromosome ( Figure 2).

Band Pruning Based on Genetic Search Algorithms
Illustrating the process of evolution, Figure 3 shows the result of a single run. The vertical (y) axis represents the count of the genes selected, while the horizontal axis (x) represents the wavelength. At the beginning (1st generation) the population consisted of randomly selected genes from all wavebands, and as the evolution proceeded the bands started to converge. The overall classification accuracy using the winning chromosome genes are illustrated in Table 2. The results ( Table 2) show that the classification accuracies of the winning chromosome were above the set threshold (85%) for both training and testing datasets.  PL  RH  SP  TP  AP  AN  CS  FJ  GB  HH  IL  LS  PO  PL  17  The genetic algorithm was run 40 times to check consistency. The wining chromosomes along with classification accuracies (based on the SAM) are reported in Appendix 1. The fitness scores of all winning chromosomes were above the defined threshold (classification accuracy over 85%). The frequency of the selected genes showed genes clustering around certain wavebands (Figure 4). The high frequency occurring at certain wavebands represents that waveband's importance for the separating of species. The selected genes were grouped into eight waveband regions based on the mean and standard deviation ( Table 3). Five of those lie in the mid infrared (2.5-6 µm) and the remaining three regions belong to the thermal infrared (8-12 µm).  The eight waveband regions (where selected genes were grouped) correspond to the spectral wavebands positions of the Mid-wave infrared Airborne Spectrographic Imager (MASI600) and the Thermal infrared Airborne Spectrographic Imager (TASI600). The MASI600 and TASI600 are pushbroom hyperspectral sensors operating in the mid-wave infrared (3-5 µm) and thermal infrared (8-11.5 µm), having 64 continuous spectral bands. These sensors can acquire data at a maximum altitude of 3,048 m (above sea level). The spatial resolution varies between 1 m and 3.5 m (depending on the altitude of the platform) with a spatial coverage of 600 pixels. The first four waveband regions (B, C, D and E) correspond to the wavebands of MASI600 and the last three regions (F, G and H) lay within the spectral range of TASI600.

Evaluation of the Performance of Genetic Algorithm
The Jeffries Matusita (J-M) distances between different species pairs calculated using the bands selected by the genetic algorithm, were compared with the randomly selected bands. The five selected bands (resulting from the genetic algorithms and the random selection) were used to calculate the J-M distance between each species. The average J-M distance values of genetic algorithm based selected bands were higher than the value of randomly selected bands. The result of the t-test (Table 4) confirms that the differences between most J-M distances (74 out of 78 ≈95%), based on genetic algorithms and random selection, are statistically significant at a 95% confidence level (p ≤ 0.05).
The classification accuracy based on genetic algorithms selected bands was higher than results obtained by Ullah et al. [29]. They used One-way analysis of variance (ANOVA) coupled with a post-hoc Tuckey HSD test. The spectral features (bands resulting in the highest number of statistically significantly different pairs) were then manually selected. In this study, the genetic algorithms selected the bands, further improving the classification accuracy.

Discussion
This study tested the applicability of genetic algorithms for the selection of bands from the mid and thermal infrared emissivity spectra to discern thirteen vegetation species. The visible to shortwave infrared domain have been widely used for discriminating vegetation species, but mid to thermal infrared emissivity spectra have received little attention. The outcome of the study (Table 2 and Appendix 1) demonstrated that the genetic algorithm based selected bands (subset of five bands) achieved an overall accuracy of more than 85%.
The improved classification accuracy of the bands selected by genetic algorithms compared to the randomly selected bands could be attributed to the fact that genetic algorithms provide several possible solutions, evaluate them on the basis of an objective function and pick the best one for the next generation.
The validity of the combination of genetic algorithm based selected bands used for the spectral discrimination of vegetation species in the mid to thermal infrared emissivity spectra may be attributed to the spectral positioning of the selected bands. The emissivity spectra of the different plant species contain unique features due to the variation in physio-chemical composition of the superficial epidermal layer of the plant leaves. The emissivity signature of plant leaves is dominated by a feature associated with major classes of cellulous of the epidermis [26][27][28][35][36][37][38]. The selected waveband positions, between 2.5 to 6 µm, may be attributed to the physical makeup of the surface, as well as the water and chemical content of different plant leaves [27,39,40]. The clustering of the winning genes at around 3.00 µm may be due to OH band stretching and bending in the water molecule [26,27,40]. The selection of bands at the wavelength position of 3.44 µm may be due to the presence of different amounts of nonacosane (a compound in wax occurring on the leaf surface), as a result of the stretching of the CH 2 bond of methylene in leaf surface waxes [41][42][43]. The stretching of carbonyl group (C=O) in ester has been linked to a spectral features at 5.80 µm [43,44]. Different amounts of leaf cutin and cutan (which are composed of esterified monomers) may be linked to the selection by the genetic algorithm of features at 5.80-5.92 µm (Figure 4). The bands selected between 9.40-9.70 µm ( Figure 3) could be attributed to cellulose thickness, creating two prominent features at 9.47 µm and 9.68 µm, associated with the C-O band stretching [26,41]. The next spectral region winner bands were selected from (mean at 9.87 µm and standard deviation ±0.121 µm) may have resulted from differences in hemicellulose and other pectins [45,46]. The winning gene clustering at 11.50 µm (mean 11.50 and standard deviation ±0.121 µm, Figure 4) may have resulted from the presence of different aromatic compounds in the plant species [27].
Discriminating vegetation species using laboratory measured emissivity spectra is prerequisite for the future vegetation mapping campaigns from air-borne and space-borne data. However, there are a number of problems associated with extending this work to field level. The calibration of remotely sensed signals in the MIR (around 3 µm) is complicated by the difficulty associated with the overlap of reflected and emitted energy in the MIR. Other problems associated with field condition are the distance between target and sensor, spectral and spatial resolution, atmospheric condition, and seasonal changes. The cavity effect of plant leaves causes blackbody emittance in the TIR and reduces spectral contrast in the signal. The cavity effect problem is noticeable in small and needle leaved species and also in species with funnel-like leaf arrangements [28]. One could extend this study to a field, airborne, and space-borne by using a sensing system with high signal to noise ratio (SNR) that allows small spectral differences in plant to be characterized.

Conclusions
This study has demonstrated the potential of genetic algorithms as band selectors using high resolution mid to thermal infrared emissivity spectra to differentiate between vegetation species at laboratory level. It is concluded that the bands selected by genetic algorithms are more useful for discriminating vegetation species than randomly selected bands are, when using laboratory emissivity spectra. The genetic algorithm based selected bands were actually found to have potential for floristic mapping. Bands selected with genetic algorithms may correspond to physiochemical characteristics of vegetation leaves (as seen in the previous studies) as leaves of different species possess unique surface materials. The genetic algorithm based selected bands help to understand the section of the electromagnetic spectrum that has a high potential for discriminating vegetation types, which may be useful when designing new sensors for vegetation studies. The outcome of this study is that the genetic algorithm band selection procedure can differentiate between plant species using laboratory measured thermal emission spectra. It would be very interesting to extend this work to the field and at airborne level with the advancement of hyperspectral thermal infrared sensors.