Evaluation of Informative Bands Used in Different PLS Regressions for Estimating Leaf Biochemical Contents from Hyperspectral Reflectance

: Partial least squares (PLS) regression models are widely applied in spectroscopy to estimate biochemical components through hyperspectral reﬂected information. To build PLS regression models based on informative spectral bands, rather than strongly collinear bands contained in the full spectrum, is essential for upholding the performance of models. Yet no consensus has ever been reached on how to select informative bands, even though many techniques have been proposed for estimating plant properties using the vast array of hyperspectral reﬂectance. In this study, we designed a series of virtual experiments by introducing a dummy variable ( C d ) with convertible speciﬁc absorption coefﬁcients (SAC) into the well-accepted leaf reﬂectance PROSPECT-4 model for evaluating popularly adopted informative bands selection techniques, including stepwise-PLS, genetic algorithms PLS (GA-PLS) and PLS with uninformative variable elimination (UVE-PLS). Such virtual experiments have clearly deﬁned responsible wavelength regions related to the dummy input variable, providing objective criteria for model evaluation. Results indicated that although all three techniques examined may estimate leaf biochemical contents efﬁciently, in most cases the selected bands, unfortunately, did not exactly match known absorption features, casting doubts on their general applicability. The GA-PLS approach was comparatively more efﬁcient at accurately locating the informative bands (with physical and biochemical mechanisms) for estimating leaf biochemical properties and is, therefore, recommended for further applications. Through this study, we have provided objective evaluations of the potential of PLS regressions, which should help to understand the pros and cons of PLS regression models for estimating vegetation biochemical parameters. dummy variable, were selected by the stepwise-PLS and UVE-PLSR methods for all cases. Compared with stepwise-PLS and UVE-PLS, the bands selected by the GA-PLS method better matched the deﬁned absorption regions of C d , suggesting that they were more robust and had physiochemical mechanisms. Overall, the results revealed that the GA-PLS method, developed on the basis of biological evolution theory, was more efﬁcient at locating the mechanism-supported informative bands for PLSR estimation of leaf biochemical properties than the other two approaches, stepwise-PLS and UVE-PLS.


Introduction
Partial least squares (PLS) regression, a traditional linear statistical approach [1], is suitable for analyzing multi-collinear spectral datasets and has the ability to make full use of redundant information [2]. To date, various PLS regression (PLSR) models that have been built to estimate plant biochemical and biophysical variables from hyperspectral remote sensing data [3][4][5][6][7][8][9][10], have remained oblivious to the serious collinearity among different bands. Many researchers have claimed the superiority of PLSR over other regression techniques, including vegetation indices, multiple linear regression, stepwise regression, and principal component regression [4,[11][12][13][14][15]. However, previous studies have also revealed that the wavelengths used by PLSR models may contain irrelevant information, which would dramatically reduce the models' generality and predictive ability [16].
Due to the high collinearity or correlations among the reflectance values at different bands, one of the inherent challenges with the PLS approach, with so many wavelength variables, is the risk of overfitting [2]. It has been demonstrated in many previous studies, both experimentally and theoretically, that the performance of PLSR can be tremendously improved if only informative variables are included in the model [2,[16][17][18][19][20]. After those variables which contain irrelevant or redundant information are removed, the robustness of the calibration models can be enhanced [21].
However, despite so many techniques having been proposed, no consensus has yet been reached on how to select informative bands for plant properties estimation among the vast array of hyperspectral remote sensing data, making physical and biochemical interpretations of selected informative bands nearly impossible. Taking chlorophyll as an example, diverse bands have currently been proposed using PLS regressions. For instance, the bands of 405, 435, 470, 525, 570, 630, 645, 660, 700, and 780 nm for chlorophyll a and the bands of 405, 435, 470, 505, 525, 570, 590, 645, 660, and 700 nm for chlorophyll b content in soybean leaves were used in [10], as well as the bands of 503, 551, 690, 717, and 770 nm for total chlorophyll of pepper leaves in [9]. In contrast, the bands of 460, 470, 480, 530, 540, 550, 730, 740, and 750 nm for chlorophyll were used in [3]. Furthermore, our recent research results on informative spectral band selection for PLS models to estimate foliar chlorophyll content in four independent field-measured datasets, clearly revealed that the bands finally picked-up were inconsistent among the four different datasets [28]. Thus, a selection mechanism-based model with informative bands remains a critical problem in ensuring the robustness of PLSR.
Clearly, the inconsistency of the reported informative bands resulted primarily from the particularity or sample size limitations of these studies. The success of statistical methods to assess plant parameters from optical properties also greatly depends upon the quality of the datasets used. Models calibrated based on small or specialized datasets often perform poorly when applied to other datasets [6]. Another important point worth mentioning is that chlorophyll absorbs light throughout the whole spectral region of 400-750 nm [29], making it difficult to judge the informativeness of one particular band. Furthermore, because of the absorption overlap with other pigments (e.g., carotenoids, anthocyanin), it is difficult to consider which variables are responsible for the property of interest from the aspect of model interpretation [16].
To address this problem, we have generated a rather comprehensive dataset for calibrating PLSR models, based on so-called hybrid methods, combining physically-based models to develop statistical models [6, [30][31][32]. The physical law-based radiative transfer models are not only available to generate datasets containing numerous samples [6, [30][31][32][33][34], but are also important tools for revealing the underlying relationships between vegetation biochemical parameters and reflectance in a more systematic way [35].
To assess the performance of different variable selection/elimination approaches to locate the optimal wavelengths, we designed virtual experiments based on the well-accepted leaf reflectance model PROSPECT-4 [29,36] by introducing a dummy variable (C d ) with convertible specific absorption coefficients (SAC) into the original version. The modified model enables us to identify the responsible wavelength regions related to the dummy input variable and will, therefore, provide an objective evaluation of the physical and biochemical mechanisms of informative bands later selected by different approaches. This is achieved through a series of virtual datasets based on the modified PROSPECT-4 model by changing the SAC (location, intensity, width) of C d , providing objective criteria for distinguishing mechanism-based informative bands. Based on these simulated datasets, we established PLSR models with bands selected by three commonly applied techniques, including: (1) the model prediction-based stepwise regression method (stepwise-PLS); (2) biological evolution Remote Sens. 2019, 11,197 3 of 15 theory-based genetic algorithms (GA-PLS); and (3) PLS regression coefficients-based uninformative variable elimination method (UVE-PLS). By comparing the bands selected with previously defined absorption features of C d , we were able to evaluate the efficiency of each method in locating the mechanism-holding wavelengths. We aimed to provide a comprehensive evaluation of the informativeness of selected bands in PLS models for a better understanding of PLS regression performance for vegetation biochemical parameters estimation.

PROSPECT-4 Modification
The PROSPECT-4 model is a well-accepted leaf-scale radiative transfer model, which considers the leaf as a succession of absorbing layers and simulates leaf hemispherical reflectance and transmittance between 400 and 2500 nm with a 1-nm step, as a function of the leaf structure parameter (N), leaf chlorophyll content (C ab , µg/cm 2 ), leaf water content (C w , g/cm 2 ) and leaf dry matter content (C m , g/cm 2 ) [29][30][31]. Inside the model, the reflectance is calculated using the specific absorption coefficient, K, of each component, which depends on the wavelength [31]. The model has been widely applied in numerous studies [6, [30][31][32][33][34].
In this study, we added a dummy variable (C d ) into the original PROSPECT-4 model. By tweaking different SACs of C d , we could, therefore, theoretically compare the properties of the selected wavelengths from different variable selection/elimination approaches for PLSR.
With the artificially added dummy variable C d and its specific SACs, the total absorption coefficient at each wavelength λ (k(λ)) for one layer in the modified PROSPECT-4 model was calculated as follows: where N is the leaf structure parameter, K cab (λ), K w (λ), and K m (λ) are the specific absorption coefficients at wavelength λ of total chlorophyll, water, and dry matter, respectively, and which are already included in the original PROSPECT-4. K d (λ) is the specific absorption coefficient of the artificially added dummy variable C d . The SACs of the dummy variable (K d ) were generated using the following Gaussian function: where e is Euler's number, a is the height of the absorption peak, b is the wavelength of the absorption peak, and c is the standard deviation which controls the width of absorption bands, whose value was generated from the predetermined half-width (W) [37]: The absorption of C d was limited to the wavelength region between b−1.5×W and b+1.5×W. The K d values beyond this wavelength region were assigned to 0.

Experimental Design and Database Generation
Various SACs of C d , with different combinations of locations of absorption peak, peak values, and half-widths, were used to produce different specific absorption coefficients. For simplification, we set the central wavelengths of the absorption peaks at 450 nm, 550 nm, and 680 nm, corresponding to the absorption peak of chlorophyll within the blue region (450 nm), the minimum absorption of chlorophyll within the green region (550 nm), as well as the absorption peak of chlorophyll within the red region (680 nm) [29,38]. Such treatments should have covered most cases for biochemical Remote Sens. 2019, 11, 197 4 of 15 components under the background of chlorophyll, which shaped the basic reflectance pattern within the wavelength domain of 400 to 800 nm. Furthermore, the absorption peak values of C d were set to vary from 0.02, 0.04, 0.06, 0.1, 0.2 to 0.3 cm 2 /µg, while the half-widths were set at levels of 10 nm, 30 nm, 50 nm, respectively.
For each combination of absorption peak location, absorption peak value and half-width, a database composed of 500 simulated leaf reflectance spectra with the modified PROSPECT-4 model, using parameters generated according to their actual distributions as all vegetation types contained in the Leaf Optical Properties Experiment (LOPEX) database [39], was built up. Based on the means (1.67 for N, 47.28 ug/cm 2 for C ab , 0.0114 g/cm 2 for C w , and 0.0054 g/cm 2 for C m ) and standard deviations (0.33 for N, 17.30 ug/cm 2 for C ab , 0.0069 g/cm 2 for C w , and 0.0025 g/cm 2 for C m ) calculated from the LOPEX dataset, the parameters of N, C ab , and C m were randomly selected following a normal distribution, while the values of C w followed a log-normal distribution [6]. For the newly added dummy variable C d , we allocated a normal distribution with mean and standard deviation values of 11.83 ug/cm 2 and 4.32 ug/cm 2 (with the assumption that the measured C d values were 0.25×C ab ), respectively.

PLS Analysis
For computational efficiency, we limited the spectral range to within the domain of 400-800 nm, covering the spectral regions of chlorophyll and dummy variable absorption [31,[40][41][42][43]. We further limited the resolution at 5 nm (resampled with a moving average filter) for the same reason. Three different commonly applied variable selection/elimination approaches (stepwise, genetic algorithms and uninformative variable elimination) were coupled with PLS models to compare their optimality for locating informative bands for leaf biochemical parameter estimation.

Stepwise-PLS
Stepwise selection is the simplest and most pragmatic search method, in which subsequent variables are selected stepwise by their capability to improve a multiple linear regression (MLR) model [26,[44][45][46]. Stepwise regression is a systematic method for adding or removing variables from a multilinear model based on their statistical significance in a regression. In this method, the P value of an F-statistic is computed to test models with and without a potential variable at each step [47]. In this study, the bidirectional elimination approach, a combination of forward selection and backward elimination, was applied for stepwise regression [48]. The maximum P values for a spectral band to be included or removed were defined as 0.05 and 0.10, respectively.

GA-PLS
Unlike statistical significance-based variable selection used in stepwise selection, genetic algorithms (GA) are developed on the basis of biological evolution theory and natural selection [17,49], and have significant effects on the band selection of the PLS model [50][51][52]. Following [22], the main steps of GA-PLS include:

1.
Forming an initial population of variable sets randomly; 2.
Fitting a PLS regression model to each variable set, and then evaluating the performance with leave-one-out cross-validation; 3.
Selecting a collection of variable sets with higher performance to survive until the next "generation"; 4.
Generating new variable sets by crossover (50% probability in this study) and mutation (1% probability in this study) for each variable; 5.
Using the surviving and modified variable sets as inputs in step 2, and repeating steps 2-5 for a preset number of times (200 in this study).

Uninformative Variable Elimination with PLS (UVE-PLS)
The UVE-PLS approach was proposed in [23] using a reliability criterion calculated from the PLS regression coefficients to evaluate the informativeness of each variable. The regression coefficient matrix was calculated through a leave-one-out validation, and the reliability criterion, c λ , of band λ was determined by the ratio of the mean value of the regression coefficient, b λ . and its standard deviation std(b λ ) as: The elimination threshold was estimated by adding an artificial normally distributed random variable matrix with a very small amplitude to the original data [24]. The highest absolute value of the reliability criterion of all artificial variables was defined as the cut-off level [23].

Evaluation of Different PLSR Models
The commonly used statistical criteria, the normalized root-mean-square error (NRMSE, which is the RMSE normalized using the mean value) and the coefficient of determination (R 2 ), were used in this study to evaluate the estimation accuracy of the PLSR models. However, as the bands involved in different PLSR models could vary, the corrected Akaike information criterion (AICc) [53,54] was instead used as the premier criterion to evaluate the goodness-of-fit of different PLSR models.

Informative Bands Selected for PLSR Models under Different Absorption Peak Locations
To illustrate the performance of different variable selection/elimination approaches for locating the informative bands, we have specifically presented the bands selected for C d estimation under distinct absorption peak locations with a narrow absorption half-width (10 nm) in Figure 1. This specific dataset used for PLSR model calibration was generated using a fixed absorption peak intensity of 0.20 cm 2 /ug, which is higher than the SAC peaks of C ab throughout the spectral region of 400-800 nm. matrix was calculated through a leave-one-out validation, and the reliability criterion, c λ , of band λ was determined by the ratio of the mean value of the regression coefficient, b λ . and its standard deviation std(b λ ) as: The elimination threshold was estimated by adding an artificial normally distributed random variable matrix with a very small amplitude to the original data [24]. The highest absolute value of the reliability criterion of all artificial variables was defined as the cut-off level [23].

Evaluation of Different PLSR Models
The commonly used statistical criteria, the normalized root-mean-square error (NRMSE, which is the RMSE normalized using the mean value) and the coefficient of determination (R 2 ), were used in this study to evaluate the estimation accuracy of the PLSR models. However, as the bands involved in different PLSR models could vary, the corrected Akaike information criterion (AICc) [53,54] was instead used as the premier criterion to evaluate the goodness-of-fit of different PLSR models.

Informative Bands Selected for PLSR Models under Different Absorption Peak Locations
To illustrate the performance of different variable selection/elimination approaches for locating the informative bands, we have specifically presented the bands selected for Cd estimation under distinct absorption peak locations with a narrow absorption half-width (10 nm) in Figure 1. This specific dataset used for PLSR model calibration was generated using a fixed absorption peak intensity of 0.20 cm 2 /ug, which is higher than the SAC peaks of Cab throughout the spectral region of 400-800 nm. Results clearly indicated that when the absorption peak of Cd was located within the domain with strong chlorophyll absorption (450 nm), the correlation coefficient (r) of Cd and reflectance varied from 0.00 to -0.12 within the wavelength domain of 400 to 800 nm being examined, and its global extremum appeared at 450 nm. It is worthy of note that the informative bands identified by both stepwise-PLS and GA-PLS approaches (450 nm) fell exactly within the wavelength domain that was affected by Cd (435-465 nm). By comparison, many non-affected bands of Cd (beyond the domain of Results clearly indicated that when the absorption peak of C d was located within the domain with strong chlorophyll absorption (450 nm), the correlation coefficient (r) of C d and reflectance varied from 0.00 to −0.12 within the wavelength domain of 400 to 800 nm being examined, and its global extremum appeared at 450 nm. It is worthy of note that the informative bands identified by both stepwise-PLS and GA-PLS approaches (450 nm) fell exactly within the wavelength domain that was affected by C d Remote Sens. 2019, 11,197 6 of 15 (435-465 nm). By comparison, many non-affected bands of C d (beyond the domain of 435 to 465) were also picked up by the UVE-PLS method, besides the identified informative bands of 450 to 465 nm.
Alternatively, when the absorption peak of C d was located near 550 nm, a domain with gentle chlorophyll absorption, the correlation coefficient (r) between C d and reflectance even reached −0.62 around 550 nm. Furthermore, the informative bands within the absorption region of C d were successfully captured by all the three different approaches. Unfortunately, several redundant, apparently non-affected, bands within 700-800 nm were also included. Furthermore, bands within 580-700 nm, which were also beyond the affected domain, were used in the UVE-PLS models.
In the case of the absorption peak location of C d being fixed to 680 nm, the extremum of the correlation coefficient (r) between C d and reflectance appeared at 680 nm (with a value of −0.28). Again, the informative bands affected by C d (665-695 nm) were captured in all the PLS models. However, besides these C d affected bands, the stepwise-PLS, and UVE-PLS models also used non-affected bands within 400-650 nm.

Informative Bands Selected for PLSR Models under Different Absorption Intensities
The informative bands involved in the PLSR models also varied with the absorption intensity of C d , as illustrated in Figure 2, which shows the informative bands picked up by different approaches under varying absorption intensities of C d . Here we have only presented the cases when the absorption peak location of C d was set to 680 nm, which is around the absorption peak of chlorophyll within the red region. The absorption half-width of C d was set to 30 nm in the meantime. Alternatively, when the absorption peak of Cd was located near 550 nm, a domain with gentle chlorophyll absorption, the correlation coefficient (r) between Cd and reflectance even reached -0.62 around 550 nm. Furthermore, the informative bands within the absorption region of Cd were successfully captured by all the three different approaches. Unfortunately, several redundant, apparently non-affected, bands within 700-800 nm were also included. Furthermore, bands within 580-700 nm, which were also beyond the affected domain, were used in the UVE-PLS models.
In the case of the absorption peak location of Cd being fixed to 680 nm, the extremum of the correlation coefficient (r) between Cd and reflectance appeared at 680 nm (with a value of -0.28). Again, the informative bands affected by Cd (665-695 nm) were captured in all the PLS models. However, besides these Cd affected bands, the stepwise-PLS, and UVE-PLS models also used nonaffected bands within 400-650 nm.

Informative Bands Selected for PLSR Models under Different Absorption Intensities
The informative bands involved in the PLSR models also varied with the absorption intensity of Cd, as illustrated in Figure 2, which shows the informative bands picked up by different approaches under varying absorption intensities of Cd. Here we have only presented the cases when the absorption peak location of Cd was set to 680 nm, which is around the absorption peak of chlorophyll within the red region. The absorption half-width of Cd was set to 30 nm in the meantime. Results showed that the extrema of the correlation coefficient of Cd and reflectance were located around the peaks of Kd/KCab (the ratio between the SAC of the artificially added dummy variable and the SAC of chlorophyll). The highest correlation coefficient of Cd and reflectance was -0.15 at 695 nm when the absorption intensity of Cd set to 0.02 cm 2 /ug, while it was -0.29 at the same location (695 nm) when the absorption intensity of the Cd was set to 0.06 cm 2 /ug, and further improved to -0.39 at 700 nm when the absorption intensity of the Cd was set to 0.10 cm 2 /ug.
Although more or less informative bands within the absorption regions of Cd, and thus with physiochemical mechanisms, were selected by all the variable selection/elimination approaches examined, a large set of bands (with correlation coefficient (r) values for Cd and reflectance of < 0.10) out of the wavelengths affected by Cd, and thus lacking physiochemical mechanisms, were also used in the PLS models, especially the stepwise-PLS and UVE-PLS models. Results showed that the extrema of the correlation coefficient of C d and reflectance were located around the peaks of K d /K Cab (the ratio between the SAC of the artificially added dummy variable and the SAC of chlorophyll). The highest correlation coefficient of C d and reflectance was −0.15 at 695 nm when the absorption intensity of C d set to 0.02 cm 2 /ug, while it was −0.29 at the same location (695 nm) when the absorption intensity of the C d was set to 0.06 cm 2 /ug, and further improved to −0.39 at 700 nm when the absorption intensity of the C d was set to 0.10 cm 2 /ug.

Informative Bands Selected for PLSR Models under Different Absorption Half-Widths
Although more or less informative bands within the absorption regions of C d , and thus with physiochemical mechanisms, were selected by all the variable selection/elimination approaches examined, a large set of bands (with correlation coefficient (r) values for C d and reflectance of < 0.10) out of the wavelengths affected by C d , and thus lacking physiochemical mechanisms, were also used in the PLS models, especially the stepwise-PLS and UVE-PLS models.

Informative Bands Selected for PLSR Models under Different Absorption Half-Widths
We also investigated the selection results for C d with different absorption half-widths of 10 nm, 30 nm, and 50 nm (Figure 3). Here the absorption peak location of C d was fixed to 550 nm, with weak disturbance of chlorophyll, and the absorption intensity was set to 0.06 cm 2 /ug, a value approximating the SAC peak value of chlorophyll in the red region. We also investigated the selection results for Cd with different absorption half-widths of 10 nm, 30 nm, and 50 nm (Figure 3). Here the absorption peak location of Cd was fixed to 550 nm, with weak disturbance of chlorophyll, and the absorption intensity was set to 0.06 cm 2 /ug, a value approximating the SAC peak value of chlorophyll in the red region. The highest correlation coefficients between Cd and reflectance were found at 550 nm, with the same value of -0.53 for all three cases under different absorption half-widths. For Cd with the narrowest absorption half-width (10 nm), the Cd -affected bands were found within the range of 535-563 nm and were involved in all the stepwise-PLS, GA-PLS and UVE-PLS models. However, besides these bands that were clearly affected by the dummy variable, all approaches used many more bands that were supposed to lack physiochemical mechanisms. By comparison, the stepwise-PLS models used more non-affected bands than those of the GA-PLS and UVE-PLS models. With the increase of absorption half-widths, broader wavelength regions with high correlation coefficient (r) values were identified, but non-affected bands could not have been eliminated by whichever of the approaches was being examined.

Statistical Criteria of Different PLSR Models for Estimating Cd
According to the statistical criteria of NRMSE, R 2 and AICc, as presented in Table 1, all calibrated PLSR models efficiently estimated Cd in most cases when the absorption peak location of Cd was set to 550 nm or 680 nm, even when the bands selected were not strictly located within the affected bands of Cd. However, when the absorption peak location of Cd was set to 450 nm, because even low contents of Cd are sufficient to saturate absorption due to the strong absorption of chlorophyll within the blue band, calibrated PLS models were far less efficient compared with those cases when the absorption peak locations of Cd were set to 550 nm or 680 nm. Furthermore, when the half-width of Cd was set to broader values, calibrated PLSR models performed better in tracing Cd, as more informative bands were involved in PLSR models.  The highest correlation coefficients between C d and reflectance were found at 550 nm, with the same value of −0.53 for all three cases under different absorption half-widths. For C d with the narrowest absorption half-width (10 nm), the C d -affected bands were found within the range of 535-563 nm and were involved in all the stepwise-PLS, GA-PLS and UVE-PLS models. However, besides these bands that were clearly affected by the dummy variable, all approaches used many more bands that were supposed to lack physiochemical mechanisms. By comparison, the stepwise-PLS models used more non-affected bands than those of the GA-PLS and UVE-PLS models. With the increase of absorption half-widths, broader wavelength regions with high correlation coefficient (r) values were identified, but non-affected bands could not have been eliminated by whichever of the approaches was being examined.

Statistical Criteria of Different PLSR Models for Estimating C d
According to the statistical criteria of NRMSE, R 2 and AICc, as presented in Table 1, all calibrated PLSR models efficiently estimated C d in most cases when the absorption peak location of C d was set to 550 nm or 680 nm, even when the bands selected were not strictly located within the affected bands of C d . However, when the absorption peak location of C d was set to 450 nm, because even low contents of C d are sufficient to saturate absorption due to the strong absorption of chlorophyll within the blue band, calibrated PLS models were far less efficient compared with those cases when the absorption peak locations of C d were set to 550 nm or 680 nm. Furthermore, when the half-width of C d was set to broader values, calibrated PLSR models performed better in tracing C d , as more informative bands were involved in PLSR models. In comparison, when the absorption peak locations of the dummy variable were set to 550 nm and 680 nm, regardless of the absorption halfwidths and intensities, all calibrated PLSR models efficiently traced C d according to the statistical criteria of NRMSE and R 2 .
We further examined the bands involved in these models, and the results are illustrated in Figure 4. It is apparent that a bunch of bands beyond the absorption regions of C d , and clearly not affected by the Remote Sens. 2019, 11, 197 9 of 15 dummy variable, were selected by the stepwise-PLS and UVE-PLSR methods for all cases. Compared with stepwise-PLS and UVE-PLS, the bands selected by the GA-PLS method better matched the defined absorption regions of C d , suggesting that they were more robust and had physiochemical mechanisms. Overall, the results revealed that the GA-PLS method, developed on the basis of biological evolution theory, was more efficient at locating the mechanism-supported informative bands for PLSR estimation of leaf biochemical properties than the other two approaches, stepwise-PLS and UVE-PLS. In comparison, when the absorption peak locations of the dummy variable were set to 550 nm and 680 nm, regardless of the absorption halfwidths and intensities, all calibrated PLSR models efficiently traced Cd according to the statistical criteria of NRMSE and R 2 .
We further examined the bands involved in these models, and the results are illustrated in Figure  4. It is apparent that a bunch of bands beyond the absorption regions of Cd, and clearly not affected by the dummy variable, were selected by the stepwise-PLS and UVE-PLSR methods for all cases. Compared with stepwise-PLS and UVE-PLS, the bands selected by the GA-PLS method better matched the defined absorption regions of Cd, suggesting that they were more robust and had physiochemical mechanisms. Overall, the results revealed that the GA-PLS method, developed on the basis of biological evolution theory, was more efficient at locating the mechanism-supported informative bands for PLSR estimation of leaf biochemical properties than the other two approaches, stepwise-PLS and UVE-PLS.

Collinearity among Reflectance Values
The collinearities among the reflectance bands could cause serious problems in multiple regression, such as instability of estimated coefficients, which make predictions by the regression model poor [2,55,56]. We also hypothesized that the collinearities underly the difficulty data-oriented statistical approaches like PLSR models have in identifying mechanism-based informative bands, leading to so many non-affected bands being involved in the PLSR models for Cd estimation, regardless of the absorption peak locations, intensities and half-widths of Cd. To further illustrate this reason, we investigated all correlations among different bands, as shown in Figure 5.

Collinearity among Reflectance Values
The collinearities among the reflectance bands could cause serious problems in multiple regression, such as instability of estimated coefficients, which make predictions by the regression model poor [2,55,56]. We also hypothesized that the collinearities underly the difficulty data-oriented statistical approaches like PLSR models have in identifying mechanism-based informative bands, leading to so many non-affected bands being involved in the PLSR models for C d estimation, regardless of the absorption peak locations, intensities and half-widths of C d . To further illustrate this reason, we investigated all correlations among different bands, as shown in Figure 5.
The correlation patterns shown in Figure 5a used the same data as in Figure 1b. The reflectance values at the absorption peak (550 nm) did not show significant correlations (r < 0.40) with the reflectance values within the bands of 400-515 nm and 625-800 nm. However, the reflectance values around 540 nm and 560 nm were highly correlated (r > 0.60) with those bands throughout the wavelength domain of 465 to 745 nm. The correlation coefficients of the reflectance bands at C d 's weak absorption bands (540 and 560 nm) and those bands within the range of 685-725 nm were even higher than 0.80.
Remote Sens. 2019, 11, Firstpage-Lastpage; doi: FOR PEER REVIEW www.mdpi.com/journal/remotesensing bands within the range of 690-725 nm were also highly correlated (r > 0.70) with those within the range of 520-620 nm. Furthermore, the correlation coefficients of the reflectance band at 725 nm and bands within the range of 730-800 nm were even > 0.82. Such highly correlated bands (most not caused by the dummy variable, theoretically) are hence getting involved in the PLSR models, especially the stepwise-PLS and UVE-PLS models, and are likely to be the primary reason for these non-affected bands being used in PLSR models.

Informative Bands Selected by Different Methods for Leaf Biochemical Parameters
Building PLSR models from informative spectral bands, rather than using full bands, could refine the performance of PLS analysis in field spectroscopy [2]. In this study, we investigated three different band selection methods for setting up PLSR models to retrieve Cd, which had clearly defined absorption features. It was apparent, in most cases, that the GA-PLS method outperformed the other two methods in locating the mechanism-based informative bands (i.e., the bands affected by the absorption of the dummy variable in this study).
Stepwise regression is a popular statistical technique for choosing terms to include in a regression model for large data sets [57]. A previous study by Huang et al. (2004) reported that the wavelengths selected by the stepwise regression methods for foliage chemical concentrations The correlation coefficient map shown in Figure 5b illustrates the same specific case as in Figure 3c. The correlation coefficients between reflectance at C d 's absorption peak (550 nm) and bands within the range of 500-740 nm, were all > 0.60. Moreover, the correlation coefficients between reflectance bands around 510 nm (within the absorption band region of C d ) and those within the range of 415-705 nm were even > 0.80, with the highest > 0.95 (the reflectance band at 510 nm and those bands within 585-695 nm).
High correlations between the reflectance at C d -affected bands and non-affected bands were also identified, as shown in Figure 5c, in which the reflectance bands within the range of 625-685 nm were highly correlated (r > 0.70), with those within the range of 400-700 nm. In addition, the reflectance bands within the range of 690-725 nm were also highly correlated (r > 0.70) with those within the range of 520-620 nm. Furthermore, the correlation coefficients of the reflectance band at 725 nm and bands within the range of 730-800 nm were even > 0.82. Such highly correlated bands (most not caused by the dummy variable, theoretically) are hence getting involved in the PLSR models, especially the stepwise-PLS and UVE-PLS models, and are likely to be the primary reason for these non-affected bands being used in PLSR models.

Informative Bands Selected by Different Methods for Leaf Biochemical Parameters
Building PLSR models from informative spectral bands, rather than using full bands, could refine the performance of PLS analysis in field spectroscopy [2]. In this study, we investigated three different band selection methods for setting up PLSR models to retrieve C d , which had clearly defined absorption features. It was apparent, in most cases, that the GA-PLS method outperformed the other two methods in locating the mechanism-based informative bands (i.e., the bands affected by the absorption of the dummy variable in this study).
Stepwise regression is a popular statistical technique for choosing terms to include in a regression model for large data sets [57]. A previous study by Huang et al. (2004) reported that the wavelengths selected by the stepwise regression methods for foliage chemical concentrations estimation were closely related to the known absorption features, in contrast to our results. However, there are several critical issues with stepwise regression, including its inability to cope with redundant dimensions (it deteriorates in the presence of collinearity) and its inability to shrink regression coefficients [58,59]. High dependency or correlations between the reflectance bands, as revealed in this study, make the stepwise approach problematic for locating known absorption features of C d .
On the other hand, the UVE-PLS is strongly affected by the magnitude of regression coefficients of variables, as it is a method for variable selection based on an analysis of PLS regression coefficients [20]. This method is good at eliminating variables that are irrelevant rather than selection of the best small subset of variables for fitting a model [3,23]. Our results also suggested that several non-affected bands were involved in the UVE-PLS models for C d estimation. Most likely, an artificially added random matrix might have influenced the model [22].
In contrast, the genetic algorithm (GA) approach had a significant effect in the band optimization selection of the PLS model [17]. However, a simple GA algorithm implementation could only locate near-optimal solutions, while failing in most cases to converge on the optimal solution [60,61]. A previous study on estimating pasture mass and quality from field hyperspectral data with the GA-PLS model suggested that although the selected bands did not exactly match published absorption peaks of specific materials, most were within the 30-nm vicinity of the peaks [2]. Similar results were obtained in our study.

Performance of PLS Models for Field-Measured Datasets
As the study reported here was mainly based on simulation data resulting from virtual experiments, a previous study by Jin and Wang [28], which evaluated the performance of different informative band selection techniques for PLS towards better estimation of leaf chlorophyll contents of various species from reflected hyperspectral information, should provide a good comparison. The study was based on four field-measured datasets containing a total of 598 leaf samples from various species. It is reported that the calibrated stepwise-PLS and GA-PLS models can estimate the contents of chlorophyll with high accuracy (R 2 > 0.70) at different spectral resolutions (1-50 nm). Taking a spectral resolution of 10 nm as an example, which is a common resolution approximate to the specification of several existing hyperspectral sensors such as Hyperion, AVIRIS, MARTE VNIR imaging spectrometer and CRISM [62][63][64], the best estimation of the content of chlorophyll could reach an R 2 of 0.77. However, some discrepancies in the informative bands used by the stepwise-PLS and GA-PLS models were noted. The bands selected by the GA-PLS method were within the ranges of  [31,43]. Thus, for field-measured datasets, the GA-PLS method was proven efficient for locating the informative bands for estimating leaf biochemical parameters.

Advantages and Disadvantages of PLS Models for estimating Leaf Biochemical Contents
PLSR has proven to be a very versatile method for multivariate data analysis, especially for high-dimensional data, among diverse research fields such as bioinformatics, machine learning and chemometrics [22]. Previous studies on retrieving various plant characteristics from spectral data have also demonstrated that PLS outperformed other regression techniques (e.g., vegetation indices, multiple linear regression, stepwise regression, and principal component regression) based on statistical criteria such as normalized root-mean-square error (NRMSE) or/and the coefficient of determination (R 2 ) [4,[11][12][13][14][15]. However, on the other hand, we must realize that the PLSR models usually involved more bands when using high-dimensional hyperspectral remote sensing data, leading to a high possibility of overfitting. A previous study on canopy nitrogen content demonstrated that the predictive ability of a two-band based index was comparable to that of a PLSR model using all the hyperspectral data [65]. Moreover, another study suggested that vegetation indices used fewer bands than a PLSR model (three versus four), but was more capable of detecting chlorophyll content during the shooting stage and trumpet stage [66]. Research on retrieving leaf area index and leaf chlorophyll concentration at different growth stages of winter wheat concluded that the narrow-band indices were optimal, and a PLSR model, using the information from all wavelengths in the hyperspectral region, did not show any significant improvement [5]. Thus, there is a high risk of overfitting using PLSR with so many bands (especially using all hyperspectral bands) [46], which may be the reason for the data-oriented PLSR models performing worse on retrieving plant characteristics at different growth stages than indices based on fewer bands.
Furthermore, although PLS models could estimate leaf biochemical contents with hyperspectral reflectance efficiently, the selected bands based on popular data-oriented methods, could not exactly match known absorption features to ensure all bands used were mechanism-based. Thus, the selection of informative bands for PLS models to reduce the risk of overfitting remains a great challenge when applying PLSR for plant biochemical content estimation from high-dimensional reflectance. Overall, based on the virtual analysis carried out in this study, we concluded that the GA-PLS method developed on the basis of biological evolution theory located the mechanism-based wavelengths for PLS estimation of leaf biochemical properties more efficiently and reliably than other band selection/elimination methods.

Conclusions
To evaluate the efficiency of band selection methods for PLS models and the ability to locate physiochemical mechanism-based informative bands, we conducted a series of virtual analyses based on the modified PROSPECT-4 model. Although the PLSR models could estimate leaf biochemical contents from hyperspectral reflectance efficiently, the selected bands, unfortunately, did not exactly match known absorption features, resulting in poor robustness and generality of the models. The collinearities among the reflectance values at different bands may be the primary reason for including non-affected bands into PLSR models, and this will be treated explicitly in future studies. In general, the GA-PLS method, developed on the basis of biological evolution theory, was more efficient at locating the physicochemical mechanism-based, informative bands. Results obtained in this study should help to lay a basis for better understanding of PLS regression performance in retrieving vegetation biochemical parameters.