Laboratory Visible and Near-Infrared Spectroscopy with Genetic Algorithm-Based Partial Least Squares Regression for Assessing the Soil Phosphorus Content of Upland and Lowland Rice Fields in Madagascar

As a laboratory proximal sensing technique, the capability of visible and near-infrared (Vis-NIR) diffused reflectance spectroscopy with partial least squares (PLS) regression to determine soil properties has previously been demonstrated. However, the evaluation of the soil phosphorus (P) content—a major nutrient constraint for crop production in the tropics—is still a challenging task. PLS regression with waveband selection can improve the predictive ability of a calibration model, and a genetic algorithm (GA) has been widely applied as a suitable method for selecting wavebands in laboratory calibrations. To develop a laboratory-based proximal sensing method, this study investigated the potential to use GA-PLS regression analyses to estimate oxalate-extractable P in upland and lowland soils from laboratory Vis-NIR reflectance data. In terms of predictive ability, GA-PLS regression was compared with iterative stepwise elimination PLS (ISE-PLS) regression and standard full-spectrum PLS (FS-PLS) regression using soil samples collected in 2015 and 2016 from the surface of upland and lowland rice fields in Madagascar (n = 103). Overall, the GA-PLS model using first derivative reflectance (FDR) had the best predictive accuracy (R2 = 0.796) with a good prediction ability (residual predictive deviation (RPD) = 2.211). Selected wavebands in the GA-PLS model did not perfectly match wavelengths of previously known absorption features of soil nutrients, but in most cases, the selected wavebands were within 20 nm of previously known wavelength regions. Bootstrap procedures (N = 10,000 times) using selected wavebands also confirmed the improvements in accuracy and robustness of the GA-PLS model compared to those of the ISE-PLS and FS-PLS models. These results suggest that soil oxalate-extractable P can be predicted from Vis-NIR spectroscopy and that GA-PLS regression has the advantage of tuning optimum bands for PLS regression, contributing to a better predictive ability.


Introduction
Phosphorus (P) deficiency is a major constraint for rice production in the tropics [1] because strongly weathered soils, which cover vast regions of the tropics, contain low concentrations of readily exchangeable inorganic phosphate [2,3].In tropical soils, available P can generally be even less due to strong sorption to aluminium (Al) and iron (Fe) oxides and often limits crop production in low input agricultural systems [4,5].
In our previous study [6], we examined highly weathered and typical P-deficient soils in the central highland of Madagascar and found that the P uptake of rice under flooded conditions is related not only to easily soluble P content but also to the amounts of active Fe and Al, which are bound to incalcitrant P fractions.Acid ammonium oxalate extraction is a powerful extraction method and covers both soluble and incalcitrant Fe-and Al-bound P fractions, unlike certain conventional extraction methods (e.g., the Olsen method) that normally consider only easily soluble P [6,7].In addition, recently, Helfenstein et al. [8] revealed that the incalcitrant P fraction (NaOH-extractable inorganic P pool) turns over in weeks to months, suggesting that the incalcitrant P fraction would potentially play a significant role as a P source within a cropping season.Rabeharisoa et al. [9] also found the amount of oxalate-extractable P in soils had a significant correlation with the P concentrations of rice leaves in farmers' fields in Madagascar.Therefore, we assumed that oxalate-extractable P reflects bioavailable P for rice production in the region and applied this assumption in the current study.
Soil P occurs in a variety of chemical forms that differ markedly in their behavior and bioavailability in the soil environment [6,10].Our previous study also revealed that these soil P contents and forms largely varied among neighboring fields.These observations indicate that P nutrient management for rice production can be further improved by understanding field-to-field variations in bioavailable P (i.e., oxalate-extractable P) in the tropics.Thus, the development of a rapid and accurate methodology for evaluating bioavailable P in soils is needed.However, a quantitative assessment of soil P using standard procedures (e.g., wet chemistry) can often be difficult, especially in spatially heterogeneous assessments that require numerous soil samples, a process that is costly and time consuming.
To overcome the issues with the standard procedure, laboratory visible and near-infrared (Vis-NIR) spectroscopy has been widely adopted for soil studies as a non-destructive, rapid and reproducible analytical method and has been used for the simultaneous prediction of a variety of primary and secondary soil attributes [11].Vis-NIR spectroscopy is an analytical technique that characterizes materials according to their reflectance at light absorption in the visible (400-700 nm) and NIR (700-2,500 nm) regions.These techniques measure the radiation absorbed by various bonds of O-H, C-H, N-H, C=O, C-N, N-H, or C=C, resulting in bending, twisting, stretching, or scissoring [11,12].Spectroscopy has been used in conjunction with chemometric (multivariate regression) analyses to relate soil spectra to soil attributes, such as carbon content, clay and iron oxide [13][14][15].
Although partial least squares (PLS) regression is the most commonly used approach for soil spectral analyses, waveband selection can refine the performance of a PLS analysis [16][17][18].The PLS regression method combines the most useful information from hundreds of wavebands into the first several PLS factors (or latent variables), whereas the less important factors might include background effects [19,20].Thus, many techniques for selecting wavebands or wavelength regions have been developed, such as iterative stepwise elimination-PLS (ISE-PLS) regression [21], uninformative variable elimination-PLS (UVE-PLS) regression [22], competitive adaptive reweighted sampling (CARS) regression [23], interval PLS (iPLS) regression [24], moving window-PLS (MW-PLS) regression [25], and genetic algorithm-PLS (GA-PLS) regression [26].In our previous study [18], removal of the redundant wavebands by ISE-PLS regression greatly improved the estimation of total carbon (TC) and total nitrogen (TN) in paddy soils.Among the waveband selection methods, GA-PLS regression has been used as a suitable method in chemometrics [27].Leardi and González [28] demonstrated that the GA-PLS method, after suitable modifications, produces more interpretable results because the selected wavelengths are less dispersed in this method than in other methods.Several studies have reported that the GA-PLS method obtained a better solution than did the ISE-PLS method [29][30][31].
To date, there have been several attempts to predict soil P using Vis-NIR spectroscopy at field and laboratory scales with the standard full-spectrum PLS (FS-PLS) method [32][33][34].However, the predictive accuracy was relatively low compared with that for other macro-nutrients (e.g., nitrogen, carbon), and waveband selection coupled with PLS regression analysis has not been evaluated.The objective of this study was to develop a laboratory-based proximal sensing method based on an empirical relationship between soil P and Vis-NIR spectral characteristics using PLS analyses.To improve the predictive ability, we investigated the potential to use GA-PLS regression analyses to estimate the soil P status of upland and lowland soils from laboratory Vis-NIR reflectance data.Here, we targeted amounts of oxalate-extractable P based on the above-noted field observations regarding its relative importance for rice production and on P uptake as noted by Rabeharisoa et al. [9] and Nishigaki et al. [6].The predictive ability of the GA-PLS method was compared with the predictive abilities of ISE-PLS and FS-PLS methods using first derivative reflectance (FDR) spectra data.Rapid measurements of soil P status at low cost and with less soil sample preparation could be an application of the present study instead of the routine chemical methods.

Study Site and Soil Sampling and Chemical Analyses
The field survey was conducted in the central highland of Madagascar (Figure 1).This region has a subtropical climate with an altitude of 1000-1500 m above sea level.The mean temperature is 14-17 • C in winter and 20-22 • C in summer.The average annual rainfall is 1100 mm (>80% occurs in November-March) [35].The area is dominated by inherently nutrient-poor soil types that are mainly classified into Ferralsols and Acrisols [36] or into Oxisols of semiarid to humid climates [37].
Remote Sens. 2018, 10, x FOR PEER REVIEW 3 of 18 carbon), and waveband selection coupled with PLS regression analysis has not been evaluated.The objective of this study was to develop a laboratory-based proximal sensing method based on an empirical relationship between soil P and Vis-NIR spectral characteristics using PLS analyses.To improve the predictive ability, we investigated the potential to use GA-PLS regression analyses to estimate the soil P status of upland and lowland soils from laboratory Vis-NIR reflectance data.Here, we targeted amounts of oxalate-extractable P based on the above-noted field observations regarding its relative importance for rice production and on P uptake as noted by Rabeharisoa et al. [9] and Nishigaki et al. [6].The predictive ability of the GA-PLS method was compared with the predictive abilities of ISE-PLS and FS-PLS methods using first derivative reflectance (FDR) spectra data.Rapid measurements of soil P status at low cost and with less soil sample preparation could be an application of the present study instead of the routine chemical methods.

Study Site and Soil Sampling and Chemical Analyses
The field survey was conducted in the central highland of Madagascar (Figure 1).This region has a subtropical climate with an altitude of 1000-1500 m above sea level.The mean temperature is 14-17°C in winter and 20-22°C in summer.The average annual rainfall is 1100 mm (>80% occurs in November-March) [35].The area is dominated by inherently nutrient-poor soil types that are mainly classified into Ferralsols and Acrisols [36] or into Oxisols of semiarid to humid climates [37].In 2015 and 2016, soil sampling was conducted in 103 rice fields-the major production system in the region -with 63 upland and 40 lowland fields under various management practices.The sampling positions were recorded with a handheld GPS (Colorado 300, Garmin, Ltd., Kansas, TX, USA).Surface soil samples were collected at a 0-10 cm depth as composites of three to four cores in each field.The spatial distributions of soil sampling points were plotted on the maps using satellite images (Figure 1a b,d) and ASTER global digital elevation model (GDEM) version 2 data (Figure 1c, e), which is a product of METI and NASA.
In the laboratory, soil samples were air dried for 14 days and sieved to <2 mm.Soil P was In 2015 and 2016, soil sampling was conducted in 103 rice fields-the major production system in the region -with 63 upland and 40 lowland fields under various management practices.The sampling positions were recorded with a handheld GPS (Colorado 300, Garmin, Ltd., Kansas, TX, USA).Surface soil samples were collected at a 0-10 cm depth as composites of three to four cores in each field.The spatial distributions of soil sampling points were plotted on the maps using satellite images (Figure 1a,b,d) and ASTER global digital elevation model (GDEM) version 2 data (Figure 1c,e), which is a product of METI and NASA.
In the laboratory, soil samples were air dried for 14 days and sieved to <2 mm.Soil P was extracted using the acid ammonium oxalate method as described by Schwertmann [38], and the concentration of P in the oxalate extraction was analyzed using the malachite green colorimetric method [39].

Vis-NIR Diffuse Reflectance Measurement
Laboratory soil reflectance measurements were conducted in a dark room at the Japan International Research Center for Agricultural Sciences (JIRCAS), Japan, on July 31-August 1, 2017.Soil samples were scanned by a portable spectroradiometer (FieldSpec 4 Hi-Res, Analytical Spectral Devices (ASD) Inc., Longmont, CO, USA) and an ASD contact probe.The ASD FieldSpec measures spectral reflectance in the 350-2500 nm wavelength region, which has one silicon array (350-1000 nm) and two indium gallium arsenide (InGaAs) detectors (1000-1800 and 1800-2500 nm).The spectral sampling interval was 1.4 nm in the 350-1000 nm range and 1.1 nm in the 1001-2500 nm range.The spectral resolution (full-width-half-maximum; FWHM) was 3 nm in the 350-1000 nm range and 6 nm in the 1000-2500 nm range, which were calculated to 1 nm resolution wavelengths for output data using the cubic spline interpolation function in ASD software (RS3 for Windows; ASD Inc., Longmont, CO, USA).
The contact probe light source (halogen lamp) was aligned at 12 • to the probe body, ensuring illumination at a fixed angle without the influence of ambient light.The fiber optic cable of the ASD FieldSpec was attached to the contact probe at a fixed measurement angle of 35 • .The sensed spot area had a diameter of ~1.1 cm with a field of view of 1.33 cm 2 .A Spectralon (Labsphere Inc., Sutton, NH, USA) reference panel (white reference) was used to optimize the ASD instrument prior to taking Vis-NIR reflectance measurements for each sample.
Bulk soil samples were spread in optical-glass Petri dishes that were 85 mm in diameter and pressed to form a layer ~19 mm thick.The soil surfaces were scanned 25 times with five replications for the soil samples, and the spectral readings were averaged.

Overview of Data Processing
In this section, an overview of the data processing process is described using a flowchart in Figure 2 that shows a general overview of the methodology.In this study, two types of validations were performed for the models: (i) a leave-one-out cross-validation (LOO-CV) procedure based on whole data sets (n = 103) and (ii) a modified bootstrap procedure based on an independent test data set, which was similar to our previous study [30].Here, the LOO-CV procedure included waveband selection in ISE-PLS and GA-PLS regression analyses, while the bootstrap procedure was performed using the selected wavebands; the best GA-PLS model and final wavebands from five GA runs were justified by the residual predictive values (RPD).More details on the predictive abilities are described in Section 2.9.

Preprocessing of Spectral Data
Spectral data in both edge wavelength regions (350-399 nm and 2401-2500 nm) were eliminated because of low signal-to-noise ratios in the instrument.Thus, a total of 2001 spectral bands between 400 nm and 2400 nm were used for the analyses.
FDR spectra were used to reduce baseline variation and enhance spectral features [40].The FDR was calculated using the Savitzky-Golay smoothing filter [41].A third-order, 15-band moving polynomial was fitted according to the original reflectance signatures.The parameters of this polynomial were subsequently used to calculate the derivative at the center waveband of the moving spline window.In addition, a standard normal variate transform (SNV) was employed to reduce the particle size effect [42].
To detect outliers, a principal component analysis was performed on spectral data for calculating the Mahalanobis distance H, and samples with H > 3 were eliminated as outliers.As a result, three samples were considered outliers, leaving 103 samples for further analyses.

Standard Full-Spectrum Partial Least Squares (FS-PLS) Regression
PLS regression analyses were performed to estimate soil parameters using reflectance and FDR data sets (n = 103).The standard FS-PLS regression equation was calculated as follows: where the response variable y is a vector of the soil oxalate-extractable P; the predictor variables x1 to xi are surface reflectance or FDR values for spectral bands 1 to i (400, 401, …, 2400 nm), respectively; β1 to βi are the estimated weighted regression coefficients; and ε is the error vector.The latent variables were introduced to simplify the relationship between response variables and predictor variables.To determine the optimal number of latent variables (NLV), a LOO-CV was performed to avoid over-fitting of the model and was based on the minimum value of the root mean squared error of cross-validation (RMSECV).The RMSECV was calculated as follows: All data handling and linear regression analyses were performed using PLS_Toolbox version 8.6 (Eigenvector Research Inc., Manson, WA, USA) in MATLAB software ver.9.3 (MathWorks Sherborn, MA, USA).

Preprocessing of Spectral Data
Spectral data in both edge wavelength regions (350-399 nm and 2401-2500 nm) were eliminated because of low signal-to-noise ratios in the instrument.Thus, a total of 2001 spectral bands between 400 nm and 2400 nm were used for the analyses.
FDR spectra were used to reduce baseline variation and enhance spectral features [40].The FDR was calculated using the Savitzky-Golay smoothing filter [41].A third-order, 15-band moving polynomial was fitted according to the original reflectance signatures.The parameters of this polynomial were subsequently used to calculate the derivative at the center waveband of the moving spline window.In addition, a standard normal variate transform (SNV) was employed to reduce the particle size effect [42].
To detect outliers, a principal component analysis was performed on spectral data for calculating the Mahalanobis distance H, and samples with H > 3 were eliminated as outliers.As a result, three samples were considered outliers, leaving 103 samples for further analyses.

Standard Full-Spectrum Partial Least Squares (FS-PLS) Regression
PLS regression analyses were performed to estimate soil parameters using reflectance and FDR data sets (n = 103).The standard FS-PLS regression equation was calculated as follows: where the response variable y is a vector of the soil oxalate-extractable P; the predictor variables x 1 to x i are surface reflectance or FDR values for spectral bands 1 to i (400, 401, . . ., 2400 nm), respectively; β 1 to β i are the estimated weighted regression coefficients; and ε is the error vector.The latent variables were introduced to simplify the relationship between response variables and predictor variables.
To determine the optimal number of latent variables (NLV), a LOO-CV was performed to avoid over-fitting of the model and was based on the minimum value of the root mean squared error of cross-validation (RMSECV).The RMSECV was calculated as follows: where y i and y p represent the respective measured and predicted soil parameters for sample i, and n is the number of samples in the data sets (n = 103).

Iterative Stepwise Elimination Partial Least Squares (ISE-PLS) Regression
ISE-PLS is a PLS model that incorporates a waveband elimination algorithm.The ISE method eliminates noisy variables and selects useful predictors.When PLS models include large numbers of redundant variables or outliers, the models' predictive abilities may perform poorly, while the ISE method can overcome such problems.Performance depends on the importance of predictors (z i ), described as follows: where s i is the standard deviation, and β i is the regression coefficient; both s i and β i correspond to the predictor variable of the waveband i.
Initially, all available wavebands (2001 bands, 400-2400 nm) are used to develop the PLS regression model.Then, to create a scope in which useless predictor variables are removed and predictive ability is improved, each predictor z i is evaluated, and the less informative wavebands are eliminated.Subsequently, the PLS model is re-calibrated with the remaining predictors [43].The model-building procedure is repeated until the final model is calibrated with the maximum predictive ability.

Genetic Algorithm Partial Least Squares (GA-PLS) Regression
The GA is an efficient numerical optimization method based on genetic principles and natural selection [44].In a GA, a population of individuals (or chromosomes) is created automatically and typically stored as binary strings in a computer memory, which means that a binary integer "zero" or "one" represent one gene.Then, each chromosome consists of sequences of a "gene" or "bits."During this evolutionary computation, one or more bits are swapped within or between individuals by computer operations using mechanisms of natural variation, selection and inheritance.Briefly, selection, crossover and mutation form the core of GA, and these three operations are applied to the initial populations to generate a new population.This process is repeated until a pre-defined number of generations is propagated.
Although GA is well suited for solving variable subset selection problems [45], the major risk associated with using GA-PLS regression is over-fitting due to the large number of variables (wavebands) used in the Vis-NIR spectroscopy data set.To minimize the risk of over-fitting, Leardi [27] developed the GA program used in the present study.The GA program was designed to contain the following features: (i) The parameters are set with the highest possible elitism, a very limited population size and a relatively high mutation rate to ensure a rapid response increase and to find a good solution very early in the process.Here, elitism means to encourage the propagation of the best band repressors between generations without being disrupted by crossover or mutation so that the search speed of the program can be improved.(ii) The final model is determined via 100 independent and short GA runs.(iii) A weighted average of the selection frequency of the variables from the starting run to the previous runs is used to describe the current frequency of selection of the variables.Thus, each run is able to 'learn' information from the previous runs.(iv) A moving average (window size 3) is applied to the selection of the variables to take into account high spectral correlations and to ensure that highly correlated spectral bands are selected together.
Before the GA run, the suitability of our data set for applying GA band selection was assessed by a fitness function [27,46].In the present study, five GA runs were performed on the FDR data set because each GA-PLS gave a slightly different model.The parameters and their conditions (Table 1) were taken from previous studies [27,30,31,46].

Predictive Ability of the PLS Models
To evaluate the predictive ability of the FS-PLS, ISE-PLS and GA-PLS models, two types of validation were used for the models (see Figure 2): (i) a waveband selection with a LOO-CV procedure based on whole data sets (n = 103), including waveband selection in ISE-PLS and GA-PLS models, and (ii) a modified bootstrap procedure based on an independent test data set using the selected wavebands from (i).
In waveband selections with the LOO-CV procedure, each sample is estimated using the remaining samples.This process means that for each variant, we developed 103 individual models, which were constructed with data from 102 observations.The calibration model was then used to predict the observation that was left out.As the predicted samples were not the same as the samples used to establish the models, the RMSECV was used as the accuracy indicator of the model in predicting unknown samples.The predictive abilities of the PLS models were assessed by calculating the coefficient of determination (R 2 ), RMSECV and the residual predictive deviation (RPD) using a LOO-CV.High R 2 and low RMSECV values indicate the best model for predicting the soil parameters.The RPD has been defined as the ratio of standard deviation (SD) of reference data for predicting RMSECV.For the performance ability of calibration models, an RPD of 3 has been suggested for agriculture applications, while RPD values between 2 and 3 indicate a model with good prediction ability; 1.5 < RPD < 2 is an intermediate model needing some improvement; and an RPD < 1.5 indicates that the model has poor prediction ability.
In the bootstrap procedure, the data were divided randomly into training (n = 69) and test (n = 34) data sets with replacement for N = 10,000 times.In each process, a PLS regression model was developed using the training data set.Here, ISE-PLS and GA-PLS were developed using selected wavebands in the LOO-CV procedure.The PLS model was then used to predict soil oxalate-extractable P in the test data set.The robustness of the calibration models was evaluated by the mean (±SD) values of R 2 and the root mean squares error of prediction (RMSEP) from 10,000 runs in the test data set.The RMSEP was calculated as follows: where y v i and y v p are the measured and predicted soil parameters, respectively, for sample i in the test data set.

Assessing Significant Wavelengths
To assess the importance of the wavelengths in the FS-PLS calibration, the variable importance in the projection (VIP) [47,48] was used and referred to the selected wavelength regions from the ISE-PLS and GA-PLS models.The VIP score provides a summary of the importance of an x variable (waveband) for an observed y variable and is calculated using the following equation: where VIP k (a) is the importance of the kth predictor variable based on a model with a factors, W ak is the corresponding loading weight of the kth variable in the ath PLS regression factor, SSY a is the explained sum of squares of y obtained from a PLS regression model with a factors, SSY t is the total sum of squares of y, and m is the total number of predictor variables.A high VIP score (>1) indicates an important x variable (waveband) [47,49].

A Wide Range of Soil Oxalate-Extractable P Contents in Upland and Lowland Rice Fields
The descriptive statistics of soil oxalate-extractable P in the whole (n = 103) upland (n = 63) and lowland (n = 40) data sets are shown in Figure 3, and Table 2 summarizes the minimum, maximum, median, mean, SD and coefficients of variation (CV) values.The soil oxalate-extractable P values in the upland and lowland data sets ranged between 30.73-1225.16mg kg −1 and 30.73-826.64 mg kg −1 , respectively.The mean value of the upland data set (588.74 mg kg −1 ) showed significantly higher values than that of the lowland data set (319.41 mg P kg −1 ) (p < 0.001, two sample t-test).The lower values in soil oxalate-extractable P is probably because of little fertilizer input in lowland fields compared to upland fields in the central highlands of Madagascar.Similarly, the soil TC was significantly higher (p < 0.05) in lowland soils due to the anaerobic condition, while there was no significant difference in soil clay contents (Figure S1).It is, therefore, suggested that soil physicochemical properties were inherently not different between upland and lowland soils as they were collected nearby fields, and have been changed by the agricultural practices, i.e., fertilization and flooding.
Remote Sens. 2018, 10, x FOR PEER REVIEW 8 of 18 where VIPk(a) is the importance of the kth predictor variable based on a model with a factors, Wak is the corresponding loading weight of the kth variable in the ath PLS regression factor, SSYa is the explained sum of squares of y obtained from a PLS regression model with a factors, SSYt is the total sum of squares of y, and m is the total number of predictor variables.A high VIP score (>1) indicates an important x variable (waveband) [47,49].

3.1.A wide Range of Soil Oxalate-Extractable P Contents in Upland and Lowland Rice Fields
The descriptive statistics of soil oxalate-extractable P in the whole (n = 103) upland (n = 63) and lowland (n = 40) data sets are shown in Figure 3, and Table 2 summarizes the minimum, maximum, median, mean, SD and coefficients of variation (CV) values.The soil oxalate-extractable P values in the upland and lowland data sets ranged between 30.73-1225.16mg kg -1 and 30.73-826.64 mg kg -1 , respectively.The mean value of the upland data set (588.74 mg kg -1 ) showed significantly higher values than that of the lowland data set (319.41 mg P kg -1 ) (p < 0.001, two sample t-test).The lower values in soil oxalate-extractable P is probably because of little fertilizer input in lowland fields compared to upland fields in the central highlands of Madagascar.Similarly, the soil TC was significantly higher (p < 0.05) in lowland soils due to the anaerobic condition, while there was no significant difference in soil clay contents (Figure S1).It is, therefore, suggested that soil physicochemical properties were inherently not different between upland and lowland soils as they were collected nearby fields, and have been changed by the agricultural practices, i.e., fertilization and flooding.
Our whole data set (upland + lowland) covered a wide range of variations in oxalate-extractable P content.The mean (and SD) values of soil oxalate-extractable P were 484.15 mg kg -1 (±319.10mg kg -1 ), with a range of 30.73-1225.16mg kg -1 , and CV = 65.91%.The SD and range of the sample affect the accuracy of soil property predictions using Vis-NIR spectroscopy [34].In the present study, the range of soil oxalate-extractable P values was considered sufficiently large to develop the calibration models using PLS regression analyses.Our data set also demonstrated that the oxalate-extractable P content had a good correlation with the total P content in soils [6].Our whole data set (upland + lowland) covered a wide range of variations in oxalate-extractable P content.The mean (and SD) values of soil oxalate-extractable P were 484.15 mg kg −1 (±319.10mg kg −1 ), with a range of 30.73-1225.16mg kg −1 , and CV = 65.91%.The SD and range of the sample affect the accuracy of soil property predictions using Vis-NIR spectroscopy [34].In the present study, the range of soil oxalate-extractable P values was considered sufficiently large to develop the calibration models using PLS regression analyses.Our data set also demonstrated that the oxalate-extractable P content had a good correlation with the total P content in soils [6].

Soil Spectral Response and Its Correlation to Oxalate-Extractable P in Soil
Figure 4 shows the original reflectance spectra, FDR and Pearson's correlation coefficient (r) values between soil oxalate-extractable P content and reflectance and FDR spectra at each waveband.Large variations in the reflectance spectra were obtained from heterogeneous soil samples, which were collected from upland and lowland soils under various rice-based cropping systems.In general, soils from different fields show variations in the absorbance at wavelengths associated with iron oxides (400-500 nm), clay minerals (OH bond: 1400 and 1900 nm, Al-OH bond: 2200 nm) and organic matter (CH bond: 2300-2400 nm) in soil [50].

Soil Spectral Response and Its Correlation to Oxalate-Extractable P in Soil
Figure 4 shows the original reflectance spectra, FDR and Pearson's correlation coefficient (r) values between soil oxalate-extractable P content and reflectance and FDR spectra at each waveband.Large variations in the reflectance spectra were obtained from heterogeneous soil samples, which were collected from upland and lowland soils under various rice-based cropping systems.In general, soils from different fields show variations in the absorbance at wavelengths associated with iron oxides (400-500 nm), clay minerals (OH bond: 1400 and 1900 nm, Al-OH bond: 2200 nm) and organic matter (CH bond: 2300-2400 nm) in soil [50].
Soil reflectance in visible regions (400-700 nm) is primarily associated with absorption in minerals containing Fe [51][52][53] and organic matter [54,55].NIR regions (700-2500 nm) are dominated by absorption related to water (1400 and 1900 nm), minerals (1300-1400, 1800-1900, and 2200-2500 nm) and organic matter (1100, 1600, 1700-1800, 2000, and 2200-2400 nm) [56].Carbonates also have weak absorption peaks in the NIR region [57].These absorptions in the NIR region are due to overtone and combination bands primarily of C-H, N-H and O-H groups with fundamental bands related to molecular stretching that occurs in the mid-infrared (MIR) spectral region.However, there is no specific absorption by P in the Vis-NIR region, and thus, differences in the shape of reflectance due to P are still in the process of being determined [33].In the reflectance spectra of this study, oxalate-extractable P had a positive correlation with the wavelength between the red and shorter range NIR regions (600-1300 nm), while oxalate-extractable P had negative correlations in the bluegreen region (400-580 nm) and NIR regions (1420-2100 nm).In the FDR spectra, positive correlations were found in the green-red region (508-676 nm) with some peaks in the NIR region (1000, 1410 and 2133 nm), while negative correlations were observed in the 800-1850 nm region with some peaks in the 2200-2400 nm region.Soil reflectance in visible regions (400-700 nm) is primarily associated with absorption in minerals containing Fe [51][52][53] and organic matter [54,55].NIR regions (700-2500 nm) are dominated by absorption related to water (1400 and 1900 nm), minerals (1300-1400, 1800-1900, and 2200-2500 nm) and organic matter (1100, 1600, 1700-1800, 2000, and 2200-2400 nm) [56].Carbonates also have weak absorption peaks in the NIR region [57].These absorptions in the NIR region are due to overtone and combination bands primarily of C-H, N-H and O-H groups with fundamental bands related to molecular stretching that occurs in the mid-infrared (MIR) spectral region.However, there is no specific absorption by P in the Vis-NIR region, and thus, differences in the shape of reflectance due to P are still in the process of being determined [33].In the reflectance spectra of this study, oxalate-extractable P had a positive correlation with the wavelength between the red and shorter range NIR regions (600-1300 nm), while oxalate-extractable P had negative correlations in the blue-green region (400-580 nm) and NIR regions (1420-2100 nm).In the FDR spectra, positive correlations were found in the green-red region (508-676 nm) with some peaks in the NIR region (1000, 1410 and 2133 nm), while negative correlations were observed in the 800-1850 nm region with some peaks in the 2200-2400 nm region.

Selected Wavebands from ISE-PLS and GA-PLS Models
Selected wavebands from ISE-PLS analysis and five GA-PLS runs using FDR spectra to estimate soil oxalate-extractable P contents are shown in Figure 5, with the regression coefficient and VIP score in the FS-PLS model as information to assist in considering the importance of the selected wavelengths.In comparison to the ISE-PLS model, the GA-PLS model selected a wider range of spectral wavelength regions from within the visible (400-699 nm) and NIR (700-2400 nm) spectra, with slightly different regions for the five runs.The commonly selected regions from the five GA-PLS runs (red bar in Figure 5) were 454-457, 506-508, 517, 518, 660, 1732, 1847-1849, 1957-1961, 2105, 2107, 2109, and 2312 nm.These wavelengths did not match those identified for soil characteristics in previous studies (Table 3); in most cases, the wavelengths were found within 20 nm of previously known wavebands.As P is not spectrally active in the Vis-NIR region, the wavelengths detected in this study are potentially important spectral bands in the FDR spectra to indirectly estimate the soil oxalate-extractable P content via a link to other soil components with spectral properties.
Remote Sens. 2018, 10, x FOR PEER REVIEW 10 of 18 Selected wavebands from ISE-PLS analysis and five GA-PLS runs using FDR spectra to estimate soil oxalate-extractable P contents are shown in Figure 5, with the regression coefficient and VIP score in the FS-PLS model as information to assist in considering the importance of the selected wavelengths.In comparison to the ISE-PLS model, the GA-PLS model selected a wider range of spectral wavelength regions from within the visible (400-699 nm) and NIR (700-2400 nm) spectra, with slightly different regions for the five runs.The commonly selected regions from the five GA-PLS runs (red bar in Figure 5) were 454-457, 506-508, 517, 518, 660, 1732, 1847-1849, 1957-1961, 2105, 2107, 2109, and 2312 nm.These wavelengths did not match those identified for soil characteristics in previous studies (Table 3); in most cases, the wavelengths were found within 20 nm of previously known wavebands.As P is not spectrally active in the Vis-NIR region, the wavelengths detected in this study are potentially important spectral bands in the FDR spectra to indirectly estimate the soil oxalate-extractable P content via a link to other soil components with spectral properties.When comparing the GA-PLS model-selected wavebands with the ISE-PLS models, the selected wavebands were different but the GA-PLS model selected wavebands in the visible region (22-bands in 454-457, 506-508, 517, 518, 660 nm) were all selected with the ISE-PLS model.In contrast, no overlapping wavebands in the NIR region were found between the GA-PLS and ISE-PLS models.ISE-PLS regression analysis is a model-wise elimination technique, while GA-PLS regression analysis is a numerical optimization method based on genetic principles and natural selection, which are slightly different models.Thus, we performed five GA-PLS runs and used the commonly selected wavebands.However, the GA-PLS models also contained the wavebands selected by ISE-PLS model plus additional bands, especially in the NIR region.In some cases, these extra regions could be readily interpreted, but in other cases, they could not [27].Nevertheless, previous studies have reported that the wavebands selected by GA-PLS models clearly contain relevant information since the waveband subset decreases the RMSECV, and the wavebands were consistently selected in independent GA runs [27,30,31].
The commonly selected wavebands in the visible region (454-457, 506-508, 517, 518 and 660 nm) seem to be closely relevant to Fe oxide minerals (Table 3).In addition, the waveband at approximately 2270 nm, which is relevant to gibbsite (Al oxide mineral) [59], had a high regression coefficient and VIP score, although it was not always selected.These results are supported by our previous findings [6], in which oxalate-extractable P was significantly and positively correlated with active Al and Fe, respectively.The other selected wavebands likely overlapped or corresponded to previously known wavebands, which are most likely related to soil organic matter.Based on the study by Knadel et al. [56], the selected wavebands were considered to be associated with organic matter (C-H bond: 1720, 2111 and 2300 nm) [55,60,61], methyls (C-H bond: 1730-1852 nm) [61,62] and phenolics (C-OH bond: 1961 nm) [61,62].Our previous study also found that active Al and Fe had a positive correlation with soil TC and organic phosphorus content, respectively [6].This finding suggested that active Al and Fe play a significant role as sorbents for both oxalate-extractable P mainly in inorganic forms and organic matter in the studied soils.Turner [63] investigated the chemical nature of P in a range of rice field soils in Madagascar and reported that a considerable proportion of the TP extracted by NaOH-EDTA occurred in organic forms (19-44%), mostly as phosphate monoesters.Since the acid ammonium oxalate method can partly extract organically bound P, our results also indirectly indicated the importance of organic compounds, probably containing organic P, in oxalate-extractable P. Thus, the selected wavelengths in our data set should also be relevant to chemical associations of oxalate-extractable P in soils.

Waveband Selection with Cross-Validated Calibration Results
Figure 6 shows the relationships among the cross-validated calibration results between the FDR spectra and soil oxalate-extractable P using FS-PLS, ISE-PLS and GA-PLS regression analyses, with the selected number of wavebands (NW) and the selected NW as a percentage of the full-spectrum (NW% = NW/whole waveband [n = 2001] × 100) (Table 4).The optimum NLVs were 7, 7 and 6 using FS-PLS, ISE-PLS and GA-PLS methods, respectively, and they were determined as the lowest RMSECV values calculated from LOO-CV to avoid over-fitting of the model.Overall, the best R 2 and lowest RMSECV values were obtained with the GA-PLS model for estimating the soil oxalate-extractable P content (R 2 = 0.796 and RMSECV = 143.625).Based on RPD > 2 in the GA-PLS and ISE-PLS models, the quality and future applicability of our results could be considered to have a good predictive ability.
spectra and soil oxalate-extractable P using FS-PLS, ISE-PLS and GA-PLS regression analyses, with the selected number of wavebands (NW) and the selected NW as a percentage of the full-spectrum (NW% = NW/whole waveband [n = 2001] × 100) (Table 4).The optimum NLVs were 7, 7 and 6 using FS-PLS, ISE-PLS and GA-PLS methods, respectively, and they were determined as the lowest RMSECV values calculated from LOO-CV to avoid over-fitting of the model.Overall, the best R 2 and lowest RMSECV values were obtained with the GA-PLS model for estimating the soil oxalateextractable P content (R 2 = 0.796 and RMSECV = 143.625).Based on RPD > 2 in the GA-PLS and ISE-PLS models, the quality and future applicability of our results could be considered to have a good predictive ability.
The NW (NW%) remaining after waveband selection was 158 (7.9%) in ISE-PLS and 94 (4.7%) in GA-PLS, which were considered useful wavelengths for estimating the soil oxalate-extractable P content.These results also suggested that over 92% of the waveband information from the soil reflectance spectrum was redundant and did not contribute to or disturb the prediction.These findings support previous findings that the performance of PLS models can be improved through waveband selection, and the most useful information in the Vis-NIR region (400-2400 nm) predicted less than 20% of the forage [30,70], water [71] and soil parameters [18].Moreover, the spectral data efficiency is also expected to improve by the optimization of the waveband subset using the GA-PLS model [30].

Evaluation of Predictive Ability Using Modified Bootstrapping
To evaluate the predictive ability of the PLS models, a modified bootstrap procedure (N = 10,000 times) was conducted using selected wavebands in the FDR data set.Table 5 summarizes the mean  The NW (NW%) remaining after waveband selection was 158 (7.9%) in ISE-PLS and 94 (4.7%) in GA-PLS, which were considered useful wavelengths for estimating the soil oxalate-extractable P content.These results also suggested that over 92% of the waveband information from the soil reflectance spectrum was redundant and did not contribute to or disturb the prediction.These findings support previous findings that the performance of PLS models can be improved through waveband selection, and the most useful information in the Vis-NIR region (400-2400 nm) predicted less than 20% of the forage [30,70], water [71] and soil parameters [18].Moreover, the spectral data efficiency is also expected to improve by the optimization of the waveband subset using the GA-PLS model [30].

Evaluation of Predictive Ability Using Modified Bootstrapping
To evaluate the predictive ability of the PLS models, a modified bootstrap procedure (N = 10,000 times) was conducted using selected wavebands in the FDR data set.Table 5 summarizes the mean values of NLV, R 2 and RMSECV in the training data set (n = 69) and R 2 , RMSEP and the percent difference of RMSEP (∆RMSEP) between FS-PLS and ISE-PLS or GA-PLS models in the test data set (n = 34).In addition, Figure 7 demonstrates the distribution of R 2 values in the test data set.The mean optimum NLV ranged from 5.364 in the GA-PLS model to 7.285 in the FS-PLS model.In the training data set, GA-PLS obtained the best mean R 2 (0.782) and the lowest mean RMSECV (148.930mg P kg −1 ) values, and ISE-PLS performed better than FS-PLS.Similarly, in the test data set, the GA-PLS model obtained the best mean R 2 (0.787) and the lowest mean RMSEP (149.013mg P kg −1 ) values for estimating soil oxalate-extractable P. In comparison with the FS-PLS model and GA-PLS, the ∆RMSEP showed greater predictive accuracies in ISE-PLS (−16.21%) and GA-PLS (−24.69%)models, respectively.These findings confirm previous results that showed that the performance of PLS models can be improved through wavelength selection [20, 70,72] and the predictive ability of GA-PLS can overcome the ISE-PLS [29][30][31].Yang et al. [16] suggested that reducing large spectral data sets is valuable for more efficient storage, computation, and transmission as well as for the ease of spectral analysis.
Although spectral data efficiency could be improved by the optimization of a wavelength subset in the PLS model, the over-fitting problems still remained in the GA-PLS method.Leardi and Nørgaard [73] addressed a limitation of GA-PLS, which was that a greater number of variables (> These findings confirm previous results that showed that the performance of PLS models can be improved through wavelength selection [20, 70,72] and the predictive ability of GA-PLS can overcome the ISE-PLS [29][30][31].Yang et al. [16] suggested that reducing large spectral data sets is valuable for more efficient storage, computation, and transmission as well as for the ease of spectral analysis. Although spectral data efficiency could be improved by the optimization of a wavelength subset in the PLS model, the over-fitting problems still remained in the GA-PLS method.Leardi and Nørgaard [73] addressed a limitation of GA-PLS, which was that a greater number of variables (> 200) would result in over-fitting and reduce the capability of obtaining a solution with good predictive ability.To solve this problem, they proposed a sequential application of backward-interval PLS (biPLS) and GA for the selection of relevant spectral regions.The biPLS removes the non-informative regions prior to GA runs, thereby reducing the number of variables.Future work needs to examine using such sequential application of biPLS and GA-PLS with a larger number of data sets collected in different fields with a range of various soil properties.

Conclusions
A timely and accurate assessment of the soil P content is crucial for resource-limited farmers in Madagascar to improve rice production by site-specific fertilizer management.In this study, we investigated the performance of GA-PLS regression analysis in laboratory Vis-NIR reflectance spectroscopy for estimating soil oxalate-extractable P contents as a diagnostic indicator of soil P status in rice fields of Madagascar.Our results showed that a large range of soil oxalate-extractable P (30.73 to 1225.16 mg P kg −1 ) can be rapidly and non-destructively predicted by Vis-NIR spectroscopy for rice fields irrespective of different cropping systems and geographical locations and that the predictive ability was improved by GA-based waveband selection coupled with PLS regression analysis.GA-based waveband selection in the PLS calibration suggested that the important wavebands for estimating soil oxalate-extractable P were 4.7% of all 2001 wavebands in the 400-2400 nm range.The selected wavebands were different from previously published absorption peaks of specific materials.However, most of the peaks were within the 20 nm vicinity of such a peak and apparently relevant to chemical associations of oxalate-extractable P in soils bound to Al and Fe oxides and organic compounds.Thus, the selected wavelength in our study should be considered informative for estimating soil oxalate-extractable P contents.Based on the selected FDR wavebands in the GA-PLS model, soil oxalate-extractable P was determined to provide a good prediction (RPD = 2.211), with 20.4% and 21.3% of errors when cross-validating and testing, respectively, the independent data set.Such timely P sensing in soils might allow Madagascar's farmers to implement better fertilizer management.

Figure 2 .
Figure 2. Flowchart depicting a general overview of the methodology.

Figure 2 .
Figure 2. Flowchart depicting a general overview of the methodology.

Figure 3 .
Figure 3. Box plot (a) and histogram (b) of soil oxalate-extractable P in the whole, lowland and upland data sets.

Figure 3 .
Figure 3. Box plot (a) and histogram (b) of soil oxalate-extractable P in the whole, lowland and upland data sets.

Figure 4 .
Figure 4. Raw reflectance spectra (a) and first derivative reflectance (FDR) spectra on a log10 scale (b) of the soil samples and their correlation of coefficients to soil oxalate-extractable P in each waveband (c,d).

Figure 4 .
Figure 4. Raw reflectance spectra (a) and first derivative reflectance (FDR) spectra on a log10 scale (b) of the soil samples and their correlation of coefficients to soil oxalate-extractable P in each waveband (c,d).

Figure 5 .
Figure 5. (a) Selected wavebands in ISE-PLS analysis (green bars) and GA-PLS (blue bars in each run) using the FDR data set (n = 103) to estimate oxalate-extractable P contents of upland and lowland soils, with commonly selected wavebands from five GA-PLS runs (red bars), (b) regression coefficients in the FS-PLS model, and (c) VIP score (>1, grey bars).Specific absorption wavebands for the different bonds present in soil are specified on the top x-axis (modified by[58]).

Figure 5 .
Figure 5. (a) Selected wavebands in ISE-PLS analysis (green bars) and GA-PLS (blue bars in each run) using the FDR data set (n = 103) to estimate oxalate-extractable P contents of upland and lowland soils, with commonly selected wavebands from five GA-PLS runs (red bars), (b) regression coefficients in the FS-PLS model, and (c) VIP score (>1, grey bars).Specific absorption wavebands for the different bonds present in soil are specified on the top x-axis (modified by[58]).

Figure 6 .
Figure 6.Relationships between observed and predicted values of soil oxalate-extractable P contents using (a) FS-PLS, (b) ISE-PLS and (c) GA-PLS regressions.

Figure 7 .
Figure 7. Comparisons of the frequency distributions of R 2 values in the test data (n = 34) from (a) FS-PLS, (b) ISE-PLS and (c) GA-PLS using FDR, with mean (red line) ± standard deviation (SD) values.

Figure 7 .
Figure 7. Comparisons of the frequency distributions of R 2 values in the test data (n = 34) from (a) FS-PLS, (b) ISE-PLS and (c) GA-PLS using FDR, with mean (red line) ± standard deviation (SD) values.

Table 1 .
Parameter conditions of the genetic algorithms-partial least squares (GA-PLS) regression.

Table 3 .
Commonly selected wavebands from five GA-PLS runs to estimate soil oxalate-extractable P using the FDR data set (n = 103) and possible soil components.

Table 5 .
Mean values of NLV, R 2 and RMSECV/RMSEP from N = 10,000 evaluations using independent training and test data sets with FS-PLS, ISE-PLS and GA-PLS.

Method Training Data Set (n = 69) Test Data Set (n = 34) Mean NLV Mean R 2 Mean RMSECV Mean R 2 Mean RMSEP ∆RMSEP 1
R2and RMSECV in the training data set (n = 69) and R 2 , RMSEP and the percent difference of RMSEP (ΔRMSEP) between FS-PLS and ISE-PLS or GA-PLS models in the test data set (n = 34).In addition, Figure7demonstrates the distribution of R 2 values in the test data set.The mean optimum NLV ranged from 5.364 in the GA-PLS model to 7.285 in the FS-PLS model.In the training data set, GA-PLS obtained the best mean R 2 (0.782) and the lowest mean RMSECV (148.930mg P kg −1 ) values, and ISE-PLS performed better than FS-PLS.Similarly, in the test data set, the GA-PLS model obtained the best mean R 2 (0.787) and the lowest mean RMSEP (149.013mg P kg −1 ) values for estimating soil oxalate-extractable P. In comparison with the FS-PLS model and GA-PLS, the ΔRMSEP showed greater predictive accuracies in ISE-PLS (−16.21%) and GA-PLS (−24.69%)models, respectively.

Table 5 .
Mean values of NLV, R 2 and RMSECV/RMSEP from N = 10,000 evaluations using independent training and test data sets with FS-PLS, ISE-PLS and GA-PLS.

method Training data set (n = 69) Test data set (n = 34) Mean NLV Mean R 2 Mean RMSECV Mean R 2
1 ΔRMSEP, percent difference in the RMSEP to FS-PLS.