Leaf Area Index Estimation Algorithm for GF-5 Hyperspectral Data Based on Different Feature Selection and Machine Learning Methods

: Leaf area index (LAI) is an essential vegetation parameter that represents the light energy utilization and vegetation canopy structure. As the only in-operation hyperspectral satellite launched by China, GF-5 is potentially useful for accurate LAI estimation. However, there is no research focus on evaluating GF-5 data for LAI estimation. Hyperspectral remote sensing data contains abundant information about the reflective characteristics of vegetation canopies, but these abound data also easily result in a dimensionality curse. Therefore, feature selection (FS) is necessary to reduce data redundancy to achieve more reliable estimations. Currently, machine learning (ML) algorithms have been widely used for FS. Moreover, the same ML algorithm is usually conducted for both FS and regression in LAI estimation. However, no evidence suggests that this is the optimal solution. Therefore, this study focuses on evaluating the capacity of GF-5 spectral reflectance for estimating LAI and the performances of different combination of FS and ML algorithms. Firstly, the PROSAIL model, which coupled leaf optical properties model PROSPECT and the scattering by arbitrarily inclined leaves ( SAIL) model, was used to generate simulated GF-5 reflectance data under different vegetation and soil conditions, and then three FS methods, including random forest (RF), K-means clustering (K-means) and mean impact value (MIV), and three ML algorithms, including random forest regression (RFR), back propagation neural network (BPNN) and K-nearest neighbor (KNN) were used to develop nine LAI estimation models. The FS process was conducted twice using different strategies: Firstly, three FS methods were conducted to search the lowest dimension number, which maintained the estimation accuracy of all bands. Then, the sequential backward selection (SBS) method was used to eliminate the bands having minimal impact on LAI estimation accuracy. Finally, three best estimation models were selected and evaluated using reference LAI. The results showed that although the RF_RFR model (RF used for feature selection and RFR used for regression) achieved reliable LAI estimates (coefficient of determination (R 2 ) = 0.828, root mean square error (RMSE) = 0.839), the poor performance (R 2 = 0.763, RMSE = 0.987) of the MIV_BPNN model (MIV used for feature selection and BPNN used for regression) suggested using feature selection and regression conducted by the same ML algorithm could not always ensure an optimal estimation. Moreover, RF selection preserved the most informative bands for LAI estimation so that each ML regression method could achieve satisfactory estimation results. Finally, the results indicated that the RF_KNN model (RF used as feature selection and KNN used for regression) with seven GF-5 spectral band reflectance achieved the better estimation results than others when validated by simulated data (R 2 = 0.834, RMSE = 0.824) and actual reference LAI (R 2 = 0.659, RMSE = 0.697). data set estimation GF-5 The results indicate that RFR and BPNN methods were robust with respect to the dimension number changing. When the dimension decreased from 20 to 8, the RMSE of RF_RFR and RF_BPNN decreased by 0.010 and 0.006, respectively. However, when the number of variables continued to decrease, the LAI estimation accuracy of those two models declined sharply. In contrast, the number of variables has a great influence on KNN. When the number decreased from 20 to 7, RMSE decreased by 0.151, which is almost 15 times larger than those of RF_RFR and RF_BPNN.


Introduction
Leaf area index (LAI), which is defined as the ratio of total leaf area per unit of horizontal ground surface area, is one of the key input parameters of numerous ecosystem models [1][2][3]. Since it can characterize the substance and energy exchange in vegetation canopy, LAI is also regarded as an essential indicator of vegetation condition, productivity and photosynthetic capacity [4]. Therefore, accurate LAI estimation at different scales (regional and global) is crucial for many applications, such as climate change, crop yield modeling, and ecological monitoring [5][6][7].
Remote sensing data with superiority in extensive coverage and nondestructive estimation provide effective information for LAI estimation at the regional scale [8]. Therefore, accurate LAI estimation becomes an essential issue in the field of quantitative remote sensing [9,10]. Recently, many global and regional LAI estimation algorithms have been proposed using different multispectral satellite data such as moderate resolution imaging spectroradiometer (MODIS), Satellite Pour l' Observation de la Terre (SPOT) VEGETATION, Landsat, Sentinel and GF-1 [11][12][13][14][15][16][17]. However, some studies show that the saturation phenomenon occurred in LAI estimation when using multispectral band reflectance or vegetation indices (VIs) [18][19][20]. In contrast, hyperspectral sensors acquire remote sensing data with hundreds of consecutive narrow bands that have the capacity of detecting small changes in light absorption and reflection [21,22]. Previous studies have shown that hyperspectral reflectance or VIs can estimate forest aboveground biomass [23], LAI [24,25], fractional vegetation cover (FVC) [26] and leaf chlorophyll content (LCC) accurately [27]. Some of them even indicated that it could resolve the underestimation problem in LAI estimation [28,29]. However, since just a few hyperspectral satellite data are available, most studies are conducted by airborne hyperspectral equipment or ground-based spectral devices. However, those devices are limited to the spatial coverage. Therefore, as the only in-operation hyperspectral satellite launched by China, GF-5 has a great potential for LAI estimation in a large area. Unfortunately, there is scarcely any study for LAI estimation using GF-5 data.
Although GF-5 hyperspectral data describe more details about absorption features of canopy or leaves than multispectral data, the contiguous bands also easily result in spectral autocorrelation, which is also known as "high dimensional disaster" or "Hughes" [30,31]. Therefore, dimensionality reduction methods are usually applied to overcome data redundancy, improve computational efficiency and reduce the risk of overfitting [32][33][34][35]. There are two different ways of dimensionality reduction: one is called feature extraction (FE), which is conducted to transform the original data into other feature spaces that allow the generated low-dimensional data contain the vast majority of information [36][37][38], such as principal component analysis (PCA). However, despite the principal components (PCs) obtained by those methods are independent with each other, the meaning of each PC is not as clear as the original data. Moreover, the PCs with small variance may contain important information on sample differences, thus, discarding that information may have an impact on final LAI estimation accuracy. The other dimensionality reduction approach is known as feature selection (FS), which is conducted to select a group of features that contains the most important and useful information of the original data set [39]. For instance, methods like simulated annealing (SA) [40], genetic algorithms (GA) [41] and correlation-based feature selection (CFS) [42] belong to this category. Compared with FE, FS methods not only can avoid the curse of dimensionality but also maintain the original data characteristics, which makes the results more interpretable. The purpose of dimensionality reduction is to eliminate the negative impact of redundant features and improve model performance. However, other than dimensionality reduction methods, the regression algorithm is another crucial aspect that also has a great influence on LAI estimation accuracy. Some machine learning (ML) algorithms possess the ability in both feature selection and regression. For instance, random forest (RF) [43] has its own built-in feature selection method that can derive the importance of each variable on the tree decision. Thus, the contribution of each variable to the estimation can be easily understood. RankSVM and kernel RankSVM [44,45], which are extended from the basic support vector machine (SVM) [46], ranking the features according to the performance of evaluation criterion such as root mean square error (RMSE). However, although those ML algorithms can be used in feature selection and regression, there is no evidence shows that using the same algorithm for feature selection and regression would achieve more accurate estimations. Therefore, the LAI estimation performance of different FS and ML regression combinations is still worth discussing.
In this study, three different FS methods with different searching standard, including K-means, RF and mean impact value (MIV), and three ML regression algorithms, including RF regression (RFR), back propagation neural network (BPNN) and K-nearest neighbor (KNN), were combined and compared to develop the LAI estimation algorithm for GF-5 hyperspectral data. For this purpose, the radiative transfer model PROSAIL [47], which coupled leaf optical properties model PROSPECT and the scattering by arbitrarily inclined leaves (SAIL), was used to generate a simulated data set of GF-5 band reflectance and corresponding LAI values under different conditions. Then, FS process was conducted to find the best input variables for LAI estimation. Finally, three best LAI estimation models for GF-5 hyperspectral data were selected and evaluated using reference LAI. Figure 1 shows the experiment workflow of this study. Firstly, the PROSAIL model was used to generate a number of simulated data that consist of the training and testing data sets. Then, the first FS process was used to select three different subsets (RF data set, MIV data set and K-means data set), and the second FS process, which combines the sequential backward selection (SBS) process with ML regression, was conducted to select the final variables for LAI estimation. Next, the testing data set was applied to select the top three best LAI estimation models. Finally, LAI was estimated using GF-5 reflectance by the three models and evaluated using the reference LAI data.

Study Area and Field Survey
The study area that covers approximately 3760 km 2 is located in Changchun (43°05′N-45°15′N; 124°18′E-127°05′E), Jilin province of China ( Figure 2). It has a flat terrain with an altitude varying from 137 to 160 m. The temperate continental humid climate type with annual precipitation range from 600 to 700 mm in this region is very suitable for crop growth. Field maize LAI measurements were collected from 16 to 21 July, 2019. There are 26 sample plots with size of 20 m × 20 m and seven LAI values were measured using a LAI-2200C plant canopy analyzer (LI-COR Inc., Lincoln, Nebraska) in each plot. However, not all of the measured LAI values could be used in this study because of the availability of GF-5 hyperspectral data. Therefore, Sentinel-2 data were firstly used to estimate LAI, and then both the Sentinel-2 LAI estimates and measured LAI values were regarded as a reference LAI to evaluate the estimation accuracy of each GF-5 model.

Sentinel-2 Data
Sentinel-2 is a widely used satellite launched by the European Space Agency (ESA) in June 2015 [48,49]. Users can search and download Sentinel-2 data from the Copernicus Open Access Hub. The Multi Spectral Instruments (MSI) installed in this satellite provide 13 spectral bands ranged from visible (VIS) to shortwave infrared (SWIR). Sentinel-2 has three different high spatial resolutions. For example, Band 1 (coastal aerosol), Band 9 (water vapor) and Band 10 (SWIR-Cirrus) have a spatial resolution of 60 m. While four vegetation RE bands (Band 5, 6, 7 and 8a) and two SWIR bands (Band 11 and 12) have a spatial resolution of 20 m. The last four bands (Band 2, 3, 4 and 8) that have a distribution in VIS and NIR have a spatial resolution of 10 m.
In this study, the sample plots were fully covered by one Sentinel-2B MSIL2A data (tile number 51TXK). To obtain high accuracy reference LAI, bands with 10 and 20 m spatial resolution, which are proven to be very useful in LAI estimation [50][51][52], were used in this study.

GF-5 Hyperspectral Data
The GF-5, which was launched in May 2018 from Taiyuan Satellite Launch Centre, is the only hyperspectral satellite in the China High-resolution Earth Observation System. Compared with other GF satellites, GF-5 is also the only one that carries six different sensors for various academic and application use. Among all the sensors carried by GF-5, the visible-shortwave infrared advanced hyperspectral imager (AHSI) is the main payload, which was used to obtain 330 bands ranging from 400 to 2500 nm, with 30 m spatial resolution and 60 km swath width. There are two different spectral resolutions of those bands: 5 nm in VNIR and 10 nm in SWIR. Compared to the Hyperion Imaging Spectrometer [53,54], which is also a hyperspectral satellite sensor (with spectral resolution of 10 nm and band number of 224), GF-5 not only provides more detailed information, but also has a higher signal-to-noise ratio, which guarantees better data quality.
GF-5 hyperspectral data can be searched from the Land Observation Satellite Service website. In this study, ortho-rectification and atmospheric correction are conducted to obtain the GF-5 land surface reflectance. After removing the invalid bands, overlapped bands in the SWIR region and blue bands, which are seriously affected by atmospheric scattering, 210 bands were selected for further LAI estimation algorithm development.

Using PROSAIL Model to Generate Simulated Data
PROSAIL model [55] is widely used in biophysical parameters estimation because of its high accuracy and computing efficiency [56,57]. In the PROSAIL model, the PROSPECT model was used to simulate the reflectance and transmittance of leaves from 400 to 2500 nm. Then the spectral information of leaves is regarded as the input of the SAIL model to generate canopy reflectance [56]. Since the parameter setting will affect the calculation speed and the redundancy of outputs, reasonable ranges and fixed values were applied according to the previous studies [58,59] (Table 1). As the representation of underlying surface information, soil reflectance is also a vital parameter for the PROSAIL model. In this study, 20 representative soil reflectance were generated from the International Soil Reference and Information Center (http://www/isric.org) using the method proposed by Wang et al. [57].
After simulating canopy reflectance at 400-2500 nm wavelength, resampling was conducted to obtain GF-5 band reflectance according to the center wavelength and bandwidth of the selected bands. Compared with the simulated data, canopy reflectance of plants driven by the satellite contains some noise generated by the sensors. Therefore, a Gaussian white noise of 1% was added into the simulated data [57,60]. The training set and validation set were produced separately. Considering the computational efficiency, the simulation generated 24,000 samples as a training set and 4800 samples as a validation set.

Feature Selection Methods
Feature selection is one of the core concepts in machine learning, which highly influences the model performance. In this study, three FS methods with different criteria (RF, MIV and K-means) were applied to the original data set to determine the appropriate dimension number, which still maintains satisfactory LAI estimation accuracy. Then, the SBS method was used to search for the optimal subsets for LAI estimation.

RF for Feature Selection
Random forest is one of the most popular and efficient algorithms for regression and classification problems. Different from other machine learning algorithms, RF combines the idea of bagger with random feature selection [61][62][63]. Based on the information of the randomly selected samples, RF can predict the category (for classification) and value (for regression) of the target by establishing multiple independent decision trees [64]. It achieves the final result and out-of-bag (OOB) error by voting for the independent results of decision trees. The OOB error not only indicates the accuracy of RF model but also can be used to evaluate the classification and estimation capacity of each variable. In the end, the RF model would output the importance score of each variable. Then users can choose the suitable variable subset according to the predefined dimension or accuracy.

MIV
MIV is another feature selection algorithm that belongs to the embedded category like RF. This method assesses each variable by testing their stability of estimation performance by adding noise to the original data set. The major steps are as follows [65,66]: Step 1: Training network using BPNN algorithm, and then recording outputs as i A ; Step 2: Every input variable value ( i P ) in the original training data set was transformed by increasing and decreasing 10% to form new training data sets.
Step 3: Taking 1 i P and 2 i P as input data, then training those data using network obtained in step 1. The output is recorded as A . Calculating the difference value between ij A and ij P , and recording as impact value (IV): Step 4: Calculating the mean value of IV (MIV) according to the number of samples, and then outputting the MIV for each variable. Finally, the variables are ranked according to the magnitude of the absolute value of MIV.

K-means
K-means is a simply and efficient iterative clustering algorithm that can be used as the dimensional reduction algorithm without label information [67,68]. Unlike RF and MIV, K-means need to combine with alternative criteria such as the feature correlation index, mutual information [69] to achieve the final results. In this study, the Pearson correlation coefficient (PCC) between the input variable and LAI was used as criteria for K-means feature selection. The main steps are as follows: Step 1: Selecting k initial clustering centers randomly in feature space; Step 2: Calculating the distance between each feature and cluster center, then classifying them into the closest category; Step 3: Calculating the average value of all data in each category, then using this value to determine the new center of each category; Step 4: Evaluating the astringency of the clustering function based on the category center. Final clustering result will be obtained when the function is convergent; Step 5: Calculating the PCC between each variable and LAI, and then selecting the variable with the highest PCC value in each category as the FS result.

RFR
Random forest regression is an ensemble learning technique developed by Breiman [70]. This algorithm achieves more accurate and stable results by combining the individual results of a large number of decision trees [71]. RFR uses a bootstrap resampling method to extract a number of sample sets (k) from the original data set, and then takes each set as a training sample to generate a single decision tree. Every node variable at each split of decision tree is the best one chosen from the input variables (m). The final estimation result is determined by the average value of each prediction result of decision trees [72]. The number of sample sets (k) and the number of input variables (m) need to be set by users in the RFR model. After several tests, those two parameters were set to 1/3 and 500, respectively.

BPNN
The back propagation neural network [73] is one of the effective and widely used neural networks for vegetation parameter estimation [74][75][76][77][78]. The construction of this network is mainly achieved by training the weights with a non-linear differentiable function. The basic concept of BPNN is to adjust the network parameters by calculating the error between outputs and the expected values to improve accuracy. This study adopted the three-layer network with a single hidden layer. Each layer is connected by different functions. In this BPNN model, "tansig" is selected as the activation function to construct a nonlinear mapping between the input layer and hidden layer, "pureline" is selected as the transfer function to build a linear mapping between the hidden layer and output layer, and "trainlm" is selected as the training function.
The number of nodes (n) in the hidden layer is an important factor affecting fitting accuracy. In this study, the number of nodes in the input layer (ni) was equal to the dimension of the subset that was selected by FS methods, and the output had one node (n0), which corresponded to the LAI values. The number of nodes in the hidden layer can be determined using Equation (5). Since a in Equation (5) ranged from 1 to 10, the number of nodes in the hidden layer needs to be determined by evaluating every possible value, and 15 was determined as the optimal value of n. a n n n KNN is one of the simple ML methods that determine the sample characteristic by considering the performance of its nearest neighbors [79]. It is not only suitable for classification, but also for quantitative estimation of vegetation parameters [80]. Since KNN does not depend on specific function distribution, it is suitable for feature fusion and missing value estimation of multi-mode remote sensing. The attribute values of the estimation pixels are obtained by weighting k pixels that are nearest to them. The formula is as follows: where Vp and Vpi represent the attribute value of estimation pixel p and k pixels named pi; Wp,pi represents the weight between p and pi. In this study, Wp,pi is defined by Euclidean distance.
In general, KNN estimates the value of the validation samples by taking average of some training data. Therefore, the estimation accuracy of KNN model depends on the parameter k, which represents the size of the training data near to the validation sample [81]. In this study, 10-fold cross-validation was introduced to find optimal k in which the search range was set to 2-20.
In this study, three feature selection methods and three machine learning algorithms generate nine LAI estimation models, and their abbreviations and descriptions are shown in Table 2. Table 2. The abbreviations and descriptions of nine LAI estimation models.

RF_RFR
Using the random forest algorithm as feature selection and regression methods.

RF_BPNN
Using the random forest algorithm and back propagation neural network algorithm as feature selection method and regression method, respectively.

RF_KNN
Using the random forest algorithm and K-nearest neighbor algorithm as the feature selection method and regression method, respectively.

MIV_RFR
Using the mean impact value algorithm and random forest regression algorithm as the feature selection method and regression method, respectively.

MIV_BPNN
Using the mean impact value algorithm and back propagation neural network algorithm as the feature selection method and regression method, respectively.

MIV_KNN
Using the mean impact value algorithm and K-nearest neighbor algorithm as the feature selection method and regression method, respectively.
K-menas_RFR Using the K-means algorithm and random forest regression algorithm as the feature selection method and regression method, respectively.

K-means_BPNN
Using the K-means algorithm and back propagation neural network algorithm as the feature selection method and regression method, respectively.
K-means_KNN Using the K-means algorithm and K-nearest neighbor algorithm as the feature selection method and regression method, respectively.

LAI Estimation Accuracy Evaluation
In this study, model validation consisted of two different parts. Firstly, the validation set that contained 4800 simulated samples was used to access the performance of different FS and ML combinations for LAI estimation. Secondly, the reference LAI contained field LAI measurements and Sentinel-2 data derived LAI values were used for LAI estimation accuracy assessment of actual GF-5 hyperspectral data. The reference LAI values from Sentinel-2 were also obtained by PROSAIL and machine learning algorithms. Firstly, PROSAIL model was used to generate simulated data including Sentinel-2 reflectance and corresponding LAI. Then, nine spectral band reflectance (Band 2 to Band 8a, Band 11 and Band 12) and five vegetation indices (Table 3) were selected as input variables. To prevent the ML regression method used in the Sentinel-2 LAI estimation model from affecting the validation of GF-5 LAI estimation, RFR, BPNN and KNN were all used to estimate LAI from Sentinel-2 data and the average values of the three models were determined as the reference LAI ( Figure 3). The reference LAI achieved satisfactory estimation accuracy with R 2 of 0.506 and RMSE of 0.679. Therefore, the Sentinel-2 derived LAI could be used as a reference to access the accuracy of LAI estimation models using GF-5 data.   (6) where i y , ' i y y and n represent the reference LAI value, estimated LAI value, average of the reference LAI value and the number of samples.

Determining the Dimension Number of the First FS Process
The nine models were derived using three FS methods (RF, MIV and K-means) and three ML algorithms (RFR, BPNN and KNN) in this study. The reason we chose these methods was that they represent different feature selection and regression criteria, which makes the LAI estimates more comparable. For example, although RF and MIV selection are both based on feature importance criteria, they rely on different ML algorithms. K-means is an unsupervised clustering algorithm that is only based on spectral similarity instead of the fixed regression method. Therefore, in theory, the features selected by K-means are not able to abide to one specific ML algorithm.
Different models show different RMSE trends with the changing of the input variable dimension (Figure 4). For example, the RMSE increased by 0.038 when the original data decreased to 10-dimension using the RF_RFR model. While the RMSE increased by 0.252, which is almost 7 times larger than the RF_RFR model when applying the RF_KNN model in the same dimensional condition. According to Figure 4, the changes in LAI estimation accuracy of all models can be divided into three categories: reducing continuously (MIV_KNN as category 1); decreasing sharply after a gradual change (RF_RFR, RF_KNN, RF_BPNN as category 2); increasing at initial and then decreasing (MIV_RFR, MIV_BPNN, K-means_RFR, K-means_KNN, K-means_BPNN as category 3). For category 2, the RMSE values only changed by 0.012, 0.008 and 0.015 along with the decreasing of dimension from 210 to 20. While with the dimension continuing to decrease (from 20 to 10), the RMSE values increased sharply by 0.026, 0.260 and 0.014. In addition, for category 3, the RMSE values continued to increase until the dimension reduced to 40 or 50 (depend on the models), then the RMSE started to decrease until the dimension decreased to 20. As the dimensions continue to decline, the LAI estimation accuracy reduced significantly. Therefore, 20 would be a suitable dimension number that maintains most information of the original data set while reduced nearly 90% redundant bands. Therefore, the complexity and computational efficiency of models would be improved obviously.  Table 4 shows the LAI estimation accuracy using different FS and ML algorithms. According to the accuracy indicators (R 2 and RMSE), RFR is the best ML algorithm with or without dimensionality reduction. When using original data set, RFR algorithm obtained higher estimation accuracy (with RMSE of 0.837) than KNN (with RMSE of 0.982) and BPNN (with RMSE of 0.910) algorithm. When the input variable dimension was reduced to 20, RFR still achieved the lowest RMSE values in each data set compared with other ML regression algorithms. In contrast, KNN achieved the lowest accuracy in all ML algorithms except MIV-based models.

LAI Estimation Using Features Selected by the First FS Process
Compared with MIV-selected and K-means-selected subsets (Table 4), using RF as feature selection generated the highest accuracy when using RFR and KNN. When using BPNN as the regression algorithm, the LAI estimation difference of the RF_BPNN and K-means_BPNN model could be neglected (with RMSE difference of 0.004). By contrast, those RMSE values of K-means-selected models (RMSE of 0.862, 1.008 and 0.921) and MIV-selected models (RMSE of 0.940, 1.004 and 1.010) were much larger than RF-selected models (RMSE of 0.849, 0.974 and 0.925). Therefore, the results indicate that the RF-selected data set retained more useful information of the original data set, while MIV selection discarded some important spectral bands that could reveal difference in estimation of LAI.
Therefore, RF shows excellent performance as both FS and regression methods, which proved the priority of RF among ML algorithms. Moreover, when using RF as the FS method, the RFR regression achieved the best estimation result. While the same FS and regression method combinations from MIV and BPNN have the worst LAI estimation accuracy among all the models, which indicates that using the same ML algorithm as the FS method cannot always guarantee the satisfactory LAI estimation accuracy. To explain the reason why MIV-selected models achieve worse LAI estimation than other models, the selected bands of each FS method and their Pearson correlation coefficients are shown in Table 5 and Figure 5, respectively. Each variable in Figure 5 is corresponding to the bands in Table 5 with the same order. Bands selected by MIV were concentrated in the SWIR region with great linear correlation, which have limitations on the representation of canopy spectral information and lead to low LAI estimation accuracy. In contrast, bands selected by K-means and RF were well distributed in all regions (VIS, red-edge, NIR and SWIR), which also result in the high similarity of estimation results using K-means and RF selected data set. Table 5. First feature selection (FS) process selected bands using random forest (RF), mean impact value (MIV) and K-means methods.

Optimal Bands Combination Searching in the Second FS Process
SBS is a heuristical search method to remove the bands that have a minimal impact on evaluating indicators. Figure 6 shows the performances of RF_RFR, RF_BPNN and RF_KNN models using SBS to search the optimal input variable data set for LAI estimation using GF-5 hyperspectral data. The results indicate that RFR and BPNN methods were robust with respect to the dimension number changing. When the dimension decreased from 20 to 8, the RMSE of RF_RFR and RF_BPNN decreased by 0.010 and 0.006, respectively. However, when the number of variables continued to decrease, the LAI estimation accuracy of those two models declined sharply. In contrast, the number of variables has a great influence on KNN. When the number decreased from 20 to 7, RMSE decreased by 0.151, which is almost 15 times larger than those of RF_RFR and RF_BPNN.  Figure 7 and Figure 8 show the performances of different ML methods using K-means and MIV as feature selection methods, respectively. RFR and BPNN methods were still robust to the input variable dimension changing. When using K-means as the FS method, RFR and BPNN achieved their highest LAI estimation accuracy when the input variables number decreased to 9 (RMSE of 0.809 and 0.906). While when using MIV as the FS method, RFR and BPNN achieved their highest LAI estimation accuracy when input variables number decreased to 8 and 11 (RMSE of 0.912 and 0.987).
However, the trend of LAI estimation accuracy by KNN is related to input variables selected by different FS methods. For instance, Figure 7 shows that RMSE of RF_KNN decreased by 0.026 when the number of input variables decreased from 20 to 14, and then the accuracy dropped significantly with the decrease of the input variables dimension. While using MIV as the FS method, KNN shows good stability like RFR and BPNN.  Table 6 shows the best performance of each FS and ML method combination. According to Table 4, Table 6 and Figure 6a, the bands distributed in RE and NIR (< 1100 nm) had a negative effect on the LAI estimation accuracy of the RF_KNN model. Removing those bands significantly improved the LAI estimation accuracy. However, the RE bands are indispensable to maintain the LAI estimation accuracy of the K-means_KNN model since the RMSE will at least increase by 0.132 if the dimension changed from 14 to 13. On the other hand, although RF_RFR and RF_KNN models achieved similar LAI estimation accuracy, their final selected bands were distributed differently.

Evaluation of GF-5 LAI Estimation
According to Table 6, there are only three models with RMSE less than 0.85. Therefore, RF_RFR, RF_KNN and K-means_RFR models were selected to access the LAI estimation accuracy using actual GF-5 hyperspectral data (Figure 9). RF_RFR model had the highest R 2 , but its RMSE value was the largest one among all the LAI estimation results. In addition, some low LAI estimates present in the reference value range of 2-4 when using the RF_RFR model. K-means_RFR shows an overestimation in the range of 2-5, whereas the LAI estimates from RF_KNN achieved the best performance (R 2 = 0.659, RMSE = 0.697). In Figure 9, the purple triangles represent the filed-measured LAI points. Similarly, RF_RFR and K-means_RFR models still show greater overestimation than the RF_KNN model.  Figure 10 shows the LAI estimation results using the 7-band RF_KNN model. According to the land-use map (FROM-GLC30) [82] shown in Figure 2, the LAI estimations of the RF_KNN model conform to the distribution characteristics of different land cover types. For instant, the forest region, which was mostly located in the southeast of Figure 2, had high LAI estimations ranging from 5 to 6. In the northwest is the corn filed (classified as cropland in Figure 2), which also had high LAI estimations ranging from 3 to 6. There are also some fallow land and vegetable filed regions (also classified as cropland in Figure 2) located in the northeast and southeast, they had low LAI estimations ranging from 0 to 3. The red regions in Figures 10, which had LAI estimations close to 0, were the impervious surfaces and water bodies. In general, the 7 band RF_KNN model is suitable for LAI estimation using GF-5 hyperspectral data.

Discussion
This study proposed a LAI estimation algorithm for GF-5 hyperspectral data based on the PROSAIL model as well as different feature selection and machine learning methods. The PROSAIL model can simulate extensive vegetation and underlying surface situations, which result in wide applicability of the proposed LAI estimation method. This LAI estimation method can be operated without any prior knowledge, it is also robust to the vegetation type and soil background. The seven selected spectral bands of the RF_KNN model maintain most information of the original hyperspectral data, which have achieved satisfactory LAI estimation accuracy validated by both simulated data and actual GF-5 data. This LAI estimation model developing method through comparison of different feature selection and machine learning methods is also suitable for other biophysical parameters inversion method development using hyperspectral data, such as FVC, fraction of absorbed photosynthetically active radiation (FPAR) and chlorophyll content.
Compared with multispectral data, hyperspectral data contains much more detailed information about reflectance properties of vegetation canopies. The effects of those abundant spectral bands in LAI estimation are always been discussed. For example, Lee et al. [83] confirmed the advantage of those abundant bands by comparing the LAI estimation accuracies of AVIRS and Landsat ETM+ data. Similarly, the research conducted by Das et al. [84] also indicated that Hyperion hyperspectral bands performed better than Landsat-8 and Sentinel-2 data in LAI estimation. The board-band reflectance and VIs show occurrence of saturation in high amounts of biomass, therefore they usually underestimate LAI in the high value region [85,86]. In contrast, the narrow-band VIs [28,29,87] are more sensitive to biomass change than board-band VIs, therefore those valuable variables, which provided by hyperspectral data contribute to achieve higher estimation accuracy in the high LAI value region [88,89]. Moreover, the advantages of higher spectral resolution and narrow-band VIs give hyperspectral data abilities in distinguishing vegetation types, thus the LAI could be estimated more accurately according to the different classification categories [90].
In general, the statistical regression methods applied in LAI estimation mainly based on the multiple-linear model with spectral bands or VIs as inputs [91][92][93]. However, it is difficult to describe the relationship between those independent variables and LAI using linear models. In contrast, the machine learning algorithms have excellent performances on establishing the nonlinear relationship between the dependent and independent variables [94,95]. Furthermore, traditional statistical regression methods are more sensitive to the multicollinearity between independent variables [96,97]. Although principal component regression (PCR) and partial least squared regression (PLSR) can improve the LAI estimation accuracy, some studies indicated that they still achieved lower accuracy than ML regression algorithms [98][99][100]. Table 7 shows the best LAI estimation accuracy using the PLSR algorithm based on different data sets selected by RF, MIV and K-means algorithms in this study. Those best LAI estimates generated by PLSR algorithms have higher RMSEs than ML-based methods, especially for the GF-5 data. Those results confirm the previous research [98][99][100] by revealing the low robustness to noise of the PLSR algorithm. Therefore, ML algorithms still will be considered as a better choice for LAI (and other biophysical parameters) estimation. The applications of ML regression methods to hyperspectral data processing in the field of quantitative remote sensing are developed rapidly [39,101]. However, the collinear relationship between adjacent bands hampers the vegetation parameter estimation accuracy. Most previous studies on LAI estimation employing the hybrid model and some vegetation indices (VIs) from hyperspectral data were preferred to be used to avert dimensionality curse [92,93,102]. Although VIs have great linear or nonlinear relationships with LAI in some cases, they only represent a small part of hyperspectral reflectance information. This study focused on searching the most representative input variables by investigating all the hyperspectral reflectance. It not only avoids the subjectivity of input variable selection for LAI estimation, but also makes full use of the advantage of hyperspectral bands. Furthermore, feature selection algorithms usually applied only once in most studies [39,103,104], such as choosing variables according to importance scores. However, most of the scores only represent the correlation between one specific input variable and the LAI. Therefore, it does not guarantee that the combination of high scored variables would be the optimal choice for LAI estimation. The two-step FS process confirmed this hypothesis by showing more accurate estimations using only a small part of the variables. Those RF-based and MIV-based models (selected by important score ranking) achieved better performances after the optimal variables searching strategy conducted by the SBS algorithm. Therefore, this two-step FS process has been proven to be suitable for LAI estimation.
Due to the limitation of available GF-5 hyperspectral data, the performance of the proposed algorithm was verified by reference LAI, which including field measured LAI and Sentinel-2 data derived LAI. According to the error transmission theory, the Sentinel-2 data derived LAI values have some influence on accuracy validation of LAI estimation from GF-5 data. Therefore, more field experiments are needed to conduct in the further to access the accuracy of this GF-5 LAI estimation algorithm.

Conclusion
In this study, a LAI estimation algorithm for GF-5 hyperspectral reflectance base on different FS and ML methods was proposed. According to the performances presented in this study, GF-5 hyperspectral reflectance data achieved satisfactory LAI estimations with R 2 of 0.659 and RMSE of 0.697 using a RF_KNN model (7 bands). The main conclusions are as follows: (1) Using the same ML algorithm as feature selection and regression methods could not always ensure an optimal LAI estimation result. In this study, the RF_RFR model using the random forest algorithm as both FS and regression methods achieved higher estimation accuracy than RF_BPNN and RF_KNN when using simulated data. The MIV_BPNN is another model that uses the same algorithm as the FS and regression method. However, this model yielded lower estimation accuracy than using other regression algorithms (MIV_RFR and MIV_KNN).
(2) The RF algorithm can be regarded as one of the most adaptable algorithms for further studies of biophysical parameters estimation using hyperspectral data. Not only RF-based features retained the most useful information for LAI estimation, but this algorithm was also less affected by the redundant variables when used as the regression method.
(3) The proposed two-step feature selection process can achieve more satisfactory estimations with even fewer inputs. The study indicates that the feature ranking provided by RF and MIV only represents the importance of a single feature, thus the combination of high-score features could not represent the best inputs of the LAI estimation model. While the additional selection process based on the SBS algorithm was very effective in the optimal subset searching in a small or moderate dimension. Therefore, this two-step feature selection method improved the model performance by taking advantage of two FS algorithms with different criteria (first to reduce dimension, then search for the optimal subset). This proposed method was not only suitable for LAI estimation, but also can be used for classification based on hyperspectral remote sensing data.
The approach provided in this study assessed the application of GF-5 reflectance data in LAI estimation. Further research will focus on developing some effective narrow-band vegetation indexes for biophysical parameters estimation using GF-5 data.