Modeling Bidirectional Polarization Distribution Function of Land Surfaces Using Machine Learning Techniques

: Accurate estimation of polarized reﬂectance ( R p ) of land surfaces is critical for remote sensing of aerosol optical properties. In the last two decades, many data-driven bidirectional polarization distribution function (BPDF) models have been proposed for accurate estimation of R p

models. For this goal, we (1) compared the accuracy of the proposed ML-based BPDF models with that of the GRNN-based model as well as the widely used semi-empirical models; (2) analyzed the advantages of the ML-based BPDF models over the semi-empirical BPDF models; (3) improved the accuracy of the ML-based models by selecting the optimal configuration of the model input variables; and (4) discussed the limitations and potential of the ML-based BPDF models.
The paper is organized as follows. Section 2 gives an introduction of the data used and brief descriptions on the ML-based algorithms and the semi-empirical models. The results of ML-based BPDF models and the comparison with the semi-empirical models are presented in Section 3. Section 4 discusses the advantages, improvement, limitations, and potential of the ML-based BPDF models. Finally, Section 5 summarizes this paper.

POLDER/PARASOL BRDF-BPDF Database
In this study, the most up-to-date POLDER/PARASOL BRDF-BPDF database released in January 2017 was used [41]. This database provides a great number of bidirectional reflectance and polarized reflectance data of globally distributed earth targets observed from POLDER/PARASOL. It has been widely used for the studies on both BRDF and BPDF modeling of land surfaces [11,23,42,43]. Although the data acquisition period of POLDER/PARASOL is from early 2005 to October 2013, only data acquired in 2008 were used to generate the database because of its best acquisition continuity. The atmospherically corrected surface bidirectional reflectance factor (BRF) at six wavebands from visible to near-infrared spectral region (i.e., 490 nm, 565 nm, 670 nm, 765 nm, 865 nm, and 1020 nm) was provided in the database. Besides, the top-of-atmosphere polarized reflectance (R p ) was measured at three wavebands, i.e., 490 nm, 670 nm, and 865 nm. However, only atmospherically corrected surface R p at 865 nm was provided in the database, because (i) the atmospheric correction has larger uncertainty for shorter wavelengths, and (ii) the surface R p is generally assumed to be spectrally neutral [41]. The data were allocated and stored according to 16 International Geosphere Biosphere Program (IGBP) classes. In each IGBP class, 50 selected targets with best quality were provided for every month of the year, i.e., ideally there were 600 selected targets for a given IGBP class. Note that although the POLDER pixels have a resolution of 6.2 × 6.2 km, only pixels with more than 75% occupation of one land cover type were kept, which guaranteed the relatively high homogeneity of the selected targets in the database [41]. The database can be freely downloaded from the PANGAEA website (doi:10.1594/PANGAEA.864090).
Statistics of the selected observations used in this study are listed in Table 1. It is notable that missing observations (with filled values in the database) were excluded from the database. Moreover, to suppress the aerosol effects, the observations with Aero field value bigger than 5 were also excluded from the data. A target was removed if all the corresponding observations were excluded. The number of selected targets was thus less than 600 per IGBP class. It is notable that there is a small occupation of negative polarized reflectance in each IGBP class. Unlike the positive values which indicate the polarizing direction of reflected light perpendicular to the scattering plane, the negative values represent such directions parallel to the scattering plane. The scattering plane here is the principal plane, i.e., the plane containing the sun and view direction. The negative R p always occur in the back-scattering direction when the scattering angle is near to 180 • . The negative R p takes up nearly 7.24% on average in the database, as shown in Table 1.

Machine Learning (ML)-Based BPDF Models
Four machine learning regression algorithms, involving GRNN, SVR, KNN, and RF, were used in this study to build the ML-based BPDF models. The configuration of the input variables is introduced in Section 2.2.1, after which brief description of the four ML algorithms is given. All the four algorithms were implemented in MATLAB (R2019b). The summary of the four ML algorithms was listed in Table A1.

Selection of Input Variables
The polarized radiation is primarily generated by the specular reflection which occurs at the land surfaces. The Fresnel function, F p , can describe this process and is widely used for BPDF modeling [9,23,34]. Analogously, F p is used as one of the input variables in this study. It is a function of the scattering angle (SA) and the refractive index of the air-surface interface (N), as where α is the incident angle, which can be calculated from SA through α = (π − SA)/2. α is the refractive angle and is related to α by sin(α) = N sin(α ). Here, N is fixed to 1.5, which is commonly accepted [24,29]. Another input variable is the scattering angle (SA), which has great impact on R p . Previous studies showed that for a given target, R p tends to be larger when SA is smaller and it approaches zero, or even negative, when SA becomes larger and approaches 180 • [10,23]. SA characterizes the sun-sensor geometry and can be calculated using sun zenith angle θ s , view zenith angle θ v and relative azimuth angle ϕ of the sun and view direction Moreover, as is documented in [10,21], the vegetation coverage was found to be negatively correlated to R p , given that vegetation produce lower R p than soil [21]. Consequently, surface reflectance at the two vegetation-sensitive wavebands, i.e., 670 nm (R 670 ) and 865 nm (R 865 ) was considered as the input variables.
Remote Sens. 2020, 12, 3891 5 of 21 Therefore, four variables, F p , SA, R 670 and R 865 were selected as the input of the ML-based models, i.e., the size of input matrix was n-by-4, where n was the number of observations.

Generalized Regression Neural Networks (GRNN)
GRNN is one of the radial basis function (RBF) based neural networks. It builds a joint probability density function of training input and the output variable. GRNN has four layers containing the input, the pattern, the summation and the output layer [44]. For each of the input variables X (1-by-4 vector in this study), the pattern layer is calculated by using the input and a specific training sample X i through a Gaussian RBF where p i is the i-th neuron of the pattern layer, and σ is the smoothing parameter of the Gaussian RBF, determining the size of the sensitive area of the model. Then two neurons of the summation layer are calculated by the simple sum of the pattern neurons and the sum weighted by the corresponding output (i.e., the R p ) of the training samples, respectively. Finally, the output layer gives the estimated R p by calculating the ratio between the two summation neurons. Small σ may lead to overfitting whereas large σ may result in under-fitting of the training samples. As a smoothing factor controlling the shape of the Gaussian RBF, σ has a significant impact on the model performance. Cross-validation method is applied to find the optimal selection of σ. The searching range of σ was set from zero to 0.2 with a step of 0.01, as documented in [34]. The GRNN was modeled in MATLAB using a function newgrnn.

Support Vector Regression (SVR)
SVR is the regression implementation of support vector machine (SVM). Among various types of SVR algorithm, the classical and widely used ε-SVR was used in this study [45][46][47]. ε-SVR uses a kernel function to transform the input data into a high-dimensional feature space, and then uses the support vectors whose training error lying outside the ε margin to build a regression function with hyper parameters. A nonlinear Gaussian RBF, with an adjustable parameter γ, was used in this study as a kernel function. γ controls the basis radius of the RBF, which reflects the size of the sensitive area of each support vector. A larger γ indicates a narrower sensitive area of the RBF, making the model more likely to be over-fitted. In the loss function, a regularization constant C is used to control the trade-off penalty on the support vectors, whereas observations within the ε-tube are not penalized. A larger C also leads to a bigger possibility of overfitting. In this study, ε was set to 0.01, and the appropriate setting of γ and C is crucial to the model performance.
The optimizing ranges of γ and C parameters were from 10 −5 to 10 2 and from 10 −2 to 10 2 , respectively, which is similar to the configuration used in [48]. The parameters were then optimized by minimizing the merit function where RMSECV(γ k , C k ) is the root mean square error of cross-validation using the k-th set of the combination of γ and C; m is the fold number of the cross-validation; RMSE ith is the root mean square error of the model estimation in the i-th iteration of the cross-validation. The optimization process was repeated for each IGBP class. The optimizing was performed using a MATLAB function fminsearchbnd, with the boundary values mentioned above. The ε-SVR was implemented using the MATLAB interface of the V3.24 LIBSVM [49].

K-Nearest-Neighbor (KNN) Regression
KNN is a simple supervised learning algorithm for both classification and regression. To solve the regression problem, KNN implements the estimation based on the values of K nearest neighbors of the query input point. To define the distance between the two multi-dimensional points (four-dimensional points in this study), the Euclidean distance is the most common choice and was thus used in this study. After the selection of the K neighbors, the estimation of the query point is given by the weighted value of the selected neighbors through a specific weighting function [50] where d i , I = 1, 2, . . . k, is the distance between the query point and the i-th nearest neighbor; t generally takes values from 0, 1, and 2 [35,50,51]. In this study, t was fixed to 1 [51,52]-i.e., the weights were assigned proportional to the inverse of the distance between the neighbor and the query points; that is, the nearby points contributed more to the estimated value than the faraway points. The model-estimated polarized reflectance of the query point is then calculated as the weighted average of the nearest points.
As a key parameter of the KNN algorithm, K has a great impact on the model performance. Smaller K may give under-fitting results whereas larger K is more likely to leads to overfitting. In this study, K was optimized within the range from zero to 200 using the cross-validation method, which was repeated for each IGBP class. The structure of fast query of nearest points was built using MATLAB function KDTreeSearcher, and the query was achieved using function knnsearch.

Random Forest (RF) Regression
RF regression is a non-parameter ensemble learning algorithm [53]. RF regression grows many decision trees simultaneously and implements the estimation based on the results of all trees, so it generally gives efficient and robust performance [54]. In the training procedure, RF regression uses bootstrapping approach to randomly select about two-thirds of samples from the training dataset, and the remaining one-third are used to calculate the out-of-bag (OOB) error to represent the performance of the built model. To grow each tree independently, RF randomly selected one-third (two variables of the total four in this study) every time for growing each individual tree. Using the standard Classification and Regression Tree (CART), the best splitting variable and the best splitting value are determined by minimizing a weighted impurity G from the left and the right node after the splitting where v ij , the splitting point, is the j-th value of the splitting variable x i ; n l , and n r are the number of training samples of the left node and right node after the splitting, respectively; N t is the number of training samples of the node to be split; H(X l ) or H(X r ) is the impurity function of the left or the right node. For the regression problem, the mean squared error (MSE) serves as the H(X). MSE is defined as the mean square of the deviation between the training targets and their average within a node after the splitting. In the prediction procedure, the average of the predictions of all trees are taken as the estimated output value for each of the query input observation. The two key parameters of RF regression-i.e., the number of the trees (ntree) and the minimum leafs of the terminal nodes (nodesize)-have significant impact on the model performance. Smaller ntree with larger nodesize indicate a less dense forest with deeper trees. nodesize was set to 5 in this study to balance the training accuracy and the generalizing ability of the model [36]. It was reported that a selection of ntree bigger than 100 guarantees the stability of the RF model [54], so ntree was set to 100 in this study. The RF regression was implemented using TreeBagger function in MATLAB, with the option "OOBPredictorImportance" set to "on" to calculate the importance of all input variables. The process of Remote Sens. 2020, 12, 3891 7 of 21 RF modeling and the calculation of importance was repeated for 100 times in this study to reduce the uncertainty, and the average results were finally used.

Semi-Empirical BPDF Models
To assess the performance of the ML-based BPDF models, semi-empirical models were taken as a reference. Semi-empirical BPDF models have been widely used to estimate the surface R p , given the specified sun-sensor geometry and the model-required information [55,56]. These models were initially proposed for different land covers and were built based on the measurements from different instruments [9][10][11]21,23,28,29,57]. They have been proven to yield relatively high accuracy by introducing empirically free parameters to the modeling process. Table 2 gives the brief introduction of the six representative semi-empirical models-i.e., Nadal-Bréon, Waquet, Maignan, Litvinov, Diner and Xie-Cheng models. The estimation accuracy of these models were used in this study for the comparison with the ML-based BPDF models.

Nadal-Bréon
Based on POLDER/ADEOS measurements, developed for natural land surfaces. [9] Waquet Incorporated a shadowing function S(θ s , σ). Based on air-borne MICROPOL and developed for forest, cropped and urban surfaces. [29] Maignan Added NDVI to the model and C· exp(−θ i ) considered the attenuation from leaf surface. Based on POLDER/PARASOL measurements and developed for 14 IGBP classes. [10] Litvinov Added a shadowing function f sh (SA, k r ) and considered a Gaussian distribution of facets, f (σ, ϑ). Based on air-borne RSP measurements and developed for vegetation and soil surfaces. [57] Based on the ground-based GroundMSPI and developed for grass surface. [28] Xie-Cheng R p = A·F p · f sh (SA, k r )· exp(−0.7NDVI) A, k r Based on POLDER/PARASOL measurements and developed for urban areas. [21]

Optimization and Selection of Model Parameters
For each IGBP class, the data were randomly divided into two parts, i.e., the training dataset, and the validation dataset. For SVR-, KNN-, and GRNN-based models, model parameters were optimized using 10-fold cross-validation within the training dataset. That is to say, for each candidate parameter within the optimizing range, the training dataset was randomly divided into 10 parts, nine of which were used to build the model and the other for the validation. The average of the root mean square errors (RMSEs) of the 10 iterations was calculated. After the searching of all possible parameters, the parameter yielding smallest average model RMSE was chosen to be optimal, and the model was then built based on the whole training dataset using the optimal parameter.
For each semi-empirical model and for each IGBP class, fitting with observations from each of the targets in the training dataset gave hundreds of sets of target-based free parameters. Then, the median value of these target-based parameters were chosen as the class-based free parameter, i.e., the a priori free parameter. The scheme for generating the a priori free parameters was widely used in previous studies [10,23,55]. Finally, the model with a priori free parameter was considered a priori model for the R p estimation.

Evaluation and Comparison
The procedure of evaluation and comparison of the ML-based and semi-empirical BPDF models are illustrated in Figure 1. The step-by-step approach can be briefly described as follows: optimal parameters selected in Step 2 to estimate Rp; (ii) Use the validation dataset to run the six semiempirical models with the a priori free parameters obtained in Step 2 to estimate Rp.
Step 4: Compare the estimated Rp with the measured Rp from the validation dataset, calculate the RMSE and evaluate models' performance for both ML-based and semi-empirical models. Intercompare the accuracy of the SVR-, KNN-, and RF-based BPDF models with that of the GRNNbased BPDF model.  Step 1: Divide the POLDER/PARASOL BPDF database into two parts, 75% as the training dataset and 25% as the validation dataset. Each dataset includes the input variables, involving F p , SA, R 670 , and R 865 , and the output R p .
Step 2: (i) For GRNN-, SVR-, and KNN-based BPDF models, use the training dataset to optimize the model parameters, i.e., σ for GRNN, C and γ for SVR and K for KNN. For the RF-based BPDF model, set the ntree to 100; (ii) For the six semi-empirical models, use the training dataset to obtain the a priori free parameters.
Step 3: (i) Use the input variables of the validation dataset to run the ML-based models with the optimal parameters selected in Step 2 to estimate R p ; (ii) Use the validation dataset to run the six semi-empirical models with the a priori free parameters obtained in Step 2 to estimate R p .
Step 4: Compare the estimated R p with the measured R p from the validation dataset, calculate the RMSE and evaluate models' performance for both ML-based and semi-empirical models. Intercompare the accuracy of the SVR-, KNN-, and RF-based BPDF models with that of the GRNN-based BPDF model.

Results
Among the four ML-based models, the GRNN-, SVR-, and KNN-based BPDF models required the optimizing process of model parameters, following Section 2.4 and Figure 1. The selected optimal parameters for these three models were shown in Table 3, whereas the ntree of the RF model was fixed to 100. ML-based models with the optimal model parameters, and semi-empirical models with the calculated a priori free parameters (not shown), were used to estimate R p in the validation dataset.
The RMSEs (×100) of the semi-empirical and the ML-based models were illustrated in Table 4. It can be seen that all the four ML-based BPDF models outperform the semi-empirical models for all IGBP classes. Among the ML-based models, the KNN-based BPDF model gives the best results in terms of RMSE for most of the IGBP classes, i.e., 10 out of 16 classes. Following are the GRNN-and the SVR-based models, with best performance for six and three classes, respectively. The relatively worst Remote Sens. 2020, 12, 3891 9 of 21 results among the four ML-based models were obtained by the RF-based model, which only gives the best results for IGBP 01. Nevertheless, the performances of the four ML-based models are quite close to each other, with the difference within 0.0001 in regards of RMSE. Conversely, the accuracy of the six semi-empirical models vary greatly, among which the Xie-Cheng model is the best performing one, yielding the best results for 12 IGBP classes, as shown in Table 4. The last column of Table 4 records the improvement (in percentage) of the KNN-based model over the Xie-Cheng model, and it is shown that the KNN-based model outperforms the Xie-Cheng model for all the IGBP classes. The largest improvement of more than 18% was seen in IGBP 06 and 07 (closed and open shrubland), whereas the smallest improvement of 3.57% was found in IGBP 13 (urban and built-up area), which is the surface type that Xie-Cheng model proposed for. Besides, IGBP 04 (deciduous broadleaf forest), 09 (savannas), 15 (snow and ice), and 16 (desert) were also found with the improvement of more than 10%. Overall, compared with the Xie-Cheng model, the KNN-based model improved the accuracy by 9.55%. It is thus clear that the ML-based BPDF models are more accurate than the semi-empirical models.   Figure 2 shows the measurement-model plots of the best performing semi-empirical model, i.e., the Xie-Cheng model, the GRNN-based BPDF model and the three newly built ML-based BPDF models. Both Xie-Cheng and ML-based BPDF models reproduced R p with good correlation with the measurement, with the correlation coefficient (Cor.) bigger than 0.8. ML-based models yield results with higher Cor., improving Cor. of the Xie-Cheng model by 0.01 to 0.05. The RMSE of the ML-based models are apparently lower than that of the Xie-Cheng model. The Xie-Cheng model tends to be saturated when R p becomes larger, i.e., R p is either underestimated or overestimated in such cases. This kind of saturation is obvious in Figure 2 especially for IGBP 02 (forest), 07 (shrubland), and 10 (grassland) when the measured R p is bigger than 0.3%, 0.8%, and 0.8%, respectively. Such saturation, however, does not appear in the four ML-based models. Additionally, the Xie-Cheng model failed to estimate the negative R p , whereas it can be reproduced by the ML-based models, as can be seen from Figure 2.    Figure 3 illustrated the ML-based BPDF models' capability to reproduce the negative R p when observations of all 16 IGBP classes (18580 observations) are taken into consideration. There is no much difference among the accuracy of the four models in terms of both RMSE (around 0.0025) and Cor. (around 0.076), among which the SVR-based BPDF model yields slightly lower RMSE. Most of the measured negative R p are no less than −0.3% except for IGBP 15 (ice and snow, 25.8% of all the observations with negative R p ), whose negative R p can reach −1%. However, there is an obvious overestimation for the four models, with most of the modeled R p ranging from −0.2% to 0.2% or even greater. It is notable that the bins with warmer colors indicate much more number of scatters than the bins with cooler colors.

Discussion
Polarized reflectance has been found to have a non-linear relationship with sun-sensor geometry and/or vegetation coverage [10,23,34]. In this study, ML techniques were adopted for parameterization of such non-linear relationship. We built three ML-based models, i.e., the SVR-, KNN-, and RF-based BPDF models for the estimation of the polarized reflectance. The performances of these ML-based BPDF models were compared to that of the GRNN-based model and six widely used semi-empirical BPDF models using the POLDER/PARASOL measurements. Experiments suggested that (1) the accuracy of the three newly proposed ML-based models and the GRNN-based model were quite close to each other, among which the KNN-based model showed slightly better performance; (2) the ML-based BPDF models provided comprehensively better performance than the widely used semi-empirical models (Table 4). These results thus confirmed the feasibility of accurate estimation of polarized reflectance of land surfaces using the machine learning techniques.

Discussion
Polarized reflectance has been found to have a non-linear relationship with sun-sensor geometry and/or vegetation coverage [10,23,34]. In this study, ML techniques were adopted for parameterization of such non-linear relationship. We built three ML-based models, i.e., the SVR-, KNN-, and RF-based BPDF models for the estimation of the polarized reflectance. The performances of these ML-based BPDF models were compared to that of the GRNN-based model and six widely used semi-empirical BPDF models using the POLDER/PARASOL measurements. Experiments suggested that (1) the accuracy of the three newly proposed ML-based models and the GRNN-based model were quite close to each other, among which the KNN-based model showed slightly better performance; (2) the ML-based BPDF models provided comprehensively better performance than the widely used semi-empirical models ( Table 4). These results thus confirmed the feasibility of accurate estimation of polarized reflectance of land surfaces using the machine learning techniques.

Advantages of the ML-Based BPDF Models
One advantage of the ML-based BPDF models over the semi-empirical models found in this study is their ability of avoiding modeling saturation in the forward scattering directions. For the semi-empirical models, the scheme of selection the class-based a priori free parameters led to a reduction of the a priori modeling accuracy. The selected priori free parameters determined the maximum polarized reflectance of land surfaces in the forward scattering directions. They, however, cannot represent the polarized characteristics of every target. The saturation phenomenon occurred especially for those targets whose polarized reflectance has a large bias from the one determined by the selected priori free parameters. In contrast, the ML-based models simultaneously considered the impact of input multi-variables, i.e., the Fresnel function, the scattering angle, and multi-band reflectance. They directly built a non-linear function between polarized reflectance of land surfaces and the input variables without a fixed mathematical expression. They thus can avoid the limitation of saturation.
Another advantage of the ML-based BPDF models is their ability of modeling negative polarized reflectance of land surfaces. Negative polarized reflectance was generally observed at the backscattering directions where multiple successive scattering occurred [10]. ML-based BPDF models recognized this phenomenon in the training dataset, and could reproduce similar signature when they were used for estimation of polarized reflectance. That is to say, if the polarized reflectance is negative under a specific case (e.g., backscattering directions) in the training dataset, the ML-based BPDF models were likely to give a negative estimation when the input variables were similar to the specific case. As for the semi-empirical BPDF models, they neglected such phenomenon and could only give positive estimations.

Further Improvements Using Different Configuration of Input Variables
Selection of input variables is critical for the performance of the ML algorithms [30,36,50,58]. Among various information provided in the POLDER/PARASOL BRDF-BPDF database, only four variables-i.e., F p , SA, R 670 , and R 865 -were used as the inputs of the ML-based BPDF models in this study. Naturally, one may question if different configurations of input variables-e.g., addition of reflectance at other wavebands or vegetation indices-could help to improve the performance of the ML-based BPDF models. For this purpose, we designed four experiments, for which different configuration of input variables were adopted. These input configurations are listed in Table 5. It is notable that F p and SA were used in all these experiments because they are directly related to the polarized characteristics of land surfaces. C0 is the original configuration used in Section 3. In C1, reflectance at 565 nm (R 565 ) was considered. Unlike C0 and C1, vegetation indices were used as input variables in C2. Here, the selected VIs included two types, i.e., normalized difference (ND) and simple ratio (SR) [59] where R ref and R idx are reflectance at the reference and index wavelengths, respectively. The index wavelength was fixed to 670 nm, whereas the reference wavelength referred to 865 nm or 565 nm.
Finally, C3 adopted reflectances at all the available bands in the BRDF-BPDF database. The optimized model parameters of different configurations were listed in Table A3. Figure 4 and Table 6 showed the improvements of the ML-based BPDF models with different configuration of input variables over the GRNN-based BPDF model with C0 as input variables. The results suggested that using multi-band reflectance as input variables provided better performance than that using VI, and reflectance of more wavebands leads to a more accurate model. Compared with C0, the RMSE of C1 and C3 were reduced for most of the IGBP classes. Moreover, C3 performed even better than C1, indicating that the more reflectances were utilized, the better performance was achieved. As for using VI as input variables (i.e., C2), the accuracy decreased for most of the IGBP classes (green values in Table 6). For IGBP 04 (deciduous broadleaf forest), 10 (grassland) and 16 (desert), the accuracy was even decreased by more than 5%. It thus suggested that VI could not yield results as good as reflectance, which was also indicated in [60].
When all available reflectances were utilized as input variables, the RF-based BPDF model performed the best. Compared to the GRNN-based model under C0, the overall RMSE of the RF-based BPDF model was decreased by 6.62% (subfigure in Figure 4). The decrease could be up to 16.97% and 14.90% for IGBP 06 and 07 (closed and open shrubland), respectively (Table 6). It was as expected from two aspects. On one hand, more input variables made the RF algorithm randomly select more variables each time for growing each tree, making the individual tree more robust. On the other hand, as an ensemble tree model, RF is more stable when solving the regression problem whose input variables are highly correlated, compared with other ML algorithms. As seen from the matrix of correlation coefficients of the input variables (Table A2), Fresnel factor and scattering angle were highly correlated, and reflectances from 565 nm to 865 nm were moderately correlated with each other. More input variables make RF less possible to choose a pair of well correlated variables for growing an individual tree. For the RF-based BPDF model of each IGBP class, importance of the eight input variables was given in Table 7. Among the eight input variables, R 490 is the most important one for seven IGBP classes; followed are F p and R 670 , which are with the highest value of importance for four and three IGBP classes, respectively. It can be explained by the fact that for most of the land surfaces, e.g., vegetation, polarized reflectance takes up more of total reflectance or even dominates the reflectance of absorbing spectral regions-e.g., 490 nm, 670 nm, and /or some shortwave infrared wavebands [55,59]. Conversely in non-absorbing wavebands, polarized scattering is negligible compared with total reflectance. As a result, reflectance at absorbing bands explain more polarized information. Fresnel factor conveys polarized reflection generated by specular reflection, and was assumed in this and previous studies to be the only generator of the polarized reflectance. It has thus been consistently utilized as a key parameter in all the physical and semi-empirical BPDF models. The importance of F p obtained here further confirmed the feasibility of such assumption. According to Table 7, Fresnel factor, scattering angle, and reflectance at 490 nm and 670 nm were overall important for all the IGBP classes in RF-based BPDF modeling, as they were always in the top two important input variables for every IGBP class. Scattering angle's relatively high importance can be attributed to its high correlation with the Fresnel factor (Table A2); the increase of polarized reflectance with the decrease of the scattering angle can also explain. Importance of the input variables provided helpful information for future studies on BPDF modeling.
In addition to the higher accuracy, the RF-based model required no parameter to be optimized. It made the training process simple and consequently less time-consuming, compared with the other ML-based BPDF models. The RF-based BPDF model with all the available six-band reflectance as input variables was thus recommended in this study for estimation of polarized reflectance of land surfaces.

Limitations of the ML-Based BPDF Models
There are two potential limitations of the ML-based BPDF models. One limitation is that the ML-based BPDF models cannot guarantee a significantly high accuracy for negative R p 's estimation, as shown in Figure 3. The overestimation might be attributed to the impact of neighbor observations in the backscattering area [34]. The observations with negative R p were generally concentrated in a very small region where the corresponding SA is close to 180 • . The R p at the surrounding directions were generally positive. Moreover, the value of negative R p is close to zero (generally greater than −0.003) and smaller than that of positive R p . These two factors made the estimation of negative R p easily affected by the surrounding positive R p .
Another limitation is that the parameters optimized in this study (Table 3) might not be directly applied to measurements from other platforms. It is because these parameters were optimized only against the POLDER/PARASOL measurements, and might be sensor-sensitive. Further optimizing and training process might be needed if one wants to apply these models to other polarization measurements, like measurements from airborne AirMSPI [61], MICROPOL [29], and RSP [62], and the space-borne sensors DPC/GF-5 [63] and 3MI/EPS-SG [64].

Potential Applications of the ML-Based BPDF Models
Using the information of model implementation following Table A1 and optimized input parameters listed in Table 3 (with only four input variables) or Table A3 (with eight input variables), the four ML-based BPDF models trained from POLDER BPDF database can be directly transferred to estimate polarized reflectance from remote sensing data of any platforms. If polarized observations are available, an optimizing process of model parameters should be conducted within the suggested boundaries listed in Table A1 except for the RF-based model. The optimization aims to find a parameter producing the minimum RMSE of cross validation. Although practical, this transfer requires further accuracy assessment to confirm its stability, by using polarized data from other polarimetric platforms. Further investigation using polarized observation from multi-sensors as the training database may produce optimized model parameters that are more generalized.
Polarized reflectance of land surface serves as a necessary boundary condition for retrieval of aerosol optical depth (AOD) [7,21]. The POLDER/PARASOL provided polarized reflectance of various earth targets with an accuracy with errors bigger than 0.001 [21]. It has been reported that a typical error of 0.001 on surface polarized reflectance could lead to an error of 0.04 on AOD [10], Therefore, accurate BPDF models are required for an accurate estimation of surface polarized reflectance. The semi-empirical BPDF models listed in this study produced an overall RMSE of 0.0030. Their error could be even greater for special surface types, e.g., urban area and snow and ice ( Table 4). The KNN-and RF-based BPDF model used in this study reduced the overall error to 0.0027 and 0.0025 (Tables 4 and 6), respectively. The reduction, however, could not significantly improve the AOD retrieval accuracy. Nevertheless, more accurate polarized reflectance is also helpful for retrieval of microphysical properties of aerosol. For this purpose, the four ML-based BPDF models used in this study can be utilized for remote sensing of the complete aerosol properties more precisely [56].

Conclusions
In this study, three ML-based BPDF models were proposed using SVR, KNN, and RF algorithms. These models were evaluated and compared to the GRNN-based and semi-empirical BPDF model using the popular POLDER/PARASOL measurements. When Fresnel factor, scattering angle and two-band reflectance (R 670 and R 865 ) were taken as the input variables, the performances of the ML-based models were quite similar. The KNN-based BPDF model gave slightly better accuracy, and compared with the best semi-empirical model, it decreased the overall RMSE by 9.55%. It was confirmed that the ML techniques performed comprehensively better than semi-empirical regressions, as, on one hand, they could avoid model saturation; and on the other hand, they were able to reproduce the negative polarized reflectance. If all available reflectances at six bands were taken as input variables, the RF-based BPDF model was the most accurate and is thus recommended. The improvement achieved by the RF-based BPDF model over the GRNN-based model is 6.62%. Analyses on the importance of input variables of the RF-based BPDF model suggested that reflectance at 490 nm is the most important input variables for most land surfaces, and Fresnel factor and reflectance at 670 nm also played necessary roles.
This study confirmed the feasibility of applying machine learning techniques to more accurate BPDF modeling. It is notable that the RF-based BPDF model proposed in this study could be used to improve the accuracy of complete aerosol retrieval. The efficiency of the RF algorithm allows its easy migration for training against measurements collected from other polarimetric sensors. This paper gives an insight of applying machine learning techniques to accurate BPDF modeling and thus provides an alternative solution for accurate estimation of polarized reflectance of land surfaces.