A Random Forest Algorithm for Retrieving Canopy Chlorophyll Content of Wheat and Soybean Trained with PROSAIL Simulations Using Adjusted Average Leaf Angle

Canopy chlorophyll content (CCC) is an important indicator for crop-growth monitoring and crop productivity estimation. The hybrid method, involving the PROSAIL radiative transfer model and machine learning algorithms, has been widely applied for crop CCC retrieval. However, PROSAIL’s homogeneous canopy hypothesis limits the ability to use the PROSAIL-based CCC estimation across different crops with a row structure. In addition to leaf area index (LAI), average leaf angle (ALA) is the most important canopy structure factor in the PROSAIL model. Under the same LAI, adjustment of the ALA can make a PROSAIL simulation obtain the same canopy gap as the heterogeneous canopy at a specific observation angle. Therefore, parameterization of an adjusted ALA (ALAadj) is an optimal choice to make the PROSAIL model suitable for specific row-planted crops. This paper attempted to improve PROSAIL-based CCC retrieval for different crops, using a random forest algorithm, by introducing the prior knowledge of crop-specific ALAadj. Based on the field reflectance spectrum at nadir, leaf area index, and leaf chlorophyll content, parameterization of the ALAadj in the PROSAIL model for wheat and soybean was carried out. An algorithm integrating the random forest and PROSAIL simulations with prior ALAadj information was developed for wheat and soybean CCC retrieval. Ground-measured CCC measurements were used to validate the CCC retrieved from canopy spectra. The results showed that the ALAadj values (62 degrees for wheat; 45 degrees for soybean) that were parameterized for the PROSAIL model demonstrated good discrimination between the two crops. The proposed algorithm improved the CCC retrieval accuracy for wheat and soybean, regardless of whether continuous visible to near-infrared spectra with 50 bands (RMSE from 39.9 to 32.9 μg cm−2; R2 from 0.67 to 0.76) or discrete spectra with 13 bands (RMSE from 43.9 to 33.7 μg cm−2; R2 from 0.63 to 0.74) and nine bands (RMSE from 45.1 to 37.0 μg cm−2; R2 from 0.61 to 0.71) were used. The proposed hybrid algorithm, based on PROSAIL simulations with ALAadj, has the potential for satellite-based CCC estimation across different crop types, and it also has a good reference value for the retrieval of other crop parameters.


Introduction
As the primary leaf photosynthesis component, chlorophyll is the most essential factor to determine the photosynthetic rate and dry matter accumulation [1][2][3]. The canopy chlorophyll content (CCC)-chlorophyll mass per unit of ground area-is the product of the leaf area index (LAI) and the leaf chlorophyll content (LCC) expressed in the unit leaf area. CCC is an important indicator reflecting the crop growth status and the canopy photosynthetic capacity [4]. The accurate acquisition of crop CCC in large-scale areas is very important for crop gross primary productivity estimation, field water and fertilizer management, pest control, and other agricultural applications [4][5][6]. However, the traditional measurements of crop chlorophyll, including the destructive laboratory method [7] and nondestructive field-portable instrument method [8], are inefficient and resource-intensive, meaning they are not applicable in large areas.
Remote sensing technology has been widely used in the retrieval of vegetation parameters because of its fast data collection and non-destructive characteristics. Due to the pigment absorption in the visible band and multiple scattering in the near-infrared band, the vegetation spectrum shows its unique characteristics in these bands, which makes it possible to use when monitoring the vegetation chlorophyll status. CCC quantifies total canopy chlorophyll content and contains both LAI and LCC information. The reflectance of red-edge bands (700-760 nm) is sensitive to chlorophyll [9] and is commonly used to retrieve the crop CCC [10,11].
Empirical, machine learning and physics-based approaches are commonly used to retrieve vegetation CCC [9,12,13]. Through statistical analysis using a vast amount of experimental data, regression relationships have been established to retrieve CCC from canopy spectra data [13][14][15]. Compared to empirical statistical models, machine learning methods have a stronger learning ability to fit complex nonlinear relationships. Machine learning methods, such as the support vector machine (SVM), artificial neural network (ANN), and random forest regression (RFR), have been widely used to retrieve vegetation parameters [16][17][18][19]. Nevertheless, training an SVM with high-dimensional data can be extremely slow [20], while ANN is prone to overfitting, and the parameter setting in ANN is more complicated [21]. Compared to SVM and ANN, RFR has proven to be a very robust machine learning algorithm for the retrieval of vegetation parameters, including the CCC [21,22]. At the same time, the retrieval models established using empirical methods or machine learning mostly use local sampled data for training. Although the empirical retrieval model is accurate for a specific type, a specific area, or a specific time period, its generalization is poor with regard to the time-space change [20]. In recent years, a new hybrid retrieval approach involving vegetation radiative transfer models (RTMs) and machine learning algorithms has been developed to compensate for limited field data [23]. The forward simulation of RTM is an important way to obtain the extensive training data needed for machine learning and is helpful for obtaining a more robust retrieval model. The PROSAIL model, a combination of the PROSPECT leaf model [24] and the arbitrary oblique leaf-scattering (SAIL) canopy model [25], is the most famous canopy reflection model. As a single-layer homogeneous canopy model, PROSAIL has been widely used for retrieving crop biochemical and structural variables [26][27][28].
RTM-based crop-CCC modeling requires attention to the interference of multiple vegetation factors, especially canopy structure factors [26,27]. At the canopy level, reflectance is not only affected by leaf biochemical parameters but is also determined by LAI, leaf inclination distribution, soil background, and other structure factors [29]. The characteristics of leaf inclination distribution-often simply the average leaf angle (ALA)-are an important canopy structure parameter that determines canopy spectral behavior. In field observation experiments and RTM simulation, ALA is an important vegetation structure parameter [30]. Although the sample size based on the SAIL forward model is large, the effects of LAI and ALA on canopy reflectance are closely related to one another, which makes it difficult to estimate LAI and ALA information accurately and simultaneously, based on the canopy spectrum [30,31]. Due to the CCC parameter containing the information of LAI, the variation in leaf inclination distribution is the main interference factor in the remote sensing retrieval of the CCC. This means it is necessary to consider prior knowledge of the ALA parameter in PROSAIL-model-based CCC retrieval for different crop types.
Introducing the prior knowledge of the ALA parameter in the PROSAIL model is a way to reduce the uncertainty of CCC retrieval across different crop types. However, there is a conflict between the homogeneous canopy assumption of SAIL and the actual crop row structure, which makes it difficult to improve the spectral simulation and retrieval of SAIL with the prior knowledge of real ALA. In addition to the ALA, the row structure also affects the proportion of exposed background soil, which is related to the reflectivity of the whole canopy [32,33]. This leads to the problem of existing differences between the measured canopy spectra and outputs of the SAIL model even when the field measurement variables are defined as the inputs of the SAIL model [34,35]. Among the simplified parameters of the PROSAIL model, ALA expresses the canopy's geometrical properties. Therefore, the ALA input may be reasonably adjusted for the PROSAIL model in order to obtain simulated spectra that are comparable to those of the measured crop spectra [34]. It has been proven that an adjusted ALA (ALAadj) can be obtained by matching the PROSAIL model with measured spectra and non-ALA-measured parameters [34]. ALAadj for the PROSAIL model integrates leaf inclination and canopy heterogeneity information. Compared to in-situ ALA, ALAadj can better reflect the difference in canopy heterogeneity between different crop types. Therefore, prior knowledge of ALAadj has the potential to improve the performance of PROSAIL-based CCC retrieval across crop types. However, research on the application of ALAadj to vegetation parameter retrieval has not yet been carried out.
The objectives of this study were to use the ALAadj of crops as a prior knowledge input to improve the performance of RTM-based machine learning algorithms for CCC retrieval across different crop types. The mature and stable RFR algorithm was used in our work. Two important food crops, wheat and soybean, were used for algorithm testing. In this study, we evaluated the significance of ALA and other canopy parameters on crop spectra, using the PROSAIL model simulation experiment. Meanwhile, ALAadj parameterization of the PROSAIL model for wheat and soybean was carried out; then, a CCC hybrid retrieval model integrating prior ALAadj knowledge related to crop types was established. Finally, the accuracy of the improved retrieval model was evaluated using ground-measured CCC.

Experimental Areas and Ground Data
In this study, two ground-measured datasets of wheat and soybean were used for algorithm validation and were taken from experiments carried out in Beijing, China, and Nebraska, USA, respectively. Table 1 shows the experimental locations of the collected datasets of wheat and soybean.  [36]), and 2019. The experimental wheat plot was located at the National Station for Precision Agriculture, Xiaotangshan, Beijing, China. The biophysical parameters of winter wheat in the growing season were determined. The above wheat experiments covered erecting, early jointing, late jointing, head emergence, and filling stages. After the erecting stage, the canopy changes from open to dense as the wheat grows. In the 2002 campaign, 48 plots included four water treatment plans, four nitrogen fertilization densities, and three winter wheat varieties. In the 2004 campaign, 21 winter wheat varieties were planted with non-differentiated fertilization and irrigation management in 42 plots. In the 2019 campaign, 32 plots were analyzed, with four nitrogen fertilization densities, four fertilization treatment plans, and two winter wheat varieties.

Soybean Experimental Area
The experimental soybean area consisted of two AmeriFlux sites (US-Ne2 and US-Ne3), located at the University of Nebraska-Lincoln Agricultural Research and Development Center near Mead, NE, USA (https://ameriflux.lbl.gov/sites/). The US-Ne2 and US-Ne3 sites were both planted in a maize-soybean rotation model, with soybean planted in even years. The US-Ne2 site used a center-pivot irrigation system, while US-Ne3 relied entirely on rainfall for moisture. At the US-Ne2 (2002 and 2004) and US-Ne3 (2002) sites, six sub-sampling points were randomly designed to obtain spectrum and biophysical parameters (detailed in [37]). Table 1 provides detailed information on the ground sample data.

Ground Data
(1) Determination of canopy chlorophyll content At the BJ-XTS site, fresh leaves of wheat were collected in each 1 m 2 plot and then placed in a chilled box for transport to the laboratory for leaf chlorophyll content measurement (detailed in [38]). The LCC of the sampled leaves was measured by the laboratory ultraviolet-visible spectrophotometer method given in [7]. Based on the dry weight ratio of the sub-sample to the total leaf sample, the measured leaf area of the subsample was converted to the LAI at the BJ-XTS site. At the US-Ne2 and US-Ne3 sites, fresh leaves of soybean were collected; then, the LCC was determined through the standard procedures (detailed in [15,39]). The LAI was measured using an area meter (Model LI-3100, Li-Cor Inc., Lincoln, NE, USA). The CCC in the BJ-XTS, US-Ne2, and US-Ne3 sites were calculated by multiplying the LAI by the LCC.
(2) Canopy reflectance measurements The ASD FieldSpec spectrometer (Analytical Spectral Devices, Boulder, CO, USA) was used to obtain the canopy reflectance in the BJ-XTS site. The spectral range of the spectrometer was 350-2500 nm, and the spectral resolution was 3 and 10 nm in the 350-1050 and 1050-2500 nm ranges, respectively. With a standard reference plate, the canopy spectra were measured under cloudless conditions from 10 a.m. to 2 p.m. local time. The wheat canopy spectra were acquired at a field of view of 25°. In the US-Ne2 and US-Ne3 sites, the canopy reflectance spectra were measured using the Ocean Optics USB2000 radiometer (Dunedin Ocean Optics, Dunedin, FL, USA) [40]. The spectral range was 400-1100 nm, with a spectral resolution of 1.5 nm and a spectral interval of 10 nm. One radiometer measured the upward-reflected radiation of the crop canopy at a field of view of 25°. The second radiometer measured the downward solar radiation with a hemispherical field of view. The spectrum measurements were obtained under clear sky conditions from 11 a.m. to 2 p.m. local time. Then, the canopy reflectance was calculated from the measured upward and downward radiation data (detailed in [5]).
The wheat field-measured spectra were sampled in the spectral range and interval of the soybean spectral data. In order to prevent a bias, the sample size of wheat was consistent with that of 73 soybean samples, and 73 wheat samples were randomly selected from all the collected samples (shown in Table 1). The prior ALAadj knowledge of wheat and soybean was obtained by fitting the PROSAIL model using a random quarter of the measured samples. The remaining three-quarters of the measured samples were used to test the performance of our improved CCC retrieval algorithm.

Spectra Simulation of Homogeneous Canopy Based on PROSAIL Model
As an improved version of the PROSAIL model, the PROSAIL-D model, coupled with the PROSPECT-D leaf model [41] and the 4SAIL canopy model [42], was used to simulate crop canopy spectrum in the 400-2500 nm wavelength range for training the CCC retrieval model, with the input parameter settings as in Table 2.
In the leaf parameters of the PROSPECT-D model, the LCC value was set between 10 and 80 μg cm −2 . Carotene accounts for 25% of the content of the LCC and changes with the LCC. The leaf dry matter content (Cm) ranged from 0.003 to 0.006 g cm −2, and the setting range of the leaf structure parameter (N) was 1.0~2.0. Equivalent water thickness was set to a fixed value because the water absorption bands are generally not used in the retrieval of vegetation chlorophyll. The canopy structure in the 4SAIL model was defined by the LAI, ALA, soil reflectance, and hot-spot parameters (shown in Table 2). The LAI value was set between 0.5 and 8. Two widely used modes of leaf inclination angle distribution function (LIDF), including Campbell's algorithm [43] and Verhoef's algorithm [44], have been implemented in multiple versions of SAIL codes. Horizontal leaves can have higher light interception capabilities, and a certain number of horizontal leaves is a common state in canopy plants [45]. The ellipsoid function in Campbell's algorithm causes the probability density of zero when the leaf angle is zero or close to zero, which is inconsistent with the actual canopy situation [46]. Verhoef's leaf inclination distribution algorithm, using a linear combination of trigonometric functions, improves this problem [44]. Therefore, Verhoef's algorithm with the LIDF parameters was selected for the PROSAIL-D simulation in this study. LIDFa and LIDFb control the ALA and the distribution bimodality, respectively. The ALA values in degrees calculated from LIDFa can be expressed as ALA = 45-360 × LIDFa / π 2 [44], ranging from 10 to 80 degrees. Compared to ALA, LIDFb has a very small effect on canopy reflectance [29], and the setting of LIDFb was fixed at 0. To represent the soil background variations, the measured spectra of bare and dry soil were multiplied by the brightness coefficients of five gradients (0.1, 0.25, 0.5, 0.75, and 1), and five soil spectra from light to dark were determined for the PROSAIL-D simulation. According to the field sampling of the canopy spectrum at nadir, a fixed observation zenith angle of 0 degrees and the variable sun angle values were set.

Spectra Simulation of the Heterogeneous Canopy with Mixed Pixels
The SAIL model as a part of the PROSAIL model is a turbid medium model that expresses a homogeneous canopy. The SAIL model assumes that the canopy is composed of extremely small leaves randomly distributed in the horizontal layer [25]. However, the assumption of homogeneous canopy conditions in the PROSAIL model is inconsistent with the actual heterogeneity of most row-sown crops. Different crops have different leaf inclination characteristics, and their planting structures also have large discrepancies. For example, a canopy of row-planted crops has more soil exposed than a uniform canopy [33,35]. Taking mixed pixels as a special case of a heterogeneous canopy, the influence of the canopy heterogeneity factor on reflectance was evaluated in this study.
The spectral linear mixing method [47] was used to fit the reflectance of vegetationsoil mixed pixels, named Rmixed: where Rveg is the reflectance of the pure vegetation-covered endmember based on PROSAIL simulations, Rsoil is the reflectance of the soil endmember, FCsoil is a fractional cover of the soil endmember, and the range of FCsoil was set from 0% to 50%.

Methods
The specific flow chart of the method is shown in Figure 1.

Random Forest
As an advanced machine learning algorithm, random forest is considered to demonstrate accurate prediction and good fault tolerance [48,49]. We used random forest for evaluating the influence of ALA and other vegetation parameters on reflectance, as well as adjusting the ALA for the PROSAIL model, based on measured spectrum and non-ALA parameters and retrieving CCC by involving the crop-type-related ALAadj. Random forest uses many decision trees to train samples and make predictions. A decision tree is a statistical model of supervised classification that is constructed through the training set for prediction or classification. Random forest is an improvement of multiple decision trees, based on the bagging strategy [50]. After selecting samples, k attributes are randomly selected from the sample attributes, and then the information gain and Gini index methods are used to continuously find the best-separated attributes to establish a cart decision tree. The above process is repeated to establish m classifiers; these decision trees can then be used to form a random forest, and prediction results can be obtained via averaging.
In fact, random forest is a common method in the field of machine learning, which can be used for the nonlinear regression of a large amount of data and retrieval of vegetation parameters, using a radiative transfer model such as PROSAIL. The random forest method has the following characteristics [51,52]: (1) in the current machine learning algorithms, it has a fairly high accuracy; (2) it can run effectively on huge datasets; (3) it is able to process input data with high-dimensional features; (4) it can evaluate the importance of each variable; (5) it can obtain unbiased estimates of internally generated errors; (6) it can also give good results for the discrete values of inputs. By using MATLAB software (The Math Works, Natick, MA, USA) in this study, the simulation dataset was divided into 100 homogeneous subsets, that is, the results of establishing 100 trees were averaged to achieve the final prediction.

Importance Assessment Method of Vegetation Variables on Canopy Reflectance
The variable importance measures of RFR were used to evaluate the contribution of different vegetation parameters to reflectance at the visible and near-infrared (NIR) wavelengths before 900 nm. Random forest is a Bayesian prediction of the decision tree, which can provide a flexible mapping relationship between input and output and provide a measure of the importance of variables. In addition to the importance evaluation of the quantitative input-output relationship, global importance evaluation can be carried out for situations where the input or output variables are not quantifiable, such as soil types. Field-measured data have some inherent natural limitations; for example, there is usually a certain autocorrelation between the LCC and LAI, which is not conducive to the global importance analysis of ALA on reflectance. Each parameter in the simulation data can be set to be completely independent, based on the physical model such as PROSAIL, so it is suitable to carry out an impact contribution evaluation of vegetation variables on vegetation reflectance [34,53].
In this study, the random forest algorithm is used to train the nonlinear relationship between parameters and reflectivity at different wavelengths, and to simultaneously calculate the importance of all independent variables versus dependent variables. Leaf and canopy parameters, including the LCC, N, Cm, LAI, Rsoil, ALA, and FCsoil, are the independent variables, while the PROSAIL-simulated reflectance for the homogeneous canopy case and the mixed pixels cases act as the dependent variable.

PROSAIL Parameterization Scheme of ALAadj Based on Measured Spectrum and Non-ALA Parameters
The same LAI corresponds to a larger soil fraction under a spatially heterogeneous canopy than that under a homogeneous canopy [54]. ALA variation changes the gap of the homogeneous canopy in a specific observation direction [25,54,55], which leads to the ALAadj being used to make the PROSAIL model suitable for the use of spatially heterogeneous canopy cases. The conceptual diagram in Figure 2 describes the role of ALAadj in improving the applicability of the PROSAIL model on a homogeneous canopy. In Figure 2a, we simplified the homogeneous canopy as a leaf completely covering the ground, and the leaf angle is ALA. For the special case shown in Figure 2a, the corresponding ground area covered by the leaf is calculated based on ALA, as below: where A is the ground area; LA is the leaf area, here, LAI = LA/A. Figure 2b shows that, in the case of a mixed pixel canopy with the same LAI, the leaf clumping effect causes a gap fraction of the soil endmember. The gap fraction of the soil endmember is a. In fact, a/A is equivalent to FCsoil, the ratio of soil endmember in a mixed vegetation-soil pixel.
The limitation of the PROSAIL model to simulate heterogeneous canopy reflectance may affect the performance of a PROSAIL-based hybrid model of CCC retrieval. Figure 2c shows that the homogeneous canopy case with the leaf angle adjusted to ALAadj has the same gap fraction as the heterogeneous canopy case in Figure 2b. The ground area, calculated based on ALAadj in Figure 2c, can be determined as below: where a is the same soil gap as that in Figure 2b. Figure 2c shows that the ALAadj can compensate for the increase in soil gap caused by the leaf aggregation effect at a specific observation direction, which may make the PROSAIL-simulated spectrum more consistent with that of an actual row-planted canopy. Through fitting the PROSAIL simulations with the measured spectrum, and non-ALA parameters such as LAI, the ALA can be adjusted [34]; that is, the prior ALAadj knowledge for wheat and soybean can be obtained. The prior ALAadj knowledge for one crop type is simply characterized by the average ALAadj. Inversion of the PROSAIL model generally obtains the fitted parameters to match the measured data through three ways: the iterative optimization, based on the forward programming of the physical model, the look-up table (LUT), and the machine learning method. Compared to the program loop call, LUT improves the calculation speed and avoids the risk of converging to local minima [56], while the machine learning approach saves abundant storage space compared to the LUT approach [32].
In this study, the RFR was used for optimizing the ALA of the PROSAIL model based on ground-measured data; then, the ALAadj of wheat and soybean was obtained. Optimizing ALA requires determining the measured vegetation parameters and the used bands. Due to the overlap of response areas of the LAI and ALA, as well as the CCC containing LAI information, ALA optimization for the PROSAIL model needs the measured LAI data, in addition to the spectrum. LCC was used as supplementary measured data ( Figure S1 and S2). In the case of additional data, it also participated in the PROSAIL model fitting process to obtain ALAadj. Based on the analysis of the measured and simulated data, it was found that the main spectral response regions of ALA are the red-edge and near-infrared regions [34,57]. We considered the sensitive response bands of canopy ALA to optimize ALA when using measured spectra and non-ALA vegetation parameters. The specific ALAadj formula, based on RFR, is as follows: where SPEC is the canopy reflectance, and LAImeasured and LCCmeasured are the groundmeasured LAI and LCC values. Considering the presence or absence of chlorophyll, there are two combinations of inputs: (a) SPEC_LAI, meaning that the input data have reflectance and are LAImeasured; (b) SPEC_LAI_LCC, meaning the input data have reflectance and are LAImeasured and LCCmeasured.

CCC Retrieval Model, Based on ALAadj Related to Crop Types
In this study, the PROSAIL-D model was used to simulate the canopy's spectral reflectance, and a training dataset containing multiple parameter variations, such as ALA, LAI, and LCC was established. The CCC retrieval RFR models were established, whether prior ALAadj knowledge was used or not. We used RFR to identify the bands related to vegetation biochemical characteristics in different spectral datasets. RFR randomly selected the feature number for each decision tree and determined the best feature number. Its hyperparameters mainly included the number of growing decision trees. Too few trees result in weak learning ability, so the number of trees was set to 100.
In order to compare and analyze the influence of the different ALAadj of wheat and soybean on the retrieval model, two kinds of algorithms considering the prior ALAadj knowledge, related to crop types or not, were designed and tested. First, the algorithm advocated in this study was used to select the corresponding ALAadj-based RFR model to carry out CCC retrieval for wheat and soybean, named the Prior-ALA model. Second, we designed three NonPrior-ALA models for the CCC retrieval of soybean and wheat, without considering the prior ALA knowledge corresponding to the differences in vegetation types. The three NonPrior-ALA models included: (a) model training only using PROSAIL-simulated data with the wheat ALAadj; (b) model training with the soybean ALAadj; (c) model training with both wheat and soybean ALAadj.
Due to the difference in the number of sampling bands of different remote sensors, the impact of band number on canopy chlorophyll machine learning was investigated and evaluated. The continuous spectrum data and the different numbers of discrete band data were used to test the CCC retrieval model. The continuous spectrum adopted the 50 narrow-band ground spectra within 400-890 nm that are related to chlorophyll, and the discrete bands referred to the settings of Envisat MERIS and Sentinel-3 OLCI bands. MERIS and OLCI are two important satellite sensors with red-edge bands and have very overlapping band settings (Table 3). Referring to MERIS reflectance products, we selected all available bands (13 bands) after removing the three main atmospheric bands (aerosol: B1; oxygen: B11; water vapor: B15). The blue band is susceptible to residual aerosol effects [58], so we also selected fewer bands (nine bands) after removing the bands adjacent to blue. As given in Table 3, the three combinations of independent variables for the CCC estimation were: (1) all continuous reflectance (50 bands) within 400-890 nm; (2) the reflectance of all available bands in MERIS (13 bands): B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, B12, B13, and B14; (3) nine selected MERIS bands (nine bands) from the green to nearinfrared regions: B5, B6, B7, B8, B9, B10, B12, B13, and B14. The simulated spectra were resampled using the MERIS spectral function to obtain 13-and 9-band datasets. The unit of band center and band width is nm.

Validation Method
Using ground reflectance data, CCC retrieval for wheat and soybean was carried out, and the accuracy of this CCC inversion was verified by ground validation data. When evaluating the accuracy of CCC retrieval, we should consider not only the determination coefficient (R 2 ) for the estimation results and ground-based validation data but also the root mean square error (RMSE) and bias (Bias). The formulas for R 2 , RMSE, and Bias are as follows: where y represents the estimated CCC value, y represents the measured CCC value, y represents the mean of estimated CCC values, and n represents the number of test samples. Figure 3 shows the estimation results of the importance of multiple vegetation factors for visible and NIR reflectance, based on random forest regression analysis, in the case of both a homogeneous canopy and a heterogeneous canopy with mixed pixels.

Influences of Vegetation Variables on Canopy Reflectance
For the first case in which the PROSAIL model simulated the reflectance of a uniform canopy, LCC, LAI and ALA contributed significantly to the tested bands before 900 nm. Of the three most important factors, the LCC and LAI are the composition parameters of the CCC. The LCC has the highest contribution importance on the canopy reflectance, especially before 710 nm. The N value has a certain influence on the 560 nm and 710 nm bands, but less influence on the other bands. There is little influence of Cm on all the visible and NIR bands that were tested. Among the canopy parameters, the LAI is still the most important driving factor affecting canopy reflectance, followed by the ALA. The importance of soil reflectance changes is equivalent to the ALA in the 490 nm and 680 nm bands and is far lower than that of the LAI and ALA in other bands. The global sensitivity analysis results of the PROSAIL model reported in [29,34] showed that the LCC had a leading role before 710 nm and that LAI and ALA were the two main driving forces for the reflectance change, which is basically consistent with our variable importance measures results for the homogeneous canopy scenario. This shows that the variable importance analysis method of using random forest to evaluate the multi-factor contributions of canopy reflectance is also reliable.
In the case of vegetation-soil mixed pixels, which is a special non-uniform canopy scenario, the importance values of LCC and other leaf parameters to the tested bands are similar to the result for the uniform canopy scenario. However, an obvious difference is that FCsoil has become a new influencing factor of canopy reflectance, and its influence in the 750 nm and 860 nm bands is greater than the ALA but only less than the LAI. In addition, the mixed pixel effect greatly enhances the influence of the ALA and Rsoil on visible and NIR reflectance. It is worth noting that the Rsoil has a greater impact on low vegetation reflectance in the visible bands than the ALA while having a lower impact on the red-edge and NIR bands. The results of the RFR importance analysis of the mixedpixel scenario show that the spectral contribution of the LAI was easily confused, not only with the ALA factor but also with FCsoil. Therefore, the sensitivity of canopy reflectance to the ALA and other heterogeneous canopy factors, such as FCsoil, should be considered in CCC retrieval for crops that are usually planted in rows.

ALAadj Parameterization of the PROSAIL Model for Wheat and Soybean
An ALA optimization model was established by the random forest algorithm, based on PROSAIL-simulated data (Section 2.2). As seen in Figure 3, the reflectance of the 710-890 nm bands was selected for ALA optimization. Table 4 shows that the average values of wheat ALAadj obtained by PROSAIL optimization, based on the canopy spectra and the measured LAI with or without LCC, were very close (62.0 and 61.8 degrees respectively), and the prior knowledge of wheat ALAadj for this study was set to 62 degrees. The soybean ALAadj obtained from the two sets of data was 45.1 and 45.7 degrees, respectively. In this study, the prior knowledge of soybean ALAadj was set to 45 degrees. Whether or not LCC Predictor importance was involved, the results of ALAadj for wheat and soybean were stable, and there was a clear distinction between the ALAadj of wheat and soybean.  Figure 4 provides the validation of CCC retrieval based on the ground-measured continuous spectra, considering prior ALAadj knowledge or not. Figure 4a shows that the NonPrior-ALA_45 model constructed by PROSAIL simulations, with an ALAadj of 45 degrees, had a good correlation between the estimated and measured CCC values of soybean but the CCC value of wheat was underestimated, especially when the CCC was relatively large. Figure 4b shows that the NonPrior-ALA_62 model produced a very good CCC estimation of wheat but produced an overestimation of the soybean CCC. Figure 4c shows that the NonPrior-ALA_45_62 model underestimated the CCC for wheat and overestimated the CCC for soybean. Compared to the three models without considering prior ALA, the test results of the random forest model with a prior (Prior-ALA_45_62) (Figure 4d) showed no obvious overestimation of wheat or soybean, and its retrieval accuracy (R 2 = 0.76, RMSE = 32.9 μg cm −2 ) was better than all three CCC models without prior ALAadj knowledge related to crop types (NonPrior-ALA_45:    Table 5 shows the performance differences of the CCC retrieval models for wheat and soybean. In the validation results for wheat samples, it was found that the NonPrior-ALA_62 model, trained by the PROSAIL-simulated data with our proposed wheat ALAadj (62 degrees), showed higher CCC retrieval accuracy (RMSE = 37.9 μg cm −2 ) than the other two models (NonPrior-ALA_45: RMSE = 56.2 μg cm −2 ; NonPrior-ALA_45_62: RMSE = 39.4 μg cm −2 ). In the validation results for the soybean samples, the CCC accuracy (RMSE = 26.9 μg cm −2 ) of the NonPrior-ALA_45 model corresponding to the proposed soybean ALAadj (45 degrees) was better than the other two (NonPrior-ALA_62: RMSE = 48.2 μg cm −2 ; NonPrior-ALA_45_62: RMSE = 40.3 μg cm −2 ). The difference in the wheat or soybean CCC RMSE values for the above three models was large, which was not only caused mainly by the change in R 2 but also by the large variation of Bias. This shows that when the ALA of the PROSAIL simulations did not match the prior ALAadj knowledge corresponding to the test crop, the RFR algorithm produced a larger CCC retrieval error. For comprehensive test samples of the two crops, the Prior-ALA_45_62 model that considered prior knowledge of crop type-related ALAadj obtained the best performance.  Figure 5 shows the validation and comparison results of the CCC retrieval models using the continuous and discrete bands. In the case of not considering the prior knowledge of crop-type-related ALAadj, using the reflectance of the continuous waveband can obtain the highest CCC retrieval accuracy ( Figure 5a). As the resolution decreased, the accuracy of CCC retrieval was slightly reduced (Figure 5b,c). In the scenario where the prior ALAadj knowledge is included, the accuracy had a similar relationship with the number of input bands (Figure 5d-f). Figure 5 also indicates that, after considering the prior ALAadj knowledge, the CCC retrieval accuracy using the continuous and discrete spectra significantly increased. The RMSE of the improved CCC retrieval model using the 50-band continuous reflectance decreased by 7.0 μg cm −2 and the R 2 increased by 0.09; the RMSE of the improved model using the 13-band discrete reflectance decreased by 10.2 μg cm −2 and the R 2 increased by 0.11; the RMSE of the improved model, using the nine-band discrete reflectance, reduced by 8.1 μg cm −2 and the R 2 increased by 0.10. Using any of the three different band combinations, the accuracy of the improved CCC retrieval models (RMSE: 32.9~37.0 μg cm −2 ; R 2 : 0.71~0.76) was higher than those of models not involving crop-type-related ALAadj parameterization (RMSE: 39.9~45.1 μg cm −2 ; R 2 : 0.61~0.67). In other words, the influence of crop-type-related ALAadj parameterization on CCC retrieval is larger than that of the difference among the three band selections in this study.

Contribution of Our Proposed Method to PROSAIL-Based Retrieval of Crop Parameters
This study proposed an ALAadj parameterization scheme to improve the performance of the PROSAIL model in the CCC estimation for different crops. The PROSAIL model simplifies the scene and parameters, making it straightforward and easy to use, but the non-uniform canopy assumptions of the PROSAIL model also limit the application of the PROSAIL model for a non-uniform canopy. For example, the PROSAIL-simulated spectrum using measured vegetation parameters is inconsistent with the simultaneously measured spectra [34]. This problem limits the vegetation parameter hybrid retrieval based on the PROSAIL simulations, as shown in Sections 3.3 and 3.4 of this paper. Establishing model parameterization schemes according to actual scenes or research objects is an important way to improve the simulation accuracy of vegetation RTMs [19,59,60]. For the PROSAIL model, the ALA is considered to be the structural parameter that has the greatest impact on reflectance, besides the LAI [31,61]. The LAI quantifying the number of leaves cannot be used to express the degree of canopy spatial heterogeneity, while the ALAadj can achieve a different canopy gap fraction under the same LAI (shown in Figure 2) and make the PROSAIL simulations adapt to a heterogeneous canopy. It has been proven that PROSAIL simulations with ALAadj can maintain better consistency with the measured spectra of wheat and maize [34]. Due to the conflict between PROSAIL simulations and the actual canopy spectra of row-planted crops, error of the crop CCC estimates occurs when appropriate ALAadj parameterization is not considered in the PROSAIL-based hybrid model of CCC retrieval; in particular, the CCC estimation error in medium and high LAI samples is obvious (Figure 6a). In this study, the RFR algorithm of CCC estimation with ALAadj parameterization drastically alleviated the problem of substantial error in medium and high LAI crop samples (Figure 6b).
In the previous studies on RTM-based CCC retrieval, Xing [62] reported their best result for wheat CCC estimates obtained from PROSAIL-based regression models using VIs (wheat: RMSE = 104.66 μg cm −2 ), which is worse than the result (shown in Table 5) of using our unimproved RFR model (wheat: RMSE = 39.4 μg cm −2 ) or improved RFR model (wheat: RMSE = 37.9 μg cm −2 ). This indicates that the advantages of advanced machine learning algorithms such as RFR in non-linear learning may play a role. Sun [63] used a PROSAIL-based RFR model with two VIS inputs (MTCI and MTVI2) to obtain a comprehensive wheat and soybean result of CCC retrieval with an RMSE of 37.76 μg cm −2 , which is similar to that from our unimproved RFR model using reflectance spectra (wheat and soybean: RMSE = 39.9 μg cm −2 ) but is worse than that when using our improved RFR model (wheat and soybean: RMSE = 32.9 μg cm −2 ). In addition, the accuracy of soybean CCC estimates from our improved physical-based model (soybean: RMSE = 26.9 μg cm −2 ) is not as good as the best accuracy of soybean CCC estimates from the locally calibrated empirical model using CIred-edge (soybean: RMSE = 16 μg cm −2 ) reported by Clevers [64]. This may be understandable because hybrid algorithms based on RTMs have strong generalization abilities and empirical models using local calibration have advantages in local areas [32]. To the best of our knowledge, this is the first time that prior information of crop type-related ALAadj has been involved in the PROSAIL-based retrieval of vegetation parameters. This study provides a new design idea for RTM-based machine learning algorithms for retrieving other parameters across different crop types. This study used a special heterogeneous canopy scenario-an ideal mixture of vegetation and soil-to show that the canopy spatial heterogeneity of row crops can have a great impact on the spectral reflectance, related to CCC. There are other characteristics of crop canopy structure that are not considered by the PROSAIL model, such as leaf and stem morphology [65,66]. They may cause spectral differences between different crops under the same CCC and affect the performance of a PROSAIL-based CCC retrieval model. There are great differences in these canopy characteristics between crop types, which makes cross-crop-type CCC retrieval rather difficult. This study aimed to address this problem, and improved CCC retrieval for wheat and soybean by introducing ALAadj parameterization into a PROSAIL-based RFR algorithm. In addition, the growth dynamics of crops within the same type are accompanied by changes in the attributes of these canopy structure factors [66]. Our wheat case (shown in Figure S2) shows that variations in growth stages increased the difficulty of CCC modeling even for the same crop, although our proposed method improved the accuracy of wheat CCC retrieval (especially in terms of Bias). Based on more crop structure measurements and advanced 3D canopy models, it will be very valuable to carry out quantitative analysis of the impact of LAI, ALA, leaf morphology, stem composition, and other canopy heterogeneity factors on the CCC retrieval of cross-type crops in the future.

Reliability and Limitation of the ALAadj Parameterization Associated with Crop Types
A feasible scheme of ALAadj parameterization for the PROSAIL model is crucial to improving the hybrid machine learning algorithms for CCC estimation. In addition to the spectral data, the LAI provides the necessary measured data for an ALAadj parameterization scheme. This is because the LAI and ALA have overlapping effects on reflectance, and they are difficult to separate by spectral retrieval [19]. Our research found that ALAadj calculation based on the measured LAI and canopy spectrum can obtain stable results, regardless of whether chlorophyll is involved or not (as shown in Table 4). In addition, LAI samples are easier to obtain than LCC samples in the field. Consequently, the use of the canopy spectrum and measured LAI is also a practical and feasible ALAadj parameterization scheme.
ALAadj containing canopy ALA information, as well as other canopy structure information, has the ability to reflect the differences in canopy structure for different crops, which was proven in our parameterized ALAadj results, showing obvious differences between wheat and soybean. Our ALAadj parameterization results of the wheat sample (ALAadj = 62 degrees) was similar to the wheat results (ALAadj = 60.5 degrees) reported in the literature [34]. The wheat ALA values were reported to be generally greater than the soybean ALA [34,53,67], which is similar to our own ALAadj comparison results between wheat and soybean. However, the correlation between ALAadj and ALA was found to not be very strong [34], which shows that ALAadj not only contains ALA information but also canopy structure information other than LAI and ALA. The plant types of different crops are obviously variable, and the planting structures of the same crops are similar. That results in more consistent ALAadj within the crop types, and larger differences in inter-type ALAadj values, although ALA may vary with different growth stages or different varieties for a given crop type [57,61]. Therefore, ALAadj related to crop types can become an effective prior knowledge that can be used in CCC retrieval models across different crop types.
The obtained ALAadj for nadir observation in this study may be not suitable for all observation angles. By using our ALAadj parameterization model for nadir observation and PROSAIL-simulated data with a fixed ALA of 45 degrees, the influence of different view zenith angles (VZAs) on the ALAadj parameterization was shown in Figure 7. The PROSAIL-simulated spectra correspond to a homogeneous canopy; the ALA input of 45 degrees should be the true ALAadj value in this case. The obtained ALAadj values close to the nadir are more accurate than those far from the nadir, which implies that the extraction and application of prior ALAadj knowledge must consider the VZA matching. To solve the problem of ALAadj application with large-angle observation, one possible approach is to generate a prior knowledge matrix of ALAadj related to both crop types and VZAs.  Verhoef's leaf inclination distribution mode was used for ALA optimization in this study. In addition, there are other leaf inclination distribution modes that have a certain influence on canopy spectra simulation [55]. Therefore, the leaf inclination distribution modes used in the acquisition and application of crop-type-related ALAadj with the PROSAIL model need to be the same. This study only tested the cases of wheat and soybean. In future investigations, for more crops with a complex canopy structure, it is necessary to carry out new and more precise tests of ALAadj parameterization for the PROSAIL model.

Prospects of Our Proposed Method of Crop CCC Estimation Based on Satellite Data
With the exception of prior ALAadj knowledge of different types, the satellite-scale CCC retrieval also requires the spatial distribution of crop types to determine the ALAadj value. The development of vegetation-type mapping products is becoming increasingly better [68][69][70], and the promotion of the hybrid vegetation retrieval model with ALAadj parameterization has a chance of receiving better support.
In this study, the improved method was based on the ground reflectance spectrum. In addition to continuous reflectance (50 bands), we also tested discrete band combinations (13 and nine bands) with reference to MERIS and Sentinel satellite data. It was found that the performance of the discrete spectrum was worse than that of the continuous spectrum, which is similar to other parameterization studies [71,72]. The small reduction is also acceptable in the discrete band ( Figure 5), which shows that the method has the potential to be extended to discrete satellite remote-sensing data.
In addition, there are some uncertainties in the actual satellite data, including remote sensor hardware indicators, atmospheric correction processing [73], and complex ground mixing [74]. Future research can try to use Sentinel-2, Sentinel-3, and other satellites to estimate and test the canopy chlorophyll content of satellite-to-ground synchronization, to further verify the applicability of the proposed algorithm on the satellite scale.

Conclusions
In this study, through the evaluation of the variable importance of canopy reflectance based on the random forest algorithm, it was found that the PROSAIL model had difficulty in expressing a spatially heterogeneous canopy with vegetation-soil mixed effects caused by crop-row planting. Given that the ALA input of the PROSAIL model is related to the gap of the canopy, ALA adjustment based on measured spectra and non-ALA parameters can take into account the influence of the ALA and canopy spatial heterogeneity on the crop reflectance in a specific observation direction. In view of the limitations of the PROSAIL model in expressing the complexity of vegetation structure, this study proposed an RFR algorithm of crop CCC retrieval that is trained with PROSAIL simulations, with prior ALAadj knowledge. The applicability of the algorithm was tested using measured wheat and soybean data. The results show that the parameterized values of ALAadj (62 degrees for wheat; 45 degrees for soybean), fitted based on ground-measured data, showed good discrimination between the two crop types. The CCC retrieval accuracy (RMSE from 39.9 to 32.9 μg cm −2 ; R 2 from 0.67 to 0.76) of wheat and soybean ground continuous spectra (50 bands) was improved by integrating the prior knowledge of wheat and soybean ALAadj parameterization. In addition, this method also performed well under discrete reflectance data (13 bands: RMSE from 43.9 to 33.7 μg cm −2 ; R 2 from 0.63 to 0.74; nine bands: RMSE from 45.1 to 37.0 μg cm −2 ; R 2 from 0.61 to 0.71). The findings reveal that the ALAadj parameterization for the PROSAIL model helps to improve the performance of RFR hybrid algorithms of CCC retrieval across crop types. With the support of crop type (survey or satellite) products, our proposed CCC retrieval algorithm involving ALAadj parameterization related to crop types will offer the beneficial potential of application on a large scale, as well as providing a good reference for an RTM-based hybrid algorithm for other crop parameter estimations.

Supplementary Materials:
The following supporting information can be downloaded at: www.mdpi.com/article/10.3390/rs14010098/s1, Figure S1: Five soil spectra from light to dark used in the PROSAIL simulations, Figure S2: Accuracy of CCC retrieval characterized by different growth stages for wheat using ground-measured continuous spectra (a) without and (b) with considering prior ALAadj knowledge related to crop types.