Corn Biomass Estimation by Integrating Remote Sensing and Long-Term Observation Data Based on Machine Learning Techniques

The accurate and timely estimation of regional crop biomass at different growth stages is of great importance in guiding crop management decision making. The recent availability of long time series of remote sensing data offers opportunities for crop monitoring. In this paper, four machine learning models, namely random forest (RF), support vector machine (SVM), artificial neural network (ANN), and extreme gradient boosting (XGBoost) were adopted to estimate the seasonal corn biomass based on field observation data and moderate resolution imaging spectroradiometer (MODIS) reflectance data from 2012 to 2019 in the middle reaches of the Heihe River basin, China. Nine variables were selected with the forward feature selection approach from among twenty-seven variables potentially influencing corn biomass: soil-adjusted total vegetation index (SATVI), green ratio vegetation index (GRVI), Nadir_B7 (2105–2155 nm), Nadir_B6 (1628–1652 nm), land surface water index (LSWI), normalized difference vegetation index (NDVI), Nadir_B4 (545–565 nm), and Nadir_B3 (459–479 nm). The results indicated that the corn biomass was suitably estimated (the coefficient of determination (R2) was between 0.72 and 0.78) with the four machine learning models. The XGBoost model performed better than the other three models (R2 = 0.78, root mean squared error (RMSE) = 2.86 t/ha and mean absolute error (MAE) = 1.86 t/ha). Moreover, the RF model was an effective method (R2 = 0.77, RMSE = 2.91 t/ha and MAE = 1.91 t/ha), with a performance comparable to that of the XGBoost model. This study provides a reference for estimating crop biomass from MOD43A4 datasets. In addition, the research demonstrates the potential of machine learning techniques to achieve a relatively accurate estimation of daily corn biomass at a large scale.


Introduction
Crop biomass is one of the most important biophysical indicators of crop growth [1,2]. The accurate and efficient estimation of the regional crop biomass at different growth stages is of great importance in guiding crop management decision making, such as in effective fertilization, water irrigation, weeding, and pest and disease management [3,4]. In addition, as an important aspect of biomass energy, the accurate estimation of crop biomass contributes to the rational and effective utilization of biomass energy [5]. Moreover, effective estimates of crop biomass during the growing season could play a key role in crop yield prediction [6]. This is currently one of the major challenges for agricultural researchers and farm managers [7].
The traditional crop biomass estimation method is mainly based on field measurements, which include destructive field sampling, laboratory drying and weighting. The traditional methods are relatively accurate, but they are labor intensive and time consuming. Moreover, the sparse nature of space makes it impossible to provide a comprehensive understanding at regional or larger scales [2,[8][9][10]. Thus, it is impossible to acquire the regional crop biomass via ground measurements alone. The satellite remote sensing technique is a unique and useful method for crop biomass monitoring in a repeatable manner due to the valuable information it provides on vegetation parameters, high spatial coverage and long time series, which effectively compensates for the deficiency of traditional biomass estimation methods [2,8,[11][12][13]. Previous studies have revealed that remote sensing is a reliable and effective technique to obtain biophysical and biochemical crop information [6,[14][15][16][17]. For example, the moderate resolution imaging spectroradiometer (MODIS) of the Earth Observation System (EOS) instrument provides long time series observations at a spatial resolution ranging from 250 to 1000 m in multiple spectral bands at the visible to shortwave infrared (SWIR) wavelengths, with a global coverage of one to two days, which has been widely applied in studies on the variation in vegetation parameters at the regional and global scales [18][19][20].
Various methods have been developed for crop biomass estimation based on optical remote sensing data, such as statistical, process-based, machine learning, and other methods [21]. Machine learning techniques have become a basic and effective way to model and extract patterns from remote sensing data due to their high computational efficiency, few required variables, and reliable results [14,[21][22][23]. Machine learning techniques, including artificial neural networks (ANNs) [24], random forests (RFs) [25][26][27], support vector machines (SVMs) [28,29], and extreme gradient boosting (XGBoost) [30][31][32] have been widely implemented to evaluate vegetation biomass based on spectral signature parameters [1,17,33,34]. Although great progress has been attained in improving the spatially explicit estimation of crop biomass with machine learning methods based on multisource data, regional crop biomass estimates still exhibit large uncertainties at local scales [6,21,35]. There are several sources of uncertainty in crop biomass estimation. The first limitation is the lack of adequate field observation data, which may easily lead to the model becoming overfitted and failing to capture local features. The second limitation is that there is a need for the further comparison of available techniques, particularly machine learning techniques, for crop biomass estimation. Comparisons of these machine learning techniques have shown that each has its own advantages and drawbacks [11,17,22,30,33]. The third uncertainty originates from the input variable selection process according to the remote sensing data during the simulation process. Crop biomass was highly correlated with spectral signature parameters derived from satellite imagery, such as vegetation indices, which were calculated from different integrations of visible and NIR reflectance, leaf area, and reflectance for certain bands [1,6,10,11,17]. The input variable selection was based on its sensitivity to vegetation biomass. The input variables varied between studies, which made it difficult to analyze the uncertainty of the estimation biomass for machine learning models.
In this study, the corn biomass was estimated based on daily MODIS datasets with a 500 m spatial resolution combined with long-term ground observation data in the middle reaches of the Heihe River Basin, and four machine learning algorithms, namely RF, SVM, ANN, and XGBoost, were adopted to predict the biomass. The objectives of this study were (1) to validate the potential of MOD43A4 datasets with regard to corn biomass estimation; (2) to compare the performance of the RF, SVM, ANN, and XGBoost models in the estimation of corn biomass and examine suitable models; and (3) to generate an optimized framework for MODIS-based corn biomass estimation with high accuracy.

Materials and Methods
In this study, we selected the middle area of the Heihe River basin as the research area to carry out corn biomass estimation based on machine learning models. Figure 1 shows a schematic diagram of the steps and processes followed in the research. Overall, there are four major components: data and variable preparation, model parameter optimization, shows a schematic diagram of the steps and processes followed in the research. Overall, there are four major components: data and variable preparation, model parameter optimization, model performance assessment and comparison, and model application based on MODIS data at the country level. The following is a detailed description of each component.

Study Sites
The study area is situated in the middle reaches of the Heihe River basin, Gansu Province, China ( Figure 2). The Heihe River basin, located in the middle of the Hexi Corridor, is the second largest inland river basin in China. It consists of three regions: the upper mountain zone, middle oasis zone and lower arid zone. The middle reaches of the Heihe River basin cover an area of approximately 8700 km 2 and are an important food base in Northwest China. The climate is continental, with an annual average temperature of 7.5 °C [36]. The annual rainfall ranges from 100 to 250 mm [37], and the annual potential evaporation reaches 1200-1800 mm [38]. Seed corn is the most important commercial crop in the middle reaches of the Heihe River. With the increase in seed corn sowing area, the study area has become an important seed corn-producing area in China [36]. It has been identified as the national seed corn production base since 2013.

Study Sites
The study area is situated in the middle reaches of the Heihe River basin, Gansu Province, China ( Figure 2). The Heihe River basin, located in the middle of the Hexi Corridor, is the second largest inland river basin in China. It consists of three regions: the upper mountain zone, middle oasis zone and lower arid zone. The middle reaches of the Heihe River basin cover an area of approximately 8700 km 2 and are an important food base in Northwest China. The climate is continental, with an annual average temperature of 7.5 • C [36]. The annual rainfall ranges from 100 to 250 mm [37], and the annual potential evaporation reaches 1200-1800 mm [38]. Seed corn is the most important commercial crop in the middle reaches of the Heihe River. With the increase in seed corn sowing area, the study area has become an important seed corn-producing area in China [36]. It has been identified as the national seed corn production base since 2013.

Field Data
Large, relatively homogeneous areas, which mainly functioned as corn growing areas, were selected as the sample sites. According to a field survey of the sample sites, a total of sixteen sites were selected in 2012, and three sites were selected for field measurement purposes every year from 2013 to 2019. One to three plots for each sample site were chosen for further biomass observation, with each plot covering an area of 100 m × 100 m. The number of plots for each sample site was determined by the planting structure of corn at the site. For each plot, three corn plants representing the level of the plot were randomly selected. All selected plants in each plot were collected, and the roots of the corn plants were washed. The fresh weight was measured, after which the plants were transported to the laboratory, chopped into easy-to-dry segments, and dried in an oven at 85 • C for 48-72 h until a constant dry biomass was obtained. The average biomass value of the plots was considered to represent the biomass at the site scale. In situ biomass measurements were implemented from 2012 to 2019 during the growing season of corn. The observations occurred from mid-May to late September each year, i.e., from the beginning of corn growth to corn harvest, over a five-day period during the rapid growth period (before August) and over a ten-day period for days thereafter. The beginning observation date differed each year because the corn sowing time continuously changed in the study area. The number of corn plants per unit area was calculated by combining various structural parameters of corn planting, such as the plant spacing, row spacing and uplift spacing.

Field Data
Large, relatively homogeneous areas, which mainly functioned as corn growing areas, were selected as the sample sites. According to a field survey of the sample sites, a total of sixteen sites were selected in 2012, and three sites were selected for field measurement purposes every year from 2013 to 2019. One to three plots for each sample site were chosen for further biomass observation, with each plot covering an area of 100 m × 100 m. The number of plots for each sample site was determined by the planting structure of corn at the site. For each plot, three corn plants representing the level of the plot were randomly selected. All selected plants in each plot were collected, and the roots of the corn plants were washed. The fresh weight was measured, after which the plants were transported to the laboratory, chopped into easy-to-dry segments, and dried in an oven at 85°C for 48-72 h until a constant dry biomass was obtained. The average biomass value of the plots was considered to represent the biomass at the site scale. In situ biomass measurements were implemented from 2012 to 2019 during the growing season of corn. The observations occurred from mid-May to late September each year, i.e., from the beginning of corn growth to corn harvest, over a five-day period during the rapid growth period (before August) and over a ten-day period for days thereafter. The beginning observation date differed each year because the corn sowing time continuously changed in the study area. The number of corn plants per unit area was calculated by combining various structural parameters of corn planting, such as the plant spacing, row spacing and uplift spacing.

Remote Sensing Data and Processing
MODIS MCD43A4 and MOD15A2 products from May to September from 2012 to 2019 were analyzed in this study. MCD43A4 provides daily 500-m reflectance data adjusted via the nadir bidirectional reflectance distribution function-adjusted reflectance retrieved from combined MODIS-Terra and MODIS-Aqua acquisitions. Since the product removes land surface reflectance anisotropy and provides similar observations to those of the nadir view angle, it is a more stable and consistent product [39]. MCD43A4 has been adopted to monitor vegetation biomass and land surface disturbance due to its improved accuracy of change detection in intra-and interannual temporal dynamics [2,40]. MOD15A2 provides 8-day composite 1 km resolution leaf area index (LAI) data from Terra. The MODIS reprojection tool (MRT) was employed to reproject and resample MODIS images.

Land Cover Data
The land cover map of the Heihe River Basin of August 2015 was used in this study to generate a corn map of the study area [41][42][43] (Figure 2). Previous studies have demonstrated that the main planting area in the study area is seed corn [36,44,45], and the crop of seed corn is more than three times that of field corn [44]. It is difficult to distinguish seed and field corn planting areas in land cover data. Therefore, we considered both field corn and seed corn as corn in this study.

Model Variables and Optimization
Twenty-seven variables, including the LAI derived from the MODIS MOD15A2 product, seven MCD43A4 bands (Table 1), and nineteen spectral vegetation indices (Table 2) based on the visible and near-infrared (NIR) bands of MCD43A4 related to the vegetation biomass were adopted in this study. We selected those variables based on their sensitivity to the vegetation structure and biomass according to previous studies [1,3,6,31,34,40]. For example, the normalized difference vegetation index (NDVI), normalized difference water index (NDWI), enhanced vegetation index (EVI), and soil-adjusted vegetation index (SAVI) have been successfully used for biomass accumulation research because they are strongly correlated with vegetation biophysical parameters [40]. The LAI was sensitive for crop biomass estimation [1]. Furthermore, additional indices that were modified based on the NDVI with different integrations of visible and NIR reflectance were used, such as the green normalized difference vegetation index (GNDVI), blue normalized difference vegetation index (BNDVI), and wide dynamic range vegetation index (WDRVI). Additionally, the band of visible and NIR reflectance, which are relatively sensitive to vegetation biomass, were frequently used in biomass studies [26,31]. The forward feature selection method was implemented to choose the most influential variables, which is an iterative method starting from the condition of no model features. In each iteration, we add a given feature that best improves the model until the addition of a new variable does not further improve the model performance.

Vegetation Indices Abbreviation Formula Reference
Red and Blue Normalized Difference Red, Blue and Green Normalized Difference [60]

Machine Learning Algorithms
The RF, SVM, ANN, and XGBoost algorithms were adopted in this study. With regard to the RF algorithm, only two parameters are tuned: ntree and mtry. Parameter mtry is the number of splits per node in each tree during the building process, and ntree is the number of decision trees or the number of bootstrap samples [61]. The accuracy of the model is mainly influenced by the value of mtry [30,62]. Therefore, we set ntree as the default value of 500.
The SVM algorithm is largely based on the structural risk minimization principle and statistical theory [63,64]. The choice of the positive definite kernel function is very important in this method [62,65]. Moreover, with regard to the SVM algorithm, the cost factor and gamma affect the punishment imposed for sample misclassification and algorithm complexity [63]. In this study, the radial kernel function was chosen, and it includes the sigma and c parameters.
The ANN includes one input layer, one output layer and one or multiple hidden layers. Neural networks learn the relationship between the input and output variables via the establishment of a set of nonlinear units, which are organized into layers and connected by weights with deviations equivalent to those in the regression parameters of classical parametric models [11]. In this study, the Bayesian regularized neural network (BRNN) was employed, which is one type of ANN widely applied in prediction research [66]. The BRNN fits a two-layer neural network and uses the Nguyen and Widrow algorithm to assign initial weights and the Gauss-Newton algorithm to perform optimization. One parameter of the BRNN function is considered here, namely the number of neurons.
XGBoost performs a second-order Taylor expansion of the objective function and employs the second derivative to accelerate the model convergence speed during training [67]. In addition, a regularization term is added to the objective function to control the tree complexity, which generates a relatively simple model and prevents overfitting. The XGBoost algorithm contains many parameters, and the most important parameters in this study are: (1) max_depth, which is the maximum depth of an individual tree; (2) nrounds, which is the maximum number of iterations; (3) eta, which reduces the feature weights to ensure that the boosting process becomes more conservative; (4) gamma, which is the minimum loss reduction required to further partition a given leaf node of the tree; (5) subsample, which is the subsampling ratio of the training instances or rows; (6) colsample_bytree, which is the subsampling ratio of the columns during tree construction; (7) rate_drop, which is the fraction of previous trees to eliminate during the dropout procedure; (8) skip_drop, which is the probability of skipping the dropout procedure in a given boosting iteration; and (9) min_child_weight, which is the minimum sum of the instance weight needed in a leaf node. XGBoost model tuning is a complicated task because changing any one parameter affects the optimal values of the other parameters [30].
R software was used to implement the above four machine learning algorithms with the randomForest, kernlab, brnn, and xgboost packages. In this study, the caret package in R software was applied to tune the parameters of the above four machine learning algorithms. The caret package comprises a set of functions to streamline the process of creating prediction models. It contains tools for data splitting, preprocessing, feature selection, model tuning by resampling and variable importance estimation. The root mean squared error (RMSE), coefficient of determination (R 2 ) and mean absolute error (MAE) were calculated during parameter tuning, and the minimum RMSE value was considered to select the optimal model. According to the optimal models determined in this study, the parameter mtry of the RF algorithm was set to 9, and sigma and C were set to 2.1 and 1, respectively, in the SVM algorithm, the number of neurons was set to 3 in the BRNN function of the ANN algorithm, and the max_depth, nrounds, eta, gamma, subsample, colsample_bytree, rate_drop, skip_drop, and min_child_weight parameters of the XGBoost algorithm were set to 3, 100, 0.4, 0, 1, 0.6, 0.5, 0.95, and 1, respectively.

Model Evaluation
The 10-fold cross validation method was implemented to evaluate the performance of the above four machine learning models due to the small number of training samples. According to the 10-fold cross validation method, the field observations were divided into 10 equal groups; nine groups were selected as training samples with which to build the corresponding regression model, and the remaining group was used as the validation sample to evaluate the trained model. R 2 (Equation (1)), RMSE (Equation (2)), and MAE (Equation (3)) were determined to quantify the performance of the above four models. A high R 2 value, low RMSE value and low MAE value indicate good model performance: whereŷ i is the predicted biomass value, y i is the ground-measured biomass value, y i is the mean value of the ground-measured biomass, and n is the number of samples.

Importance of Variable Optimization for Machine Learning Model Performance
The selection of suitable variables is a critical step in the development of a biomass estimation model because certain variables may be weakly correlated with the biomass, which may reduce the biomass estimation performance [35]. In this study, nine out of the twenty-seven variables were selected according to the forward feature selection method: SATVI, GRVI, Nadir_B7, Nadir_B6, LSWI, NDVI, Nadir_B4, Nadir_B3, and RVI. Figure 3 shows the relative correlation between the corn biomass and each of the nine optimal variables. The results indicated that all nine variables were significantly correlated with the corn biomass, and GRVI, LSWI, NDVI, and RVI were positively correlated with the biomass, while the remaining five variables were negatively correlated with the biomass. The correlation coefficients R of Nadir_B3, Nadir_B4, and NDVI were higher than 0.70, and the correlation coefficients R of the other variables were all above 0.60.

Model Performance and Accuracy Comparison
The performance of the four machine learning algorithms was further analyzed based on the nine optimal variables. The model performance was explained via scatter plots. Figure 4 shows the relationship between the observed and predicted biomass values. The results indicated that the XGBoost (R 2 = 0.78, and RMSE = 2.86 t/ha) and RF (R 2 = 0.77, and RMSE = 2.91 t/ha) models performed better than the SVM (R 2 = 0.72, and RMSE = 3.20 t/ha) and ANN (R 2 = 0.72, and RMSE = 3.20 t/ha) models with the same dataset, and the XGBoost model performed slightly better than the RF model because it attained the highest R 2 and lowest RMSE values. shows the relative correlation between the corn biomass and each of the nine optimal variables. The results indicated that all nine variables were significantly correlated with the corn biomass, and GRVI, LSWI, NDVI, and RVI were positively correlated with the biomass, while the remaining five variables were negatively correlated with the biomass. The correlation coefficients R of Nadir_B3, Nadir_B4, and NDVI were higher than 0.70, and the correlation coefficients R of the other variables were all above 0.60. Figure 3. Correlation matrix between the nine variables (SATVI, GRVI, Nadir_B7, Nadir_B6, LSWI, NDVI, Nadir_B4, Nadir_B3, and RVI) and the corn biomass. *** indicates regression significance at a p-value < 0.001. Figure 3. Correlation matrix between the nine variables (SATVI, GRVI, Nadir_B7, Nadir_B6, LSWI, NDVI, Nadir_B4, Nadir_B3, and RVI) and the corn biomass. *** indicates regression significance at a p-value < 0.001.
In addition, based on R 2 , MAE, and RMSE, as shown in Figure 5, we also found that the R 2 values followed the order of XGBoost > RF > ANN > SVM, the RMSE values followed the sequence of XGBoost < RF < ANN < SVM, and the MAE values followed the order of XGBoost < RF < ANN < SVM, which indicated that the XGBoost model performed better than the other three models, followed by the RF model. Moreover, there was little difference in R 2 , MAE, and RMSE between the RF and XGBoost models. This verified that the RF model was also an effective model, with the XGBoost model performing slightly better than the RF model. The SVM performed worse than the other models in most situations.

Visual Representation of the Spatiotemporal Characteristics of the MODIS-Estimated Biomass
To investigate the temporal performance of the RF, SVM, ANN and XGBoost models, a relatively homogeneous MCD43A4 pixel with long-term ground observation data was chosen to analyze the differences between the estimated and ground-observed biomass data. Figure 6 shows a temporal comparison of the ground-observed and estimated biomass data between the RF, SVM, ANN, and XGBoost models from 2012 to 2019. According to the visual assessment of the fitted time series curves (Figure 6), the estimated biomass trend curves of the four models were similar to the ground-observed biomass curves at most times. However, obvious differences were observed between the estimated and fieldobserved biomass values. Different degrees of biomass overestimation or underestimation occurred among the four models, especially after the tasseling stage, and the corn biomass suddenly decreased and then gradually increased (Figure 6a,e,g). The SVM model tended to overestimate the biomass during early corn growth (Figure 6c,d,f,h), while the data estimated with the other three models were very close to the observed data at this stage. It was difficult to distinguish the performance of the RF, ANN and XGBoost models based on the estimated data curves. In 2018, all four models tended to underestimate the corn biomass after mid-July (Figure 6g). The main reason is that the variety of corn in the observation area changed from seed corn to field corn, and the phenological periods of the two types of corn were significantly different. Unlike the field corn, seed corn is emasculated in mid-July, and then male-removed in early August. This is also the reason for the slower above ground biomass growth or sudden decrease in this period for most study years.
To examine the seasonal changes in the spatial distribution of the corn biomass, the biomass data estimated with the RF, SVM, ANN and XGBoost models in the study area were visualized. Figure 7 shows the seasonal differences in the spatial distribution of the four models based on the MCD43A4 data on the first day of June, July, August, and September 2019. Obvious differences occurred between the four models. For example, the estimated values of the SVM model in some regions in June were notably higher than those of the other three models, and the areas indicating a biomass between 3 and 5 t/ha of the SVM model were much larger than those of the RF, ANN, and XGBoost models. Moreover, the estimated values of the ANN model in certain regions in July and August were notably lower than those of the other three models, and the areas indicating a biomass between 9 and 12 t/ha of the ANN model were obviously smaller than those of the RF, SVM and XGBoost models.

Model Performance and Accuracy Comparison
The performance of the four machine learning algorithms was further analyzed based on the nine optimal variables. The model performance was explained via scatter plots. Figure 4 shows the relationship between the observed and predicted biomass values. The results indicated that the XGBoost (R 2 = 0.78, and RMSE = 2.86 t/ha) and RF (R 2 = 0.77, and RMSE = 2.91 t/ha) models performed better than the SVM (R 2 = 0.72, and RMSE = 3.20 t/ha) and ANN (R 2 = 0.72, and RMSE = 3.20 t/ha) models with the same dataset, and the XGBoost model performed slightly better than the RF model because it attained the highest R 2 and lowest RMSE values. In addition, based on R 2 , MAE, and RMSE, as shown in Figure 5, we also found that the R 2 values followed the order of XGBoost > RF > ANN > SVM, the RMSE values followed the sequence of XGBoost < RF < ANN < SVM, and the MAE values followed the order of XGBoost < RF < ANN < SVM, which indicated that the XGBoost model performed

Spatial Distribution of the Simulated Biomass with the Highest-Performance Models
Based on the accuracy comparison of R 2 , RMSE, and MAE, the XGBoost and RF models were chosen as the two best performing models to carry out further spatial distribution analysis. Figure 8 shows the interannual differences in the spatial distribution of the simulated corn biomass obtained with the two best performing models based on the MCD43A4 data on the first day of August from 2012 to 2019. As shown in Figure 8, the XGBoost and RF models exhibited similar spatial distribution characteristics, and the corn biomass value was high in the southeast and moderate and low in the west. Moreover, it was found that the spatial distribution of the estimated biomass by the two models revealed similar interannual changes in certain years. For example, the areas indicating a biomass ranging from 5 to 7 t/ha and 7 to 9 t/ha in 2014 and 2015 were larger than those in the other years, while the areas indicating a biomass over 9 t/ha were much smaller in the same years for both models. However, there were differences in the areas of different predicted biomass ranges between the two models. The percentage of low-and high-biomass areas for the XGBoost model was higher than that for the RF model, while the percentages of moderate biomass areas for the RF model were higher than those for the XGBoost model during the study period. For example, the percentages of the areas indicating a biomass lower than 5 t/ha and over 12 t/ha for the XGBoost model were higher than those for the RF model, while the opposite was true for the areas indicating a biomass between 5 and 9 t/ha. Based on the best-performing model-XGBoost-we obtained the spatial characteristics of the annual variation in corn maximum biomass for the study area ( Figure 9). Overall, the corn biomass was higher in the eastern and central parts than in the western part of the study area. The average biomass was 13.66-15.40 t/ha (the highest was in 2016, and the lowest was in 2013). The highest biomass (>16 t/ha) was concentrated in the central part of the study area where irrigation water resources are relatively abundant. The lowest biomass (<12 t/ha) was distributed in the western part of the study area.
Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 26 difference in R 2 , MAE, and RMSE between the RF and XGBoost models. This verified that the RF model was also an effective model, with the XGBoost model performing slightly better than the RF model. The SVM performed worse than the other models in most situations.

Visual Representation of the Spatiotemporal Characteristics of the MODIS-Estimated Biomass
To investigate the temporal performance of the RF, SVM, ANN and XGBoost models,   estimated values of the SVM model in some regions in June were notably higher than those of the other three models, and the areas indicating a biomass between 3 and 5 t/ha of the SVM model were much larger than those of the RF, ANN, and XGBoost models. Moreover, the estimated values of the ANN model in certain regions in July and August were notably lower than those of the other three models, and the areas indicating a biomass between 9 and 12 t/ha of the ANN model were obviously smaller than those of the RF, SVM and XGBoost models. (a)

Spatial Distribution of the Simulated Biomass with the Highest-Performance Models
Based on the accuracy comparison of R 2 , RMSE, and MAE, the XGBoost and RF models were chosen as the two best performing models to carry out further spatial distribution    Figure 10 shows the scatter plots of the ground-observed biomass and estimated biomass for the RF, SVM, ANN and XGBoost models based on all twenty-seven variables. Compared with Figure 4, it was found that the precision of the results after variable optimization was greatly improved for all four models. The R 2 value for both the RF and XGBoost models was 0.75 before variable optimization but increased to 0.77 and 0.78, respectively, after variable optimization. In addition, RMSE decreased from 3.03 to 2.901 t/ha and from 3.00 to 2.86 t/ha for the RF and XGBoost models, respectively. Similarly, for the ANN and SVM models, R 2 increased and RMSE decreased to different degrees.  Figure 10 shows the scatter plots of the ground-observed biomass and estimated biomass for the RF, SVM, ANN and XGBoost models based on all twenty-seven variables. Compared with Figure 4, it was found that the precision of the results after variable optimization was greatly improved for all four models. The R 2 value for both the RF and XGBoost models was 0.75 before variable optimization but increased to 0.77 and 0.78, respectively, after variable optimization. In addition, RMSE decreased from 3.03 to 2.901 t/ha and from 3.00 to 2.86 t/ha for the RF and XGBoost models, respectively. Similarly, for the ANN and SVM models, R 2 increased and RMSE decreased to different degrees.  Table 3 summarizes the variable selection results obtained with the forward feature selection method starting from the initial 27 variables, and Figure 11 shows the process of the forward feature selection method in this study. The best bivariate model includes SATVI and GRVI with mean R 2 and RMSE values of 0.7266 and 3162.03 kg/ha, respectively. When Nadir_B4 and Nadir_B7 are added to the model, the R 2 and RMSE values are 0.7392 and 3097.89 kg/ha, respectively. With the addition of the other five variables (LSWI, NDVI, Nadir_B6, Nadir_B3, and RVI), R 2 slightly increases from 0.7527 to 0.7739, and RMSE decreases from 3011.75 to 2886.80 kg/ha.  Table 3 summarizes the variable selection results obtained with the forward feature selection method starting from the initial 27 variables, and Figure 11 shows the process of the forward feature selection method in this study. The best bivariate model includes SATVI and GRVI with mean R 2 and RMSE values of 0.7266 and 3162.03 kg/ha, respectively. When Nadir_B4 and Nadir_B7 are added to the model, the R 2 and RMSE values are 0.7392 and 3097.89 kg/ha, respectively. With the addition of the other five variables (LSWI, NDVI, Nadir_B6, Nadir_B3, and RVI), R 2 slightly increases from 0.7527 to 0.7739, and RMSE decreases from 3011.75 to 2886.80 kg/ha.   The variable importance results in Table 3 show that the SATVI and GRVI play an important role in corn biomass estimation in this study, which is in stark contrast to other studies because SATVI has not been as widely adopted as other vegetation indices. However, SATVI, as an index of the amount of green and senescent vegetation, is highly correlated with vegetation coverage [52,68]. SATVI can capture 55% of the variability in ground measured total vegetation cover from diverse sites [69], and it has been utilized to analyze that of fractional plant cover changes [68]. There was a strong correlation between corn biomass and coverage, and it is a reasonable key variable in this study. GRVI showed great sensitivity to nitrogen deficiency in corn, and previous studies have shown a higher correlation with corn yield [70,71]. LSWI is a vegetation water content index, which is also a key variable for this research. This finding is consistent with the results reported by Wang et al., who indicated that LSWI had a better relationship with biomass than NDVI and EVI [72]. The RVI and NDVI play crucial roles in the dry matter of vegetation [73,74], and they have been demonstrated to be strongly correlated with vegetation biomass in previous studies [75][76][77]. In this study, the four MCD43A4 reflectance wavebands, namely SWIR from 2105 to 2155 nm, SWIR from 1628 to 1652 nm, green from 545 to 565 nm, and blue from 459 to 479 nm, attained a notable relationship with the corn The variable importance results in Table 3 show that the SATVI and GRVI play an important role in corn biomass estimation in this study, which is in stark contrast to other studies because SATVI has not been as widely adopted as other vegetation indices. However, SATVI, as an index of the amount of green and senescent vegetation, is highly correlated with vegetation coverage [52,68]. SATVI can capture 55% of the variability in ground measured total vegetation cover from diverse sites [69], and it has been utilized to analyze that of fractional plant cover changes [68]. There was a strong correlation between corn biomass and coverage, and it is a reasonable key variable in this study. GRVI showed great sensitivity to nitrogen deficiency in corn, and previous studies have shown a higher correlation with corn yield [70,71]. LSWI is a vegetation water content index, which is also a key variable for this research. This finding is consistent with the results reported by Wang et al., who indicated that LSWI had a better relationship with biomass than NDVI and EVI [72]. The RVI and NDVI play crucial roles in the dry matter of vegetation [73,74], and they have been demonstrated to be strongly correlated with vegetation biomass in previous studies [75][76][77]. In this study, the four MCD43A4 reflectance wavebands, namely SWIR from 2105 to 2155 nm, SWIR from 1628 to 1652 nm, green from 545 to 565 nm, and blue from 459 to 479 nm, attained a notable relationship with the corn biomass. This result is quite different from those reported in previous studies, as few previous studies have directly regarded reflectance wavebands as influencing factors. However, according to Wang et al., SWIR bands centered at 1649 nm and 1722 nm were highly correlated with leaf dry matter content [78]. A similar study by Shoko et al. also showed that SWIR centered at 2190 nm contributed more to biomass estimation over time [79]. In addition, the green and blue bands were found to be sensitive to corn biomass. This is most likely because the two bands play an important role in the photosynthesis of vegetation and have a direct effect on the chlorophyll content and biomass of corn.

Comparison of the Prediction Models
Previous studies on biomass estimation involving machine learning models have mainly been conducted in forestlands and grasslands [8,26,30,31]. With regard to crops, the application of machine learning models has largely focused on crop yield estimation [6,17,33,34]. In this research, we explored the performance of four machine learning models in estimating the corn biomass of the whole growing season (RF, SVM, ANN, and XGBoost), all of which exhibited acceptable accuracy (Figures 5 and 6).
The XGBoost model yielded the best performance with a high ratio of explained variance and low error, which was consistent with the results of Li et al. [30], who predicted the forestland above ground biomass (AGB) with the RF, XGBoost, and linear regression models and indicated that the XGBoost model is an effective method for AGB estimation and reduces the problems of over-and underestimation. According to Li et al. [30], XGBoost corrects the residual error to generate a new tree based on the previous tree. Trees are independent in the RF model, which is a more flexible algorithm and is the reason why XGBoost performs better. In this study, the performance of the RF model was comparable to that of the XGBoost model, and this finding is in accordance with the results reported by Herrero-Huerta et al. [80]. The RF model outperformed the SVC and ANN models, and the results are similar to those of the studies of An et al., Kayah et al., Xu et al., and Zhu et al. [14,[81][82][83], which focused on the performance and accuracy of the RF, SVC, ANN and other models, and the RF model was recommended due to its robustness and accuracy in those studies. Previous studies have indicated that the RF model has the ability to resist overfitting and address high-dimensional data [84], which is why the RF model is an effective model in this study. Conversely, the SVM and ANN models showed relatively lower performance. It is important to note that there are some limitations in the comparison of models. The advantages and disadvantages of the four machine learning models are not fully demonstrated due to the limited observations [84]. Taking the ANN model as an example, it requires many training repetitions to obtain an optimal neural network. The number of ground observation data points was not large enough, which may be one of the factors that influenced the results of the ANN model. In addition, the inner workings of the ANN and SVM were treated as black-box models, which are difficult to understand [85]. However, we mainly focused on the model accuracy in estimation in this research, and the XGBoost and RF models could be convenient and efficient models with which to estimate crop biomass.

Factors Influencing the Model Accuracy
Although a relatively high accuracy of regional daily corn crop biomass estimation was achieved in this research, there are some limitations associated with our data and methodology, which directly affect the accuracy of model estimation.
The first limitation is the representativeness of the field-observed values. Seed corn and field corn were included in this study area. The phenology of these two types of corn was quite different, which might affect the estimation accuracy of the model to some extent. Although relatively homogeneous sample sites were selected, it is difficult to match the scale between the plot size and MODIS pixel size, which covers a 500 m × 500 m area [8]. In addition, the heterogeneity of the sample sites within a pixel is inevitable, especially in cropland areas. Roads, bare land areas, ridges, and various crops grown intermittently all enhance heterogeneity [86]. The second factor is the saturation of the remote sensing data at high biomass densities. According to previous studies, the saturation of surface reflectance and vegetation indices often occurs at moderate to high vegetation cover [35]. This refers not only to some vegetation indices but also to certain wavebands [87]. In this study, twenty-seven variables, including LAI derived from MODIS MOD15A2, seven MCD43A4 wavebands (Table 1), and nineteen spectral vegetation indices (Table 2), were considered. According to Saatchi et al. [88], the use of multiple prediction variables may offset the underestimation caused by spectral saturation to a certain extent. However, the saturation effect remains present in this study. As shown in Figure 6a,c,h, the estimated biomass was lower than the ground-observed biomass when the biomass was higher than 10 t/ha. This may occur because the accumulated corn biomass was mainly allocated in the vertical plane at the later vegetative stages [89]. However, the saturation effect may be reduced by combining spaceborne and airborne light detection and ranging (LiDAR), radar and synthetic aperture radar [8,72]. Therefore, the integration of these data with optical data may reduce the saturation effect, which will be addressed in future studies.

Conclusions
In this study, field-measured corn biomass data collected during eight growing seasons from 2012 to 2019 were utilized to calibrate predictive models from coarse-resolution remote sensing data. Combined with the MODIS MCD43A4 product, which is a stable and consistent daily surface reflectance 500-m spatial resolution product in wavebands 1-7, we employed four machine learning models (the RF, SVM, ANN and XGBoost models) to predict the corn biomass in the middle reaches of the Heihe River basin. Twenty-seven parameters related to vegetation biomass were analyzed. Nine of these were finally selected according to the forward feature selection algorithm, namely SATVI, GRVI, Nadir_B7 (2105-2155 nm), Nadir_B6 (1628-1652 nm), LSWI, NDVI, Nadir_B4 (545-565 nm), and Nadir_B3 (459-479 nm), and their importance decreased in the order above. In contrast to previous studies, the 3, 4, 6, and 7 wavebands of MCD43A4 were sensitive to corn biomass.
All four models exhibited acceptable accuracy. The estimated biomass trend curves of the four models were similar to the ground-observed biomass curves at most times. However, by comparing the performances of the above four machine learning models based on R 2 , RMSE and MAE, the XGBoost model performed the best, followed by the RF model, while the performance of the ANN and SVM models fluctuated. The ANN model overestimated the corn biomass during early growth according to the spatiotemporal characteristics of the estimated biomass. The corn biomass was underestimated for all models during some periods, which is possibly related to the index and certain waveband saturations used in this study.
To improve the accuracy of corn biomass estimation at the regional scale, our future work will focus on the collection of more ground-observed data on different types of crops in a wide range of areas, and other sources of data (e.g., high-spatial resolution remote sensing images, hyperspectral data, and LiDAR data), and other auxiliary data (e.g., plant height, soil moisture and climate data) will be examined for biomass estimation purposes. Furthermore, exploring suitable deep learning approaches will play an important role in improving prediction accuracy.