Prediction of Water Carbon Fluxes and Emission Causes in Rice Paddies Using Two Tree-Based Ensemble Algorithms

: Quantiﬁcation of water carbon ﬂuxes in rice paddies and analysis of their causes are essential for agricultural water management and carbon budgets. In this regard, two tree-based machine learning models, which are extreme gradient boosting (XGBoost) and random forest (RF), were constructed to predict evapotranspiration (ET), net ecosystem carbon exchange (NEE)


Introduction
Rice is a major staple crop that feeds over 50% of the world's population. Planting rice requires many freshwater resources, coupled with increased demand from other competitive sectors and expanding sectors, which means better agricultural water management for rice. Evapotranspiration (ET) is the main component of water consumption in rice paddies. Therefore, accurate estimation of ET is essential for rice water management, which helps formulate and evaluate water-saving strategies and enhances the understanding of the rice water cycle [1][2][3]. In addition to the rice paddy water cycle, the carbon cycle is significant for maintaining the health and sustainable development of the rice paddy ecosystem [4]. Continuous measurement of net ecosystem CO 2 exchange (NEE) in rice paddy helps determine the source and sink status of the rice paddy ecosystem and analyze the temporal variation of carbon exchange [5]. Methane (CH 4 ) is the second largest radiatively forced greenhouse gas after CO 2 [6]. Rice paddies will cause an anaerobic soil environment under long-term flooding conditions. Organic matter, such as primary carbon and plant residues in the soil, is gradually decomposed into soluble organic matter utilized Seven rice paddy sites were selected from Version 1 of the FLUXNET-CH 4 database (Table 1, Figure 1) [17]. In this study, two tree-based machine learning models (XGBoost and RF) were used to predict ET, NEE, and FCH 4 in rice paddies. The input factors are shown in Tables 2-4 respectively. When soil water content (SWC) and soil temperature (TS) have more than one observation depth, the mean value of SWC is taken into the model, and TS selects the depth with the highest statistical correlation with the predictors. The most used input factors in predicting FCH 4 include air temperature (TA), incoming shortwave radiation (SW_IN), outgoing longwave radiation (LW_OUT), vapor pressure deficit (VPD), atmospheric pressure (PA), wind speed (WS), wind direction (WD), SWC, friction velocity (USTAR), net radiation (NETRAD), ecosystem respiration (RECO), sensible heat turbulent flux (H), gross primary productivity (GPP), soil heat flux (G), TS, normalized difference vegetation index (NDVI), latent heat turbulent flux (LE), NEE, and the temperature difference between the previous day and the current day (DeltaTA).  Figure 1. Location of eddy covariance sites, with sites colored by wetland type. Cyan represents the seven rice paddy sites studied.

Extreme Gradient Boosting (XGBoost)
The XGBoost model is a machine-learning algorithm implemented in a gradientboosting framework. The integration algorithm summarizes the modeling results of the sum of all the weak learners (classification and regression tree, CART). The XGBoost model adopts the training method of continuous accumulation of multiple weak learners to optimize the objective function; that is, the XGBoost model builds one CART at a time, and the newly established CART score will be accumulated with all previous CART scores. The t-th objective (O (t) ) function of the model can be expressed as follows: where l is the loss term of the t-th CART, C is a constant term, and Ω (f t ) is the regularization term of the model, defined as: where γ and λ are customization parameters. In general, the larger these two values are, the simpler the tree structure will be. Then, the problems of over-fitting may be effectively solved. Taking a second-order Taylor expansion with Equation (1), it can be written as follows: where g is the first derivative, and h is the second derivative. They can be described as: Substituting (2), (4), and (5) into (3) and taking the derivative, the solutions can then be obtained from (6) and (7): The XGBoost model has the advantages of high computational efficiency and high accuracy. More details about the XGBoost model can be found in Chen et al. [25].

Random Forest (RF)
RF is an ensemble algorithm proposed by Breiman [26]. RF consists of multiple classification and regression trees (CART), and the average of these tree prediction results is the final prediction result. RF model has various advantages: (1) it can handle very highdimensional data (that is, data with numerous features) and need no feature selection; (2) it can detect the interaction and importance of input factors; (3) it can be easily parallelized because trees are independent of each other during training; and, (4) it has high accuracy and can obtain good accuracy for the missing data problem.
In this study, a k-fold testing approach was applied to assess the performances of XGBoost and RF models. The complete data set is divided into 5 parts, with 4 parts for training and 1 part for testing. Further, the importance of the input factors of each predictor is calculated by the XGBoost model. The gain represents the fractional contribution of each factor to the model based on the total gain of this factor's splits. A higher percentage means a more important predictive feature.

Statistical Evaluation
The statistical indicators used in this study include root mean square error (RMSE), coefficient of determination (R 2 ), mean absolute error (MAE), mean bias error (MBE), and global relative indicator (GRI), which are expressed as: where Y i,o , Y i,p , Y o , and Y p are the observed and predicted factors, and the mean values of the observed and predicted factors, respectively; n is the number of observations; α j is a coefficient, which equals to 1 for RMSE, MAE, and MBE, and −1 for R 2 ; g j,max and g j,min represents the maximum and minimum of j, respectively; Y i,j is the scaled value of j. The models with a lower GRI value indicated higher accuracy.

Models Performance and Driving Factors of ET Prediction
The statistical result of the two machine learning models for predicting half-hour ET at different sites is provided in Table 5. The GRI of the XGBoost models was smaller than that of the RF models. In addition, the XGBoost models showed higher R 2 and lower RMSE and MAE as compared to the RF models, with RMSE ranging 0.0187~0.0314 mm hr −1 , R 2 ranging 0.8671~0.9377, and MAE ranging 0.0110~0.0169 mm hr −1 for XGBoost, RMSE ranging 0.0207~0.0325 mm hr −1 , R 2 ranging 0.8583~0.9240, and MAE ranging 0.0121~0.0198 mm hr −1 for RF, respectively. At the same time, the MBEs of the two models were almost close to zero.
The scatter plots of ET observed values and the values predicted by the two machine learning models are shown in Figure 2. Figure 2 shows that many points in the two models were located above and below the 1:1 line. The correlation between the observed and predicted values of the XGBoost models was slightly better than those of the RF models. The R 2 values of the XGBoost and RF model reached the maximum at IT-Cas sites, which were 0.9338 and 0.9240, respectively.
It is of great significance to analyze the sensitivity of rice paddy ET to input factors for further understanding the impact of global climate change on rice paddy ET. The XGBoost model has the ability to evaluate the importance of predictors. Figure 3 shows the importance of different factors at different sites to ET through the XGBoost model (a higher value of gain implies greater importance). It can be seen from Figure 3 that SW is the most important factor for ET prediction at each site; the mean gain value is 0.6872. It is quite normal, after all, that solar radiation provides the energy needed for evapotranspiration. Except for SW, the gain values of other factors were less than 0.3. After averaging the gain of each site factor, the order was SW (0.6872), VPD (0.0989), TA (0.0611), NDVI (0.0470), SWC (0.0402), WS (0.0385), PA (0.0177), and DeltaTA (0.0150).

Models Performance and Driving Factors of NEE Prediction
The statistical result of the two machine learning models for predicting half-hour NEE at different sites is provided in Table 6. As seen in the table, except for KR-CRK and US-HRC sites, the XGBoost models showed higher R 2 and lower RMSE and MAE as compared to the RF models. Only at the KR-CRK site was the GRI value of XGBoost larger than that of RF. Figure 4 shows the scatter plots of the observed and predicted NEE values based on the XGBoost and RF models. Both models made good predictions at most points, with R 2 ranging from 0.7525~0.9668 for XGBoost, and 0.7548~0.9581 for RF. When the predicted value was small, there was an overestimation, and when the predicted value was large, there was an underestimation.        . Scatter plots of the observed and predicted NEE at seven rice paddy sites usi XGBoost and RF models. Red points represent the observed and predicted NEE using the XG model. Blue points represent the observed and predicted NEE using the RF model. Figure 5 shows the importance of different factors at different sites to NEE thr the XGBoost model (a higher value of gain implies greater importance). It can be seen Figure 4 that except for the JP-Mse site NETRAD data missing, the gain of NETRA Sustainability 2023, 15, x FOR PEER REVIEW 1 Figure 4. Scatter plots of the observed and predicted NEE at seven rice paddy sites us XGBoost and RF models. Red points represent the observed and predicted NEE using the X model. Blue points represent the observed and predicted NEE using the RF model. Figure 5 shows the importance of different factors at different sites to NEE th the XGBoost model (a higher value of gain implies greater importance). It can be see Figure 4 that except for the JP-Mse site NETRAD data missing, the gain of NETR Figure 4. Scatter plots of the observed and predicted NEE at seven rice paddy sites using the XGBoost and RF models. Red points represent the observed and predicted NEE using the XGBoost model. Blue points represent the observed and predicted NEE using the RF model. Figure 5 shows the importance of different factors at different sites to NEE through the XGBoost model (a higher value of gain implies greater importance). It can be seen from Figure 4 that except for the JP-Mse site NETRAD data missing, the gain of NETRAD at other sites exceeded 0.2, and the contribution of NETRAD to the prediction of NEE ranked top. After averaging the gain of these sites, the gain of NETRAD was 0.38, ranking first. Followed by NDVI (0.26), LW (0.15), and SWC (0.12), the rest of the factor gain values were less than 0.10. TA was important at only one site (IT-Cas).

Models Performance and Driving Factors of FCH 4 Prediction
The statistical result of the two machine learning models for predicting half-hour FCH 4 at different sites is provided in Table 7. Comparing the GRI values, the XGBoost models had better performance than RF models except for the IT-Cas and US-HRA sites. In the US-HRA site, the XGBoost model had lower RMSE and higher R 2 , but MAE and MBE were further away from 0 compared to the RF model. However, the RF model showed higher R 2 and lower RMSE and MAE as compared to the XGBoost model in the IT-Cas site; only MBE performance was not good.  Figure 6 shows the scatter plots of observed and predicted FCH 4 values based on XGBoost and RF models. The accuracy of both models was within the acceptable range, and the minimum R 2 value was 0.6421. As seen in the figure, there were many sites with relatively serious underestimation, such as when the FCH 4 observed at the US-Twt site exceeded 200 nmol m −2 s −1 ; the underestimation was obvious. However, there were still many points distributed on both sides of the 1:1 line.

Analysis of Influencing Factors of Evapotranspiration in Rice Paddies
Evapotranspiration (ET) plays a crucial role in water resource management in rice paddies. With the growth of data, computing power, and storage capacity, the utilization of machine learning models for predicting ET has gradually increased [27,28]. Agrawal et al. [29] found that the XGBoost model significantly enhances the performance in predicting Penman-Monteith reference evapotranspiration (ET 0 ) compared with the RF model. Ge et al. [30] used the same models, XGBoost and RF, as this study did, based on three years of experimental data to predict crop evapotranspiration. The results revealed that XGBoost outperformed RF in predicting crop evapotranspiration, consistent with the findings of this study regarding XGBoost's prediction of rice evapotranspiration. In addition, Ge et al. revealed the order of importance of input factors obtained from the XGBoost model. Similarly, this study used the XGBoost model to rank the importance of the input factors in predicting rice ET.
Previously, Liu et al. [31] indicated that radiation was the dominant factor for rice ET through multiple stepwise regression analyses. Ahmadi et al. [32] found that solar radiation was the most critical factor in California and all its climatic zones. They used eight factors (TA, SW, VPD, PA, WS, NDVI, SWC, and DeltaTA) as factors for rice ET modeling, except for the absence of SWC at the US-Twt site. From the results of the gain value obtained from the XGBoost model, radiation showed the most significant effect on ET. The decisive reason why these studies found that radiation plays such a large role in predicting evapotranspiration is that radiation provides the energy required for rice evapotranspiration [33].
In addition, Zhang et al. [34] also found that the influence of radiation was excellent; the second and third factors were VPD and TA. They found that VPD was significantly better than TA at JP-MSE and US-Twt sites, consistent with the previously mentioned studies, and while at IT-Cas sites, TA had better feature importance than VPD. The former may be due to the influence of rice during the growth process by rough irrigation and water supply. When the leaf surface is humid and the ambient temperature is high, a higher VPD usually occurs. In this case, the rice may need to release water more quickly to adapt to environmental conditions, resulting in larger evapotranspiration [35]. The latter is that the ET of the IT-CAS site was more sensitive to temperature than VPD, which may be due to the fact that the site has a lower temperature, resulting in a relatively small VPD. However, the significant change in temperature makes TA have a greater effect on rice evapotranspiration.
Except for SW, TA, and VPD, other factors were less important in predicting evapotranspiration in rice paddies. In some water-scarce areas, soil moisture was the dominant factor of evapotranspiration [36], while this paper studies the evapotranspiration of rice paddies, which maintains a certain height of water layer for a long time through irrigation during the planting process, so soil moisture performance was poor in this study.

Analysis of Influencing Factors of Net Ecosystem Carbon Exchange in Rice Paddies
The net ecosystem carbon exchange (NEE) is a critical parameter for quantifying the rice ecosystem and its contributions to climate change. Liu et al. [37] indicated that both the XGBoost and RF models were applicable machine learning algorithms for predicting ecosystem NEE. However, the XGBoost model had higher computational efficiency than the RF model. Among environmental input factors, NETRAD, SWC, and TS were the most important factors, while precipitation and WS were less important in predicting NEE. It was different with this study.
The feature importance of the seven sites is shown in Figure 5, and it can be seen that the input factor importance ranking is different among sites. However, both NETRAD and NDVI were in the top three for all sites (except for the absence of net radiation factor at the JP-MSE site). It showed that NETRAD and NDVI were important factors in predicting NEE in rice paddies. Safa et al. [23] indicated that the most effective inputs on the NEE were NETRAD and LAI for irrigated sites through sensitivity analysis, similar to our findings.
The reason for this phenomenon is that NETRAD and NDVI, respectively, reflect the input of solar radiation energy and the degree of vegetation coverage.
Zhou et al. [38] indicated that SWC was one of the most important factors based on the RF model. However, in this study, SWC only performed better at the US-HRC site, ranking first. Five sites (no data available at the other site) showed a relatively moderate effect of SWC on the NEE at the half-hour time scale, which is similar to Xue et al. [39].
Temperature affects both photosynthesis and respiration, thus affecting NEE. Rice sensitivity to temperature varies in different regions. Too high or too low a temperature can affect the growth of rice. Among them, IT-Cas, JP-Mse, US-HRA, and US-HRC belong to humid subtropical climates, while KR-CRK, PH-RiF, and US-Twt have humid continental, tropical monsoon, and Mediterranean climates, respectively. Rice originates in tropical or subtropical regions, so in humid subtropical areas, the climate is warm and humid, which is very conducive to the growth of rice. However, in these humid subtropical regions, only the IT-Cas site was sensitive to temperature when predicting NEE, with a gain value of 0.3916. This may be due to the fact that the temperature at the IT-Cas site varied more than at other sites, ranging from −14.05 • C to 34.41 • C, resulting in NEE's increased sensitivity to temperature.
Differences in geographical location and climatic conditions lead to different effects of LW on NEE in different regions. In this study, when modeling JP-Mse and KR-CRK, LW had the greatest impact, and the remaining sites had a smaller impact (except for two sites missing LW data). These two sites are located in East Asia. The range of LW at these two sites was 299.71-509.32 W m −2 and 196.94-517.96 W m −2 , respectively. Compared with other sites, their LW variation span was larger. It was possible that this changing trend affected the growth of rice, and the respiration and photosynthesis of rice were also affected, which made the prediction of NEE more sensitive to LW.

Analysis of Influencing Factors of Methane Flux in Rice Paddies
Accurate prediction of methane flux (FCH 4 ) for rice paddies is crucial for understanding the greenhouse gas budget of rice paddy ecosystems and achieving environmental sustainability. Wu et al. [40] identified the XGBoost model as the most suitable model with outstanding efficiency and accuracy for predicting FCH 4 . However, Wu et al. have not studied the environmental factors influencing FCH 4 emissions. Understanding the causes of FCH 4 emissions contributes to more sustainable agricultural development. This study investigated the ability of the XGBoost and RF models to predict FCH 4 and its causes.
The importance of methane prediction input characteristics at each site is shown in Figure 7. The dominant factors of methane prediction were also changing with the location of the site. Among them, RECO performed best at IT-Cas and JP-Mse sites, while at the KR-CRK and US-Twt sites, NDVI had the greatest impact on the model. At the remaining three sites, SWC obtained the largest gain value.
Morin et al. [41] found that RECO was a powerful factor that is necessary to explain methane emissions. The importance of RECO in paddy field CH 4 prediction was also well reflected in this study. RECO played the most important role in the IT-Cas and JP-Mse sites. Although the gain value obtained by RECO was no more than 0.1, some sites also had the importance of the top 3. RECO had such a good performance, which is attributed to the fact that it may represent the final result of complex nonlinear processes that also affect CH 4 exchange and are the direct drivers of CH 4 production and methane transport in paddy fields.
Shi et al. [42] indicated that vegetation indexes could be used as input factors for determining CH 4 flux in rice paddies. NDVI was one of the indicators of vegetation change and performed best when predicting CH 4 at KR-CRK and US-Twt sites. In addition, it also had a good performance at IT-Cas and PH-RiF sites, with a gain value better than other variables. These results can prove the possibility of NDVI for FCH 4 prediction.
Ge et al. [43] indicated that TS and SWC were the most important factors controlling CH 4 emissions from rice paddies on seasonal timescales through the partial F tests.
Knox et al. [44] showed that the average annual soil temperature was the strongest predictor of annual CH 4 flux across wetland sites globally. This study found that TS and SWC were major factors in CH 4 prediction at the half-hour scale. For instance, TS ranked second in importance among two sites and third in one site; SWC ranked first in three sites, while the rest also performed in the top four. The main reason is that TS and SWC are important regulators of soil reduction conditions and enzymatic processes. Higher soil temperatures can enhance methane production, molecular diffusion, and transport within plants [45,46]. SWC regulates the balance between CH 4 production and oxidation by influencing the depth of anaerobic and aerobic zones in the soil [47].

Conclusions
In this study, two tree-based machine learning models (the XGBoost and RF models) were used to model the water carbon fluxes (ET, NEE, and FCH 4 ) of seven rice paddies, and the importance of these water carbon flux input factors were analyzed.
By analyzing the statistical indicator results, it was found that the XGBoost and RF model was available for predicting rice paddy ET, NEE, and FCH 4 , and the XGBoost model outperformed the RF model at most sites. Thus, utilizing the XGBoost model for predicting water carbon fluxes can achieve highly accurate results while reducing resource consumption, such as the installation and calibration of measuring instruments. This is particularly significant for sustainable development in the fields of rice paddy.
Similarly, understanding the causes of water carbon fluxes also contributes to sustainable development by allowing us to focus on the most important factors and save time and resources. When predicting ET, SW was the most important at all sites. NETRAD and NDVI were more sensitive in predicting NEE.TA, LW, and SWC were more sensitive at individual sites. FCH 4 exhibited varying sensitivity to input factors across different sites, with SWC, TS, RECO, and NDVI emerging as the most critical factors influencing its emissions.
For future research, the current study suggests further investigation of XGBoost's predictive water carbon fluxes performance in different wetland types. Additionally, investigating the combination of XGBoost with other optimization models, such as binary particle swarm optimization, to iteratively search for the optimal parameter combination of the XGBoost model and study the optimal input combination for water carbon fluxes.