Gap-Filling of Surface Fluxes Using Machine Learning Algorithms in Various Ecosystems

: Five machine learning (ML) algorithms were employed for gap-ﬁlling surface ﬂuxes of CO 2 , water vapor, and sensible heat above three di ﬀ erent ecosystems: grassland, rice paddy ﬁeld, and forest. The performance and limitations of these ML models, which are support vector machine, random forest, multi-layer perception, deep neural network, and long short-term memory, were investigated. Firstly, the accuracy of gap-ﬁlling to time and hysteresis input factors of ML algorithms for di ﬀ erent ecosystems is discussed. Secondly, the optimal ML model selected in the ﬁrst stage is compared with the classic method—the Penman–Monteith (P–M) equation for water vapor ﬂux gap-ﬁlling. Thirdly, with di ﬀ erent gap lengths (from one hour to one week), we explored the data length required for an ML model to perform the optimal gap-ﬁlling. Our results demonstrate the following: (1) for ecosystems with a strong hysteresis between surface ﬂuxes and net radiation, adding proceeding meteorological data into the model inputs could improve the model performance; (2) the ﬁve ML models gave similar gap-ﬁlling performance; (3) for gap-ﬁlling water vapor ﬂux, the ML model is better than the P–M equation; and (4) for a gap with length of half day, one day, or one week, an ML model with training data length greater than 1300 h would provide a better gap-ﬁlling accuracy.


Introduction
In order to study climate change, hydrological cycle, and atmosphere-surface interactions, information on fundamental factors such as carbon dioxide (CO 2 ), latent heat (LE), and sensible heat (H) fluxes are indispensable. To date, the most accurate and reliable method to obtain these surface flux data is the eddy-covariance method. However, this method relies on high frequency measurements of three-dimensional sonic anemometers and CO 2 /H 2 O infrared gas analyzers, which are often forced to stop by rain or instrument maintenance, and complete flux time series data sometimes cannot be obtained. Furthermore, the process of data quality inspection will also result in missing flux data. Therefore, gap-filling missing flux data is a key process for calculating mass and energy budgets, such as ecological carbon budgets, evapotranspiration, and water resource balance. Commonly used gap-filling methods adopt low-frequency meteorological information such as temperature, humidity, wind speed, and net radiation to perform linear or non-linear regressions to make up loss data. These methods include the following: non-linear regression, interpolation, mean diurnal variation, look-up tables, multiple imputation, and marginal distribution sampling [1,2]. The advantages and disadvantages of these methods can be found in [1,[3][4][5][6][7][8][9][10][11].
Aside from classic statistics-based and process-based gap-filling methods, gap-filling based on data-driven, that is, machine learning (ML) algorithms have also become popular in recent years. Van Wijk and Bouten [12] used artificial neural networks (ANNs), a type of classic ML algorithm, to simulate CO 2 and water vapor fluxes at six different forest stations in Europe. Their results data memorized in the model is still valuable. The valuable data will be used to forecast in this round and will be retained in the next round. This process would be implemented through a Sigmoid function filter. The input gate and output gate are used to determine whether the new input data are useful, and as the output value of the neuron. Beside three gate units, there is a candidate value, which will be used to determine the magnitude of the updated neuron. In Figure 1e, Xt is the input vector given to the LSTM model in this round of forecast; ht-1 and ht are the forecasting results of the previous and current rounds (ht-1 and ht also passed to the next forecasting round as inputs to deliver the model state); St-1 and St are the cell state at time t − 1 and time t. tanh is the hyperbolic tangent function. In this study, the LSTM contained two to three hidden layers, and each layer contained 4 to 32 memory cells (neurons). The activation function used here for the layer connection was ReLU. The optimizer was selected from Rmsprop, Adam, or Nadam.

Input Variables for Training the ML Models
In this study, the input variables for training the ML models included the following: (1) time factors: Julian day (JD) and day and night time (DN); the DN is a time index that converts the 24 h in a day into 0-1; (2) meteorological factors: air temperature (Ta), relative humidity (RH), net radiation (Rn), and wind speed (U); and (3) hysteresis factors.
The hysteresis phenomenon between surface fluxes and net radiation has been widely noted (e.g., Cui et al. [37]; Lin et al. [38]). However, none of the ML models for flux gap-filling have taken this into account. Other environmental factors (e.g., air temperature, wind speed, soil water content, relative humidity) also have hysteresis relationships with surface fluxes (Cui et al. [37]; Zhang et al. [39]). In addition to net radiation, we also select one of the environmental factors, air temperature, to explore its hysteresis relation with surface fluxes at these three sites. Hence, in this study, the third input variable group is the hysteresis factors consisting of Rn (t − 2), Rn (t − 1), Ta (t − 2), Ta (t − 1), and Tavg; Tavg is the average of air temperature at time t − 2, t − 1, and t.; Rn (t − 2) and Rn (t − 1) are the net radiation at time t − 2 and t − 1, respectively; Ta (t − 2) and Ta (t − 1) are the air temperature at time t − 2 and t − 1, respectively. In this study, each time step is 30 min.
All the input variables from these three groups are listed in Table 1. In this study, we explore the best input combinations from these three groups for training the ML models for various sites and fluxes and compare the model performance of the five ML models for gap-filling the three fluxes at the three sites.

Penman-Monteith Equation
In this study, the gap-filling of latent heat flux (LE) was also performed by the classic physicalprocess based model, the Penman-Monteith (P-M) equation: where ∆ (kPa K −1 ) represents the slope of the saturated vapor pressure-temperature curve at the air temperature Ta; (= . ) is the psychrometric constant; (= 1.2 kg m −3 ) and (= 1005 J kg −1 K −1 ) represent the air density and specific heat for the air, respectively; is the latent heat of vaporization and is equal to 2.45 × 10 (J kg −1 ); D (kPa) represents the vapor pressure deficit; and are the aerodynamic resistances of water vapor and stomatal resistance; Qn (= Rn − Gs) is the available energy; and Gs is the soil heat flux. The rav can be expressed as [36] = − − (2)

Random Forest (RF)
The present study also used the Scikit-learn package in Python to build the random forest, in which the number of decision trees was gradually increased from 100 to 500. The maximum number of features is automatically selected by the Scikit-learn package. In order to avoid overfitting, the maximum depth of the decision tree in this study ranged from 1 to 10. The structure of the RF is presented in Figure 1b; a detailed description can be found in Breiman [28].

Multi-Layer Perceptron (MLP)
The package used to build MLP in this study was TensorFlow developed by Google. The model structure is shown in Figure 1c; in between the input and output layers, only one hidden layer is used to build the neural network. The number of neurons used in the hidden layer was from 3 to 512. The grid-search method was used to determine the hyper-parameters. In order to distinguish from the novel deep learning networks, the activation function used here is Sigmoid function. A detailed description can be found in Rumelhart et al. [29].

Deep Neural Network (DNN)
The main improvement of DNN is the introduction of ReLU as the activation function to avoid the gradient vanishing; therefore, most DL models can contain three or more hidden layers. In addition, the invention of the new optimizers such as Adam or Nadam [30] allows us to train a large number of hyper-parameters more quickly; the regularization and dropout [31] enable us to avoid overfitting. The DNN used in this study was the TensorFlow package published by Google. The model structure is shown in Figure 1d, the number of hidden layers is three, and the number of neurons used in each hidden layer ranges from 4 to 32 neurons. The activation function used is ReLU, and the optimizer is either Rmsprop, Adam, or Nadam (based on optimization). More details about DNN can be found in Hinton et al. [32].

Long Short-Term Memory (LSTM)
LSTM is a type of improved recurrent neural network (RNN)-details of LSTM are given in Hochreiter et al. [33]. As many studies have shown [34][35][36], LSTM can more effectively predict the trend of mid-and long-term events than other ML algorithms. The LSTM used in this study was the Keras package launched by Google. The structure of an LSTM memory cell is presented in Figure 1e. The memory cell includes forget, input, and output gates. The forget gate is used to filter whether the data memorized in the model is still valuable. The valuable data will be used to forecast in this round and will be retained in the next round. This process would be implemented through a Sigmoid function filter. The input gate and output gate are used to determine whether the new input data are useful, and as the output value of the neuron. Beside three gate units, there is a candidate value, which will be used to determine the magnitude of the updated neuron. In Figure 1e, X t is the input vector given to the LSTM model in this round of forecast; h t-1 and h t are the forecasting results of the previous and current rounds (h t-1 and h t also passed to the next forecasting round as inputs to deliver the model state); S t-1 and S t are the cell state at time t − 1 and time t. tanh is the hyperbolic tangent function. In this study, the LSTM contained two to three hidden layers, and each layer contained 4 to 32 memory cells (neurons). The activation function used here for the layer connection was ReLU. The optimizer was selected from Rmsprop, Adam, or Nadam.

Input Variables for Training the ML Models
In this study, the input variables for training the ML models included the following: (1) time factors: Julian day (JD) and day and night time (DN); the DN is a time index that converts the 24 h in a day into 0-1; (2) meteorological factors: air temperature (T a ), relative humidity (RH), net radiation (R n ), and wind speed (U); and (3) hysteresis factors.
The hysteresis phenomenon between surface fluxes and net radiation has been widely noted (e.g., Cui et al. [37]; Lin et al. [38]). However, none of the ML models for flux gap-filling have taken this into account. Other environmental factors (e.g., air temperature, wind speed, soil water content, relative humidity) also have hysteresis relationships with surface fluxes (Cui et al. [37]; Zhang et al. [39]). In addition to net radiation, we also select one of the environmental factors, air temperature, to explore its hysteresis relation with surface fluxes at these three sites. Hence, in this study, the third input variable group is the hysteresis factors consisting of R n (t − 2), R n (t − 1), T a (t − 2), T a (t − 1), and T avg ; T avg is the average of air temperature at time t − 2, t − 1, and t.; R n (t − 2) and R n (t − 1) are the net radiation at time t − 2 and t − 1, respectively; T a (t − 2) and T a (t − 1) are the air temperature at time t − 2 and t − 1, respectively. In this study, each time step is 30 min.
All the input variables from these three groups are listed in Table 1. In this study, we explore the best input combinations from these three groups for training the ML models for various sites and fluxes and compare the model performance of the five ML models for gap-filling the three fluxes at the three sites.
average of the air temperatures measured at time t, t − 1, and t − 2 ( • C)

Penman-Monteith Equation
In this study, the gap-filling of latent heat flux (LE) was also performed by the classic physical-process based model, the Penman-Monteith (P-M) equation: where ∆ (kPa K −1 ) represents the slope of the saturated vapor pressure-temperature curve at the air temperature T a ; γ(= ρc p 0.622L v ) is the psychrometric constant; ρ (= 1.2 kg m −3 ) and c p (= 1005 J kg −1 K −1 ) represent the air density and specific heat for the air, respectively; L v is the latent heat of vaporization and is equal to 2.45 × 10 6 (J kg −1 ); D (kPa) represents the vapor pressure deficit; r av and r st are the aerodynamic resistances of water vapor and stomatal resistance; Q n (= R n − G s ) is the available energy; and G s is the soil heat flux. The r av can be expressed as [36] where k (=0.4) is the von Karman constant, z is the measurement height, d is the zero-plane displacement height (≈2/3 of the canopy height), z 0m is the surface roughness for momentum, and z 0v is the surface roughness for water vapor. With Equation (1), we can fill the gap of LE with measured low-frequency meteorological data and local r st derived from the measured latent heat flux (see Appendix B).

Flux Gap Scenario
Different gap lengths may require different data lengths to train the ML model. In this study, we consider four gap-length scenarios: one hour, half day, one day, and one week. The following steps were taken for each gap length to investigate the relation between gap length and data length of training. (1) The highest point of the measured flux was selected as the central point of the gap-length as this point was the most difficult point for gap-filling; (2) the closest neighbor points to the gap were selected as the training data; and (3) the training data length was started from 20 h and was increased gradually to a maximum of 1600 h with a time step of 20 h.

Research Process
The research process of this study includes three stages described below.
(1) Explore the optimal combinations of input variables for constructing the five ML models for gap-filling of surface fluxes at the three sites, and then compare the model performance.
For constructing the ML model, the ratio of data sets for training, validation, and testing was 5:3:2, and each of the data sets was randomly selected with uniform distribution. (2) In the second stage, the best ML model selected from the first stage was compared with the P-M equation to explore the water vapor flux gap-filling accuracy of both methods at the three ecosystems. The determination of r st for the P-M equation at the three sites is described in Appendix B. (3) In the third stage, the relation between gap length (one hour, half day, one day, and one week) and training data length (20 to 1600 h) was investigated by the steps in Section 3.3. The ML model adopted here was the best model selected from the first stage.

Performance Metrics
In order to objectively quantify the performance of each model, four commonly used performance metrics for linear or nonlinear data were selected and are described below.
(1) Root mean square error (RMSE) where C t andĈ t are the values from observation and model prediction, respectively, and n represents the total number of data points. (2) Mean absolute error (MAE) It is worth noting that the RMSE amplifies extreme errors more and ignores the small errors, while the MAE considers the average of all errors.
(3) Coefficient of determination (R 2 ) where C is the mean of the observed values andĈ represents the mean of the estimated values. (4) Coefficient of efficiency (CE) The CE evaluates whether the estimated value generated by the model is better than the estimated value using a direct average. The closer the value is to 1, the more ideal it is.

Optimal Input Combinations for Training ML Models
The optimal input combinations for CO 2 , latent heat, and sensible heat fluxes at the three sites determined by the model performance metrics are listed in Tables 2-4, respectively. It is noticed that, for the grassland site, the hysteresis factors played no roles in all three fluxes, and the time factors have some influences on all fluxes at the three sites. To further examine the influences of time factors and hysteresis factors on the three fluxes at the three sites, the averaged model performances with and without these two factors are summarized in Tables 5-7 for the grassland, rice paddy field, and forest, respectively. The individual model performance with and without time and hysteresis factors is listed in Appendix C. Tables 5-7 demonstrate the following: (1) For the grassland, the improvements by including time factors in the input combination are less than 5% for all three fluxes. For the rice paddy field, the improvements for RMSE for the three fluxes range from 8.6 to 19.7%, showing that time factors' influence is larger at this site. For the forest, this influence on CO 2 flux is small (2.9%), but it is larger for water vapor and sensible heat fluxes (7.26-7.9%, respectively). (2) Concerning the hysteresis factors, at the grassland site, these factors have no influence on all three fluxes. For the rice paddy, the influence on CO 2 flux is less important, but important for water vapor and sensible heat fluxes (RMSE improved by 8.72-9.50%). For the forest site, the influence on CO 2 flux is important (RMSE improved by 8.10%), but the influence on both LE and H is small (improvement rates of RMSE both less than 2%). Cui et al. [37] found that the magnitude of hysteresis between LE and net radiation is large on water surfaces and small on land surfaces. Our results for LE reveal that the hysteresis factors are stronger for the rice paddy field (flooded with water during growing season), but small for forest and grassland sites. This is consistent with the finding of Cui et al. [37]. Table 2. The optimal input combinations for CO 2 flux gap-filling at the three ecosystems using different machine learning models. SVM, support vector machine; RF, random forest; MLP, multi-layer perceptron; DNN, deep neural network; LSTM, long short-term memory.

MLP
Grassland JD, DN, T a (t), RH, R n (t), U(t) Rice paddy field JD, DN, T a (t), Table 3. Same as Table 2, but for latent heat flux.

LSTM
Grassland JD, T a (t), RH, R n (t), U(t) Rice paddy field JD, DN, T a (t), RH, R n (t), U(t), R n (t−2), R n (t−1), T a (t−2), T a (t−1), T avg (t) Forest JD, DN, T a (t), RH, R n (t), U(t), T avg (t) In addition to the time and hysteresis factors, at the rice paddy field, the LAI measurements were also carried out and the values varied with time; hence, LAI's influence on model performance was also considered. The results are presented in Appendix D. The ML model adopted in Appendix D was SVM. From Appendix D, it is noticed that the influences of LAI on gap-filling of CO 2 flux, LE, and H at the rice paddy field are all small.

Comparisons of Gap-Filling by ML Models
In this section, we compare the gap-filling results generated from the five ML models for CO 2 , water vapor, and sensible heat fluxes at the three ecosystems. Figure 2 shows a typical time series for CO 2 flux obtained from the experimental measurements and the five ML models at the three sites. As Figure 2a shows, in the grassland, the results of the five ML models were similar and could underestimate the peak values of CO 2 fluxes around noon. For the rice paddy field and forest ecosystem, all five models produced similar predictions and could accurately reproduce the flux peak values.  The regression analyses between simulated and measured latent heat fluxes are listed in Table  8. From Table 8 and Appendix C, though the differences between the five models' performances were marginal, for both the rice paddy field and forest ecosystem, SVM was the best model; for the grassland, the model with the best accuracy was LSTM. Moreover, from Tables 5-7, for gap-filling of LE, the ML models performed well over the grassland, rice paddy field, and forest, where R 2 values were 0.85, 0.89, and 0.71, respectively. The linear regression analyses between measured and model predicted fluxes at the three sites are summarized in Table 8. From Table 8 and Appendix C, notice that RF had the best performance in the grassland ecosystem, while the SVM performed best in both the rice paddy field and forest ecosystem. Moreover, from Tables 5-7, for gap-filling of CO 2 flux, the ML models performed best in the rice paddy field, followed by the forest, and lastly the grassland (R 2 values were around 0.9, 0.8, and 0.6, respectively).  The regression analyses between simulated and measured latent heat fluxes are listed in Table  8. From Table 8 and Appendix C, though the differences between the five models' performances were marginal, for both the rice paddy field and forest ecosystem, SVM was the best model; for the grassland, the model with the best accuracy was LSTM. Moreover, from Tables 5-7, for gap-filling of LE, the ML models performed well over the grassland, rice paddy field, and forest, where R 2 values were 0.85, 0.89, and 0.71, respectively.  The regression analyses between simulated and measured latent heat fluxes are listed in Table 8. From Table 8 and Appendix C, though the differences between the five models' performances were marginal, for both the rice paddy field and forest ecosystem, SVM was the best model; for the grassland, the model with the best accuracy was LSTM. Moreover, from Tables 5-7, for gap-filling of LE, the ML models performed well over the grassland, rice paddy field, and forest, where R 2 values were 0.85, 0.89, and 0.71, respectively.

Sensible Heat Flux
A typical time series of sensible heat flux gap-filling from the five ML models is shown in Figure 4. It is noticed that the five models produced similar results for all three sites and no model was found to systematically over-or underestimate the flux peak values.
Water 2020, 12, x FOR PEER REVIEW 13 of 25

Sensible Heat Flux
A typical time series of sensible heat flux gap-filling from the five ML models is shown in Figure  4. It is noticed that the five models produced similar results for all three sites and no model was found to systematically over-or underestimate the flux peak values. The regression analyses between measured and ML model predicted sensible heat fluxes and are also summarized in Table 8. Based on Table 8 and Appendix C, the performance of SVM was the best for all the three sites, though the differences between the five models were quite small. Moreover, on the basis of Tables 5-7, for gap-filling of H, the ML models performed well over the grassland, rice paddy field, and forest, where R 2 values were 0.83, 0.87, and 0.84, respectively. To summarize, we list the best ML models for gap-filling CO2 flux, LE, and H at the three sites in Table 9. Table 8. Summary of regression analysis between measured and ML model predicted fluxes at the three sites. Y = aX + b; Y is measured flux; X is predicted flux. The regression analyses between measured and ML model predicted sensible heat fluxes and are also summarized in Table 8. Based on Table 8 and Appendix C, the performance of SVM was the best for all the three sites, though the differences between the five models were quite small. Moreover, on the basis of Tables 5-7, for gap-filling of H, the ML models performed well over the grassland, rice paddy field, and forest, where R 2 values were 0.83, 0.87, and 0.84, respectively. To summarize, we list the best ML models for gap-filling CO 2 flux, LE, and H at the three sites in Table 9.

Comparison between Machine Learning Model and Penman-Monteith Equation
The comparisons between gap-filling results from the ML model and physical-process based model (P-M equation) over the three ecosystems are presented in this section. According to Section 4.2, the ML algorithm used in the rice paddy field and forest ecosystem was SVM; for the grassland ecosystem, LSTM was adopted. Figure 5 shows the comparison between measured and simulated latent heat flux by LSTM and the Penman-Monteith equation in the grassland ecosystem. The regression analyses for Figure 5 and model performance metrics are also summarized in Table 10. From Figure 5 and Table 10, we noticed that both the data-driven model (LSTM) and process-based model (P-M equation) performed well (both R 2 values greater than 0.82) for gap-filling LE at the grassland site. Moreover, it is noticed that, for the peak values of LE, the LSTM predictions were closer to the measurements than the P-M equation. As shown through the RMSE and MAE, the error of peak value and the average error obtained by LSTM were both smaller than those by the P-M equation. The comparison between measured and simulated latent heat flux by SVM and the Penman-Monteith equation at the rice paddy field is plotted in Figure 6. The regression analyses for Figure 6 and model performance metrics are also listed in Table 10. According to Figure 6 and Table 10, it is noticed that both SVM and the P-M equation could accurately reproduce LE (both R 2 values greater than 0.9) at the rice paddy field. Because of the lack of measurements of soil heat flux and water heat storage at the rice paddy field (the field was flooded with water), the available energy (Qn) was calculated by H+LE with the assumption of energy balance, and applied to Equation (A1) for calculating rst. The outstanding performance of the P-M equation benefits from this forced energy balance. This result implies that, if the surface energy balance of the experimental site is satisfied (the assumption of the P-M equation), then the P-M equation can effectively gap-fill the LE flux. In addition to the energy conservation assumption, the hysteresis between Rn and LE is another factor to influence the accuracy of the P-M equation (but this hysteresis factor is not considered in the P-M equation). Though this hysteresis in magnitude is stronger in the rice paddy field than the forest, the outstanding performance of the P-M equation in the rice paddy field implies that the energy conservation assumption might be a key factor in this field. Figure 7 shows the comparison between measured and predicted LE by SVM and the P-M equation at the forest. The regression analyses for Figure 7 and model performance metrics are also summarized in Table 10. It is obvious that, at the forest site, the ML model was better than the processbased model. The reason that the P-M equation did not reproduce LE well might be because the halfhourly rst varied strongly with the environmental factors (e.g., Rn, air temperature) and could not be predicted by the process in Appendix B. For the forest ecosystem, the two methods were both less accurate than the previous two ecosystems. Especially    The comparison between measured and simulated latent heat flux by SVM and the Penman-Monteith equation at the rice paddy field is plotted in Figure 6. The regression analyses for Figure 6 and model performance metrics are also listed in Table 10. According to Figure 6 and Table 10, it is noticed that both SVM and the P-M equation could accurately reproduce LE (both R 2 values greater than 0.9) at the rice paddy field. Because of the lack of measurements of soil heat flux and water heat storage at the rice paddy field (the field was flooded with water), the available energy (Q n ) was calculated by H+LE with the assumption of energy balance, and applied to Equation (A1) for calculating r st . The outstanding performance of the P-M equation benefits from this forced energy balance. This result implies that, if the surface energy balance of the experimental site is satisfied (the assumption of the P-M equation), then the P-M equation can effectively gap-fill the LE flux. In addition to the energy conservation assumption, the hysteresis between R n and LE is another factor to influence the accuracy of the P-M equation (but this hysteresis factor is not considered in the P-M equation). Though this hysteresis in magnitude is stronger in the rice paddy field than the forest, the outstanding performance of the P-M equation in the rice paddy field implies that the energy conservation assumption might be a key factor in this field. Many factors (e.g., energy conservation ratio, hysteresis, rst) can affect the accuracy of the P-M equation. The accuracy of the process-based model (P-M equation) at the three ecosystems clearly reflects the environmental and physiological uncertainties at the site. On the other hand, the ML models do not suffer from these environmental and physiological factors. However, when using ML models to extract features from historical data, it might not perform well when the missing data exceed the range of the training data.

Effect of Data Length on Flux Gap-Filling
In this section, we present the relation between gap length and data length of training for the three fluxes at the three sites. The ML model adopted here is SVM, with the optimal input combinations listed in Tables 2-4. Figure 8a shows the RMSE of the model predicted LE as a function of training data length for various gap lengths (one hour, half day, one day, and one week) at the grassland site. The curve for one-hour gap length (blue line) reveals that RMSE was reduced from the highest value (≈27.4 W/m 2 ) to a local minimum (≈17 W/m 2 ) as the training data length increased from 20 h to around 280 h. Once the data length was more than 280 h, the RMSE oscillated and increased to around 19.5 W/m 2 at 900 h data length. After 900 h, the RMSE was reduced again and   Table 10. It is obvious that, at the forest site, the ML model was better than the process-based model. The reason that the P-M equation did not reproduce LE well might be because the half-hourly r st varied strongly with the environmental factors (e.g., R n , air temperature) and could not be predicted by the process in Appendix B. For the forest ecosystem, the two methods were both less accurate than the previous two ecosystems. Especially   Many factors (e.g., energy conservation ratio, hysteresis, rst) can affect the accuracy of the P-M equation. The accuracy of the process-based model (P-M equation) at the three ecosystems clearly reflects the environmental and physiological uncertainties at the site. On the other hand, the ML models do not suffer from these environmental and physiological factors. However, when using ML models to extract features from historical data, it might not perform well when the missing data exceed the range of the training data.

Effect of Data Length on Flux Gap-Filling
In this section, we present the relation between gap length and data length of training for the three fluxes at the three sites. The ML model adopted here is SVM, with the optimal input combinations listed in Tables 2-4. Figure 8a shows the RMSE of the model predicted LE as a function Many factors (e.g., energy conservation ratio, hysteresis, r st ) can affect the accuracy of the P-M equation. The accuracy of the process-based model (P-M equation) at the three ecosystems clearly reflects the environmental and physiological uncertainties at the site. On the other hand, the ML models do not suffer from these environmental and physiological factors. However, when using ML models to extract features from historical data, it might not perform well when the missing data exceed the range of the training data.

Effect of Data Length on Flux Gap-Filling
In this section, we present the relation between gap length and data length of training for the three fluxes at the three sites. The ML model adopted here is SVM, with the optimal input combinations listed in Tables 2-4. Figure 8a shows the RMSE of the model predicted LE as a function of training data length for various gap lengths (one hour, half day, one day, and one week) at the grassland site. The curve for one-hour gap length (blue line) reveals that RMSE was reduced from the highest value (≈27.4 W/m 2 ) to a local minimum (≈17 W/m 2 ) as the training data length increased from 20 h to around 280 h. Once the data length was more than 280 h, the RMSE oscillated and increased to around 19.5 W/m 2 at 900 h data length. After 900 h, the RMSE was reduced again and reached a relative stable value (≈17 W/m 2 ) when the data length was longer than 1300 h. This result shows that a longer training data length does not promise better performance for the one-hour gap length. shows that a longer training data length does not promise better performance for the one-hour gap length.
In Figure 8a, the RMSE curve for one day (green line) had the same trend as the one hour curve did, but with slightly higher RMSE values. Concerning the curve for the half day gap length (red line), the RMSE decreased with increasing training data length from 20 h to the maximum point (1600 h), where after 400 h, the decrease in RMSE was relatively small. For the one week gap length case, the RMSE curve (purple line) decreased rapidly with the increase of data length and reached a stable RMSE value after 1000 h.
From these four curves, it is noticed that, when using the ML algorithm to gap-fill the missing LE at the grassland ecosystem, the error was larger when the length of training data was short. As the data length increased, the accuracy also increased, and then gradually converged to the model limitation; that is, the RMSE no longer decreased with an increase of data length. For the case herein, it is found that, for gaps less than one day, using about 400 h training data could result in a relatively small RMSE (less than 25 W/m 2 ). Figure 8b is the same as Figure 8a, but for the rice paddy field. The major trend in Figure 8b is the same as in Figure 8a, that is, the RMSE curves (for all four gap lengths) had the highest values at the beginning and then dropped as the data length increased. For the one hour gap length case, the RMSE had its minimum (≈10 W/m 2 ) at 900 h and then oscillated up and down with data length till the end at 1600 h. For cases of half day and one day, the minimum values (≈25 W/m 2 ) of RMSE happened at 1600 h, but the RMSEs only changed a bit after 1300 h. For the one week case, the RMSE had its lowest value (≈22 W/m 2 ) at 1000 h and remained stable afterwards.  Figure 8c is the same as Figure 8a, but for the forest ecosystem. For the one hour gap length case, the RMSE had its minimum (≈8 W/m 2 ) at 500 h and then increased to 11 W/m 2 with the increase of data length till 800 h, and then remained stable till the end at 1600 h, with an RMSE around 11 W/m 2 . Same as that in Figure 8b, for the one week gap length case, the RMSE also had its lowest value (≈31 W/m 2 ) at 1000 h and remained stable afterwards. For the cases of half day and one day, the minimum values of RMSE occurred at 1300 h and the RMSE remained stable afterwards.
In summary, Figure 8 reveals the following characteristics: (1) The gap-filling accuracy increased with the increase of data length and reached the model limitation when the data length is longer than 1300 h, except for the one hour gap length case. In Figure 8a, the RMSE curve for one day (green line) had the same trend as the one hour curve did, but with slightly higher RMSE values. Concerning the curve for the half day gap length (red line), the RMSE decreased with increasing training data length from 20 h to the maximum point (1600 h), where after 400 h, the decrease in RMSE was relatively small. For the one week gap length case, the RMSE curve (purple line) decreased rapidly with the increase of data length and reached a stable RMSE value after 1000 h.
From these four curves, it is noticed that, when using the ML algorithm to gap-fill the missing LE at the grassland ecosystem, the error was larger when the length of training data was short. As the data length increased, the accuracy also increased, and then gradually converged to the model limitation; that is, the RMSE no longer decreased with an increase of data length. For the case herein, it is found that, for gaps less than one day, using about 400 h training data could result in a relatively small RMSE (less than 25 W/m 2 ). Figure 8b is the same as Figure 8a, but for the rice paddy field. The major trend in Figure 8b is the same as in Figure 8a, that is, the RMSE curves (for all four gap lengths) had the highest values at the beginning and then dropped as the data length increased. For the one hour gap length case, the RMSE had its minimum (≈10 W/m 2 ) at 900 h and then oscillated up and down with data length till the end at 1600 h. For cases of half day and one day, the minimum values (≈25 W/m 2 ) of RMSE happened at 1600 h, but the RMSEs only changed a bit after 1300 h. For the one week case, the RMSE had its lowest value (≈22 W/m 2 ) at 1000 h and remained stable afterwards. Figure 8c is the same as Figure 8a, but for the forest ecosystem. For the one hour gap length case, the RMSE had its minimum (≈8 W/m 2 ) at 500 h and then increased to 11 W/m 2 with the increase of data length till 800 h, and then remained stable till the end at 1600 h, with an RMSE around 11 W/m 2 . Same as that in Figure 8b, for the one week gap length case, the RMSE also had its lowest value (≈31 W/m 2 ) at 1000 h and remained stable afterwards. For the cases of half day and one day, the minimum values of RMSE occurred at 1300 h and the RMSE remained stable afterwards.
In summary, Figure 8 reveals the following characteristics: (1) The gap-filling accuracy increased with the increase of data length and reached the model limitation when the data length is longer than 1300 h, except for the one hour gap length case. (2) For cases of one hour, the best model performance happened at different data lengths for different ecosystems (around 300, 900, and 500 h for grassland, rice paddy, and forest, respectively). Figure 9 plots the RMSE of gap-filling for CO 2 flux by the SVM at the three ecosystems as a function of data length for various gap lengths. In Figure 9, the following issues are noticed.
(1) For all three sites, the RMSE curve of half day and one day gap lengths had the highest value at the beginning and then dropped as the data length increased; after a local RMSE minimum was reached, the RMSE oscillated up and down with the increase of data length and then reached its minimum at around 1300 h and remained stable till the end at 1600 h. The one week gap length case at the grassland also followed this trend. (2) For the one week gap length case at both the rice paddy field and forest, the RMSE decreased with the increase of data length and had the minimum after 1300 h. (3) For the one hour gap length case, the RMSE curve trend differed from each ecosystem.
The minimum RMSE occurred at around 1000, 200, and 300 h for the grassland, rice paddy field, and forest, respectively. (4) Figure 9c shows that too much training data could result in a decrease in gap-filling accuracy for the short gap length cases (one hour, half day, and one day). This is because too much training data might average out the necessary features (e.g., peak values) of a short period. Figure 10 is the same as Figure 9, but for sensible heat flux. Figure 10b shows that the variations of all four RMSE curves at the rice paddy field were relative stable with the increase of data length. This is because the sensible heat flux measurements were small at this site. In Figure 10a,b, for both the one day and one week cases at both the grassland and forest ecosystems, the RMSE decreased with the increase of data length and had the minimum after 1300 h. However, the RMSE curves for the one hour and half day cases at these two sites oscillated a bit after the local minimum occurred.
In summary, (1) for the one week gap length case, the RMSE value decreased with the increase of the data length, and had the minimum value around 1300 h; (2) for both half day and one day cases, the RMSE value sometimes oscillated with the increase of the data length, but it would converge and reach the lowest value around 1300 h; (3) in the case of one hour, using the longest data length does not guarantee the best predicted value, and sometimes, the lowest RMSE value occurs in a shorter data length. Here, the shortest data length required to obtain an RMSE less than 1.05 times the lowest RMSE is summarized in Table 11. minimum RMSE occurred at around 1000, 200, and 300 h for the grassland, rice paddy field, and forest, respectively.
(4) Figure 9c shows that too much training data could result in a decrease in gap-filling accuracy for the short gap length cases (one hour, half day, and one day). This is because too much training data might average out the necessary features (e.g., peak values) of a short period.  Figure 10 is the same as Figure 9, but for sensible heat flux. Figure 10b shows that the variations of all four RMSE curves at the rice paddy field were relative stable with the increase of data length. This is because the sensible heat flux measurements were small at this site. In Figure 10a,b, for both the one day and one week cases at both the grassland and forest ecosystems, the RMSE decreased with the increase of data length and had the minimum after 1300 h. However, the RMSE curves for the one hour and half day cases at these two sites oscillated a bit after the local minimum occurred. In summary, (1) for the one week gap length case, the RMSE value decreased with the increase of the data length, and had the minimum value around 1300 h; (2) for both half day and one day cases, the RMSE value sometimes oscillated with the increase of the data length, but it would converge and reach the lowest value around 1300 h; (3) in the case of one hour, using the longest data length does not guarantee the best predicted value, and sometimes, the lowest RMSE value occurs in a shorter data length. Here, the shortest data length required to obtain an RMSE less than 1.05 times the lowest RMSE is summarized in Table 11.
The above analyses are for individual gap cases; it is also interesting to know the relation between total gap length and data length of training where the total length of the gaps is the same (but the individual gap length is different). The analysis is presented in Appendix E, and it shows that, when the total gap length was the same, the RMSE curves of the four gaps were quite similar (regardless of the length of each gap) and all of them decreased with the increase of data length and converged to the lowest values after 1300 h.   Grassland  280  440  200  860  1120  100  80  240  640  620  620  1280  Rice paddy field 880  1380  1380  920  160  100  100  1340  1180  1180  1180  580  Forest  460  740  1120  460  280  200  1340  640  240  380  1080  1200 The above analyses are for individual gap cases; it is also interesting to know the relation between total gap length and data length of training where the total length of the gaps is the same (but the individual gap length is different). The analysis is presented in Appendix E, and it shows that, when the total gap length was the same, the RMSE curves of the four gaps were quite similar (regardless of the length of each gap) and all of them decreased with the increase of data length and converged to the lowest values after 1300 h.

Conclusions
In this study, five ML algorithms were adopted, including three conventional algorithms (SVM, RF, and MLP) and two deep learning algorithms, (DNN and LSTM) to gap-fill the missing fluxes of CO 2 , water vapor, and sensible heat in three ecosystems (grassland, rice paddy field, and forest). We conclude the following.
(1) In addition to the mean meteorological parameters, including the time factors (i.e., Julian day and decimal time) is important for all fluxes of CO 2 , water vapor, and sensible heat at the rice paddy field. However, the influences of time factors on these three fluxes are small (less than 5%) at the grassland. For the forest, this influence on CO 2 flux is small, but it is larger for water vapor and sensible heat fluxes. (2) The hysteresis factors have no influence on all three fluxes at the grassland site. For the rice paddy, this influence on CO 2 flux is not important, but it is important for water vapor and sensible heat fluxes. For the forest site, the hysteresis influence is important on CO 2 flux, but it is small on both water vapor and sensible heat fluxes. (3) For all three ecosystems, the five ML models produced similar results for gap-filling of CO 2 , water vapor, and sensible heat fluxes. A list of the best ML model for flux gap-filling at the three sites is provided in Table 9. All in all, the SVM model is the most recommended model. (4) In terms of water vapor flux gap-filling, the ML model was better than the P-M equation, especially for forests; however, historical data are required a priori for training ML models. (5) The following general rule for the relation between gap length and data length of training can be made: if the gap length is less than one week, the training data length for achieving the best model performance is around 1300 h (i.e., 7.7 times the gap length). (6) For a particular gap that we are concerned about (especially where the flux peak values occurred), if training data length longer than 1300 h are not available when doing gap-filling, the data length listed in Table 11 is recommended. efficiency; the upper and lower boundaries and the grid size will determine the number of parameter combinations to be calibrated. (3) Establish models using all the parameter combinations separately to obtain the performance metrics, and select the best one as the optimal parameter combination. More details about this method can be found in Lerman [26].

Appendix B. Calculation of the Stomatal Resistance (rst)
When applying the P-M equation to do latent heat flux (LE) gap-filling, in addition to the general meteorological parameters, R n , G s , T a , RH, and U, the stomatal resistance r st should also be given. In this study, when there is a missing point at time t, the r st for this point is the average of the r st at t − 1 and t + 1 (each time period is 30 min). By the measured LE and rearranging Equation (1), the r st was calculated as If there was an outlier in the measured r st at t − 1 and t + 1, the following rules were applied to replace the outlier: firstly, an r st greater than 800 was replaced by 800; secondly, an r st less than 0 was replaced by 0. If the previous or next r st was missing, or if the data were the first or the last point of the data set, the long-term average of r st was used as a replacement. That is, the diurnal (07:00~17:00) missing r st was supplemented by the average of all r st from 10:00 to 14:00. On the other hand, the nocturnal (0:00~07:00 and 17:00~00:00) missing r st would be supplemented by the average of all r st during night time.
For the grassland, the measured data were divided into two seasons, the hot period (from May to October) and cold period (from November to April). In the hot season, the average r st for diurnal and nocturnal periods were 358.86 and 503.01 (s/m), respectively. In the cold period, the average r st for diurnal and nocturnal periods were 273.81 and 422.43 (s/m), respectively For the rice paddy field, the average r st for diurnal and nocturnal periods in the growing season were 321.35 and 361.41 (s/m), respectively.
For the forest ecosystem, similar to the grassland site, the whole dataset was divided into hot and cold seasons. The average r st for diurnal period in the hot and cold seasons were 196.63 and 339.33 (s/m), respectively. As the forest canopy was always wet during night time [38], the missing r st at night for both hot and cold seasons were filled in by 0 (s/m). Table A1. Summary of individual model performance with and without time factors and hysteresis factors as inputs for flux gap-filling at the grassland. The number in the first parenthesis is the result without time factors; the number in the second parenthesis is the result without hysteresis factors.  Table A4. Summary of model performance with and without leaf area index (LAI) as an input for flux gap-filling at the paddy rice field. The model adopted here is the support vector machine (SVM) with the optimal input combination listed in Tables 2-4.

Appendix E. Gap Scenario with Equal Total Gap Length
Here, we present the relation between total gap length and data length of training where the total length of the gaps is the same (but the individual gap length is different). The gap scenario is as follows.
(1) The total gap length was one week (=168 h). (2) Four individual gap lengths were considered: one hour, half day, one day, and one week. That is, if the gap-length of each gap is a half day (=12 h), there will be 14 gaps in this case. (3) The highest point of the measured flux was selected as the central point of the first gap because this point was the most difficult point for gap-filling. (4) The remaining gaps were randomly selected from the rest of the data with a uniform distribution. (5) The closest neighbor points to the first gap were selected as the training data. (6) The training data length started from 20 h and increased gradually to a maximum of 1600 h. Figure A1 plots the RMSE of the model predicted CO 2 flux as a function of training data length for four individual gap lengths (one hour, half day, one day, and one week) at the forest site. In Figure A1, the total gap length for the four individual gap lengths was the same (one week) and the CO 2 flux was predicted using the SVM model. Figure A1 shows that, when the total gap length was the same, the RMSE curves of the four different gaps were quite similar, and all of them decreased with the increase of data length and converged to the lowest values after 1300 h. In other words, if the total gap length is one week, the length of training data for achieving better model performance is around 1300 h, regardless of the length of each gap. Sensible Heat Flux SVM 9.71 6.07 0.89 0.89 (W/m 2 ) SVM with LAI 9.75 6.10 0.89 0.89

Appendix E. Gap Scenario with Equal Total Gap Length
Here, we present the relation between total gap length and data length of training where the total length of the gaps is the same (but the individual gap length is different). The gap scenario is as follows. (1) The total gap length was one week (=168 h). (2) Four individual gap lengths were considered: one hour, half day, one day, and one week. That is, if the gap-length of each gap is a half day (=12 h), there will be 14 gaps in this case. (3) The highest point of the measured flux was selected as the central point of the first gap because this point was the most difficult point for gap-filling. (4) The remaining gaps were randomly selected from the rest of the data with a uniform distribution. (5) The closest neighbor points to the first gap were selected as the training data. (6) The training data length started from 20 h and increased gradually to a maximum of 1600 h. Figure A1 plots the RMSE of the model predicted CO2 flux as a function of training data length for four individual gap lengths (one hour, half day, one day, and one week) at the forest site. In Figure  A1, the total gap length for the four individual gap lengths was the same (one week) and the CO2 flux was predicted using the SVM model. Figure A1 shows that, when the total gap length was the same, the RMSE curves of the four different gaps were quite similar, and all of them decreased with the increase of data length and converged to the lowest values after 1300 h. In other words, if the total gap length is one week, the length of training data for achieving better model performance is around 1300 h, regardless of the length of each gap. Figure A1. The RMSE of predicted CO2 flux at the forest site as a function of training data length where the total gap length is one week for the four individual gap lengths: one hour, half day, one day, and one week.  Figure A1. The RMSE of predicted CO 2 flux at the forest site as a function of training data length where the total gap length is one week for the four individual gap lengths: one hour, half day, one day, and one week.