Estimating Forest Carbon Fluxes Using Machine Learning Techniques Based on Eddy Covariance Measurements

: Approximating the complex nonlinear relationships that dominate the exchange of carbon dioxide ﬂuxes between the biosphere and atmosphere is fundamentally important for addressing the issue of climate change. The progress of machine learning techniques has offered a number of useful tools for the scientiﬁc community aiming to gain new insights into the temporal and spatial variation of different carbon ﬂuxes in terrestrial ecosystems. In this study, adaptive neuro-fuzzy inference system (ANFIS) and generalized regression neural network (GRNN) models were developed to predict the daily carbon ﬂuxes in three boreal forest ecosystems based on eddy covariance (EC) measurements. Moreover, a comparison was made between the modeled values derived from these models and those of traditional artiﬁcial neural network (ANN) and support vector machine (SVM) models. These models were also compared with multiple linear regression (MLR). Several statistical indicators, including coefﬁcient of determination ( R 2 ), Nash-Sutcliffe efﬁciency (NSE), bias error (Bias) and root mean square error (RMSE) were utilized to evaluate the performance of the applied models. The results showed that the developed machine learning models were able to account for the most variance in the carbon ﬂuxes at both daily and hourly time scales in the three stands and they consistently and substantially outperformed the MLR model for both daily and hourly carbon ﬂux estimates. It was demonstrated that the ANFIS and ANN models provided similar estimates in the testing period with an approximate value of R 2 = 0.93, NSE = 0.91, Bias = 0.11 g C m − 2 day − 1 and RMSE = 1.04 g C m − 2 day − 1 for daily gross primary productivity, 0.94, 0.82, 0.24 g C m − 2 day − 1 and 0.72 g C m − 2 day − 1 for daily ecosystem respiration, and 0.79, 0.75, 0.14 g C m − 2 day − 1 and 0.89 g C m − 2 day − 1 for daily net ecosystem exchange, and slightly outperformed the GRNN and SVM models. In practical terms, however, the newly developed models (ANFIS and GRNN) are more robust and ﬂexible, and have less parameters needed for selection and optimization in comparison with traditional ANN and SVM models. Consequently, they can be used as valuable tools to estimate forest carbon ﬂuxes and ﬁll the missing carbon ﬂux data during the long-term EC measurements.


Introduction
Forest ecosystems play a critical role in sequestering atmospheric carbon dioxide [1,2]. The magnitude of carbon sequestration in different forest stands at various timescales ranging from hour to inter-annual is seriously influenced by global environmental change, such as climate, land use, CO 2 enrichment, nitrogen deposition and biotic invasions. In addition, the substantial impacts of Table 1. Site characteristics used in this study. The mean annual temperature (MAT, • C) and total annual precipitation (TAP, mm year −1 ) are for the time period (Period). Vegetation types are deciduous broadleaf forest (DBF), evergreen needle-leaf forest (ENF) and mixed forest (MF).  [37] Based on the EC technique, continuous half-hourly carbon fluxes for six or more years were measured at each site. Positive NEE stands for a loss of CO 2 to the atmosphere, while negative NEE implies an uptake of CO 2 by the ecosystem. The method of gap-filling for eddy covariance-measured Sustainability 2018, 10, 203 4 of 26 NEE and its partitioning into GPP and R was according to the Fluxnet-Canada Research Network protocol [38]. Briefly, R was obtained from both nighttime and cold-season NEE. The gap-filling of missing R during nighttime and calculation of R during daytime were on the basis of an empirical relationship between R and soil temperature. Finally, GPP was estimated based on the daytime-observed NEE and calculated daytime R. During the cold season, GPP was set to zero. More details regarding the EC instrument characteristics, NEE gap-filling and its separation scheme as well as the random errors in NEE, R and GPP have been provided for CA-Oas and CA-Obs in Griffis et al. [35] and Zha et al. [34], and CA-Gro in McCaughey et al. [37].

Site
Meteorological variables were also observed, mainly including air temperature (T a ), net radiation (R n ), relative humidity (R h ), wind speed, volumetric soil moisture content and soil temperature (T s ). The EC measurements began in April 1996, May 1999, and July 2003 at CA-Oas, CA-Obs, and CA-Gro, respectively. Table 2 presents the daily statistical parameters of the used data during a period of six years. At the three sites, the correlation values for both GPP and R are generally greater than those for NEE. The variables, T a , T s and R n , show strong positive correlation with both GPP and R, while have negative correlation with NEE. Table 2. Daily statistical parameters of flux tower measured environmental variables including air temperature (T a , • C), net radiation (R n , mol m −2 ), relative humidity (R h , %), soil temperature (T s , • C), gross primary productivity (GPP, g C m −2 day −1 ), ecosystem respiration (R, g C m −2 day −1 ) and net ecosystem exchange (NEE, g C m −2 day −1 ) during the whole period in all three stands.

Stand
Variable X mean X max X min X sd X ku X sk CC 1 CC 2 CC 3 Note: X mean , X max , X min , X sd , X ku and X sk refer to the mean, maximum, minimum, standard deviation, kurtosis and skewness of each variable, respectively; CC 1 , CC 2 and CC 3 refer to the correlation coefficient between each variable and GPP, R and NEE, respectively.

Artificial Neural Network
The ANN model with a parallel computing system is widely recognized owing to its strong ability of approximating the non-linear relationship between inputs and outputs. ANN approach is widely recognized as an important supervised technique for addressing various issues, such as regression and classification, in a wide range of fields [39][40][41]. Commonly, the feed-forward neural Sustainability 2018, 10, 203 5 of 26 network is the most classic ANN model and was used here in the present study. This network structure includes three different types of layers, namely, input layer, hidden (intermediate) layer and output layer. The present work used only one hidden layer because even for a specific ANN model with a single hidden layer, it can also approximate any continuous multivariate function with reasonable precision. The optimal number of hidden units selected for a specific network with lowest error during the validation period is determined through varying the number of hidden neurons on the basis of the corresponding data set. For the purpose of determining the optimum weights and thresholds of multilayer neural networks, the back propagation algorithm was adopted in this study with the aim to minimize the error between the actually simulated and expected targets from outputs for each case. Moreover, Levenberg-Marquardt algorithm was used as the training function by this study, because of its superiority of faster learning speed over some traditional conjugate gradient algorithms.

Support Vector Machine
The SVM model has rapidly gained popularity over the past few decades, since it was first proposed by Vapnik [42]. This advanced machine learning technique is essentially focused on addressing the classification and regression problems, which are involved in a broad range of research fields. A noteworthy innovation of SVM method is that, when solving the estimated error between the measured and the calculated, SVM depends on the structural risk minimization principle, which was introduced by Vapnik [42] in order to obtain an optimum network structure through minimizing the so-termed penalized empirical risk. This principle seeks to constrict the upper boundary on the desired risk and has some distinct superiority over conventional empirical risk minimization algorithm, which is adopted by the traditional neural network techniques. Just owing to this noticeable difference, several lines of evidence support that the SVM method presents a more powerful generalization ability than many neural network methods. When developing a SVM model, selecting a proper kernel function is of particular importance. To ensure the predictive accuracy, several kernel functions, including polynomial, sigmoid, radial basis functions, were compared and evaluated based on our applied data sets. Many attempts were made by this study and the radial basis function (RBF), which achieved the best performance, was utilized for all the SVM models. In addition, three critical parameters, including kernel width of RBF, the value of cost function and insensitive loss coefficient, were together optimized according to the grid search procedure.

Adaptive Neuro-Fuzzy Inference System
The ANFIS method is considered to be a sophisticated machine learning technique, and was introduced by Jang [32], based on the aforementioned ANN and fuzzy inference system (FIS). ANFIS is a more interpretable model and can offer more robust results than traditional ANN method, primarily due to the fact that it is able to qualitatively and quantitatively elucidate the nonlinear relationship between inputs and outputs through utilizing systematically the respective strengths of both ANN and FIS techniques. When developing a specific ANFIS for a given training data set, selecting a proper FIS is a major task. There are two widely used types of FISs, namely, Sugeno system [43] and Mamdani system [44]. The former was adopted by this study due to the implicitly of design in the consequent part and high computational efficiency. For a specific ANFIS model, there are generally two types of parameters that need to be optimized, the premise parameters and the consequent parameters. In the present study, the hybrid learning algorithm are composed of gradient descent algorithm and least-squares method, which were used to optimize the premise parameters and the consequent parameters, respectively. Generating a FIS is an important issue, which can be performed using several methods, such as grid partitioning, mountain and subtractive clustering. The grid partitioning algorithm is widely utilized, but the use of this method is substantially impeded owing to the curse of dimensionality. For this reason, the subtractive clustering technique was adopted by this study. Detailed description of subtractive clustering algorithm can be found in Cobaner [45].

General Regression Neural Network
GRNN as a feed-forward neural network could be usually subsumed under the category of radial basis function network. It is proposed by Specht [33] based on nonlinear regression theory and hence can be used as an alternative machine learning technique for approximating any arbitrary function of input-output variables from a given training dataset. In comparison with the conventional ANN approach based on back propagation algorithm, the major strengths of GRNN method can be summarized in the following aspects: (1) the simplicity of the network structure that heavily depends upon the number of input and output variables as well as the sample size of a specific training dataset, and does not necessitate any modification during the learning process; (2) the quick speed of learning as the GRNN does not require to be trained by using an time-consuming iterative procedure which must be carried out for ANN according to the back propagation algorithm with a huge amount of computation; (3) the solution of the problem of local minimum through the use of Gaussian radial basis kernel function as the transfer function in the pattern layer. In addition to these strengths, another noteworthy point is that the GRNN model does not need many iterations to determine its inner algorithms or parameters, except for the spread factor that may substantial influence the model generalization capability. In this study, the optimal spread factor selected from a set of alternative ones in the range of 0.01 to 1 was determined by using an iterative algorithm with fourfold cross-validation. More details about the GRNN method can be found in Specht [33] and Kisi [46].

Model Implement and Evaluation
In this study, four different machine learning methods, ANN, GRNN, ANFIS and SVM, were investigated and compared to estimate daily carbon fluxes in three forest ecosystems. In addition, the MLR model was also used in this investigation with the intention of examining the strength of machine learning approaches against traditional regression technique. The correlation coefficient values between environmental variables and each carbon flux in all the ecosystems have been given in Table 2. The results showed that all the carbon fluxes were highly correlated with T a , T s and R n , implying that these variables were the most effective parameter influencing carbon fluxes. To obtain an accurate and reasonable comparison of modeling performance among different carbon fluxes, the applied methods with different variable combinations for inputs were tried in this study. For the sake of convenience, these abbreviated models are summarized in Table 3. We used all the models for each flux in order to make the comparison of various carbon fluxes as fair as possible. Table 3. The input combinations for different methods with environmental variables including air temperature (T a , • C), net radiation (R n , mol m −2 ), relative humidity (R h , %), soil temperature (T s , • C).

Models
Input Complete 6-year environmental and carbon flux data were used at all the sites. These data were divided into three datasets. The first 4-year dataset was used for training, the fifth year dataset was used for validation for the purpose of avoiding over-fitting and assuring the ability of model generalization, and the last 1-year dataset was for testing. It is noteworthy that all the applied models were separately trained and extrapolated for validation and prediction, owing to the differences in age among the three sites. Before the training of the applied models was started, all input and output variables were normalized between 0 and 1, respectively. The computer programs of all the machine learning models and statistical analysis of estimated results were undertaken in MATLAB (version 8.2, The Mathworks, Inc., Natick, MA, USA). Coefficient of determination (R 2 ), Nash-Sutcliffe efficiency (NSE), bias error (Bias) and root mean square error (RMSE) were considered for the performance evaluation of the used models. These performances indices are defined as below: where Y o and Y m denote the observed and modeled values of daily carbon fluxes, respectively; Y o and Y m are the means of observed and modeled values, respectively; N is the number of observed values.

Modeling of Daily Gross Primary Productivity
The performance indices, including R 2 , NSE, Bias and RMSE, are used to evaluate the accuracy of the employed models in predicting the daily GPP, R and NEE in three different forest stands in the present study. Table 4 shows the estimated accuracy of the applied models in the prediction of daily GPP in the testing period at the three sites. It is clearly seen from Table 4 that the models with full environmental variables (T a , R n , R h , T s ) have the best accuracy for each method and site. Therefore, the following comparisons of the used models are mainly based on the inputs using these four variables. The developed machine learning models are consistently superior to the corresponding MLR models at the three sites. Regarding the comparisons of the machine learning models at CA-Obs site, the ANN model is superior to the other models in the testing period. However, the GRNN model generally provides the worst performance. The overall model efficiency of the used models can be ranked as follows: ANN, ANFIS, SVM, GRNN and MLR according to the R 2 , NSE, Bias and RMSE metrics. Similar to CA-Obs site, the GRNN model also gives the worst performance among the four machine learning models in the testing period at CA-Oas site, while the ANFIS model generates the best estimates. The overall ranks of the used models in the testing period can be summarized as follows: ANFIS, SVM, ANN, GRNN and MLR. Unlike CA-Oas site, the ANN model in the testing period performs the best at CA-Gro site. Additionally, the ANFIS and SVM models have the similar model efficiency and they are both slightly superior to the GRNN model.
In order to examine the efficiency of each model in predicting the GPP, R and NEE in all the three stands, the overall mean accuracy of each model for all the stands in the testing period is provided in Table 5. It is apparent from the table that the MLR model in the prediction of GPP generally has the worst accuracy in the testing period, whereas the ANN, ANFIS and SVM models yield the similar performance with the higher values of R 2 ranging from 0.922 to 0.927 and NSE from 0.901 to 0.909. Furthermore, to provide deeper insight into the responses of different forest stands to the applied models in the prediction of daily GPP, R and NEE, the mean performance of all machine learning models for each site in the testing period is summarized in Table 6. As demonstrated in the table for GPP prediction, the models yield the best precision at CA-Obs site, and give similar results at both CA-Oas and CA-Gro sites. Table 4. Comparisons of machine learning models with different input combinations in the testing period for gross primary productivity (GPP, g C m −2 day −1 ) in the three boreal forest stands. Note: Bold numbers show the best performance among the models with different input combinations for each method and stand. Table 5. Comparisons of different models in the three stands for gross primary productivity (GPP, g C m −2 day −1 ), ecosystem respiration (R, g C m −2 day −1 ) and net ecosystem exchange (NEE, g C m −2 day −1 ) in the testing period.  Table 6. Comparisons of different stands using all the five models for gross primary productivity (GPP, g C m −2 day −1 ), ecosystem respiration (R, g C m −2 day −1 ) and net ecosystem exchange (NEE, g C m −2 day −1 ) in the testing period.  2 illustrate the comparisons of daily GPP between observed and predicted using the data-driven models at the three sites in the form of scatter plot and time series graph, respectively. As clearly seen from Figure 1, the fit lines of the models (ANN, GRNN, ANFIS and SVM) during the testing period for CA-Oas site appear to be consistent with the ideal fit lines (y = x), whereas at CA-Obs site, these models have higher values of R 2 and NSE than those of the corresponding models for CA-Oas and CA-Gro stands (Tables 4 and 6). Moreover, all the five models generate more scattered estimates at both CA-Oas and CA-Gro sites in comparison with CA-Obs site.  Table 6. Comparisons of different stands using all the five models for gross primary productivity (GPP, g C m −2 day −1 ), ecosystem respiration (R, g C m −2 day −1 ) and net ecosystem exchange (NEE, g C m −2 day −1 ) in the testing period.  2 illustrate the comparisons of daily GPP between observed and predicted using the data-driven models at the three sites in the form of scatter plot and time series graph, respectively. As clearly seen from Figure 1, the fit lines of the models (ANN, GRNN, ANFIS and SVM) during the testing period for CA-Oas site appear to be consistent with the ideal fit lines ( y = x ), whereas at CA-Obs site, these models have higher values of R 2 and NSE than those of the corresponding models for CA-Oas and CA-Gro stands (Tables 4 and 6). Moreover, all the five models generate more scattered estimates at both CA-Oas and CA-Gro sites in comparison with CA-Obs site. On the other hand, as demonstrated in Figure 2, the over-and under-estimation of the best model for each site in predicting daily GPP among the training, validation and testing periods are examined. It is apparent from Figure 2 that most of the simulated values from the best ANN model in the testing period (the last 1 year) are substantially underestimated at CA-Obs site, while greatly overestimated at CA-Gro site. This is also confirmed by the aforementioned scatter plots in Figure 1.

Modeling of Daily Ecosystem Respiration
The estimated precision of the used models for predicting daily R in the testing period in the three sites are provided in Table 7. As shown from the table, the models with full input variables (Ta, Rn, Rh, Ts) perform the best for each method and site. The machine learning models with these variables consistently outperform the corresponding MLR models at both CA-Obs and CA-Oas sites, but at CA-Gro site, all the five models provide similar estimates. Specifically at CA-Obs site, the ANN model performs the best among the four machine learning models in the testing period, with respect to R 2 , NSE and RMSE. The predictive precision of the ANFIS is similar to that of the SVM. These two models give the inferior performance. As a whole, the GRNN model provides the lowest precision in the prediction of daily R. At CA-Oas site, the GRNN model is also inferior to the other three models (ANN, ANFIS and SVM) in terms of R 2 , NSE and RMSE. Besides, the ANFIS model performs the best On the other hand, as demonstrated in Figure 2, the over-and under-estimation of the best model for each site in predicting daily GPP among the training, validation and testing periods are examined. It is apparent from Figure 2 that most of the simulated values from the best ANN model in the testing period (the last 1 year) are substantially underestimated at CA-Obs site, while greatly overestimated at CA-Gro site. This is also confirmed by the aforementioned scatter plots in Figure 1.

Modeling of Daily Ecosystem Respiration
The estimated precision of the used models for predicting daily R in the testing period in the three sites are provided in Table 7. As shown from the table, the models with full input variables (T a , R n , R h , T s ) perform the best for each method and site. The machine learning models with these variables consistently outperform the corresponding MLR models at both CA-Obs and CA-Oas sites, but at CA-Gro site, all the five models provide similar estimates. Specifically at CA-Obs site, the ANN model performs the best among the four machine learning models in the testing period, with respect to R 2 , NSE and RMSE. The predictive precision of the ANFIS is similar to that of the SVM. These two models give the inferior performance. As a whole, the GRNN model provides the lowest precision in the prediction of daily R. At CA-Oas site, the GRNN model is also inferior to the other three models (ANN, ANFIS and SVM) in terms of R 2 , NSE and RMSE. Besides, the ANFIS model performs the best among the five models. In conclusion, the overall model rankings of the applied models over the testing period can be ranked as follows: ANFIS, ANN, SVM, GRNN and MLR. In contrast to CA-Obs and CA-Oas sites as mentioned before, the GRNN model yields the best performance among the five models at CA-Gro site according to the four performance indices. In addition, the ANFIS performs the second best. The ANN and SVM models give similar accuracy. Table 7. Comparisons of machine learning models with different input combinations in the testing period for ecosystem respiration (R, g C m −2 day −1 ) in the three boreal forest stands. On the other hand, it is clearly seen from Table 5 that the GRNN model in the three stands for forecasting daily R gives the best accuracy in the testing period, in terms of NSE, Bias and RMSE. However, the MLR model gives the worst accuracy. The ANN, ANFIS and SVM models achieve similar performance, according to the four indices. As shown in Table 6, no apparent performance indicator is found among the three stands in the prediction of daily R using the five data-driven models. Nevertheless, it should be noted that at CA-Gro site, the models give the best accuracy in terms of R 2 and NSE indices. Figure 3 presents the comparisons of daily R between observed and predicted using the five models at the three sites in the form of scatter plot. As shown in Figure 3, although the regression fit lines between measured and estimated R by the ANN, GRNN, ANFIS and SVM models at CA-Oas stand are closer to the 1:1 lines, higher mean values of R 2 = 0.937 and NSE = 0.925 are obtained by the models at CA-Obs site (Table 6). Besides, these models at CA-Obs site display less scattered estimates than those at CA-Oas and CA-Gro sites. It is evident from Figure 4 that the distribution of predicted values versus observed values for daily R in the testing period varies among different sites. Similar to the time series comparison of daily GPP in Figure 2, the under-and over-estimation of the best model (especially in the peaks) in forecasting daily R in the testing period are clearly seen at CA-Obs and CA-Gro sites, respectively, as shown in Figure 4. This is consistent with the scatter plots in Figure 3 as mentioned before. It is evident from Figure 4 that the distribution of predicted values versus observed values for daily R in the testing period varies among different sites. Similar to the time series comparison of daily GPP in Figure 2, the under-and over-estimation of the best model (especially in the peaks) in forecasting daily R in the testing period are clearly seen at CA-Obs and CA-Gro sites, respectively, as shown in Figure 4. This is consistent with the scatter plots in Figure 3 as mentioned before.  Table 8 summarizes the accuracy of the applied models for forecasting daily NEE in the testing period. It can be clearly seen from the table that the models with full input variables (Ta, Rn, Rh, Ts) generate the best accuracy for each method and site. The four machine learning models with these variables consistently perform better than the corresponding MLR models at the three sites. At CA-Obs site, the ANN model generally gives the best precision among the five models. The SVM, GRNN and ANFIS models yield similar accuracy in terms of R 2 , NSE and RMSE. Unlike CA-Obs site, the ANFIS model at CA-Oas site is superior to the other models in terms of R 2 , NSE and RMSE. In addition, the ANN and SVM models have similar accuracy in predicting the daily NEE and they both outperform the GRNN model. At CA-Gro site, the modeling performance among the ANN, GRNN and ANFIS models is nearly consistent in the testing period. These three models are superior to the SVM and MLR models. As shown in Table 5, regarding the prediction of daily NEE, the ANN, ANFIS and SVM models achieve similar mean accuracy for the three sites. However, the MLR model gives  Table 8 summarizes the accuracy of the applied models for forecasting daily NEE in the testing period. It can be clearly seen from the table that the models with full input variables (T a , R n , R h , T s ) generate the best accuracy for each method and site. The four machine learning models with these variables consistently perform better than the corresponding MLR models at the three sites. At CA-Obs site, the ANN model generally gives the best precision among the five models. The SVM, GRNN and ANFIS models yield similar accuracy in terms of R 2 , NSE and RMSE. Unlike CA-Obs site, the ANFIS model at CA-Oas site is superior to the other models in terms of R 2 , NSE and RMSE. In addition, the ANN and SVM models have similar accuracy in predicting the daily NEE and they both outperform the GRNN model. At CA-Gro site, the modeling performance among the ANN, GRNN and ANFIS models is nearly consistent in the testing period. These three models are superior to the SVM and MLR models. As shown in Table 5, regarding the prediction of daily NEE, the ANN, ANFIS and SVM models achieve similar mean accuracy for the three sites. However, the MLR model gives the worst accuracy. Furthermore, it can be seen from Table 6 that the models for daily NEE prediction provide similar performance for CA-Obs and CA-Oas sites according to R 2 and NSE criteria. Generally, the applied models for these two sites perform better than those for CA-Gro site. The measured and predicted values of daily NEE using the five models for the three stands are compared in Figures 5 and 6. It is apparent from Figure 5 that the fit lines of the applied models during the testing period at CA-Obs and CA-Oas sites are closer to the ideal fit lines (y = x) than those obtained at CA-Gro site. Meanwhile, the mean values of R 2 and NSE generated by the five models at the former two sites are substantially higher than those at the latter site (Table 6). Additionally, all the models at the CA-Gro site have more scattered estimates in comparison with the other two sites. On the other hand, the over-and under-estimation of the best model in predicting the daily NEE for the three sites among the training, validation and testing periods are illustrated in Figure 6. It is clear that most of the estimated values closely follow the corresponding observed values in the three stands. In the growing season, however, the NEE values simulated by the GRNN model at CA-Gro site fail to well match the corresponding measured values. On the other hand, the over-and under-estimation of the best model in predicting the daily NEE for the three sites among the training, validation and testing periods are illustrated in Figure 6. It is clear that most of the estimated values closely follow the corresponding observed values in the three stands. In the growing season, however, the NEE values simulated by the GRNN model at CA-Gro site fail to well match the corresponding measured values.

Modeling of Hourly Carbon Fluxes
To examine the modeling ability of machine learning methods at a finer time scale, we further use these approaches to estimate hourly carbon fluxes at the three sites. The results estimated by different models with four input variables in the testing period are shown in Table 9. It should be noted that the gap-filled hourly carbon flux data were not used for these estimates. The number of observed NEE data available in the testing period in Table 9 is 5865 (2009), 6233 (2008) and 6099 (2010) for CA-Obs, CA-Oas and CA-Gro, respectively. As a whole, the machine learning models accurately quantify hourly carbon fluxes at the three sites. They consistently and substantially outperform the MLR model for GPP and NEE estimates in terms of R 2 , NSE, Bias and RMSE, while for R estimates, the MLR model achieves satisfactory results comparable to those of machine learning models. Specifically, for the comparison among the three sites in the prediction of hourly GPP, the developed machine learning models offered the highest accuracy at CA-Obs site, followed by CA-Oas site. Regarding the estimates of hourly R among the three sites, according to the R 2 indictor, the applied models perform the best at CA-Gro site, whereas they provide the worst statistical results with respect to NSE, Bias and RMSE indices. On the whole, these models generate the best R estimates in

Modeling of Hourly Carbon Fluxes
To examine the modeling ability of machine learning methods at a finer time scale, we further use these approaches to estimate hourly carbon fluxes at the three sites. The results estimated by different models with four input variables in the testing period are shown in Table 9. It should be noted that the gap-filled hourly carbon flux data were not used for these estimates. The number of observed NEE data available in the testing period in Table 9 is 5865 (2009), 6233 (2008) and 6099 (2010) for CA-Obs, CA-Oas and CA-Gro, respectively. As a whole, the machine learning models accurately quantify hourly carbon fluxes at the three sites. They consistently and substantially outperform the MLR model for GPP and NEE estimates in terms of R 2 , NSE, Bias and RMSE, while for R estimates, the MLR model achieves satisfactory results comparable to those of machine learning models. Specifically, for the comparison among the three sites in the prediction of hourly GPP, the developed machine learning models offered the highest accuracy at CA-Obs site, followed by CA-Oas site. Regarding the estimates of hourly R among the three sites, according to the R 2 indictor, the applied models perform the best at CA-Gro site, whereas they provide the worst statistical results with respect to NSE, Bias and RMSE indices. On the whole, these models generate the best R estimates in terms of NSE, Bias and RMSE. For NEE estimates, the machine learning models give similar results at both CA-Obs and CA-Oas sites, and slightly perform better than the models at CA-Gro site. Table 9. Comparisons of different models with four input variables for predicting hourly gross primary productivity (GPP, g C m −2 hour −1 ), ecosystem respiration (R, g C m −2 hour −1 ) and net ecosystem exchange (NEE, g C m −2 hour −1 ) in the testing period in the three boreal forest stands. As commonly known in the eddy covariance community, the requirement for turbulent atmospheric conditions limits the quality of data observed by the eddy covariance technique. Half-hourly NEE data are directly measured and may be rejected due to low turbulence conditions. The majority of missing NEE data occur during nighttime. Therefore, it is essential to provide some insights into the modeling ability of our proposed models for nighttime and daytime NEE data. In this study, the identification of nighttime and daytime was according to the values of photosynthetic photon flux density (PPFD). A positive PPFD implied daytime, while nighttime was defined by the periods of the day without light (the values of PPFD were equal to zero). The comparisons of different models with four input variables for predicting hourly NEE in the testing period for daytime and nighttime datasets at the three sites are summarized in Table 10. As can be seen from the Tables 9 and 10, the NEE estimates of each model for the daytime data at both CA-Obs and CA-Oas sites are similar to those for all the data (daytime + nighttime), and are considerably better than those for the nighttime data. In contrast, at CA-Gro site, all the models (except MLR) for the daytime and nighttime data give similar estimates according to the R 2 performance index (Table 10), while in terms of the NSE index, the estimates from all the models for the daytime data are consistently superior to those for the nighttime data.
As a case study, Figure 7 illustrates hourly NEE between eddy covariance measured and predicted by the ANFIS model in the testing period. The daily courses of hourly measured and predicted NEE flux for 7 days during the mid-growing season at the three sites are displayed in Figure 7a-c. It can be clearly seen that most of the modeled NEE values at the three sites well match the corresponding observed ones. In comparison with the estimates at both CA-Oas and CA-Gro sites, the simulated NEE values at CA-Obs site more closely match the corresponding measured ones. The scatter plots of ANFIS modeled and measured hourly NEE flux for both nighttime and daytime datasets at the three sites are presented in Figure 7d-f. As shown in the figures, considerable difference between nighttime and daytime datasets can be found in the magnitude and variation of hourly NEE for an entire testing year, which are largely reduced during nighttime primarily due to the absence of photosynthesis. For the daytime datasets among the three sites, the fit line of the ANFIS model at CA-Oas site is closest to the ideal straight line of slope one, while this model have the highest value of R 2 at CA-Obs site. However, for the nighttime datasets among the three sites, the best estimates occur at CA-Gro site in terms of both the slope of the fit line and R 2 value.

Discussion
For the first time, the suitability and potential of two advanced machine learning techniques, namely ANFIS and GRNN, were investigated in the prediction of terrestrial carbon fluxes using the EC-measured data in three forest ecosystems. Besides, another three traditional approaches (ANN, SVM and MLR) were also used as benchmarks in order to evaluate the modeling capability of all the applied models in the present study. In the following subsections, we primarily focused on discussing the estimates of carbon fluxes among the machine learning models, according to the different performance evaluation indices. In addition, we also provided the advantages and limitations of the present research, as well as the possible improvements in the future work.

Capability of Machine Learning Models and Their Comparison
Our results demonstrated that the applied machine learning models with full input variables (T a , R n , R h , T s ) did an excellent job of explaining approximately average 92%, 94% and 78% of diurnal variances in GPP, R and NEE for the three stands, respectively (Table 5). These models also exhibited great capability in mapping the hourly variation of different carbon fluxes in response to environmental factors. According to the estimates of each method with different input combinations (Tables 4, 7 and 8), it is clear that the relative importance of driving variables is different with respect to each carbon flux. More specifically, the contributions of T a and T s in estimating daily GPP are similar and more than those of R n and R h at the three sites. T s is the most predominant factor controlling the variability of R flux at each site, followed by T a and then R n . However, considerable difference among the three forest sites was found in the contributions of these driving variables to NEE. At CA-Obs site, T a and R n provided similar contributions and played more important role than T s and R h ; At CA-Oas site, T s offered the largest contribution to NEE, followed by T a and then R n ; The largest contribution to NEE at CA-Gro site was from R n , followed by T s and then T a . Overall, the contribution of R h to each carbon flux at the three sites was relatively lower, compared with the other three variables.
In addition, these advanced machine learning models consistently performed better than the traditional MLR model for each carbon flux prediction. The predictive performance of these techniques at various terrestrial ecosystems was also confirmed by previous studies, mainly involving ANN and SVM approaches [12,21,47]. Additionally, it is worth mentioning here that both ANN and SVM machine learning techniques in recent years have been extensively accepted as preferred tools to upscale the terrestrial carbon fluxes from site to regional scale [13,20,48,49]. More importantly, previous studies have pointed out that, based on the data derived from the remote sensing and EC techniques, data-driven models can perform better than land surface models for predicting the terrestrial carbon fluxes at regional and global scales [50,51].
Furthermore, regarding all the examined machine learning models, our results revealed that these models can consistently estimate hourly and daily carbon fluxes at the three sites (Tables 5  and 9). Specifically, conventional ANN method remains adequately competent in reproducing the carbon fluxes due to its strong generalization ability. For this reason, recently, it has been continuously utilized in the estimation of carbon fluxes [22,52,53]. In addition, ANFIS, as a state of the art method, is designed by combining the strength of FIS and adaptive neural networks for the purpose of identifying the beneficial knowledge within the trained networks in the form of fuzzy logic expression. Therefore, ANFIS seems to be able to overcome the weakness of knowledge interpretation which is still a great challenge for neural networks. In theory at least, ANFIS could perform better than ANN method and this assumption has been proved by many previous studies in practical terms [54,55]. However, in the present work, our results showed that ANFIS and ANN produced the similar estimates, which concurs with the results of both Karimi et al. [56] and Piri et al. [57] for the hydrological time series forecasting. A potential reason for the difference can be accounted for by the fact that the generalization capability of ANFIS is hugely influenced by FIS with various identification algorithms [45,58], indicating that other identification algorithms for generating a FIS, such as fuzzy c-means, mountain, and grid partitioning, may enhance the modeling performance of ANFIS method for the current research.
Moreover, in order to improve the predictive ability of data-driven techniques, packaging heuristic algorithms into data-driven models with the intention of optimizing the model structures and/or parameters is broadly adopted as a preferred solution. A large number of novel nature-inspired optimization algorithms (e.g., particle swarm optimization [59], artificial bee colony [60] and cuckoo search optimization [61]) have been developed over the last two decades [62]. It should be particularly stressed that the models developed in this study have been successfully optimized by various heuristic algorithms in diverse research fields (e.g., hydrological prediction, wind speed and solar radiation forecasting). For instance, Baghban et al. [63] used genetic algorithm to tune, optimize, and determine the respective key parameters of ANFIS and SVM methods for predicting the dew point temperature of moist air and obtained excellent results; Moosavi et al. [64] utilized particle swarm optimization algorithm in conjunction with ANFIS and SVM approaches to estimate the surface soil moisture and respectively improved the predictive performance of original ANFIS and SVM models without using optimization algorithm. Consequently, to improve the modeling accuracy of carbon fluxes in the present work, further studies can be devoted to optimizing the machine learning models by the aid of different meta-heuristic algorithms.

Advantages and Limitations of Present Research and Future Work
We modelled the carbon fluxes at three different forest ecosystems using four different modeling techniques. The ability of these techniques has been demonstrated in terms of dealing with the nonlinear processes that control the daily carbon fluxes. In general, the best predictive results for the three carbon fluxes were provided at the evergreen needle-leaf forest (CA-Obs), followed by the deciduous broadleaf forest (CA-Oas), whereas the worst performance was obtained at the mixed forest (CA-Gro), maybe due to the difficulty for our proposed models in seizing the useful knowledge resulting from the influences of more complex and instable underlying surface (Table 6). Moreover, this performance difference among the three sites may be due to the climate extreme events. For instance, at CA-Gro site, compared with the 4-year means over the training period, the annual average values of R n , T s and T a and in the testing period were higher by about 0.2 mol m −2 , 0.2 • C and 1.6 • C, respectively. These three variables have been found to be the most important driving factors dominating the variation of carbon fluxes (Tables 4, 7 and 8). On the other hand, for each estimated flux at all the three forest ecosystems, our applied models generally produced satisfactory results (Tables 5 and 6), especially for GPP and R fluxes. In addition, more remarkable difference among the three fluxes was that the predictive results of both GPP and R were substantially superior to those of NEE, which is in accordance with the previous studies [22,65]. As a matter of fact, NEE, as a balance between GPP and R fluxes, is controlled by the interplay of assimilation and respiration processes between the biosphere and the atmosphere, and this interplay highly influences the diurnal, seasonal and inter-annual variations in NEE involving amplitude and phase [66]. Consequently, accurately estimating NEE remains a great challenge for the researchers to date.
Furthermore, it is a generally acknowledged fact that pursuing excessively the convenience and simplification of the used models may cause the under-fitting for the trained models. In our study, a possible reason for the lower explanation of NEE from the applied models seems to be the missing of some key driving variables for model design, such as biomass pools and management activities. Therefore, in the follow-up work, the sensitivity analysis of a variety of variables in relation to carbon fluxes (particularly for NEE), as well as the addition of effective driving variables as model inputs should be further investigated to improve the predictive performance of our proposed models for the carbon flux forecasting. On the other hand, NEE measurements have been extensively used for estimating carbon budgets at different spatial and temporal scales. Therefore, data gaps in NEE measured by EC technique must be precisely filled [19,67]. However, no standard gap-filling approach for CO 2 flux time series data with the EC method has been widely accepted [15,68]. Consequently, it is of importance to identify optimal filling approaches for accurately determining carbon budgets.
According to the present estimates, our proposed methods could be used as useful alternatives to address the problem of gap-filling in EC measurements in the future.
In practical terms, our study presented the performance difference among the utilized models and also found that the instability of the data-driven models in terms of predictive accuracy for each case indeed existed (Tables 5, 9 and 10). To obtain more stable model as well as better predictive accuracy, a novel technique, namely ensemble learning, has gained extensive attention in recent years [69,70]. The purpose of ensemble learning is to combine the positive properties of a diverse set of learners generated from either the same or different machine learning methods. Apart from the applied methods, the performance of ensemble learning generally varies depending on both the generation of input data from the whole sample and the integration of various learners for the final output. At present, many studies focus on the ensemble generation approaches for inputs of different learners, mainly including bagging [71], boosting [72], and random forests [73]. These algorithms can be combined with machine learning methods for producing more accurate solution than an individual method, which has been proved by previous studies both in theory and practice [74,75]. For example, for the water quality forecasting, Barzegar et al. [76] reported that a boosting algorithm in conjunction with wavelet-based ELM and wavelet-based ANFIS models performed better than their respective single models without using the ensemble technique, respectively. Keenan et al. [77] found the variability in predictive capability among nine different machine learning approaches for estimating the species distributions in three forest stands and obtained a better predictive results through multiple model ensemble technique. To our best knowledge, the suitability and ability of ensemble learning techniques have never been explored for carbon flux forecasting research. Therefore, it may be effective to enhance the performance of our applied models in terms of stability and predictability for the present study. However, it is beyond the scope of the current investigation and will be undertaken in our follow-up work.

Conclusions
This study focused on investigating the feasibility and potential of both the GRNN and ANFIS models for modeling and forecasting daily carbon fluxes in three different forest ecosystems. Moreover, their predictive accuracy was compared with traditional ANN, SVM and MLR models. All the models were evaluated according to several performance indicators (R 2 , NSE, Bias and RMSE). It has been found that all the machine learning models proposed in this study, including ANN, GRNN, ANFIS and SVM, were capable of accounting for the most variance in each carbon flux at both daily and hourly time scales in the three stands, especially for both GPP and R. These modern models consistently performed better than conventional MLR model for both daily and hourly carbon flux estimates in terms of R 2 , NSE, Bias and RMSE. Therefore, these advanced models were able to elucidate precisely non-linear processes of controlling the exchanges of the carbon fluxes between land and the atmosphere at the ecosystem level. Moreover, among all the applied models, the ANFIS and ANN models provided similar estimates of carbon fluxes and slightly outperformed the GRNN and SVM models. In practical terms, however, the ANFIS and GRNN models are more robust and flexible, and have less parameters needed for selection and optimization, when compared with two common ANN and SVM models. Accordingly, both the ANFIS and GRNN models are valuable tools for estimating forest carbon fluxes and interpolating the missing carbon flux data during the long-term EC measurements.
Considering the findings above, our study has conclusively demonstrated for the first time that both the ANFIS and GRNN models can effectively reproduce carbon fluxes at the ecosystem level. Consequently, when upscaling the carbon and water fluxes from ecosystem to regional or global scale, these promising tools should be taken into consideration, which is comparatively important for the scientific community to determine the global carbon and water budgets and offer reliable information for policy makers responding to global climate change.