Estimating Ecosystem Respiration in the Grasslands of Northern China Using Machine Learning: Model Evaluation and Comparison

: While a number of machine learning (ML) models have been used to estimate RE, systematic evaluation and comparison of these models are still limited. In this study, we developed three traditional ML models and a deep learning (DL) model, stacked autoencoders (SAE), to estimate RE in northern China’s grasslands. The four models were trained with two strategies: training for all of northern China’s grasslands and separate training for the alpine and temperate grasslands. Our results showed that all four ML models estimated RE in northern China’s grasslands fairly well, while the SAE model performed best ( R 2 = 0.858, RMSE = 0.472 gC m − 2 d − 1 , MAE = 0.304 gC m − 2 d − 1 ). Models trained with the two strategies had almost identical performances. The enhanced vegetation index and soil organic carbon density (SOCD) were the two most important environmental variables for estimating RE in the grasslands of northern China. Air temperature (Ta) was more important than the growing season land surface water index (LSWI) in the alpine grasslands, while the LSWI was more important than Ta in the temperate grasslands. These ﬁndings may promote the application of DL models and the inclusion of SOCD for RE estimates with increased accuracy.


Introduction
Ecosystem respiration (RE) is a major flux in the global carbon cycle. Small changes in RE can have a significant impact on the atmospheric CO 2 concentration and thus be a potentially positive feedback mechanism to the warming climate [1,2]. However, RE varies greatly at temporal and spatial scales due to the complex interactions among chemical, physical, and biological processes [3]. Numerous studies at regional and global scales indicate that RE estimates remain rather uncertain [4,5]. Therefore, it is necessary but challenging to accurately estimate RE at regional and global scales to quantitatively assess the terrestrial carbon budget and its response to global changes.
RE is often quantified using process-based models (e.g., the Lund-Potsdam-Jena model [6]), semi-empirical models (e.g., RECO model [5]), and empirical models (e.g., Arrhenius model [7]). In recent years, empirical models based on machine learning (ML) algorithms have become widely used because they are driven by observational data without complex assumptions and a large number of parameters [8,9]. In general, there are three well-known and commonly applied ML models for RE estimation: the back propagation artificial neural network (BP-ANN) model, the support vector regression (SVR) model, and the random forests (RF) model. For example, Zhao et al. [10] developed a BP-ANN model to predict global soil respiration (Rs, which is a major component of RE) from 1960 to 2012 using field records reported in the scientific literature. Ueyama et al. [9] estimated RE in Alaska by using flux observations and remote sensing data with the SVR model. Jian et al. [11] obtained global Rs using different timescales of Rs and climate data with the RF model. Although these ML models have been successful in estimating RE at different temporal and spatial scales, some uncertainties still exist. For instance, these ML models are usually constructed based on different learning principles; however, few attempts have been made to compare the predictive performance of these models in estimating RE [12]. Some advanced ML models, such as deep learning (DL) models, have not yet been tested for estimating RE. Moreover, RE under different environmental conditions may have different responses to climate change [13,14], and whether separately training ML models for each ecosystem type can improve performance is unclear. In addition, few studies have adequately explored the effects of different environmental variables on the performance of ML models in estimating RE. Thus, systematic evaluation and comparison of the performance of these ML models in estimating RE are essential.
In this study, with the integration of flux, meteorological, remote sensing, and soil map data, we developed four ML models for estimating RE in northern China's grasslands. The four models include the three commonly used traditional ML models (BP-ANN, SVR, and RF) and a DL model named the stacked autoencoders (SAE) model. The main objectives of this study are (1) to compare the performance of the four ML models in estimating RE in the grasslands of northern China; (2) to compare the performance of the ML models trained for all of northern China's grasslands and separately trained for each ecosystem type; and (3) to analyze the effects of different environmental variables on the performance of the ML models in estimating RE.

Study Area
Northern China's grasslands are mainly composed of alpine and temperate grasslands, which are widely distributed in the Tibetan Plateau (TP) and Inner Mongolian Plateau (IM), respectively ( Figure 1). The TP and IM have distinct environmental features [13]. The TP is in the alpine climate zone, and the average elevation of TP is over 4000 m. The range of mean annual precipitation is from 200 to 600 mm, and the range of mean annual temperature is from −5.75 to 2.57 °C. The alpine grasslands represent a typical ecosystem in the central Asia alpine environment and cover more than 60% of the plateau [15]. The IM is under arid and semi-arid conditions, with an average elevation of approximately 1000 m. The range of mean annual precipitation is from 200 to 350 mm, and the range of mean annual temperature is from 3 to 6 °C. The temperate grasslands growing on the IM are an important component of the Eurasian grasslands, and represent a typical vegetation type under the temperate continental climate [16].
Fully considering the spatial representativeness of the flux sites in this study, we focused on three grassland types: (a) alpine meadow, (b) alpine meadow steppe, and (c) temperate steppe. According to the Atlas of Grasslands Resources of China (1:1,000,000) [17], the alpine meadow was further reclassified into three types: alpine swamp meadow, alpine shrub meadow, and alpine Kobresia meadow. The temperate steppe was also reclassified into three types: desert steppe, typical steppe, and meadow steppe. The distribution of these grassland types, as well as the typical grassland flux sites, is exhibited in Figure 1.

Flux and Meteorological Observations
In this study, we used 52 site-year eddy covariance (EC) flux and meteorological observations from 18 grassland sites over 2003 to 2014 in northern China's grasslands, including 34 site-years from 9 sites on the TP and 18 site-years from 9 sites on the IM ( Figure 1 and Table 1). These data were collected from the ChinaFLUX [18], the Coordinated Observations and Integrated Research over Arid and Semi-arid China (COIRAS) [19], and the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) [20]. The flux and meteorological data were processed by the ChinaFLUX CO2 data processing system [21], including triple coordinate rotation, the Webb-Pearman-Leuning (WPL) correction [22], abnormal data rejection [23], and nighttime data filtering with the friction velocity Fully considering the spatial representativeness of the flux sites in this study, we focused on three grassland types: (a) alpine meadow, (b) alpine meadow steppe, and (c) temperate steppe. According to the Atlas of Grasslands Resources of China (1:1,000,000) [17], the alpine meadow was further reclassified into three types: alpine swamp meadow, alpine shrub meadow, and alpine Kobresia meadow. The temperate steppe was also reclassified into three types: desert steppe, typical steppe, and meadow steppe. The distribution of these grassland types, as well as the typical grassland flux sites, is exhibited in Figure 1.

Flux and Meteorological Observations
In this study, we used 52 site-year eddy covariance (EC) flux and meteorological observations from 18 grassland sites over 2003 to 2014 in northern China's grasslands, including 34 site-years from 9 sites on the TP and 18 site-years from 9 sites on the IM ( Figure 1 and Table 1). These data were collected from the ChinaFLUX [18], the Coordinated Observations and Integrated Research over Arid and Semi-arid China (COIRAS) [19], and the Heihe Watershed Allied Telemetry Experimental Research (HiWATER) [20]. The flux and meteorological data were processed by the ChinaFLUX CO 2 data processing system [21], including triple coordinate rotation, the Webb-Pearman-Leuning (WPL) correction [22], abnormal data rejection [23], and nighttime data filtering with the friction velocity (µ*) threshold obtained from the algorithm provided by Reichstein et al. [24]. The missing nighttime RE and daytime RE data were calculated with the Lloyed-Taylor equation based on nighttime net ecosystem exchange observations [24]. The air temperature (Ta) and photosynthetically active radiation (PAR) data were gap-filled following the methods provided by Schwalm et al. [25]. The half-hourly RE, Ta, and PAR data were integrated into daily values, and the site-years with more than 30% of the daily RE missing were eliminated. To match the 8-day compositing intervals of the moderate resolution imaging spectroradiometer (MODIS) data, the processed daily RE, Ta, and PAR data were averaged within the same periods. The following MODIS products (Collection 6) during 2003 to 2014 were used in this study: normalized difference vegetation index (NDVI), enhanced vegetation index (EVI) (MOD13A2) [26], and surface reflectance (MOD09A1) [27]. These data were obtained from the Land Processes Distributed Active Archive Center. NDVI and EVI are at 16-day temporal resolution and 1000 m spatial resolution, while surface reflectance is at 8-day temporal resolution and 500 m spatial resolution. The obtained data were further processed for quality control and gap-filling [28,29]. Each 16-day NDVI/EVI composite was used for two 8-day intervals corresponding to the composting interval of other MODIS products [30]. The surface reflectance data were further processed to generate the land surface water index (LSWI) data [31]. We used MODIS subsets of 3 × 3 km pixels centered on each flux tower to better represent the EC footprint area and to reduce the effect of geolocation errors [30].

SOCD Data
The Harmonized World Soil Database (HWSD) data were used to derive soil organic carbon density (SOCD) data of the surface (0-30 cm) soil layer [32]. SOCD (kgC m −2 ) was estimated from the organic carbon content (wt %), gravel content (vol %), layer thickness (m), and bulk density (kg . Further details about the estimation are provided by Carvalhais et al. [33]. Then, we extracted the SOCD of each site from the HWSD-derived SOCD data [34].

Environmental Variables
A variety of environmental factors influencing RE, including temperature [35,36], moisture [37,38], plant productivity [3,39] and substrate availability [1,40]. Grassland type represents the spatial distribution of different grassland community types, which have different physiological characteristics and responses to climate change [41,42]. The TP and IM have distinct elevational gradients, which influence the vertical zonality of grasslands [13,43]. Variations in Ta have significant effects on the intensity of vegetation and microbial activities, and many studies have indicated that Ta strongly regulates RE [1,3,13]. The growing season LSWI is strongly correlated with leaf water content and soil moisture in grasslands [37,44,45]. PAR is crucial for photosynthesis [46], thus influencing the photosynthetic supply of RE. The NDVI is closely correlated to fractional vegetation cover and vegetation biomass [45,47]. We also incorporated the EVI because, in comparison to the NDVI, it is more responsive to canopy structural variations, such as plant physiognomy, canopy type, and canopy architecture [48]. SOCD represents the quantity of carbon input of Rs [40]. Taken together, we selected grassland type, elevation, Ta, LSWI, PAR, NDVI, EVI, and SOCD as the environmental variables to account for the variation in RE in northern China's grasslands.

Machine Learning Algorithms
ML algorithms focus on how to automatically improve their performance through experience [49]. In regression problems, ML algorithms try to automatically learn the dependencies between input and target variables from historical observations and make predictions. These algorithms mainly differ in structure and learning principle [8,50]. In this study, we developed models to estimate RE using four ML algorithms. The four algorithms include three traditional ML algorithms and a DL algorithm, which represented four broad families in ML [8]. The four traditional algorithms are the back propagation artificial neural network (BP-ANN), the support vector regression (SVR), and the random forests (RF), respectively. The DL algorithm is the stacked autoencoders (SAE). The four algorithms were implemented using the "scikit-learn" and "keras" packages in Python [51]. A brief description of the characteristics of each algorithm follows.
BP-ANN is one of the most widely used artificial neural networks and simulates the behavior of biological nervous networks. BP-ANN is composed of input, hidden, and output layers. Each layer contains many artificial neurons. An artificial neuron can be seen as a linear function with a non-linear function connected with its output. The input layer contains the input variables to perform the prediction, and the output layer contains the target variable. The input and output layers are connected by a hidden layer. Weights in the three layers represent the linkages between the input and target variables. In the training of BP-ANN, the weights are automatically adjusted along a negative gradient to minimize the error between the predicted and observed target variables using the back propagation algorithm [10]. BP-ANN has self-organizing, self-adaptive, and self-learning abilities, which can represent the complex relationship between the input and target variables [50].
SVR is an algorithm based on the kernel method [8]. Generally, the original regression problem is represented in a low dimensional space that is nonlinear [50]. SVR can transform the nonlinear regression into linear regression by projecting the input data of the original low-dimensional space into a higher-dimensional space with a kernel function. Thus, a global optimal solution is obtained by solving the convex quadratic programming problem [52]. The radial basis kernel function is usually used as the kernel function [53]. The training of SVR can always converge to the global optimal solution [9].
RF is a tree-based algorithm that combines decision tree and ensemble methods [54]. For regression problems, the basic unit in the RF is the regression tree, which is the decision tree with a continuous target variable. Each independent regression tree is developed using input samples selected by the bootstrap sampling method, and at each node, a random subset of input variables is selected [55]. The prediction results of all trees are averaged to make final predictions. RF is able to handle high-dimensional data and avoid overfitting in practice [54].
SAE has been one of the most typical and widely used DL algorithms over the past few years [56,57]. A single autoencoder (AE) is a neural network that attempts to reconstruct its inputs [58]. A basic AE has three neural layers, the first layer and the third layer are the input layer and the output layer, respectively, and the second layer is the hidden layer. The number of artificial neurons in the hidden layer is smaller than the input layer and the output layer; thus the AE can learn the main features that form a good representation of its input. An SAE is a deep neural network consisting of multiple layers of AEs in which the outputs of each layer are wired to the inputs of the successive layer [59,60]. For the purpose of prediction, a regressor is often added on the top layer. The AEs are pre-trained one by one in an unsupervised layer-wise manner, and the trained weights are used to initialize the SAE. Then, the SAE is fine-tuned to achieve a better convergence using the back propagation algorithm [61]. SAE can capture high-level features from input data for robust prediction [57].

Model Training and Evaluation
The performances of the four ML models (BP-ANN, SVR, RF, and SAE) in estimating RE were evaluated using a sample-based 10-fold cross-validation strategy [62], in which all of the RE observations were averagely split into 10 folds randomly. In each training, one fold was used as the validation data, and the remaining nine folds were used as training data. This process was repeated 10 times. Parameters in the four models were optimized and determined using the grid-search method [63]. The predicted RE from all 10 folds were compared with the observed RE. In addition, we trained the ML models with two strategies: training for all of northern China's grasslands and separate training for the alpine and temperate grasslands.
The coefficient of determination (R 2 ), root mean squared error (RMSE), and mean absolute error (MAE) were selected as evaluation metrics. These metrics were calculated as follows: where O i is the ith observed value of RE, P i is the ith predicted value of RE, O and P are the means of the observed RE and predicted RE, respectively; and n is the number of RE observation samples.

Variable Relative Importance Evaluation
To analyze the effects of different environmental variables on the performance of the ML models in estimating RE, we evaluated the relative importance of each variable by sequentially removing one of the environmental variables and repeating the cross-validation process [53,64]. The relative importance of each variable was quantified by the variation in the RMSE in percentage form. Since NDVI and EVI are highly consistent in representing vegetation cover and growth [26], we first compared the predictive performance of NDVI and EVI and then selected one of them for subsequent analysis. Moreover, we also tested the predictive performance of combining the NDVI and EVI. As the alpine grasslands and the temperate grasslands exist under different environmental conditions [13,37], we evaluated the relative importance of the environmental variables in the alpine grasslands and in the temperate grasslands.

Model Performance
We first trained the BP-ANN, SVR, RF, and SAE models for all of northern China's grasslands. The predicted RE from the four models agreed well with the observed RE and occurred at approximately the 1:1 line, indicating that all four ML models estimated RE in the grasslands of northern China fairly well (Figure 2a). However, there were still appreciable differences in the performances among the four models. The SAE model produced the lowest RMSE (0.472 gC m −2 d −1 ) and MAE (0.304 gC m −2 d −1 ) and the highest R 2 (0.858), demonstrating that of the four models, SAE performed best in estimating RE. The SVR prediction performed the second-best with higher RMSE (0.492 gC m −2 d −1 ) and MAE (0.325 gC m −2 d −1 ) and lower R 2 (0.846) than those of SAE. RF had slightly higher RMSE (0.500 gC m −2 d −1 ) and lower R 2 (0.841) than those of SVR, but had slightly lower MAE (0.323 gC m −2 d −1 ) than that of SVR. BP-ANN produced the highest RMSE (0.508 gC m −2 d −1 ) and MAE (0.342 gC m −2 d −1 ) and the lowest R 2 (0.836). We then separately trained the ML models for the alpine and temperate grasslands. The two strategies had similar R 2 , RMSE, and MAE values in estimating RE for the alpine grasslands, temperate grasslands, and all of northern China's grasslands ( Figure 2). This result indicated that the difference between the performance of the two strategies was minimal. For example, in the alpine grasslands, the overall R 2 of the two strategies ranged from 0.885 to 0.903. In the temperate grasslands, the overall R 2 of the two strategies ranged from 0.656 to 0.718. In all of northern China's grasslands, the overall R 2 of the two strategies ranged from 0.836 to 0.858.
The results also showed that the performance of the four ML models varied with ecosystem type. More specifically, the four models performed better in estimating RE in the alpine grasslands than in the temperate grasslands in general (Figure 2). For the alpine grasslands, the four models with the two strategies produced RMSEs lower than 0.430 gC m -2 d -1 , MAEs lower than 0.300 gC m -2 d -1 , and R 2 values higher than 0.880. In contrast, for the temperate grasslands, the four models produced RMSEs higher than 0.640 gC m -2 d -1 , MAEs higher than 0.400 gC m -2 d -1 , and R 2 values lower than 0.720.

Relative Importance of Environmental Variables
All four ML models produced higher R 2 and lower RMSE while using the EVI as the input vegetation index than while using the NDVI as the input vegetation index ( Table 2), indicating that in comparison to the NDVI, the EVI better reflected the RE variations in the grasslands of northern China. By using both the EVI and NDVI as input vegetation indices, however, the BP-ANN, SVR, RF, and SAE models performed best in estimating RE. We then separately trained the ML models for the alpine and temperate grasslands. The two strategies had similar R 2 , RMSE, and MAE values in estimating RE for the alpine grasslands, temperate grasslands, and all of northern China's grasslands ( Figure 2). This result indicated that the difference between the performance of the two strategies was minimal. For example, in the alpine grasslands, the overall R 2 of the two strategies ranged from 0.885 to 0.903. In the temperate grasslands, the overall R 2 of the two strategies ranged from 0.656 to 0.718. In all of northern China's grasslands, the overall R 2 of the two strategies ranged from 0.836 to 0.858.
The results also showed that the performance of the four ML models varied with ecosystem type. More specifically, the four models performed better in estimating RE in the alpine grasslands than in the temperate grasslands in general (Figure 2). For the alpine grasslands, the four models with the two strategies produced RMSEs lower than 0.430 gC m −2 d −1 , MAEs lower than 0.300 gC m −2 d −1 , and R 2 values higher than 0.880. In contrast, for the temperate grasslands, the four models produced RMSEs higher than 0.640 gC m −2 d −1 , MAEs higher than 0.400 gC m −2 d −1 , and R 2 values lower than 0.720.

Relative Importance of Environmental Variables
All four ML models produced higher R 2 and lower RMSE while using the EVI as the input vegetation index than while using the NDVI as the input vegetation index ( Table 2), indicating that in comparison to the NDVI, the EVI better reflected the RE variations in the grasslands of northern China. By using both the EVI and NDVI as input vegetation indices, however, the BP-ANN, SVR, RF, and SAE models performed best in estimating RE. Considering that the four models performed better while using EVI than using NDVI, we evaluated the relative importance of the environmental variables while using the EVI as the input vegetation index. The removal of each environmental variable caused an increase in the RMSE (Figure 3), illustrating that the four models performed best in estimating RE in northern China's grasslands when all the environmental variables were used as input variables. However, the relative importance of the environmental variables varied with the model, indicating that environmental variables had different effects on different ML models in estimating RE.

Discussion
In recent years, several ML models have been widely used to estimate RE at regional and global The relative importance of the environmental variables also varied with the ecosystem type (Figure 3), indicating that the environmental variables had different effects on the RE estimation in different ecosystem types. For the Tibetan alpine grasslands (Figure 3a), the removal of the EVI caused the largest mean increase in RMSE (14.02%) of the four models, indicating that the EVI was the most important environmental variable for estimating RE in the alpine grasslands. The next most important environmental variables were SOCD, Ta, LSWI, grassland type, and elevation, with comparatively minor mean increases in the RMSE (10.64%, 5.53%, 4.70%, 3.50%, and 2.78%, respectively). The removal of PAR caused the smallest mean increase in the RMSE (1.50%). For the Inner Mongolian temperate grasslands (Figure 3b), the rank of the relative importance of the environmental variables was EVI, SOCD, LSWI, grassland type, Ta, elevation, and PAR. The removal of the EVI, SOCD, LSWI, grassland type, Ta, elevation, and PAR caused 8.44%, 6.57%, 5.07%, 4.09%, 3.44%, 2.55%, and 1.23% mean increases in the RMSE, respectively.

Discussion
In recent years, several ML models have been widely used to estimate RE at regional and global scales [8,9,30,62]. However, few studies have systematically evaluated and compared the performance of different ML models in estimating RE. In this study, we evaluated and compared the performance of three traditional ML models (BP-ANN, SVR, and RF) and a DL model (SAE) for estimating RE in the grasslands of northern China. We found that all four models could estimate RE fairly well in the study area ( Figure 2). This can be explained by the fact that RE has strong dependencies on the selected environmental variables in the grasslands of northern China, and all four ML models have sufficient capabilities to learn these underlying dependencies to estimate RE, although they have different learning principles (introduced in Section 2.2.2). However, REs predicted from ML models were underestimated while the RE was getting larger (Figure 2). This may be caused by two reasons. First, we estimated RE for each 8-day interval. The 8-day Ta, LSWI, PAR, NDVI and EVI values may not represent some weather events influencing RE during that period [45,65], such as extreme heat and precipitation in summer. Second, some large RE values may be caused by certain atmospheric turbulent events rather than real ecological processes [66]. ML models are well known for having the ability to automatically learn complex nonlinear relationships from input data [49,67]. Given that RE is difficult to constrain due to our limited understanding of the complex interactions among physical, chemical, and biological processes [5,14], ML models can help us accurately quantify the spatiotemporal variation in RE at regional or global scales.
In addition, our results show that the SAE model performed best in estimating RE in the grasslands of northern China ( Figure 2). As a DL model, SAE is able to automatically extract higher-level features in the environmental variables with unsupervised pre-training than traditional ML models. These higher-level features are more robust to outliers in input data and can better reflect the inherent nature of environmental variables [68,69]. Then, these higher-level features can be utilized to predict RE more effectively and accurately in the fine-tuning process [56]. Previous studies have shown that SAE or other DL models can perform better than traditional ML models in air quality prediction [59,70], digital soil mapping [71], and soil moisture prediction [72]. However, few studies have applied SAE or other DL models to estimate RE or other carbon fluxes. Our results suggest that the SAE model can be successfully applied to RE estimation and may perform better than traditional ML models, although with relatively small datasets. Moreover, with the rapid increase in EC flux observations and the development of DL approaches, we expect that DL models will be more widely used in quantifying RE or other carbon fluxes at different scales.
We developed ML models to estimate RE in northern China's grasslands using two different strategies: training for all of northern China's grasslands and separate training for each ecosystem type. Both strategies have been used to estimate RE or other carbon fluxes in previous studies. For example, Xiao et al. [30] developed a regression tree model to estimate RE for all ecosystem types over North America. In contrast, Liu et al. [73] separately trained general regression neural networks to estimate gross primary productivity and net ecosystem exchanges for each ecosystem type in the conterminous United States. Our results show that the two strategies differed little in estimating RE in northern China's grasslands (Figure 2). This can be explained by two aspects. First, by training with samples from all the ecosystem types together, the four ML models may have been able to learn the underlying dependencies between RE and environmental variables both in the alpine and temperate grasslands, although the RE may have different responses to environmental change in the two ecosystem types [13,37]. Then, the four models had the generalization capacity to accurately estimate RE in the entire study area. Second, this similar performance may be due to the high variability among sites within the same ecosystem type, which is comparable to the variability among different ecosystem types [74]. We suggest training ML models for all ecosystem types together, which is more expedient in practice and can be more independent of the land cover and vegetation maps and related uncertainty [74,75]. In addition, we found that the four models generally performed better in the alpine grasslands than in the temperate grasslands ( Figure 2). This may be due to the higher spatial and temporal variability in RE in the temperate grasslands (CV = 91.64%) than in the alpine grasslands (CV = 84.86%) [74].
All four ML models performed better in estimating RE while using the EVI as the input vegetation index than using the NDVI as the input vegetation index ( Table 2). The NDVI uses red and infrared reflectance, which are sensitive to the soil background [26]. Studies have also shown that the NDVI could be saturated in grasslands with highly fractional vegetation cover [76]. As an improved vegetation index, EVI overcomes these shortcomings [26]. Moreover, in comparison to the NDVI, the EVI is more responsive to canopy structural variations, such as canopy architecture, plant physiognomy, and canopy type [48]. Therefore, in comparison to the NDVI, the EVI can better reflect the variation in RE in the grasslands of northern China. Furthermore, we found that the four models performed best in estimating RE while using both the EVI and NDVI as input vegetation indices, which may be due to the synergy of the two vegetation indices.
By using all the environmental variables as input variables, the four models performed best (Figure 3), implying that all the selected environmental variables account for the variation in RE in northern China's grasslands. However, the results showed that environmental variables have different effects on different ML models in estimating RE ( Figure 3). As the same training data were used for the four models, this may be caused by their different sampling strategies or learning methods [50]. The environmental variables also had different effects on the RE estimation in different ecosystem types (Figure 3). In general, the EVI and SOCD were the two most important environmental variables for estimating RE in both the alpine and temperate grasslands. This is not surprising, given that the EVI and SOCD can represent the variability of plant productivity and soil organic carbon pool, respectively [44,77], which are two main sources of carbon substrate supply for RE [1,40,78]. The important role of plant productivity and the soil carbon pool in regulating RE variations is also consistent with the results of many previous studies [3,5,42,79]. The most obvious differences in the effects of environmental variables on estimating RE in the two ecosystem types are Ta and LSWI (Figure 3). For the Tibetan alpine grasslands, Ta was more important than LSWI in estimating RE (Figure 3a). This result indicates that RE is more responsive to temperature variations than moisture variations in the alpine grasslands, which is consistent with the results of previous studies [13,36,80,81]. Alpine grasslands on the TP are generally thermal limited [13]; thus, temperature strongly controls the RE process by affecting enzyme activity [1,82]. Geng et al. [38] indicated that the spatial variation in Rs in Tibetan alpine grasslands can be better explained by soil moisture than by soil temperature. The different conclusions may be due to the differential responses of RE components to temperature variations [83]. In contrast, RE is more responsive to moisture variations than temperature variations in temperate grasslands (Figure 3b). Being mainly distributed in arid and semi-arid environments, vegetation and microbial activities in the Inner Mongolian temperate grasslands are strongly limited by moisture, thus leading to the strong control of moisture on RE variations [13,42]. Although ML models are usually called "black box" models, our results suggest that they still have some level of interpretability to improve our understanding of RE dynamics [84].
We evaluated and compared the performance of four ML models in estimating RE in the grasslands of northern China in our study; however, some limitations still exist. On the one hand, some other ML models have been used to estimate RE, such as the model tree ensembles [62] and Cubist [30], which are not included in our study. However, they are similar to the RF model that we have already evaluated. On the other hand, although we found that the soil carbon pool is important for RE estimation, the application of ML models that consider the effects of the soil carbon pool on regional RE estimation is still limited since spatially explicit information on the soil carbon pool is not readily available or contains considerable uncertainty [30,33,85]. With the development of ML, especially DL, the increase in soil carbon inventory data, and the development of digital soil mapping, we may be able to quantify the dynamics of RE more accurately and reasonably at regional and global scales.

Conclusions
In this study, we systematically evaluated and compared three traditional machine learning (ML) models (back propagation artificial neural network, support vector regression, and random forests models) and a deep learning (DL) model (stacked autoencoders model) in terms of estimating ecosystem respiration (RE) in northern China's grasslands. Our results show that all four ML models estimated RE in the grasslands of northern China fairly well, while the stacked autoencoders model performed best (R 2 = 0.858, RMSE = 0.472 gC m −2 d −1 , MAE = 0.304 gC m −2 d −1 ). The ML models that were trained for all of northern China's grasslands and separately trained for the alpine and temperate grasslands had almost identical performances, indicating that ML models for RE estimations can be developed for all ecosystem types together. Moreover, by evaluating the relative importance of environmental variables, we found that the enhanced vegetation index (EVI) and soil organic carbon density (SOCD) were the two most important variables in estimating RE in the grasslands of northern China. Air temperature (Ta) was more important than the growing season land surface water index (LSWI) in the Tibetan alpine grasslands, while LSWI was more important than Ta in the Inner Mongolian temperate grasslands. We suggest that RE estimation will benefit from the advanced algorithms in ML models, and some important environmental variables, such as SOCD, should be incorporated into RE estimations at regional and global scales.