Estimating Surface Downward Longwave Radiation Using Machine Learning Methods

: The downward longwave radiation ( L d , 4–100 µ m) is a major component of research for the surface radiation energy budget and balance. In this study, we applied ﬁve machine learning methods, namely artiﬁcial neural network (ANN), support vector regression (SVR), gradient boosting regression tree (GBRT), random forest (RF), and multivariate adaptive regression spline (MARS), to estimate L d using ground measurements collected from 27 Baseline Surface Radiation Network (BSRN) stations. L d measurements in situ were used to validate the accuracy of L d estimation models on daily and monthly time scales. A comparison of the results demonstrated that the estimates on the basis of the GBRT method had the highest accuracy, with an overall root-mean-square error (RMSE) of 17.50 W m − 2 and an R value of 0.96 for the test dataset on a daily time scale. These values were 11.19 W m − 2 and 0.98, respectively, on a monthly time scale. The e ﬀ ects of land cover and elevation were further studied to comprehensively evaluate the performance of each machine learning method. All machine learning methods achieved better results over the grass land cover type but relatively worse results over the tundra. GBRT, RF, and MARS methods were found to show good performance at both the high- and low-altitude sites.


Introduction
It is important for knowledge of weather and climate phenomena to study the Earth's surface radiation budget. The downward longwave radiation (L d , 4-100 µm) is mainly emitted by H 2 O, CO 2 , O 3 molecules, and cloud droplets in the atmosphere near the Earth's surface [1] and is an indispensable part of the surface radiation energy balance and budget of the Earth's climate system [2]. Developing an accurate, global, and long-term L d database is critical for studying changes in the Earth's climate. However, the spatial coverage of L d observation sites is sparse and even entirely absent due to the high cost of L d measurements. Accuracy estimation using denser stations has great importance for studying the distribution of L d on global and regional scales.
Various models have been developed for L d estimation over stations under clear sky conditions. The methods used for estimating L d under clear sky conditions can be split into two types: the parameterization methods and physical-based methods. Numerous parameterization methods usually estimate L d from air temperature (Ta) and water vapor pressure under clear sky conditions [3][4][5][6][7]. Wu et al. [3] utilized eight meteorological parameterization methods to estimate instantaneous L d using the Moderate Resolution Imaging Spectroradiometer (MODIS) atmospheric products under clear-sky conditions. The evaluation results displayed that the root-mean-square error (RMSE) values of estimated instantaneous L d based on the selected eight methods were all less than 35 W m −2 at 17 sites collected from FLUXNET, AmeriFlux, BSRN, AsiaFlux, and SURFRAD networks. Although much attention has been paid to parameterization methods, most of the existing parameterization methods are only valid for local atmospheric conditions and geographic coverage. The physical-based methods estimate L d using retrievals of Ta, humidity profiles, and top of atmosphere (TOA) radiance from satellite observations based on radiative transfer model [8,9]. Duguay et al. [8] utilized radiative transfer model to estimate L d using satellite-retrieved and digital terrain data. The physical-based methods depend strongly on the satellite-retrieved temperature and humidity profile data for the lower atmosphere, which are not always available and are associated with great uncertainties [9][10][11].
Quantifying cloud effects for L d estimates is important under cloudy sky conditions. Cloud cover fraction is usually used to measure the impact of the cloud on L d and correct L d under cloudy sky conditions, which is proportional to the total cloudiness [9,12,13]. Reliable cloud cover fraction measurements can be obtained from meteorological observations and satellite cloud detection products [14]. For cloudy sky situation, cloud cover fraction can also be estimated from the proportion of the measured horizontal global solar radiation to the horizontal global solar radiance [15,16]. The parameterization method of cloud effect on L d proposed by the Crawford and Duchon [15] is considered as one of the best methods [17,18], and it has been widely used to estimate cloudy-sky L d . For instance, Wang et al. [19] used this method [15] to explain the impact of the cloud on L d . The evaluation results displayed that the instantaneous L d estimates had an average bias value of 2 W m −2 (0.6%) under all-sky conditions at 36 ground observation sites. Since the downward shortwave radiation (DSR, 0.3-3 µm) data are more available than data for L d , which is not a generally conventional observation parameter, Yang et al. [20] estimated L d over the Tibetan Plateau by calculating the cloud cover fraction using DSR. The evaluation results showed an error of less than 30 W m −2 for four Chinese Meteorological Administration (CMA) stations in the Tibetan Plateau. The accuracy for cloudy-sky L d estimates is lower than that in the condition of clear sky. Most previous studies that estimated L d under cloudy sky conditions over China only used measurements collected at the Heihe River Basin [21][22][23]. However the accuracy of parameterization methods under cloudy sky conditions relies strongly on the calibration data, and such methods do not fully consider the impact of cloud characteristics on the model, so they may have a large bias outside their local calibration parameter range [11].
In addition to the methods mentioned above, machine learning methods are increasingly used to retrieve both surface shortwave radiation and longwave radiation [24][25][26][27][28][29][30][31]. Wei et al. [24] utilized four machine learning methods to estimate DSR using Advanced Very High Resolution Radiometer data. The gradient boosting regression tree (GBRT) method performed best on both daily and monthly time scales. Wang et al. [25] estimated all-sky surface shortwave net radiation (SSNR) using four machine learning methods with the Landsat Thematic Mapper (TM)/Enhanced Thematic Mapper Plus (ETM+) TOA reflectance and other ancillary information including clearness index and water vapor. The L d estimates on the basis of the GBRT method similarly performed best, with an RMSE of 73.23 W m −2 and 18.76 W m −2 , a basis of 0.64 W m −2 and −1.74 W m −2 , and a coefficient of determination value (R 2 ) of 0.88 and 0.94 on the instantaneous and daily time scales, respectively. These results indicate that applying machine learning methods to estimate surface radiation is reliable. Moreover, Cheng et al. [30] developed linear and dynamic learning neural network models to estimate the global clear-sky surface upwelling longwave radiation using the MODIS TOA radiance. Compared with the empirical methods [32][33][34][35][36], physics-based methods [8,37,38], parameterization methods [5,10,22,[39][40][41][42], and hybrid methods [43][44][45][46][47][48] for L d estimation, the machine learning methods automatically learn the relationship between input parameters and surface radiation and extract complex features from simple features. All in all, it is promising to estimate L d on the basis of the machine learning methods.
Therefore, the objective of this study was to compare the accuracy and precision of five machine learning methods, namely artificial neural network (ANN), GBRT, random forest (RF), multivariate adaptive regression spline (MARS), and support vector regression (SVR) for L d estimation. This study is organized as follows: The ground measurements and the five machine learning methods are introduced in Section 2. Section 3 provides the validation and comparison results of five machine learning methods for L d estimation. Finally, the discussion and conclusion are presented in Sections 4 and 5, respectively.

BSRN Datasets
In this study, the ground observations collected at Baseline Surface Radiation Network (BSRN) stations from 2000 to 2015 were utilized to train and validate the models on the basis of machine learning methods, including L d (W m −2 ), downward shortwave radiation (DSR, W m −2 ), air temperature (Ta, • C) at 2 m height, relative humidity at 2 m height (RH, %), and elevation (m). BSRN is a project established by the World Climate Research Program (WCRP) aiming to survey changes in the Earth's surface radiation and provide support in the validation of qualities for satellite measurements and climate models [49].
The first four BSRN stations to be established began data recording in January 1992. Since then, four sites have been operating continuously ever since. Presently, 52 BSRN stations exist on seven continents: 14 in North America, 10 in Europe, 8 in Asia, 7 in Oceania, 5 in South America, 4 in Africa, and 4 in Antarctica [50]. Recent reports indicated that the BSRN measurements have the highest possible accuracy at high temporal resolution in various climate zones and with uncertainties of approximately ±5 W m −2 [51]. Data collected at 27 sites from 2000 to 2015 were used to train and evaluate the accuracy of the estimation model after excluding sites with questionable or discontinuous observations. Figure 1 shows the spatial distribution for 27 BSRN stations. Table 1 lists the detailed information for these BSRN stations, including geographic location, elevation, and land cover.

Machine Learning Methods
Machine learning is dedicated to the study of how to use experience to improve the performance of the system itself through computational means, so as to generate a "model" from the data on the computer to make judgments about new situations [52,53]. These methods can learn rules from a large amount of historical data to intelligently identify new samples or predict the future. Five machine learning methods, namely artificial neural network (ANN), support vector regression (SVR), gradient boosting regression tree (GBRT), random forest (RF), and multivariate adaptive regression spline (MARS), were used to estimate L d using ground measurements collected from 27 BSRN stations. Four variables extracted at BSRN sites were selected as predictor variables: elevation, Ta, RH, and DSR. The ground-measured daily L d was used as a target variable. In order to acquire the best model, the dataset was randomly grouped into two parts: 80% (61,108 samples for daily) as a training dataset and remaining 20% (13,015 samples for daily) as a test dataset. By looping in the threshold of each parameter, the optimal parameters with the lowest average RMSE were selected through 5-fold cross-validation in the training data for each method.

Support Vector Regression
The support vector machine (SVM) is a maximum-margin classifier based on structural risk minimization in statistical learning theory developed by Vapnik [54]. The support vector regression (SVR) is an application of the SVM algorithm for regression problems [55]. It uses some nonlinear mapping to separate the sample data from the input vectors into a high-dimensional feature space and can be used to find the optimal separation hyperplane between data points of different categories in the high-dimensional space. The SVR algorithm can overcome overfitting by introducing regularization parameters to finitely constrain samples. However, the SVR method is more suitable for small-sample data due to the high computational complexity of the algorithm.
The accuracy of the SVR method is highly influenced by the kernel. Considering that previous studies have shown that the performance of the radial basis function (RBF) kernel is better than any other kernel function in the case of regression, we used a RBF kernel in this study [56]. When training an SVR model with the RBF kernel, two parameters must be considered: C and gamma. The C parameter represents the penalty factor, which controls the trade-off between the complexity of the function and the frequency that allows the error to fall outside the SVR regression tube. Gamma represents the influence of a single training example. A larger gamma means that other examples must be influenced more closely. The range of arguments in the SVR model was predefined as (C ∈ [0.01, 1, 10, 100], gamma ∈ [0.01, 0.1, 10, 100]). The 5-fold cross-validation results showed that an SVR model with C parameter of 1 and gamma parameter of 0.1 was optimal for estimating the L d .

Gradient Boosting Regression Tree
The gradient boosting regression tree (GBRT) algorithm can be considered as a collection of weak prediction models (typically decision trees) to enhance boosting. Similar to other boosting methods, this algorithm builds models in a stepwise fashion and generalizes these methods by allowing optimization of any differentiable loss function. The GBRT algorithm generally reduces both the variance and bias, which has been shown to be a very accurate method. This method can be used to automatically find nonlinear interactions through decision tree learning [57].
The GBRT model is influenced by four parameters: the number of iterations, the learning rate, the depth of the tree, and the sampling rate. The range of parameters was predefined (iterative number ∈ [10, 60], learning rate ∈ [2,4], tree depth ∈ [2, 10], sampling rate ∈ [2, 6]). The validation results showed that a GBRT model with an iteration number of 150, learning rate parameter of 0.1, tree depth of 5, and sampling rate of 0.8 was optimal for estimating the L d .

Random Forest
The random forest (RF) is an ensemble of multiple decision trees and was first proposed by Breiman for classification and regression problems [58]. Unlike other decision tree algorithms, the generated tree can grow to the maximum possible depth in the new training dataset by utilizing a combination of different features. The trees in the RF method are not pruned, further reducing the computational load [59]. This is the key advantage of the RF algorithm compared to other decision trees algorithms [60]. The RF algorithm can be considered as an improved version of bagging, which is based on training many classifiers on bootstrapped samples from a training set. Bagging can reduce the variance of the classification. In addition, the RF algorithm is relatively robust to outliers and noise and faster than bagging or boosting. The main factors influencing the RF model are the min-sample-split, n-estimators, min-samples-leaf, and max-features parameters. The extent of parameters was predefined (n-estimators ∈ [10, 60], max features ∈ [2, 4], min-samples-split ∈ [2, 10], min-samples-leaf ∈ [2, 6]). The 5-fold cross-validation results displayed that it is optimal to estimate the L d for the RF model with a n-estimators parameter of 30, max features parameter of 4, min-samples-split parameter of 3, and min-samples-leaf parameter of 2.

Multivariate Adaptive Regression Splines
The multivariate adaptive regression splines (MARS), developed by Friedman [61], is a nonparametric regression method that can be used to build relationships between explanatory and response variables. One of the advantages of the MARS method is that the input space can be divided into multiple regions, and each region has its own regression equation. This makes the MARS method particularly suitable for problems with higher input dimensions.
The MARS method was implemented using the py-earth package in the Python platform. The estimation accuracy of the MARS model is mostly affected by the max-terms, max-degree, and endpan-alpha parameters. The extent of parameters was predefined (max-terms ∈ [100, 500], max-degree ∈ [1,7], endspan-alpha ∈ [0.3, 0.9]). The 5-fold cross-validation results proved that a max-terms parameter of 100, max-degree parameter of 2, and an endspan-alpha parameter of 0.3 were most suitable parameters for the MARS model to estimate the L d .

Artificial Neural Network
The artificial neural network (ANN) is a network formed by the interconnection of many processing nodes or neurons in a certain order. There can be one or more nonlinear layers that contain several neurons between input and output layers. By repeated learning and training with known information, the network achieves the purpose of processing information and simulating the relationship between input and output through gradually adjusting the weight of neuron connections [62]. The ANN method does not need to know the exact relationship between input and output and does not require many parameters. Therefore, compared with traditional data processing methods, the ANN method has obvious advantages in processing fuzzy data, random data, and nonlinear data and is especially suitable for data with large scale and complex structure.

Validation against Ground Measurements
After confirming the optimal parameters, machine learning methods were utilized to estimate L d on the training and test datasets. Ground observations for daily L d collected at 27 BSRN stations were used to validate the accuracy of each machine learning method. Table 2   The other four models did not exhibit the same evident overfitting problem as found in the RF model. The overfitting problem may be relevant to a specific shortcoming of the RF method. When samples are concentrated in a specified extent, the result of the RF model displays the phenomenon of overfitting. Compared with the other models, the SVR model was the most inaccurate for the training dataset. When the SVR model was used to estimate daily L d , the overall RMSE, bias, and R values for the training dataset were 26.25 W m −2 , −8.87 W m −2 , and 0.98, respectively. For the SVR method, the slope and intercept of the fitted linear regression equation on the training dataset were 0.74 and 75.58, respectively, which indicated that the daily L d estimates seriously deviate from the 1:1 line when compared to the other four machine learning methods. The SVR method tended to overestimate L d for low values and to underestimate L d for high values. The overall RMSE value of the SVR model on the test dataset decreased to 18.18 W m −2 ; however, the value was still higher than those of the other models, except the RF model. The best correlation was seen between the L d estimates on the basis of the MARS method and the ground measurements, which achieved an overall RMSE, bias, and R values on the test dataset of 16.56 W m −2 , −2.87 W m −2 , and 0.97, respectively. The slope and intercept of the fitted linear regression equation on the test dataset were 0.97 and 5.97, respectively, which was the closest to the 1:1 line when compared to the other four machine learning methods. The L d estimates on the basis of the GBRT and ANN methods also correlated well with the ground measurements. The RMSE, bias, and R values for the GBRT model were 17.5 W m −2 , −2.77 W m −2 , and 0.96, respectively; these values for the ANN model were 17.27 W m −2 , −5.6 W m −2 , and 0.98, respectively. The similarity between the ANN and MARS models was that the differences between their RMSEs on the training and test datasets were small, less than 1 W m −2 . The RMSEs on the training dataset were 16.90 W m −2 and 17.27 W m −2 for the MARS and ANN models, respectively. Compared with the MARS model, the GBRT model achieved lower a RMSE of 13.43 W m −2 on the training dataset, and the RMSE on the test dataset differed by less than 1 W m −2 . Therefore, it is reasonable and accurate to estimate the daily L d on the basis of the GBRT model.  Monthly L d estimates were obtained by averaging the daily values over a month. If there were more than nine missing daily L d estimates in a single month at a station, the data for that month at that station were deleted. The calculated average monthly L d estimates were utilized to validate the consistency against the BSRN monthly L d measurements on a monthly time scale. Figures 4 and 5 display the evaluation results for the L d estimates on the basis of each method on a monthly time scale.
On the whole, the performance of each model for the test dataset did not differ much compared to that for the training dataset on a monthly time scale. The RMSE and bias on the training dataset ranged from 4.3 to 23.2 W m −2 and −8.01 to 0.38 W m −2 , respectively; these values on the test dataset ranged from 10.21 to 11.81 W m −2 and −2.99 to 5.33 W m −2 , respectively. When the RF method was used to estimate L d , the overall RMSE, bias, and R value for the training dataset were 4.3 W m −2 , 0.05 W m −2 , and 1, respectively, showing the best correlation. These values for the training dataset were 7.95 W m −2 , 0.04 W m −2 , and 1, respectively, between the monthly L d estimates on the basis of the GBRT method and the ground measurements, indicating that the GBRT method was also suitable for predicting L d . Compared with the other methods, higher RMSE and bias for the training dataset existed between the ground measurements and the monthly L d estimates on the basis of the SVR method, with values of 23.2 W m −2 and −8.01 W m −2 , respectively. As displayed in Figure 5, the MARS model obtained the lowest RMSE when compared to other models, achieving the overall RMSE, bias, and R value of 10.21 W m −2 , −1.92 W m −2 , and 0.99 on the test dataset, respectively. The same conclusion can be drawn from the test results on a daily time scale. The overall RMSE, bias, and R values of the GBRT method were 11.19 W m −2 , −1.62 W m −2 , and 0.98 on the test dataset, respectively. Although the RMSE of the MARS method was lower than that of the GBRT method, the difference between them was less than 1 W m −2 , on a monthly time scale. The bias of the GBRT method was lower than that of the MARS method on both daily and monthly time scales. The RMSE, bias, and R values of the ANN method were 11.29 W m −2 , 5.25 W m −2 , and 0.99, respectively. The RF model performed worst on a monthly time scale, with an overall RMSE value increasing to 11.81 W m −2 , for the test dataset. Therefore, it is also rational and accurate to conduct monthly L d estimation on the basis of the GBRT model.
According to the evaluation results for the five machine learning methods, the GBRT, RF, ANN, and MARS methods have a significant advantage in accuracy over the SVR method. Compared with other machine methods, the GBRT method showed a relatively high precision on both daily and monthly time scales, obtaining RMSE on the test dataset of 17.50 W m −2 and 11.19 W m −2 , respectively. In summary, the GBRT method is relatively accurate for L d estimation using elevation, Ta, RH, and DSR among five machine learning methods studied.

The Influence of Land Cover and Elevation on L d Estimation
Land cover and elevation can influence the interaction between the earth and atmosphere [64]. In this section, we present the results of our attempt to investigate the influence of land cover and elevation on the L d estimation based on the above five machine learning methods using both training data (61,108 samples for daily) and test data (13,015 samples for daily) at all sites. We divided the sites into seven primary land cover types: desert, bare land, cropland, grass, glacier, ocean, and tundra. Among them, asphalt, concrete, and rock land covers are considered as bare land category and ice land cover is considered as glacier category. Then, the accuracy of each method over different land cover types was evaluated. The statistical results for five machine learning methods over each land cover type and surface elevation are provided in Tables 3 and 4 and Figures 6 and 7. As shown in Figures 6 and 7, the box plots can visually display the outliers, data dispersion, and bias of a dataset. The box part represents the middle 50% of the observed value, and the upper boundary, middle line, and lower boundary of the box represent the upper quartile, median, and lower quartile, respectively. The dotted lines extending from the edge of the box represent the maximum and minimum values, and the scattered red asterisks represent the outliers of the dataset.
Although the selected machine learning methods have diverse performance over different land covers on the training dataset, all methods can generally achieve better results over the grass land cover type but show relatively poorer results over the tundra. Moreover, the five methods showed better performance on the training dataset than on the test dataset. On the training dataset, the RMSE ranged from 6.02 to 23.11 W m −2 and 6.64 to 24.13 W m −2 over the desert and grass land cover types, respectively; the mean RMSE of the five methods was 13.28 W m −2 and 13.75 W m −2 over desert and grass land cover types, respectively. Compared with the grass land cover type, the larger bias values over the desert on the training dataset were −6.77 W m −2 , −1.81 W m −2 , −1.49 W m −2 , −1.15 W m −2 , and −16.53 W m −2 for the ANN, GBRT, MARS, RF, and SVR methods, respectively. On the test dataset, the RMSE of all methods over the desert and grass land cover types ranged from 10.66 to 21.89 W m −2 and 11.85 to 24.39 W m −2 , respectively; the mean RMSE was 15.31 W m −2 and 15.52 W m −2 over the desert and grass land cover types, respectively. Meanwhile, the RMSE over the tundra ranged from 9.16 to 24.84 W m −2 and 20.09 to 22.91 W m −2 and the mean RMSE of five methods was 19.25 W m −2 and 21.12 W m −2 on the training and test datasets, respectively. Over the cropland, the RMSE and bias of all methods on the training dataset varied from 8.06 to 19.24 W m −2 and −5.82 to 0.84 W m −2 , respectively, which indicated that all models underestimated the L d , except the ANN method. The RMSE and bias of all methods on the test dataset varied from 14.76 to 19.13 W m −2 and −5.32 to 1.47 W m −2 , respectively. Over the bare land cover, the GBRT method on the test dataset obtained the lowest RMSE, 13.05 W m −2 , but this value was still higher than those obtained over the desert and grass land cover types. The RMSE of all methods over the ocean was similar with that obtained over the glacier on both training and test datasets, with differences of less than 1 W m −2 , except for the SVR method.
Among five machine learning methods, the RF method over all land cover types performed best on the training dataset; its RMSE and bias ranged from 6.02 to 9.16 W m −2 and −1.15 to 0.49 W m −2 , respectively. However, it achieved worse results on the test dataset, with RMSE and bias ranging from 15.70 to 22.91 W m −2 and −5.07 to 1.05 W m −2 , respectively. This may be related to the overfitting issue for the RF method. The SVR method showed larger RMSE and bias, with values ranging from 19.13 to 28.63 W m −2 and −15.47 to 18.53 W m −2 , respectively, on the test dataset. When the SVR method was used to estimate L d , the performance was the best over the cropland and worst over the glacier, which was different from the other methods. For the MARS method, the RMSE and bias over different land cover types ranged from 10.58 to 20.96 W m −2 and −4.52 to 0.49 W m −2 , respectively, on the test dataset. Among seven land cover types, the MARS method performed worst over the tundra, obtaining overall RMSE and bias values of 20.96 W m −2 and −4.52 W m −2 , respectively, on the test dataset, whereas this method performed best over the desert, with RMSE and bias values decreasing to 10.58 W m −2 and 0.32 W m −2 , respectively, on the test dataset. The underestimation of L d on the test dataset occurred over the desert, grass, and tundra land cover types for all methods. The overestimation of L d on the test dataset occurred over the cropland and ocean land cover types, except with the RF and SVR methods. According the validation dataset on the test dataset, the distribution of RMSE values over different land cover types was found to be uniform between the GBRT and MARS methods, both ranking from smallest to largest for the desert, grass, bare land, glacier, ocean, cropland, and tundra. However, the RMSE values of the GBRT method were lower than those of the MARS method on the test dataset, except for the desert and cropland land cover types. The performance of the GBRT method was second only to the RF method on the training dataset and was the best over most land cover types on the test dataset. For the GBRT method, the difference between RMSE on the training and test datasets was small, less than 1 W m −2 over different land cover types, which indicated that this method did not have overfitting phenomenon and was relatively most suitable to estimate L d .
Similar to the method of analyzing the effect of land cover types, we divided the site elevation (H, in m) into four ranges, namely H < 500 m, 500 m < H < 1000 m, 1000 m < H < 2000 m, and H > 2000 m, to evaluate the influence of site elevation. The validation results are shown in Figures 6  and 7 and Tables 3 and 4. According to the mean RMSE of five methods at different elevations, it can be concluded that all models can achieve relatively better results at the 1000 m < H < 2000 m elevation sites regardless of considering the training or test dataset. The mean RMSE of five methods at the 1000 m < H < 2000 m elevation sites was 12.00 W m −2 and 13.54 W m −2 , respectively, on the training and test datasets. For the H > 2000 m elevation sites, the performance of five methods on the training dataset was quite different, and the RF and SVR methods obtained lowest and highest RMSE, with values of 6.86 W m −2 and 33.74 W m −2 , respectively, on the training dataset. For the H < 500 m elevation sites, all methods on the test dataset underestimated L d , except the RF method whose bias was zero. Compared with the H < 500 m elevation sites, using the SVR method to estimate L d at the 500 m < H < 1000 m elevation sites had better fitting result, obtaining lower RMSE and bias regardless of considering the training or test dataset. When the ANN, GBRT, MARS, and RF methods were used to estimate L d , the RMSE at the 500 m < H < 1000 m elevation sites was higher than that obtained at the H < 500 m elevation sites, with value of 19 Tables 3 and 4 and Figures 6 and 7, it can be concluded that the RF method on the training dataset performed best at all elevation ranges, with RMSE and bias ranging from 6.22 to 9.70 W m −2 and -0.44 to 0.37 W m −2 , respectively, whereas its performance on the test dataset was worse than those of the GBRT and MARS methods. The difference between RMSEs on the training and test datasets was larger (>6 W m −2 ). The SVR method had a larger RMSE, with value ranging from 16.70 to 33.74 W m −2 and 16.76 to 32.88 W m −2 , respectively, on the training and test datasets. The performance of the GBRT and MARS methods was similar, both receiving the lowest RMSE at 1000 < H < 2000 m sites and the highest RMSE at 500 m < H < 1000 m sites on both training and test datasets. Their bias values were close to zero on the test dataset, with difference less than 1 W m −2 at the different elevation ranges. When the ANN method was used to estimate L d , the performance was the best at 1000 m < H < 2000 m sites and worst at 500 m < H < 1000 m sites, with RMSE of 13.98 W m −2 and 20.07 W m −2 on the test dataset, respectively. All in all, the GBRT method not only lacked an overfitting phenomenon but also performed best on the test dataset.

Discussion
In this study, we applied five machine learning methods, namely artificial neural network (ANN), support vector regression (SVR), gradient boosting regression tree (GBRT), random forest (RF), and multivariate adaptive regression spline (MARS), to estimate L d using the ground measurements collected at 27 Baseline Surface Radiation Network (BSRN) stations from 2000 to 2015. The evaluation results indicated that different methods have different performances on the training or test dataset, on the daily or monthly time scale, over the different land cover types and at different elevation ranges. On the whole, for the training dataset, the RMSEs of five methods were consistent on daily or monthly time scale, ranked from lowest to highest as follows: RF, GBRT, ANN, MARS, and SVR. The lowest RMSEs were 7.64 W m −2 and 4.3 W m −2 and the highest RMSEs were 26.25 W m −2 and 23.2 W m −2 on the daily and monthly time scales, respectively, for the training dataset. For the test dataset, although the sorting of RMSE between five methods was diverse on daily and monthly time scales, the differences between RMSE values for different methods were small, less than 3 W m −2 on both daily and monthly time scales. It is worth noting that for the training dataset, the fitted regression line of the SVR method seriously deviated from the 1:1 line, whose coefficients were 0.74 and 0.75 and intercepts were 75.58 and 70.54 on the daily and monthly time scales, respectively. Although the RF method obtained the lowest RMSE on the training dataset, it performed worse than the GBRT method on the test dataset. The performance of the GBRT method was second only to the RF method on the training dataset and was the best for most land cover types and elevation ranges on the test dataset. Furthermore, the difference between RMSEs on the training and test datasets was small, less than 1 W m −2 , regardless of land cover type or elevation range, which indicated that this method did not have overfitting phenomenon and was relatively the most suitable method for estimating L d .
Although machine learning methods can reliably estimate L d using elevation, Ta, RH, and DSR, there are still some disadvantages. The GBRT method does not have an explicit regularization, and its regression tree learner is regarded as a black box, which may not be valid [57]. The RF method has an obvious overfitting issue, especially when samples are concentrated in a specified extent [65]. Moreover, the ground measurements themselves are flawed, which is also one of the potential sources of errors in the model [66].

Conclusions
L d is one of the key components for calculating the surface energy budget. Accurate estimation of L d has great importance for knowledge of the Earth's radiation budget on global and regional scales. The primary goal of this paper was to demonstrate the usage of five machine learning methods, i.e., ANN, GBRT, RF, MARS and SVR, to estimate L d using the ground measurements collected at 27 BSRN stations from 2000 to 2015. Validation was carried out using the L d ground observations on both daily and monthly time scales. According to the comparison of the evaluation results, the GBRT method showed the highest accuracy for L d estimation; its overall RMSE and R on the test dataset were 17.50 W m −2 and 0.96, respectively, on a daily time scale. These values were 11.19 W m −2 and 0.98, respectively, on a monthly time scale. The experimental operating environment was a Windows 10 Intel Core i7-8700 CPU, 3.2 GHz, 3.19 GB RAM processor. The mean calculation time of the GBRT method was within 15 s. Therefore, the GBRT method is relatively the most suitable for estimating L d regardless of considering the calculation time or prediction accuracy.
The influence of land cover and elevation on the five machine learning methods was also discussed. The accuracy of each machine learning method was found to be better over the grass land cover type but relatively worse over the tundra. The evaluation result indicated that all models could achieve better results at the 1000 m < H < 2000 m elevation sites. The GBRT method performed better at all different elevation ranges, with RMSE and bias ranging from 11.57 to 17.48 W m −2 and −0.59 to 1.03 W m −2 , respectively, on the test dataset, whereas the SVR method obtained larger RMSE and bias, with values ranging from 16.76 to 32.88 W m −2 and −10.27 to 22.57 W m −2 , respectively, on the test dataset.
We conclude that it is reasonable to apply machine learning methods to estimate L d by building relationships between input variables and L d . Although the GBRT method showed the best performance, the differences among the different machine learning methods used were small. Analysis of bias showed that the L d estimates on the basis of five machine learning methods were approximately underestimated or overestimated by 2 to 8 W m −2 . Reducing bias and overfitting by optimizing the selection of parameters and variables requires further research.