Integrating Reanalysis and Satellite Cloud Information to Estimate Surface Downward Long-Wave Radiation

The estimation of downward long-wave radiation (DLR) at the surface is very important for the understanding of the Earth’s radiative budget with implications in surface–atmosphere exchanges, climate variability, and global warming. Theoretical radiative transfer and observationally based studies identify the crucial role of clouds in modulating the temporal and spatial variability of DLR. In this study, a new machine learning algorithm that uses multivariate adaptive regression splines (MARS) and the combination of near-surface meteorological data with satellite cloud information is proposed. The new algorithm is compared with the current operational formulation used by the European Organization for the Exploitation of Meteorological Satellites (EUMETSAT) Satellite Application Facility on Land Surface Analysis (LSA-SAF). Both algorithms use near-surface temperature and dewpoint temperature along with total column water vapor from the latest European Centre for Medium-range Weather Forecasts (ECMWF) reanalysis ERA5 and satellite cloud information from the Meteosat Second Generation. The algorithms are trained and validated using both ECMWF-ERA5 and DLR acquired from 23 ground stations as part of the Baseline Surface Radiation Network (BSRN) and the Atmospheric Radiation Measurement (ARM) user facility. Results show that the MARS algorithm generally improves DLR estimation in comparison with other model estimates, particularly when trained with observations. When considering all the validation data, root mean square errors (RMSEs) of 18.76, 23.55, and 22.08 W·m−2 are obtained for MARS, operational LSA-SAF, and ERA5, respectively. The added value of using the satellite cloud information is accessed by comparing with estimates driven by ERA5 total cloud cover, showing an increase of 17% of the RMSE. The consistency of MARS estimate is also tested against an independent dataset of 52 ground stations (from FLUXNET2015), further supporting the good performance of the proposed model.


Introduction
The downward long-wave radiation (DLR hereafter), defined as the irradiance reaching the surface in the infrared range between 4 and 100 µm, is an essential component of the Earth's surface radiation budget [1][2][3]. DLR has a high dependency with the vertical profiles of atmospheric temperature, water vapor (the largest contributor to the greenhouse effect [4]), and cloud cover. Therefore, accurate estimations of DLR are important for a wide range of applications dealing with climate variability [5]. Since DLR is a key component of the land surface radiative balance, it is essential to model and estimate the land surface turbulent fluxes (latent and sensible), which are relevant to predict the effects that climate and land use changes have on water resources, ecosystems, and the agricultural sector [6]. Naud and Miller [7] have reported DLR high sensitivity to changes in water vapor in high-elevation regions, which are among the most sensitive regions to future climate change. This is of particular relevance, since in such remote regions there is usually DLR estimates from MARS against another set of ground stations previously excluded from the training process. The in situ measurements are provided by the Baseline Surface Radiation Network (BSRN, [41]) and the Atmospheric Radiation Measurement (ARM, [42]) user facility, all within the MSG-disk (as described in Section 2.1). Additionally, to assess the consistency of the proposed methodology, MARS estimates are also validated against an independent spatiotemporal dataset of 52 ground stations from FLUXNET2015. In this analysis, besides DLR estimates from the MARS model, other model estimates are also considered for comparison purposes, including the operational LSA-SAF model, a new LSA model calibrated with ERA5 and measured data, and ERA5 radiation fluxes. Additionally, the MARS and LSA models are also driven by ERA5 cloud information to assess the added value of the MSG cloud information in the estimation of DLR. More details regarding MARS application, as well as a description of other models used in this study, are presented further on in Section 2.2.
The remainder of this paper is structured as follows: in Section 2, data and methods are presented, including a description for the LSA and MARS models; Section 3 provides the results for the validation of the MARS model against in situ measurements and other models estimates within the MSG-disk; in Section 4, a discussion for the obtained results is presented; while conclusions and future perspectives are given in Section 5; additional information related to the analysis is added at the end of this paper in Appendix A, Appendix B, Appendix C, Appendix D.

Observations and Reanalysis
The BSRN [41] has been operational since 1995, as part of the world climate research program, supported by the World Meteorological Organization (WMO) and others. As a network of surface radiation monitoring, significant improvements have been achieved over the last decades with the simultaneous increase of globally scattered stations and the quality of ground measured data. Although there are currently 57 operational stations installed over different surface types, measured data from a total of 77 stations is freely available (https://bsrn.awi.de/; accessed on 31 March 2022), with quality control procedures.
In this work a total of 22 BSRN stations were used (Figure 1a), being located within the MSG-disk (i.e., longitude/latitude within +/− 75 • E/N), and with available data within the 16-year period from 2004 to 2019. Three BSRN stations located in the MSG-disk are not considered for analysis, either because of representative issues of the measurements (Izaña, in Tenerife, Spain) or due to complete absence of data (Ilorin, Nigeria; and Rolim de Moura, Brazil). Izaña, is a relatively high-altitude station (2372.9 m), often above the clouds that cover most of the island; under these conditions, the MSG pixel is usually correctly classified as "cloud covered", but DLR observations are characteristic of "clear sky" and therefore inconsistent with the satellite information. In addition to the BSRN stations, the Niamey station (13.4773 • N; 2.1758 • W) from the ARM user facility was also included (60-s downwelling irradiances from the sky radiation sensor SKYRAD60S). Niamey station was selected due to its particular local atmospheric features, in particular the aerosol load [43], which can be in the form of severe dust events, such as desert storms. More details concerning Niamey mobile facility and radiation observations are available in Sengupta et al. [44].
Although quality control procedures have been previously applied, several outliers were found across the 23 stations (see Table 1), as well as a period of about 5 months in SMS station with measurements made with malfunctioning equipment. Accordingly, the corresponding sets of data were removed from the analysis, increasing the number of gaps in the in situ observations. There is a diverse amount of temporal coverage across all stations. For instance, the station that has the most complete list of records (TAM) only has 0.83% of missing data during the 16-year period, while the station with the lowest number of records (BUD) has 99.49% of missing data. For comparison purposes, in this analysis all DLR observations were temporally aggregated to hourly frequency. Although quality control procedures have been previously applied, several outliers were found across the 23 stations (see Table 1), as well as a period of about 5 months in SMS station with measurements made with malfunctioning equipment. Accordingly, the corresponding sets of data were removed from the analysis, increasing the number of gaps in the in situ observations. There is a diverse amount of temporal coverage across all stations. For instance, the station that has the most complete list of records (TAM) only has 0.83% of missing data during the 16-year period, while the station with the lowest number of records (BUD) has 99.49% of missing data. For comparison purposes, in this analysis all DLR observations were temporally aggregated to hourly frequency.   To further improve and reinforce the proposed validation procedure, an independent dataset of ground observations from the FLUXNET2015 network has also been considered (https://fluxnet.org/data/fluxnet2015-dataset/, [45]; accessed on 31 March 2022). To this end, half-hourly measurements from 52 ground stations were aggregated to hourly values Remote Sens. 2022, 14, 1704 6 of 26 and then used for validation purposes. The selected measured variable was the incoming longwave radiation (LW_IN_F_MDS), which is gap-filled using the Marginal Distribution Sampling (MDS) method, i.e., taking into account observations made under similar meteorological, physical and temporal conditions [46]. It should be noted that only 52 stations (Table A8) were eligible for the present study, since these needed to follow several requirements at the same time, i.e., to be within the MSG-disk, within the 2004-2015 period, and should have representative values. Regarding the latter, during a quality control check it was observed that several stations had different measuring periods with MDS gap-filling applied, being characterized by a "poor-quality" flag (i.e., the lowest level) and, therefore, are removed from the analysis. In this context, out of 198 stations globally available, only 52 were suitable. Moreover, out of these, 48 stations are located in Europe (Figure 1b). More details regarding the FLUXNET2015 stations and respective validation results for the MARS model are shown in Appendix D.
In addition to ground measured data, several fields with hourly frequency of the most recent ECMWF reanalysis, ERA5 [47], were extracted from the Copernicus Data Store (CDS). The fields include total column water vapor (tcwv, mm), 2-metre temperature (t2m, K), 2-metre dewpoint temperature (d2m, K), total cloud cover (tcc), and downwelling surface thermal radiation (strd or DLR, W·m −2 ), with the latter being produced through the McRad radiation scheme [48]. For comparison purposes, the ERA5 fields were interpolated to the measuring station's location following a nearest neighbor approach. It is worth noting that, for the evaluation of MARS and LSA models against observations, both t2m and d2m temperatures were adjusted to each measuring altitude considering a reference temperature lapse rate of −6.5 K/km [49]. Similarly, ERA5 radiation fluxes were adjusted considering a correction factor of −2.8 W/m 2 per 100 m [13].
Cloud mask information retrieved from the SEVIRI sensor on board MSG is also used in this work (at 15-min frequency) for the definition of sky conditions. As described by Derrien and Gléau [50], the MSG cloud mask was developed by the LSA on support to Nowcasting and Very Short-Range Forecasting (NWC-SAF, https://www.eumetsat.int/ nwc-saf; accessed on 31 March 2022), allowing to identify cloud free areas where different products can be computed (e.g., total precipitable water, land, or sea surface temperatures), as well as cloudy areas from which other products can be derived (e.g., cloud type or cloud top temperature/height). Several research works have shown the added value of the MSG information for cloud detection (e.g., [14,51]). In particular, Trigo et al. [14] showed an overall good performance of the MSG cloud mask in cloud identification during the validation of DLR estimates. However, despite the satisfactory results, these authors also observed that, in regions under high aerosol load, the accuracy of the satellite cloud mask could contribute to a lower performance of the proposed method. In the context of DLR estimation, the present study uses cloud fraction (denoted cf ) retrieved from the SEVIRI sensor for the training and evaluation of both MARS and LSA algorithms. For this purpose, 15-min cloud mask data is aggregated to hourly cf using an hourly rolling mean. The procedure allows to select pure situations of clear (cf = 0) and cloudy (cf = 1) conditions during a particular hour.

Models
Two algorithms that estimate DLR are evaluated in this study: (i) the current operational semi-empirical algorithm used by LSA-SAF (referred as LSA hereafter) and (ii) a new MARS algorithm, a more flexible approach for the definition of the different atmospheric states under which DLR is calculated. The resulting models (LSA and MARS) are both driven by ERA5 atmospheric conditions (tcwv, t2m, and d2m), satellite cloud cover from MSG/SEVIRI observations, and are calibrated using DLR observations from the BSRN and ARM stations. Additionally, estimates of DLR from ECMWF-ERA5 reanalysis (ERA5) and of the current LSA-SAF operational product (LSA_OPER) are also considered in the analysis. For completeness, to assess the value of using satellite cloud information to calculate DLR with both algorithms, LSA and MARS models were also applied using ERA5 Remote Sens. 2022, 14, 1704 7 of 26 total cloud cover (denoted LSA* and MARS*, respectively) instead of the MSG cf. Table 2 resumes the key characteristics of the models used in the analysis, including the MARS algorithm for the MARS and MARS* models, the LSA-SAF for the LSA and LSA* models, and the ERA5 reanalysis for the ERA5 model. The LSA-SAF algorithm is presented in detail by Trigo et al. [14] and resumed in Appendix A. The piecewise regression approach used by LSA-SAF is based on three classes of atmospheric profiles, independent for clear and cloudy conditions, which were manually selected (Table A1). Similarly, MARS [40,52] is based on a weighted sum of piecewise functions, also known as basis functions, in which the MARS additive model follows the recursive partitioning regression form, as described by Friedman [40]. The resulting regression coefficients are then adjusted to find the best fitting to the data. The selection of the basis functions is a fundamental process in MARS: this consists of an automatic procedure following a two-stage building process, established by performing a forward and a backward step. Compared with the LSA-SAF algorithm, the automatic procedure to establish the piecewise regression in MARS is a key advantage. In this study, the MARS algorithm available in the py-earth python package (version 0.1.0, https://github.com/ scikit-learn-contrib/py-earth; accessed on 31 March 2022) is used. As in the case of the LSA algorithm, two MARS sub-models are trained for clear and cloudy conditions, respectively, considering "pure types" identified by the MSG cf (0 or 1); the all-sky DLR is computed following Equation (A4).
Both LSA and MARS models were calibrated with the same subsets of data independently for clear and cloudy conditions, using the MSG cf. The models used ERA5 tcwv, t2m, and d2m as predictors, and observations (BSRN and ARM) of DLR as predictand. Since there are large differences in data availability for each station (Table 1), the models were calibrated with a randomly selected sample of 40% of the full-time series in each station limited to a maximum of 6 months of data. The procedure allowed us to avoid the dominance of some stations with longer periods in the training dataset. In an initial phase, the MARS model was also tested with different combinations between the three predictors (tcwv, t2m, d2m). The results (not shown) indicated that the use of the three predictors provides the best outputs, although tcwv and t2m alone could already generate reasonable results. Moreover, the addition of the cf was also tested as an explicit input (predictor) in MARS, i.e., as an alternative to the two sub-models for clear and cloudy conditions. This approach did not perform as well as for the two sub-models (not shown), most likely due to the binary nature of the cloud information, which is not optimal for the MARS model. Considering these preliminary tests, and for consistency with LSA, it was decided to keep the three predictors in MARS and two independent sub-models, i.e., one for clear and another for cloudy conditions. The training of the LSA-SAF algorithm follows Trigo et al. [14], described briefly in Appendix A, where the resulting calibrated parameters are shown in Table A2. For MARS, a repeated k-fold cross-validation procedure was considered during the training phase. The process involves the performance of repeated cross-validation procedure several times and calculation of the mean result across all folds. For the present study, a 10-fold was considered, since this value is typically used in machine learning models (e.g., [53][54][55]), being found to provide a good trade-off of low computational cost and low bias. The results of the training of both MARS and LSA models for clear and cloudy conditions are presented in detail in Appendix B, showing the respective model performance in the training and validation datasets.

Evaluation Metrics
The performance of the different DLR estimates obtained from each model is assessed through a series of conventional error metrics, such as bias (µ), the root mean square error (RMSE), standard deviation of the error or unbiased root mean square error (σ), and the temporal correlation coefficient (R): where y i is the modelled value of the i-th sample (N is the number of samples), o i is the corresponding reference value, d i is the difference between the modelled and reference, with the overbar representing the temporal mean of a variable.

Model Evaluation
The models were evaluated considering the whole DLR observational dataset between 2004 and 2019; statistics obtained for the independent datasets used in model training and verification are shown in Appendix B. Since there is a large range of the temporal coverage among the stations, the evaluation was performed following two approaches: (i) merging all station data before the evaluation, computing the overall metrics, and displaying the results as density scatter plots; and (ii) computing the evaluation metrics for each station independently, displaying the distributions of the metrics as boxplots.
The overall performance of the models, i.e., merging all stations before the evaluation, is depicted in Figure 2. The corresponding density scatter plot for each model and sky condition (i.e., clear, cloudy, and all-sky) shows the model DLR as function of the observations along with the different metrics. The absolute biases are always below 2 W·m −2 for both LSA and MARS models. This is expected since the model's training aims at the minimization of systematic differences. For the case of ERA5, biases are also small under clear-sky, however these grow to −14.54 W·m −2 in cloudy conditions, which result in an all-sky bias of −5.25 W·m −2 . The root mean square error is dominated by the error variability (standard deviation of the error) in all models. This can be primarily attributed to temporal/spatial variability that is not captured by ERA5 or by the ERA5 predictors used in LSA and MARS. In terms of the RMSE, MARS has the best performance in all the different sky conditions, with an error of 18.76 W·m −2 under all-sky, being followed by LSA with 20.24, ERA5 with 22.08, and LSA_OPER with 23.55 W·m −2 . Moreover, the linear correlations are always above 0.91 in all models and sky conditions, with MARS showing a consistently better performance, although differences are small. W.m for both LSA and MARS models. This is expected since the model's training aims at the minimization of systematic differences. For the case of ERA5, biases are also small under clear-sky, however these grow to −14.54 W.m −2 in cloudy conditions, which result in an all-sky bias of −5.25 W.m −2 . The root mean square error is dominated by the error variability (standard deviation of the error) in all models. This can be primarily attributed to temporal/spatial variability that is not captured by ERA5 or by the ERA5 predictors used in LSA and MARS. In terms of the RMSE, MARS has the best performance in all the different sky conditions, with an error of 18.76 W.m −2 under all-sky, being followed by LSA with 20.24, ERA5 with 22.08, and LSA_OPER with 23.55 W.m −2 . Moreover, the linear correlations are always above 0.91 in all models and sky conditions, with MARS showing a consistently better performance, although differences are small.  The performance of each model can also be assessed considering different DLR ranges for all-sky conditions. The following ranges were established: (UL) upper limit for values above 400 W·m −2 , with a total of 100,521 samples; (ML) middle limit for between 200-400 W·m −2 , with a total of 1,571,217 samples; and (LL) lower limit for below 200 W·m −2 , with a total of 65,888 samples. Figure 1a gives a reasonable overview of the actual conditions represented by those ranges of DLR: there are values above 400 W·m −2 under warm and very moist conditions, such as those found in the tropics, while the other end of the range, with DLR below 200 W·m −2 , is only found under very cold and dry conditions, such as in high latitudes winter. Table 3 includes a summary of the results for all-sky conditions considering all the data (ALL) and the median (MED) of the metrics distribution when computed for each station. Moreover, as complementary material to these results, detailed information concerning the statistics found in each station under all, clear, and cloudy-sky conditions is provided in Appendix C (Tables A5-A7, respectively). The highest performance is found in all models within the "middle" range, with similar metrics to those computed when considering the entire dataset. This is expected due to a higher sampling, which impacts the training of the models prior to the evaluation. On the other hand, focusing on the upper and lower limits, a clear reduction of all model's performance is found. MARS and LSA underestimate the most extreme conditions, namely at higher values of DLR, while LSA_OPER and ERA5 shows a systematic underestimation of DLR in all conditions with an exception to the ERA5 overestimation at lower values. The large biases in MARS and LSA in the upper and lower conditions lead to high RMSE, with LSA_OPER showing a better overall performance. This is likely associated with the small sampling of these extreme conditions in the training dataset initially used. Despite the limitation in extreme conditions, these results are favorable to the MARS model. When pulling all stations together, the results will be dominated by those stations with larger temporal extend. This could potentially hide some problematic stations (or regions), which will be partially addressed in the following analysis.  The performance of each model in estimating hourly DLR in each station can be assessed through the distributions of the various metrics displayed, as shown in Figure 3 boxplots for all, clear, and cloudy-sky conditions (i.e., left, middle, and right column, respectively). Each boxplot has a reference at the top, corresponding to the median value (also shown in Table 3) found for each error metric (i.e., bias, standard deviation, RMSE and correlation coefficient) and each model. Additionally, Figure 3 shows the same boxplots for the LSA* and MARS* models, which will be discussed in the next subsection for the assessment of the cloud information in DLR estimation. The results are qualitatively consistent with the previous analysis, when all data was merged, with MARS always showing better adjustments to observations (being followed by LSA, LSA_OPER, and ERA5). However, quantitatively, the median of station metrics differs from the metric considering all the data.  (Figure 2j) when considering the full data. Moreover, the graphical display of the metrics distribution also allows to clearly identify a better performance in clear conditions (RMSE and correlation) when compared with cloudy conditions in all models. This is associated with the different radiative impact of clouds, in particular cloud base, which is not considered in the LSA and MARS models, and limitations due to model uncertainty in ERA5. Finally, it is worth noting the presence of outliers in all estimates, which are due to several factors that can affect model accuracy in a group of stations, leading to higher deviations from observations, as shown next.

MARS LSA
Remote Sens. 2022, 13, x FOR PEER REVIEW 12 of 28 In addition to these results, and as a parallel validation of the MARS model, we used a set of (52) stations from an independent network with DLR observations from FLUXNET2015 [45]. The results, shown in detail in Appendix D, demonstrate the MARS model consistency between the different networks. Similarly to Table 3, in Table A9 it is possible to observe that, despite an overall error increase in all models metrics, MARS has In addition to these results, and as a parallel validation of the MARS model, we used a set of (52) stations from an independent network with DLR observations from FLUXNET2015 [45]. The results, shown in detail in Appendix D, demonstrate the MARS model consistency between the different networks. Similarly to Table 3, in Table A9 it is possible to observe that, despite an overall error increase in all models metrics, MARS has the best performance. The best scores are found in the "middle range" for all models, where MARS presents the lowest bias and RMSE of 0.06 and 18.32 W·m −2 , respectively. The same behavior is observed when considering all data from FLUXNET2015, as well as when using data from each individual station. As previously noted, larger errors are also found at the lower and upper limits in all models. It is important to note that such results, beside reinforcing the proposed methodology, suggest BSRN observations are more appropriate for the MARS training (and validation) within the MSG-disk than the FLUXNET2015 network. BSRN operates exclusively for continuous radiation measurements at surface, where most sites provide both downward longwave and shortwave fluxes, following high standards in terms of instrument calibration and observations quality checks [41], while FLUXNET2015 targets a broader set of observations aimed at characterizing exchanges of energy, water, and carbon between the surface and the atmosphere, where available radiation plays an important role. Within FLUXNET2015, and despite the quality checks performed on measurements, availability of a complete set of measured variables (including longwave and shortwave radiation fluxes) is considered crucial [45]; across-variable quality checks are regularly performed, but the BSRN standards for radiation flux observations may not be always followed. The geographical distribution of FLUXNET2015 sites with acceptable quality radiation fluxes is limited to Europe, if we only consider sites within the MSG-disk as opposed to BSRN. These aspects are confirmed by the FLUXNET2015 validation, as depicted by the overall error increase in all models.
The evaluation procedure continues, now focusing on different case studies to highlight several aspects (positive and negative) of the different models. As previously mentioned, there are stations (most noticeably GVN, SMS, and SON) from which model estimates deviate further from observations. On the other hand, there are also stations (e.g., CAR and TAM) in which models have an overall good correspondence with observations.
The following examples presented in Figure 4 depict the behavior of each model during a 36 h-period in such stations, which also include the NIM station, due to particular atmospheric effects that occur in the region. The DLR time-series of each station are shown at the hourly resolution from the different models and observations, as well as the cloud information (in the bottom subplot) from ERA5 (tcc) and MSG (cf ). For the best performance cases (Figure 4a,b), the MARS model is the one that has better adjustments to observations, while ERA5 produces higher deviations. A suitable example is the CAR station (Figure 4a), where a good relation is found between the observations' DLR variability and the MSG cf, reflected in the MARS, LSA, and LSA_OPER simulations, while ERA5 DLR shows some deviations associated with tcc variability. In TAM station (Figure 4b), the reduced cloud variability clearly leads to lower deviations between models and observations, in which DLR values are found between 250-350 W·m −2 during this time of the year. When analyzing NIM station (Figure 4c), underestimation of DLR occurs in all models. Despite a slightly higher deviation in comparison with the LSA estimates, MARS shows a smoother variation than the former, closer to the observed behavior. Regarding the worst performances cases (Figure 4d-f), significant deviations are observed in all models. In GVN station (Figure 4d), all models deviated from the observations, missing the increase in DLR at the start of the period and underestimating DLR in the following hours. Such behavior can be explained by the fact that GVN is in Antarctica, at a very high latitude, near the MSG-disk limit (close to 80 • ), posing significant challenges to the identification of cloudy pixels under very high view angles and under circumstances that make it difficult to separate the signature of clouds from those of snow or ice in SEVIRI/MSG observations. In SON and SMS stations, an overall overestimation of models estimates towards observations is visible, particularly in SMS (Figure 4f). For the case of SON station, the measuring equipment is located at a relatively high altitude (about 3109 m), which, similarly to Izaña station in Tenerife, can be measuring DLR values above clouds, instead of recording values below cloud-base height. In SMS, the frequent occurrence of stratiform and shallow convective clouds [56] can lead to higher deviations, since under such conditions there is a higher difficulty to model DLR.  Hourly cloud cover (CC) information is also added in the inside plots with the total cloud cover (tcc) from the ERA5 (blue dots) and the cloud fraction (cf) from the MSG (black dots).

Impact of Satellite Information
Following the previous results describing MARS and LSA performances, the added value that the satellite cloud information has in the calculation of DLR in both MARS and LSA algorithms is clearly observed through the MARS* and LSA* results (Figure 3). In comparison to all the other models, MARS* and LSA* stand out with an overall error increase in all the metrics. This can be primarily associated with cloud misrepresentation in Hourly cloud cover (CC) information is also added in the inside plots with the total cloud cover (tcc) from the ERA5 (blue dots) and the cloud fraction (cf ) from the MSG (black dots).

Impact of Satellite Information
Following the previous results describing MARS and LSA performances, the added value that the satellite cloud information has in the calculation of DLR in both MARS and LSA algorithms is clearly observed through the MARS* and LSA* results (Figure 3). In comparison to all the other models, MARS* and LSA* stand out with an overall error increase in all the metrics. This can be primarily associated with cloud misrepresentation in ERA5. However, MARS* still shows an overall better performance than the LSA* model, although with small differences. For instance, for all-sky conditions, a correlation of 0.87 and 0.86 is found for MARS* and LSA*, respectively, while 0.88 is obtained with ERA5 ( Figure 3j). These deviations are generally higher than the ERA5 estimates due to the linear relation between clear/cloudy and tcc that both MARS and LSA algorithms consider, which does not happen in ERA5. Following the results obtained for different ranges of DLR in the remaining models, a similar behavior of both LSA* and MARS* models is observed in Table 4. Despite an overall error increase due to the use of the tcc to calculate DLR, lower deviations and higher correlations (0.88 and 0.89, respectively) occur in the "middle" limit.

Discussion
The present work focusses the estimation of DLR fluxes at surface using MARS combined with hourly observations of DLR (from BSRN and ARM stations), ERA5 atmospheric profiles (tcwv, t2m, and d2m), and the MSG cf. The fact that ground and remote sensed observations are used for model training under different sky conditions provides a novel approach to estimate DLR. Similarly, to all NWP models, despite an overall good result in comparison to other models estimates, the proposed MARS model also has a few limitations. The main source of uncertainty is related to the adopted training procedure, particularly to the 23 ground stations used, which provide some degree of data availability differential. This means that stations with higher samplings will induce a local bias dependency. Moreover, the selected stations are mainly distributed in Europe, which also creates a regional dependency. When using a spatiotemporal independent set of 52 ground stations from FLUXNET2015 to validate the MARS model over a period of about 11 years, it was possible to observed that, regardless of its limitations, the proposed methodology is consistent. Most of the FLUXNET2015 stations used are located in Europe (i.e., a total of 48 stations), which limits a more global assessment of the results. Nevertheless, the MARS model continues to demonstrate an overall better estimation of DLR when compared with the remaining models. Furthermore, we should keep in mind that in situ observations may also be subject to significant uncertainties. Other sources of error can result from the cf information used for the sky classification, in which the data quality is dependent on the satellite interpretation of clouds and associated errors.
Considering all the available data, the validation of the MARS model ( Figure 2) shows that, generally, and despite its limitations, using measured data for training (i.e., a total of about 5.86 years) purposes produces better adjustments to observed values. This is the case for the MARS (Figure 2a-c) and LSA (Figure 2d-f) model estimates. Particularly, MARS provides the best performance under the different sky conditions as a result of using an automatic piecewise regression method instead of the least square fitting method used in the LSA-SAF original algorithm. In comparison, the worst results are found with ERA5, which also depict an overall decrease in performance. As previously mentioned, the ERA5 negative bias in cloudy conditions might be partially explained by problems in the representation of clouds and their radiative effect. It is worth noting that this effect is observed not only during cloudy-sky periods (where higher deviations are found) but also during clear-sky periods (with a very low bias of −0.46 W·m −2 ), in which ERA5 is likely assuming some situations of cloud occurrence when none are to be observed, thus leading to overestimation that lowers bias close to zero. Moreover, the separation between clear and cloudy conditions is performed using the satellite information, which can introduce a few inconsistencies regarding the actual observations (due to satellite footprint and uncertainties in cloud detection) and ERA5 atmospheric conditions (e.g., cloud base errors). Therefore, the interpretation of the model's evaluation in clear versus cloudy conditions is not straightforward. Nevertheless, in the absence of accurate cloud information from NWP models, satellite information should be used instead of reanalysis data for model training purposes, as particularly shown by the improved DLR estimates of MARS in comparison to the ones found with MARS*. Another overall underestimation of DLR is similarly found with the LSA-SAF operational model (LSA_OPER), although with smaller deviations than in the ERA5 model. In that case, the poorer performance must be attributed to the original calibration carried out by Trigo et al. [13], where TIGR-like and MODTRAN-4 simulations (not observations) were used to calibrate the model parameters (Table A1). Additionally, a common feature of the original LSA-SAF algorithm is the 'S' shape curve in DLR estimates, as shown by the LSA_OPER results (Figure 2g-i), where its effects are particularly visible under cloudy conditions. Since the plots of MARS, the newly calibrated LSA model, and ERA5 do not present that characteristic, it is likely an artifact introduced by the MODTRAN-4 estimates used in the fitting. It should be noted that, although MARS eliminates most of the previous error signatures from LSA-SAF and ERA5 models, there is still room for improvement, particularly for cloudy conditions. In particular, the way to further incorporate satellite observations related to, e.g., cloud type and cloud phase, is still largely unexplored.
The results analyzed so far suggest an overall good performance of the considered models, but also reveal the presence of several outliers in all model estimates (Figure 3). Taking into account the information provided by Tables A5-A7, it is possible to find significant deviations towards observations in three stations (GVN, SMS, and SON), either due to the latitude or altitude effect, as well as measurement inaccuracies related to equipment malfunction. Figure 4 presents (36-h) examples of the behavior in each model for a selected group of very different stations, which are aimed to represent the best-and worst-case studies. Despite the good results in CAR and TAM stations (Figure 4a,b), it is important to consider the fact that the spatial sampling is not equally distributed throughout all the selected stations within the MSG-disk. In terms of the spatial distribution, the use of observations for validation allows to test the performance of different model estimates over different climate regions, strengthening the validation of the proposed formulation. However, as previous mentioned, regional dependencies should be expected in regions that have a higher number of stations (e.g., Europe), therefore contributing to an overall bias reduction to the MARS and LSA models. Nevertheless, for the best-case studies, ERA5 continues to produce higher deviations due deficiencies in cloud representation. When analyzing NIM station (Figure 4c), a clear underestimation of measured values is provided by all models (between 340-450 W·m −2 ). This aspect can be explained with the fact that NIM station may be subject to high aerosol loads, usually desert dust, which can lead to a significant increase in observed DLR when compared to similar aerosol free conditions, which is not captured well by any model. In particular, this station usually experiences higher occurrence of extreme dust events (e.g., desert storms), resulting in larger deviations of estimations from observations [44]. For the worst-case studies, the latitude effect in GVN (Figure 4d), the altitude effect in SON (Figure 4e), and the measuring inaccuracies in SMS (Figure 4f) result in very high errors between estimated and observed values, particularly in the former. At a very high latitude, the LSA_OPER seems to provide a higher deviation from observations, which is in accordance with one of the LSA-SAF model limitation stated by Trigo et al. [14], while the MARS and LSA models demonstrate to have a better approximation to observations, particularly MARS.

Conclusions
This work aimed at contributing to a new and improved formulation for the estimation of downward long-wave radiation (DLR) at the surface. The new formulation was built considering the combination of hourly reanalysis, ground, and remote observed inputs to train a state-of-the-art machine learning algorithm based on multivariate adaptive regression splines (MARS). The use of satellite data not only allows us to perform better estimates of DLR with suitable temporal and spatial samplings under different sky conditions, but also provides a wide spatial coverage at high-resolution.
When compared with the Satellite Application Facility on Land Surface Analysis (LSA-SAF) algorithm, results showed that the MARS algorithm performs very well, providing better adjustments to observed DLR fluxes than the former, where lower errors and higher correlations were evident under all, clear, and cloudy-sky conditions. This is mainly related to the fact that MARS allows to replace the previous least square fitting criterion for a set of pre-defined atmospheric states implemented in the LSA-SAF with a more refined discretization that accounts with the best fitting option based on maximum reduction on sum-of-squares residual error. Systematic differences and an overall underestimation were found in both LSA_OPER and ERA5 models, being linked to the original calibration with the Thermodynamic Initial Guess Retrieval (TIGR, [38]) atmospheric-profile database and the Moderate Resolution Atmospheric Transmittance and Radiance Code (MODTRAN-4, [37]) fluxes, and the cloud representation by the total cloud cover retrieved from ERA5, respectively. The role of satellite information in the calculation of DLR was also evaluated using both MARS and LSA models but considering ERA5 cloud information instead of the satellite cloud information to separate clear and cloudy situations (MARS* and LSA* models). The results clearly showed the added value of using remotely sensed data instead of reanalysis cloud cover.
The evaluation analysis, performed within the MSG-disk (i.e., longitude/latitude within 75 • E/N), continued to show that MARS provided best results in comparison to the remaining models (LSA, LSA_OPER, and ERA5). In particular, the use of ground observations, from the baseline surface radiation network (BSRN, [41]) and the atmospheric radiation measurement (ARM, [42]) user facility, to calibrate MARS led to improved adjustments with lower errors and higher correlations (in which sampling plays an important role). During the validation procedure it was shown that, when using all available data from the 23 stations, MARS allows us to obtain RMSEs of 18.76, 17.07, and 17.13 W·m −2 under all, clear, and cloudy conditions, respectively. Lower errors were found when considering the performance of the model at each measuring location, as shown by the median values of the RMSE for all, clear, and cloudy-sky, i.e., 16.96, 15.44, and 16.00 W·m −2 , respectively. Moreover, the reduction and elimination of previous systematic differences and overall underestimation carried out by LSA_OPER and ERA5 was achieved. The added value of using the satellite cloud information was accessed by comparing with estimates driven by ERA5 total cloud cover, showing an increase of 17% of the RMSE. Finally, the proposed methodology was further validated against independent observations gathered from 52 FLUXNET2015 ground stations over an 11-year period, showing that MARS DLR estimates have a better approximation to observations than the remaining models.
There is potential in using the proposed MARS formulation for operational purposes, however there are still a few steps towards improvement that need to be carried out in the future. These include: (i) the assessment of DLR estimates on a regional level, by producing regional maps and comparing MARS estimates with LSA-SAF product outputs (a fundamental procedure in order to operationalize MARS estimates); (ii) improvements of MARS estimates with enhanced input fields from ECMWF numerical weather prediction (e.g., increased resolution, better model physics, data assimilation); and (iii) other MARS model variants that can make use of other satellite products, such as measurements of thermal infrared bands and the top of atmosphere radiances as inputs for the training phase, similarly to Zhou et al. [23].
An application example for the estimation of hourly DLR values from the MARS model is made available in the Supplementary Materials, including a python code and the two calibrated MARS submodels (i.e., for clear and cloudy skies), and a synthetic test data for a 24-h period.

Conflicts of Interest:
The authors declare no conflict of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study.

Appendix A. LSA-SAF Algorithm
In the LSA-SAF algorithm, DLR (F ↓ ) is estimated through a bulk parameterization given by the following equations, as described by Trigo et al. [14]: where σ is the Stefan-Boltzmann constant, sky and T 4 sky are the sky effective emissivity and the sky effective temperature, respectively. The former is given as a function of the total column of water vapor (tcwv), as follows: where m are values that best adjust to clear and cloudy conditions (0.5 and 1, respectively). For the case of the latter, the following relation is used: where T 0 is the 2-metre temperature corrected through the 2-metre observed dewpoint depression (δ∆d 0 ). The parameters α, β, γ, and δ in Equations (A2) and (A3) are fitted independently for cloudy-sky and clear-sky conditions, and the all-sky DLR is the sum of the clear F ↓ clear ) and cloudy F ↓ cloudy ) contributions considering the cloud fraction (cf ): More information regarding the calibration procedure of the parameters in Equations (A2) and (A3) is presented by Trigo et al. [13]. The method follows a piecewise regression independent for clear and cloudy conditions considering three classes of profiles: (i) dry cold, with tcwv ≤ 10 mm and t2m < 270 K; (ii) dry and warm, with tcwv ≤ 10 mm and t2m > 270 K; and (iii) moist, with tcwv > 8 mm. For the calibration phase of the operational LSA-SAF algorithm (LSA_OPER) atmospheric profiles from the TIGR-like database, namely the tcwv, t2m, d2m, and fluxes simulated with MODTRAN-4, were used. Moreover, the separation of clear and cloudy skies in the calibration database considers the total cloud cover (tcc), where clear and cloudy conditions are assigned for tcc = 0 and tcc > 0.9, respectively, in which a piecewise regression method is then applied to each set of clear and cloudy conditions, separately. The parameters used by the current LSA-SAF operational algorithm (LSA_OPER) are presented in Table A1. Table A1. Calibrated parameters for the LSA_OPER model [14], i.e., the LSA-SAF operational algorithm that makes use of TIGR-like database (1992)(1993) [38] for different atmospheric profiles under clear and cloudy-sky.

Clear-Sky
Cloudy-Sky

Appendix B. Models Training
For the training of both LSA and MARS models (i.e., for clear and cloudy conditions), the full dataset was divided in two components: training and verification. The training dataset was constructed by randomly selecting 40% of the full-time series for each station limited to a maximum of 6 months of data used from each station, with the remaining data being used as the verification dataset. This corresponded to a total of 51,386 (5.87 years) and 932,282 (106.42 years) hourly samples for the training and verification period, respectively. The bias and root mean square error (RMSE) for clear and cloudy conditions in the training and verification samples are presented in Tables A3 and A4, respectively. The results show that MARS always performs better than LSA under clear and cloudy-sky conditions. During the training stage, MARS presents a very small bias (close to zero) as opposed to LSA, which shows a relatively large bias in comparison with higher deviation for the training under clear-sky. The RMSE, besides being lower in MARS (although with less discrepancies than the bias found for each model), does not vary significantly from clear to cloud sky. A similar behavior is observed for the verification phase, despite an overall increase in the bias in both models, which is related to the sampling increase with a higher data availability differential.

Appendix C. Evaluation Detailed Results
The following tables comprise all the statistical error metrics obtained. Tables A5-A7 show the scores for each station in all, clear, and cloudy-sky conditions (respectively).

Appendix D. FLUXNET2015 Validation
The following results show the use of FLUXNET2015 [45] dataset for the validation of the MARS model. To this end, 30-min data from 52 ground stations within the MSG-disk were aggregated to hourly values and then used for validation purposes considering a period of 11 years between 2004 and 2015 (Table A8). In Table A9, a summary of the error metrics distribution for different ranges of DLR and under all-sky conditions is shown. Table A8. List of 52 stations from FLUXNET2015 used within the Meteosat Second Generation (MSG) disk for validation purposes of estimated downward long-wave radiation (DLR) at the surface. The name, acronym, location, geographical coordinates ( • ), elevation (m), availability (total number of years available between 2004 and 2015), and annual mean DLR (W·m −2 ) for each station is shown.