Comparison of machine learning algorithms for merging gridded satellite and earth-observed precipitation data

Gridded satellite precipitation datasets are useful in hydrological applications as they cover large regions with high density. However, they are not accurate in the sense that they do not agree with ground-based measurements. An established means for improving their accuracy is to correct them by adopting machine learning algorithms. This correction takes the form of a regression problem, in which the ground-based measurements have the role of the dependent variable and the satellite data are the predictor variables, together with topography factors (e.g., elevation). Most studies of this kind involve a limited number of machine learning algorithms, and are conducted for a small region and for a limited time period. Thus, the results obtained through them are of local importance and do not provide more general guidance and best practices. To provide results that are generalizable and to contribute to the delivery of best practices, we here compare eight state-of-the-art machine learning algorithms in correcting satellite precipitation data for the entire contiguous United States and for a 15-year period. We use monthly data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset, together with monthly earth-observed precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm). The results suggest that extreme gradient boosting (XGBoost) and random forests are the most accurate in terms of the squared error scoring function. The remaining algorithms can be ordered as follows from the best to the worst: Bayesian regularized feed-forward neural networks, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks, linear regression.


Introduction
Knowing the quantity of precipitation at a dense spatial grid and for an extensive time period is important in solving a variety of hydrological engineering and science problems, including many of the major unsolved problems listed in Blöschl et al. (2019). The main sources of precipitation data are ground-based gauge networks and satellites (Sun et al. 2018). Data from ground-based gauge networks are precise; however, maintaining such a network with high spatial density and for a long time period is costly. On the other hand, satellite precipitation data are cheap to obtain but not accurate (Mega et  By merging gridded satellite precipitation products and ground-based measurements, we can obtain data that are more accurate than the raw satellite data and, simultaneously, cover space with a much higher density compared to the ground-based measurements.
This merging is practically a regression problem in a spatial setting, with the satellite data being the predictor variables and the ground-based data being the dependent variables.
Such kinds of problems are also commonly referred to under the term "downscaling" and are special types of spatial interpolation. The latter problem is met in a variety of fields (see, e.g., the reviews by Bivand  Spatial interpolation of precipitation by merging satellite precipitation products and ground-based measurements has been conducted at multiple temporal and spatial time scales by using a variety of regression algorithms, including several machine learning ones. A non-exhaustive list of previous works on the topic and a summary of their methodological information can be found in Table 1. Notably, this table is indicative of the large diversity in the temporal and spatial scales examined and in the algorithms utilized. In this study, we work towards filling the above-identified gap. More precisely, we compare a larger number of machine learning algorithms than usual (see Table 1) with respect to how accurate they are in providing estimates of total monthly precipitation in spatial interpolation settings by merging gridded satellite products and ground-based measurements. Also, the comparison is made for a long time period and for a large geographical area (again contrary to the most common strategy that appears currently in the literature), with this area also having a dense ground-based gauge network, thereby leading to trustable results for the monthly time scale. Moreover, proper evaluations are made according to theory and best practices from the field of statistics, with the methodological aspects developed in this endeavour contributing to the transfer of knowledge in the overall topic of spatial interpolation using machine and statistical learning algorithms.
The remainder of the paper is structured as follows: Section 2 describes the algorithms selected and the methodology followed for exploring the relevant regression setting.
Section 3 presents the data and the validation procedure. Section 4 presents the results.
Section 5 discusses the most important findings and provides recommendations for future research. Section 6 concludes the work.

Machine learning algorithms for spatial interpolation
Eight machine learning algorithms were implemented in this work for conducting spatial interpolation and were extensively compared with each other in the context of merging gridded satellite products and gauge-based measurements. In this section, we briefly describe these algorithms, while their detailed description can be found in Hastie et al.

Linear regression
A linear regression algorithm models the dependent variable as a linear weighted sum of the predictor variables (Hastie et al. 2009, pp 43-55). The algorithm is optimized with a squared error scoring function.

Multivariate adaptive regression splines
The multivariate adaptive regression splines (MARS; Friedman 1991Friedman , 1993 (Kooperberg 2022). In the present work, the poly-MARS model was implemented with the default parameters.

Random forests
Random forests (Breiman 2001a) are an ensemble of regression trees based on bagging (acronym for "bootstrap aggregation"). The benefits accompanying the application of this algorithm were summarized by Tyralis et al. (2019b), who also documented its recent popularity in hydrology with a systematic literature review. In random forests, a fixed number of predictor variables are randomly selected as candidates when determining the nodes of the regression tree. Herein, random forests were implemented with the default parameters. The number of trees was equal to 500.

Gradient boosting machines
Gradient boosting machines (gbm) are an ensemble learning algorithm. In brief, they iteratively train new base learners using the errors of previously trained base learners The latter is the squared error scoring function in the implementation of this work. In the same implementation, the optimization's scoring function was the squared error and the base learners were regression trees. Also, the number of trees was set equal to 500 for keeping consistency with the implementation of the random forest algorithm. The defaults were used for the remaining parameters.

Extreme gradient boosting
Extreme gradient boosting (XGBoost; Chen and Guestrin 2016) is another boosting algorithm. It is considerably faster and better in performance in comparison to traditional implementations of boosting algorithms. It is also further regularized compared to such implementations for controlling overfitting. In the implementation of this work, the maximum number of the boosting iterations was set equal to 500. The remaining parameters were kept as default. For instance, the maximum depth of each tree was kept as equal to 6.

Feed-forward neural networks
Artificial neural networks (or simply "neural networks") extract linear combinations of the predictor variables as derived features and, subsequently, model the dependent variable as a nonlinear function of these features (Hastie et al. 2009, p 389). Herein, we used feed-forward neural networks (Ripley 1996, pp 143-180). The number of units in the hidden layer and the maximum number of iterations were set equal to 20 and 1 000, respectively, while the remaining parameters were kept as default.

Feed-forward neural networks with Bayesian regularization
Feed-forward neural networks with Bayesian regularization (MacKay 1992) for avoiding overfitting were also employed herein. In the respective implementation, the number of neurons that was set equal to 20 and the remaining parameters were kept as default. For instance, the maximum number of iterations was kept equal to 1 000.

Variable importance metric
We computed the random forests' permutation importance of the predictor variables, a metric measuring the mean increase of the prediction mean squared error on the out-of-

Data
Our experiments relied totally on open databases that offer earth-observed precipitation data at the monthly temporal resolution, gridded satellite precipitation data and elevation data for all the gauged locations and grid points shown in Figure 1.

Earth-observed precipitation data
Total monthly precipitation data of the Global Historical Climatology Network monthly database, version 2 (GHCNm; Peterson and Vose 1997) were used for the verification of the algorithms implemented for spatial interpolation. From the entire database, 1 421 stations that are located in the contiguous United States were extracted, and data that span the time period 2001-2015 were selected. These data were sourced from the website of the National Oceanic and Atmospheric Administration (NOAA) (https://www.ncei.noaa.gov/pub/data/ghcn/v2; accessed on 2022-09-24).

Satellite precipitation data
For the application, we additionally used precipitation data from the current operational The final product spans a grid with a spatial resolution of 0.25° x 0.25°. We extracted a grid that spans the contiguous United States at the time period 2001-2015. We also transformed daily precipitation into total monthly precipitation for supporting the investigations of this work.

Elevation data
For all the gauged geographical locations and the grid points shown in Figure 1, elevation data were computed by using the get_elev_point function of the elevatr R package (Hollister 2022

Validation setting and predictor variables
We define the earth-observed total monthly precipitation at the point of interest as the dependent variable. Notably, the ground-based stations are located irregularly in the region (see Figure 1); thus, the problem of defining predictor variables is not the usual one that is met in problems with tabular data. To form the regression settings, we found, separately for each station, the closest four grid points and we computed the distances di, i = 1, 2, 3, 4 (in meters) from those points. We also indexed the points Si, i = 1, 2, 3, 4 according to their distance from the stations, where d1 < d2 < d3 < d4 (see Figure 2). Figure 2. Setting of the regression problem. Note that the term "grid point" is used to describe the geographical locations with satellite data, while the term "station" is used to describe the geographical locations with ground-based measurements. Note also that, throughout the present work, the distances di, i = 1, 2, 3, 4 are also referred to as "distances 1−4", respectively, and the total monthly precipitation values at the grid points 1−4 are referred to as "PERSIANN values 1−4", respectively.
Possible predictor variables for the technical problem of the present work are the total monthly precipitation values at the four closest grid points (which are referred to as "PERSIANN values 1−4"), the respective distances from the station (which are referred to as "distances 1−4"), the station's elevation, and the station's longitude and latitude. We defined and examined three different regression settings. Each of these correspond to a different set of predictor variables (see Table 2).
The predictor sets 1 and 2 do not account directly for possible spatial dependences, as the station's longitude and latitude are not part of them. Still, by using these predictor sets, spatial dependence is modelled indirectly, through covariance information (satellite precipitation at close points and station elevation). The predictor set 2 includes more information with respect to the predictor set 1 and, more precisely, the distances between the station location and closest grid points. The predictor set 3 allows spatial dependence modelling, as it comprises the station's longitude and latitude.
The dataset is composed by 91 623 samples. Each sample includes the total monthly precipitation observation at a specific earth-located station for a specified month and a specified year, as well as the respective values of the predictor variables, with the latter being dependent on the regression setting (see Table 2). The results of the performance comparison are obtained within a five-fold cross-validation setting.
Overall, the validation setting proposed in this work benefits from the following: To deliver exploratory insight into the technical problem investigated in this work, we additionally estimated Spearman correlation (Spearman 1904) for all the possible pairs of the variables appearing in the regression settings. We also ranked the total of the predictor variables with respect to their importance in predicting the dependent variable.
The latter was performed after estimating the importance according to Section 2.2.

Performance metrics and assessment
To compare the algorithms outlined in Section 2.1 in performing the spatial interpolation, we used the squared error scoring function. This function is defined by In the above equation, y is the realization (observation) of the spatial process and x is the prediction. The squared error scoring function is consistent for the mean functional To extend the comparison by also including the assessment between differences in performance across predictor sets, the procedures for computing the relative and mean relative scores were repeated by considering the set {linear regression, predictor set 1} as the reference case for all the sets {machine learning algorithm, predictor set}. In addition to the two types of relative improvements, we present information on the rankings of the machine learning algorithms. For obtaining the respective results, we first ranked the eight machine learning algorithms, separately for each set {case, predictor set} (with each case belonging to one test fold only). Then, we grouped these rankings per set {predictor set, test fold} and computed their mean. Lastly, we averaged the five mean ranking values corresponding to each predictor set and provided the results of this procedure, which are referred to in what follows as "mean rankings". Moreover, we repeated the mean ranking computation after computing the rankings collectively for all the predictor sets.
Notably, we did not compare the algorithms using alternative scoring functions (e.g., the absolute error scoring function) because such functions may not be consistent for the mean functional (excluding functions of the Bregman family; Gneiting 2011). It is also possible to use other skill scores (e.g., the Nash-Sutcliffe efficiency, which is used widely in hydrology). Here, we preferred to use the simple linear regression algorithm as a reference technique. We believe that this choice is credible because of the simplicity and ease in the use of the algorithm.  Other relationships that are notably strong and, thus, expected, at least at an initial stage, to be particularly beneficial for estimating precipitation in the spatial setting adopted herein are those indicated by the values 0.45 and −0.40 (which again appear in the same column of the same heatmap; see Figure 3). The former of these two values refers to the relationship between the precipitation quantity observed at an earth-located station and the longitude at the location of this station, while the latter of the same values refers to the relationship between the precipitation quantity observed at an earth-located station and the elevation at the location of this station. The remaining relationships between the predictand and predictor variables are found to be less strong; nonetheless, they could also be worth-exploiting in the regression setting. Examples of the abovediscussed relationships can be further inspected through Figure 4.   Figure 5 additionally provides the ordering of the same estimates, which is also the ordering of the 11 predictor variables according to their importance. The longitude at the location of the earth-located station is the most important predictor variable (probably because it is a spatial characteristic and the regression is made in a spatial setting), followed by the precipitation quantities drawn from the first, second and fourth closest points to the earth-located station at the PERSIANN grid. These latter three predictors are followed by the elevation at the location of the earth-located station. The next predictor in terms of importance is the precipitation quantity drawn from the third closest point to the earth-located station at the PERSIANN grid. The latitude at the location of the earth-located station follows and the four variables referring to distances are the least important ones. and random forests are the two best-performing algorithms. In terms of mean relative improvements, the former of these algorithms showed a much better performance than the latter when they were both run with the predictor sets 1 and 2, and a slightly better performance than the latter when they were both run with the predictor set 3. Feedforward neural networks with Bayesian regularization follow in the line and, in terms of mean rankings, were empirically proven to have, an almost equally good performance with random forests. Multivariate adaptive polynomial splines (poly-MARS) and gradient boosting machines (gbm) are the fourth-and fifth-best-performing algorithms, respectively. While the mean rankings corresponding to the latter two algorithms do not suggest large differences in their performance, the mean relative improvements favour poly-MARS to a notable extent. In terms of both mean relative improvements and mean rankings, feed-forward neural networks performed better than gbm and multivariate adaptive regression splines (MARS) when these three algorithms were run with the predictor set 1. The linear regression algorithm was the worst for all the predictor sets investigated in this work. For the predictor sets 2 and 3, feed-forward neural networks were the second-worst algorithm with very close performance to that of linear regression, probably due to overfitting.   For the predictor set 3 (see Figure 7c), the frequency with which each algorithm appeared in the various positions from the first to the last exhibits more similarities with what was found for the predictor set 2 than with what was found for the predictor set 1.

Regression setting exploration
Yet, there are a few notable differences with respect to this good reference case. In fact, although the XGBoost algorithm appeared more often, here as well, in the first and last positions, the frequency of its appearance in the remaining positions was notably larger than the respective frequency for the case of the predictor set 2. In addition, the random forest algorithm appeared more often in the third, fourth, fifth and sixth positions than it did for the same reference case.

Discussion
In summary, the large-scale comparison showed that the machine learning algorithms of this work can be ordered from the best to the worst in regard to their accuracy in correcting satellite precipitation products at the monthly temporal scale as follows:

Conclusions
Hydrological applications often rely on gridded precipitation datasets from satellites, as these datasets cover large regions with higher spatial density compared to the ones that comprise ground-based measurements. Still, the former datasets are less accurate than the latter, with the various machine learning algorithms consisting an established means for improving their accuracy in regression settings. In these settings, the ground-based measurements play the role of the dependent variable and the satellite data play the role of the predictor variables, together with data for topography factors (e.g., elevation). The studies devoted to this important endeavour are numerous; still, most of them involve a limited number of machine learning algorithms, and are also conducted for a small region and a limited time period. Thus, their results are mostly of local importance, and cannot support the derivation of more general guidance and best practices.
In this work, we moved beyond the above-outlined standard approach by comparing eight machine learning algorithms in correcting precipitation satellite data for the entire contiguous United States and over a 15-year period. More precisely, we exploited monthly precipitation satellite data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) gridded dataset and monthly earthobserved precipitation data from the Global Historical Climatology Network monthly database, version 2 (GHCNm), and based the comparison on the squared error scoring function. Overall, extreme gradient boosting (XGBoost) and random forests were found to be the most accurate algorithms, with the former being more accurate than the latter to a small extent for the majority of the scores computed. The remaining algorithms can be ordered from the best-to the worst-performing as follows: feed-forward neural networks with Bayesian regularization, multivariate adaptive polynomial splines (poly-MARS), gradient boosting machines (gbm), multivariate adaptive regression splines (MARS), feed-forward neural networks, linear regression.
Aside from the above ordering which constitutes, in our opinion, the most important finding of the present work, important findings on the selection of predictor variables in the field of satellite precipitation data correction were also obtained for the monthly time scale. Indeed, we found that the distances of the four closest grid points from a groundbased station, as well as this station's longitude and latitude, can offer improvements in predictive performance, when utilized as predictor variables for most of the machine learning algorithms assessed (including the best-performing ones), together with the monthly precipitation values at the four closest grid points and the station's elevation.
Also importantly, we proposed a new validation setting that could bring considerable benefits to future comparisons of machine and statistical learning algorithms in the field.
These benefits are enumerated in Section 3.2. Even more generally, we proposed an authentic methodological framework and contributed to the transfer of theory and best practices from the field of statistics to the field of satellite precipitation data correction.

Conflicts of interest:
The authors declare no conflict of interest. Acknowledgements: The authors are sincerely grateful to the Journal for inviting the submission of this paper, and to the Editor and Reviewers for their constructive remarks.
They would also like to acknowledge the contribution of the late Professor Yorgos Photis in the proposal of the research project BETTER RAIN.

Appendix A Statistical software information
We used the R programming language (R Core Team 2022) to implement the algorithms, and to report and visualize the results.
For data processing and visualizations, we used the contributed R packages caret