Comparison of tree-based ensemble algorithms for merging satellite and earth-observed precipitation data at the daily time scale

Merging satellite products and ground-based measurements is often required for obtaining precipitation datasets that simultaneously cover large regions with high density and are more accurate than pure satellite precipitation products. Machine and statistical learning regression algorithms are regularly utilized in this endeavour. At the same time, tree-based ensemble algorithms are adopted in various fields for solving regression problems with high accuracy and low computational cost. Still, information on which tree-based ensemble algorithm to select for correcting satellite precipitation products for the contiguous United States (US) at the daily time scale is missing from the literature. In this study, we worked towards filling this methodological gap by conducting an extensive comparison between three algorithms of the category of interest, specifically between random forests, gradient boosting machines (gbm) and extreme gradient boosting (XGBoost). We used daily data from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and the IMERG (Integrated Multi-satellitE Retrievals for GPM) gridded datasets. We also used earth-observed precipitation data from the Global Historical Climatology Network daily (GHCNd) database. The experiments referred to the entire contiguous US and additionally included the application of the linear regression algorithm for benchmarking purposes. The results suggest that XGBoost is the best-performing tree-based ensemble algorithm among those compared...


Introduction
Machine and statistical learning algorithms (e.g., those documented in Hastie  Tyralis and Papacharalampous 2021). Additionally, they usually do not require extensive preprocessing and hyperparameter tuning to perform well (Hastie et al. 2009; ). On the other hand, as they are highly flexible algorithms, they are less interpretable than simpler algorithms (e.g., linear regression), due to the wellrecognized trade-off between interpretability and flexibility (James et al. 2013).
Notably, the known theoretical properties of the various tree-based ensemble algorithms (including random forests, gradient boosting machines − gbm and extreme Tree-based ensemble algorithms are regularly applied and compared to other machine and statistical learning algorithms for the task of merging satellite products and groundbased measurements. This task is the general focus of this work, together with the general concept of tree-based ensemble algorithms, and is commonly executed in the literature in the direction of obtaining precipitation datasets that cover large geographical regions with high density and, simultaneously, are more accurate than pure satellite precipitation products. The importance of this same task can be perceived through the inspection of the major research topics appearing in the hydrological literature (see, e.g., those discussed Nonetheless, a relevant comparison of tree-based ensemble algorithms for the latter temporal resolution and the latter geographical region is missing from the literature, with the closest investigations at the moment being those available in the work by Lei et al. (2022), which focuses on China. In this work, we fill this specific literature gap. Notably, the selection of the most accurate regression algorithm from the tree-based ensemble family could be particularly useful at the daily temporal scale, in which the size of the datasets for large geographical areas might impose significant limitations on the application of other accurate machine and statistical learning regression algorithms due to their large computational costs.

Methods
Random forests, gbm and XGBoost were applied in a cross-validation setting (see Section 3.2) for conducting an extensive comparison in the context of merging gridded satellite products and ground-based measurements at the daily time scale. Additionally, the linear regression algorithm was applied in the same setting for benchmarking purposes. In this section, we provide brief descriptions of the four afore-mentioned algorithms. Extended

Linear regression
The results of this work are reported in terms of relative scores (see Section 3.3). These scores were computed with respect to the linear regression algorithm, which models the dependent variable as a linear weighted sum of the predictor variables (Hastie et al. 2009, pp 43-55). A squared error scoring function facilitated this algorithm's fitting.

Random forests
Random forests (Breiman 2001)  (2019b), along with a systematic review of its application in water resources. Notably, random forests are an ensemble learning algorithm and, more precisely, an ensemble of regression trees that is based on bagging (acronym for "bootstrap aggregation") but with an additional randomization procedure. The latter aims at reducing overfitting. In this work, random forests were implemented with all their hyperparameters kept as default.
For instance, the number of trees was equal to 500. This methodological choice is adequate, as random forests are known to perform well without tuning as long as they are applied with a large number of trees (Tyralis et al. 2019b).

Gradient boosting machines
Another ensemble learning algorithm that was herein used with regression trees as base learners is gbm (Friedman 2001

Data
For our experiments, we retrieved and used daily earth-observed precipitation, gridded satellite precipitation and elevation data for the gauged locations and grid points shown in Figures 1 and 2.

Earth-observed precipitation data
Daily precipitation totals from the Global Historical Climatology Network daily (GHCNd) (Durre et al. 2008, 2010, Menne et al. 2012 were used for comparing the algorithms. More precisely, data from 7 264 earth-located stations spanning across the contiguous US (see Figure 1) were extracted. These data cover the two-year time period 2014−2015.
Data retrieval was made from the website of the NOAA National Climatic Data Center (https://www1.ncdc.noaa.gov/pub/data/ghcn/daily; accessed on 2022-02-27). Notably, the extracted PERSIANN data were used in the experiments with their original spatial resolution, while the extracted GPM IMERG data were used for forming data with a spatial resolution of 0.25 degree x 0.25 degree by applying bilinear interpolation on the CMORPH0.25 grid. The herein formed data and their grid (see Figure 2b) are those referred to in what follows as "IMERG values" and "IMERG grid", respectively, and are the ones used in the experiments.

Elevation data
Elevation is a key predictor variable for many hydrological processes (Xiong et al. 2022).
Therefore, we estimated its value for all the geographical locations shown in Figure 1. For this estimation, we relied on the Amazon Web Services (AWS) Terrain Tiles (https://registry.opendata.aws/terrain-tiles; accessed on 2022-09-25).

Validation setting and predictor variables
To formulate the regression settings of this work, we first defined earth-observed daily total precipitation at a point of interest (which could be station 1 in Figure 3) as the dependent variable. Then, we adopted procedures proposed in Papacharalampous et al.
(2023) to compute the observations of possible predictor variables. Separately for each of the two satellite precipitation grids (see Figure 2), we determined the closest four grid points to each ground-based station from those depicted in Figure 1. We also computed the distances di, i = 1, 2, 3, 4 from these grid points and indexed the latter as Si, i = 1, 2, 3, 4 based on the following order: d1 < d2 <d3 < d4 (see Figure 3). Throughout this work, the distances di, i = 1, 2, 3, 4 are also respectively called "PERSIANN distances 1−4" or "IMERG distances 1−4" (depending on whether we refer to the PERSIANN grid or the IMERG grid) and the daily precipitation values at the grid points 1−4 are called "PERSIANN values 1−4" or "IMERG values 1−4" (depending on whether we refer to the PERSIANN grid or the IMERG grid). Figure 3. Setting of the regression problem. Note that the term "grid point" is used to describe the geographical locations with satellite data, while the term "station" is used to describe the geographical locations with ground-based measurements. Note also that, throughout this work, the distances di, i = 1, 2, 3, 4 are also respectively called "PERSIANN distances 1−4" or "IMERG distances 1−4" (depending on whether we refer to the PERSIANN grid or the IMERG grid) and the daily precipitation values at the grid points 1−4 are called "PERSIANN values 1−4" or "IMERG values 1−4" (depending on whether we refer to the PERSIANN grid or the IMERG grid).
Based on the above, the predictor variables for the technical problem of interest could include the PERSIANN values 1−4, the IMERG values 1−4, the PERSIANN distances 1−4, the IMERG distances 1−4 and the station's elevation. We defined and examined three sets of predictor variables (see Table 1

Performance metrics and assessment
The performance assessment relied on procedures proposed by Papacharalampous et al. approach, which was defined as the linear regression when run with the same predictor set as the modelling approach to which the relative score referred. More precisely, the relative score was computed as the difference between the score of the set {algorithm, predictor set} minus the score of the reference modelling approach, multiplied by 100 and divided by the score of the reference modelling approach. Then, mean relative scores (which are also referred to as "mean relative improvements" throughout this work) were computed by averaging, separately for each set {algorithm, predictor set}, the relative scores. The procedures for computing the relative scores and the mean relative scores were repeated by considering the set {linear regression, predictor set 1} as the reference modelling approach for all the sets {algorithm, predictor set}.
Mean rankings of the machine and statistical learning algorithms were also computed.
For that, and separately for each set {case, predictor set} (with each case belonging to one test fold only), we first ranked the four algorithms based on their squared errors. Then, we averaged these rankings, separately for each set {predictor set, test fold}. Lastly, we obtained the mean rankings reported by averaging the two previously computed mean ranking values corresponding to the same predictor set. We also computed the rankings collectively for all the predictor sets.

Regression setting exploration
Regression setting explorations can facilitate interpretations of the results of prediction experiments, at least to some extent. Therefore, in Figure 4, we present the Spearman correlation estimates for the various variable pairs appearing in the regression settings examined in this work. As it could be expected, the magnitude of the relationships between the predictand (i.e., the precipitation quantity observed at the earth-located stations, which is referred to as "true value" in Figure 4) and the 17 predictor variables seems to depend, to some extent, on the satellite rainfall product.    Figure 6 facilitates the comparison of the four machine and statistical learning algorithms in terms of the square error function, separately for each predictor set. The mean relative improvements (see Figure 6a) suggest that XGBoost is the best algorithm for all the predictor sets. For predictor set 1 (which incorporates, among others, information from the PERSIANN gridded precipitation dataset; see Table 1), random forests exhibit very similar performance compared to that of XGBoost. At the same time, for predictor set 2 (which incorporates, among others, information from the IMERG gridded precipitation dataset; see Table 1), gbm exhibits very similar performance to that of XGBoost.

Algorithm comparison
Additionally, the mean rankings (see Figure 6b) of random forests and XGBoost are of similar magnitude. In terms of the same criterion, gbm scores much closer to random forests and XGBoost than to linear regression. Indeed, linear regression is much more likely to be found at the fourth (i.e., the last) place than at any other place. It is also more likely to be found at the first place than at the second and third places. At the same time, random forests are more likely to be ranked first than second, third and fourth, and gbm appears most often in the second and third positions. The last position is the least likely for both gbm and XGBoost. The latter algorithm is more likely to be ranked second; yet, the first and third positions are also far more likely than the last. Lastly, Figures 8 and 9 allow us to compare the degree of information that is offered by the two gridded precipitation products within the context of our regression problem, further than the comparisons already allowed by the variable importance explorations using the gain metric incorporated into the XGBoost algorithm (see again Figure 5).
Overall, the IMERG dataset was proven to be far more information-rich than the PERSIANN dataset, in terms of both mean relative improvement (see Figure 8a) and mean ranking (see Figure 8b). Indeed, the relative improvements with respect to the linear regression algorithm run with the predictor set 1 are much larger for the tree-based algorithms when these algorithms are run with predictor set 2 than when they are run with predictor set 1. Additionally, predictor set 3 (which contains information from both gridded precipitation datasets) does not improve the performances notably in terms of mean relative improvements with respect to predictor set 2, although it does in terms of mean ranking. While the best modelling choice is {XGBoost, predictor set 3}, random forests were ranked in the two first positions more often than any other algorithm for predictor sets 2 and 3, when the ranking was made collectively for all the predictor sets (see Figure 9). Still, for the same predictor sets, XGBoost appeared in the last few positions much less often and achieved the best performance in terms of mean ranking when run with predictor set 3. Figure 8. Heatmaps of the: (a) relative improvement (%) in terms of the median square error metric, averaged across the two folds, as this improvement was provided by each tree-based ensemble algorithm with respect to the linear regression algorithm, with this latter algorithm being run with the predictor set 1; and (b) mean ranking of each machine and statistical learning algorithm, averaged across the two folds. The computations were made collectively for all the predictor sets. The more reddish the colour, the better the predictions on average. Figure 9. Heatmaps of the percentages (%) with which the four machine and statistical learning algorithms were ranked from 1 to 12 for the predictor sets (a−c) 1−3. The rankings summarized in this figure were computed separately for each case and collectively for all the predictor sets.

Discussion
Overall, XGBoost was proven to perform notably better than random forests and gbm when merging gridded satellite precipitation products and ground-based precipitation This would require working on machine and statistical learning methods, such as those summarized and popularized in the reviews by Papacharalampous and Tyralis (2022a) and Tyralis and Papacharalampous (2022a).

Conclusions
Precipitation datasets that simultaneously cover large regions with high density and are more accurate than satellite precipitation products can be obtained by correcting such products using earth-observed datasets together with machine and statistical learning regression algorithms. Tree-based ensemble algorithms are adopted in various fields for solving algorithmic problems with high accuracy and lower computational cost compared with other algorithms. Still, information on which tree-based ensemble algorithm to select when the merging is conducted for the contiguous United States (US) and at the daily time scale, at which the computational requirements might constitute a crucial factor to consider along with accuracy, is missing from the literature of satellite precipitation product correction.
Herein, we worked towards filling this methodological gap. We conducted an extensive comparison between three tree-based ensemble algorithms, specifically random forests, gradient boosting machines (gbm) and extreme gradient boosting (XGBoost). We exploited daily information from the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) and the IMERG (Integrated Multi-satellitE Retrievals for GPM) gridded datasets, and daily earth-observed information from the Global Historical Climatology Network daily (GHCNd) database. The entire contiguous US was examined and results that are generalizable were obtained. These results indicate that XGBoost is more accurate than random forests and gbm. They also indicate that IMERG is more useful than PERSIANN in the context investigated.

Conflicts of interest:
The authors declare no conflict of interest.
Author contributions: GP and HT conceptualized and designed the work with input from AD and ND. GP and HT performed the analyses and visualizations, and wrote the first draft, which was commented on and enriched with new text, interpretations and discussions by AD and ND.

Acknowledgments:
The authors are sincerely grateful to the Journal for inviting the submission of this paper, and to the Editor and Reviewers for their constructive remarks.

Appendix A Statistical software
We used the R programming language (R Core Team 2022) to implement the algorithms, and to report and visualize the results.
For data processing and visualizations, we used the contributed R packages caret