Validation of Satellite, Reanalysis and RCM Data of Monthly Rainfall in Calabria (Southern Italy)

: Skills in reproducing monthly rainfall over Calabria (southern Italy) have been validated for the Climate Hazards group InfraRed Precipitation with Station data (CHIRPS) satellite data, the E-OBS dataset and 13 Global Climate Model-Regional Climate Model (GCM-RCM) combinations, belonging to the ENSEMBLES project output set. To this aim, 73 rainfall series for the period 1951–1980 and 79 series for the period 1981–2010 have been selected from the database managed by Multi-Risk Functional Centre of the Regional Agency for Environmental Protection (Regione Calabria). The relative mean and standard deviation errors, and the Pearson correlation coe ﬃ cient have been used as validation metrics. Results showed that CHIRPS satellite data (available only for the 1981–2010 validation period) and RCMs based on the ECHAM5 Global Climate performed better both in mean error and standard deviation error compared to other datasets. Moreover, a slight appreciable improvement in performance for all ECHAM5-based models and for the E-OBS dataset has been observed in the 1981–2010 time-period. The whole validation-and-assessment procedure applied in this work is general and easily applicable where ground data and gridded data are available. This procedure might help scientists and policy makers to select among available datasets those best suited for further applications, even in regions with complex orography and an inadequate amount of representative stations.


Introduction
Climate monitoring and analysis has received growing attention. Indeed, assessments have evidenced that temperature change over the last 50 years of the 20th century to a great extent results from anthropogenic forcings [1]. Observations are essential to climate monitoring since they are the basis for: (i) assessing century-scale trends; (ii) the validation of climate models; iii) the detection and attribution of changes in climate at regional scale. In particular, precipitation is a subject of special concern: since it is the main component of the global water cycle, it is also a major contributor to extreme events, and a crucial parameter in water resources management.
Precipitation observation is based primarily on ground rain gauges, then on weather radars and satellite retrievals. While rain gauges generally produce the most reliable observational results, they are often sparsely distributed; thus, they may not be fully representative of a region, especially for large areas with few observations [2]. In regions with complex orography and scarce human settlements, rain gauges are not enough to provide data to resolve precipitation processes in simulation studies. Satellite retrievals and climate reanalysis have thus been used to create regular data grids, in order to fill-in on lacking observations and to address the scarcity of stations in ungauged regions [3]. A climate reanalysis, comparisons, the analyses of bias and correlations, and the use of probability distribution functions as possible metrics. To evaluate skills in estimating and reproducing total monthly precipitation in CHIRPS, Funk et al. [37,38] and Toté et al. [35] used mean absolute errors and correlation coefficients for the Sahel, Afghanistan, south-western north America, Colombia, Mexico, Peru and Mozambique. Dembele and Zwart [39] studied the performance of seven gridded satellite rainfall products comparing them with data from nine weather stations in Burkina Faso. They used a point-to-pixel basis at various time steps (from daily to annual), using Pearson correlation coefficient, mean errors, bias, Root Mean Square Error (RMSE) and the Nash-Sutcliffe Efficiency coefficient.
Although rain gauges are most commonly used to validate datasets, the triple collocation (TC) technique has been increasingly used to characterize uncertainties in precipitation products, thanks to the availability of more and more datasets. This technique has been used to address problems arising from the validation of gridded data with a too coarse or sparse rain gauge network [40][41][42].
The goal of this study is to evaluate the skills of several datasets in reproducing monthly precipitation climatology. These sets include state-of-the-art reanalysis and satellite data, and well-tested model results, validated through an established and reliable rain gauge network. The study area is Calabria, a southern Italian region of about 15,000 km 2 . Calabria is a challenging area for rainfall studies: it has a complex orography and a high vulnerability to climate change due to its position in the center-south of the Mediterranean basin. At the same time, it is equipped with a robust rain gauge network, available through the Multi-Risk Functional Centre of the Regional Agency for Environmental Protection (Regione Calabria). For this study, CHIRPS has been used as the satellite dataset, E-OBS as the reanalysis dataset and 13 GCM-RCM combinations, available as outputs of the ENSEMBLES project, as the RC models. All these sets have been validated for the 1951-2010 time-period (CHIRPS only for 1981-2010) against the 79 rain gauges through (i) a two-metrics set consisting of adimensional, relative mean error and standard deviation error, and (ii) Pearson correlation.

Study Area
Located at the toe of the Italian peninsula, Calabria has a surface of 15,080 km 2 and an average altitude of 597 m above sea level (a.s.l., hereafter).
With its tallest relief at 2266 m a.s.l., Calabria does not present many high peaks, yet it is one of the most mountainous areas in the country. Mountains (areas over 500 m a.s.l. high) occupy 42% of the region, while hills between 50 and 500 m a.s.l. high cover 49% of the territory. Only 9% of the region is under 50 m a.s.l. (Figure 1). Calabria's climate is typically Mediterranean. It features sharp contrasts due to both its position within the Mediterranean Sea and to its orography. Specifically, warm air currents coming from Africa affect the Ionian side, leading to high temperatures, and to short and heavy precipitations. The Tyrrhenian side, instead, is affected by western air currents, which cause milder temperatures and more intense precipitations when compared to the Ionian side. Cold and snowy winters, and fresh summers with some precipitation, are typical of the inner areas of the region [43].

Data Sources
The following data sets have been validated:  [17]. These GCM-RCM combinations use a Global Climate Model (Table 1) to drive a Regional Climate Model ( Table 2). As an example, the HCH-RCA acronym refers to the Sveriges Meteorologiska och Hydrologiska Institute (SMHI) regional RCA Model driven by the global Hadley Climate Model 3 (HCH) with high sensitivity. See Table 3 for a full list of GCM-RCM combinations and acronyms.
Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 21 drive a Regional Climate Model ( Table 2). As an example, the HCH-RCA acronym refers to the Sveriges Meteorologiska och Hydrologiska Institute (SMHI) regional RCA Model driven by the global Hadley Climate Model 3 (HCH) with high sensitivity. See Table 3 for a full list of GCM-RCM combinations and acronyms.  Table A1 for details on rain gauge codes and names).
Several reasons determined the choice of the datasets: • CHIRPS and E-OBS are up-to-date, state-of-the-art products: they are regularly maintained and updated and they are the subject of several validation studies [44]; • E-OBS has been used in previous studies as a validation tool for model outputs, and in particular to validate precipitation of the ENSEMBLES project GCM-RCM combinations [45][46][47]; • the E-OBS fields are available on a grid consistent with that used by ENSEMBLES RCMs; in fact, they were technically built with the goal of direct comparison with ENSEMBLES RCM outputs [25,48]; • ENSEMBLES models have already been selected and studied in previous projects; they are easily comparable with E-OBS, as they share the same spatial resolution and the same space grid; furthermore, they have data available for the 1951-2010 time-period. The timeframe of the study goes from 1951 to 2010 in order to split the time period into two 30years periods (1951-1980 and 1981-2010). This split allows to skill for the datasets from one 30-year period to another. In particular, the 1981-2010 time-period was chosen because it coincides with the current climatological normal period; moreover, CHIRPS data is available only from 1981 onward.  Table A1 for details on rain gauge codes and names).
Several reasons determined the choice of the datasets: • CHIRPS and E-OBS are up-to-date, state-of-the-art products: they are regularly maintained and updated and they are the subject of several validation studies [44]; • E-OBS has been used in previous studies as a validation tool for model outputs, and in particular to validate precipitation of the ENSEMBLES project GCM-RCM combinations [45][46][47]; • the E-OBS fields are available on a grid consistent with that used by ENSEMBLES RCMs; in fact, they were technically built with the goal of direct comparison with ENSEMBLES RCM outputs [25,48]; • ENSEMBLES models have already been selected and studied in previous projects; they are easily comparable with E-OBS, as they share the same spatial resolution and the same space grid; furthermore, they have data available for the 1951-2010 time-period.  The timeframe of the study goes from 1951 to 2010 in order to split the time period into two 30-years periods (1951-1980 and 1981-2010). This split allows to skill for the datasets from one 30-year period to another. In particular, the 1981-2010 time-period was chosen because it coincides with the current climatological normal period; moreover, CHIRPS data is available only from 1981 onward.
The validation has been conducted for monthly precipitation, because of its importance and relevance for research and for applications. First of all, it is used to build monthly normal climatology, which is a fundamental climatological parameter. It is also used: in calculating drought indices like Standard Precipitation Index (SPI; [49]); in evaluating rainfall seasonality, with, e.g., the Precipitation Concentration Index (PCI; [50]); in trying to understand the seasonal correlation between precipitation and teleconnections (e.g., El Nino Southern Oscillation, North Atlantic Oscillation); and runoff of monthly precipitation is also used as a driver of hydrological modeling (e.g., [51]). Finally, it must also be taken into account that many available datasets are only produced with monthly values, or restricted to monthly sampling (e.g., gauge-based precipitation products; see for instance [52]).
The validation set is based on daily data, available online, managed by the Multi-Risk Functional Centre of the Regional Agency for Environment Protection. The database consists of high quality and complete or near-complete records, available since early 20th century and currently updated. This set has been widely employed in the study of climate in Calabria (e.g., [53,54]). In particular, at the end of 2010, the Calabria database consisted of daily data collected at about 100 stations. All the rainfall series, which presented less than 80% of daily data in the observation period, were discarded. As a result, data from 79 stations in the period 1951-2010 (Figure 1 and Appendix A Table A1), with an average density of 1 station per 190 km 2 , were selected. Rain gauge density and distribution are crucial for an accurate description of rainfall amount over a region. It is difficult to derive fundamental laws to determine the gauge density needed in a particular region [55]. However, several studies proved a linear correlation between uncertainty in the spatially averaged rainfall and spatial standard deviation [56][57][58]. More recent theoretical developments suggest that uncertainty is directly proportional to the spatial standard deviation and inversely proportional to the square root of the total number of gauges [59]. World Meteorological Organization (WMO) guidelines indicated a number of stations per km 2 ranging from one station per 100 km 2 for complex, mountain terrain to one station per 10,000 km 2 in arid and polar deserts [60]. Nonetheless, uncertainties are often calculated with different methods and metrics depending on the region of interest and the applications for which precipitation is needed (e.g., [61,62]). Mishra, for instance, suggested that for southern India the acceptable rain gauge density for reproducing significantly total precipitation was around 1 station per 350 km 2 [61].
Small-scale variability influences rainfall events and monthly accumulated precipitation, so that validation at the smallest possible spatial scale is recommended. Interpolating gauge measurements into a gridded product results in large uncertainties [63]. Thus, for the CHIRPS versus observations comparison, we applied a point-to-pixel analysis which compared rainfall data observed at gauge stations with the respective grid cell; i.e., for each station and month, time series of data observed at selected rain gauges were compared to the corresponding CHIRPS pixel [36,64].
In comparing gauge measurements and E-OBS/ENSEMBLES grid points, grid data were interpolated to the station location and we compared the results to the station data. Numerous studies have provided reviews of existing spatial interpolation methods for hydrological variables (e.g., [65]) or suggested new spatial interpolation approaches (e.g., [66]). Among these studies, there was not a unanimous consensus on the best interpolation method: several authors have concluded that results depend on the sampling density (e.g., [67]). A bilinear interpolation has been chosen because it is a simple, two-dimensional (2D) interpolation and it allows to perform a clear assessment of improvements and uncertainties introduced.

Validation Metrics
Different metrics can evaluate different skills in reproducing precipitation. For this study, three common evaluation metrics were used: the relative mean error, the relative standard deviation error, and the Pearson correlation coefficient [2,25,36].
These metrics have been selected for the following reasons: • Mean error and standard deviation error are among the most commonly used tools in validation and error theory [68,69], and also to compare E-OBS and ENSEMBLES RCMs [25]; • The Pearson correlation coefficient is an important measure commonly used in climate science for evaluating data from independent data sources, just like precipitation values from gauge stations, satellites and models [70,71].
The use of mean error and standard deviation assume that the error distribution is Gaussian. We checked preliminarily that monthly precipitation data over Calabria followed roughly a normal distribution for rain gauge data, E-OBS and CHIRPS. It is then a common assumption that the errors are normally distributed as well (e.g., Roebeling et al. [40]).
It is interesting to note that the approximately normal distribution of E-OBS, CHIRPS and of the gauge network, three independent sets with mutually uncorrelated errors, might satisfy the requirements for applying triple collocation validation. However, the mutual independence of the sets might be only apparent, as both E-OBS and (partially) CHIRPS are built using rain gauge data. Thus, more preliminary analyses are needed before using triple collocation to assess the performance of these sets.
There are many other methods and indices that can be used; however, many of these are not independent of mean and standard deviation, or are basically different methods to calculate correlation.
Others, such as the number of rainy days, could not be used as we do not have daily data for most of these sets.

Mean Error and Standard Deviation Error
The two-metric validation introduced in this section is based on the work of Deidda et al. [25]. They used it to evaluate ENSEMBLES RCMs' skills in reproducing precipitation (and temperature) against E-OBS reanalysis data. The performance indices are the monthly mean and standard deviation adimensional errors. The mean error evaluates how well the estimates correspond to the observed values, indicating whether rainfall totals are overestimated or underestimated. The standard deviation evaluates the average magnitude of estimated errors, and the capability at reproducing variability.
Let P S (m, y) be monthly precipitation, for month m, year y and for a generic dataset s that we want to validate. It can be collected at a station, a grid point, or as an averaged value over an area (e.g., a hydrological basin or an administrative district).
Considering a climatological time-frame that takes into account a number N y of years of monthly averaged precipitation P S (m, y), starting with year y 0 , the N y -years average of the monthly precipitation for each month in the annual cycle µ s (m) is: and the standard deviation of precipitation of month m is: ( Each dataset has to be compared with the observed data (registered in the rain gauges). Just like in Equation (1) and Equation (2), it is possible to estimate, also for the observed dataset, the N y -years average of the monthly precipitation P 0 (m, y): and the standard deviation: Within this aim, the following error metrics have been introduced: a) The average absolute error on the monthly mean: b) The average absolute error on the standard deviation: The above defined error metrics provide information on the reliability of a single model in reproducing precipitation, whereas normalizing metrics is needed to visualize more clearly the simultaneous performance of different datasets against each other.
To produce normalized metrics for datasets, each error has been divided by a factor obtained as the sum of errors on all datasets (S): The errors on the climatological mean of a single dataset become: The results can be graphically represented in a simply Cartesian plane, by drawing the mean error and the standard deviation error on the x-and y-axes, respectively. The origin (0,0) indicates the reference value. The closest the error metrics for a dataset s are to the origin, the better its performance.
This procedure can be easily generalized from a two-dimensional error to a N-dimensional error by including any relative indices relevant to the specific study. It might even be possible to provide every index with a weighting factor. The use of more and/or different indices would be up to the specific scientific problem's demands.
As the absolute standard deviation represents the degree of dispersion, Equation (6) can be used as a bias error estimate (see [72,73]). On the other hand, while the relative bias error is defined as the bias error divided by the mean precipitation, Equation (10)'s relative error is normalized over the errors of all other models. Thus, this error measure is not a relative bias index.

Pearson Correlation Coefficient
The Pearson correlation coefficient (r) has been used to evaluate how well the estimates corresponded to the observed values. For each month (m) and each dataset (s), the coefficient is defined as: with values ranging from −1 to 1 with the extremes ±1 indicating the perfect scores [70,71,74]. N y indicates the number of years taken into account, i.e., 30 or 60 (except for CHIRPS, for which N y = 30 only).

Results
Results from the adimensional mean-and-standard deviation metrics (Figure 2) showed that some model results from ENSEMBLES compare well with satellite data (CHIRPS) and reanalysis tools (E-OBS).
In particular, CHIRPS and the four ECHAM5-driven models are the best performers overall, both in terms of standard deviation and mean. From the other ENSEMBLES combinations, good relative performances were obtained by HCH-RCA and ARP-HIR.
There is very little change in relative performance between the two considered time periods, with the exception of E-OBS and of the ECHAM5-driven models. In particular, all of the latter models increase their relative skills in reproducing standard deviation; and three out of four increase the skills in the mean from the 1951-1980 to the 1981-2010 time-period ( Figure 3).      Figure 4 shows the Pearson correlation coefficient, evaluated between monthly rain gauge precipitation and gridded dataset precipitation, for the whole observation period (1951-2010) and for the 1981-2010 time period only. The correlation shows that the same models that had good relative error metrics are the best-performing ones (ECH-driven models, HCH-RCA and ARP-HIR), with the  Figure 4 shows the Pearson correlation coefficient, evaluated between monthly rain gauge precipitation and gridded dataset precipitation, for the whole observation period (1951-2010) and for the 1981-2010 time period only. The correlation shows that the same models that had good relative error metrics are the best-performing ones (ECH-driven models, HCH-RCA and ARP-HIR), with the addition of BCM-HIR. However, the most correlated dataset of all (r = 0.97) is E-OBS, which was only an average performer with regard to relative metrics. CHIRPS is once again an excellent performer, with a correlation value of r = 0.94.
The other six models (all driven by the Hadley Center HadCM3 Model) score very bad on the Pearson correlation. They all show anticorrelation, with values ranging from r = −0.23 (HCS-HRM) to r = −0.59 (HCH-HRM), which means that it is not possible to identify any linear relation between the model data and the rain gauge data.
A huge difference between the results obtained from two regional models driven by the Hadley Model with high sensitivity has been detected: HCH-RCA is one of the best performers at all metrics, while HCH-HRM is one of the worst.  The other six models (all driven by the Hadley Center HadCM3 Model) score very bad on the Pearson correlation. They all show anticorrelation, with values ranging from r = −0.23 (HCS-HRM) to r = −0.59 (HCH-HRM), which means that it is not possible to identify any linear relation between the model data and the rain gauge data.
A huge difference between the results obtained from two regional models driven by the Hadley Model with high sensitivity has been detected: HCH-RCA is one of the best performers at all metrics, while HCH-HRM is one of the worst. Figure 5 shows the seasonal correlation of monthly rain gauge precipitation with gridded dataset precipitation. The seasonal breakdowns show a good performance of nine datasets in the spring (MAM) and in the fall months (SON). The E-OBS is the only dataset with an excellent correlation (r > 0.9) in the winter months (DJF), while all datasets show a strong decrease in correlation for the summer months. To further understand the strengths and weaknesses of the examined datasets in reproducing precipitation features in Calabria, we have examined Quantile-Quantile plots (QQ-plots) of monthly precipitations of all stations for each dataset. Figure 6 shows the QQ-plots of gridded data against station data for the full 1951-2010 time period, for E-OBS and the seven best-correlating (r > 0.9) models from ENSEMBLES. Figure 7 shows the results for 1981-2010 for CHIRPS.  To further understand the strengths and weaknesses of the examined datasets in reproducing precipitation features in Calabria, we have examined Quantile-Quantile plots (QQ-plots) of monthly precipitations of all stations for each dataset. Figure 6 shows the QQ-plots of gridded data against station data for the full 1951-2010 time period, for E-OBS and the seven best-correlating (r > 0.9) models from ENSEMBLES. Figure 7 shows the results for 1981-2010 for CHIRPS.     Results from QQ-plots and relative metrics show how E-OBS is not one of the best models. However, it must be noted that it still correlates very well with the observed data. From the plots, it is clear that most of the well-performing models have good or even excellent results at the lower spectrum of the precipitation range. However, they are not able to reproduce correctly the months with most precipitation. This is probably due to a lack of skill in reproducing extreme events (for models) or problems in satellites in observing them (TRMM). It is possible that this problem is transferred from raw satellite data to methods that integrate satellites with stations (CHIRPS) or use satellite data in reanalysis (E-OBS).
Analogous QQ-plots (not shown) were examined for the full 1951-2010 time period for the seven mildly anticorrelating (−0.53 < r < −0.23) models from ENSEMBLES. It is clear that the gridded data here do not follow any recognizable linear pattern for a long enough interval to allow some form of linear correlation to emerge. Results from QQ-plots and relative metrics show how E-OBS is not one of the best models. However, it must be noted that it still correlates very well with the observed data. From the plots, it is clear that most of the well-performing models have good or even excellent results at the lower spectrum of the precipitation range. However, they are not able to reproduce correctly the months with most precipitation. This is probably due to a lack of skill in reproducing extreme events (for models) or problems in satellites in observing them (TRMM). It is possible that this problem is transferred from raw satellite data to methods that integrate satellites with stations (CHIRPS) or use satellite data in reanalysis (E-OBS).
Analogous QQ-plots (not shown) were examined for the full 1951-2010 time period for the seven mildly anticorrelating (−0.53 < r < −0.23) models from ENSEMBLES. It is clear that the gridded data here do not follow any recognizable linear pattern for a long enough interval to allow some form of linear correlation to emerge.

Discussion
Results from this study show the importance of using multiple metrics in validation: combining the analysis from several tools provides a deeper understanding of the datasets' skills and shortcomings.
The relative error metrics show which datasets perform better as to what concerns actual monthly precipitation values. However, Pearson correlation is also an important validation tool: a high correlation (i.e., correlation values above 0.8) means that it might be possible to obtain better results with a simple bias correction. For example, we could use the Pearson value to select only highcorrelating datasets, then bias-correct them (for instance, with a simple linear regression), and finally re-run the relative metrics to find the best-performing models. In general, however, one must always remember that metrics should be tailored on the desired end-use of the datasets: metrics are not a measure of quality per se.
Bias correction, for instance, might be useful with the E-OBS data: while this reanalysis gridded set shows the highest correlation values (r = 0.97), its performance is only average with respect to the relative, normalized error metrics. E-OBS shows a strong tendency to underestimate precipitation amounts in general and extreme events in particular. The mean yearly precipitation for 1951-2010 is 1057 mm/year according to the observations, while it is only 606 mm/year according to E-OBS (see Appendix A and Table A1 for more information on the datasets' mean yearly precipitation). This underestimation has also been found in other studies that have shown a bias in E-OBS toward lower values [44].

Discussion
Results from this study show the importance of using multiple metrics in validation: combining the analysis from several tools provides a deeper understanding of the datasets' skills and shortcomings.
The relative error metrics show which datasets perform better as to what concerns actual monthly precipitation values. However, Pearson correlation is also an important validation tool: a high correlation (i.e., correlation values above 0.8) means that it might be possible to obtain better results with a simple bias correction. For example, we could use the Pearson value to select only high-correlating datasets, then bias-correct them (for instance, with a simple linear regression), and finally re-run the relative metrics to find the best-performing models. In general, however, one must always remember that metrics should be tailored on the desired end-use of the datasets: metrics are not a measure of quality per se.
Bias correction, for instance, might be useful with the E-OBS data: while this reanalysis gridded set shows the highest correlation values (r = 0.97), its performance is only average with respect to the relative, normalized error metrics. E-OBS shows a strong tendency to underestimate precipitation amounts in general and extreme events in particular. The mean yearly precipitation for 1951-2010 is 1057 mm/year according to the observations, while it is only 606 mm/year according to E-OBS (see Appendix A and Table A1 for more information on the datasets' mean yearly precipitation). This underestimation has also been found in other studies that have shown a bias in E-OBS toward lower values [44].
With regard to possible bias corrections, it must be noted that the ENSEMBLES models have associated the available elevation data, which can be used for vertical corrections. On the other hand, CHIRPS precipitation is based on satellite data, and has no orographic reference. Caroletti and Deidda [75] applied bias correction and orographic correction to downscale precipitation of 14 ENSEMBLES models in Sardinia. They improved the skill in reproducing results, using a combination of a multifractal model and of a linear orographic model. This could be usefully applied to the mountainous regions of western Calabria, where precipitation has a strong orographic component. However, this method would not solve the issue concerning months with high amounts of convective precipitation.
Balsamo et al. suggested spatial re-scaling as a way to improve gridded dataset results [66]. The main problem for Calabria would be to find a reliable gridded reference to use for re-scaling. An alternative approach could be the spatial interpolation of the rain gauge network into a regular-spaced grid.
E-OBS could be considered a reliable enough dataset for this purpose. However, E-OBS is provided at the same grid (and space resolution) as ENSEMBLES, so there would not really be a re-scaling. Even though E-OBS could be used to re-scale CHIRPS data, more problems could come from re-scaling satellite data at cell level into a point-grid system (see Section 2.2). Another choice of dataset could be ERA-INTERIM, which has a resolution of about 80 km, but is only available from 1979 to 2017 [76,77].
With regard to the interpolation of the rain gauge network to a regular grid, one of the main issues would be what method to use for building the grid. Several tools are available for sparse data interpolation: the Barnes method, for instance, has been used successfully for precipitation [25]. The problem of rescaling cell data to grid points, though, would still have to be addressed.
CHIRPS has a very high correlation value (0.94) and a high relative error skill compared to other models. This dataset seems to be an excellent basis for further work, especially given the fact that it has a 0.05 • resolution compared to the 0.25 • resolution of all other datasets taken into account. Precipitation estimates derived from satellite data are indirect and are inevitably accompanied by a large degree of variability, and have difficulty representing precipitation with high spatiotemporal variability in areas of complex topography [52,78]. CHIRPS results, however, show no particular problem in capturing light precipitation. This is probably due to the incorporation of station data in satellite products.
Good Pearson correlation results and an underestimation of high precipitation events were in accord with previous evaluations: Paredes-Trejo et al. [32] in northeast Brazil, and Luo et al. [79] in the Lancang-Mekong river basin, where they used CHIRPS data to drive hydrological modeling. On the other hand, Rivera et al. [33] noted some issues in the Andes region of Argentina, where CHIRPS underperformed especially in areas above 1000 m a.s.l. As there are only a handful of stations located above that altitude in this study, this begs for some caution in the use of the CHIRPS dataset. This is especially noteworthy given the importance of mountain areas of Calabria as the main freshwater sources for the region. Bai et al. [3] investigated the performance of CHIRPS over China, finding very different results in different regional and basin areas, wet versus arid zones, and summer versus winter months. Once again, this underlines the importance of taking case region validations as valid only for that specific region and for the specific use the data is validated and selected for.
Deidda et al. [25] used E-OBS to validate 14 ENSEMBLES models. However, although the Pearson correlation with ground data is excellent, the relative error metrics show that some of the ENSEMBLES models perform better than E-OBS. Since the ENSEMBLES models are built at least in part on the E-OBS results, this is not necessarily surprising. However, this might question the idea of using E-OBS as a validation method for these models, as ENSEMBLES and E-OBS are not independent of each other.
In general, Pearson correlation results show huge problems in reproducing summer precipitation. However, this has a lesser impact on the relative error metrics, as they are based on the sum of all errors on the monthly means; thus, the contribution to the error from summer months is much reduced, as summer precipitations in Calabria are significantly less than precipitation during the other months of the year.
Poor summer performance, on the other hand, might contribute to errors in extreme events evaluation for E-OBS and CHIRPS, as extreme events can affect Calabria in the form of late summer thunderstorms and extreme convective precipitation.
The reliability of RCMs in general is strongly dependent on the quality of the climate forcing data, i.e., of the GCMs. It is plausible that the ECH-driven models are the best performers in general because of the quality of their forcing data, which is more important than the RCMs fine-tuning. The three models driven by the Hadley Centre HadRM3Q3 Model (HCL-HRM, HCS-HRM and HCH-HRM), are better during 1951-2010 than 1981-2010 according to the Pearson correlation coefficient. By looking at the seasonal correlations, we see that the performance is almost the same in the spring season, while it worsened for the other seasons; most significantly, it worsened for winter months, where it went from a good correlation to anti-correlation. In the 1981-2010 time period, there was an increase in extreme precipitation and a decrease in the number of precipitation days in Calabria [80], especially in winter [81]. This change in precipitation patterns might have escaped the HRM climatology that drove these RCMs; even though the 30-years climatology of precipitation (i.e., the decrease of the yearly average) might have been correctly reproduced (see Appendix A Table A2).
The performance of RCMs in regard to extreme events, on the other hand, depends on the skill of the regional model to capture correctly spatial distribution of precipitation and orographic enhancing. In this regard, we found that RCA regional models were the best performers: even HCL-RCA, which is one of the lower ranking datasets overall, actually correlates more at extreme events than for all the rest of the precipitation spectrum (not shown). Results from ENSEMBLES models run with the HIRHAM5 RCM, show the opposite problem: even those with good correlation values, strongly overestimate extreme events. Thus, it is not surprising that BCM-HIR and ECH-HIR are the datasets with the highest yearly precipitation values (see Appendix A Table A2).

Conclusions
Products distributed on regular grids, whether satellite data, reanalysis products, or model data, are currently used for most climatological studies, especially in regions where ground stations are inadequate to perform high-resolution regional studies. A common approach in future projections studies is to produce large ensembles of climate model results. Many studies (e.g., [13] and [14]) suggest that the results from all available models are used to span the uncertainties coming from different approaches (e.g., on parameterizations). However, other studies [22,69,82] challenge this approach and suggest weighting the models contribution, or using a limited number of high-performing model results instead. The selection of the most accurate gridded products-i.e., validation-can play a major role in accurate climate projections, assessment studies and hydrological studies.
The results of this study showed that, taking into account Pearson correlation, error metrics and extreme events performance, the best datasets are the satellite-based CHIRPS dataset and the ECH-RCA, HCH-RCA and ECH-REM models from the ENSEMBLES set. Given its high correlation values, E-OBS data could be a good dataset to use for further applications, but only after bias correction. There was a slight appreciable improvement in performance for all four ECHAM5-based models and for the E-OBS dataset from the 1951-1980 to the 1981-2010 time-period.
Results from this validation show that, out of 13 ENSEMBLES models, the ones with the worse error metrics are also the ones of which the monthly precipitation data do not correlate with rain gauge data. Thus, to calculate ensemble uncertainties using models showing these performances might be in fact counterproductive.
The whole validation procedure presented in this study is general and easily applicable to any other region where ground-and gridded-data are available, as a supporting tool in the choice of data for precipitation assessments in areas with sparse ground data.