#### 4.1. Model Comparison

In this section, we compare the results of the downscaling analysis carried out for models B and RFB, described in

Section 3. All available hourly rainfall gauges (735 series; see

Figure 1) were used to fit and evaluate the predictive skill of the models by cross-validation.

Results presented in

Table 2 show that the RFB model improves the prediction of the B model in most climates for the 1-h temporal level. Only for BWh and BSh climates, the prediction given by B is better for the 1-h variance. These climates correspond to desert and arid climates, where we find higher values of TAS in the testing than in the training set. In these cases, while the B model, based on the MARS technique, is able to predict values larger than observed (extrapolate), RFB model is not. This is due to the fact that MARS is a parametric model, whereas random forests is an ensembling one.

The limited skill of the B model to predict the ACF-lag1 1-h is due to the way in which lag-one autocorrelation coefficients are derived. As seen in Equations (

1) and (

2), instead of applying specific regressions for this variable, this method draws from previously-predicted quantities, and so increases the uncertainty of the prediction.

For larger aggregations periods (results for 12-h rainfall are presented in

Table 2), results for variances, skewnesses, and probability of a dry interval (Pdry) improve remarkably with respect to the 1-h results in both models. Indeed,

${R}^{2}$ values around 0.9 were systematically obtained for the temporal aggregation of 12 h. However, the prediction of the ACF-lag1 deteriorated for larger aggregation periods.

Table 3 shows the predictors and the predictands selected for the B and RFB models. It is important to remark that the RFB model includes only a subset of the predictors originally used in Beuchat et al. [

31]. As demonstrated in

Table 2, even with this reduction in the number of predictors, the RFB model improves the overall results obtained with the original B model for all statistics. Some of the predictors used in Beuchat et al. [

31] explain a negligible percentage of the variance of our data; hence, we have removed them to avoid overfitting. The predictors that were ultimately selected were those whose removal significantly degraded model performance. Average surface air temperature (TAS) was the only large-scale atmospheric variable selected as a predictor except for the ACF-lag1 1-h statistics, for which surface air temperature variance (

${\sigma}_{TAS}^{2}$), relative air humidity (HUR), and elevation of the station were also selected.

#### 4.2. Performance Analysis of RFB

Figure 3 shows the prediction obtained by the RFB model for variance, proportion of dry intervals (Pdry), skewness coefficient, and autocorrelation lag-one (ACF-lag1) for the 1-h temporal level.

Figure 3 shows the scatter plots of the observed versus predicted values and the corresponding error distribution; the associated performances in terms of

${R}^{2}$ for 1-h and 12-h variance, Pdry, skewness, and ACF-lag1 are available in

Table 2.

The scatter plot and the error distribution show that the results for variance and Pdry were satisfying for all time scales with average

${R}^{2}$ values around 0.83 and 0.96, respectively (see

Table 2). For skewness and ACF-lag1, the model was less accurate with

${R}^{2}$ values around 0.73 and 0.61. Nevertheless, error distributions were unbiased with most of the mass concentrated around zero for all the statistics.

The large dispersion existing in the observed values of skewness (with values between zero and 125) compared to the other statistics makes its prediction more difficult, explaining partially its lower accuracy. The skill of the models to predict skewness was significantly worse for climate D (see

Table 2), mountainous areas of high altitude (>1000 m). There, local atmospheric conditions were not properly captured by large-scale atmospheric variables [

65]. Indeed, we realized that TAS was not correlated to the 1-h skewness for climate D, a reason why the ability of the models to predict 1-h skewness in this climate was hindered.

Predictions of 1-h ACF-lag1 were significantly less accurate than those of the other three statistics. The reason was that supradaily rainfall predictors have practically no importance in the prediction of 1-h ACF-lag1, unlike in the case of the other three statistics (see

Table 3). In fact, ACF-lag1 one-day resolution was not selected as part of the predictors. In contrast to the skewness, the worst results for the 1-h ACF-lag1 correspond to climates BWh, BWk, BSh, and BSk (desert and semi-arid climates), which appear in small areas in the southeast of the provinces of Almería, Murcia, and Alicante, as well as in some areas of the Canary Islands. These climates, affected by extreme convective storms, show values of 1-h ACF-lag1 very close to zero, conditions under which random forests fails to classify in a single group.

The lack of accuracy on the 1-h ACF-lag1 predictions led us to explore another way to capture the temporal structure of rainfall. Then, the RFB model was used in the same way to predict the transition probabilities (from a dry interval to a dry interval,

${\varphi}_{1h}^{DD}$, and from a wet interval to a wet interval,

${\varphi}_{1h}^{WW}$). As shown in

Figure 4, most of the error distribution of the

${\varphi}_{1h}^{DD}$ lied within the

$\pm 1\%$ error interval, whereas it spanned between

$\pm 50\%$ error in the case of

${\varphi}_{1h}^{WW}$. The relative abundance of two consecutive dry periods, in comparison with two consecutive wet periods (mostly due to the intermittency of rainfall) was likely the reason for the different performance. However, the results in terms of

${R}^{2}$ for the transition probabilities, with values around 0.97 and 0.77, were significantly better than those reached for the ACF-lag1 (0.61).

Table 4 shows the values of

${R}^{2}$ for the 1-h and 12-h

${\varphi}_{1h}^{DD}$ and

${\varphi}_{1h}^{WW}$. In contrast to the ACF-lag1, the predictive skill for the transition probabilities increased at larger temporal aggregations.

As mentioned in the Methodology Section, we carried out a sensibility analysis to assess the effect in the results of improving the spatial resolution of the atmospheric reanalysis data. The CFSR database was used instead of the NCEP, since it provides a better spatial resolution of the atmospheric variables. Despite this improvement, results were not significantly affected. This could suggest that either average atmospheric conditions were sufficient to explain the regression between daily-to-subdaily rainfall statistics or the CFSR data resolution (0.312${}^{\circ}$ degrees) was not able to capture local atmospheric conditions in Spain.

Additional analysis were carried out to evaluate the robustness of the method. The first analysis referred to the total amount of stations used during fitting. An increasing number of hourly rainfall stations was selected in an iterative process and the overall performance computed. When the number of stations was above 25% of the total number of stations available (around 200), the results converged in terms of ${R}^{2}$, and only minor changes on the error distribution were observed. Secondly, all the stations pertaining to a climate were removed from the database, and the model was used to predict hourly statistics for the missing climate; the process was repeated for each climate type. In this case, the model performance was severely degraded, indicating that the model skill to extrapolate hourly statistics to unobserved climates is limited.

An important point to highlight is that not even fitting the model for our dataset could we obtain coefficients of determination as high as the ones reported in Beuchat et al. [

31] for the 340 gauges scattered throughout Switzerland, the U.K., and USA. An overall more homogeneous set of stations could explain this performance difference, although we would have expected a more heterogeneous dataset to have been used in Beuchat’s model fitting, as the spatial coverage is larger. However, the gauges selected in Beuchat et al. [

31] rarely exceeded the skewness value of 60, while in Spain, this value is higher for a large percentage of the stations. Therefore, the strong climate variability, both in space and time, found in the Spain dataset and the incorporation of more extreme rainfall regimes could explain the decrease in performance of the regressions.

#### 4.3. Performance of Simulated Rainfall

In this subsection, we investigated how RFB-predicted statistics can be used in conjunction with Poisson cluster models to simulate synthetic hourly rainfall series in Spain.

Fifteen locations across Spain were used as case studies (see

Figure 5) to illustrate the ability of NSRPMs, fed by RFB-predicted statistics, to generate synthetic rainfall hourly series in Spain. The gauges with larger hourly rainfall records were selected covering all climate types.

As in Beuchat et al. [

31], we assessed the ability of NSRPMs to reproduce observed hourly series characteristics under three scenarios. The “exact scenario” corresponds to the situation where NSRPM is fitted on observed daily and subdaily statistics. In the “target scenario”, observed subdaily statistics are replaced by the RFB-predicted statistics. In the “simple scenario”, NSRPM is fitted only on observed daily rainfall statistics. The performance of these three scenarios allows us to assess the loss of performance when the observed subdaily rainfall statistics is replaced by those predicted by RF technique, and also when subdaily rainfall data are not used in the fitting procedure.

Fitting the NSRPM involves minimizing an objective function, which integrates a weighted sum of rainfall statistics at different temporal levels of aggregation.

Table 5 shows the set of rainfall statistics and associated weights selected for fitting the models. In the case of the “simple scenario”, only the weights and statistics at a daily (1 d) temporal aggregation were used. One NSRPM was calibrated for each study case and scenario, except for the “simple scenario”, for which 10 NSRPMs were calibrated since the resulting subdaily statistics might differ significantly from one calibration to another. One thousand years of continuous synthetic rainfall records were simulated for each calibration in order to ensure the stability of the results.

Figure 6 and

Figure 7 show the 1-h rainfall performance of the NSRPM simulations.

Figure 6 shows the performance for variance, Pdry, and skewness, while

Figure 7 that of ACF-lag1,

${\varphi}^{DD}$, and

${\varphi}^{WW}$. Black color corresponds to the “exact scenario”; the line and squares represent, respectively the observed and simulated 1-h statistics. Red color corresponds to the “target scenario”; the dashed-line and squares represent, respectively, the RFB-predicted and the simulated 1-h statistics. The blue-hatched area shows the range of statistics for the calibration in the “simple scenario”. Each row corresponds to a case study. One-hour mean rainfall results are not shown in the study since the NSRPMs reproduce almost perfectly the observed values under the three scenarios.

As we can see in

Figure 6 and

Figure 7, NSRPMs were flexible enough to simulate the observed and the RFB-predicted values of variance, Pdry, and skewness; except in some specific cases such as Almería in Month 5 and Barcelona in Month 6. In these cases, the squares (simulated statistics from the NSRPMs) fall far away from the lines (observed and predicted statistics). The temporal dependence terms (ACF-lag1,

${\varphi}^{DD}$, and

${\varphi}^{WW}$), however, agree less accurately.

${\varphi}^{WW}$ was not correctly captured by NSRPMs for several months in Granada, Huesca, and Barcelona. Predicted

${\varphi}^{DD}$ matched observations with errors contained within

$\pm 2\%$. A perfect fit of the Pdry and the transition probabilities was never obtained in any of the case studies because rainfall values below a specific threshold were deemed null.

Comparing the observed statistics (black lines) with those simulated by the “target scenario” (red squares) and “simple scenario” (blue-hatched area), we concluded that the “target scenario” performed significantly better. The “simple scenario” underestimated variance and skewness in most cases (León, Albacete, Zaragoza, Huesca, Baleares, Cantabria); we will see later how this affects the generation of extreme values (see

Figure 8). The “simple scenario” underestimated Pdry as well, which also showed great dispersion between the 10 calibrations. This was because NSRPMs did not keep the observed dry/wet spells at shorter temporal aggregations when not fed with subdaily statistics; the calibration process found solutions with a smaller number of storms (

$\lambda $) of longer duration (

$\beta $), leading to series with longer values of ACF-lag1 and

${\varphi}^{WW}$ than the observed.

Interesting results were also found when the empirical intensity-frequency curves derived from observed and simulated scenarios were compared (see

Figure 8). Black dots represent exceedance probability values of the observed rainfall series. Black and red lines correspond to the exceedance probability of “exact scenario” and “target scenario”, respectively. The blue-colored area represents the range of exceedance probability values between the 10 calibrations in the “simple scenario”. Each panel corresponds to a case study.

Figure 8 shows that the results obtained by the “exact scenario” reproduced the observed values adequately. The same happens for the “target scenario”, except for Barcelona and Mallorca, where the intensity of the values with a exceedance probability less than 0.001% (

${10}^{-5}$) was underestimated. In contrast to the results provided by the “exact scenario” and the “target scenario”, the “simple scenario” failed to reproduce the observed exceedance probability in most of the cases. Average results of the 10-calibrations underestimated the observed values except for Málaga and La Coruña. In fact, the maximum simulated value in the “simple scenario” rarely exceeded the maximum observed value, which seems unlikely since the observed series had a maximum of 20 years of records, while the simulated series comprised 1000 years.