Performance Evaluation of a Nowcasting Modelling Chain Operatively Employed in Very Small Catchments in the Mediterranean Environment for Civil Protection Purposes

: The Hydro-Meteorological Centre (CMI) of the Environmental Protection Agency of Liguria Region, Italy, is in charge of the hydrometeorological forecast and the in-event monitoring for the region. This region counts numerous small and very small basins, known for their high sensitivity to intense storm events, characterised by low predictability. Therefore, at the CMI, a radar-based nowcasting modelling chain called the Small Basins Model Chain, tailored to such basins, is employed as a monitoring tool for civil protection purposes. The aim of this study is to evaluate the performance of this model chain, in terms of: (1) correct forecast, false alarm and missed alarm rates, based on both observed and simulated discharge threshold exceedances and observed impacts of rainfall events encountered in the region; (2) warning times respect to discharge threshold exceedances. The Small Basins Model Chain is proven to be an effective tool for ﬂood nowcasting and helpful for civil protection operators during the monitoring phase of hydrometeorological events, detecting with good accuracy the location of intense storms, thanks to the radar technology, and the occurrence of ﬂash ﬂoods.


Introduction
In recent years, the Liguria Region, located on the North-West coast of Italy, has been affected by several flash floods which have caused significant losses in terms of human lives, livelihoods, damages to the environment, property and infrastructure. The two most severe flash floods in recent memory in this region occurred on 25 October and 4 November 2011. As a result of these two flood events, 19 persons died and tens of millions of euros in damages to property and infrastructure were recorded [1]. The Liguria region is a hilly and mountainous area and counts a large number of small and very small basins: more than 90% of the drainage areas do not exceed 15 km 2 and about 87% do not exceed 5 km 2 ; due to their small size, such basins are characterized by very small response times. Most of them are heavily urbanized so that in many cases the river bed is covered and water flows under streets and buildings. These characteristics make them particularly sensitive to intense storm events, frequent in the region, and prone to flash floods, while the population does not often have the perception of the risk related to these small watercourses. Furthermore, many of such basins are ungauged [2].
Several measures allow for flood risk reduction. Adequate land use planning is, for instance, a permanent measure intended to prevent the hazard or reduce its probability; Early Warning Systems represent, instead, a form of temporary risk mitigation, which operates on reducing vulnerability and exposure to the hazard event [3]. Early Warning Systems  The choice of a lumped model over a distributed or semi-distributed one is driven by two main reasons: the very small size of the basins and optimization issues. Despite the high density of the rain gauge network in the region (on average, 1 every 25 km 2 ) and radar technology providing high-resolution data (1 km 2 ), the drainage areas here considered are so small, ranging from 0.2 up to 15 km 2 , with an average value of 3.6 and a median value of 2.1 km 2 (Figure 2), that either they are ungauged or they are provided with one The choice of a lumped model over a distributed or semi-distributed one is driven by two main reasons: the very small size of the basins and optimization issues. Despite the high density of the rain gauge network in the region (on average, 1 every 25 km 2 ) and radar technology providing high-resolution data (1 km 2 ), the drainage areas here considered are so small, ranging from 0.2 up to 15 km 2 , with an average value of 3.6 and a median value of 2.1 km 2 (Figure 2), that either they are ungauged or they are provided with one rain gauge only and only one or a just a few radar data grid cells cover them. This implied the use of the basin average rainfall field as hydrological model input. Optimization issues relate to the operative purpose of the model chain, which requires fast computational times, and to the uncertainty, which parameter estimation of a distributed model would have for these basins. Moreover, these basins are characterised by very short response times, mostly less than 1 h, and some of them respond almost instantly to rainfall. Furthermore, information is so coarse for these numerous and small basins that a finer description of the processes is not possible. Observed discharge series are not available for comparison with nowcast scenarios since no rating curve nor water level time series are available: most of these watercourses have a torrential regime, meaning they are dry most of the year, and they are very numerous in the region, therefore, it is not feasible to provide them all with water level gauges or make several periodical discharge measurements to maintain updated rating curves [13] nor find updated hydraulic studies for all. It is made known that observed discharge here refers to simulated discharge using observed precipitation in the model input, for short. With so little information available, a finer description of the natural processes in the basin is not possible and uncertainties in estimating a large set of parameter values would be so large and one would not be able to guarantee a better forecast [14]. Parameter estimation of the Nash model is made employing a simplified approach from literature values, adapted to the Ligurian region geomorphology [15]. All these reasons, unfortunately, do not allow us to employ a more complex distributed model: we need a model chain as fast as possible, i.e., to find an optimum solution while considering model accuracy, uncertainties and computational time.  The choice of a lumped model over a distributed or semi-distributed one is driven by two main reasons: the very small size of the basins and optimization issues. Despite the high density of the rain gauge network in the region (on average, 1 every 25 km 2 ) and radar technology providing high-resolution data (1 km 2 ), the drainage areas here considered are so small, ranging from 0.2 up to 15 km 2 , with an average value of 3.6 and a median value of 2.1 km 2 (Figure 2), that either they are ungauged or they are provided with one  Figure 3 shows an example of model output at a sample cross-section at two different time steps. In the upper box, the rainfall intensity and cumulated rainfall are shown. The vertical grey line corresponds to the current time step (now bar), separating the observed rainfall by the nowcast rainfall. The right side of the hyetograph represents the worst scenario of rainfall intensity obtained by the stochastic nowcasting algorithm. Each bar of the hyetograph corresponds to a 10 min time interval; at the moment, for operative use, the nowcast is made up to 1 h only, so that the most reliable rainfall estimates are shown, since uncertainty in the storm trajectory, etc., increases with increasing nowcasting time [2]. Observed cumulative rainfall is represented by a pink line till the now bar; on the right side, each equiprobable cumulative rainfall scenario predicted by PhaSt is shown in different coloured lines. In the bottom figure, the corresponding hydrographs simulated with the Nash rainfall-runoff model are reported. Since, as mentioned above, the modelled cross-sections are not provided with water level measurements nor rating curve, the river streamflow up to the current time, represented by the black line, is estimated as the simulated discharge obtained by using the observed rainfall as input; for the sake of simplicity, it is here called observed discharge. On the right side of the now bar, the black line represents how the hydrograph would develop if the rainfall would stop at the current time; the other ten hydrographs are obtained adding the equiprobable rainfall nowcast scenarios to the hydrological model input. Due to the small size of the basins, only one cross-section is modelled for each of them and it is located in correspondence of the most critical section of the considered stream, for instance, in correspondence of a structure such as a bridge, which strongly reduces the cross-section area. These streams can be coastal watercourses or tributaries to major rivers which are not considered here ( Figure 2). Each cross-section is associated with two discharge thresholds. These are the pre-alarm threshold, coloured in yellow and here called threshold 1, and an alarm threshold, coloured in orange and here called threshold 2, which corresponds to the discharge values flowing through the cross-section with 1.5 and 1 m freeboard, respectively, where hydraulic studies were already available. Where detailed hydraulic information was instead not available for . The date and time format on the X-axis is: YYMMDDHHmm. At 18:50 UTC (a) all nowcast discharge scenarios exceed one or both thresholds (pre-alarm and alarm thresholds), but this has not been observed, yet; the crosssection icon will colour as the worst-case scenario (orange, for alarm threshold nowcast exceedance) and the operator can take measures. At 19:50 UTC (b), 1 h later, alarm threshold (orange line) exceedance is observed and discharge is expected to further increase.
The model chain runs with a time resolution of ten minutes. Due to the quick dynamics of storm cells and the short response times of the basins, frequent update of nowcasting is necessary. Model simulations are not continuous, though: as the chain is designed for monitoring purposes, they respond to an activation threshold, varying according to the antecedent soil saturation conditions and the average rainfall intensity estimated at the basin scale during the previous six hours, which correspond to the time interval on the left side of the now bar. When the peak river discharge reaches at least 10% of the threshold 1 value at a cross-section, i.e., the drainage area showed a sort of response to the rainfall, the corresponding basin figure is saved, available for visualisation by the operator, and we say that the section is active. The colour of the corresponding icon on the map will then switch from grey to green and, at the nowcast and observed threshold exceedances, to yellow and orange, so that the operator can focus attention on those specific catchments.

Data and Preliminary Analyses
In order to estimate the reliability of SBMC, three different analyses were performed. The first analysis aims at evaluating the performance of the model chain comparing forecast and observed discharge, from a purely modelling point of view; the second analysis is similar but performed at the warning area scale. When a flood alert is issued, it is usually not uniform over the whole region; the region is subdivided into five areas, called warning areas (Figure 4), to refine the scale of the area interested by the alert and be able to diversify it, according to the forecast over the areas. Finally, the third analysis adds a comparison of the modelling results with the criticalities observed on-site, where information was available.
Firstly, all available archived simulations were retrieved from the server. A further selection was made to determine the significant events according to pre-established criteria, i.e., excluding those who had a very small number of activations and simulations per basin, i.e., less than 4 basins activated or less than 3 h of simulation (18 runs) and for which, as to confirm their minor significance, a zonal criticality analysis had not been carried out. A total dataset of 99 events was thus obtained in a period between August 2006 and December 2019, with a hole in the dataset between the autumn of 2015 and that of 2017, also due to the lack of radar data. . The date and time format on the X-axis is: YYMMDDHHmm. At 18:50 UTC (a) all nowcast discharge scenarios exceed one or both thresholds (pre-alarm and alarm thresholds), but this has not been observed, yet; the crosssection icon will colour as the worst-case scenario (orange, for alarm threshold nowcast exceedance) and the operator can take measures. At 19:50 UTC (b), 1 h later, alarm threshold (orange line) exceedance is observed and discharge is expected to further increase.
The model chain runs with a time resolution of ten minutes. Due to the quick dynamics of storm cells and the short response times of the basins, frequent update of nowcasting is necessary. Model simulations are not continuous, though: as the chain is designed for monitoring purposes, they respond to an activation threshold, varying according to the antecedent soil saturation conditions and the average rainfall intensity estimated at the basin scale during the previous six hours, which correspond to the time interval on the left side of the now bar. When the peak river discharge reaches at least 10% of the threshold 1 value at a cross-section, i.e., the drainage area showed a sort of response to the rainfall, the corresponding basin figure is saved, available for visualisation by the operator, and we say that the section is active. The colour of the corresponding icon on the map will then switch from grey to green and, at the nowcast and observed threshold exceedances, to yellow and orange, so that the operator can focus attention on those specific catchments.

Data and Preliminary Analyses
In order to estimate the reliability of SBMC, three different analyses were performed. The first analysis aims at evaluating the performance of the model chain comparing forecast and observed discharge, from a purely modelling point of view; the second analysis is similar but performed at the warning area scale. When a flood alert is issued, it is usually not uniform over the whole region; the region is subdivided into five areas, called warning areas (Figure 4), to refine the scale of the area interested by the alert and be able to diversify it, according to the forecast over the areas. Finally, the third analysis adds a comparison of the modelling results with the criticalities observed on-site, where information was available.
cross-sections were considered.
In order to complete analyses 2 and 3, events with damage were then discriminated, based on a previous analysis that compared the forecast colour-coded classes of criticality with the observed criticalities per warning area. Colour-coding is practical for its intuitiveness, therefore it has widely been employed [16]: darker shades correspond to increasing criticalities, so to more severe damages or other consequences. Some sample records of this study are reported in Table 1.  Firstly, the events with damage and, subsequently, those with criticality or damage in, at least, a certain warning area were classified; finally, as illustrated in the third analysis, the events for which it was possible to define the criticality at the single basin-scale were further classified ( Table 2). Firstly, all available archived simulations were retrieved from the server. A further selection was made to determine the significant events according to pre-established criteria, i.e., excluding those who had a very small number of activations and simulations per basin, i.e., less than 4 basins activated or less than 3 h of simulation (18 runs) and for which, as to confirm their minor significance, a zonal criticality analysis had not been carried out. A total dataset of 99 events was thus obtained in a period between August 2006 and December 2019, with a hole in the dataset between the autumn of 2015 and that of 2017, also due to the lack of radar data.
In the same way, all registry files of the model chain cross-sections were recovered; these files contain information on which basins were present in a given time period, since the total cross-sections number varied over time, and on the relative threshold's changes. In several cases, as a result of subsequent analyses and ex-post information on the impacts of hydrometeorological events, the first-guess thresholds established for the sections were modified over time with other values considered more appropriate; therefore, while verifying the simulated and observed threshold exceedances, only the most recent threshold values were used. As better illustrated in the dedicated paragraphs, while the first two analyses consider the activated basins only, i.e., the ones that in each event reached the discharge activation threshold, the third analysis considers which basins were active and which were not, too, relatively to the observed ground effects. To better estimate the scores related to this last analysis, the variation on the number of modelled basins over time was therefore taken into account, i.e., it was considered, for each event, which basins were present in the registry file during a specific event or not, so as not to distort the statistics of correct non-activation and missed alarms. Four phases have therefore been identified, discriminated by the inclusion or exclusion in the registry of modelled cross-sections only, and not, instead, by variations on the relative threshold values. On the overall period, 219 cross-sections were considered.
In order to complete analyses 2 and 3, events with damage were then discriminated, based on a previous analysis that compared the forecast colour-coded classes of criticality with the observed criticalities per warning area. Colour-coding is practical for its intuitiveness, therefore it has widely been employed [16]: darker shades correspond to increasing criticalities, so to more severe damages or other consequences. Some sample records of this study are reported in Table 1. Firstly, the events with damage and, subsequently, those with criticality or damage in, at least, a certain warning area were classified; finally, as illustrated in the third analysis, the events for which it was possible to define the criticality at the single basin-scale were further classified (Table 2). Events with damage in D and E areas considered here are numerically much lower than those relating to the other areas because the cross-sections located in these areas have only been inserted in the most recent phase, so events that occurred before their inclusion are not of interest for the zonal analysis.
Successively, for each simulation, i.e., for each event, and each basin the following information was extracted: (1) the number of activations; (2) the relative nowcast and observed discharge thresholds exceedances or non-exceedances (where, as mentioned above, observed discharge means the discharge simulate with observed precipitation data); (3) the warning times before the exceedance of the thresholds, i.e., the anticipation times Tw1 and Tw2 on threshold 1 or threshold 2 observed exceedance, respectively, and the instant when this had firstly been nowcast (Table 3). Table 3. Sample record for each basin activation, containing information used to compute the model chain scores. Q1 stands for threshold 1 discharge value, Q2 for threshold 2 discharge value. Subscripts "obs" and "frc" state observed and forecast discharge, respectively. Tw means the anticipation or warning time on threshold exceedances.
It is necessary to specify that by instant it is not meant the future time on the X-axis of the graph at which the exceedance is predicted with respect to the now bar (which may differ in subsequent model runs and for the different forecast scenarios generated by PhaSt), but the instant of the model run, i.e., the time of the first simulation available for which at least one threshold exceedance is predicted among all the nowcast scenarios. To make a practical example, let us assume that Figure 3a represents the first model run when at least one nowcast threshold exceedance appears; the current time by then is 18:50 UTC. Every 10 min a new graph is generated based on the updated scenarios generated by PhaSt. Then, at 19:50 UTC (Figure 3b), a threshold exceedance is for the first time observed on the left-side on the now bar: the observed discharge exceeded the threshold value 1 h later with respect to the time when this had firstly been nowcast. This represents the anticipation or warning time available for taking measures. Since the model time step is 10 min, Tw will assume discreet values, multiple of 10 min.
Furthermore, since the same basin can be activated more than once during the same event, especially in the case of events extending beyond the daily duration, multiple activations related to the same basin were considered, assuming a minimum of 6 h-time intervals between them, corresponding to the time window from the initialization of the model up to the now bar. For each one, any exceedance and related anticipation times were accounted for. A lower time interval would not be relevant for operational purposes, since once a dot points a potential or observed criticality out, the operator in charge of the event monitoring has already evaluated the situation and, if need be, warned the potentially affected municipalities, and kept monitoring. Since, based on the rainfall pattern, there can be even more threshold exceedances within the established 6 h time interval, only the first observed threshold exceedance was considered, for the same reason.

Model Evaluation Criteria
Based on these preliminary results, the contingency tables of nowcast vs. observed threshold exceedances and the statistics for Tw1 and Tw2 were obtained, as better described in the following paragraph. Such scores are considered more functional for our purpose than other objective functions traditionally used in hydrology, such as, for instance, the Nash-Sutcliffe Efficiency index, RMSE, or Percent Error Peak [14]. While these goodnessof-fit criteria can be easily employed at a specific site for comparison of simulated versus observed discharge series and they are extremely useful for the model calibration and validation phases, they might not be so convenient in this particular case, considering both model structure and operational procedures illustrated in Section 2.1. During an event, for each activated cross-section the operator is provided with a graph as those in Figure 3, which updates every 10 min, each time showing 10 different nowcast hydrographs, which poses an issue about how to compute the scores. We want here to estimate the overall reliability of the model chain and its usefulness from a more operative point of view, which is why we focus on discharge threshold exceedances, i.e., those that provide a clearly visible signal to the operator during the event (Section 2.1), anticipation times on these exceedances and event-related impacts on the territory. When, for instance, a threshold 1 exceedance is observed and a threshold 2 exceedance is expected, the operator will most probably take measures to make sure that on-field operators are alerted about the potentially dangerous situation. In other words, threshold 2 nowcast exceedance may be sufficient for expecting a criticality to occur and communication will be sent. We are here more interested in more qualitative nowcasting, i.e., we want to know whether a threshold will or will not be exceeded, than knowing the exact shape of the hydrograph. To sum up, choosing different hydrograph descriptors as estimators would make the analysis much more computationally expensive without improving much the understanding of the model performance, while focusing on thresholds exceedance allow us to evaluate the signals in the way these are being presented to operators in-event.
Once the scores to define model performance were chosen, a priori criteria were established to describe the goodness of results [17], which are reported in Table 4. Establishing a priori goodness criteria for the warning time is more complex. This might vary according to the rainfall pattern of the specific event and on the time of concentration of the basin, independently of the model setup, which means that storm cells characterized by very fast dynamics and high rainfall intensities combined with the smallest drainage areas, can lead to very limited anticipation times, independently on how well the model can simulate the process. Besides, it is reasonable to expect Tw2 to be, generally, greater than Tw1, since the pre-alarm threshold is reached earlier than the alarm threshold, so one could think of different criteria for them. The worst case is represented by Tw = 0, i.e., when no threshold exceedance is nowcast before being observed, while longer warning times are always wished for, to be able to take measures. The possible maximum warning time is limited, as already mentioned, by the response time of the drainage area and by the inherent uncertainty in rainfall nowcasting, most reliable up to 1 h and inevitably decreasing with time. So, we based the evaluation purely on the operational aspects, i.e., the time available for civil protection and municipalities to act, once communication is made. As mentioned above, in this first phase the comparison is of purely modelling type: the performance of the forecast flow with respect to observed flow was estimated, in terms of correct forecast, false alarm and missed alarm rates. Please note that by observed flow it is here meant the discharge simulated with the observed rainfall data only, from the initialization of the model run till the now bar. One could relate these scores to the specific PhaSt component, rather than to the model chain as a whole. For the given boundary conditions and thresholds, both nowcast and observed streamflow are computed with the same model and parameter values, while the estimated rainfall input is the only varying forcing. This is why analysis 3 accounts for observed criticalities, too.
From the information thus obtained, summarized in a file for each event (Section 2), a procedure was automated to obtain, for each event and each basin: the number of activations, correct forecast-in the case of both exceedance and null exceedance-false alarm and missed alarm rates; the values for Tw1 and Tw2. Computing the scores both by event and by basin and storing this information, in addition to the total contingency tablesstructured as in Table 5-is useful for: verifying the results; investigating a particular event: if the scores are low it could be due to PhaSt; investigate a particular basin: if the scores are low it could be due to inadequate thresholds (see Analysis 3); look for a correlation between the average Tw per basin and the basin area, in the hypothesis that very low Tw-may also be due to the small size of the drainage areas. For simplicity, we called the threshold exceedance errors missed alarm and false alarm, without considering further evaluations made by the operator.
The sample used and the contingency tables for each threshold are reported below. In Table 6, the actual sample is given by the number of the activations, which is basically the number of records structured as in Table 3. The number of events and basins considered is reported, too, together with the total correct forecast rates. Table 7a,b show the contingency tables for thresholds 1 and 2, respectively, structured as in Table 5.  Results are overall better for threshold 2 (Thr2) than threshold 1 (Thr1), which is more important since more directly related to criticalities in the territory. According to the pre-established criteria (Table 4), the correct forecast rate (CFR) resulted to be sufficient for Thr1 and satisfactory for Thr2, about 17% higher. As displayed in Table 7a,b, the greatest contribution to the percent of correct forecasts, in particular as regards threshold 2, is given by the non-exceedances rather than the exceedances. Events with damage are investigated more thoroughly in analyses 2 and 3. False alarm rates (FAR) were comparable and resulted to be good in both cases while missed alarm rates (MAR) for Thr1 were satisfactory and very good for Thr2, as they reduced by around a third. So, Thr2 FAR was higher than MAR, which indicates a positive bias which is still preferable to a negative bias.
As mentioned above, these results represent the overall scores over the whole time period and basins considered; we now wish to investigate what the scores are for each of the 99 events. Results are summarised in Table 8, which can be read as follows: considering the first two columns, 10 out of 99 events (so, about 10%) had a very good correct forecast rate for threshold 1, so between 90 and 100% correct forecasts; 36 events had very good correct forecast rate respect to threshold 1, so between 0 and 10%; and so on. The results are promising particularly with regard to forecast threshold 2 exceedances: 86 events had satisfactory to very good correct forecast rate (more than 70%), while 6 only had poor rates; 92 events recorded a good to very good missed alarm rate (less than 20%) and 80 events had good to very good false alarm rate.
Below is the relative frequency histogram of the warning times Tw (Figure 5a). The warning times were divided for both the pre-alarm and alarm thresholds Thr1 and Thr2 into 13 classes, of 10 min each, coherently with the temporal resolution of the model (Section 2.1), except for the last class which incorporates all Tw greater or equal to 120 min. When a threshold exceedance is observed before it could be forecast, Tw = 0 is assigned; this often occurred at the catchment's activations, basically when the first hydrograph is saved. If there are no observed exceedances or there are not at all, Tw = NaN is assigned. Therefore, the Tw sample is always equal to or smaller than the total activation sample: this was 1898 and 921 for Tw1 and Tw2, respectively. Figure  It can be deduced that Tw1 and Tw2 have a left-skewed distribution so that higher values of Tw appear with a lower frequency. Higher values of Tw2 appear with higher frequency than Tw1, as expected, as Tw1 is exceeded sooner. Their average values are Tw1 values higher than 50 min and Tw2 values higher than 70 min can be considered outliers.
Out of these values, two more samples were then extracted, for each threshold: average warning time in each event, Twea, and average warning time by basin, Twba. We wanted to investigate these derived distributions to have an insight on what the mean warning time, in every situation operators have to face, is and to search for any correlation between a basin and its mean warning time. Twea sample size corresponds therefore to the number of events considered, 99, and Twba corresponds to the number of basins, 219. Unfortunately, only a very small sample is available for each basin, on average 9 for Tw1 and 4 for Tw2, therefore the reliability of the estimated mean warning time for a basin is very low. For each event, these values increase to 19 and 10, which means that, on average, the mean warning time for a specific event is computed on 19 values for Tw1ea and 10 values for Tw2ea, which is still a small sample, but almost doubled. During the more severe or widespread events, the forecast and observed threshold exceedances were so many that we can compute the mean warning time over more than a hundred values (Table 9). Table 9. Available sample size to compute basin average and event average warning time for a given basin or event. Tw1ba  Tw2ba  Tw1ea  Tw2ea  mean  9  4  19  10  max  39  33  158  121 Moreover, no significant correlation was found between Tw1ba nor Tw2ba and the drainage area dimensions, which was around 10% for the former or even lower for the latter. This might lead to the conclusion that the warning time depends mostly on the particular hydrometeorological event dynamics and the configuration of the nowcast rainfall field, rather than on the specific basin and its geomorphological characteristics, but this could further be investigated again in the future when a larger sample will be available for each basin. Probably, since the time of concentration of basin measures a basin response time, this could happen because we are considering basins of comparable sizes and possibly a stronger correlation would be found comparing a set of basins with It can be deduced that Tw1 and Tw2 have a left-skewed distribution so that higher values of Tw appear with a lower frequency. Higher values of Tw2 appear with higher frequency than Tw1, as expected, as Tw1 is exceeded sooner. Their average values are Tw1 values higher than 50 min and Tw2 values higher than 70 min can be considered outliers.

Sample n°
Out of these values, two more samples were then extracted, for each threshold: average warning time in each event, Tw ea , and average warning time by basin, Tw ba . We wanted to investigate these derived distributions to have an insight on what the mean warning time, in every situation operators have to face, is and to search for any correlation between a basin and its mean warning time. Tw ea sample size corresponds therefore to the number of events considered, 99, and Tw ba corresponds to the number of basins, 219. Unfortunately, only a very small sample is available for each basin, on average 9 for Tw1 and 4 for Tw2, therefore the reliability of the estimated mean warning time for a basin is very low. For each event, these values increase to 19 and 10, which means that, on average, the mean warning time for a specific event is computed on 19 values for Tw1 ea and 10 values for Tw2 ea , which is still a small sample, but almost doubled. During the more severe or widespread events, the forecast and observed threshold exceedances were so many that we can compute the mean warning time over more than a hundred values (Table 9). Moreover, no significant correlation was found between Tw1 ba nor Tw2 ba and the drainage area dimensions, which was around 10% for the former or even lower for the latter. This might lead to the conclusion that the warning time depends mostly on the particular hydrometeorological event dynamics and the configuration of the nowcast rainfall field, rather than on the specific basin and its geomorphological characteristics, but this could further be investigated again in the future when a larger sample will be available for each basin. Probably, since the time of concentration of basin measures a basin response time, this could happen because we are considering basins of comparable sizes and possibly a stronger correlation would be found comparing a set of basins with greater variance as for their dimensions. For all these reasons, we reported statistics of event-averaged warning times. Table 10 compares the interquartile range and mean values of Tw and Tw ea , related to both thresholds, including information on the related sample sizes. This means that among the whole dataset of warning times the mean value is sufficient for both cases; the third quartile of Tw2, particularly, is satisfactory. As for the eventaveraged warning times, values are similar. The median is poor but still increases. The mean values decrease by around 5 and 10 min relatively to threshold 1 and 2, respectively. The 75th percentiles increase by 5 min for Thr1 and decrease by 5 min for Thr2.

Analysis 2: Modelling Performance Related to the Observed Zonal Criticalities
As mentioned in Section 2.2, in this analysis, in addition to the results obtained from the comparison of forecast and observed threshold exceedances, a previous study on the observed criticalities by alert area was used. In summary, the same procedure was adopted as in analysis 1, but since in Table 7a,b the greatest contribution to the percentage of correct predictions is given by the non-exceedances rather than the correct ones, we focused on the events with damage ( Table 2). Results are shown in Table 11. The first row includes results for analysis 1, for comparison; the second row shows results considering events with damage or any other criticality; the third row considers the 62 cross-sections included in the A area and the events which had impacts, at least, on the A area; similarly, for B and C warning areas. D and E warning areas were excluded due to the very limited activation sample size available for them (23 and 32), for which the scores cannot be considered as reliable as for the other areas. A comparison is though possible between A, B, ad C areas, for which we wanted to check for any bias with respect to the overall results, e.g., if an area was characterized by a sensibly higher false or missed alarm rate or shorter warning times. Considering the selection of critical events over the whole region, the sample is reduced by about 30% only, coherently with the fact that we observed the most activations during those events, but results did not substantially change from those in analysis 1. As expected, there was an increase in correct threshold 1 and 2 exceedances, i.e., +4% and 3%, respectively, but in favour of a decrease in correct non-exceedances, while the false and missed alarm rates remain practically unchanged. This may be due to the fact that many events had consequences on medium-large basins only, typically at the occurrence of non-convective meteorological events, characterised by higher cumulated rainfall values rather than intensities, or that these damages occurred in areas where no cross-sections are modelled in this chain or that criticalities were related to landslides and similar phenomena. Therefore, a more detailed analysis including information regarding observed criticalities is reported in Section 3.
As for the three warning areas considered, the A area showed a worse performance in terms of missed alarms, for both thresholds. FAR1 values are comparable. Overall, B and C area basins had comparable scores, with the higher rate of correct forecasts, both good as for threshold 2, and both good to very good FAR2 and MAR2, around 10% or less. These results may provide directions for further investigations about PhaSt performances for these particular events or maybe the cross-section thresholds; the latter aspect is better analysed in Section 3.
The statistics related to the warning times follow, including, again for comparison, results from analysis 1 (Table 12). As for the warning times, results for the A and C areas are more similar and slightly better than in the B area. Mean Tw1 and 75th percentiles in the B area are about 5 to 10 min lower than in A, for instance. The frequency of higher anticipation times on threshold 2 exceedances is higher for A and C, where the 50th, and the 75th percentile in C, increases by 10 min; this might seem little for larger basins but it can acquire more relevance in the smallest drainage areas and be somehow useful during the operational phases.

Analysis 3: Modelling Performance Related to Observed Criticalities at the Basins Scale
As mentioned, the preparatory work for the latter analysis concerned the verification of the criticalities which occurred in the region, this time at the basin scale. As part of its activities, when a relevant hydrometeorological event occurs, ARPAL publishes a Hydrometeorological Event Report (REM). As a first step, based on such reports, the events causing severe ground effects or any other criticalities were selected. Then, the availability of the relative simulations in the archive was checked and, based on the available official information sources, namely the Civil Protection archives, the largest possible number of basins were listed, in which these criticalities were found. Comparing model results with observed impacts implies that here it is not sufficient to look at threshold forecast exceedances, but at basins activations, too: e.g., when the damage occurred in a drainage area whose corresponding cross-section did not activate at all, this implies a missed alarm. Unfortunately, the limited availability of information further reduced the sample; indeed, in order not to alter the results, in particular the percentage of correct non-activations and false alarms (the cases of observed zero criticality tend to be more numerous, at least apparently), the events with little information available had to be excluded. In addition, events which, despite having caused criticalities on medium-large basins, had no consequences on small basins were also excluded; the same was done when the damage reported was extremely localized or for which it was impossible to identify the basins, or, even if the basin was present in the damage file, it bore insufficient information. The result was a dataset of 25 events.
If the observed criticality flag was 0 (see Section 2), the modelling was considered correct in cases where: there was no activation; there was activation only; the basin activated with forecast threshold 1 exceedance, as these cases do not imply the occurrence of a criticality; on the other hand, a forecast threshold 2 exceedance was considered a false alarm.
If the observed criticality flag was equal to 1: • if there was no activation, the case is considered as a missed alarm; • if the basin activated, the case is considered as an underestimation, rather than a full missed alarm, since the model provided somehow a signal on that basin, even if not sufficiently clear; at the operational level, however, that signal can cause those who monitor in SOR to follow the situation with greater attention and proceed with further evaluations; • if the basin activated with forecast threshold 1 (only threshold 1) exceedance, we can consider this case can be considered a less serious underestimation than in the case of activation alone; • if threshold 2 is exceeded, that is correct modelling.
It is necessary to focus on the case of multiple activations during the same event, too. For a given basin and event, for example, the following case could occur: during a hydrometeorological event, damage occurs in a drainage area; the corresponding crosssection had activated twice, the first time without alarm threshold exceedance, the second time with exceedance. While the exceedance has to be correct, we cannot say anything about the first outcome, since we do not know the time at which the damage occurred and we would have to store information about the exact forecast time (Table 13). Looking for such additional information would be inefficient, besides these cases occurred with very low frequency. Therefore, they were excluded, without substantially affecting the activation sample size. Table 13. Exclusion of the records returning occurred criticality and threshold 2 non-exceedance in the case of multiple activations with divergent outcomes.

Flag| (Event and Basin)
Basin Activation n • Q frc Values Score This is not an issue in the case of null observed criticality, as the flag is taken as a reference for all activations. The total sample size used to compute the scores is summarised in Table 14. This is even higher than the one used in Section 3.2, since, previously, activations only were considered. Table 14. Sample used in analysis 3. Basins number refers to the overall time period; "cases" refers to both basin activations and non-activations.

Area Events Considered Basins Number Total Cases
Liguria 25 219 5413 Based on the previous considerations, the contingency tables had a slightly more complex structure as in Table 15, which reports the scores resulting from the model forecast.  Table 16 sums up all correct forecasts and all underestimations from the previous Table 15. By summing up all correct predictions (84.2%) and totally wrong ones (missed and false alarms, 10.7%), promising scores are obtained.
The analysis was repeated, this time considering the observed discharge, too (i.e., simulated with observed rainfall field), to measure how much the modelling chain as a whole can detect criticalities, since, even while these are occurring, it is possible, although with a delay, to intervene and activate the rescue chain. To more easily visualise at a glance how these scores changed in comparison with the ones derived by analysing the forecast discharge scenarios only, the scores which, with respect to Table 15, have improved or worsened, are highlighted in green or red, respectively (Table 17). In Table 18 is possible to see a compact representation of results of Table 17. As expected, the underestimation rate decreased in favour of an increase in minor underestimations and criticality detection rate, with the contribution of observed discharges; on the other hand, false alarms also increase, accompanied by a decrease in correct activations. The latter can be explained on one hand by the need to intervene on the thresholds; on the other hand, it must be accounted for that there are necessarily limits in the analysis of criticalities at the basin scale, due to the limited information availability: while, when damage is found, the criticality flag 1 is assigned with certainty, the null flag is always doubt.
By summing up all correct predictions (83.8%) and totally wrong ones (missed and false alarms, 12.7%), similar scores are obtained.
Afterwards, the scores conditionally on the occurrence, or not, of any criticality or damage were computed. Sättele et al. [18] estimate the reliability of a warning system for a debris flow alarm system by firstly computing the probability of detection, POD, and the probability of false alarms, PFA, given that a hazard event occurs or not. They estimate POD as the expected value of the ratio between the number of detected events by the number of total events and PFA as the expected value of the number of days with false alarms by the number of event-free days. Here, we are considering a (limited) set of hydrometeorological events, each one of them recording the occurrence, or not, of any criticality for each basin. Events with criticalities were therefore isolated to compute the scores conditionally on the occurrence of any criticality or damage, to further investigate correct criticality detections in relation to missed alarms, given that criticality occurs; similarly, the probability of false alarm, given that any criticality does not occur, was then estimated considering criticalityfree events only. Criticality detection means alarm threshold exceedance. If we define POD and PFA based on these criteria, we obtain: PFA = number o f f alse nowcast criticalities number o f criticality f ree simulations . ( Please note that by this definition of POD we do not refer to the number of total rainfall events, but to the number of observed criticalities and impacts on the territory, only; similarly, when defining PFA, we do not refer to the number of rainfall event-free days, but, out of these events, to which ones resulted in null criticalities, for specific basins. The resulting probability is not then a false alarm probability, e.g., per year, but per event, or, practically, per model simulation. This choice derives from the model nature and structure, too: as mentioned above (Section 2), the model chain is a monitoring chain, which does not run continuously, but at the occurrence of significant precipitation events, only. Therefore, exploiting the previously made classification in hit cases (correct threshold 2 exceedances), neutral cases (correct non-exceedances), false and missed alarms and by isolating the criticalities occurrences, [4], we obtain the scores as in Table 19. The first row is derived by considering nowcast scenarios, only; the second row accounts for both discharge nowcast and simulation with observed rainfall field. The missed alarm rate cannot change, since derives from non-activation, therefore no observed or nowcast discharge is available.
By applying the second method, the activations without any exceedance decrease by more than 10%, as mentioned above, in favour of a slight increase in the minor underestimations, which only apparently did not point an improvement out, and an increase by about 10% in the correct detections, which reach up to about 50%. Now, one could consider these outcomes from a more operative point of view. When a severe event occurs, characterized by very intense rainfall intensities, any cross-section activation leads the operator to focus on the area. Then, he/she does not interpret model results in an aseptic manner but proceeds with further evaluations. For instance, when an intense storm cell rapidly moves over an area so that alarm discharge flow is being nowcast at a specific cross-section, over which the most intense part of the cell is predicted to be localised, then the neighbouring cross-sections, which might not yet predict a threshold 2, or 1, exceedance will be kept under close observation, too. Therefore, we can try and sum up the correct detections with the minor underestimations, reaching up to 65% and any kind of signal, including activations only, reaching up to 82%. This means that, even with some delay and underestimations, the model chain provides a signal for more than 80% of the cases, which can be considered a good score.
The analogous table for PFA was not reported, since results did not substantially change with respect to Table 17: an overall PFA estimate of about 10% was obtained.
Besides the above considerations, it should be emphasized that analysis 3 is limited by the availability of information. Furthermore, in general, many false alarms may result from the impossibility of verifying that a criticality had occurred in the area or the choice of some too conservative thresholds; that is, it could be later investigated whether there are cases of exceeding the threshold both in the observation and in the forecast but the damage is null and verify this aspect.
Finally, similarly to Section 3.1, after estimating the overall scores we can have a look more in detail at the scores for each, this time not event, but basin. This choice is made, of course, because criticalities are evaluated at the basin scale and they can be correlated with the threshold values estimations. Table 20 summarises the scores computed for the 219 basins and can be read as follows: considering the last row, 52% of the basins had very good PFA, so between 0 and 10%; 31% of the basins had good PFA, so between 10 and 20%; and so on. Table 20. Performance ratings of forecast threshold exceedances by event. Subscript "b" stands for "basin"; corr stands for correct, lue stands for light underestimation and ues for any kind of underestimation. It is very important to highlight that we computed PFA as the ratio between the number of false exceedances during hydrometeorological events with no damage and the number of total hydrometeorological events in the region, not the total number of days with no event at all. In other words., the total number of hydrometeorological events with no damage on a basin was used, and not the total number of days with no damage on that basin. This is why the PFA may apparently seem too high.

Performance Rating by Basin
Even though such values are not highly reliable due to the limited sample available for each basin, they allow us to assign a priority for a deeper further analysis. For instance, one could focus on the basins which provided the poorest scores in terms of false and missed alarm and investigate whether the discharge thresholds need to be modified, for instance with the support of up-to-date site-specific hydraulic studies.

Discussion
In order to estimate the reliability of SBMC, three different analyses were performed. The first analysis evaluated the performance of the model chain comparing forecast and observed discharge, from a purely modelling point of view; the second analysis employed a similar method, though applied at the warning area scale and considering critical events. The last analysis included a comparison of the modelling results with the criticalities observed at the basin scale, where information was available. Criteria were a priori established to define the goodness of results in terms of correct forecast, false, missed alarm rates (CFR, FAR, MAR) and anticipation times (Tw) on thresholds exceedances.
By observed discharge, it is here meant the simulated discharge using the observed rainfall field as input, hence the first analysis is more related to the PhaSt component performance rather than to the chain as a whole since the only varying forcing here is precipitation. The performance of the PhaSt algorithm itself has already been investigated in the past, for instance, comparing SBMC output employing different forcing by using observed rainfall data only, persistence forecast (assuming a rainfall pattern after the current time identical to the one occurred during the last hour) and, as the operative chain is currently working, by coupling observed rainfall data with the PhaSt nowcast scenarios. Thanks to radar rainfall field estimates, which PhaSt needs as input, significant improvements in the nowcast were observed [2]. Here, results were promising, too. The overall CFR related to the alarm (CFR2) has more than 80%, classified as good, FAR2 and MAR2 around 10% and 5% (good and very good), respectively. Computing such scores for each of the 99 events available, 72 events and 67 events had very good MAR2 and FAR2 (less than 10%), respectively, while 86 events had satisfactory up to very good CFR2. As for the warning times available prior to thresholds exceedances, they had a left-skewed distribution, which was expected given the very low dimensions of the basins and their sensitivity to intense, rapidly evolving, storm phenomena. Nevertheless, it is sometimes possible to have enough anticipation on the exceedances, particularly, on threshold 2. The event-averaged mean value of the warning time for threshold 2 (Tw2 e ) is estimated at around 15 min; the absolute mean value of Tw2 from the original dataset is around 20 min, but in some cases can exceed 1 h, or more depending on the event dynamics.
The second analysis focuses on events that had criticalities of any entity, at the warning area scale. Due to the very limited sample size of basins in D and E warning areas, recently added to the model chain, a comparison was only possible for the three coastal areas, A, B, C. This revealed, overall, better performances for basins in B and C as for nowcast threshold exceedances (good CRF2, around 10% higher than in A, where this was still satisfactory) and slightly better performances for A and C in terms of expected warning times, between 20 and 30 min.
Finally, the third analysis accounts for both activations and non-activations and employs a more articulated contingency table structure: in case of observed criticalities, non-activations only are considered as a completely missed alarm, while activations and threshold 1 exceedance, only, providing somehow a signal at a cross-section, are considered as detection with underestimation and light underestimation, respectively; a threshold 2 nowcast exceedance is then a full criticality detection. It is very complex to determine the exact trajectory and dynamics of intense and localized storm cells. Even when threshold 2 is not exceeded at a considered cross-section but this happens in neighbouring basins, all signals provided by the chain, e.g., threshold 1 exceedance only or just activation, could be accounted for, adopting a more risk-averse approach. The overall probability of detection of the model chain, given that a criticality occurs and considering all signals provided, is around 80%.
Similarly, the false alarm probability, given that a criticality does not occur, during a hydrometeorological event (not per day), is around 10%. The same scores were computed at the basin scale. Even if uncertain due to the limited sample available for each basin, they allow us to focus on basins that had the poorest performance and investigating the reasons why this happened; this may lead for instance to threshold values update.
Ideally, one aims at reducing the forecast uncertainties to obtain a correct forecast rate as close as possible to 100% and to issue a flood warning early enough or take timely measures in-event. Unfortunately, uncertainties cannot be reduced to zero and it is not always possible to timely forecast flash floods [6]. Particularly for very small basins (drainage area lower than 15 km 2 ), small errors in the nowcast rainfall field can propagate through the model chain and strongly affect the nowcast discharge [9]. Nevertheless, useful tools can be developed to help to fill a gap as for the discharge forecasts in small and very small basins, for which information is usually hard to retrieve and for which the basins response times are very short, thanks to the availability of radar data, characterized by high spatial and temporal resolution. While natural variability has a stochastic nature, all types of uncertainty associated with imperfect knowledge or measurement errors are defined as epistemic or knowledge uncertainties; they are easier to detect and quantify and it should be possible to reduce them by further knowledge (e.g., by gathering additional data) [19]. Threshold values can be included in this type of uncertainty and may deserve further investigation and, possibly, updating.

Conclusions
The Small Basin Model Chain is a radar-based flood nowcasting chain employed in the Liguria region for event monitoring in basins with drainage areas lower than 15 km 2 , numerous in the region. A monitoring tool in such small basins is fundamental since they can potentially be dangerous for the population, especially in heavily urbanised areas [2]. This nowcasting chain was implemented over several years and consolidated during previous works, so that is now used for operative purposes. In particular, the performance of the PhaSt algorithm component has already been investigated in previous studies, showing significant nowcast improvements [2]. The nowcasting chain counts now more than 200 modelled cross-sections in the region, to have a capillary insight on what the impacts on the region are, driven by the meteorological events. This work was aimed at analysing model chain performances, finding the extent to which they can improve the predictability of flash floods for this particular kind of basins and improving the anticipation times. Overall, the correct forecast, false alarm and missed alarm rates, based on both model outcomes and the observed impacts or damages in the region, were considered satisfactory. Improving the anticipation time of flash flood events in such small basins is difficult, but during the monitoring phases, while the whole Civil Protection System is already alerted and ready to activate, having information even a few tens of minutes in advance can be helpful to take measures at specific sites.