Flood Uncertainty Estimation Using Deep Ensembles

: We propose a probabilistic deep learning approach for the prediction of maximum water depth hazard maps at high spatial resolutions, which assigns well-calibrated uncertainty estimates to every predicted water depth. Efﬁcient, accurate, and trustworthy methods for urban ﬂood management have become increasingly important due to higher rainfall intensity caused by climate change, the expansion of cities, and changes in land use. While physically based ﬂood models can provide reliable forecasts for water depth at every location of a catchment, their high computational burden is hindering their application to large urban areas at high spatial resolution. While deep learning models have been used to address this issue, a disadvantage is that they are often perceived as “black-box” models and are overconﬁdent about their predictions, therefore decreasing their reliability. Our deep learning model learns the underlying phenomena a priori from simulated hydrodynamic data, obviating the need for manual parameter setting for every new rainfall event at test time. The only inputs needed at the test time are a rainfall forecast and parameters of the terrain such as a digital elevation model to predict the maximum water depth with uncertainty estimates for complete rainfall events. We validate the accuracy and generalisation capabilities of our approach through experiments on a dataset consisting of catchments within Switzerland and Portugal and 18 rainfall patterns. Our method produces ﬂood hazard maps at 1 m resolution and achieves mean absolute errors as low as 21cm for extreme ﬂood cases with water above 1m. Most importantly, we demonstrate that our approach is able to provide an uncertainty estimate for every water depth within the predicted hazard map, thus increasing the model’s trustworthiness during ﬂooding events.


Introduction
The frequency of urban floods and their impact are escalating at an unprecedented rate due to changes in land use, increase in population, and climate change [1,2]. Floods in urban areas occur when natural water sources or drainage systems in cities lack the capacity to convey excess water (runoff) caused by intense rainfall events (pluvial flooding) [3]. It is a tremendous challenge for urban pluvial flood risk management, which can involve various actions ranging from systematic offline analyses of different rainfall scenarios [4] to real-time flood predictions [5]. Rapid generation of hazard maps is key [6] for supporting flood alert systems and protecting the population.
Surface water flows can be simulated with hydrodynamic models that solve equations derived from the laws of fluid dynamics. Such models can simulate point-wise flood occurrence in urban areas with different degrees of complexity using rainfall forecasts from weather services [7] coupled with digital elevation models (DEMs) or digital surface models (DSMs). However, hydrodynamic simulations are not well resolved in space and time. Their high computational burden hinders their application to large areas with high spatial resolution [7]. In practice, this limits their use to small areas at a resolution of ≤ 10 m [8]. Urban areas require a higher spatial resolution of 1-5 m for meaningful flood predictions because of the large number of urban structures that influence the surface flow [8,9], further constraining their applicability at a useful scale in practice.
In contrast to the slow and computationally intensive hydrodynamic models, machine learning models provide a powerful way to obtain fast predictions at a large scale in case of flooding. They implicitly learn the underlying hydrodynamic phenomena a priori from large amounts of offline data [10][11][12]. Modern deep learning models can learn both linear and non-linear relationships in flood events directly from data [13][14][15], turning them into a powerful tool for urban flood prediction. A disadvantage is that deep learning models usually do not provide uncertainty estimates. This can jeopardise their applicability in practice: policymakers need a sense of how trustworthy the models' results are [13] to implement meaningful flood risk management and evacuation plans. It is therefore crucial that deep learning models provide well-calibrated uncertainty estimates along with accurate water depth predictions densely at high spatial resolution. Here, the term well-calibrated refers to uncertainty estimates that correspond well with the model error. In this work, we take a first step in this direction and propose a probabilistic deep learning framework to rapidly predict maximum water depth in urban flooding scenarios together with uncertainty estimates for every water depth output.
We implement a probabilistic deep learning approach which we name Deep Flood based on an ensemble of deep neural networks ( Figure 1) that can be viewed as an approximation for Bayesian marginalisation [16,17]. This allows estimating two types of uncertainty: aleatoric and epistemic, where the first describes the stochasticity inherent to the data and the latter the uncertainty due to modelling approximations and errors, including uncertainty in estimating the parameters of the model [18]. All deep learning models have exactly the same network architecture and loss function but are initialised with different random weights for training, which allows for their interpretation as samples from an approximate posterior distribution. At inference time, we run the test samples through each model separately and compute their individual standard deviations, which are then combined into one, final uncertainty estimate per water depth. We validate the proposed method on a dataset with two catchments in Switzerland and one in Portugal, for 18 different rainfall patterns. Our Deep Flood approach is able to provide 1 m resolution hazard water depth maps and dense, well-calibrated uncertainty estimates per pixel for every predicted maximum water depth, eventually increasing the trustworthiness of the system for downstream tasks such as risk assessment and evacuation plans. High-level illustration of our proposed approach, Deep Flood, for prediction of flood hazard maps. The input data containing parameters of the terrain and rainfall forecast is fed to the deep learning model during the training phase. All models are identical but trained independently starting from random initial weights. Each model outputs a pixel-wise maximum water depth µ and scale parameter 2b. The outputs are then combined providing the final estimate of the water depth map together with its associated predictive uncertainty.

Flood Estimation
Robust and accurate flood predictions support water resource management strategies, policy suggestions and analysis, and additional evacuation modelling [19]. Hydrological models have been in use for a long time to predict water-related events, such as storms [20,21], rainfall/runoff [22,23], shallow water conditions [24], hydraulic models of flow [25,26], and further global circulation phenomena [27], including the consequences of atmosphere, ocean, and floods [28]. Hydrological physically based models can be categorised based on the dimensionality of the flow representation. In the specific case of flood models, they can be categorised as one-(1D), two-(2D), or three-dimensional (3D): 1D models simulate the flow along a centreline and are mainly applicable in confined situations such as in a channel or in a pipe [8]; in 2D models [29][30][31] the flow is represented as a 2D field: here the assumption is that, compared with the other spatial dimensions, the water depth is shallow; 3D models are required wherever vertical features are crucial, e.g., for studying dam ruptures and tsunamis [32,33]. In order to be developed, such models require expertise and in-depth knowledge about hydrological parameters [34], in addition to intensive computational costs. To this end, a way to speed them up is to reduce the complexity of their associated differential equations. For example, this can be performed by disregarding the inertial and advection terms of the momentum equation [29,30] or by decoupling the flow into orthogonal directions [7,31,35].
Another group of models for flood estimation are the so-called physically simplified approaches [36,37] which predict flood occurrence through simplified hydraulic notions. As a result, these models provide predictions with much less computational costs, with a small loss of accuracy [38,39]. These models are suitable whenever some flow properties, for instance, velocity, are not necessary. The cellular-automata flood models [40] have recently received considerable attention. Instead of solving complex shallow water equations, these models are able to carry out faster flood modelling by using simple transition rules and a weight-based system. The transition rules work by predicting the new state (e.g., the amount of water) of a cell based on the cell's previous state and its neighbours. Since the transition is applied on all raster cells in parallel, such models can benefit from GPU parallelism, greatly reducing the simulation time [40].
In addition to the models described above, data-driven machine learning models have recently gained more popularity in flood modelling. Various machine learning techniques, such as logistic regression [41,42], artificial neural networks [11,12], support vector machines [43][44][45], and random forests [46] have been used for urban flood risk prediction during the last decade. Notably, within the machine learning umbrella, deep learning approaches are being presented more often in recent years since they can perform complex tasks without requiring extensive feature engineering. They work well with unstructured data and can support parallel and distributed algorithms thus gaining computational efficiency. In [47], the authors used a recurrent neural network to forecast the two-step-ahead river stream flow based on rainfall measurements from several gauge stations. This method was later extended to multiple-step-ahead using an expandable neural network architecture in [48]. As the traditional hydrological models have the problem of performance degradation when calibrated for multiple basins together instead of a single basin, the authors in [49] trained a single long short-term memory model on 531 basins using meteorological time series data and static catchment attributes and were able to significantly improve performance. Their approach not only significantly outperformed hydrological models that were calibrated regionally but also achieved better performance than hydrological models that were calibrated for each basin individually. For flash flood susceptibility mapping tasks, Tien Bui et al. [50] used a deep neural network model to construct a classification boundary that could determine the susceptibility to flash floods for areas within the studied region. The model took as input a flash flood's influencing factor which was selected based on a literature review and available parameters for flash flood modelling. Finally, Guo et al. [51] showed that the grid-based fluid simulations can also be approximated accurately with a convolutional neural network (CNN), which predicted the velocity field of the steady flow from discretised input geometries.

Uncertainty Estimation in Deep Learning
The ability of deep learning to provide useful flood predictions is clear, but assessing the reliability of such predictions remains a challenge. In fact, estimators obtained from regression in its simplest form only output a prediction without any form of uncertainty measure. In this work, our aim is a more complex form of regression which does not only predict the value of interest but also estimate its uncertainty. The uncertainty estimation should be calibrated, i.e., it should correspond to the "true" expected error of the predictor [52]. Trained deep neural networks should be able to provide a prediction coupled with its uncertainty, whereas "unsure" predictions are associated with high uncertainty. It becomes clear that such behaviour is especially desirable for real-world decision-making systems where downstream actions are taken based on such predictions [53]. Specifically for our flood prediction task, we expect an uncertainty value (e.g., measured in metres) associated with its water-depth predictions for each coordinate of the catchment map.
In the past years, different approaches have been proposed for uncertainty quantification in deep learning. In [54], the authors provided single-model computation of aleatoric and epistemic uncertainty for deep neural networks. To estimate aleatoric uncertainty they proposed simultaneous quantile regression, a loss function to learn all the conditional quantiles of a given target variable. These quantiles were then used to compute well-calibrated prediction intervals. For epistemic uncertainty evaluation, they used orthonormal certificates, a collection of diverse non-constant functions mapping out-of-distribution examples to non-zero values, which indicated epistemic uncertainty. Another approach [55][56][57], called deterministic uncertainty quantification, was built upon the idea of radial basis function networks, consisting of a deep neural network model and a set of feature vectors. These corresponded to the different classes or centroids. A prediction was made by measur-ing the distance between the feature vector provided by the model and the centroids. The uncertainty was modelled by computing the distance between the model prediction and the closest centroid. A significant, rapidly increasing family of models is the Dirichlet-based uncertainty (DBU) family [58]. The benefit of DBU models is to provide efficient uncertainty estimates at test time in a single forward pass by directly predicting the parameters of a Dirichlet distribution over categorical probability distributions. DBU models provide both aleatoric and epistemic uncertainty estimates which can be quantified from Dirichlet distributions using different uncertainty measures such as differential entropy, mutual information, or pseudo-counts.
Despite the successful examples presented above, the majority of research revolves around Bayesian formalism for uncertainty estimation [59]. A Bayesian neural network (BNN) model is a regular neural network with a prior placed on the weights and biases [18]. Using the prior and given training data, the posterior distribution over the parameters can be then computed (approximated). The obtained distribution is then employed to compute the posterior distribution of predictions, from which a mean estimate and its uncertainly could be extracted. As the exact Bayesian inference is computationally intractable for neural networks, several approximations have been proposed [59]. Mackay, in [60], suggested a Laplace approximation of the posterior but with limited performance. Markov chain Monte Carlo (MCMC) methods can be used to sample the posterior distribution over the set of model parameters. The main limitation of MCMC methods is their prohibitive storage cost. As the posterior distribution is represented by samples of deep neural network parameters, thousands of samples need to be stored each time. Considering that modern deep neural networks have millions or more parameters, MCMC methods would not be feasible in this case. Variational inference is an alternative Bayesian method which approximates the posterior distribution by a tractable variational distribution q θ (W) indexed by a variational parameter θ. The optimal variational distribution can be achieved by minimising the Kullback-Leibler divergence between q θ (W) and the true posterior p(W|D). In contrast to MCMC, variational inference methods are more time and space efficient but the gap between the approximate posterior and the true posterior degenerates the model performance [61]. To tackle the limitation of such a slow and computationally expensive method, Monte Carlo (MC) dropout [62] was introduced using dropout [63] as a regularisation term to compute the predictive uncertainty [53].
In our approach, we approximate the posterior over the BNN model with a deep ensemble. In [16,17,64], the authors showed that deep ensembles can be seen as an approximate approach to Bayesian marginalisation, which selects for functional diversity by representing multiple basins of attraction in the posterior. The authors also proved that deep ensembles can provide a better approximation of the Bayesian predictive distribution than standard approaches and discuss that they also outperform some particular approaches to BNNs. Additionally, ensembles are reported to provide better performance in their predictions and reliability of the computed uncertainty estimates [65,66] and have become a standard for uncertainty estimation [67].

Datasets
In this work we use a dataset containing two different catchment areas located within Switzerland, using raster data with 1 m spatial grid sampling. The data were simulated using the CADDIES cellular-automata flood model [40] based on [68]. As mentioned earlier, here transition rules are employed through a weight-based system to determine the flow movement, avoiding solving complex and computationally expensive equations. The benefit of using the CADDIES model is that it achieves a faster computational performance when compared with physically based models that solve the shallow-water equations without a significant sacrifice of accuracy [40]. In this way, the CADDIES simulator allows us to conveniently verify our approach with relatively low computational effort. Additionally, we employed a dataset related to a catchment area located in Coimbra, Portugal, using raster data with a spatial grid resolution of 1 m. This dataset was generated using the Infoworks ICM software by Innovyze [69], which is a physically based model that also considers pipes and surface flow in urban areas. The characteristics of catchments are summarised in Table 1 and an illustration of the catchments is provided in Figure 2.  To conduct the simulations, we selected 18 different one-hour rainfall hyetographs representing different rainfall intensities, with approximately 2, 5, 10, 20, 50, and 100 year return periods, and with different time step discretisation (5 min, 10 min, and 15 min), as shown in Figure 3. Rainfall (mm/h)   tr5_1  tr20_1  tr50_1  tr2_2  tr10_2  tr20_2  tr50_2  tr5_3  tr10_3  tr100_3  tr100_2  tr2_3  tr2_1  tr10_1  tr100_1  tr5_2  tr20_3  tr50_3 training validation testing  Table 2.

Rainfall events
For both CADDIES and Infoworks ICM, the simulated data (which we consider our "ground truth") consisted of the maximum water depth for each pair of catchment and hyetograph. No post-processing was applied in order to keep fidelity to the flood simulator we aim to replicate. Predicting maximum water depths maps can be related to anticipating the worst-case scenario (i.e., a hazard map) given a catchment area and a rain forecast. For this specific application, it is clear that providing an uncertainty estimate for each predicted cell in the area is of extreme importance for downstream tasks such as decision making in evacuation plans.
We computed 11 input terrain features as input data for our deep learning model, which are then concatenated as multi-channel images: • the catchment's DEM, representing the terrain elevation; • a spatial differential DEM (DEM diff ) comprising of four channels. A DEM can be viewed as a 2D grid whose adjacent columns (c) or rows (r) can be subtracted in four directions: rightward, leftward, downward, and upward. The DEM diff was obtained using the following equations: where i and j are the row and column index of the DEM. • the topographic index as the logarithm of the ratio between flow accumulation and local slope. Flow accumulation is related to the upstream drainage area of each raster cell and is computed from the raw DEM using the r.terraflow module in QGIS [70]. The topographic index is commonly used in hydrology as a steady-state wetness index; • slope, defined as the measure of the rate of change of elevation in the direction of steepest descent. It reflects the steepness of the terrain and is the means by which gravity induces the flow of water; • aspect as the orientation of the line of steepest descent. Alternatively, it can be defined as for a selected point on a surface, aspect is the direction in which slope is maximised [71,72]; • three curvature attributes computed on second derivatives. Curvature combines the profile curvature, which measures the rate of change of slope down a flow line, and plan curvature, which calculates the rate of change of aspect along a contour [71,73].
In addition to the features described above, we also add rainfall intensities as input channels to the model. The duration of rainfall events used is one hour which results in twelve channels. Note that the selection of these features was inspired by the work of [7]. Compared to [7], we found that adding DEM diff and the topographic index, as well as the complete rainfall forecast, was beneficial for the performance and led to faster convergence of the model during training.

Deep Learning Model
We employ a convolutional neural network (CNN) architecture to predict the maximum flood height map at every location of the DEM. Our model design is inspired by the U-Net architecture, which was originally presented by Ronneberger et al. [74] for biomedical segmentation applications. The architecture is depicted in Appendix A, Figure A1, and consists of a contracting (encoder) and an expansive (decoder) path designed to encode and decode the input to produce an output of the same resolution as the input. The encoder consists of the repeated application of two 3 × 3 convolutions. Each one is followed by a rectified linear unit (ReLU) and batch normalisation and a max pooling operation of 2 × 2 with stride 2 for downsampling and reducing the spatial dimensions. At each downsampling step, the number of feature channels is doubled and the spatial dimensions are reduced by half. Every step in the decoder path consists of an upsampling of the feature map followed by a 2 × 2 transpose convolution, which halves the number of feature channels. Additionally, a concatenation with the corresponding high-resolution feature map from the contracting path is performed, and a successive convolutional layer is then used to learn to assemble a more precise output based on this information. An important feature of U-Net architecture is that, in the upsampling part, we have a large number of feature channels which allows the model to propagate context information to a higher resolution layer. As a result, the expansive path on the right side is nearly symmetric to the contracting path on the left which gives a u-shape to the architecture. To adapt the architecture to our application, we have added two heads for the regression of [µ, log(2b)] as described later in Section 4.2. Each head consists of two 1 × 1 convolutions with stride 1 with a ReLU layer in between.

Epistemic Uncertainty
Epistemic uncertainty is modelled by placing a prior distribution over a model's weights, and then expressing the variation of such weights vary given some data [18]. In regression we predict a single continuous target variable y from a given dataset of inputs X = {x 1 , x 2 , x 3 , ..., x N } with labels Y = {y 1 , y 2 , y 3 , ..., y N }. We denote the model as f W , with output f W (x). We would like to find the parameters W of a function f W (x) that is likely to have generated our outputs. Taking inspiration from [18], in this work we place a Laplacian prior distribution on the model weights. This distribution represents our prior belief as to which parameters are likely to have generated our data before we observe any input of the dataset. After we observe some data, the prior distribution will be transformed to capture the more or less likely parameters. For this, we further require to define a likelihood distribution, for which we again chose a Laplacian. We then look for the posterior distribution which captures the set of plausible model parameters given the dataset X, Y by invoking Bayes' theorem: A key component to evaluate that posterior is the normalisation constant p(Y|X), the so-called model evidence, which can be obtained by integrating out the parameter distribution. Note that this is equivalent to computing the posterior distribution over the labels, i.e., performing inference for a new input x * sampled from the same distribution: The marginalisation can be resolved analytically for simple models such as Bayesian linear regression when the prior is conjugate to the likelihood. However, for complex architectures such as deep models, this marginalisation cannot be performed analytically. In such cases approximate techniques are required [75], which were briefly described in Section 2.2. In our approach, we approximate the posterior over the BNN model with an ensemble. Wilson and Izmailov [16] have shown that deep ensembles can be seen as a proxy for Bayesian marginalisation, which selects for functional diversity by representing multiple basins of attraction in the posterior. Using deep ensembles comprises maximum a posteriori (MAP) training of the same architecture many times starting from different random initialisations, in order to find a different local optimum. Therefore, using these models in an ensemble as an approximate Bayesian model is another way to generate a distribution over models, which can be employed to provide an estimate of the variability of the learning procedure. Performing this step corresponds to estimating an essential component of epistemic uncertainty. Instead of using a single point mass to approximate our posterior, as with classical training, we are now using multiple point masses, enabling a better approximation of the integral in Equation (2) that we are trying to solve [64]. Finally, the variance computed over their respective predictions provides an estimate of the epistemic uncertainty.

Aleatoric Uncertainty
We can subcategorise aleatoric uncertainty into homoscedastic uncertainty and heteroscedastic uncertainty. Homoscedastic uncertainty remains constant for different inputs while heteroscedastic uncertainty is dependent upon the inputs provided to the model and learned through the training process [18,76,77]. Heteroscedastic uncertainty is notably meaningful for computer vision applications [18] and is computed by training the model to estimate the probability density function, by deriving the loss functions for a Laplace prior, as explained below. The probability density function for the Laplace distribution is given by: where µ is a location parameter and b > 0 is a scale parameter, corresponding to the mean absolute deviation from the mean. The latter indicates how much the distribution is spread out. Therefore, the negative log-likelihood of µ i , b i given a sample x i for the Laplace distribution becomes: Note that a Gaussian likelihood could be also used for the above formulation. However, the loss resulting from a Laplace likelihood is more robust than a Gaussian likelihood, when the error is heavy-tailed and not Gaussianly distributed (see [18,78]). Therefore, in this work, we adapt the standard Gaussian negative log-likelihood loss and formulate our loss function using the Laplacian likelihood L NLL (µ 1 , . . . , µ N ,b 1 , . . . ,b N where N is the number of output pixels corresponding to the input image indexed by i, µ i andb i are the model outputs at location i, and y i is the ground truth value. Given the formulation of our loss function, labels for the uncertainty are not needed to learn uncertainty. Simply, the learning of the regression task is conducted and the scale parameter b i is implicitly learned from the loss function. The loss function is composed of two terms, log 2b i and 1/b i . The first term discourages the model from predicting high uncertainty for all data as a large uncertainty value increases the contribution of this term and so the loss. Similarly, the model can also learn to neglect the data or predict very low uncertainty value for data points with high error |µ i − y i |, but, by doing so, the denominatorb i will amplify the contribution of the residual and will penalise the model [18]. Our approach is summarised in Figure 4. The model takes the data input and provides two outputs [µ, log(2b)] = fŴ (x). Note that the conversion of the second output using an exponential function guarantees that the estimatedb is positive. For numerical stability, we also add a small number = 1e − 7 to the estimated 2b value during training to avoid division by zero. Lastly, since the Laplace distribution has mean µ and variance 2b 2 , this can be easily computed from the model output and will be used for the final estimation of the predictive uncertainty.

Combining Aleatoric and Epistemic Uncertainty
We calculate the predictive uncertainty of the deep ensemble by training M different models and consider them as random samples from the distribution of models. This is performed by applying a different random initialisation of the Laplacian prior weights at the start of the training procedure. The deep ensemble is treated as a uniformly weighted mixture model of Laplacian distributions [79] and the predictions are combined using the following formula:ȳ while the total predictive uncertainty is approximated as: for the outputs [µ m , log(2b m )] of model m ∈ [1, ..., M]. In the above equation, the first term depicts the epistemic uncertainty and the second is the aleatoric uncertainty.

Model Training
Different rainfall events were allocated for training, validation, and testing of our approach, as shown in Table 2. As a common practice in machine learning applications, the validation set is not employed in the optimisation process but for hyper-parameter tuning and for loss monitoring on unseen data, and the models were only evaluated on the test sets. The data input fed to the model is described in Section 3. The output consists of hazard maps across the DEM corresponding to predicting the maximum water depth that can occur within the forecasted rainfall period. Additionally, we output the scale factor of the Laplacian distribution, through which we then compute the predictive uncertainty. Table 2. Allocation of rainfall events for training, validation, and test sets.

Dataset Rainfall Events
Training Set tr5_1, tr20_1, tr50_1, tr2_2, tr10_2, tr20_2, tr50_2, tr5_3, tr10_3, tr100_3 Validation Set tr100_2, tr2_3 Test Set tr2_1, tr10_1, tr5_2, tr20_3, tr50_3 , tr100_1 As the catchment size is too large to be fed to each model, we generate catchment patches. At each training iteration, a random patch of size 256 × 256 with a batch size of 8 data samples is used, which are common choices in deep learning applications. We train each of the five individual model of the ensemble with prior weights initialisation using a Laplace distribution, in this way, the optimisation process can start from a different point, potentially resulting in a diverse final set of weights and performance characteristics. We then train each U-Net model in the ensemble using a Laplacian negative log-likelihood as a loss function Equation (3) for 500 epochs. We use the Adam optimisation algorithm and train the model using back-propagation, which allows the information from the objective function to flow backwards through the model in order to compute the gradient [80]. To update the model weights in the negative gradient direction we start with a learning rate of 0.001 and decrease it by a factor of ten at different epochs. These numbers were chosen empirically during the training and validation of our models, following standard hyperparameter settings in deep learning applications.

Hazard Map
Recall that Deep Flood represents an independent ensemble of five U-Net models of identical architecture, as described in Section 4.3. The output of each individual U-Net in an ensemble is used to compute the final prediction and uncertainty per model. For all the experiments, we report the mean absolute error (MAE) as a performance metric which quantifies the deviation between the predicted water depth and ground truth.
We first evaluate the predicted hazard map and consider the distribution of MAEs for the studied catchments at different ranges of water depth, i.e., all pixels, pixels with water depth > 10, > 20, > 50, > 100 cm in Table 3. Here, the MAEs represent the average between all the rainfall events in the test set. From Table 3, we can observe that at test time the Zurich and Lucerne catchments perform generally worse compared to the Portugal one, suggesting that the deep learning model learns more effectively the flood behaviour when presented with a complex, physically based dataset. Specifically for the Portugal dataset, the MAE values for elements >50 and >100 cm are particularly low, which makes these results attractive for hazard prediction, where deep water depths correspond to a higher risk for the population. In Figure 5, we qualitatively illustrate the results by plotting the ground truth water depth against the predictions. Note that, while the results in Table 3 were combined across all rainfall events, we will now focus on only the tr2_1 rainfall event in the test set to avoid excessive averaging. Here, the results for the rainfall event are divided into bins and then plotted. The shade of each bin represents the count of bins for that pixel. We also ignored bins presenting less than ten elements. The black diagonal line represents the ideal prediction, where the trained model predicts exactly the same ground truth water depth value: the more the plot diverges from the diagonal, the more it indicates a decrease in accuracy. The plots agree with the results presented in Table 3. We notice from Figure 5b that the Lucerne catchment presents a wide pixel distribution plot along the upper area of the diagonal, which suggests that the prediction errors for this catchment are caused by an underestimation of the predicted water depth. This effect is opposite (and reduced) for the Zurich terrain, (Figure 5a), with the difference that the pixels here are much less spread. The Portugal catchment, on the other hand, presents a plot closer to the diagonal, indicating an improved accuracy compared to the other catchments.
To further analyse the results of this rainfall event, we divide the pixels into five different categories based on their ground truth water depth value and show the error in the form of a box plot in Figure 6, where the x-axis represents the different binned water depth categories and the y-axis the associated absolute error value. As we move from shallow water depth to deeper areas, the absolute error value increases. Specifically, the plot shows that in reality, most of the errors are very low (between a few cm to less than 20 cm median for high water depth). Few points show very high errors, especially for deep water depths. This can be explained by the behaviour of the U-Net model which, as it is based on convolution operations, tends to smooth the prediction when pixels present sudden, small changes in the water depth (which could be caused by artefacts in the DEM). This is discussed in the next paragraph.  In Figure 7a, we qualitatively show the complete reconstruction of the hazard map for the Portugal catchment. Reconstructions for the Zurich and Lucerne catchments are shown in Appendix A, Figures A2 and A3. The red bounding box depicts the patch areas in Figure 7b, which is zoomed in. We can observe that the prediction clearly follows the ground truth water depth pattern. However, some edges of the prediction are smoothed out compared to the ground truth, especially where it presents sharp water depth changes. This might explain the presence of outliers discussed earlier and is highlighted within the red box in Figure 7b (left and centre). Additionally, we highlight within the yellow box in Figure 7b (centre) an area of the reconstruction which presents small artefacts (accounting for the few cm errors) which are most probably caused by the feature Slope (Figure 7b on the right). These errors were already visible in Figure 5c, in the area of the plot close to the axes' origin. Nevertheless, we found that the inclusion of the Slope feature still improves the results, especially for high water depths, which is the reason why we decided to keep this feature for the training of the model. In the middle, the complete predicted reconstruction performed by our Deep Flow. On the right, the residual of ground truth and predicted reconstruction for the same area.
(b) On the left, one enlargement area of the ground truth. In the centre, the same area predicted by our Deep Flow. On the right, the feature Slope for the same area. Figure 7. Ground truth maximum water depth and prediction for steep catchment at one specific time step of rainfall event tr2_1, with a bounding box in red depicting the patch area shown in (b). In the bottom figure, we highlight in red the areas where the edges around water basins are smoothed out; in yellow, an area of the reconstruction which presents small artefacts.

Uncertainty Evaluation
Evaluation of the predictive uncertainties is a difficult task as the ground-truth or reference uncertainty estimates are unavailable. An approach to analysing the quality of our predictive uncertainties estimates is to use calibration plots by measuring the discrepancy between subjective forecasts and empirical long-run frequencies [59]. To form the calibration plot for our ensemble, we first take the mean value for each data point in our test set from the five trained U-Nets Equation (4). After that, we discretise the computed predictive uncertainty values in Equation (5) into bins. We then plot the mean estimated value of residual water depth for each bin. The more the resulting plot correlates with the identity line, the better the uncertainties are calibrated. In fact, as for regression tasks we have continuous output values, and the predictive uncertainty is expected to follow the residuals. having a well-calibrated predictive uncertainty means that we can safely decide, based on the uncertainty, whether to trust the predictions or not. Figure 8a shows the calibration of our approach. The second way to analyse the uncertainty is with precision-recall curves [18]. These curves depict in which way our model performance improves by discarding data points associated with an uncertainty larger than various percentile thresholds. To create a precision-recall curve, we sort the test samples by their predictive uncertainties and then filter out the test samples associated with the highest uncertainties. In such a way, we can reduce the overall error for the test set. We can observe this scenario in Figure 8b,d,f, where the precision (MAE) quickly decreases as the recall decreases, then flattens and becomes stable. This shows that the predictive uncertainty measurements are correlated well with MAE. For instance, we observe that for a recall between 1.0 and 0.8 in Figure 8b the MAE value decreases from 2.5 cm to around 0.5 cm. As the uncertainty is correlated with data points with high residuals, we see a sharp drop in the MAE value by removing such points. In Figure A4 in the Appendix A, we present the same results for another rainfall event, tr100_1.
Finally, in Figure 9 we show the qualitative results for our predictive uncertainty by plotting the epistemic, aleatoric and predictive uncertainties together with the residuals in a specific area of the Zurich, Lucerne, and Portugal catchments for the tr2_1 rainfall event. We observe that epistemic, aleatoric and combined predictive uncertainties follow the pattern of the residuals. For higher residual values, we also observe higher values of uncertainty. Additionally, epistemic uncertainty is much sharper around the edges and aleatoric uncertainty is more fuzzy. This behaviour is in accordance with the results such uncertainties presented in [18].    . Qualitative evaluation of the epistemic, aleatoric and total predictive uncertainty with respect to the residual error in different catchment areas for the tr2_1 rainfall event. We can observe that both uncertainties follow the pattern of the residuals, with epistemic uncertainty being slightly higher around the edges of high water depth areas.

Comparison with previous works
In the last decades, a large body of literature has been dedicated to flood prediction and management strategies. Emerging advances in computing technologies and big data are just some of the key elements which have favoured data-driven approaches such as machine learning and deep learning, offering flexibility and scalability. Some recent works have applied deep learning for flood forecasting and prediction [14,81]. Ref. Song [81] presents a CNN model to perform a daily runoff simulation and creation method of two extra features representing the topographical and rainfall conditions. Analogously, Löwe et al. [14] trained a CNN to predict the maximum water depth maps in urban pluvial flood events using hydrodynamic flood simulations as ground truth data. Another similar work is [7], where the generalisation capabilities of CNNs as flood prediction models are explored on the same dataset used in this work. In contrast to these studies, here we presented a new framework to provide well-calibrated uncertainty estimates densely at the same spatial resolution along with the model's maximum water depth predictions.
Recently, in [13], deep learning techniques were used for predicting gauge height as well as for evaluating the associated uncertainty. The authors show that their deep learning model was more accurate than the physical and statistical models currently in use. To provide an estimation of the uncertainty for the predicted outputs, the forecasting process was repeated several times and then averaged. It is different from our approach, however, in that study, the authors did not report point-wise water depth estimates and generalisation capabilities for different rainfall events. To the best of our knowledge, our work is the first data-driven approach that is able to jointly provide a two-dimensional flood occurrence prediction at high spatial resolution together with well-calibrated uncertainty estimates.

Advantages of the proposed approach
In this work, we proposed a probabilistic deep learning approach to predict urban flood occurrence at high spatial resolution. Especially compared to physically based flood models, a major advantage of our data-driven approach is that it provides predictions much faster than the simulator. While physically based flood simulations for very small catchments can still be computed efficiently, we expect the computational load to raise drastically with larger urban catchments. Real-time flood prediction for large and complex scenarios becomes almost infeasible [7]. Additionally, we have shown that our probabilistic deep learning approach can provide well-calibrated uncertainty predictions per waterdepth prediction. This value indicates how "unsure" the model is for each prediction. We believe that those findings advocate for deep learning models as a promising alternative to physics-based simulations for flood prediction in practice: the more such systems are reliable and trustworthy, the better they can be used to implement meaningful flood risk management and evacuation plans.

Predictive uncertainty estimation using deep ensembles
There is an ongoing debate in the literature on whether to place deep ensembles within the family of Bayesian approaches or rather view them as an alternative, outside the Bayesian framework. In this work, we have followed the method proposed by [18], who view deep ensembles as an approximation of Bayesian marginalisation, therefore placing them within the family of Bayesian methods. This interpretation is supported by [16], who argue that deep ensembles are effective at providing approximate Bayesian marginalisation as well as [17], who explicitly identify deep ensembles in the framework of Bayesian deep learning. Likewise, the authors of [64] state that deep ensembles should not be seen as an alternative approach outside the family of Bayesian methods, but as approximate Bayesian marginalisation. Finally, Izmailov et al. [82] discuss that deep ensembles provide better Bayesian than standard approximate inference procedures. On the other hand, Lakshminarayanan et al. [59] describe deep ensembles as an alternative to Bayesian neural networks thus placing them outside the Bayesian framework. Ovadia et al. [83] include deep ensembles within the non-Bayesian methods together with bootstrapping, despite showing how well ensembles perform empirically [84]. Overall, we found that despite the ongoing debate on where to theoretically place deep ensembles with respect to the Bayesian framework, they do work well empirically for calibrating uncertainties as demonstrated by our experimental evaluation.

Limitations of the proposed approach and future works
In Figure 6 we have shown the box plots for the predicted water depth values and their associated absolute errors. While most of the errors are very low, it can be noted that some outliers go up to 150 cm. This is explained by the convolution operations in our deep learning model which tend to smooth the predictions when the pixels present sudden changes in the water depth. This drawback can negatively alter the prediction for such important and dangerous sharp changes and could be addressed in future works by testing different types of deep learning architectures.
Another strategy worth investigating for reducing severe outliers would be adding physical constraints in the design and training of the deep learning architecture. For example, hydraulic laws could be incorporated in the loss of the model such that training in an end-to-end fashion would simultaneously constrain the gradient flow to physically plausible values and improve the quality of the predictions.
An interesting research question for future work would be to include the intrinsic uncertainty of the rainfall forecast in the predictions i.e., determining how such uncertainty propagates to the outputs of the model, as well as testing the results on real-world data.

Conclusions
We have presented Deep Flood, a new probabilistic deep learning approach based on deep ensembles for dense flood prediction at high spatial resolution. Deep Flood predicts accurate water depths and assigns well-calibrated uncertainty estimates to every predicted water depth. Our approach uses terrain features of a catchment and a rainfall forecast as input to provide a water depth map. Through experiments over three catchments and eighteen rainfall patterns, we have shown that the deep learning model is able to accurately predict maximum water depth maps over the full rainfall events at 1 m resolution in large catchment areas. Importantly, our approach also generalises well to new rainfall events with mean absolute errors of as low as 21 cm for the extreme, catastrophic cases where flooding is above 1 m. For the first time in flood prediction, our approach also returns pixelwise well-calibrated uncertainty estimates for every predicted water depth. Quantifying the predictive uncertainty increases the trustworthiness of the system, which translates into better downstream tasks such as risk assessment and evacuation plans.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
An overview of the network architecture used for the experiments in this paper is shown in Figure A1, and more visualisations of results are provided below. Figure A1. Illustration of the U-Net architecture as used in this paper.
(a) On the left, the ground truth maximum water depth for Zurich catchment of rainfall event tr2_1. In the middle, the complete predicted reconstruction performed by our Deep Flow. On the right, the residual of ground truth and predicted reconstruction for the same area.
(b) On the left, one enlargement area of the ground truth. In the middle, the same area predicted by our Deep Flow. On the right, the residual of ground truth and predicted reconstruction for the same area. Figure A2. Ground truth maximum water depth and prediction for Zurich catchment and rainfall event tr2_1, with bounding box depicting the patch area shown in (b).
(a) On the left, the ground truth maximum water depth for Lucerne catchment of rainfall event tr2_1. In the middle, the complete predicted reconstruction performed by our Deep Flow. On the right, the residual of ground truth and predicted reconstruction for the same area.
(b) On the left, one enlargement area of the ground truth. In the middle, the same area predicted by our Deep Flow. On the right, the residual of ground truth and predicted reconstruction for the same area. Figure A3. Ground truth maximum water depth and prediction for Lucerne catchment and rainfall event tr2_1, with bounding box depicting the patch area shown in (b).