A Spatiotemporal Deep Learning Approach for Urban Pluvial Flood Forecasting with Multi-Source Data

: This study presents a deep-learning-based forecast model for spatial and temporal prediction of pluvial ﬂooding. The developed model can produce the ﬂooding situation for the upcoming time steps as a sequence of ﬂooding maps. Thus, a dynamic overview of the forthcoming ﬂooding situation is generated to support the decision of crisis management actors. The inﬂuence of different input data, data formats, and model setups on the prediction results was investigated. Data from multiple sources were considered as follows: precipitation information, spatial information, and an overﬂow forecast. In addition, models with different layers and network architectures such as convolutional layers, graph convolutional layers, or generative adversarial networks (GANs) were considered and evaluated. The data required to train and test the models were generated using a coupled hydrodynamic 1D/2D model. The model setup with the inclusion of all available input variables and an architecture with graph convolutional layers presented, in general, the best results in terms of root mean square error (RMSE) and critical success index (CSI). The prediction results of the ﬁnal model showed a high agreement with the simulation results of the hydrodynamic model, with drastic reductions in computation time, making this model suitable for integration into an early warning system for pluvial ﬂooding.


Introduction
Pluvial flooding caused by heavy rainfall poses a high safety risk; for highly sealed urban areas in particular, precipitation becomes almost exclusively runoff.According to the sixth report of the Intergovernmental Panel on Climate Change (IPCC) [1], the number and intensity of heavy rainfall events have increased in recent years.This trend will likely persist because of global warming.In combination with the ongoing urbanization of cities, an increase in the frequency of pluvial flood events and in the resulting risk is expected in the future [2].Since pluvial flooding, compared to fluvial flooding, can theoretically occur anytime and anywhere in urban catchments, comprehensive protection is not possible from a technical and economic point of view.Thus, early warning systems are essential to enable proactive protection in an incident.In addition, actors in municipal crisis management, as potential users of real-time warning systems, have high expectations of the reliability of the warning alerts.Spatially and temporally precise predictions are required for efficient action in a crisis and to avoid wrong decisions as far as possible.In summary, there are two competing requirements listed by Zhao et al. [3] for predictive models used in real-time warning systems: Water 2023, 15, 1760 2 of 23

•
High temporal and spatial resolution of flood forecasts; • Sufficient lead time between prediction and event occurrence.
Hydrodynamic (HD) computational models are a widely used tool for spatiotemporal high-resolution modeling of pluvial flood events.Initially, the main focus was on modeling the sewer network, but in recent years modeling the surface flooding process has become increasingly essential, and coupled modeling of both systems has become state of the art.The outputs are high-resolution 2D water level maps showing the flood hazard.Various studies on the validation of simulation results using images from social networks [4,5], surveillance camera footage [6], or reported insurance claims [7] have shown the excellent quality of these models.However, this quality is accompanied by high computational costs [8], which means that computational times can quickly reach several hours to days for a single event, depending on the study area size.Compared to flash floods in natural watersheds, which are triggered by advective precipitation events and can be predicted with numerical weather models for multiple hours or several days [9], pluvial flash floods are usually caused by convective precipitation cells.With current nowcasting models, these events can only be predicted with lead times of up to two hours [10][11][12].Due to the short lead times, hydrodynamic models are presently unsuitable for real-time usage because of their long computation times.Therefore, the field of application is limited to the simulation of individual scenarios to identify general flooding hazards.
To meet the second requirement of sufficient lead time, many approaches to developing real-time warning systems for pluvial flooding focus on minimizing the computation times of hydrodynamic computational models as far as possible.For this purpose, the level of detail of the models can be reduced by considering only hotspots or reducing the resolution of the computational network [13].Both approaches were combined by Hofmann and Schüttrumpf [14] for application in a real-time warning system.Other methods focus on increasing computational speed through parallel data processing [15,16] or simplifying computational operations [3,17,18].However, reducing the level of detail or simplifying computational processes is accompanied by reduced accuracy of the computational results.
Other investigated approaches to reducing computation times use data-driven models to estimate flood extends.Most of these approaches are based on machine learning and often on deep learning.Especially, deep learning has recently achieved great success in the field of image processing [19][20][21][22].The developed methods are applied in more and more areas such as earth system sciences [23] or water resources management [24].In terms of flood modeling, Mosavi et al. [25] provided a general overview of existing machine learning models and Bentivoglio et al. [26] reviewed deep learning models.These models are usually trained with the results of hydrodynamic models to achieve similar results in a fraction of the time.Some of the investigated approaches differ significantly concerning the considered methods.For example, Bermúdez et al. [27] used an artificial neural network (ANN) to predict the maximum overflow volumes in subcatchments for a precipitation event.Depending on the predicted overflow volumes, a suitable flooding situation is selected from a result catalog of pre-simulated events.A similar approach was taken by Jhong et al. [28] and Lin et al. [29] with a support vector regression (SVR)-based model.They first used an SVR model to predict the water level hydrograph on the ground surface at various reference points in the study area.Subsequently, an SVM model was used to determine the inundation areas depending on the predicted water level at these points in combination with geographic information.
Bermúdez et al. [30] also developed an SVR model to predict floodplains.However, the prediction of water levels is performed directly for 25,000 points instead of only for a few reference points so that spatial information about the flooding event is instantly available.Berkhahn et al. [31] applied the same approach, where the method used is an ensemble of fully connected multi-layer perceptron layers.However, a disadvantage of fully connected networks is the high number of weights (parameters) to be trained and the associated high computational and memory requirements.This differs from convolutional neural networks (CNNs), which share the weight matrix in space, significantly reducing the number of Water 2023, 15, 1760 3 of 23 weights [32].Guo et al. [33] used CNNs in an autoencoder architecture and trained a model to compute the flooding situation for entire cities.Hofmann and Schüttrumpf [34] also adopted CNNs but organized them in a generative adversarial network (GAN).To enable the transferability of the trained models to other areas, Guo et al. [33] and Löwe et al. [35] utilized spatial information as an additional input variable.Seleem et al. [36] also took advantage of spatial inputs and highlighted the good performance of a CNN architecture over a random forest algorithm regarding transferability.In addition, do Lago et al. [37] used spatial information in combination with a GAN to distribute the rainfall-runoff volume determined by a hydrologic model in a study area.Moreover, deep learning models have shown promising results in similar tasks such as flood sensitivity modeling [38][39][40] or fluvial flooding prediction [41,42].
In this work, a deep learning model for temporal and spatial prediction of pluvial flooding is developed for integration into an early warning system.The target variable represents a sequence of flood grids with a 2 m × 2 m resolution for the forecast horizon.Three potential variables are considered as inputs whose effect on the prediction quality is investigated as follows: (i) the fallen, as well as the predicted precipitation; (ii) spatial information on terrain properties and degree of pavement; (iii) the predicted overflow hydrographs for the forecast horizon.The investigations aim to check which inputs are required and how they must be provided to the model.The main contributions of this study can be summarized as follows: 1.
Development of a prediction model for pluvial flooding based on deep learning that can predict the spatial and temporal evolution of the flooding situation.In contrast to other studies investigating the use of deep learning to predict pluvial flooding [31,[33][34][35], the model output is a flooding sequence for the upcoming time steps instead of the maximum water levels.The chosen model design also allows predictions to be generated at any point in an event and is not limited to specific durations of an event.The accuracy of the results is expected to be as close as possible to that of physically based models, with drastically reduced computation times at the same time.

2.
Compared to existing studies on predicting pluvial flash floods using deep learning approaches [31,[33][34][35], the sewer network is considered as an extra retention volume here.To achieve this, an event-specific overflow forecast is taken as an additional input variable informing whether the sewer network is overloaded or not.In subsequent operational use, this input can be provided either by hydrodynamic sewer network models or data-driven models.

3.
Different model setups are evaluated.This refers, on the one hand, to the considered model inputs and in the case of overflow prediction, to the data format and the model architecture depending on it.Furthermore, different modern deep learning architectures such as encoder-decoder networks, graph neural networks, or generative adversarial networks are combined and compared with each other in the investigations.

Modeling Concept
The overall model structure is shown in Figure 1.The model aims to calculate the upcoming flooding situation for the next time steps starting from a time of observation.For this purpose, different precipitation information, spatial information, and an overflow forecast are available as potential input.Together with the predicted inundation areas, three different data formats are thus considered as follows:

•
1D time series (precipitation information and overflow forecast): These are time series whose values vary along the temporal axis but are assumed to be constant over the spatial extent of the study area (precipitation) or correspond only to a single spatial unit in the study area (overflow).

•
2D raster (spatial information): These are raster data sets whose values vary across the spatial extent of the study area but are assumed to be constant over time.
• 3D raster sequence (predicted inundation areas): These are grid sequences with the same format as video sequences.The values vary both spatially and temporally.
eas, three different data formats are thus considered as follows: • 1D time series (precipitation information and overflow forecast): These are ti ries whose values vary along the temporal axis but are assumed to be constan the spatial extent of the study area (precipitation) or correspond only to a spatial unit in the study area (overflow).

•
2D raster (spatial information): These are raster data sets whose values vary the spatial extent of the study area but are assumed to be constant over time.

•
3D raster sequence (predicted inundation areas): These are grid sequences w same format as video sequences.The values vary both spatially and temporal The consideration of different data types is a special requirement for develo of the machine learning (ML) model, which has to be capable of processing them t er.This condition severely limits the number of suitable ML methods.Furthermo methods must be able to efficiently process image data (2D raster) and especially data (3D raster sequence) and recognize structures within them.For this reason, cus here is on artificial neural networks, which have proven to be particularly effic similar problems such as precipitation nowcasting (e.g., [43][44][45]) or various traffi casting tasks (e.g., [46][47][48]).Figure 1 presents the proposed model setup with the tial inputs and the predicted flooding situation as the target.

Fully Connected Layers
In artificial neural networks with fully connected layers, all neurons of one lay fully connected to the neurons of the following layer.A widely used network ar ture with fully connected layers is the multilayer perceptron (MLP), based on th ceptron developed by Rosenblatt [49].This is a mathematical model for informatio cessing that receives input, weights it, sums it up, and passes it on, according to a vation function.MLPs consist of multiple perceptrons organized in fully connecte ers.The network output depends on the weights between the layers.These must justed in a training process using an optimization algorithm to minimize the err tween outputs and targets.MLPs represent a simple and widely used network ar ture.In addition, individual layers are often used as part of more complex archite as practiced in this work.

Convolutional Layer
The convolutional neural network (CNN) is a network architecture develope stantially through the work of LeCun et al. [50], which has proven to be highly ef in image recognition.Convolutional layers consist of a receptive field and a kern taining the weights.When processing image data, the receptive field slides over The consideration of different data types is a special requirement for development of the machine learning (ML) model, which has to be capable of processing them together.This condition severely limits the number of suitable ML methods.Furthermore, the methods must be able to efficiently process image data (2D raster) and especially video data (3D raster sequence) and recognize structures within them.For this reason, the focus here is on artificial neural networks, which have proven to be particularly efficient in similar problems such as precipitation nowcasting (e.g., [43][44][45]) or various traffic forecasting tasks (e.g., [46][47][48]).Figure 1 presents the proposed model setup with the potential inputs and the predicted flooding situation as the target.

Considered Layers and Network Architectures 2.2.1. Fully Connected Layers
In artificial neural networks with fully connected layers, all neurons of one layer are fully connected to the neurons of the following layer.A widely used network architecture with fully connected layers is the multilayer perceptron (MLP), based on the perceptron developed by Rosenblatt [49].This is a mathematical model for information processing that receives input, weights it, sums it up, and passes it on, according to an activation function.MLPs consist of multiple perceptrons organized in fully connected layers.The network output depends on the weights between the layers.These must be adjusted in a training process using an optimization algorithm to minimize the error between outputs and targets.MLPs represent a simple and widely used network architecture.In addition, individual layers are often used as part of more complex architectures, as practiced in this work.

Convolutional Layer
The convolutional neural network (CNN) is a network architecture developed substantially through the work of LeCun et al. [50], which has proven to be highly effective in image recognition.Convolutional layers consist of a receptive field and a kernel containing the weights.When processing image data, the receptive field slides over the input images with a given step size, generating many small sections multiplied by the kernel's weights.In contrast to fully connected layers, the filter uses the same weight matrix at different image locations.Thus, the number of parameters to be learned is drastically reduced.Moreover, convolutional layers focus on recognizing particularly relevant features and can detect them at different locations in an image [51].In addition to processing 2D data such as images, convolutional layers can also be used to process 1D data such as time series or 3D data sets such as video sequences.

Recurrent Layer
Compared to fully connected layers, recurrent layers have feedback loops that allow the layer outputs to be fed back into the same layer again.This makes them highly suitable for modeling sequential data such as text or time series.The longer the input sequences, the more feedback is required.In the case of very long sequences, this leads to deep networks.Training these networks often causes the problem where the gradient toward the lower layers shrinks and vanishes, or it grows in the other direction and explodes.These problems, also called vanishing and exploding gradients, cause the network to stop converging in the deeper layers or the training to become unstable [52].Therefore, these networks are said to have a "short-term memory", which leads to the problem where longer-term dependencies are only barely considered or not at all.
To counter this problem, Hochreiter and Schmidhuber [53] developed the so-called long short-term memory (LSTM) cells.The special feature of LSTM cells compared to classical recurrent neurons is that they have an additional cell state.This cell state makes it possible to take long-term dependencies into account.The cell state is controlled by gates which decide what information is added to the state, what is forgotten, and how the cell state influences the network output.

Graph Neural Networks (GNNs)
The network layers described above are suitable for processing categorical data, sequential data, or data structured as rasters.However, many data sets are also available as networks in the form of graphs, in which the connection of individual objects to other objects plays an important role.These include, for example, social networks, molecules, traffic networks, or even sewer networks.A particular type of neural network was developed for this kind of data, the so-called graph neural network (GNN).Scarselli et al. [54] introduced this architecture over a decade ago.It was made especially popular by the work of Defferrard et al. [55] and Kipf and Welling [56], who combined GNNs with CNNs to form graph convolutional networks (GCNs).This architecture has been used to optimize the performance of neural networks for many problems including traffic forecasting [57], forecasting pressure in drinking water supply networks [58], and forecasting COVID-19 infection events [59].
The basis of GNNs is a graph G, which can be represented in the simplest form as G = (V; E), where V stands for the nodes and E for the edges.An edge from node v i ∈ V to node v j ∈ V can be described as (v i , v j ) ∈ E. For efficient processing of graphs in ML applications, they are usually represented as a matrix.One way of doing this is to use an adjacency matrix A ∈ R N×N consisting of an N × N matrix in which, for each position i, j (1 ≤ i ≤ N; 1 ≤ j ≤ N), the following is the case: Furthermore, graphs can be divided into directed and undirected.In directed graphs, the edges between two nodes can only be crossed in one direction, while in undirected graphs, the connection can be crossed in both directions.Furthermore, the edges in graphs can be weighted, whereby the entries in the adjacency matrix are multiplied by a given weight.
GNNs can be used to predict features on the level of nodes, edges, or the whole graph.In addition to the adjacency matrix, a feature matrix X ∈ R N×D is considered as input.Here, N describes the number of nodes and D is the number of input features per node.The output at the node Z ∈ R N×F (F stands here for the number of output features) is accordingly a function f of the adjacency and feature matrix: The value of Z can be computed with different GNN architectures.Many studies rely on the GCNs described in Kipf and Welling [56].GCNs transfer the convolution operation known from CNNs from image data to graph data.The main idea behind this is that the representation of a node always depends on its features and on the features of its neighboring nodes.

Generative Adversarial Networks
The generative adversarial network (GAN) is an architecture first presented by Goodfellow et al. [60] consisting of two sub-models.First, a generator G generates records based on a random distribution z similar to the training data set.Then, a classifier called discriminator D computes the probability of whether a data example x comes from the training data set or the generator.The models have contrary goals and compete against each other in a zero-sum game during training.The generator aims to produce outputs that the discriminator cannot distinguish as "real" contents from a data set or "fake" outputs produced by the generator.Conversely, the discriminator aims to classify the generator's outputs as "false" content with the highest possible probability.Both models are trained simultaneously and use the same loss function L, which indicates the likelihood of whether an input data set is "real" or "false."While the discriminator's parameters are adjusted to maximize this probability, the generator's parameters are adjusted to minimize it.This results in the following function presented in Goodfellow et al. [60]: In classical GAN models, which receive only a random distribution as input, there is no way to control how data are generated [61].With this in mind, Mirza and Osindero [61] developed conditional GANs (cGANs), a modified version, which in addition to a random distribution z, also depends on latent information y.This gives the model some context that can be used to influence the output.For example, images can be generated from contours or labels [22] or the resulting flooding from a precipitation forecast [34,37].

Case Study
Different deep-learning-based model setups were developed and tested for a study area to predict pluvial flooding.The data sets required for the training process were generated using an HD model of a study area and a data set with various precipitation events.In addition to the tests conducted to compare the model setups, this section also describes the HD model and the precipitation data set, as well as the data generation and preprocessing steps as an essential part of the research.
The deep learning models were developed in Python 3.8 using Tensorflow [62].In addition, other common libraries such as Scikit-learn, Pandas, Numpy, and Matplotlib were used for data preprocessing and visualization.The two modules MIKE IO 1D and MIKE IO [63] were used to read the result files that are output by MIKE+.Geopandas and GDAL were also used to process spatial data and NetworkX for processing data structured as graphs.The models were trained on a workstation with an NVIDIA RTX 6000 GPU with 48 GB of GPU memory.

Study Area and Hydrodynamic Model
A study area of 3.1 km 2 in the south of the city of Gelsenkirchen in Germany was selected for the investigations (see Figure 2).The site is primarily urban and drains with a combined sewer system.The terrain has an average slope of 7.5% and is not influenced by rivers or slopes in terms of flooding.A railroad line runs across the area, dividing the catchment area into a northern and southern part.Both parts are connected by two underpasses, which are potentially at risk of flooding and were underwater during past extreme events.
To generate the flooding grids used as the target for the training process, a coupled 1D/2D HD model of the study area was implemented in the software Mike+ [64].The municipal drainage utility of Gelsenkirchen provided a sewer network model for the study area.The field of the study area comprised 975 manholes and 982 reaches.A gridbased computational mesh with a 2 m × 2 m resolution was created, which contained 772,415 elements to model the runoff behavior at the ground surface.Elevation information was added to the computational mesh using 3D survey data acquired by airborne laser scanning from the Cologne District Government [65].Buildings were additionally raised so that they represented a flow barrier.The sewer network and surface models were then coupled in a bidirectional manner via the manholes.
Water 2023, 15, x FOR PEER REVIEW 7 of

Study Area and Hydrodynamic Model
A study area of 3.1 km 2 in the south of the city of Gelsenkirchen in Germany was s lected for the investigations (see Figure 2).The site is primarily urban and drains with combined sewer system.The terrain has an average slope of 7.5% and is not influenc by rivers or slopes in terms of flooding.A railroad line runs across the area, dividing t catchment area into a northern and southern part.Both parts are connected by two u derpasses, which are potentially at risk of flooding and were underwater during past e treme events.To generate the flooding grids used as the target for the training process, a coupl 1D/2D HD model of the study area was implemented in the software Mike+ [64].T municipal drainage utility of Gelsenkirchen provided a sewer network model for t study area.The field of the study area comprised 975 manholes and 982 reaches.A gri based computational mesh with a 2 m × 2 m resolution was created, which contain 772,415 elements to model the runoff behavior at the ground surface.Elevation info mation was added to the computational mesh using 3D survey data acquired by a borne laser scanning from the Cologne District Government [65].Buildings were add tionally raised so that they represented a flow barrier.The sewer network and surfa models were then coupled in a bidirectional manner via the manholes.

Pluvial Flood Event Data Sets
When using deep learning models, the developed model's accuracy depended hea ily on the data set provided for the training process.For the training, precipitation h drographs were used, for which the respective target variables were determined with t help of the hydrodynamic model.Since sewer networks in Germany were designed f overflow frequencies in the range of 2 to 10 years, according to DIN EN 752:2017 [6 and DWA-A 118 [67], precipitation events that lead to sewer system overflow are qu rare.For this reason, the investigations were not limited to historical events in the stu area.Instead, data from a total of eight terrestrial rain gauges near the study area wi continuously measured data for a period of >60 years as well as different design rainf events were used.The considered rainfall data have a temporal resolution of fi minutes.

Pluvial Flood Event Data Sets
When using deep learning models, the developed model's accuracy depended heavily on the data set provided for the training process.For the training, precipitation hydrographs were used, for which the respective target variables were determined with the help of the hydrodynamic model.Since sewer networks in Germany were designed for overflow frequencies in the range of 2 to 10 years, according to DIN EN 752:2017 [66] and DWA-A 118 [67], precipitation events that lead to sewer system overflow are quite rare.For this reason, the investigations were not limited to historical events in the study area.Instead, data from a total of eight terrestrial rain gauges near the study area with continuously measured data for a period of >60 years as well as different design rainfall events were used.The considered rainfall data have a temporal resolution of five minutes.
To consider only relevant events, partial duration series were created from the rainfall records of the eight rain gauges.Preliminary experiments have shown that only rainfall events with a return period of >5 years are likely to cause overflow and the formation of relevant flood areas.Therefore, only rare events with higher return periods were considered.In total, 153 events suitable for training were identified.
While the real measured data provide a realistic representation of the rainfall characteristics, design rainfall data offer the possibility of a representative coverage of all relevant durations and return periods.Different model rainfall patterns, durations, and return periods were considered to cover the full range of all possible precipitation loads.Following Schmitt [68], so-called increase factors were also considered to cover precipitation beyond a return period of 100 years.The highest factor was set at 4.0 based on the findings from studies to determine "Maximized Area Precipitation Heights for Germany (MGN)".This is a physical-empirical-based estimate of the probable maximum physically possible precipitation heights [69].
As a result, 258 events (105 model rainfall events, 153 natural rainfall events) were available for model training.Figure 3 shows the distribution of events as a function of their respective return periods for the two data sets.
findings from studies to determine "Maximized Area Precipitation Heights for Germany (MGN)".This is a physical-empirical-based estimate of the probable maximum physically possible precipitation heights [69].
As a result, 258 events (105 model rainfall events, 153 natural rainfall events) were available for model training.Figure 3 shows the distribution of events as a function of their respective return periods for the two data sets.

Data Generation Process
With the calibrated hydrodynamic 1D/2D flood model and the generated precipitation series as model load, the necessary training data sets for the ML-based forecast models were produced.The precipitation was assumed to be spatially homogeneous, and an additional lag time of 120 min was considered for each event to represent the decay of floods.As a result, different hydrographs such as overflow hydrographs from spilling manholes as well as sequences of grids with inundation areas were obtained.
At the end of a simulation, in addition to a map with the maximum water levels at the ground surface, MIKE+ outputs a multidimensional grid data set.The data set contained a temporal sequence of grids with the water levels at the respective time of the simulated event.This data set allowed training a ML method to predict the desired temporal evolution of an upcoming event and was used as a target variable in the training process.In addition, it was also possible to output overflow hydrographs from spilling manholes, which are considered as potential inputs in the analyses carried out here.For

Data Generation Process
With the calibrated hydrodynamic 1D/2D flood model and the generated precipitation series as model load, the necessary training data sets for the ML-based forecast models were produced.The precipitation was assumed to be spatially homogeneous, and an additional lag time of 120 min was considered for each event to represent the decay of floods.As a result, different hydrographs such as overflow hydrographs from spilling manholes as well as sequences of grids with inundation areas were obtained.
At the end of a simulation, in addition to a map with the maximum water levels at the ground surface, MIKE+ outputs a multidimensional grid data set.The data set contained a temporal sequence of grids with the water levels at the respective time of the simulated event.This data set allowed training a ML method to predict the desired temporal evolution of an upcoming event and was used as a target variable in the training process.In addition, it was also possible to output overflow hydrographs from spilling manholes, which are considered as potential inputs in the analyses carried out here.For the models developed here, only overflow onto the ground surface was considered, not the inflow from surface runoff into the sewer network.
As in other studies [33,35], spatial information was used as potential input for the deep learning model in this study.Löwe et al. [35] conducted extensive investigations on the relevance of different types of spatial information in their study.The spatial information found to be most suitable (terrain aspect, curvature, the depth of terrain depressions, imperviousness, and flow accumulation) was also considered here in the analyses.

Data Preprocessing
Because of the different units and value ranges of the considered data and their partially right-skewed distributions, the data were further preprocessed.The spatial information was transformed following the procedure in Löwe et al. [35] and scaled to the interval [−1, 1] if negative values were present, and to the interval [0, 1] otherwise.The remaining data were also scaled to the interval [0, 1], but no additional transformation was performed.
Predicting pluvial flooding was treated as a supervised learning problem in this study.Accordingly, the data for the training process were converted into pairs of input and target variables.The spatial information was relatively straightforward since it was static and did not change between training samples.However, this did not apply to the time series and grid sequences, which changed dynamically along the temporal dimension.Hence, a Water 2023, 15, 1760 9 of 23 sliding window approach was used.Thereby, for each time step t of an event, a window was opened over the past D time steps and the upcoming H time steps, resulting in intervals [t -D+1 ,..., t] for the past time steps and [t +1 ,..., t H ] for the predicted time steps.D and H were set to 60 min for the studies performed here, corresponding to 12 time steps for the chosen temporal resolution of five minutes.The precipitation forecast for the forecast horizon of 60 min was set to be the measured precipitation of the corresponding time steps for the investigations carried out here.In the future, a forecast generated by a precipitation forecast model will be used.The procedure for generating the training pairs P is shown as an example for one observation time step in Figure 4.The total number of all generated training pairs from the 258 used events was 9045.
maining data were also scaled to the interval [0, 1], but no additional transformatio performed.
Predicting pluvial flooding was treated as a supervised learning problem i study.Accordingly, the data for the training process were converted into pairs of and target variables.The spatial information was relatively straightforward since static and did not change between training samples.However, this did not apply time series and grid sequences, which changed dynamically along the temporal d sion.Hence, a sliding window approach was used.Thereby, for each time step t event, a window was opened over the past D time steps and the upcoming H time resulting in intervals [t-D+1,..., t] for the past time steps and [t+1,..., tH] for the pre time steps.D and H were set to 60 min for the studies performed here, correspond 12 time steps for the chosen temporal resolution of five minutes.The precipitation cast for the forecast horizon of 60 min was set to be the measured precipitation corresponding time steps for the investigations carried out here.In the future, a fo generated by a precipitation forecast model will be used.The procedure for gene the training pairs P is shown as an example for one observation time step in Fig The total number of all generated training pairs from the 258 used events was 9045  The data set was split into training, validation, and testing data sets event by event.Out of the 258 events, the data pairs of 26 events were retained for testing, all of which were from the station closest to the study area.The data pairs from the remaining events were used 90% (209 events) for training and 10% (23 events) for validation.

Investigated Model Setups
As described in Section 2.1, artificial neural networks were used as ML methods to develop the prediction model.In addition, various potential inputs were available, which were examined to determine what extent the developed model would benefit from their integration.At the beginning of the investigations, the architecture shown in Figure 5 was chosen as a starting point.Initially, only precipitation information was selected as an input to predict a sequence of flooding grids.The architecture was inspired by the work of Guo et al. [33], but it underwent various modifications.In this model architecture, the precipitation information is first processed using two convolutional 1D layers for feature extraction before a fully connected layer and a reshaping layer follows.The latter converts the data into a format that can be upscaled to the output format.This is followed by a decoding part consisting of four deconvolutional 3D layers that generate the flooding raster sequence from the extracted features.Other architectures such as LSTM layers or fully connected layers for feature extraction or convolutional 3D layers in combination with upsampling layers for decoding were also tested, but they led to worse results.All Water 2023, 15, 1760 10 of 23 convolutional and deconvolutional layers, except the last one, are followed by a batch normalization layer [70] to stabilize the training process and to enable higher learning rates, as well as a rectified linear activation unit (ReLU) [71] as activation function.The last deconvolutional layer is followed by a sigmoid activation function [72] without batch normalization.
chitecture, the precipitation information is first processed using two convolutio layers for feature extraction before a fully connected layer and a reshaping layer fo The latter converts the data into a format that can be upscaled to the output forma is followed by a decoding part consisting of four deconvolutional 3D layers that ge the flooding raster sequence from the extracted features.Other architectures s LSTM layers or fully connected layers for feature extraction or convolutional 3D in combination with upsampling layers for decoding were also tested, but they worse results.All convolutional and deconvolutional layers, except the last one, a lowed by a batch normalization layer [70] to stabilize the training process and to higher learning rates, as well as a rectified linear activation unit (ReLU) [71] as act function.The last deconvolutional layer is followed by a sigmoid activation functi without batch normalization.For training, all models used the mean squared error as the objective functio less otherwise described) and were trained with the Adam optimization algorith for 100 epochs.The size of the batches was set to 16 since larger batches led t memory overload.A value of 0.001 was selected as the learning rate, previously mined following the procedure described by Smith [74].Only the models w smallest error for the validation data set during the training were saved to avoid o ting.For training, all models used the mean squared error as the objective function (unless otherwise described) and were trained with the Adam optimization algorithm [73] for 100 epochs.The size of the batches was set to 16 since larger batches led to GPU memory overload.A value of 0.001 was selected as the learning rate, previously determined following the procedure described by Smith [74].Only the models with the smallest error for the validation data set during the training were saved to avoid overfitting.

Experiment 1: Comparison of Different Input Variables
In the first experiment, it was evaluated which combination of potential model inputs provided the best results.The precipitation information including the precipitation forecast was regarded as mandatory.The remaining two inputs were varied in all possible combinations so that the following models were compared with the following inputs: Figure 6 shows the baseline architecture with the additional input paths considered for feature extraction.The overflow prediction is processed similar to the precipitation information and connected to the output architecture after the reshaping layer via a concatenate layer before the decoding path follows.At the same point, the spatial information is integrated into the network.The feature extraction for this type of data is conducted with an encoder structure consisting of several convolutional 2D layers with a stride of two to downsample the input raster.The output of the last convolutional 2D layer is stacked H times to obtain identical dimensions in front of the concatenate layer.
for feature extraction.The overflow prediction is processed similar to the precipitat information and connected to the output architecture after the reshaping layer via a c catenate layer before the decoding path follows.At the same point, the spatial inf mation is integrated into the network.The feature extraction for this type of data is c ducted with an encoder structure consisting of several convolutional 2D layers wit stride of two to downsample the input raster.The output of the last convolutional layer is stacked H times to obtain identical dimensions in front of the concatenate laye

Experiment 2: Comparison of Different Preprocessing of the Overflow Data
Different formats to integrate the overflow data into the model were investigated in the second experiment.Initially, these were available as hydrographs for all nodes in the catchment area.The issue was to what extent the model could benefit from these several hundred hydrographs without spatial relations.In this context, in addition to the unstructured overflow hydrographs (variant a), two other variants were investigated, adding the overflow data to the model as a raster sequence (variant b) and as a spatiotemporal graph (variant c). Figure 7 provides an overview of the possible architectures.
The raster sequences in variant (b) were created by intersecting the overflow hydrographs with a sink catchment raster.The result was a sequence of grids with the accumulated overflow volumes of all manholes per time step and for each sink catchment.In the model architecture, the raster sequences are processed with an encoder structure consisting of convolutional 3D layers with a stride of two.As in the decoding part, each layer is followed by a batch normalization layer and a ReLU activation function.The output of the last layer is then concatenated to the baseline architecture and further processed there.
In variant (c), the overflow forecast is processed with a temporal graph convolutional network (T-GCN) following Zhao et al. [75] and Yu et al. [57].This approach combines a graph convolutional layer with a recurrent layer.While the graph convolutional layer captures the spatial dependencies of the sewer network, the recurrent layer captures the temporal dynamics of the overflow process at the individual manholes.This enables the modeling of the spatiotemporal learning problem presented here.The graph convolutional layer receives a feature matrix containing the overflow hydrographs and an adjacency matrix representing the sewer network as an unweighted and directed graph.A LSTM layer is used as the recurrent layer.The output of the T-GCN block is then passed to a fully connected layer followed by a reshaping layer, analogous to the precipitation data, before concatenating with the output architecture.The raster sequences in variant (b) were created by intersecting the overflow drographs with a sink catchment raster.The result was a sequence of grids with the cumulated overflow volumes of all manholes per time step and for each sink catchm In the model architecture, the raster sequences are processed with an encoder struct consisting of convolutional 3D layers with a stride of two.As in the decoding part, e layer is followed by a batch normalization layer and a ReLU activation function.output of the last layer is then concatenated to the baseline architecture and further p cessed there.
In variant (c), the overflow forecast is processed with a temporal graph convo tional network (T-GCN) following Zhao et al. [75] and Yu et al. [57].This approach co bines a graph convolutional layer with a recurrent layer.While the graph convolutio layer captures the spatial dependencies of the sewer network, the recurrent layer c tures the temporal dynamics of the overflow process at the individual manholes.T enables the modeling of the spatiotemporal learning problem presented here.The gr convolutional layer receives a feature matrix containing the overflow hydrographs a an adjacency matrix representing the sewer network as an unweighted and direc graph.A LSTM layer is used as the recurrent layer.The output of the T-GCN bloc then passed to a fully connected layer followed by a reshaping layer, analogous to precipitation data, before concatenating with the output architecture.

Experiment 3: Comparison of Different Model Setups
In a third experiment, the performance of the previously best-evaluated model w compared to a conditional generative adversarial network (cGAN).The structure of cGAN is shown in Figure 8 and was inspired by the work of Isola et al. [22] and H mann and Schüttrumpf [34].The latter had used the architecture successfully for flo

Experiment 3: Comparison of Different Model Setups
In a third experiment, the performance of the previously best-evaluated model was compared to a conditional generative adversarial network (cGAN).The structure of the cGAN is shown in Figure 8 and was inspired by the work of Isola et al. [22] and Hofmann and Schüttrumpf [34].The latter had used the architecture successfully for flood prediction.Unlike a normal GAN, the cGAN receives context in addition to noise as input.In the present investigations, following the findings of Isola et al. (2017), the noise was completely ignored as input and only context in the form of the potential model inputs from experiment 1 was considered.Moreover, following similar studies [22,34,37], a mean absolute error function (L1 loss) was integrated into the objective function (cf.Formula (3)).Thus, the generator aims not only to fool the discriminator, but also to minimize the error between the results of the HD-Model used as the target variable.
The best model from experiments 1 and 2 was used as the architecture for the generator.A network structure suitable for classification was used as the discriminator.It first extracts the features from the individual inputs similar to the other model structures and then merges them with a concatenate layer (see Figure 9).Afterward, another convolutional 3D block with ReLU and batch normalization follows, as well as a convolutional 3D layer followed by a sigmoid activation function.The output represents a binary classification.In contrast to the generator, dropout [76] with a dropout rate of 0.5 was used for regularization in the discriminator.In this way, the training process could be stabilized and better results could be achieved.
prediction.Unlike a normal GAN, the cGAN receives context in addition to noise put.In the present investigations, following the findings of Isola et al. (2017), the was completely ignored as input and only context in the form of the potential mo puts from experiment 1 was considered.Moreover, following similar studies [22,3 mean absolute error function (L1 loss) was integrated into the objective functi Formula (3)).Thus, the generator aims not only to fool the discriminator, but also t imize the error between the results of the HD-Model used as the target variable.The best model from experiments 1 and 2 was used as the architecture for th erator.A network structure suitable for classification was used as the discrimin first extracts the features from the individual inputs similar to the other model stru and then merges them with a concatenate layer (see Figure 9).Afterward, another lutional 3D block with ReLU and batch normalization follows, as well as a convolu 3D layer followed by a sigmoid activation function.The output represents a binar sification.In contrast to the generator, dropout [76] with a dropout rate of 0.5 wa for regularization in the discriminator.In this way, the training process could be lized and better results could be achieved.The best model from experiments 1 and 2 was used as the architecture for the generator.A network structure suitable for classification was used as the discriminator.It first extracts the features from the individual inputs similar to the other model structures and then merges them with a concatenate layer (see Figure 9).Afterward, another convolutional 3D block with ReLU and batch normalization follows, as well as a convolutional 3D layer followed by a sigmoid activation function.The output represents a binary classification.In contrast to the generator, dropout [76] with a dropout rate of 0.5 was used for regularization in the discriminator.In this way, the training process could be stabilized and better results could be achieved.

Performance Evaluation
The predicted (ML model) and simulated (HD model) flooding grids were compared cell by cell to evaluate the prediction results.For this purpose, the root mean squared error (RMSE) and the critical success index (CSI) were used as two different quality criteria types.
The RMSE is a continuous index that compares the exact water levels and evaluates the average deviation: where n stands for the number of cells compared and y i for the respective values of the individual cells determined with the neural network NN and the hydrodynamic model HD.The RMSE can assume values in the range [0, ∞], where 0 corresponds to the optimal fit.The absolute error is given as the result.Other metrics for determining the relative error, such as the relative mean squared error (MRSE), were also tested.However, it was found that pixels with low water levels sometimes resulted in extreme relative errors.In the subsequent averaging of error values over all the cells of a flooding grid, this problem led to poor results.However, the affected cells have only a low hazard potential and are thus of minor relevance compared to cells with high water levels.
The CSI is a categorical index for evaluating location accuracy and is a widely used measure for assessing extreme events in both precipitation [43][44][45] and flash flood forecasting [35,77].Compared to other categorical indices such as the hit rate or the false alarm rate the CSI considers both misses and false alarms.Since both are equally unfavorable for the developed prediction model, the CSI is best suited for this purpose.First, binary classification of the cells needs to be performed to determine the CSI.In the present case, the pixels were classified as flooded and non-flooded.Subsequently, the CSI was calculated as follows: In this example, TP stands for the number of cells correctly predicted to be flooded, FP denotes the cells incorrectly predicted to be flooded, and FN indicates the number of cells incorrectly predicted not to be flooded.It thus responds to both missed and false alarms.This makes it well suited for the present task since missed and falsely predicted inundated areas are equally inconvenient in an emergency.The values of the CSI are in the interval [0, 1], where 1 corresponds to the best result.
The evaluation procedure only calculated metrics for pixels where the HD model or the neural network predicted water levels above a given threshold.Following the procedure in Löwe et al. [35] or as common practice in precipitation forecasting [43], multiple thresholds were considered for the CSI to evaluate the location accuracy at different water levels.The same approach was used here for the RMSE to account for the deviation in dependence on various water levels.

Comparison of the Investigated Model Setups
The three experiments described were carried out one after another, and in each case, the best model was carried into the next experiment.Table 1 summarizes the results of all experiments.The metrics for the individual water level threshold values d were formed in each case as the mean value of all samples of the 26 test events and all prediction time steps.For experiment 1, it was clearly shown that considering the overflow forecast (models 2 and 4) led to better results and generally predicted high water levels more reliably.The additional consideration of spatial information led to higher accuracy for water depth values ≤ 0.2 m.On the other hand, at higher thresholds, model 2 without spatial information performed best.Altogether, the difference between the two models was small.The result was unsurprising because the spatial information was a static variable without changes among individual training pairs.Thus, the data set acted more as a mask and did not significantly impact the error signal required for training progress.Nevertheless, since various studies have shown that spatial information can enable the transferability of trained models [35,78], model 4 was included in the following experiment.
A comparison of the different model setups in experiment 2 showed that considering the overflow information as a raster sequence led to the worst results.In addition, the format demanded a significantly larger memory consumption during model training and almost double the computation time.The unstructured input of the overflow hydrographs performed slightly better for the lower thresholds, although the differences were marginal, especially for the RMSE.The input as a graph gave the best results for the higher thresholds, which was the most relevant for flood prediction.Therefore, model 7 was carried into the final experiment.The third and last experiment showed high location accuracy for the model setup as a conditional GAN (model 9).On the other hand, the "classic" T-GCN (model 8) showed a higher accuracy for the RMSE.For a better assessment of individual outliers that may negatively affect the metrics, further evaluations were subsequently performed with both models.

Assessment of the Prediction Accuracy
The metrics were determined for each event in a further step to obtain a more detailed evaluation of the model performance.Figure 10 shows the distribution of the results for the T-GCN and the T-GCN cGAN.It should be pointed out that only 4 of the 26 events resulted in water levels > 0.5 m, so the metrics determined for that threshold are only partially representative.The differences between the two models shown in Table 1 are also apparent.Moreover, a slightly more extensive spreading of metrics was observed at higher thresholds for the T-GCN cGAN than for the T-GCN.Still, a significant negative influence of extreme outliers on the results could not be detected.
For a threshold value of 0.05 m, Figure 11 shows the RMSE and the CSI values of the 26 events with respect to their return periods (T).Here, both models provided good results even for the particularly relevant very rare events with T > 100 a.For the CSI, the results were in the upper range of all test events.The results for the RMSE were also positive considering that the absolute deviations were included in the calculation.Neither of the two models presented extreme outliers at certain recurrence intervals, and their results were similar.
sults for the T-GCN and the T-GCN cGAN.It should be pointed out that only 4 of the events resulted in water levels > 0.5 m, so the metrics determined for that threshold only partially representative.The differences between the two models shown in Tab are also apparent.Moreover, a slightly more extensive spreading of metrics was served at higher thresholds for the T-GCN cGAN than for the T-GCN.Still, a signific negative influence of extreme outliers on the results could not be detected.For a threshold value of 0.05 m, Figure 11 shows the RMSE and the CSI values the 26 events with respect to their return periods (T).Here, both models provided go results even for the particularly relevant very rare events with T > 100 a.For the CSI, results were in the upper range of all test events.The results for the RMSE were also p itive considering that the absolute deviations were included in the calculation.Neithe the two models presented extreme outliers at certain recurrence intervals, and their sults were similar.In a third analysis, it was assessed whether the models overestimated or undere mated water levels.For this purpose, the prediction error as a function of the simula water levels is shown in Figure 12 in a 2D histogram.All pixels from all forecast grid the 26 events above a threshold of 0.05 m were considered.The dashed line indicates ideal fit between forecast and HD simulation.The deviations vary relatively eve around the dashed line for both models, with a slight trend toward overestimating wa levels.Again, this showed a slightly better performance of the T-GCN.For a threshold value of 0.05 m, Figure 11 shows the RMSE and the CSI value the 26 events with respect to their return periods (T).Here, both models provided g results even for the particularly relevant very rare events with T > 100 a.For the CSI, results were in the upper range of all test events.The results for the RMSE were also p itive considering that the absolute deviations were included in the calculation.Neithe the two models presented extreme outliers at certain recurrence intervals, and their sults were similar.In a third analysis, it was assessed whether the models overestimated or undere mated water levels.For this purpose, the prediction error as a function of the simula water levels is shown in Figure 12 in a 2D histogram.All pixels from all forecast grid the 26 events above a threshold of 0.05 m were considered.The dashed line indicates ideal fit between forecast and HD simulation.The deviations vary relatively eve around the dashed line for both models, with a slight trend toward overestimating w levels.Again, this showed a slightly better performance of the T-GCN.In a third analysis, it was assessed whether the models overestimated or underestimated water levels.For this purpose, the prediction error as a function of the simulated water levels is shown in Figure 12 in a 2D histogram.All pixels from all forecast grids of the 26 events above a threshold of 0.05 m were considered.The dashed line indicates the ideal fit between forecast and HD simulation.The deviations vary relatively evenly around the dashed line for both models, with a slight trend toward overestimating water levels.Again, this showed a slightly better performance of the T-GCN.

Forecast for a Historical Heavy Rainfall Event
For the final evaluation, the T-GCN was tested using the historical heavy rainfall event of 3 July 2009 in the study area of Gelsenkirchen (Figure 13).

Forecast for a Historical Heavy Rainfall Event
For the final evaluation, the T-GCN was tested using the historical heavy rainfall event of 3 July 2009 in the study area of Gelsenkirchen (Figure 13).

Forecast for a Historical Heavy Rainfall Event
For the final evaluation, the T-GCN was tested using the historical heavy event of 3 July 2009 in the study area of Gelsenkirchen (Figure 13).A forecast was generated at the beginning of the event for a forecast horizon of 60 min.The precipitation sum for the predicted period was 46 mm and for some duration intervals, return periods of more than 200 years were reached.Figure 13 presents HD simulation and forecast results, location accuracy and the water depth difference for three forecast time steps.The visual comparison shows a high agreement and that no unrealistic flooding patterns were produced.In addition, the following characteristics of the model are shown in the figure : 1.
Predictions with shallow water depths and only a few flooded pixels often lead to large errors.This problem is particularly evident at step t = +15 min with a CSI of 0, the worst possible result.The RMSE also shows the worst value compared to the other time steps.The same problem was also found by Löwe et al. [35].On the other hand, predicted flood maps with many flooded pixels usually show a high accuracy, as is the case for time steps t = +30 min and + 60 min.Accordingly, the flooding patterns particularly relevant for crisis management are predicted with high accuracy.

2.
The model reacts with a slight delay to the precipitation load.While increasing flood areas before the peak are underestimated, the extent of areas after the peak is slightly overestimated.This behavior is illustrated by the histogram with the error frequencies, and it is also displayed in other events.3.
In the center of the depicted section, there is an underpass where the most considerable differences of up to 25 cm occur.However, it should be noted that the water levels there are sometimes more than two meters high.In this case, the relative error would be in the range of about 10-15% and thus within an acceptable range.

Discussion
The results showed a good agreement between the predicted inundation areas and the HD model results.The visual comparison showed only small differences, which should play only a minor role in its use in warning systems or as a basis for decisions in crisis management.The RMSE and CSI values were in a range similar to other studies [34,35,37], where deep learning models were used to compute flood maps.However, it should be noted that the comparison with other studies is limited because of the different prediction model tasks (grid sequences instead of single grids) and the different characteristics of the used study areas (topography, imperviousness, etc.).The investigations also found that the model provided good results for rare events with high return periods.In addition, the chosen model structure is independent of the event duration.That is because the model does not require inputs such as the precipitation load for the entire event, but only considers a fixed number of past and future time steps of each forecast time point in an event.On the other hand, it must be considered that perfect precipitation and overflow forecasts were used in the experiments.In practice, both predictions are subject to uncertainties, which would affect the model developed here.Therefore, regarding operational use, further investigations with results from forecast models for precipitation and overflow are required to evaluate the effects of the uncertainties on the predicted inundation areas.
The final model needed only a few seconds to calculate the flooding sequence for the following 60 min.Even when extended to larger areas, the computation time is expected to be less than one minute, making the model suitable for real-time operation.This result was consistent with findings from other studies [31,34,35,78].Although an extension of the prediction horizon will lead to higher computational and memory requirements because of the additional flooding grids, it will be otherwise technically feasible.However, the possible forecast horizon is limited by the uncertainties of precipitation forecasts.These increase dramatically after only a few minutes for extreme events using the currently available forecast models and thus do not allow meaningful forecasts for more than two hours [79].
As Hofmann and Schüttrumpf [34] indicated, generation of the training, validation, and test data sets with hydrodynamic models is very time intensive.For the relatively small test area used in this study, it took about two months to compute all 258 considered events on a high-performance workstation.If extended to the entire city area of Gelsenkirchen, without additional computing resources, the calculations would need several years, making the approach unfeasible.In addition to more extensive computing resources, a smaller data set could be another solution.For this purpose, investigations are planned on determining the data quantity required to achieve adequate prediction results.Especially for the events with lower return times, it is quite possible that reducing the data set will not lead to a significant loss of quality.
In addition to the required computational resources for data generation, the scalability of the model was limited by the high memory requirements for model training.Because of the high dimensionality of the flooding sequences used as target size (12 images of 1024 × 768 px for the relatively small study area of 3.1 km 2 ), an expansion to larger areas and thus a higher number of raster cells to be computed will quickly exceed the available GPU RAM.Considering the currently available technology, this makes training impossible beyond a certain amount of data.One approach to covering entire urban areas is to divide the computational domain into sub-models and merge the results as described in Berkhahn et al. [31].Another method is to train models in parallel on multiple GPUs.One approach to this is represented by the Python library Mesh-TensorFlow [80], which allows developing large models with extreme memory requirements and training them in parallel on multiple GPU units.
Another way to scale the model to an entire urban area is to make a trained model transferable.A model could then be developed for a sub-area and used to forecast the rest of the municipal area.Furthermore, this characteristic would allow the model to adapt to changes in the catchment area, for example, in topography, land use, or the sewer network, without retraining.Some studies [33,35] used topographic information as an additional input to include physical system properties to establish model transferability.This approach can be combined with transfer learning techniques to improve results for the target area with a small amount of additional training [36].Because of the further consideration of overflow as an input variable in the experiments presented here, considering only topographic information is insufficient.Instead, the combination or intersection of the overflow forecast with other physical system properties is required as an additional input to enable transferability.The intersection of overflow forecasts with sink catchments performed in experiment 2 (see Section 3.4.2) is a step in this direction.A similar approach was proposed by Löwe et al. [35] by weighting flow paths by adjacent overflow volumes in a raster data set.However, both methods are likely to yield losses in prediction accuracy and lead to a significant increase in GPU memory demand.
Finally, with the current model setup, the neural network can only become as good as the HD model.Accordingly, the generally known limitations of HD models also apply here.These include the limited validation opportunity due to the lack of measurement networks for recording water levels during flooding events.Water level detection using social media images [4,5] or recordings from surveillance cameras [6] could provide a solution in the future.In addition, spatial information on flooding extent would be highly desirable such as through extraction from satellite data [81,82].Nevertheless, because of its coarse temporal and spatial resolution, this approach is currently limited to fluvial or tidal flooding.In the future, further technical developments might enable the use of satellite data for pluvial flood modeling and possibly even provide a data source for training ML models.

Conclusions
This paper presented experiments with different deep learning models for predicting pluvial flooding in urban areas.The special feature of the different models was the spatial and temporal prediction of the flooding situation in the form of a sequence of flood maps.In addition, the models generated a forecast for the following 60 min at any given time step of an event.As part of the model development, experiments were conducted to determine the influence of different input variables, input formats, and model architectures on prediction quality.The best model proved to be a T-GCN, which was characterized by low computation times and, at the same time, produced flood maps with reasonable differences from the ones produced by an HD model.Furthermore, the best results were achieved with precipitation information, an overflow forecast, and spatial information as input.The main disadvantage of the presented model setup is the limited scalability and transferability.On the one hand, long computation times during training data generation and high demand for GPU RAM during model training limit the size of the considered area.On the other hand, due to the limited transferability, the trained model cannot be used for other catchments and must be retrained when structural changes occur in the used catchment area.The additional consideration of overflow, which positively affects the model quality, means that the trained model cannot be transferred without further input information.Here, additional information about the spatial overflow structure is required as model input, which can be exchanged if the model is used for other catchment areas.
Further investigations will be needed at various points.This applies mainly to integrating real rainfall and overflow forecasts into the prediction process.Here, for use in real-time warning systems, it must first be shown that accurate flood forecast results can still be achieved despite the uncertainties transferred to the model.The likely lower model quality can be mitigated by systematic hyperparameter tuning, which was not part of this study.To the best of our knowledge, no other studies currently use machine learning models to predict flooding sequences considering the sewer network.Accordingly, no models were available that could be used as a benchmark in this study.Therefore, a transfer to an open-source dataset such as the Belinge dataset [83] is envisaged to provide the T-GCN as a benchmark for future developments.
Furthermore, the model's scalability to an entire urban area must be examined.The time required for data generation by the HD model and the high GPU memory requirements for training remain constraints, but they can be countered with various approaches such as increased computing resources or targeted data volume reduction.It also remains to be examined how and to what extent it is possible to scale the model to large areas while considering manhole spilling as an input variable.In addition, further investigations are needed regarding the accuracy of the developed model in areas with different topography.

Figure 1 .
Figure 1.Model setup with all potential inputs (left) and the target variable (right).

Figure 1 .
Figure 1.Model setup with all potential inputs (left) and the target variable (right).

Figure 2 .
Figure 2. Illustration of the study area's sewer network and surface model (area outlined in red).

Figure 2 .
Figure 2. Illustration of the study area's sewer network and surface model (area outlined in red).

Figure 3 .
Figure 3. Distribution of events in the data set.(a) The maximum return times T distribution for all 153 natural rainfall events.(b) A schematic representation of the design rainfall events, with the selected durations, model rainfall types, and return periods/scenarios.For scenarios S 1.5 and S 4.0, the number indicates the increase factor by which the values of the 100-year model rains were multiplied.

Figure 3 .
Figure 3. Distribution of events in the data set.(a) The maximum return times T distribution for all 153 natural rainfall events.(b) A schematic representation of the design rainfall events, with the selected durations, model rainfall types, and return periods/scenarios.For scenarios S 1.5 and S 4.0, the number indicates the increase factor by which the values of the 100-year model rains were multiplied.

Figure 4 .
Figure 4. Converting the data into a supervised learning problem using a single training pai example.

Figure 4 .
Figure 4. Converting the data into a supervised learning problem using a single training pair as an example.

Figure 5 .
Figure 5. Baseline architecture for the following investigations.

Figure 5 .
Figure 5. Baseline architecture for the following investigations.

Figure 6 .
Figure 6.Baseline architecture combined with the corresponding input paths for the overf forecast and the spatial information.

3. 4 . 2 .
Experiment 2: Comparison of Different Preprocessing of the Overflow Data Different formats to integrate the overflow data into the model were investigated the second experiment.Initially, these were available as hydrographs for all nodes in catchment area.The issue was to what extent the model could benefit from these seve hundred hydrographs without spatial relations.In this context, in addition to the structured overflow hydrographs (variant a), two other variants were investigated, a ing the overflow data to the model as a raster sequence (variant b) and as a spatiote poral graph (variant c). Figure 7 provides an overview of the possible architectures.

Figure 6 .
Figure 6.Baseline architecture combined with the corresponding input paths for the overflow forecast and the spatial information.

Figure 7 .
Figure 7. Baseline architecture with the different input paths for the considered formats of overflow forecast.Baseline architecture with the input paths for the overflow forecast formatte (a) unstructured hydrographs, (b) raster sequences, and (c) spatiotemporal graphs.

Figure 7 .
Figure 7. Baseline architecture with the different input paths for the considered formats of the overflow forecast.Baseline architecture with the input paths for the overflow forecast formatted as (a) unstructured hydrographs, (b) raster sequences, and (c) spatiotemporal graphs.

Figure 8 .
Figure 8. Architecture of the conditional GAN.

Figure 8 .
Figure 8. Architecture of the conditional GAN.

Figure 8 .
Figure 8. Architecture of the conditional GAN.

Figure 10 .
Figure 10.Distribution of metrics over all 26 events in the test data set.

Figure 11 .
Figure 11.Distribution of metrics depending on different recurrence intervals for the thresh value d ≥ 0.5 m.

Figure 10 .
Figure 10.Distribution of metrics over all 26 events in the test data set.

Figure 10 .
Figure 10.Distribution of metrics over all 26 events in the test data set.

Figure 11 .
Figure 11.Distribution of metrics depending on different recurrence intervals for the thresh value d ≥ 0.5 m.

Figure 11 .
Figure 11.Distribution of metrics depending on different recurrence intervals for the threshold value d ≥ 0.5 m.

Water 2023 ,
15, x FOR PEER REVIEW 17 of 24

Figure 12 .
Figure 12.The 2D histogram with the prediction error as a function of the simulated water depths.

Figure 12 .
Figure 12.The 2D histogram with the prediction error as a function of the simulated water depths.

Figure 12 .
Figure 12.The 2D histogram with the prediction error as a function of the simulated wate

Figure 13 .
Figure 13.Results and evaluation for three time steps of a single forecast with the T-GC beginning of the event on 3 July 2009 in Gelsenkirchen.Instead of showing the entire stud section with a flooded underpass is expanded for better visualization.

Figure 13 .
Figure 13.Results and evaluation for three time steps of a single forecast with the T-GCN at the beginning of the event on 3 July 2009 in Gelsenkirchen.Instead of showing the entire study area, a section with a flooded underpass is expanded for better visualization.

Table 1 .
Evaluation results for all models from the three experiments (for each experiment and metric, the best result is bolded).