2.2. Technical Architecture
In this study, a rapid urban rainstorm flood prediction model targeting power grid facilities was constructed within the framework of a physically based hydrodynamic mechanism model integrating deep learning algorithms. The technical architecture of this model is illustrated in
Figure 2. As in step 1, data about topographical features, historical monitoring data, and multi-return-period rainfall sequences are input to drive the hydrodynamic model to simulate physical mechanism, generating a training sample database with physical consistency (step 2). This sample database undergoes data cleaning, normalization processing, and sequence reconstruction before being fed into the deep learning model (step 3). The deep learning models are trained and validated (step 4) by adopting a spatiotemporal dual-branch architecture. In such an architecture, the spatial prediction branch employs an MLP to achieve deep fusion of multi-source features such as rainfall sequence characteristics and topographical features, outputting the spatial distribution of inundation depth; and the temporal prediction branch utilizes a CNN-LSTM-ATT hybrid model to learn the temporal dependencies of rainfall sequences and predict the temporal water depth processes at key water-logging points. Through this spatiotemporal dual-branch architecture, the model achieves the rapid prediction of both spatial distribution of inundation and the temporal water depth processes (step 5).
2.2.1. Urban Rainstorm Flood Model
In this study, a self-developed fully distributed model (DHMUrban) [
13,
14,
15] was employed to construct the urban rainstorm flood mechanism model within the research area. This model consists of core modules, including a surface model, a pipe network model, and a river network model, which are fully coupled with one another (as shown in
Figure 3). It enables simulation of the entire process with high precision and high efficiency: from rainfall generation and convergence at the ground surface, through pipe network confluence, and river network accumulation in complex urban environments [
16,
17,
18].
2.2.2. Deep Learning Models
(1) Spatial Prediction Module
The spatial prediction module employs an MLP (multi-layer perceptron) model, which is a type of feed-forward artificial neural network architecture. In addition to an input layer and an output layer, this architecture can also incorporate multiple hidden layers between them [
19]. Compared with a single-layer perceptron, this multi-layer architecture effectively addresses the modeling challenge of nonlinear separable problems by introducing nonlinear activation functions and a hierarchical feature extraction mechanism [
20,
21]. In this study, the MLP model was employed to learn the spatial characteristics of rainfall-induced inundation (
Figure 2).
The architecture of each neural network is described using the notation “I–H1–…–Hn–O”, where I denotes the input dimension, Hk represents the number of neurons in the kth hidden layer, and O is the output dimension. At the same time, this study assumes by default that all grid cells share consistent grid resolution and are spatially aligned through their unique grid IDs.
- (1)
Rainfall pattern feature extraction: a three-layer fully connected network (24-64-64-251,450) is used to learn the complex nonlinear mapping from rainfall sequences to spatial inundation fields.
- (2)
Topographical features mapping: a parameter-sharing multi-layer perceptron (6-16-16-1) processes six static features (elevation, roughness coefficient, initial loss, initial infiltration rate, steady infiltration rate, and drainage capacity) to generate a topography-dominated flood-prone base field.
- (3)
Feature fusion: a lightweight fusion network (2-16-1) is designed to perform pixel-wise intelligent weighting of the contributions from rainfall and topographical features, achieving a scenario-aware fusion mechanism.
(2) Temporal prediction module
The temporal prediction module is based on an LSTM (long short-term memory) neural network, a type of deep learning model primarily designed for processing and predicting time-series data. Through three logical structures, namely the forget gate, input gate, and output gate, the model filters and retains data, addressing the problems of vanishing or exploding gradients that traditional RNN models may encounter when handling long-range dependencies. This model is particularly well suited for capturing long-term dependencies in time series, leading to its widespread application across various fields such as stock market forecasting, weather forecasting, and hydrological prediction [
22]. The initial standard LSTM, however, lacks inherent feature extraction capabilities and has limited ability to capture local patterns. In this study, several improvements were accomplished based on the traditional LSTM model by employing a CNN-LSTM-ATT architecture, as shown in
Figure 2. The model uses rainfall sequences of shape (120, 1) as input and produces water depth predictions of shape (120,) as output.
First, local temporal features of the rainfall data are extracted using a CNN. On the first layer, 16 five-dimensional convolution kernels are employed to detect basic rainfall patterns such as the onset of downpour or sustained rainfall. On the second layer, 32 three-dimensional convolution kernels are utilized to combine lower-level features, identifying more complex composite rainfall patterns. Finally, self-adaptive pooling compresses the sequence to 32 time steps, enhancing computational efficiency while preserving key information.
Next, an LSTM layer is utilized to model the temporal dynamics of the features extracted by the CNN. By leveraging the coordinated operation of the forget, input, and output gates, the LSTM can learn the dynamic patterns of the rainfall-flood system, including processes such as delayed flood response, peak accumulation, and water recession. Its recurrent connection properties ensure effective capture of long-term dependencies. Subsequently, an attention (ATT) mechanism evaluates the importance of each time step in the LSTM, identifies critical periods by computing weight scores, and automatically focuses on crucial periods such as time of intense rainfall or rapid water rise. As result, a weighted context vector will be generated, which can improve prediction accuracy.
Finally, a fully connected layer transforms the above context vector into water-logging depth predictions. A two-layer neural network structure, combined with rectified linear unit (ReLU) activation and dropout regularization, ensures that the model captures complex relationships while avoiding overfitting. With the collaborative cooperation among the CNN, LSTM, and attention mechanism, this architecture achieves hierarchical processing from local feature extraction to temporal pattern learning and to key-focus weighting, thereby effectively reducing the complexity of rainfall-flood prediction.
2.5. Training of Surrogate Model
Based on the rainstorm intensity formula recommended in the Beijing Hydrological Manual, designed schemes of 15 24 h rainfall processes under different return periods were generated. Their rainfall characteristics are shown in
Table 2. In this study, throughout the entire modeling area, each 24 h precipitation process was treated as uniformly distributed rainfall input data. These 15 designed rainfall schemes were individually input into the HM model for sequential simulation, aiming to obtain data on inundation spatial distribution, corresponding water depth, and temporal water depth processes. These data served as the training samples for the surrogate model.
After a comprehensive consideration of common, relatively severe, and extreme rainfall scenarios, 12 rainfall scenarios were chosen as training samples (P1–P6, P8–P11, P14, P15), 2 as validation samples (P7, P12), and 1 as a test sample (P13). The validation scenarios were selected based on their practical relevance: the 10-year return period represents common storm events critical for routine flood management, while the 50–100-year return period represents moderate-extreme events important for infrastructure resilience planning.
Before model training, spatial and temporal datasets required separate data preprocessing, which was conducted with strict anti-leakage protocols:
(1) Spatial Data Processing
The spatial dataset encompasses rainfall scenarios P1–P15 and their corresponding flood data, as well as six types of topographical feature data for the 251,450 cells. To prevent information leakage, each complete rainfall scenario (including its corresponding inundation data across all grid points) was treated as an indivisible data unit. Both rainfall and topographical data were normalized using Z-score standardization (with ε = 1 × 10−8 added to the denominator to prevent division by zero) to unify dimensions and accelerate model training. Critically, all normalization parameters (mean and standard deviation) were calculated exclusively from the training subset of scenarios (P1–P6, P8–P11, P14, P15), and the same parameters were then applied to the validation (P7, P12) and test (P13) subsets.
(2) Temporal Data Processing
The temporal dataset integrates the designed rainfall scenarios of P1–P15 and their corresponding temporal water depth processes at the substations, with the 24 h rainfall sequences extended to 120 h using the “last-value padding” method to ensure temporal consistency. Consistent with the spatial data protocol, the dataset was partitioned at the scenario level, guaranteeing that no data from the same rainfall scenario were mixed across training, validation, and test sets. After temporal alignment and min-max normalization (using training-set-derived parameters only), a rainfall-water depth mapping relationship was constructed for each substation as temporal prediction input.
(3) Overall Data Partitioning Strategy
The dataset partitioning follows a scenario-based approach where training, validation, and test sets contain mutually exclusive rainfall scenarios: training (12 scenarios: P1–P6, P8–P11, P14, P15), validation (2 scenarios: P7, P12), and test (1 scenario: P13). This ensures complete isolation between datasets and prevents any information leakage throughout the preprocessing pipeline.
This study adopts a lightweight architecture based on the PyTorch 2.9.1 framework which can run on a regular computer equipped with only 8 GB of memory and an Intel Core i7-11700-level processor. The model training and inference tasks can be efficiently completed. The model implementation relies on mainstream scientific computing libraries such as PyTorch 2.9.1, NumPy 2.3.5, Pandas 2.3.3, and Matplotlib 3.10.7. The entire training process does not require GPU acceleration, and the required storage space is controlled within 500 MB, with the training time within 20 min.
After multiple iterations and optimizations, the optimal parameter settings for the MLP and CNN-LSTM-ATT models were determined (
Table 3).
Evaluation results of the models on the training and validation sets are presented in
Table 4. The MLP model maintained a higher goodness of fit (
R2 > 0.98) and low errors on both datasets, demonstrating stable performance. The CNN-LSTM-ATT model (using D3 as an example) shows moderate fitting on the training set (
R2 = 0.8905) but achieves an
R2 of 0.9923 on the validation set, with validation errors lower than training errors.
To properly interpret these results, it is important to note that the higher validation R2 compared to training R2 does not necessarily indicate superior generalization capability. This pattern can occur when (1) the validation subset (10-year and 50–100-year return periods) contains scenarios that are relatively less challenging or more homogeneous than the full training set, which encompasses a wider range of rainfall intensities; when (2) R2 is sensitive to the variance of the target variable, and the validation set may have lower variance in water depth values; and when (3) the scenario-based split creates inherent differences in data distribution between training and validation sets. Thus, while the CNN-LSTM-ATT model performs well on the specific validation scenarios, its true generalization ability should be assessed through additional independent testing across diverse rainfall conditions.
To rigorously assess the generalization capability of the surrogate model and mitigate the risk of overfitting due to the limited number of rainfall scenarios, we implemented k-fold cross-validation on the combined training and validation sets (P1–P12, P14–P15, totaling 14 scenarios), while completely reserving P13 as an independent test set. Specifically, we divided these 14 rainfall scenarios into k = 5 folds, with each fold containing approximately 2–3 scenarios. In each iteration, k − 1 folds were used for training the neural network, while the remaining fold was reserved for validation. This process was repeated k times until each fold had been used exactly once as the validation set.
The evaluation results based on 5-fold cross-validation (
Table 5) indicate that both MLP and CNN-LSTM-ATT (using D3 as an example) exhibit excellent predictive performance. Among them, MLP demonstrates a stable ability in explaining the spatial distribution of water depth, with an average
R2 reaching 0.9663, while CNN-LSTM-ATT excels in capturing the temporal process of water depth, with the average
R2 increasing to 0.9856, and the results show less fluctuation among different folds (standard deviation 0.0074 vs. 0.0109). Notably, this cross-validation performance substantially exceeds the training set
R2 observed in the fixed split (0.8905), suggesting that CNN-LSTM-ATT’s temporal prediction capability is more robust than indicated by the single-partition evaluation.