Assessing Surface Water Flood Risks in Urban Areas Using Machine Learning

: Urban ﬂooding is a devastating natural hazard for cities around the world. Flood risk mapping is a key tool in ﬂood management. However, it is computationally expensive to produce ﬂood risk maps using hydrodynamic models. To this end, this paper investigates the use of machine learning for the assessment of surface water ﬂood risks in urban areas. The factors that are considered in machine learning models include coordinates, elevation, slope gradient, imperviousness, land use, land cover, soil type, substrate, distance to river, distance to road, and normalized difference vegetation index. The machine learning models are tested using the case study of Exeter, UK. The performance of machine learning algorithms, including naïve Bayes, perceptron, artiﬁcial neural networks (ANNs), and convolutional neural networks (CNNs), is compared based on a spectrum of indicators, e.g., accuracy, F-beta score, and receiver operating characteristic curve. The results obtained from the case study show that the ﬂood risk maps can be accurately generated by the machine learning models. The performance of models on the 30-year ﬂood event is better than 100-year and 1000-year ﬂood events. The CNNs and ANNs outperform the other machine learning algorithms tested. This study shows that machine learning can help provide rapid ﬂood mapping, and contribute to urban ﬂood risk assessment and management.


Introduction
Surface water flooding in urban areas is caused by heavy rainfall that exceeds the capacity of drainage systems, and results in logged water on streets [1]. It poses a severe threat to residents, properties, and economies. Flood risk is likely to increase with the growth of urban population, land use change, and climate change in many cities around the world. In England, surface water flooding affects about 3 million properties, more than those affected by river flooding and coastal flooding (2.7 million) [2]. Thus, it is important to assess the risks of surface water flooding in urban areas to support informed decision-making for flood management and risk mitigation.
There are two types of flood risk assessment approaches, i.e., physically based modelling and data driven modelling [3]. Physically based models are widely applied for flood prediction considering various hydrological processes, such as precipitation, evaporation, and geomorphology factors in catchments [4]. Data-driven models have the capacity to predict floods by identifying relationships in various datasets [5,6], so the in-depth knowledge of physical processes in catchments is not highly required [7]. Machine learning, as a type of data-driven models, can discover the patterns directly from data without pre-defined rules. However, there are two main challenges in machine learning. First, although in-depth knowledge in hydrology is not required, choosing input data (e.g., catchment features) is key to generate a high-quality model [8]. Second, machine learning normally has a generalization problem. It may not perform well with a different data set [8]. Compared to traditional machine learning algorithms, deep learning algorithms have a significantly improved capacity in detecting low-level features, and learning high-level features after being trained with a large dataset [9].
Machine learning has been used for surface water flood prediction for several decades [8]. Naïve Bayes (NB), perceptron, XGBoost, classification and regression trees [10], random forest [11], support vector machine [12], and artificial neural networks (ANNs) [13] are commonly used machine learning algorithms for classification problems. Among them, ANNs are the most popular benchmark algorithm applied in the literature [8]. In recent years, hybrid models [14] and deep learning [15] (e.g., convolutional neural networks (CNNs) [16,17], long short-term memory) are emerging with a high performance on assessments of flood events [8] as a representation of the state of the art in machine learning. We chose the NB, perceptron, ANNs, and CNNs with an aim to investigate the capacity of CNNs compared to more conventional machine learning algorithms. Further, there is lack of understanding of their performance on high-resolution flood assessments based on the criteria of F-beta score and area under the curve (AUC).
This study conducts an experiment which builds 12 models of the chosen algorithms on flood risks for rainfall events of 30-year, 100-year, and 1000-year return periods using the case study of Exeter, UK. The study demonstrates that CNN models are capable of providing accurate high-resolution flood risk predictions for rain events of specific return periods in urban areas using the static features, which include a range of urban features, such as roads and buildings. This implies that flood risk maps can be produced using machine learning, such as CNNs, for urban areas which are lacking in data for hydrodynamic modelling.

Urban Hydrological Features
For assessing surface water flood risks in urban areas, 11 catchment features are chosen for constructing the models. These include coordinates, elevation, slope gradient, imperviousness, land use, soil type, substrate, distance to river, normalized difference vegetation index (NDVI), land cover, and distance to road. These features are all static features.
The coordinates indicate the location of each grid and the adjacent grids often show similar characteristic in continuous areas. The elevation and slope data are generated from digital terrain model (DTM) data. Slope is an important feature. It determines the overland-flow rate, because the steeper the slope is, the larger the runoff velocity will be [18]. Imperviousness is an index that measures the extent of imperious areas in a catchment (0~100%) [19]. Imperviousness areas can be roofs and paved areas. The water is easy to accumulate in areas with higher imperviousness. Soil type and substrate determine the infiltration rate. Land cover and land use indicate the way that the land is used. They directly affect the hydrologic characteristics of a catchment regarding the volume and rate of runoff, infiltration, and groundwater recharge [20]. Distance to river and distance to road refer to the distance between one location and the nearest river and road, respectively. The rainfall will accumulate into the river, and increase the water level of the river. The velocity of flood could be fast when it happens on flat roads with fewer obstructions, and the roads are often at a lower location than its sides, which may result in accumulation of water. NDVI measures the existence of green vegetations in a certain area by calculating the reflectance of near-infrared light, and it is an important feature for flood assessment [21]. It can be expressed as [22]: where Vis represents the visible light, and N IR represents the near-infrared light.

Feature Selection
For machine learning algorithms, feature selection can simplify the model, improve the performance, and shorten the training time. In this study, variance inflation factor (VIF) is used for feature selection.
VIF represents the multicollinearity of features, which means a feature can be predicted by other features [23]. Generally, the features with VIF > 10 are of high multicollinearity, and should be removed. The formula of VIF for the ith feature [23]: where R 2 i is the determination coefficient of all other features.

Naïve Bayes
The naïve Bayes classifier is a probabilistic classification method based on the Bayes' theorem, with the assumption of no dependency between features [24]. It classifies the samples by calculating the probability that a sample belongs to each class, and chooses the largest one. Therefore, naïve Bayes is a stable algorithm with a strong mathematical theory foundation. In addition, it is not sensitive to missing data, and needs less parameters.

Perceptron
The perceptron is the simplest ANN with a single layer [25]. The principle is that perceptron sums the input values by their weight, and uses activation function to generate the output. Though perceptron is a binary linear classifier, it can be generalized for multiclassification problems by continuing to classify binary results. The general idea is to train k binary classifiers.
It is a simple model, but suitable for large-scale learning. It has three advantages: First, it does not require a learning rate. Second, it does not require a regular term as a penalty. Third, it only updates the model when it is wrong [26], which means its training is fast.

Multilayer Perceptron
Multilayer perceptron (MLP) is a kind of feedforward ANNs. The theory foundation is derived from modern neuroscience research. It tries to simulate the structure of neural networks in the human brain [27]. An MLP model consists of at least three layers, including one input layer, hidden layer(s), and one output layer. Each node is a neuron which uses activation function to generate the output signal. The model is trained by backpropagation under the supervised learning framework. MLP is distinct from perceptron for its multiple layers and non-linear activation function, which enables MLP to solve non-linear separable problems [28].

1D Convolutional Neural Networks
CNNs are one of the representative algorithms of deep learning, and have been inspired by the process where the visual cortex in the brain receives visual signals [29]. It is a kind of deep-structured feedforward neural network that includes convolution calculation. A CNN model includes the input layer, a convolutional layer, pooling layers, fully connected layers, and the output layer. Compared with traditional machine learning algorithms, it generally has higher accuracy on classification, but it needs more data and computing time for training.
A modified version of CNNs is 1D CNNs. Compared with CNNs, they accept 1D data as input, so they need less computing power [16] and less data [17] to get fitted, and can support city-scale study. 1D CNNs are mainly used to deal with image and signal processing problems [30] in previous studies. Recently, it has been applied to land cover classification and crop yield prediction [17,30]. These studies are similar to flood risk assessment, because the problems are non-linear with complex features, and they all do spatial analysis by extracting features from satellite images and using environmental features. Therefore, 1D CNNs can be an effective algorithm for flood assessment in urban areas.

Performance Measures
Statistical measures, such as accuracy and F-beta score, are used for algorithm comparison. They are calculated by true positive (TP), true negative (TN), false positive (FP), and false negative (FN), which represent the number of cells that were correctly classified as positive and negative, and incorrectly classified as positive and negative, respectively.
Accuracy is the basic criterion of the performance which indicates the proportion that samples are correctly predicted, calculated as below: However, accuracy is misleading for imbalanced data sets. Other criteria are essential. Precision is also called true positive rate, and indicates the proportion of the positive samples predicted correctly to the number of samples that are predicted as positive: Recall is also called sensitivity, and indicates the proportion that positive samples are correctly predicted compared to the observed positive samples: F1 score is an overall assessment on precision and recall which are equally weighted: F-beta score is an improvement of F1 score. In real-world cases, precision and recall are not equally important, and a model cannot guarantee a precision and recall in the meantime. Therefore, the parameter of β is introduced into F1 score. When β < 1, precision is more important than recall; it is opposite when β > 1. When β = 1, precision, and recall are equally important, and F-beta score is the same as F1 score, F-beta score is calculated using: The receiver operating characteristic (ROC) curve is a standard technique to evaluate the performance of models [31]. It is plotted by the situations with different threshold values of the models. The vertical axis represents the true positive rate, and the horizontal axis represents the false positive rate. The AUC measures the area underneath the ROC which has a range from 0 to 1. The larger the area, the better the model is.

Oversampling and Undersampling
The imbalance of data is the result of the uneven distribution of different classes in real-world problems. The imbalance of data makes models tend to classify a sample to the majority class, resulting in poor performance on the minority class [32]. Previous research shows that the combination of oversampling and undersampling methods can effectively mitigate the imbalance issue [33]. Therefore, this study adopted the combination of synthetic minority oversampling technique (SMOTE) and Tomek link to alleviate the influence of the imbalance of data. Tomek link is defined as an undersampling method based on the improvement of condensed nearest neighbor, and has the ability to eliminate the boundary in which the samples are considered to have more chance of being misclassified [32]. With this technique, the models gain two benefits: First, it can eliminate noisy and redundant instances. Second, this can balance the number of various classes of samples. However, the drawback is that it increases the variance of independent variables. SMOTE is based on the improvement of KNN, and is used for increasing the number of samples that belong to the minority classes. It only focuses on the minority classes. For each sample in the minority classes, it finds the k nearest samples, and generates the synthetic samples along the line segments [34]. The benefit is that, compared with the undersampling method, it increases the number of data, which is important for data-driven algorithms. However, the increased data will lead to a low variance of variables. In addition, if the samples of minority are scattered, the synthetic values can make the boundary vague.

Study Area
The city of Exeter is situated in southwestern England (Figure 1). There are about 129,000 residents [35], and the city covers a total area of 47.04 km 2 . The climate in Exeter is warm and temperate due to the warm Atlantic Gulf Stream. The annual average temperature is 10.7 • C, and the annual rainfall is 825 mm [36]. Exeter lies in the downstream of the River Exe joined by River Creedy, and the geomorphology shows a ridge of land including a steep slope side and a wide, gentle floodplain and estuary side. The geology of Exeter is mainly sandstone and conglomerate [37], and the soils are stony and well-drained sandy silt loamy or clay loans [38]. Flooding is the main natural disaster in Exeter since the 13th century. The recent devastating flood events happened in 1960s and 1970s. Though the flood defense systems were recently upgraded, the city still faces risks of extreme floods [39]. Thus, we chose Exeter as the study area to validate the methods of flood risk mapping.

Flood Inventory Map
In this study, the flood inventory maps used data of 30-year, 100-year, and 1000year flood events in Exeter provided by the Department for Environment, Food & Rural Affairs [40]. According to the UK Environment Agency, predicting flood risks of 30-, 100-, and 1000-year events plays a significant role in the strategic overview in flood management [41], so these flood events are discussed in this study.
The Environment Agency has produced flood risk maps for pluvial and fluvial flooding, and, in this study, we have used the map from pluvial flooding in order to capture the urban features, such as land uses, road networks, and buildings. The flood risks are represented by a flood hazard indicator used by the Environment Agency considering both water depth and water velocity [41]: where HR is the flooding hazard rating; d is the depth of flooding (m); v is the velocity of floodwaters (m/s). The DF (Debris Factor) ranges in 0, 1 or 2 based on the probability that debris will lead to higher hazard level. In this study, the HR, i.e., flood risks are classified into five levels: no risk (0~0.5); level 1 risk (0.5~0.75); level 2 risk (0.75~1.25); level 3 risk (1.25~2); level 4 risk (2 and above) [41].

Data Preprocessing
The experiment applied the models of NB, perceptron, ANNs, and CNNs for assessing the flood risk of 30-year, 100-year, and 1000-year flood events. To generate high-quality models, we chose the features according to the literature. In particular, we included the VIF of features to check their multicollinearity, the distance to road as an urban feature affecting stormwater flow paths. In addition, the input data were substantially improved through the experiments described in the section of preprocessing and training (e.g., choosing high resolution images, removing no value area, unifying the formats and coordinates, checking multicollinearity, oversampling and undersampling, testing the structures).
The original data are in different sizes, formats, coordination reference systems (CRS), and raster resolutions. Using QGIS, the CRS of all maps were projected into an EPSG:27700 projected coordinate system for the United Kingdom. We converted vector maps into raster maps, and extracted the clip of maps by mask layer based on the boundary of Exeter from an administrative division map of England. Then we resampled and set the resolution as 1 m. Then, data were converted to numerical data from raster images with a grid size of 10 m using the Rasterio tool in numerical experiments.
After preprocessing, the geo-environmental data of Exeter were visualized, as seen in Figure 2. The city is characterized by the hilly topography with some flat land in the southwest (Figure 2a). The terrain is generally undulating, and is steeper in the northwest (Figure 2b). The soil types in Exeter are mainly gravel and clay (Figure 2c). The substrate types are bedrock in the north, west, and east, and superficial in the south and central areas (Figure 2d). Most of the land is urban and suburban, and the improved grass land is scattered in the city (Figure 2e,f). The majority of areas in Exeter are buildings and roads, and there are two railways in the north and southeast (Figure 2g). The imperviousness percentage is larger in the southeast and around the railways in the north (Figure 2h). Two main rivers flow through Exeter in the southwest and central east, and the distance between the river and each pixel is calculated in Figure 2i. Figure 2j shows the distance of a pixel to the nearest road, and the road network is sparse in serval areas in the east and southwest. Among the data, certain features may not contribute to the training, so the feature selection method was used before training. A multicollinearity diagnosis test can prevent inputting similar features into the models, which saves training time and computation power. Table 1 shows the result of a multicollinearity diagnosis test of features. The VIF of all features for 30-year, 100-year, and 1000-year flood events are less than 10. This indicates that all features are linearly independent, so all features are applied to train NB, perceptron, ANN, and CNN models. The statistics of data help in understanding the general distribution of flood risks. As shown in Table 2, the distribution of flood risks is extremely imbalanced. The nonflood area of the dataset accounts for about 98%, 95%, and 90% for 30-year, 100-year, and 1000-year flood events. The level 4 risk areas only account for 0.01% and 0.06% for 30-year and 100-year flood events, respectively. This can influence models on detecting the pattern of minority classes during the training stage, so oversampling and undersampling methods were introduced to balance the data.

Model Training
The process of machine learning includes three stages, i.e., model training, validation, and testing. There were about a total of 260,000 samples after preprocessing, and the dataset was divided into the training and test sets with a ratio of 8 to 2. In the training set, 10% of data were used for validation. In the training stage, each of the four algorithms was trained for assessing 30-year, 100-year, and 1000-year flood events, and, therefore, 12 models were built up. The NB and perceptron models were constructed using the scikit-learn tool [26], whereas ANN and CNN models were built in the Keras and Tensorflow framework [42].
For ANNs and CNNs, we tested several models with different structures, and made a comparison to choose a model that can perform better in enhancing and recognizing the features. For all network structures, the Adam optimizer was used for gradient descent, and the learning rate was 0.0001. The loss function was categorical cross-entropy. The activation function was ReLU in dense layers for ANNs, and convolutional layers for CNNs. The activation function was SoftMax in the output layer for both ANNs and CNNs. For training ANNs, 3-layer, 6-layer, 9-layer, and 12-layer models were tested; for training CNNs, 3-layer, 4-layer, 5-layer, and 6-layer models were tested. Sixty epochs were run for each model for all flood events. Based on accuracy results (Figure 3), the 6-layer ANN model (22,  The validation stage happens during the training, and the validation set checked the performance of models at the end of each epoch. In the testing stage, the test set was used for testing the models, and the discussion was based on the obtained results from the test set.

Results
The accuracy of the assessments of machine learning algorithms on the flood events is shown in Table 3. CNNs has the highest accuracy on all flood events. Naïve Bayes and perceptron show a lower accuracy compared with other algorithms. ANNs shows higher accuracy on 30-year and 100-year flood events than CNNs. However, higher accuracy does not necessarily mean an algorithm is better. The majority of Exeter is no-risk area, and ANNs partially sacrificed precision (i.e., underestimating the high-risk area) to reach higher accuracy, even after applying oversampling techniques. Table 3. Accuracy of models for the assessment of 30-year, 100-year, and 1000-year flood events in Exeter. Underestimating the flood risks can generally cause more serious consequences than overestimating the flood risks, so we focused more on precision than accuracy, and chose a value of 0.5 for β. Table 4 shows the F-beta score of all flood events. ANNs perform the best on 30-year and 100-year events because the pattern of flood risks is simple to recognize. However, the pattern of flood risks of the 1000-year event is more complex than the 30-year and 100-year events, because the number of samples of high-risk areas are larger (Table 1), and the entropy becomes larger. Since CNNs can detect the low-level features, it can perform well in recognizing such complex patterns. This is the reason that CNNs can outperform ANNs in the 1000-year flood event, and also the reason that the performance of all models on the 30-year flood event is better than the 100-year and 1000-year flood events. Table 4. F-beta score of models for the assessment of 30-year, 100-year, and 1000-year flood events in Exeter (β = 0.5 ).  Figure 4 demonstrates the ROC curves and AUC of flood events. CNNs perform the best on 100-year and 1000-year flood events. This situation of the 100-year flood event is slightly different from the rank of F-beta score. The two criteria both focus on recall, but are slightly different in detail. ROC curves and AUC give an importance on specificity (i.e., true negative rate), whereas F-beta score gives an importance on precision. This means when a model is applied in real-world cases, the model with higher AUC is more conservative, and avoids misclassifying; the model with higher F-beta score is more active, and endeavors to find all positive samples. This explains the different ranks on F-beta score and AUC. In addition, with the increment of return period, it becomes more difficult to recognize all flood patterns due to various factors and processes involved in larger urban areas affected (Table 2). However, the performance of CNNs becomes better than ANNs, illustrating the increased capacity of CNNs in detecting complex flood patterns.   The main limitation of the study is that the prediction on level 1 risk areas is not accurate enough. All models tend to classify level 1 risk areas as a non-risk area in central Exeter, and as a level 2 risk area in the southwest. By comparing the value of input features, two reasons were found. The first one is the resolution is not high enough, and this made the attributes of the grid ambiguous. For example, a 10 m grid can only represent one attribute in raster, but it may contain both a building and road in a vector map. The second reason is that, in the central area, the values of the features of the level 1 risk grid are nearly the same with its neighbors, so the algorithms cannot recognize the pattern. This means additional features that help recognize the pattern may exist. The future work could focus two aspects. To begin with, the study can divide the area into the grid size of 1 m to improve the models, if higher resolution data map can be acquired for all layers. Moreover, more features could be introduced for training. We can manually select additional features with geographical knowledge. As an alternative, 2D CNNs could be used, because 2D CNNs perform well on detecting low-level features, and Landsat pictures contain more low-level features than 1D data.

Conclusions
This study applied machine learning algorithms to assess the surface water flood risks in urban areas. We constructed NB, perceptron, ANN, and CNN models to assess the 30-year, 100-year, and 1000-year flood events. ANNs and CNNs are more complicated algorithms, and are good at detecting low-level features on non-linear problems (i.e., flood risk assessment). The results show that the performance of ANNs and CNNs are better than NB and perceptron. Besides, as the affected areas become larger, CNNs perform better than the others. In addition, flood risks maps can be generated based on the predicted result. By comparing predicted risk maps and true risk maps, we find the models can learn the general patterns of floods, and the prediction is similar to the real-world case. The main limitation is the lower accuracy of prediction on level 1 flood risk in central and southeast areas. Improvements can be done in two aspects in future. Input data in higher resolution may improve the performance of models, and 2D CNNs could be introduced in the study as an alternative.
The goal of flood risk management is to reduce flood risk, but as the first step, this study focuses on the assessment of flood risks using machine learning algorithms. The flood risks for training did not consider the impact of flood prevention measures. The machine learning models can be easily trained with updated flood risks when flood prevention measures are considered. In future, various flood prevention measures will be considered in hydrodynamic models to access their impact in practice.
This study reveals that machine learning models can use static data to assess the flood risk for a specific storm event without rain-driven data. They are generic models that can assess the flood risks of similar areas without generating additional models for the same type of flood event. Therefore, CNNs can be used for rapid flood mapping. This study shows that machine learning is a useful tool for flood management, as it can help identify priority areas for risk reduction.