Flash Flood Risk Analysis Based on Machine Learning Techniques in the

: Flash ﬂood, one of the most devastating weather-related hazards in the world, has become more and more frequent in past decades. For the purpose of ﬂood mitigation, it is necessary to understand the distribution of ﬂash ﬂood risk. In this study, artiﬁcial intelligence (Least squares support vector machine: LSSVM) and classical canonical method (Logistic regression: LR) are used to assess the ﬂash ﬂood risk in the Yunnan Province based on historical ﬂash ﬂood records and 13 meteorological, topographical, hydrological and anthropological factors. Results indicate that: (1) the LSSVM with Radial basis function (RBF) Kernel works the best (Accuracy = 0.79) and the LR is the worst (Accuracy = 0.75) in testing; (2) ﬂash ﬂood risk distribution identiﬁed by the LSSVM in Yunnan province is near normal distribution; (3) the high-risk areas are mainly concentrated in the central and southeastern regions, where with a large curve number; and (4) the impact factors contributing the ﬂash ﬂood risk map from higher to low are: Curve number > Digital elevation > Slope > River density > Flash Flood preventions > Topographic Wetness Index > annual maximum 24 h precipitation > annual maximum 3 h precipitation.


Introduction
Flash flood is one of the most devastating natural disasters with characteristics of high-velocity runoff, short lead-time and fast-rising water [1]. Economic losses caused by flash flood increase year by year with the increase of population and infrastructure in flood-prone areas [2]. For instance, a total of 28,826 flash flood events happened in the United States between 2007 and 2015 and 10% of flash flood resulted in damages exceeding $100,000 [3]. According to the China Floods and Droughts Disasters Bulletin of 2015, an average of 935 people dies each year by flash flood disasters from 2000 to 2015. Owing to the impact of climate change, the flash flood risk is predicted to increase with the frequent extreme precipitation and sea level rise [4]. Therefore, an accurate risk assessment is critical for flash flood prevention.
Flash floods risk is a combination of flood hazard and vulnerability of an area [5,6]. Flood risk is widely assessed by hydrological models or data-driven model based on historical flood inventories. of which more than 50% is krasnozem. The climate is mainly affected by atmospheric circulation, which is a low mountain monsoon climate. The annual average precipitation is 1102 mm, with significant spatial-temporal differences [18]. Meanwhile, extreme weather events occur frequently, especially during the summer flood season (June to September), with rainfall accounting for 85-95% between May and October.
China has implemented the construction of non-structural measures for flash flood prevention since 2011. In Yunnan Province, there are 206 flash floods events from 2011 to 2015, causing 237 deaths. Especially in 2014 and 2015, the number of deaths accounted for 22.2% and 8.1% of the national total, respectively, which were the most affected by the flash floods. In order to defend against flash flood, Yunnan has launched the construction of non-structural flood prevention measures covering 129 counties since 2010. The average construction fund is $0.87 million for each county. The preventive measures implemented include: encrypting automatic rainfall stations to improve the quality of monitoring data, installing simple rainfall equipment with alarms, building an alarm system consisting of radio broadcasts and simple alarm devices. Obviously, although Yunnan Province already has a certain defense base, it still suffers from severe flash flood disasters. Therefore, it is of great significance to study the flash flood risk in Yunnan Province. Figure 1 shows the historical flash floods in Yunnan Province from 2011 to 2015. Obviously, flash floods mainly occur on lower slopes, mainly because the air rises on the windward slope and the water vapor condenses easily to form precipitation, which causes runoff to accumulate in the valley and triggers flash floods. The leeward slope is not easy to form precipitation due to the air sinking and the temperature moving downward [19].

Data
The flash flood records are mainly from official authoritative departments, such as the Ministry of Water Resources (MWR), the Ministry of Land and Resources and some local government agencies in Yunnan province. These data are divided into training and testing datasets, 70% of which are randomly selected for training and the remaining 30% data for testing. The principle of the distribution ratio is that the samples are evenly distributed and have certain representativeness ( Figure.1). It is important to emphasize that all the flash floods studied in this paper involve death or

Data
The flash flood records are mainly from official authoritative departments, such as the Ministry of Water Resources (MWR), the Ministry of Land and Resources and some local government agencies in Yunnan province. These data are divided into training and testing datasets, 70% of which are randomly selected for training and the remaining 30% data for testing. The principle of the distribution ratio is that the samples are evenly distributed and have certain representativeness (Figure 1). It is important to emphasize that all the flash floods studied in this paper involve death or missing; regardless of Remote Sens. 2019, 11, 170 4 of 16 incidents that do not cause casualties. The remote sensing data and other data covered in this paper are shown in Table 1.

Flash Flood Triggering Factors
Flash flood disasters are mainly affected by meteorological, topographical hydrological, anthropological factors. The related factors affecting flash flood risk are shown in Figure 2 and are described as followed: Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 16 missing; regardless of incidents that do not cause casualties. The remote sensing data and other data covered in this paper are shown in Table 1.

Flash Flood Triggering Factors
Flash flood disasters are mainly affected by meteorological, topographical hydrological, anthropological factors. The related factors affecting flash flood risk are shown in Figure 2 and are described as followed:     (2) Topographical factors Digital elevation model (DEM) retrieved from NASA SRTM, a 90-m raster in 2000. DEM resolution mainly affects the watershed topography, which in turn affects the accuracy of runoff generation and convergence. The higher the DEM resolution, the higher the accuracy of the extracted watershed features. However, high-resolution DEM over-emphasizes the computational burden of the model, greatly restricting the runtime of the model [20]. Slope (SL) refers to the ratio of the vertical height of the slope to the horizontal direction, which is suitable for the sensitivity analysis of floods. Generally, the SL is calculated from the DEM data using the ArcGIS tool [17]. River density (RD) utilizes China's basic vector format dataset, which is related to the area of the grid and the length of the river in the grid [21]. Vegetation coverage (VC) is calculated by an average multi-year normalized difference vegetation index (NDVI) based on MODIS images. It represents vegetation distribution and biomass levels from 2011 to 2015 [22].

(3) Hydrological factors
The Curve Number (CN) derived from the soil conservation service curve number (SCS-CN) model is a comprehensive indicator calculated according to the National Engineering Handbook of US, which primarily reflects the potential capacity of runoff generation in different grids. It is a non-dimensional index with a theoretical value between 0 (no runoff) and 100 (no infiltration). For details of CN, please refer to Zeng et al. (2017) [23]. The topographic wetness index (TWI), combined with the local uphill contribution area and the entire slope, is widely used to quantify the topographical control of flood concentration processes and can be calculated from DEM [24]. Soil moisture (SM) data is from the European Space Agency (ESA) with a spatial accuracy of 50 km. It can estimate moisture in the soil surface (down to 5 cm) which is important for hydrological modeling. SM indicates the non-linear partitioning of the precipitation into infiltration and runoff, affecting runoff by affecting infiltration [25].

(4) Anthropological factors
The effects of flood risks are often related to anthropology, manifested as loss of economic property and casualties. The losses generally increase with the population growth in flood-prone areas, especially in economically developed and densely populated areas. Therefore, Gross Domestic Product (GDP) and population (Pop) are selected as anthropological factors for flash flood assessment. DDP is defined as "an aggregate measure of production equal to the sum of the gross values added of all resident and institutional units engaged in production (plus any taxes and minus any subsidies, on products not included in the value of their outputs), mainly reflecting the economic situation of the study area. Moreover, GDP is a total indicator, which basically organizes indicators describing various aspects of the national economy through a series of scientific principles and methods. Therefore, GDP contained contributing indicators such as over-exploitation [26]. The 1-km gridded GDP and population of Yunnan Province are collected from the Data Center for Resources and Environmental Sciences Chinese Academy of Sciences (RESDC). In 2010, the Chinese government initiated the construction of national-level non-structural measures for flash flood prevention. This investment is the largest non-structural project in China, involving a total area of 3.86 million km 2 in 29 provinces (autonomous regions and municipalities). The preventive measures include the national flash flood investigation and evaluation, the establishment of construction monitoring and early warning platforms, automatic rainfall stations and water level stations, mass observations and mass prevention and so forth. The FFP data is mainly from the MWR and local governments and utilizing the investment funds to comprehensively reflect the flash flood prevention situation [27,28]. The related factors affecting flash flood risk in the LSSVM method are shown in Figure 3.
prevention situation [27,28]. The related factors affecting flash flood risk in the LSSVM method are shown in Figure 3.

Methodology
Subject to y ( ) , 1 ,2, , where m is the weight vector, β is the penalty parameter, ni is the approximation error, f is the number of autoregressive terms in the LR model, Φ i (x ) is the nonlinear mapping function and b is the bias term. The corresponding Lagrange function can be obtained by Equation (3): where αi is the Lagrange multiplier. Using the Karush-Kuhn-Tucker (KKT) conditions, the solutions can be obtained by partially differentiating with respect to m,b,ni and αi:

Methodology
(1) LSSVM LSSVM utilizes a set of linear equations to minimize the complexity of the optimization process. The constraint optimization problems can be solved using Lagrange multipliers. Consider a given training set x i , y i , i = 1, 2, . . . , f with input data x i and output data y i , the LSSVM equation can be indicated as follows: Subject to where m is the weight vector, β is the penalty parameter, n i is the approximation error, f is the number of autoregressive terms in the LR model, Φ(x i ) is the nonlinear mapping function and b is the bias term. The corresponding Lagrange function can be obtained by Equation (3): where α i is the Lagrange multiplier. Using the Karush-Kuhn-Tucker (KKT) conditions, the solutions can be obtained by partially differentiating with respect to m, b, n i and α i : . . , f . Therefore, the LSSVM for regression can be obtained from Equation (6): where K (x, x i ) is the kernel function. For LSSVM, there are many kernel functions including linear (Equation (7)), polynomial (ploy) (Equation (8)), radial basis function (RBF) (Equation (9)), sigmoid and so forth. However, most widely used kernel functions are RBF and polynomial Kernel.
Polynomial (PL) Kernel: Radial basis function (RBF) Kernel: where γ, τ and d are Kernel parameters. The Matlab toolbox named LSSVMLab is used to implement LSSVM in this study. The parameters of LSSVM are automatically calibrated during training with 10-fold cross-validation method. More details regarding the principles and application of LSSVM can be found in the LSSVMLab Toolbox User's Guide [29,30].
(2) LR LR is a probabilistic statistical classification procedure used to predict the dependent variable based on one or more independent variables. The advantage is that the dependent variable has only two cases, that is, occurrence and non-occurrence. In contrast, the stochastic gradient ascent algorithm is generally used to reduce the periodic fluctuations and the computational complexity of the iterative algorithm to further optimize the LR model, which can be calculated by the following equation [31]: where y is the dependent variable, x i is the i-th explanatory variable, β 0 is a constant, β i is the i-th regression coefficient and e is the error. The probability (p) of the occurrence of y is If the estimated probability is greater than 0.5 (or other user-defined thresholds), the object is classified as a successful group; otherwise, the object belongs to the failed group. In addition, we train 1 for flash flood, 0 for no flash flood, the values scale from 0 to 1 corresponding to the flash flood sensitivity of the basin from minimum to maximum. The result is the probability that each point is assigned as 0 to 1 training set. Similarly, equal interval classification is used to categorize the probability index of the flash flood into five risk zones of lowest (0-0.2), low (0.2-0.4), moderate (0.4-0.6), high (0.6-0.8) and the highest (0.8-1).

(3) Evaluation index
In the study, five indices including Precision(P), Recall(R), Accuracy (ACC), Kappa(K) and F-score(F) are used to evaluate the results from four models. ACC is the proportion of correctly classified cases to all cases in the set but there is no way to better deviate from the test data to evaluate the model. P is the fraction of recognized instances that are relevant, while R is the fraction of relevant instances retrieved. A better choice is the F-score, which can be interpreted as a weighted average of recalls and precision. Equations (12)- (15) shows how each index calculated, to measure the accuracy of model prediction.
Precision : P = TP TP + FP Recall : R = TP TP + FN where TP, FN, TN and FP denote the number of true positive, false negative, true negative and false positive, respectively. Cohen's kappa measures the observer's consistency. It is used to assess the consistency between two or more raters when categorizing a measurement scale. The values are between 1 and 0, corresponding to a perfect agreement and no agreement, respectively. Equation (18) is calculated the Kappa score: where P p is the relatively observed consistency among evaluators and P exp is a hypothetical probability of coincidence, using the observed data to calculate the probability that each observer randomly sees each category. If the raters are in complete agreement, then k = 1. If, except by chance, no agreement is reached among the raters (as given by P exp ), k ≤ 0. Table 2 shows model performances in the testing period. The accuracy, precision, recall, F-score and kappa range are 0.75 to 0.79, 0.76 to 0.82, 0.74 to 0.77, 0.75 to 0.79 and 0.5 to 0.59, respectively. Obviously, all models have relatively high precision. Although there is no significant difference between the three different kernel functions of the LSSVM model. They are all better than the LR method and the model 2 (LSSVM with RBF kernel) simulates the best. Receiver Operating Characteristics (ROC) curves, created by plotting the TP Rate against the FP Rate, are graphical tools applied to the analysis of classification effects over the entire class distribution. Area Under Curve (AUC) is the area under the ROC curve and usually in the range of 0.5 and 1. The AUC equal 0.5 and 1 are accidental classification and perfect classification, respectively. Figure 4 shows the good AUC results obtained by four models but the LSSVM with the RBF kernel has the highest AUC (0.81), followed by LSSVM + LN (0.80) and LSSVM + PL (0.80), the classic LR model (0.78) is relatively poor. Receiver Operating Characteristics (ROC) curves, created by plotting the TP Rate against the FP Rate, are graphical tools applied to the analysis of classification effects over the entire class distribution. Area Under Curve (AUC) is the area under the ROC curve and usually in the range of 0.5 and 1. The AUC equal 0.5 and 1 are accidental classification and perfect classification, respectively. Figure 4 shows the good AUC results obtained by four models but the LSSVM with the RBF kernel has the highest AUC (0.81), followed by LSSVM + LN (0.80) and LSSVM + PL (0.80), the classic LR model (0.78) is relatively poor.

Flash Flood Risk Map Comparison
Based on the LR model and the LSSVM model with three kernels of LN, RBF and PL, the flood risk maps of Yunnan Province are generated in the GIS environment. As shown in Figure 5, the highrisk areas are mainly concentrated in the south-central region, accounting for 32% of the total area. Although LSSVM is not significantly better than LR in the training and testing, the risk distribution is significantly different. Figure. 6 shows that the flash flood risk obtained by LSSVM is approximately a normal distribution, which is consistent with the previous study in Yunnan Province, China [32,33]. While the risk obtained by LR is a uniform distribution. Therefore, the flood risk maps obtained by LSSVM are more reliable than LR.

Flash Flood Risk Map Comparison
Based on the LR model and the LSSVM model with three kernels of LN, RBF and PL, the flood risk maps of Yunnan Province are generated in the GIS environment. As shown in Figure 5, the high-risk areas are mainly concentrated in the south-central region, accounting for 32% of the total area. Although LSSVM is not significantly better than LR in the training and testing, the risk distribution is significantly different. Figure 6 shows that the flash flood risk obtained by LSSVM is approximately a normal distribution, which is consistent with the previous study in Yunnan Province, China [32,33]. While the risk obtained by LR is a uniform distribution. Therefore, the flood risk maps obtained by LSSVM are more reliable than LR. Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 16 .      Many studies have utilized some statistical methods to conduct flash flood risk assessments in other areas. For example, Smith (2010) proposed the Flash Flood Potential Index (FFPI) model, considering slope, land use, soil texture and so forth. FFPI values from 1 to 10 correspond to the risk probability from the minimum to the maximum and has been tested in central Iowa, Colorado and upstate New York and Pennsylvania [34,35]. Based on the AHP and information entropy theory, Zeng et al. (2016) selected some relevant indicators (e.g., soil, slope, rainfall and flood control measures), utilized expert scoring method to explore their different weights and finally obtained the risk map of Yunnan Province [18]. In this study, the LSSVM method is firstly used for flash flood risk assessment. LSSVM can directly assess flood risk without setting factor weights. The contribution of each factor to flood risk is assessed by the correlation coefficient between factors and the flood risk, with a more significant advantage. Figure 7 showed the correlation coefficient of each factor with the flash flood risk from LSSVM-RBF. The greater the correlation coefficient, the greater impact of this indicator on flash floods risk. Obviously, the correlation coefficient of CN is the largest, exceeding 0.5, followed by 7 indicators (DEM, SL, RD, FFP, TWI, 24-H-P, 3-H-P) between 0.1 and 0.5 and the remaining 5 indicators (AP, POP, SM, GDP, VC) are less than 0.1. Combined with the previous analysis, CN identifies the runoff generation capacity. DEM mainly responds to the topography of the study area and SL, RD and TWI all derived from DEM. Therefore, the flash flood risk of Yunnan Province is mainly affected by local runoff capacity, topography. Meanwhile, the correlation coefficient of FFP is 0.3, reflecting that positive man-made measures can largely prevent the occurrence of flash floods. However, compared with topographical factors, we found that the precipitation factor shows a relatively low correlation with the flash floods risk. This mainly because flash floods are caused by intensive rainfall but casualties are usually occurred and reported in low-lying areas. In addition, the effects of short-term precipitation (e.g., 24-H-P, 3-H-P) are greater than the annual precipitation. Our proposed model can concern all flash flood explanatory factors and give an accurate assessment for flash flood risk. In the future, we will further combine water depth and flow as a more reasonable indicator for flood assessment. Many studies have utilized some statistical methods to conduct flash flood risk assessments in other areas. For example, Smith (2010) proposed the Flash Flood Potential Index (FFPI) model, considering slope, land use, soil texture and so forth. FFPI values from 1 to 10 correspond to the risk probability from the minimum to the maximum and has been tested in central Iowa, Colorado and upstate New York and Pennsylvania [34,35]. Based on the AHP and information entropy theory, Zeng et al. (2016) selected some relevant indicators (e.g., soil, slope, rainfall and flood control measures), utilized expert scoring method to explore their different weights and finally obtained the risk map of Yunnan Province [18]. In this study, the LSSVM method is firstly used for flash flood risk assessment. LSSVM can directly assess flood risk without setting factor weights. The contribution of each factor to flood risk is assessed by the correlation coefficient between factors and the flood risk, with a more significant advantage. Figure 7 showed the correlation coefficient of each factor with the flash flood risk from LSSVM-RBF. The greater the correlation coefficient, the greater impact of this indicator on flash floods risk. Obviously, the correlation coefficient of CN is the largest, exceeding 0.5, followed by 7 indicators (DEM, SL, RD, FFP, TWI, 24-H-P, 3-H-P) between 0.1 and 0.5 and the remaining 5 indicators (AP, POP, SM, GDP, VC) are less than 0.1. Combined with the previous analysis, CN identifies the runoff generation capacity. DEM mainly responds to the topography of the study area and SL, RD and TWI all derived from DEM. Therefore, the flash flood risk of Yunnan Province is mainly affected by local runoff capacity, topography. Meanwhile, the correlation coefficient of FFP is 0.3, reflecting that positive man-made measures can largely prevent the occurrence of flash floods. However, compared with topographical factors, we found that the precipitation factor shows a relatively low correlation with the flash floods risk. This mainly because flash floods are caused by intensive rainfall but casualties are usually occurred and reported in low-lying areas. In addition, the effects of short-term precipitation (e.g., 24-H-P, 3-H-P) are greater than the annual precipitation. Our proposed model can concern all flash flood explanatory factors and give an accurate assessment for flash flood risk. In the future, we will further combine water depth and flow as a more reasonable indicator for flood assessment.

Conclusions
Flash floods have brought huge economic losses and casualties to China. An accurate flash flood risk assessment can identify flood-prone areas and give people enough time to prevent flood disasters in advance. In this study, LSSVM was selected to assess flash flood risk based on 13 explanatory factors. The main conclusions are as follows: (1) LSSVM can provide a more accurate risk assessment than LR and LSSVM with RBF kernel evaluates best.

Conclusions
Flash floods have brought huge economic losses and casualties to China. An accurate flash flood risk assessment can identify flood-prone areas and give people enough time to prevent flood disasters in advance. In this study, LSSVM was selected to assess flash flood risk based on 13 explanatory factors. The main conclusions are as follows: (1) LSSVM can provide a more accurate risk assessment than LR and LSSVM with RBF kernel evaluates best. In conclusion, the paper utilized the LSSVM method to assess the flash flood risk for the first time and verifies that LSSVM with RBF kernel is suitable for assessing flash floods risk at large or medium scales. Since this method primarily collects explanatory factors and local flood records, where the explanatory factors are mainly derived from public datasets (remote sensing images and statistic bulletin) that can easily get for other areas. Thus, this method is feasible to apply in other regions by collecting local historical flood inventories. This method is highly dependent on data and lacks obvious physical mechanisms. Some problems, such as the shortage and uncertainty of flood inventories, limited the accuracy of model results. In particular, the historical flood record in this study was obtained through investigations by the authority of Yunnan Province, which limited the application of the research results to other regions. With the development of data mining technology, historical flood records from websites or media are desired to use for model development especially for data sparse areas in future works.
Author Contributions: All of the authors contributed to the conception and development of this manuscript. M.M. and G.Z. carried out the analysis and wrote the paper. C.L. designed the system framework and developed the project implementation plan. P.J. collected data and drew the study area map. D.W. participated in the results analysis. H.X., H.W. and Y.H. proposed many useful suggestions to improve its quality.
Funding: This research was funded by the projects of Application of remote sensing on water and soil conservation in Beijing and its demonstration (grant number Z161100001116102), Key technology on dynamic warning of flash flood in Henan Province (China) and its application(grant number HNSW-SHZH-2015-06), Study on infiltration mechanisms of special underlying surface in coalmine goal in Shanxi Province (China) and application of runoff generation and concentration theory(grant number ZNGZ2015-008_2), Research on spatial-temporal variable source runoff model and its mechanism(grant number JZ0145B2017) and National Natural Science Foundation of China (NSFC. General Projects: (grant number. 41471430)).