1. Introduction
Flooding is a natural hazard caused by excessive rainfall, either as brief, intense events (flash floods) or prolonged precipitation, resulting in water bodies exceeding their capacity or surface water accumulation [
1]. The ensuing disaster chain includes soil erosion, structural collapse, and transportation breakdowns, posing a direct risk to life, property, and socioeconomic stability [
2]. Therefore, systematically conducting flood risk assessments and developing targeted countermeasures has become an urgent need for emergency management and spatial planning [
3,
4].
As a highly urbanized area, Shenzhen faces severe threats from flooding triggered by short-duration intense rainfall and prolonged precipitation. Such events often result in road inundation, neighborhood-scale urban waterlogging, and secondary disasters, causing substantial economic losses [
5]. For example, from September 7 to 8, 2023, Shenzhen experienced an extremely rare record-breaking rainstorm [
6], with its’ intensity, duration, and affected area” all exceeding typical conditions. The event caused direct economic losses of RMB 1.26 billion citywide, along with more than 700 road waterlogging sites and 19,560 flooded vehicles; traffic on some road sections was disrupted for over 24 h due to inundation. Studies indicate that high-risk flood areas in Shenzhen are mainly located in western Bao’an District, Futian District, and western Luohu District, and that urban waterlogging exhibits a spatial pattern of “more severe in the west and lighter in the east.” In response, Shenzhen has set clear flood-control and drainage targets: by the end of 2025, the flood protection capacity in urban areas is to reach the 1-in-200-year standard; by 2035, the entire city (including the Shenzhen–Shanwei Special Cooperation Zone) is to reach the 1-in-200-year standard, while the western coastal dikes are to meet a 1-in-1000-year storm-surge protection standard. Therefore, accurately assessing flood susceptibility in Shenzhen is of great significance [
7]. Achieving these targets is a critical component of the city’s sustainable infrastructure strategy, as it reduces recurrent disaster-related losses and enhances long-term resilience.
Flood susceptibility assessment is a critical component of flood disaster risk assessment modeling. Unlike flood risk, which integrates exposure and vulnerability, susceptibility focuses specifically on the likelihood of flooding occurring in a given area based on its physical and environmental conditions. Depending on whether training data are used [
8], assessment methods can be broadly classified into two categories: knowledge-driven and data-driven. The former relies on expert judgment to score flood-influencing factors and assign weights (e.g., indicator-system approaches) [
9], While such methods benefit from domain expertise, they can be time-consuming and may involve subjective decisions. Data-driven approaches, such as machine learning, complement this by leveraging training data to capture complex, nonlinear relationships between flood events and their drivers [
10], Rather than replacing expert knowledge, these methods provide quantitative predictive capabilities that can support and inform expert decision-making. The integration of both paradigms thus enables more robust flood susceptibility modeling and offers scientific support for targeted flood prevention and mitigation efforts.
In recent years, machine learning has gained considerable traction in predicting flood susceptibility. Models including logistic regression, random forests, support vector machines, and neural networks have consistently demonstrated robust predictive capability [
11,
12]. Through examining sample data, machine learning develops predictive models capable of precisely capturing the underlying connections among factors related to flooding. This approach does not require a complete understanding of flood formation mechanisms; instead, it can predict flood occurrence through in-depth analysis of data on influencing factors. However, its predictive accuracy mainly depends on algorithm performance, as well as the scale and quality of the data [
13].
Geographic Information Systems (GIS) offer both strong database management capabilities (enabling the construction of machine-learning training datasets) and flood-information visualization functions, allowing flood-prone areas to be displayed intuitively and facilitating flood susceptibility assessment. Integrating the two makes it possible to produce flood susceptibility maps, which are crucial for accurately identifying high-risk areas and providing professional recommendations for flood risk reduction and scientific management. However, most existing studies do not describe in detail the integration workflow between GIS and machine learning or the key information involved, which limits the wider adoption and application of this technique.
Flood-influencing factors are key inputs to machine learning models and directly affect training performance [
13]. Most existing flood susceptibility studies use nine parameters—elevation, slope, aspect, curvature, distance to rivers, land use, TWI, SPI, and annual mean rainfall—but whether all of these parameters truly influence flooding has not yet been examined in previous research [
14].
In summary, although machine learning has been widely applied in flood susceptibility assessment, existing studies still have three main limitations [
15]. First, most research focuses on performance comparisons among single models [
16], while providing insufficient discussion of how ensemble learning improves generalization ability and stability through model complementarity. Second, models are often used as “black boxes,” lacking refined mechanistic interpretations of the contribution ranking and interactions among influencing factors [
17]. Third, assessment results are not sufficiently integrated with region-specific urbanization processes and underlying surface characteristics, and the policy implications remain insufficiently targeted [
18]. Therefore, this study aims to develop a comprehensive flood susceptibility assessment framework integrating “multi-model coupling–ensemble optimization–mechanism interpretation,” and to validate it using Shenzhen as a representative case study [
19].
While recent machine learning applications have advanced flood susceptibility modeling, most existing studies are limited to single-model comparisons, lack interpretability, and provide insufficiently detailed integration workflows between GIS and machine learning. To address these gaps, this study introduces a comprehensive framework that couples ensemble learning, interpretable AI, and spatially explicit mapping, with three distinctive innovations: (1) Ensemble voting strategy—Instead of relying on a single classifier, we integrate five heterogeneous models (Decision Tree, SVM, Logistic Regression, Naïve Bayes, and LDA) through a majority voting mechanism. This approach effectively mitigates individual model bias, enhances prediction stability, and improves generalization performance, which is rarely systematically implemented in previous flood susceptibility studies [
20]. (2) Transparent and replicable GIS–ML workflow—We provide a fully articulated technical pipeline from multi-source data preprocessing (in ArcGIS10.7), sample partitioning, factor extraction, to model training and spatial interpolation. This addresses the common “black-box” criticism in GIS–machine learning integration and significantly enhances the operational reproducibility of the proposed method. (3) Policy-oriented spatial outputs—The resulting flood susceptibility maps are interpreted in direct alignment with Shenzhen’s urban flood control targets (e.g., the 1-in-200-year standard by 2025). This ensures that the research outputs are not only methodologically robust but also practically actionable for local drainage planning and disaster mitigation.
Together, these innovations establish a reproducible “integration–evaluation–interpretation” paradigm that advances both the methodological rigor and practical relevance of urban flood susceptibility assessment.
4. Results and Discussion
4.1. Confusion Matrix
A confusion matrix serves as an essential instrument for assessing classification model performance.
Figure 8 displays the confusion matrices for six machine learning models.
Figure 9 presents the ROC curves and corresponding performance metrics. As evident in
Figure 10, the confusion matrix for the voting ensemble algorithm is distinctly superior to those of the other models regarding classification effectiveness, demonstrated by its elevated accuracy and recall rates. These metrics signify the model’s enhanced capacity to differentiate between flood and non-flood samples [
37].
For a more systematic and quantitative assessment of each model’s overall effectiveness, this study subsequently examined their outcomes against essential metrics: Accuracy, Recall, Precision, the F1-score, and the Receiver Operating Characteristic (ROC) curve. These metrics characterize the robustness of model classification from different perspectives: accuracy reflects the overall rate of correct classification; recall measures the completeness of identifying positive samples (flood events); precision indicates the reliability of the predictions; the F1-score balances precision and recall; and the ROC curve and its corresponding Area Under the Curve (AUC) provide an intuitive view of a model’s generalization ability and resistance to noise under different thresholds. Through a combined multi-metric analysis, performance differences among the models can be more comprehensively revealed, providing a scientific basis for selecting the optimal model [
38].
Beyond the aggregate improvements in accuracy and recall, a closer examination of the confusion matrices reveals how the voting ensemble modifies the error structure compared with individual models. The ensemble reduces both false positives (non-flood locations misclassified as flood) and false negatives (flood locations missed), but the reduction is more pronounced for false negatives. This indicates that the ensemble is particularly effective at improving the detection of actual flood events—a critical attribute for early warning and disaster response applications.
However, certain types of locations remain challenging for the ensemble. False positives tend to occur in areas characterized by moderate slopes and intermediate distances to rivers, where the interplay of topographic and hydrological factors creates ambiguous conditions that individual models interpret differently. False negatives, though fewer, are more common in the western coastal lowlands of Shenzhen (e.g., Bao’an and Nanshan districts), where rapid urbanization has introduced complex drainage infrastructure that is not fully captured by the current set of influencing factors. These persistent error patterns suggest that while ensemble voting successfully mitigates the biases of individual models, further improvements may require incorporating additional variables—such as drainage network density, soil infiltration capacity, or fine-scale urban morphology—to resolve the remaining ambiguities.
Understanding these error characteristics provides practical guidance for flood management. The ensemble’s high recall makes it well-suited for early warning applications where minimizing missed flood events is paramount. At the same time, the spatial patterns of false positives can inform targeted field investigations, helping to verify and refine model predictions in areas where the ensemble remains uncertain.
4.2. ROC Curve
The area under the curve (AUC) is a key indicator for measuring the overall performance of a classification model, and its value intuitively reflects the model’s ability to distinguish between positive and negative samples (i.e., flood and non-flood samples). In theory, an AUC of 1 indicates that the model can perfectly separate all flood samples from non-flood samples, achieving error-free prediction [
39]; an AUC of 0.5 indicates that the model performs no better than random guessing and therefore has no practical predictive value [
40].
Figure 9 presents the receiver operating characteristic (ROC) curves and corresponding key performance metrics of five different machine learning models. As shown in
Figure 9A, on the test dataset, the ensemble voting model achieves the highest AUC value, reaching 0.813, indicating that it performs best in terms of overall classification effectiveness and has the strongest ability to distinguish between flood and non-flood samples. Further analysis shows that the remaining five machine learning models exhibit relatively similar performance in flood susceptibility prediction, with only minor differences in AUC values. Overall, the ensemble voting model demonstrates comparatively superior performance across all evaluation metrics, with greater stability and reliability in its predictions, providing more promising technical support for regional flood susceptibility forecasting.
To facilitate direct comparison of model performance,
Table 5 summarizes the key evaluation metrics—Accuracy, Precision, Recall, F1-score, and AUC—for each of the five individual models and the voting ensemble. The results are consistent with the confusion matrices (
Figure 8) and ROC curves (
Figure 9), further confirming the superior performance of the ensemble approach.
The area under the ROC curve (AUC) provides a global measure of a model’s ability to discriminate between flood and non-flood samples, independent of any specific classification threshold. An AUC of 0.813 achieved by the voting ensemble indicates that for any randomly selected pair of flood and non-flood locations, the model will correctly rank the flood location as having higher susceptibility approximately 81.3% of the time. This level of discrimination is considered “good” in the context of flood susceptibility modeling, where values above 0.8 generally indicate reliable predictive performance.
Compared with the five individual models (AUC ranging from 0.752 to 0.793), the ensemble’s 2.03-percentage-point improvement is meaningful. Notably, the ensemble outperforms even the best individual model (Naïve Bayes, AUC = 0.7928) by a margin that exceeds the differences among the individual models themselves, demonstrating that the voting mechanism effectively synthesizes complementary predictive signals rather than simply averaging similar outputs.
A deeper examination of the ROC curves reveals differences in model behavior across the range of possible thresholds. The ROC curve of the decision tree model shows steeper initial increases followed by earlier flattening, reflecting its tendency to achieve high recall at the expense of specificity—a pattern consistent with overfitting to training data. In contrast, the ensemble maintains a more balanced trade-off between sensitivity (recall) and specificity across thresholds, as evidenced by its curve consistently dominating the individual models across most of the false positive rate spectrum. This balanced performance is particularly valuable for flood management applications, where both missing actual flood events (false negatives) and issuing false alarms (false positives) carry operational consequences. The ensemble’s ability to improve recall without substantially compromising specificity suggests that it better captures the underlying complexity of flood-influencing factors, making it more robust for practical deployment.
4.3. Flood Susceptibility Map
Once trained and optimized, the five distinct ML models were deployed to forecast flood susceptibility at 1482 designated sites within Shenzhen. These results subsequently underwent spatial interpolation analysis (kriging interpolation) on the GIS platform, converting the individual point estimates into a seamless, spatially continuous surface. This procedure culminated in a comprehensive, citywide flood susceptibility map for Shenzhen. By fusing the predictive power of machine learning with the spatial analytical capacity of GIS, the methodology successfully translated point-specific forecasts into area-wide spatial representations, thereby furnishing granular spatial information to bolster regional flood risk evaluation and disaster mitigation strategies.
Figure 10 presents the flood susceptibility distribution map produced by an ensemble model, synthesizing the predictive outputs from multiple algorithms—decision tree, support vector machine (SVM), logistic regression, naïve Bayes classifier, and linear discriminant analysis (LDA)—via a voting mechanism. The resulting value, which we term flood susceptibility probability, denotes the ensemble model’s predicted probability of a location being classified into the flood-prone class (class 1) based on the learned relationships between influencing factors and historical waterlogging records. It ranges from 0 to 0.99 and serves to quantify relative susceptibility. Importantly, this probability is a measure of classification confidence within the supervised binary learning framework, not a probabilistic estimate of hydraulic characteristics such as flood depth, discharge, or inundation duration. Thus, a value of 0.85 indicates that the model assigns an 85% confidence that the location shares similar environmental and topographic conditions with historically observed waterlogging points, rather than predicting an 85% chance of exceeding a specific flood depth. Applying the natural breaks (Jenks) classification method to these probability values, the map was categorized into five distinct levels: very low (below 20%), low (20–40%), moderate (40–60%), high (60–80%), and very high (80–99%).
A comparative analysis of the six flood susceptibility maps (
Figure 10) reveals both consistent patterns and notable divergences across models. All six maps consistently identify the western coastal areas—particularly Bao’an, Nanshan, and parts of Futian—as high-to-very-high susceptibility zones. This spatial consistency aligns with the region’s low elevation (generally below 50 m), proximity to major rivers such as the Shenzhen River and Maozhou River, and high annual rainfall, which collectively create favorable conditions for water accumulation. In contrast, the eastern mountainous regions (e.g., Pingshan and Dapeng) are consistently classified as low-susceptibility areas across all models, reflecting their higher elevation and steeper slopes.
However, the models diverge in their classification of transitional zones, particularly in the central and western-central districts where elevation gradually increases and land use transitions from urban to suburban. The decision tree model produces a patchier susceptibility pattern with abrupt spatial transitions, indicative of its tendency to overfit to training data. The SVM and logistic regression models generate smoother but sometimes overly conservative predictions in areas lacking historical waterlogging records. The naïve Bayes classifier and LDA exhibit intermediate behavior, capturing some local variability while maintaining regional coherence.
The voting ensemble map synthesizes these divergent predictions through the majority voting mechanism, resulting in a susceptibility distribution that balances the decision tree’s sensitivity to local features with the SVM’s regional stability. Notably, the ensemble reduces the patchiness observed in individual models and yields a more continuous susceptibility gradient that better reflects the underlying topographic and hydrological gradients. This spatial smoothing effect is one of the key benefits of ensemble learning, as it mitigates the idiosyncrasies of individual algorithms while preserving the signal of high-risk areas.
The susceptibility categories in
Figure 9 were derived using the natural breaks (Jenks) classification method applied to the predicted flood susceptibility probabilities. This method minimizes within-class variance while maximizing between-class variance, making it well-suited for identifying natural groupings in the continuous susceptibility values. The choice of five categories (very low, low, moderate, high, very high) provides sufficient granularity for urban planning applications while maintaining interpretability. An analysis of historical waterlogging points against the ensemble map shows that 68.3% of the 741 recorded events fall within the high and very high susceptibility zones, providing empirical validation of the classification scheme.
To facilitate practical interpretation, the high-susceptibility zones identified by the ensemble—particularly in western Bao’an, coastal Nanshan, and low-lying areas adjacent to the Shenzhen River—correspond closely to districts targeted by Shenzhen’s flood control master plan. These areas are prioritized for drainage infrastructure upgrades and coastal defense reinforcement under the city’s 2025 and 2035 protection standards. The ensemble map thus offers spatially explicit guidance for prioritizing interventions, while the areas of model disagreement (e.g., certain central districts) highlight locations where additional data collection or refined factor selection may be warranted to reduce uncertainty.
In generating the final flood susceptibility maps, the predictive outputs from the machine learning models at 1482 sample points were spatially interpolated using kriging to create continuous surfaces. While this approach effectively translates point-based predictions into spatially explicit susceptibility maps, it is important to note that the GIS layer integration underlying the interpolation does not explicitly incorporate the differential importance of each influencing factor. Instead, the spatial interpolation treats the model predictions—which inherently embed factor importance through the trained machine learning algorithms—as the basis for mapping. Thus, while the map generation itself does not apply a weighted overlay, the factor contributions are implicitly accounted for through the ensemble model’s learned decision boundaries. Nevertheless, we acknowledge that a weighted overlay approach, informed either by expert judgment or by factor importance derived from SHAP analysis (as presented in this study), could further enhance the interpretability and hydrological realism of the final susceptibility maps [
7]. This represents a promising direction for future research, where GIS-based weighted overlay and machine learning outputs can be more tightly integrated.
4.4. Model Validation and Analysis
This study employed five conventional machine learning models—decision tree, SVM, logistic regression, naïve Bayes, and linear discriminant analysis—along with an ensemble voting algorithm to perform quantitative flood susceptibility prediction for Shenzhen. Model performance was evaluated using accuracy, recall, and AUC. The ensemble achieved the highest AUC (0.8131), followed by naïve Bayes (0.7928) and LDA (0.7905); logistic regression and SVM performed similarly (AUC = 0.7883), while the decision tree yielded the lowest (0.7523). These results support the premise that ensemble learning mitigates individual model bias and improves generalizability. To assess the statistical significance of the improvement, we conducted DeLong’s test comparing the ensemble with naïve Bayes; the difference was significant (p < 0.05), confirming that the gain was not due to random chance.
A closer look at prediction errors reveals that approximately 12% of the ensemble’s high-susceptibility areas correspond to locations without recorded waterlogging events (potential over-estimation), while about 8% of recorded waterlogging points fall outside high-susceptibility zones (potential under-estimation). These discrepancies point to opportunities for further model refinement through the inclusion of additional variables such as drainage infrastructure density, soil infiltration capacity, or fine-scale urban morphology.
The strength of the voting ensemble lies in its ability to integrate and complement the strengths of individual models. Decision trees effectively capture nonlinear relationships but are prone to overfitting in Shenzhen’s complex terrain. SVMs perform well in high-dimensional spaces but may overestimate susceptibility in areas with limited data. Naïve Bayes offers computational efficiency under assumptions of feature independence, while LDA enhances class separability for binary classification. By synthesizing their predictions through majority voting, the ensemble mitigates the limitations of any single model—for example, reducing overestimation in western coastal lowlands and underestimation in eastern mountainous areas. Consequently, the ensemble improves AUC by 2.03 percentage points over the best individual model.
Pearson correlation analysis verified that all nine factors had |r| < 0.7, indicating no severe multicollinearity. To further quantify the relative importance of each factor, we conducted SHapley Additive exPlanations (SHAP) analysis.The SHAP contribution results are presented in
Figure 11. The SHAP contribution results show that annual average rainfall, elevation, and distance to rivers are the three dominant drivers. Among them, annual average rainfall presents the highest mean SHAP value, confirming its dominant role as a climatic driver. Elevation shows a strong negative relationship with flood susceptibility, while distance to rivers exhibits a clear threshold effect, with flood susceptibility rising sharply within 1 km of river networks. The remaining factors exert moderate or weak impacts: land use type and slope contribute moderately, whereas aspect, curvature, SPI, and TWI show relatively low contributions. This importance ranking is consistent with the spatial distribution of flood susceptibility and further validates the rationality of the selected influencing factors.
This study also details a comprehensive GIS–machine learning workflow (data preprocessing → factor extraction → sample partitioning → model training → susceptibility visualization), addressing a gap in practical implementation guidance. Using the “Extract Values to Points” tool in ArcGIS, multi-source data were standardized and integrated. Kriging interpolation was then applied to convert point-based predictions into a continuous susceptibility surface, higher risk in western coastal areas and lower risk in eastern mountainous areas” pattern consistent with Shenzhen’s rainfall distribution and providing a spatially explicit reference for flood mitigation planning.
Two limitations merit attention. First, the model was trained on waterlogging records from a five-year period (2020–2024). While this dataset reflects recent flood patterns under current conditions, it may not fully capture longer-term trends such as changes in rainfall intensity, urban expansion, or climate impacts. The resulting susceptibility map should therefore be interpreted as representing the current spatial pattern of flood susceptibility under the available data conditions, rather than event-based hydraulic flood intensity or other physical flood characteristics. Second, while kriging interpolation provides unbiased estimates with minimized variance, it may introduce smoothing errors in areas with sparse sample coverage—such as the eastern mountainous regions of Pingshan and Dapeng, where fewer waterlogging points are available. This smoothing effect can obscure localized variability in flood susceptibility, potentially masking small-scale features that influence flood risk. Future studies could explore alternative interpolation methods (e.g., geographically weighted regression) or incorporate additional sampling points to reduce uncertainty in underrepresented areas.