Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Explainable Artificial Intelligence (XAI) for Flood Susceptibility Assessment in Seoul: Leveraging Evolutionary and Bayesian AutoML Optimization

Remote Sens. 2025, 17(13), 2244; https://doi.org/10.3390/rs17132244

by Kounghoon Nam^1,*

, Youngkyu Lee¹, Sungsu Lee², Sungyoon Kim¹ and Shuai Zhang³

Reviewer 1:

Tadesual Asamin Setargie

Reviewer 2:

Zaheer Ahmed

Reviewer 3: Anonymous

Remote Sens. 2025, 17(13), 2244; https://doi.org/10.3390/rs17132244

Submission received: 31 March 2025 / Revised: 13 May 2025 / Accepted: 28 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Artificial Intelligence for Natural Hazards (AI4NH))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript assesses flood susceptibility in Seoul (Korea) employing six ensembled models, constructed by combining three baseline classifiers (GB, RF, and XGBoost) with an evolutionary AutoML algorithm (TPOT as a pipeline) and Bayesian optimization (Optuna to fine-tune hyperparameters). The study is well organized, and the findings are interesting, particularly in providing insights into the importance and interaction of conditioning factors influencing flood susceptibility mapping (FSM). The manuscript is suitable for publication in Remote Sensing after addressing the minor revisions outlined below.

General comments:

Avoid using lengthy sentences throughout the manuscript to transfer the intended meaning to a general audience. A thorough English language editing by a native speaker is recommended. Or, use online tools such as Grammarly.
Most figures are of low quality; improve the quality of most figures by increasing their DPI.
Consistently use two/three decimal places (e.g., 12.34 or 12.345) throughout the manuscript.

Specific comments:

Title:

The title is too general. Try a more specific one. I suggest one as “Combining Evolutionary and Bayesian AutoML Optimizations with Ensemble Classifiers for Flood Susceptibility Mapping in Seoul, Korea”.

Abstract:

Lines 21-22: Avoid explaining results before finishing explaining the methodology. Remove “, resulting in enhanced performance, with the GB model achieving the highest AUC of 0.966”.
Lines 25-29: Avoid general discussions, specifically highlight your key findings instead. For instance, highlight a few specific preferences of key factors such as lower values of elevation, slope, and stream distance, higher values of stream density, and built-up areas.
Following the results of key conditioning factors, add a sentence explaining their interaction.
Also, report the performance of all models, severity of the flood in the study areas based on the percentage coverage of high and very high susceptibility classes.

Keywords:

Lines 34-36: To increase your article’s visibility after publication, keywords should not be similar to words in the title. For instance, the first three keywords are not appropriate for the current manuscript title as they are already mentioned word-for-word.

2 Study area:

Line 109: Add “in” in the caption before “2022”.

3 Materials and Methods:

Line 125 (Figure 2): Show the last step in the flowchart (i.e., performance analysis with its metrics).

Section 3.1:

Line 138-140: - Always put “DEM" first when mentioning the conditioning factors, then their derivatives in their order of extraction. Be consistent throughout the manuscript, including in figures.
Line 155: Table 1: Add one column to show the ranges of continuous variables and classes of categorical variables with their assigned values.
Lines 143-144: The statement “All variables, except for LULC, were generated using ArcGIS Pro version 3.4" is not complete. I think you meant to write “All variables, except for LULC, were generated from 5-m DEM and polylines, using ArcGIS Pro version 3.4”. Please rewrite it.
In Section 3.1, explain how the stream density is calculated.

Section 3.2:

Line 160-162: Citation needed for your statement “Variables with a pairwise correlation coefficient |r| ≥ 0.8 were flagged for further investigation, as high collinearity can distort variable importance estimates and reduce model interpretability.”
Line 174 (Figure 3): Enhance the quality of figures by increasing their DPI.
Line 174 (Figure 3d): I don't understand the meaning of the "None" LULC classes. Either give it a name or merge with other similar LULC classes. Also, show the values assigned to categorical factors.
Line 174 (Figure 3h): Add the unit of stream density.

4 Results:

Lines 261-279: Redundancy. Remove the entire paragraph.

Section 4.1:

Lines 283-286: It reads “A Pearson correlation matrix was calculated to examine the pairwise linear relationships among the seven continuous flood conditioning variables, excluding categorical features such as land use and land cover (LULC), flow direction, and aspect.”. Why have you excluded the categorical variables?
Line 302 (Figure 4): Show the statistical significance of Pearson's correlation coefficients in the figure.
Line 306 (Table 2): Two/three decimal places are enough to report VIF/TOL values. Consistent use of constant decimal places is needed throughout the manuscript.

Section 4.2:

Line 340 (Figure 6): Instead of reporting the predicted flood susceptibility values in the maps with continuous scales, report them in four or five susceptibility classes (as very low (optional), low, moderate, high, and very high).
Also, show the percentage coverage of each susceptibility class predicted by each model using bar charts.
What are the black areas in the maps?

Section 4.3:

Line 354 (Figure 7): While the performance of the first model (i.e., GB) remains almost the same despite tuning by Optuna, why do the latter two models (i.e., RF and XGB) improve better than it?
In most studies, the use of RF as an individual or ensemble model shows superior performance over others. However, this is not the case in your study (referring to Sections 4.2 and 4.3). Why do you think the reason could be?

Section 4.4:

Are the model predictions free from over-fitting? Could you explain what measures were taken in this study to test the overfitting of model predictions and avoid such a problem?
Line 392 (Figure 8): A table is more descriptive, in this case, than a figure. Could you replace the figure with a table?

Section 4.5:

Line 435 (Figure 9): In the caption, explain how the SHAP values are calculated or what they refer to.
Line 456 (Figure 10): - In the figure caption, explain how the abscissa and ordinate values are calculated or what they reflect.
Lines 464-466: It states that “Figure 10 (e) reveals that built-up areas (0 = 1) consistently increase flood risk, particularly in regions with gentle slopes, as shown by the slope-based color gradient.". Readers are not informed that a categorical value of “1” is assigned to built-up areas. Show the assigned values for classes of the categorical variable in the preceding sections.
Line 471: et, you didn't satisfactorily explain the interactions of factors among themselves and in response to flood susceptibility. Explain how highly susceptible classes of each conditioning factor interact with each other. At the end of the paragraph, describe in one sentence which area is more susceptible by combining your results?
Line 477 (Figure 11): Add a statement in the caption explaining what the bold predicted score values mean or how they were calculated. For example, what does a predictive score value of 10.56 mean?
Line 502 (Figure 12): How the combinations of conditioning factors with their critical values were chosen for the four SHAP Waterfall Plots. Also, what are the 22 other features? Show all features as supplementary material.

5 Discussion:

Section 5.1:

Lines 568-575: You reported that Bersabe and Jun (2025) employed RF model and achieved the best results in your study area, compared to previous studies, considering urban infrastructure variables such as sewer density and proximity to storm drains. Hence, why didn’t you consider those factors?
Apart from comparison with the predicted performance of previous studies, what were the key features identified in those studies? Are they in agreement with your study? Which other factors showed a significant role in flood susceptibility in the study areas?

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Overall, the quality of English is good; however, the manuscript would benefit from further improvements, particularly by reducing lengthy sentences

Author Response

Response to Reviewer 1

We sincerely appreciate the reviewer’s insightful and constructive comments, which helped us improve the clarity, scientific rigor, and presentation quality of our manuscript. Below are our point-by-point responses to each comment. All revisions have been marked in blue in the revised manuscript.

Comment 1: Avoid using lengthy sentences throughout the manuscript. A thorough English language editing by a native speaker is recommended. Or, use online tools such as Grammarly.

Response 1: Thank you. We revised the manuscript to break down lengthy sentences and improve readability. We also performed an in-depth grammar check using professional editing tools (Grammarly) to enhance clarity and fluency.

Comment 2: Most figures are of low quality; improve the quality by increasing their DPI.

Response 2: We have replaced all low-resolution figures with high-DPI versions (minimum 300 dpi)

Comment 3: Use consistent two/three decimal places.

Response 3: We revised all numerical values in the tables and text to consistently use two or three decimal places across the manuscript.

Comment 4: The title is too general. Consider using:

“Combining Evolutionary and Bayesian AutoML Optimizations with Ensemble Classifiers for Flood Susceptibility Mapping in Seoul, Korea.”

Response 4: Title has been revised as suggested.

Comment 5: (Lines 21-22) Avoid stating results before explaining the methodology.

Response 5: Thank you for your comment. In response to your suggestion, we removed the performance result (AUC 0.966). Additionally, the abstract has been completely revised to enhance clarity, focus, and alignment with the key contributions of the study. The revised abstract now more effectively summarizes the methodology and key findings.

Comment 6: (Lines 25-29) Be specific about the results, such as factor preferences and susceptibility distributions.

Response 6: Revised lines 25-29 to emphasize specific factor trends (e.g., lower elevation, slope, stream distance; higher stream density and built-up area). Also added interaction explanation and susceptibility coverage percentages at the end of the abstract.

Comment 7: Avoid using words from the title.

Response 7: Keywords have been revised to avoid overlap with the title. New keywords include:

(Lines 34-35) explainable artificial intelligence (XAI); TPOT (Evolutionary Optimization); Optuna (Bayesian Optimization); SHAP Interpretation; hyperparameter tuning

Comment 8: (Line 109) Add “in” before “2022” in the figure caption.

Response 8: Corrected the caption in Line 109 to:

(Line 109) flooded points recorded in 2022

Comment 9: (Line 125, Figure 2) Add performance metrics step to flowchart.

Response 9: The performance metrics step has been added to the flowchart in Figure 2 to clarify the evaluation process of the proposed models.

Comment 10: (Lines 138-140) Clarify DEM order

Response 10: We reordered the factors consistently, starting with DEM.

(Lines 136-139) Ten flood conditioning factors were selected based on previous studies, hydrological relevance, and data availability: elevation (DEM), slope, aspect, profile curvature, plan curvature, flow direction, stream density, distance to stream, topographic wetness index (TWI), and land use and land cover (LULC)

Comment 11: (Lines 143-144) The statement “All variables, except for LULC, were generated using ArcGIS Pro version 3.4" is not complete. I think you meant to write “All variables, except for LULC, were generated from 5-m DEM and polylines, using ArcGIS Pro version 3.4”. Please rewrite it.

Response 11:

(Lines 139-141) All variables, except for LULC, were generated from 5-m DEM and polyline stream networks using ArcGIS Pro version 3.4.

Comment 12: (Section 3.1) Explain how stream density is calculated.

Response 12: We added the explanation in Section 3.1:
(Lines 141-142) Stream density was computed using the line density function in ArcGIS based on the total length of streams within a 1-km² neighborhood.

Comment 13: (Line 155, Table 1) Add one column to show the ranges of continuous variables and classes of categorical variables with their assigned values.

Response 13:
Thank you for your helpful comment. In response, we have revised Table 1 to include an additional column showing the value ranges for continuous variables and the assigned categories for categorical variables such as LULC. This addition provides greater transparency and supports reproducibility.

Comment 14: (Lines 160-162) Provide citation for correlation threshold.

Response 14: Thank you for pointing this out. To avoid unsupported statements without citation, we have removed the sentence from the manuscript.

Comment 15: (Line 174, Figure 3) Improve Figure 3 quality and labeling.

Response 15: Replaced Figure 3 with higher-resolution versions and clarified:

“None” in LULC was deleted in Figure 3 (d).
Categorical values (e.g., 1 = Built-up) were added in Figure 3 (d).
Added unit “km/km²” to stream density map in Figure 3 (h).

Comment 16: (Lines 261-279) Remove redundancy.

Response 16: The entire paragraph has been deleted to eliminate redundancy.

Comment 17: (Lines 283-286) Explain why categorical variables were excluded from correlation.

Response 17:

(Lines 261-263) Categorical variables such as land use and land cover (LULC), flow direction, and aspect were excluded, as their non-continuous nature violates the assumptions of linear correlation analysis.

Comment 18: (Line 302, Figure 4) Add significance to Figure 4.

Response 18: Thank you for your helpful suggestion. We have revised Figure 4 to include statistical significance indicators (p-values) for the Pearson correlation coefficients. This addition helps highlight which relationships are statistically meaningful and enhances the interpretability of the correlation matrix.

(Lines 280-281) Figure 4. Pearson correlation matrix of continuous flood conditioning factors. Asterisks indicate statistical significance (*p < 0.05, **p < 0.01, ***p < 0.001).

Comment 19: (Line 306, Table 2) Use consistent decimals in Table 2.

Response 19: All VIF and TOL values in Table 2 have been rounded to three decimal places.

Comment 20: (Line 340, Figure 6) Replace continuous susceptibility maps with class maps. Also, show the percentage coverage of each susceptibility class predicted by each model using bar charts.

Response 20: We reclassified susceptibility maps (Figures 6 and 7) into five classes (very low to very high) and added bar charts for area proportions per class (Figure 8).

Comment 21: (Figure 6 and 7) Explain black areas in maps.

Response 21: Thank you for your comment. The black areas in the maps have been removed to avoid confusion and improve visual clarity. The revised figures now include only relevant spatial information.

Comment 22: (Line 354, Figure 7) While the performance of the first model (i.e., GB) remains almost the same despite tuning by Optuna, why do the latter two models (i.e., RF and XGB) improve better than it? In most studies, the use of RF as an individual or ensemble model shows superior performance over others. However, this is not the case in your study (referring to Sections 4.2 and 4.3). Why do you think the reason could be?

Response 22: Thank you for your insightful questions. (1) Regarding the limited performance gain of the Gradient Boosting (GB) model after Optuna tuning, we believe this is due to the fact that the TPOT-derived GB pipeline was already near-optimal in terms of hyperparameter configuration. As a result, the additional tuning space explored by Optuna provided only marginal improvements. In contrast, the initial configurations for the RF and XGB models were more flexible, allowing Optuna to yield greater performance enhancements. (2) While Random Forest (RF) often outperforms other models in many studies, its relatively lower performance in our case may be attributed to the spatial heterogeneity and class imbalance in the flood susceptibility data. Tree-based ensembles like RF tend to perform well in balanced datasets, whereas Gradient Boosting is better at handling skewed or noisy data through iterative learning. Additionally, our model setup emphasized both accuracy and interpretability, for which GB showed superior trade-offs in this context.

(Lines 407-419) The limited performance improvement of the GB model after Optuna tuning can be attributed to the fact that the TPOT-derived GB pipeline was already close to optimal in terms of hyperparameter configuration. Consequently, the additional tuning space explored by Optuna led to only marginal gains. In contrast, the initial configurations of the RF and XGB models were more flexible, allowing Optuna to achieve more substantial performance enhancements. While RF often demonstrates strong predictive performance in other studies, its relatively lower performance in this study may be explained by the spatial heterogeneity and class imbalance present in the flood susceptibility data. Tree-based ensembles such as RF generally perform well on balanced datasets, whereas GB tends to be more robust in handling skewed or noisy data through its sequential learning mechanism. Moreover, the modeling approach adopted in this study prioritized both predictive accuracy and interpretability, and GB offered the most favorable balance between these two aspects.

Comment 23: Explain overfitting checks.

Response 23: Thank you for raising this important point. To mitigate and assess the risk of overfitting, we employed several measures throughout the model development process. First, we used stratified 5-fold cross-validation during training to ensure that the performance evaluation reflects generalizable results across different data subsets. Second, all models were trained and validated using spatially independent datasets to preserve geographic heterogeneity. Third, hyperparameter tuning with TPOT and Optuna was conducted with built-in validation schemes, and model selection was based on average cross-validation scores rather than training performance. Finally, we compared the performance metrics on both training and hold-out test sets, which showed consistent results without significant drops in test performance, indicating that the models generalize well to unseen data and are not overfitting.

Comment 24: (Line 392, Figure 8) Replace Figure 8 with a table.

Response 24: Figure 8 has been replaced with Table 3, summarizing model metrics.

Comment 25: (Line 435, Figure 9) In the caption, explain how the SHAP values are calculated or what they refer to.

Response 25: Thank you for your helpful comment. We have revised the caption of Figure 9 to include a brief explanation of how SHAP values are calculated and what they represent in the context of model interpretation.

(Lines 429-432) Figure 9. SHAP summary plot illustrating the contribution of each conditioning factor to model predictions. Feature values are color-coded (red = high, blue = low), and SHAP values were computed using the TreeSHAP algorithm. dem, stream distance, and stream density exhibit the highest influence on the model output.

Comment 26: (Line 456, Figure 10) In the figure caption, explain how the abscissa and ordinate values are calculated or what they reflect.

Response 26: Thank you for your valuable suggestion. We have revised the caption of Figure 10 to clarify what the x-axis (abscissa) and y-axis (ordinate) represent and how these values relate to the SHAP dependence plots.

(Lines 453-460) Figure 10. SHAP dependence plots showing the relationship between input feature values (x-axis) and their corresponding SHAP values (y-axis) for major flood conditioning factors: (a) elevation (DEM), (b) slope, (c) stream distance, (d) stream density, (e) built-up area (LULC_7.0), and (f) topographic wetness index (TWI). Each point represents a prediction instance, with color indicating the value of an interacting feature. The x-axis represents the original value of the input feature, while the y-axis indicates the SHAP value, which quantifies the feature’s impact on the model output. Higher absolute SHAP values correspond to stronger influence, positive or negative, on the predicted flood susceptibility.

Comment 27: (Lines 464-466) It states that “Figure 10 (e) reveals that built-up areas (0 = 1) consistently increase flood risk, particularly in regions with gentle slopes, as shown by the slope-based color gradient.". Readers are not informed that a categorical value of “1” is assigned to built-up areas. Show the assigned values for classes of the categorical variable in the preceding sections.

Response 27: Thank you for your thoughtful question. The original LULC variable is a multi-class categorical feature, with each class representing a different land cover type (e.g., Water = 1, Trees = 2, Crops = 5, Built-up Area = 7, etc.). To make this variable compatible with the machine learning models and improve interpretability in SHAP analysis, we applied one-hot encoding to transform each LULC class into a separate binary variable. For example, lulc_7.0 is a binary indicator where a value of “1” means that the pixel belongs to the built-up area class (original LULC class 7), and a value of “0” means that the pixel belongs to any other land cover class. This approach allows the SHAP dependence plots to clearly show how individual LULC types influence flood susceptibility predictions.

(Lines 468-477) Figure 10 (e) presents a SHAP dependence plot for the binary variable lulc_7.0, which indicates whether a given pixel is classified as built-up area (LULC class 7). The x-axis represents the value of the binary indicator (0 for non-built-up areas and 1 for built-up areas), while the y-axis shows the corresponding SHAP value, quantifying its contribution to the model’s flood susceptibility prediction. The color gradient reflects slope values, where red represents low slope and blue indicates steep terrain. The plot reveals that built-up areas (lulc_7.0 = 1) consistently contribute to an increased flood risk, especially in areas with gentle slopes. In contrast, non-built-up areas (lulc_7.0 = 0) have minimal influence on the prediction, as reflected by SHAP values near zero.

Comment 28: (Line 471) Explain the interactions of factors among themselves and in response to flood susceptibility. Explain how highly susceptible classes of each conditioning factor interact with each other. At the end of the paragraph, describe in one sentence which area is more susceptible by combining your results?

Response 28: Thank you for your valuable suggestion. We have revised the paragraph to explain how the most influential conditioning factors interact and jointly contribute to flood susceptibility. A summary sentence has also been added to clarify which geographic areas are most vulnerable based on the combined effect of these factors.

(Lines 486-492) The SHAP dependence plots reveal that low elevation, low slope, short distance to streams, and high stream density jointly amplify flood susceptibility, particularly when combined with built-up land cover. These factors interact synergistically in flat, urbanized lowland areas, where water tends to accumulate and drainage is limited. Based on the interaction of these variables, the most flood-susceptible areas are low-lying, densely built-up regions located near concentrated stream networks.

Comment 29: (Line 477, Figure 11) Add a statement in the caption explaining what the bold predicted score values mean or how they were calculated. For example, what does a predictive score value of 10.56 mean?

Response 29: Thank you for your helpful comment. We have revised the caption of Figure 11 to clarify the meaning of the bold predicted score values. These values represent the final flood susceptibility prediction for each instance, calculated by summing the SHAP contributions of all features along with the model’s base value (i.e., average prediction across the dataset).

(Lines 498-502) Figure 11. SHAP force plots illustrating the contribution of individual features to the flood susceptibility prediction for selected instances. The x-axis shows SHAP values relative to the model’s base value (mean prediction), with red indicating features that increase the prediction and blue indicating those that decrease it. The final score (e.g., 10.56) represents the model’s predicted flood susceptibility for the given instance.

Comment 30: (Line 502, Figure 12): How the combinations of conditioning factors with their critical values were chosen for the four SHAP Waterfall Plots. Also, what are the 22 other features? Show all features as supplementary material.

Response 30: Thank you for your helpful comment. To improve the focus and clarity of the manuscript, we have removed Figure 12 and its corresponding explanation. We believe this deletion does not affect the overall interpretability of the study, as key insights from SHAP analyses are sufficiently conveyed through the summary and dependence plots.

Comment 31: (Lines 568-575) Bersabe and Jun (2025) employed RF model and achieved the best results in your study area, compared to previous studies, considering urban infrastructure variables such as sewer density and proximity to storm drains. Hence, why didn’t you consider those factors?

Apart from comparison with the predicted performance of previous studies, what were the key features identified in those studies? Are they in agreement with your study? Which other factors showed a significant role in flood susceptibility in the study areas?

Response 31:

(Lines 565-571) Bersabe and Jun [50] introduced urban infrastructure variables such as sewer density and proximity to storm drains into a Random Forest model, achieving a notable AUC of 0.902. While these results underscore the value of engineered features in urban flood modeling, such variables were not included in the present study due to the lack of spatially detailed and validated infrastructure data across the study area. To ensure consistency and reproducibility, this study focused on publicly available topographic and hydrologic factors.

Comment 32: Compare with previous key variables.

Response 32: We now include a comparison summary in Section 5.1, indicating overlaps and discrepancies in key contributing factors with previous studies.

(Lines 571-579) Despite differences in variable selection, both studies consistently identify elevation and stream proximity as dominant predictors. In addition, our findings highlight the critical influence of stream density and built-up areas (LULC), particularly in low-lying and densely developed zones. These patterns reflect the urban flood dynamics of Seoul, where terrain and land use characteristics strongly govern flood susceptibility. This convergence on topographic and hydrologic factors reaffirms their fundamental role in urban flood modeling for the region. In comparison, the current study achieved superior performance using only topographic and land-use variables, without infrastructure-specific inputs.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

Overall, the paper is well-written and presents a systematically sound study. Addressing the minor points needed to revise will increase its precision, readability, and impression.

Grammatical Review

Overall Quality: The language used is generally clear and academic. The writing is formal, which is appropriate for a scientific paper.

Introduction:

A wide range of machine learning (ML) techniques has been 49 employed for flood prediction and susceptibility assessment. Among these, soft compu-50 ting and statistical learning models have gained popularity in recent years.

The full form of machine learning (ML) is already used in the above sentence. Now you need to use only abbreviations like (ML).

In this paper, there are many repetitions, like using both the full form and abbreviations (e.g., land use and land cover (LULC), machine learning (ML), and Topographic wetness index (TWI).

The author needs to check all terms and abbreviations.

Section 4.5.2

"Figure 10 (a) shows that lower elevation (dem) values correspond to higher SHAP values, confirming that low-lying areas are more flood prone."

Consider rewording: flood-prone

Images (Figures):

Consider adding scale bars to the thematic maps in Figure 3 to deliver a good sense of spatial dimensions.

In Figure 5, while the pipeline is well-represented, add brief annotations to clarify the purpose of each component. It would be helpful for readers less aware of AutoML.

Tables

Certify that all abbreviations used in the tables are well-defined either in the table caption or in a footnote.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

The English language used in the paper is of high quality, and grammatically correct academic writing, though minor formatting issues (e.g., hyphenation artifacts) are present.

Author Response

Response to Reviewer 2

Comment 1: In this paper, there are many repetitions, like using both the full form and abbreviations (e.g., land use and land cover (LULC), machine learning (ML), and Topographic wetness index (TWI).
The author needs to check all terms and abbreviations.

Response 1: We sincerely appreciate the reviewer’s observation. We have carefully reviewed the entire manuscript to identify and correct instances where both the full term and its abbreviation were used redundantly. In the revised version, we now introduce each term with its full form followed by the abbreviation upon first use, and then consistently use only the abbreviation throughout the rest of the manuscript (e.g., ML, LULC, TWI). This change improves the clarity and conciseness of the text.

Comment 2: "Figure 10 (a) shows that lower elevation (dem) values correspond to higher SHAP values, confirming that low-lying areas are more flood prone."
Consider rewording: flood-prone

Response 2: Thank you for pointing this out. We have revised the phrase “more flood prone” to the hyphenated form “more flood-prone” in accordance with proper grammatical usage.

(Line 459) flood-prone

Comment 3: Consider adding scale bars to the thematic maps in Figure 3 to deliver a good sense of spatial dimensions.

Response 3: Thank you for your valuable suggestion. In response, we have added scale bars to all thematic maps presented in Figure 3. This modification improves the readers’ ability to interpret spatial dimensions and enhances the overall readability of the figure.

Comment 4: In Figure 5, while the pipeline is well-represented, add brief annotations to clarify the purpose of each component. It would be helpful for readers less aware of AutoML.

Response 4: Thank you for your helpful suggestion. In the revised manuscript, we have added brief explanatory annotations to Figure 5 to clarify the purpose of each pipeline component. These annotations are intended to improve readability and provide guidance for readers who may be less familiar with AutoML workflows. We believe this enhancement makes the figure more accessible and informative.

(Lines 290-294) Figure 5. TPOT framework for automatic optimization of ML pipelines in flood susceptibility mapping. The pipeline includes preprocessing (categorical and numerical encoding), multiple feature selection strategies (SelectPercentile and SelectFromModel), a stacked MLP model, and the final Gradient Boosting classifier. Brief annotations indicate the role of each component to support readers un-familiar with AutoML workflows.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

General Comments

Nam et al. implements explainable artificial intelligence (AI) for generating flooding map in Seoul. They tested different AI algorithms and found Gradient Boosting model has the best performance. The results show surface elevation, slope, steam distance, stream density, and built-up areas are the critical flooding drivers. Data-driven methods have been widely used to identify relationship between flooding areas and relevant factors. So, I don’t see any novelties in this study. In addition, I found the method of this study is not robust due to the following reasons. (1). The flooding map used to train the model looks not right. Specifically, there is no flooding in river channel. Without reliable inputs, one cannot get robust results from AI models. (2). Only an individual event was tested. Although the authors separate the single flooding image to training and validation data, they are from the same event. It is not sure if they model works for another event. Or will the conclusion change if using the data from a different event. Overall, I don’t think the authors can get generalized conclusion with only an individual event. Based on above concerns, I cannot recommend publication of this manuscript in current form. Please find additional comments in the following.

Specific Comments

Line 43 – Line 46: I cannot agree with this statement. High-fidelity hydrodynamic models have been used to simulate flooding dynamics for different environments at very detailed resolution. And such high-fidelity hydrodynamic models have also been coupled with hydrological model to understand the impacts of climate change on the flooding dynamics. I suggest a comprehensive literature review on the flood modeling and explain why a data-driven method is preferred over physical models.

Figure 1: Please specify what is the red line. Is it watershed boundary for city border?

Line 128 – Line 130: The flooding map looks not right. There is no flooding in river channel. The quality of the flooding map will significantly affect the performance of the model. Please make a justification of the flooding map.

Author Response

Response to Reviewer 3

Comment 1: Nam et al. implements explainable artificial intelligence (AI) for generating flooding map in Seoul. They tested different AI algorithms and found Gradient Boosting model has the best performance. The results show surface elevation, slope, steam distance, stream density, and built-up areas are the critical flooding drivers. Data-driven methods have been widely used to identify relationship between flooding areas and relevant factors. So, I don’t see any novelties in this study. In addition, I found the method of this study is not robust due to the following reasons. (1). The flooding map used to train the model looks not right. Specifically, there is no flooding in river channel. Without reliable inputs, one cannot get robust results from AI models. (2). Only an individual event was tested. Although the authors separate the single flooding image to training and validation data, they are from the same event. It is not sure if they model works for another event. Or will the conclusion change if using the data from a different event. Overall, I don’t think the authors can get generalized conclusion with only an individual event. Based on above concerns, I cannot recommend publication of this manuscript in current form. Please find additional comments in the following.

Response 1: Thank you for your critical and constructive comments. We appreciate your feedback and address the major concerns as follows:

1) Regarding the novelty of the study:

While it is true that data-driven approaches and machine learning models have been widely used for flood susceptibility mapping, our study offers novelty in its methodological integration of evolutionary and Bayesian AutoML optimization (i.e., TPOT and Optuna) combined with SHAP-based explainability. This hybrid framework not only automates model selection and hyperparameter tuning but also enhances interpretability, a key requirement in disaster risk management. To the best of our knowledge, few studies have applied such a combined XAI–AutoML approach in the context of urban flood susceptibility.

2) Regarding the quality of the flood map and missing flood data in river channels:

We acknowledge your concern about the absence of flood pixels in river channels. However, the flood inventory map used in this study is based on officially reported inundation cases from government records, which tend to focus on surface and urban impacts rather than open channel flooding. Moreover, river areas were not labeled as flooded in the inventory because they are hydrologically expected to carry water during rainfall and are not treated as damage zones in administrative data.

3) Regarding model robustness and reliance on a single event:

We agree that model generalization is an important issue. While our analysis is based on one major flooding event, the model was trained and validated using spatial cross-validation to minimize overfitting and ensure spatial independence between training and test data. Furthermore, the objective of this study is to demonstrate the applicability of explainable AutoML frameworks for flood mapping in a real urban scenario, rather than to produce a universally generalizable flood model. Nonetheless, we fully agree that testing the model with data from additional flood events would improve the robustness and generality of the results.

Comment 2: (Lines 43-46) I cannot agree with this statement. High-fidelity hydrodynamic models have been used to simulate flooding dynamics for different environments at very detailed resolution. And such high-fidelity hydrodynamic models have also been coupled with hydrological model to understand the impacts of climate change on the flooding dynamics. I suggest a comprehensive literature review on the flood modeling and explain why a data-driven method is preferred over physical models.

Response 2: Thank you for your comment. We agree that physically based hydrologic and hydrodynamic models play a central role in flood modeling. Our original statement may have been overly general and unintentionally dismissive. To reflect a more balanced view, we revised the paragraph (Lines 43-46) to acknowledge the strengths of physically based models, while clarifying that our study adopts a data-driven approach as a complementary alternative, particularly suited for capturing complex interactions in urban areas with limited data availability.

(Lines 41-47) In this context, accurate flood susceptibility mapping (FSM) has become a vital tool for disaster risk reduction and urban resilience planning [2]. Physically based hydrologic and hydrodynamic models have been widely used to simulate flood dynamics. However, these approaches often require extensive input data and computational resources. As a complementary alternative, data-driven models offer greater flexibility in capturing complex, nonlinear relationships among environmental and anthropogenic factors, particularly in data-sparse urban settings [3].

Author Response File: Author Response.docx

Article Menu

Explainable Artificial Intelligence (XAI) for Flood Susceptibility Assessment in Seoul: Leveraging Evolutionary and Bayesian AutoML Optimization

Further Information

Guidelines

MDPI Initiatives

Follow MDPI