Nationwide Prediction of Flood Damage Costs in the Contiguous United States Using ML-Based Models: A Data-Driven Approach
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Area
2.2. Data Compilation and Sources
- Soil moisture and vertical variability: volumetric soil moisture is provided for multiple soil layers, along with soil saturation and ice fraction in the top 0.4 m.
- Evapotranspiration and evaporation processes: the model calculates accumulated total evapotranspiration (ACCET) as well as soil evaporation rates (EDIRs), enabling representation of water and energy fluxes in the water cycle.
- Infiltration dynamics: using the Green-Ampt-derived LGARTO soil infiltration scheme, NWM explicitly models vertical infiltration and runoff partitioning. In semiarid regions, it even accounts for channel infiltration, where water is lost from ephemeral streams into the subsurface [22].
- Water body evaporation and reservoir dynamics: the model simulates evaporation from lakes and reservoirs by tracking surface elevation, inflow/outflow, and ponded depth, integrating these into the overall water balance and energy exchange.
2.3. Methodology
2.3.1. Stage 1: Event Data Integration
2.3.2. Stage 2: Data Analysis and Pre-Processing
2.3.3. Stage 3: Predictive Model Development
2.3.4. Stage 4: Framework Deployment and Validation
3. Results
3.1. Stage 1: Data Preparation
3.1.1. Filling the Missing Data for the Median Home Prices Datasets
- Latitude and longitude are used to capture spatial heterogeneity in property values.
- Year and month are used to account for long-term trends and seasonal fluctuations in the housing market.
- Population represents population pressure and urban demand.
- County GeoID encodes administrative borders as well as the impacts of the area housing market.
3.1.2. Exploring Machine Learning Algorithms
3.2. Stage 2: Data Analysis
3.2.1. Model Parameter Sensitivity and Feature Selection
3.2.2. Predictive Features of Importance
3.2.3. Check Data Probability Distribution
3.2.4. Anomaly Detection and Removal
3.3. Stage 3: Model Build
- Configuration 1 (Manual Zoning): The contiguous United States was subdivided into predefined geographic zones for region-specific predictions.
- Configuration 3 (Direct Regression): Continuous prediction of damage costs using a Random Forest regression framework.
- Configuration 4 (Hybrid Classification and Regression): A two-step structure in which flood events were first categorized into risk levels (classification) and then refined by regression within each class [16].
3.3.1. Configuration 1—Manual Zoning (Manual Clustering)
3.3.2. Configuration 2—Automated Clustering
3.3.3. Configuration 3—Direct Regression (Approach 1)
3.3.4. Configuration 4—Hybrid Classification + Regression (Approach 2)
- (1)
- A classification stage assigned each flood event to one of seven ordered risk categories based on damage cost;
- (2)
- A regression stage then estimated continuous damage values within each category.
Classification Process
Regression Process
3.3.5. Comparative Evaluation Across Configurations
3.4. Stage 4: Final User Model
4. Discussion
4.1. Benchmarking Against Previous Studies
4.2. Modeling Approaches for Flood Damage Prediction: Strengths, Limits, and Future Directions
5. Conclusions
- Integration of multi-source datasets hydrologic (NWM), climatic (NOAA), topographic (NED), and socioeconomic (ZHVI, SEER) enabled event-specific, physically consistent predictions.
- Hybrid two-stage design (classification followed by regression) enhanced accuracy, interpretability, and robustness, especially for rare, high-damage events.
- Socioeconomic exposure and vulnerability proved as influential as hydrologic drivers, confirming the interdependence of hazard and exposure in national flood losses.
- The model’s scalable, automated workflow can be updated as new flood records become available, supporting its evolution into a real-time or near-real-time operational platform.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ACCET | Accumulated Total Evapotranspiration |
| EDIR | Direct Soil Evaporation Rate |
| LGARTO | Layered Green-Ampt with Redistribution and Optimization |
| NCI | National Cancer Institute |
| NED | National Elevation Dataset. |
| NOAA | The National Oceanic and Atmospheric Administration. |
| SEER | US National Cancer Institute’s Surveillance, Epidemiology, and End Results. |
| SHAP | SHapley Additive exPlanations |
| USGS | United States Geological Survey |
| WRF-Hydro | Weather Research and Forecasting–Hydrology |
| ZHVI | Zillow Home Value Index |
Appendix A. Clustering Methods Visualization


| Method | cluster_0 | cluster_1 | cluster_2 | cluster_3 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R | Bias | R | Bias | R | Bias | R | Bias | |||||||||
| Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | |
| K-Means | 0.369 | 0.944 | 921 | 183 | 0.509 | 0.954 | 431 | 129 | 0.619 | 0.959 | −179 | 123 | 0.516 | 0.956 | −892 | 125 |
| Agglomerative | 0.406 | 0.952 | 480 | 122 | 0.592 | 0.955 | 461 | 138 | 0.365 | 0.955 | 362 | 121 | 0.563 | 0.957 | 4298 | 159 |
| Gaussian Mixture | 0.446 | 0.973 | 528 | 196 | 0.633 | 0.980 | −747 | 126 | 0.452 | 0.952 | 1048 | 173 | 0.506 | 0.956 | −77 | 109 |
Appendix B. Classification Model Results




Appendix C. Additional Interpretability Analyses


References
- Wagenaar, D.; de Jong, J.; Bouwer, L.M. Multi-variable flood damage modelling with limited data using supervised learning approaches. Nat. Hazards Earth Syst. Sci. 2017, 17, 1683–1696. [Google Scholar] [CrossRef]
- Alipour, A.; Ahmadalipour, A.; Abbaszadeh, P.; Moradkhani, H. Leveraging machine learning for predicting flash flood damage in the Southeast US. Environ. Res. Lett. 2020, 15, 024011. [Google Scholar] [CrossRef]
- El-Jabi, N.; Rousselle, J. A Flood Damage Model for Flood Plain Studies. J. Am. Water Resour. Assoc. 1987, 23, 179–187. [Google Scholar] [CrossRef]
- Khalil, A.F.; McKee, M.; Kemblowski, M.; Asefa, T.; Bastidas, L. Multiobjective analysis of chaotic dynamic systems with sparse learning machines. Adv. Water Resour. 2006, 29, 72–88. [Google Scholar] [CrossRef]
- Ten Veldhuis, J.A.E. How the choice of flood damage metrics influences urban flood risk assessment. J. Flood Risk Manag. 2011, 4, 281–287. [Google Scholar] [CrossRef]
- Schröter, K.; Kreibich, H.; Vogel, K.; Riggelsen, C.; Scherbaum, F.; Merz, B. How useful are complex flood damage models? Water Resour. Res. 2014, 50, 3378–3395. [Google Scholar] [CrossRef]
- Sieg, T.; Vogel, K.; Merz, B.; Kreibich, H. Tree-based flood damage modeling of companies: Damage processes and model performance. Water Resour. Res. 2017, 53, 6050–6068. [Google Scholar] [CrossRef]
- Wagenaar, D.; Lüdtke, S.; Schröter, K.; Bouwer, L.M.; Kreibich, H. Regional and Temporal Transferability of Multivariable Flood Damage Models. Water Resour. Res. 2018, 54, 3688–3703. [Google Scholar] [CrossRef]
- Gutenson, J.L.; Ernest, A.N.S.; Oubeidillah, A.A.; Zhu, L.; Zhang, X.; Sadeghi, S.T. Rapid Flood Damage Prediction and Forecasting Using Public Domain Cadastral and Address Point Data with Fuzzy Logic Algorithms. J. Am. Water Resour. Assoc. 2018, 54, 104–123. [Google Scholar] [CrossRef]
- Ozger, M. Assessment of flood damage behaviour in connection with large-scale climate indices. J. Flood Risk Manag. 2017, 10, 79–86. [Google Scholar] [CrossRef]
- Vogel, K.; Weise, L.; Schröter, K.; Thieken, A.H. Identifying Driving Factors in Flood-Damaging Processes Using Graphical Models. Water Resour. Res. 2018, 54, 8864–8889. [Google Scholar] [CrossRef]
- Snehil; Goel, R. Flood Damage Analysis Using Machine Learning Techniques. Procedia Comput. Sci. 2020, 173, 78–85. [Google Scholar] [CrossRef]
- Lee, K.; Choi, C.; Shin, D.H.; Kim, H.S. Prediction of heavy rain damage using deep learning. Water 2020, 12, 1942. [Google Scholar] [CrossRef]
- Shaharkar, A.; Sonar, Y.; Sonar, A.; Pawar, C. Flood Damage Estimation using Machine Learning in GIS. Int. Res. J. Eng. Technol. 2020, 7, 5756–5760. [Google Scholar]
- Alipour, A.; Ahmadalipour, A.; Moradkhani, H. Assessing flash flood hazard and damages in the southeast United States. J. Flood Risk Manag. 2020, 13, e12605. [Google Scholar] [CrossRef]
- Yang, Q.; Shen, X.; Yang, F.; Anagnostou, E.N.; He, K.; Mo, C.; Seyyedi, H.; Kettner, A.J.; Zhang, Q. Predicting Flood Property Insurance Claims over CONUS, Fusing Big Earth Observation Data. Bull. Am. Meteorol. Soc. 2022, 103, E791–E809. [Google Scholar] [CrossRef]
- Harris, R.; Furlan, E.; Pham, H.V.; Torresan, S.; Mysiak, J.; Critto, A. A Bayesian network approach for multi-sectoral flood damage assessment and multi-scenario analysis. Clim. Risk Manag. 2022, 35, 100410. [Google Scholar] [CrossRef]
- Parvin, F.; Ali, S.A.; Calka, B.; Bielecka, E.; Linh, N.T.T.; Pham, Q.B. Urban flood vulnerability assessment in a densely urbanized city using multi-factor analysis and machine learning algorithms. Theor. Appl. Climatol. 2022, 149, 639–659. [Google Scholar] [CrossRef]
- Collins, E.L.; Sanchez, G.M.; Terando, A.; Stillwell, C.C.; Mitasova, H.; Sebastian, A.; Meentemeyer, R.K. Predicting Flood Damage Probability across the Conterminous United States. Environ. Res. Lett. 2022, 17, 034006. [Google Scholar] [CrossRef]
- Johnson, J.M.; Munasinghe, D.; Munasinghe, M.; Cohen, S. Evaluating the National Water Model’s Height Above Nearest Drainage (HAND) Flood Mapping Methodology Across the Continental United States. Nat. Hazards Earth Syst. Sci. 2019, 19, 2405–2420. [Google Scholar] [CrossRef]
- Wang, Y.; Shen, H.; Xu, H. A Data-Driven Framework for Flood Inundation Forecasting Using the National Water Model and VIIRS Observations. Remote Sens. 2024, 16, 4357. [Google Scholar] [CrossRef]
- Lahmers, T.M.; Hazenberg, P.; Gupta, H.; Castro, C.; Gochis, D.; Dugger, A.; Yates, D.; Read, L.; Karsten, L.; Wang, Y.-H. Evaluation of NOAA National Water Model Parameter Calibration in Semiarid Environments Prone to Channel Infiltration. J. Hydrometeorol. 2021, 22, 2939–2969. [Google Scholar] [CrossRef]
- Gesch, D.B.; Oimoen, M.J.; Nelson, G.A.; Steuck, M.; Tyler, D. The National Elevation Dataset. Photogramm. Eng. Remote Sens. 2002, 68, 5–11. [Google Scholar]
- Gesch, D.B.; Oimoen, M.J.; Evans, G.A. Accuracy Assessment of the U.S. Geological Survey National Elevation Dataset, and Comparison with Other Large-Area Elevation Datasets—SRTM and ASTER. USGS Open-File Rep. 2014, 2014, 10. [Google Scholar] [CrossRef]
- National Cancer Institute; National Center for Health Statistics; U.S. Census Bureau. U.S. Population Data SEER Program. Surveillance, Epidemiology, and End Results (SEER) Program. 2025. Available online: https://seer.cancer.gov/data-software/uspopulations.html (accessed on 11 December 2024).
- Shastry, A.; Durand, M. Utilizing Flood Inundation Observations to Obtain Floodplain Topography in Data-Scarce Regions. Front. Earth Sci. 2019, 6, 243. [Google Scholar] [CrossRef]
- Abbaszadeh, P.; Moradkhani, H.; Daescu, D.N. The Quest for Model Uncertainty Quantification: A Hybrid Ensemble and Variational Data Assimilation Framework. Water Resour. Res. 2019, 55, 2407–2431. [Google Scholar] [CrossRef]
- Neri, A.; Villarini, G.; Salvi, K.A.; Slater, L.J.; Napolitano, F. On the Decadal Predictability of the Frequency of Flood Events across the U.S. Midwest. Int. J. Climatol. 2019, 39, 1796–1804. [Google Scholar] [CrossRef]
- Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: New York, NY, USA, 1990. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Sugar, C.A.; James, G.M. Finding the Number of Clusters in a Dataset: An Information-Theoretic Approach. J. Am. Stat. Assoc. 2003, 98, 750–763. [Google Scholar] [CrossRef]
- MacQueen, J.B. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965, 27 December 1965–7 January 1966; University of California Press: Berkeley, CA, USA, 1965; Volume 1, pp. 281–297. [Google Scholar]
- McLachlan, G.; Peel, D. Finite Mixture Models; Wiley Series in Probability and Statistics; John Wiley & Sons: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
- Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
- Li, Z.; Tian, J.; Zhu, Y.; Chen, D.; Ji, Q.; Sun, D. A Study on Flood Susceptibility Mapping in the Poyang Lake Basin Based on Machine Learning Model Comparison and SHapley Additive exPlanations Interpretation. Water 2025, 17, 2955. [Google Scholar] [CrossRef]
- Soliman, M.; Morsy, M.M.; Radwan, H.G. Generalized Methodology for Two-Dimensional Flood Depth Prediction Using ML-Based Models. Hydrology 2025, 12, 223. [Google Scholar] [CrossRef]




















| Category | Factors Description | No. of Citations | Source(s) |
|---|---|---|---|
| Climatic (NOAA) | C_P5days—cumulative precipitation 5 days before the event. | New | NOAA |
| P7day—maximum precipitation 7 days before the event. | New | ||
| P5day—maximum precipitation 5 days before the event. | New | ||
| centroid_P—precipitation at catchment centroid (gridded data). | New | ||
| P1day—precipitation on event day. | 4 | ||
| P_w_grid—precipitation from rain stations merged with gridded data. | New | ||
| Wet—number of wet days before the event. | New | ||
| T_min—minimum daily temperature. | New | ||
| T_max—maximum daily temperature. | New | ||
| T_avg—average daily temperature. | 2 | ||
| Hydrology (NWM) | F_Range—range of modeled streamflow. | New | NWM |
| F_Mean—mean streamflow. | New | ||
| F_Median—median streamflow. | New | ||
| F_Mode—mode of streamflow. | New | ||
| Flow—observed streamflow at event date. | 6 | ||
| F_Max Value—maximum modeled flow within ±7 days of the event. | 6 | ||
| V_Range—range of modeled velocity. | New | ||
| V_Mean—mean velocity. | New | ||
| V_Median—median velocity. | New | ||
| V_Mode—mode of velocity. | New | ||
| V_Max Value—maximum modeled velocity within ±7 days of event. | 6 | ||
| Flow_Durat—duration of high flows. | 9 | ||
| Stream_ord—stream order. | New | ||
| Feature_id—unique NWM identifier. | New | ||
| Topography & Catchment | Elevation—elevation of the event site. | 2 | NED |
| cart_avgsl—average slope of catchment. | 2 | ||
| cat_Area—contributing catchment area. | New | ||
| DTW—distance to nearest stream/wadi. | New | ||
| Geographic/temporal Attributes | Longitude—event location longitude. | 1 | NOAA, US Census, Geodata |
| Latitude—event location latitude. | 1 | ||
| Time—event occurrence time. | 1 | ||
| Month—event month. | 1 | ||
| state_code—unique state identifier. | New | ||
| city_status—relation to city boundaries. | New | ||
| GEOID—Census geographic code. | New | ||
| Socioeconomic | Building_A—nearest building footprint area. | New | Esri, Zillow, SEER, US Census |
| Price—median home value *. | 3 | ||
| Population—county population. | 1 |
| Model | Number of Initial Features | Number of Final Features After Sensitivity |
|---|---|---|
| Config_1 | 38 | Multiple models |
| Config_2 | 38 | Multiple models |
| Config_3 | 38 | 18 |
| Config_4 | 39 * | 10 |
| Trial Number | Number of Clusters | Zones Designation |
|---|---|---|
| Try 1 | 7 | (Northwest, Southwest, Central North, Central South, Midwest, Northeast, Southeast) |
| Try 2 | 5 | (Northwest + Southwest), (Central North + Central South), (Midwest), (Northeast), (Southeast) |
| Try 3 1 | 3 | (Northwest + Southwest), (Central North + Central South), (Midwest + Northeast + Southeast) |
| Try 4 | 2 | (Northwest + Southwest + Central North + Central South), (Midwest + Northeast, + Southeast) |
| Try 5 | 3 | (Northwest + Southwest), (Central North + Midwest), (Northeast + Central South Southeast) |
| Try 6 1 | 4 | (Northwest + Southwest), (Central North + Central South), (Midwest + Southeast), (Northeast) |
| Try 7 | 2 | (Northwest + Central North + Midwest +Northeast), (Southwest + Central South + Southeast) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Adel, K.M.; Radwan, H.G.; Morsy, M.M. Nationwide Prediction of Flood Damage Costs in the Contiguous United States Using ML-Based Models: A Data-Driven Approach. Hydrology 2026, 13, 31. https://doi.org/10.3390/hydrology13010031
Adel KM, Radwan HG, Morsy MM. Nationwide Prediction of Flood Damage Costs in the Contiguous United States Using ML-Based Models: A Data-Driven Approach. Hydrology. 2026; 13(1):31. https://doi.org/10.3390/hydrology13010031
Chicago/Turabian StyleAdel, Khaled M., Hany G. Radwan, and Mohamed M. Morsy. 2026. "Nationwide Prediction of Flood Damage Costs in the Contiguous United States Using ML-Based Models: A Data-Driven Approach" Hydrology 13, no. 1: 31. https://doi.org/10.3390/hydrology13010031
APA StyleAdel, K. M., Radwan, H. G., & Morsy, M. M. (2026). Nationwide Prediction of Flood Damage Costs in the Contiguous United States Using ML-Based Models: A Data-Driven Approach. Hydrology, 13(1), 31. https://doi.org/10.3390/hydrology13010031

