Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP

Li, Hanze; Fan, Yazhou; Meng, Zhenzhu; Zhang, Xinhai; Zhang, Jinxin; Wang, Liang

doi:10.3390/w18010042

Open AccessArticle

Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP

by

Hanze Li

¹,

Yazhou Fan

¹,

Zhenzhu Meng

^2,3,*

,

Xinhai Zhang

¹,

Jinxin Zhang

^2,3 and

Liang Wang

¹

Zhejiang Institute of Hydraulics & Estuary (Zhejiang Institute of Marine Planning & Design), Hangzhou 310020, China

²

School of Hydraulic Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China

³

Zhejiang Key Laboratory of River-Lake Water Network Health Restoration, Hangzhou 310020, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(1), 42; https://doi.org/10.3390/w18010042

Submission received: 24 November 2025 / Revised: 16 December 2025 / Accepted: 21 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Advances in Machine Learning and Artificial Intelligence Technologies for Hydrological Processes and Hydrologic Disasters)

Download

Browse Figures

Versions Notes

Abstract

Dam break problem-induced floods can trigger hazardous dike overtopping, demanding predictions that are fast, accurate, and interpretable. We pursue two objectives: (i) introducing a new alpha evolution (AE) optimization scheme to improve tree-model predictive accuracy, and (ii) developing a cluster-wise modeling strategy in which regimes are defined by wave characteristics. Using a dataset generated via physical model experiments and smoothed particle hydrodynamics (SPH) numerical simulations, we first group samples via hierarchical clustering (HC) on the Froude number

(Fr)

, wave nonlinearity

(R)

, and relative distance to the dike

(D)

. We then benchmark CatBoost, XGBoost, and ExtraTrees within each cluster and select the best-performing CatBoost as the baseline, after which we train standard CatBoost and its AE-optimized variant. Under random train–test splits, AE-CatBoost achieves the strongest generalization for predicting relative run-up distance

H_{m}

(testing dataset

R^{2} = 0.9803

,

RMSE = 0.0599

), outperforming particle swarm optimization (PSO) and grid search (GS)-tuned CatBoost. We further perform TreeSHAP analyses on AE-CatBoost for global, local, and interaction attributions. SHAP analysis yields physics-consistent explanations: D dominates, followed by H and L, with a weaker positive effect of

Fr

and minimal influence of R;

H \times D

is the strongest interaction pair. Overall, AE optimization combined with HC-based cluster-wise modeling produces accurate, interpretable overtopping predictions and provides a practical route toward field deployment.

Keywords:

impulse wave; wave propagation; run-up distance; alpha evolution; CatBoost; TreeSHAP

1. Introduction

Dam break problem-induced floods generate rapidly propagating bores that can drive extreme wave run-up and dike overtopping, posing acute risks to downstream communities and coastal defenses [1,2,3]. Accurate estimation of the maximum run-up distance is central to design and risk assessment: underestimation can precipitate catastrophic flooding and loss of life, whereas overly conservative allowances inflate construction and maintenance costs. In addition to accuracy alone, engineering practice increasingly requires physics-consistent predictions that not only fit the data but also align with governing hydrodynamics.

To predict the wave run-up and overtopping problems induced by propagation of single wave, studies have been conducted which involve theoretical models, laboratory experiments, computational fluid dynamic simulations, and field monitoring analysis [4,5,6]. Analytical theoretical solutions based on nonlinear shallow-water equations and Boussinesq-class models provide insight into dispersion and nonlinearity but rely on simplifying assumptions of bathymetry, wave form, etc. [7,8,9,10]. Empirical formulas offer practical tools yet remain valid only within tested ranges of wave nonlinearity, water depth, etc. [11,12,13,14,15]. Computational fluid dynamics simulations such as RANS and LES can resolve complicate fluid–structure interactions at the dike toe as well as aeration, but their computational expense confines them to more realistic case studies [16,17,18,19]. Field observations based on monitoring sensors have improved understanding of run-up processes, yet such datasets are always scenario-limited [20,21]. Hybrid models such as experimental–numerical and analytical–numerical combined models have also been explored [22,23,24,25].

With the development of computer science, recent studies have applied machine learning to develop prediction model such as neural networks, random forests, XGBoost, etc. [26,27,28,29,30,31]. Power et al. (2019) studied beach run-up using gene expression programming alongside empirical relationships [32]. Rehman et al. (2022) studied incident wave run-up using response-surface methodology coupled with neural networks [33]. Li et al. (2024) proposed a temporal convolutional neural network approach for run-up on a semi-submersible [34]. Naeini et al. (2024) proposed a physics-informed machine learning model for time-dependent wave-runup prediction [35]. Li et al. (2025) studied and compared LSTM and temporal convolutional network models for semisubmersible run-up [36]. These studies have advanced the state of machine learning for wave-runup problems, delivering notable gains in predictive accuracy while increasingly embedding or validating physical principles. However, most existing models do not explicitly account for heterogeneity across wave regimes, even though different wave characteristics can induce systematically different run-up responses. This omission may limit generalization and blur physical attributions.

Recent studies have increasingly incorporated clustering techniques into data-driven hydrodynamic modeling to account for regime-dependent behavior and to reduce data heterogeneity. Typical clustering-enhanced approaches first partition samples into distinct regimes based on selected physical or statistical features (e.g., wave characteristics or flow conditions), then train either separate predictive models for each cluster or a unified model augmented with regime information. Such strategies allow models to specialize across relatively homogeneous subspaces, and have been shown to improve predictive performance in wave run-up, dam break flows, and related hydraulic problems. For example, a very recent work by Li et al. (2025) introduced K-means clustering to account for wave-characteristic heterogeneity in run-up prediction [37]. While this work demonstrated the potential of clustering-based regime partitioning, it relied on a small-scale flume dataset, which limits similarity scaling and out-of-sample generalization. Moreover, the study did not conduct systematic model optimization or interpretability analyses, leaving open questions regarding robustness across regimes and the physical consistency of the learned relationships.

Motivated by these gaps, we introduce a cluster-wise strategy that partitions data by wave characteristics and then learns regime-specific predictors, with three key improvements: (i) we build upon a combined experimentally- and numerically-generated dataset to broaden similarity coverage and enhance generalizability; (ii) we benchmark tree learners and select the optimal model, then apply a novel optimization scheme to improve predictive accuracy; and (iii) we incorporate SHAP to provide physics-consistent global and local explanations clarifying how the wave parameters jointly shape overtopping outcomes.

This study advances physics-consistent overtopping estimation for dam break-induced flood overtopping via coupling hierarchical clustering (HC), alpha evolution (AE)-enhanced CatBoost, and TreeSHAP [38,39]. The dataset is generated via two dimensional physical model experiments and smoothed particle hydrodynamics (SPH) numerical simulations. First, we develop a cluster-wise modeling strategy in which regimes are defined by wave characteristics and cases are grouped via hierarchical clustering (HC) using the Froude number

Fr

, nonlinearity index R, and relative distance to the dike D as clustering criteria. Second, we benchmark three state-of-the-art tree learners—CatBoost, XGBoost, and ExtraTrees—and find that CatBoost outperforms the others, motivating its use as the baseline model. The target variable is the relative maximum run-up

H_{m}

. Third, we introduce an alpha evolution (AE) optimization scheme that enhances CatBoost beyond particle swarm optimization (PSO) and grid search (GS), improving its predictive accuracy and stability. Fourth, to ensure physics consistency, we establish a TreeSHAP-based interpretability pipeline with the support of AE-Catboost to provide physics-consistent explanations at global, local, and interaction levels.

The rest of this paper is organized as follows. Section 2 describes the dataset, including the simplified physical model, the non-dimensional variables, and the physical model experiments and numerical simulations used to generate it. Section 3 presents the methodological framework, consisting of hierarchical clustering, CatBoost with the associated AE optimization, and the principles of TreeSHAP analysis. Section 4 reports the clustering results, the predictive performance of the proposed and baseline models, and SHAP-based interpretability analyses. Section 5 discusses the implications and methodological contributions of this study. Finally, Section 6 summarizes the key findings.

2. Dataset

2.1. Variables and Dimensionless Analysis

Figure 1 illustrates a two-dimensional simplified physical model of impulse-wave propagation toward a dike. Conceptually, the overall process can be divided into three stages. In a pre-stage (not modeled in this study), a dam break event in an upstream reservoir generates an impulse-like solitary wave. After the solitary wave has formed, Stage I (initial stage) describes its propagation over still water of uniform depth

h_{0}

, characterized by wave height

h_{w}

, wavelength

l_{w}

, wave velocity

u_{0}

, and an initial distance d between the wave front and the dike toe. Stage II (impacting stage) corresponds to the interaction between the wave and the dike, during which the free surface runs up along the slope and reaches a maximum elevation

h_{m}

above the still-water level. The present work focuses on the post-dam break evolution; the dam break generation process itself is treated as a pre-stage that is outside the scope of the quantitative analysis. The objective is to predict the maximum run-up height

h_{m}

on the dike from the impulse-wave characteristics

(h_{w}, l_{w}, u_{0}, d, h_{0})

observed in the initial stage.

From the physical model in Figure 1, the maximum run-up height

h_{m}

on the dike is governed by the incident-wave characteristics and the still-water depth. In dimensional form, this dependence can be written as

h_{m} = f (l_{w}, u_{0}, h_{w}, d, h_{0}, g),

(1)

where

l_{w}

is the incident wavelength,

h_{w}

is the incident wave height,

u_{0}

is the wave velocity at the wave crest, d is the initial distance from the wave front to the dike,

h_{0}

is the still-water depth, and g is gravitational acceleration. The ratio of

h_{w}

to

l_{w}

provides a convenient measure of wave nonlinearity.

To collapse scale effects and expose the governing similarity parameters, we introduce the following dimensionless groups:

L = \frac{l_{w}}{h_{0}}, Fr = \frac{u_{0}}{\sqrt{g h_{0}}}, D = \frac{d}{h_{0}}

(2)

H = \frac{h_{w}}{h_{0}}, R = \frac{H}{L} = \frac{h_{w}}{l_{w}}, H_{m} = \frac{h_{m}}{h_{0}}

(3)

where L and H are the relative wavelength and wave height,

Fr

is the Froude number based on the still-water depth, D is the relative distance to the dike, R quantifies wave nonlinearity, and

H_{m}

is the relative maximum run-up. Expressed in terms of these dimensionless variables,

(L, D, H, Fr, R)

serve as the basis for the subsequent data-driven modeling and interpretation.

2.2. Data Generation

To build the dataset for model training, we combined small-scale physical model tests with companion numerical simulations. Figure 2 shows the sketch of the laboratory flume with a total length of 3 m and width of 0.2 m. The dam model is 30 cm wide at the base and 25 cm high. The slope is inclined at 45°. Figure 3 shows the impacting stage, where the dam break-induced impulse wave runs up and overtops the dike. Because the limited flume length restricts the range of incident wave conditions and propagation distances that can be realized experimentally, we complemented the experiments with numerical simulations using the same two-dimensional simplified configuration. The numerical model reproduces the dam break generation of an impulse wave, its subsequent propagation along the flume, and the overtopping of the dike. By extending the propagation distance and systematically varying the governing parameters, the simulations enrich the diversity of wave scenarios while remaining consistent with the physics observed in the experiments. Both the physical and numerical setups are based on the same two-dimensional simplified model of dam break-induced impulse wave propagation and overtopping. Compared with real-world coastal and reservoir geometries, this idealized layout omits detailed bathymetry and topographic complexity, which reduces the amount of information but allows us to focus on a controlled set of core variables and to perform systematic parameter variations. In total, 183 cases were generated from the combined physical and numerical dataset.

The smoothed particle hydrodynamics (SPH) method is adopted to perform the numerical simulations. Owing to its mesh-free and fully Lagrangian formulation, SPH is particularly well suited for modeling highly transient free-surface flows with large deformation, wave breaking, and strong fluid–structure interaction, which are characteristic of dam-break–induced impulse waves and dike overtopping. Based on the SPH simulations, Figure 4 illustrates the temporal evolution of the velocity magnitude during impulse wave propagation and overtopping, highlighting the key stages of wave development, impact, and energy dissipation.

Figure 5 shows the histograms of the five dimensionless input variables for the 183 cases generated from physical model experiments and numerical simulations. All variables are approximately unimodal with limited skewness and no pronounced outliers. In specific, D concentrates around 5 to 20 with a mild right tail;

Fr

is tightly grouped in 0.035 to 0.055 with a peak near 0.04, consistent with a low-Froude regime; H lies mainly in

0.5

to

1.1

with a mode near 0.6 to 0.8; L covers from 5 to 12 and exhibits slight positive skew with a few larger wavelengths up to 15; and R centers at 0.07 to 0.11 with occasional cases as low as 0.05 and as high as 0.17. Overall, the sample provides broad yet balanced coverage of impulse wave conditions generated by dam break problem, supplying sufficient variability for learning the relation between the wave characteristics during propagation and over-topping on the sea dike.

3. Methods

3.1. Modeling Framework

In regression prediction and interpretability, combining tree models with SHAP is particularly advantageous. Tree-based learners flexibly capture nonlinear relations and higher-order interactions, while TreeSHAP provides exact, additively consistent Shapley attributions for ensembles, enabling quantitative explanations at both global and local levels.

Prior to model training, we clustered the dataset by wave characteristics using a hierarchical clustering method to stratify wave regimes, reduce heterogeneity, and enable stratified evaluation and subgroup-wise interpretation. We then compared three tree models that have proven strong in recent studies—CatBoost, XGBoost, and ExtraTrees—and selected CatBoost as the baseline based on its superior test-set performance in terms of

R^{2}

and RMSE. Next, to further improve accuracy, we applied an alpha evolution (AE) scheme for hyperparameter optimization and contrasted AE-CatBoost with PSO-CatBoost, GS-CatBoost, and an unoptimized CatBoost. The results show that AE-CatBoost delivers gains in both accuracy and stability, confirming the effectiveness of AE for hyperparameter tuning. Building upon the selected predictive model, we constructed a TreeSHAP analysis framework to systematically assess the marginal contributions and interaction effects of the inputs

D, Fr, H, L, R

on the output

H_{m}

.

In Figure 6, we present the mathematical underpinnings of the hierarchical clustering (HC), the chosen CatBoost tree model, the alpha evolution (AE) optimization scheme, and the TreeSHAP analysis. The pseudocode is provided in Appendix A.

3.2. Hierarchical Clustering

We stratify the data by wave characteristics via agglomerative hierarchical clustering (HC) on three features

(Fr, R, D)

[40,41,42]. Figure 7 illustrates the principle of the hierarchical clustering (HC) method.

To remove scale effects, each feature is standardized using statistics computed on the training set to avoid leakage. For sample i,

z_{i j} = \frac{x_{i j} - μ_{j}}{σ_{j}}, j \in {Fr, R, D},

(4)

yielding

z_{i} = {[z_{i, Fr}, z_{i, R}, z_{i, D}]}^{⊤} \in R^{3}

. The Euclidean distance is used in the standardized space.

We adopt agglomerative clustering with Ward linkage. Starting from singletons, at each step we merge the pair of clusters that produces the smallest increase in within-cluster sum of squares (WSS). Let

C = {C_{1}, \dots, C_{k}}

denote the current partition, with centroids

{\bar{z}}_{C}

. The within-cluster dispersion is

WSS (C) = \sum_{C \in C} \sum_{z_{i} \in C} {∥ z_{i} - {\bar{z}}_{C} ∥}_{2}^{2} .

(5)

Ward linkage selects the merge that minimizes the increment

Δ WSS

; equivalently, it minimizes the increase in total error sum of squares at each agglomeration step.

To determine the number of clusters, we cut the dendrogram at

k^{⋆}

that maximizes the Calinski–Harabasz (CH) index over a candidate range (e.g.,

k = 2, \dots, 6

). Define the total sum of squares

TSS = \sum_{i = 1}^{n} {∥ z_{i} - \bar{z} ∥}_{2}^{2}, BSS (k) = TSS - WSS (C_{k}),

(6)

where

\bar{z}

is the global mean and

C_{k}

is the k-cluster partition. The CH index is

CH (k) = \frac{BSS (k) / (k - 1)}{WSS (C_{k}) / (n - k)} .

(7)

We set

k^{⋆} = arg {max}_{k} CH (k)

and use the resulting labels for downstream modeling.

3.3. CatBoost Prediction

This study employs CatBoost (Categorical Boosting) to construct a regression model for

H_{m}

[43]. See Figure 8 for the principle of CatBoost algorithm.

Let the input vector be

x = {(D, Fr, H, L, R)}^{⊤}

and the response be

H_{m}

. CatBoost is an additive model based on gradient boosting. After M trees, the prediction is

{\hat{H}}_{m} (x) = F_{M} (x) = \sum_{m = 1}^{M} η_{m} T_{m} (x) .

(8)

Here,

η_{m} \in (0, 1]

is the learning rate and

T_{m} (\cdot)

is the m-th base learner. A depth-d symmetric tree contains

2^{d}

leaves, each with a constant value. To minimize empirical risk, this study adopts the squared error loss

L = \sum_{i = 1}^{N} {(H_{m, i} - F (x_{i}))}^{2} .

(9)

At iteration m, CatBoost fits the negative gradient of the previous model. Denote the first- and second-order derivatives for sample i as

g_{i}^{(m)} = {\frac{\partial L}{\partial F}|}_{F = F_{m - 1} (x_{i})},

(10)

h_{i}^{(m)} = {\frac{\partial^{2} L}{\partial F^{2}}|}_{F = F_{m - 1} (x_{i})};

(11)

then, the output of each leaf of the m-th tree is updated by a Newton-like step

v_{ℓ} = - \frac{\sum_{i \in ℓ} g_{i}^{(m)}}{\sum_{i \in ℓ} h_{i}^{(m)} + λ},

(12)

where

λ > 0

is an

ℓ_{2}

regularization term on leaf values. Under squared error,

g_{i}^{(m)} = F_{m - 1} (x_{i}) - H_{m, i}

and

h_{i}^{(m)} = 1

. Hyperparameters (tree depth d, number of boosting iterations M, learning rate

η

, regularization

λ

, etc.) are selected via alpha evolution (AE) optimization. The primary metrics are root mean squared error (RMSE) and coefficient of determination

R^{2}

.

3.4. Alpha Evolution Optimization

To obtain a high-accuracy and stable predictor for

H_{m}

from inputs

D, Fr, H, L, R

, we optimize CatBoost hyperparameters with an efficient evolutionary scheme termed alpha evolution (AE) [44]. Alpha evolution (AE) integrates domain-scaled exploration, adaptive evolution paths, and elite–non-elite contrastive guidance, enabling efficient global search while maintaining stable convergence. Unlike surrogate-based Bayesian optimization or covariance matrix-based strategies, AE avoids strong distributional assumptions and high computational overhead, making it particularly suitable for the mixed and nonlinear hyperparameter space of CatBoost. See Figure 9 for the flowchart of AE optimization.

The decision vector is hyperparameters of CatBoost:

θ = (d, M, η, λ) .

(13)

AE solves a bounded minimization of the cross-validated RMSE:

\begin{matrix} min_{θ \in [lb, ub]} & f (θ) \\ s . t . & f (θ) = {RMSE}_{CV}, \end{matrix}

(14)

where K-fold CV is stratified by the hierarchical-clustering wave regimes; early stopping is applied within each fold.

A population

X \in R^{N \times d}

(row

X_{i}

is individual i) is initialized uniformly:

X_{i j} \sim U ({lb}_{j}, {ub}_{j}), FEs \leftarrow FEs + N .

(15)

At each generation, evaluate

f (X_{i})

, set

E \leftarrow X

, and sort indices

ind = argsort (f (X))

. Domain-scaled exploratory steps are

\begin{matrix} Δ r & = (ub - lb) ⊙ (2 R_{1} ⊙ R_{2} - R_{2}) ⊙ S, \\ R_{1}, R_{2} & \sim U {(0, 1)}^{N \times d}, S \sim Bernoulli {(0.5)}^{N \times d}, \end{matrix}

(16)

and the progress factor is

α = exp (ln (1 - \frac{FEs}{MaxFEs}) - {(\frac{4 FEs}{MaxFEs})}^{2}) .

(17)

A global reference

P \in R^{d}

is formed by reweighted subpopulation aggregation. Let

K = ⌈ N rand () ⌉

, draw

I_{1} = randperm (N, K)

, set

B = X (I_{1}, :)

, choose nonnegative

ω

with

1^{⊤} ω = 1

, and define

\begin{matrix} c_{a} & = 1 - FEs / MaxFEs, \\ P & \leftarrow c_{a} P + (1 - c_{a}) \sum_{j = 1}^{K} ω_{j} B_{j :} . \end{matrix}

(18)

For contrastive guidance, pick u from the elite half of

ind

and v from the remainder, set

W_{i} = X_{u}

,

L_{i} = X_{v}

, and draw a perturbation weight

ϑ_{i} = I_{2} 1_{d} \cdot rand (0, 2) + (1 - I_{2}) rand {(0, 1)}^{1 \times d}

(19)

with

I_{2}

∼

Bernoulli (0.5)

. Then, update individual i by

E_{i}^{t + 1} = P + α Δ r_{i} + ϑ_{i} ⊙ (W_{i} + E_{i}^{t} - P - L_{i}) .

(20)

Apply bounded handling and greedy selection:

E_{i}^{t + 1} = min \{max {E_{i}^{t + 1}, lb}, ub\},

(21)

X_{i}^{t + 1} = \{\begin{matrix} E_{i}^{t + 1}, & f (E_{i}^{t + 1}) \leq f (X_{i}^{t}), \\ X_{i}^{t}, & otherwise . \end{matrix}

(22)

Increase

FEs

accordingly and stop when

FEs \geq MaxFEs

. The best

θ^{⋆}

is then used to train the final CatBoost model for predicting

H_{m}

from

(D, Fr, H, L, R)

.

3.5. TreeSHAP Analysis

To quantify the marginal contributions of inputs to

{\hat{H}}_{m}

, this study employs TreeSHAP based on Shapley values [45,46,47]. Figure 10 illustrates the principle of integrating tree models and SHAP analysis.

For any sample

x

, the prediction decomposes as

F (x) = ϕ_{0} + \sum_{j = 1}^{5} ϕ_{j} (x),

(23)

where

ϕ_{0} = E [F (X)]

is the baseline output (global expectation) and

ϕ_{j} (x)

denotes the contribution of feature

j \in {D, Fr, H, L, R}

at

x

, satisfying the additivity property. TreeSHAP provides exact and efficient Shapley computations for tree models. Global importance is measured by the sample average of absolute SHAP values:

I_{j} = E_{x} [| ϕ_{j} (x) |],

(24)

which yields an ordering of

D, Fr, H, L, R

by their average impact on

H_{m}

. Local interpretation uses the set

{ϕ_{j} (x)}

for a single sample to explain how its prediction deviates from

ϕ_{0}

.

4. Results

Figure 11 illustrates the Pearson correlation heatmap of the five input variables (

D, Fr, H, L, R

). Overall, pairwise correlations are modest. The largest absolute value is between L and R (

r = 0.53

), which remains below a common collinearity threshold (<0.6); hence, all five variables are retained as predictors. The relatively higher L–R association is physically reasonable: R characterizes wave nonlinearity and L reflects wavelength, which are inherently related. Nevertheless, it is useful to distinguish nonlinearity from amplitude- and length-related parameters, because wave type can alter propagation and run-up. Given the correlation is still moderate (<0.6), R is kept in the model.

4.1. Clustering Results

Figure 12 illustrates the Calinski–Harabasz (CH) index across candidate cluster counts

k \in {2, 3, 4, 5, 6}

. The index attains its global maximum at

k = 2

with

CH = 148.5

; for

k = 3

–6,

CH (k) \in [118, 123]

, which is clearly below the two-cluster solution, yielding

k^{⋆} = 2

. Physically, this aligns with two regimes: a nearshore, more nonlinear, and higher-Froude regime (small D, large R, higher

Fr

) favoring larger

H_{m}

, and an offshore, weaker-nonlinearity, lower-Froude regime (large D, small R, lower

Fr

) that tends to suppress

H_{m}

.

Figure 13 shows the Ward-linkage dendrogram. The two main clusters merge only at a Ward distance of 22, whereas most within-branch merges occur at area smaller than 12, indicating much larger inter-cluster than intra-cluster costs and a stable two-cluster structure. Interpreted by variable meaning, one branch aggregates cases with small D, large R, and higher

Fr

(energy/nonlinearity-driven uprush, larger

H_{m}

), while the other gathers large D, small R, and lower

Fr

(dissipation-dominated mild conditions, smaller

H_{m}

).

Figure 14 shows the pairwise scatter plots and marginal histograms in

(Fr, R, D)

space. Cluster 0 (blue) concentrates in

D \in [12, 22]

,

R \in [0.07, 0.11]

,

Fr \in [0.035, 0.045]

(offshore, weakly nonlinear, low-Froude); Cluster 1 (orange) concentrates in

D \in [6, 14]

,

R \in [0.10, 0.15]

,

Fr \in [0.045, 0.055]

(nearshore, more nonlinear, higher-Froude). Overlap is confined to a narrow band R between 0.10 and 0.11, D between 12 and 14, and

Fr

between 0.04 and 0.06, marking transitional cases. Larger H and higher

Fr

increase specific energy and momentum flux, larger R strengthens bore-like fronts, and larger D increases dissipation; the observed structure matches the expected effects on

H_{m}

. Given the moderate L–R correlation (

r = 0.53 < 0.6

), both are retained to disentangle wavelength- versus nonlinearity-driven mechanisms.

Figure 15 is the 3D scatter in

(Fr, R, D)

showing voxel-level separation. Cluster 0 occupies

D \geq 12

,

R \leq 0.11

,

Fr \leq 0.045

; Cluster 1 concentrates in

D \leq 14

,

R \geq 0.10

,

Fr \geq 0.045

, intersecting only near

R = 0.10

–

0.11

and

D = 12

–14. This geometry suggests that in the low-D–high-R–high-

Fr

volume, TreeSHAP will yield positive contributions to

H_{m}

, whereas high-D regions will show negative contributions (

ϕ_{D} < 0

); the effect of L is typically nonlinear or thresholded, with potential saturation at extreme values. Overall, the 3D separation is consistent with Figure 9, Figure 10 and Figure 11 and supports cluster-stratified validation and interpretation.

4.2. Prediction Results

We adopt an 80/20 chronological split: we train on 80% of observations and hold out the left 20% as testing data. Figure 16 shows scatter plots of measured versus predicted values for the six models on the training and test sets. We first evaluated three tree-based baselines—CatBoost, XGBoost, and ExtraTrees—and found that CatBoost performed best (

R^{2} = 0.9639

,

RMSE = 0.0810

for the testing dataset). Using CatBoost as the baseline, we then applied two optimization strategies: AE-CatBoost as the proposed model, with PSO-CatBoost and GS-CatBoost as validation models. Across panels, most points cluster near the 1:1 line. AE-CatBoost exhibits the tightest cloud with the smallest deviations, indicating the strongest fit and generalization. PSO-CatBoost and GS-CatBoost follow, while XGBoost and ExtraTrees show greater dispersion. To further highlight the performance gain, Figure 16 should be interpreted in conjunction with the evaluation metrics in Figure 17 and residual distributions shown in Figure 18. While the scatter plots provide a qualitative comparison across models, the residual histograms and KDE curves offer a clearer quantitative contrast between the baseline CatBoost and AE-CatBoost.

Figure 17 shows radar plots comparing

R^{2}

and

RMSE

across all models. AE-CatBoost sits closest to the outer ring on

R^{2}

and the inner ring on

RMSE

, achieving train/test

R^{2} = 0.9783 / 0.9803

and

RMSE = 0.0498 / 0.0599

. Relative to the CatBoost baseline (test

R^{2} = 0.9639

,

RMSE = 0.0810

), AE-CatBoost increases test

R^{2}

by

+ 0.0164

and reduces test

RMSE

by

0.0211

(26% reduction), and also lowers the test residual variance from

0.00652

to

0.00360

(45% reduction). PSO-CatBoost ranks second with test

R^{2} = 0.9751

and

RMSE = 0.0673

, representing a 17% error reduction versus the baseline. GS-CatBoost follows closely at

R^{2} = 0.9725

and

RMSE = 0.0707

(13% reduction). In contrast, XGBoost (

R^{2} = 0.9596

,

RMSE = 0.0857

) and ExtraTrees (

R^{2} = 0.9457

,

RMSE = 0.0994

) trail behind; AE-CatBoost cuts test error by roughly 30% and 40% compared with these two models, respectively. The small train–test gaps for AE-CatBoost further indicate stable generalization rather than overfitting. Overall, CatBoost outperforms XGBoost and ExtraTrees among the baselines, and the AE enhancement yields a clear and quantitatively larger gain than either PSO or grid search.

Figure 18 shows histograms and kernel density estimates of residuals for the training and test sets. The bars show normalized histograms and the solid curves are kernel density estimates (KDEs). Here, the KDE is a non-parametric estimator of the probability density function. It places a smooth kernel at each residual and sums them to produce a continuous density curve. Here, residuals are centered around zero and are approximately Gaussian with no obvious systematic bias. AE-CatBoost displays the sharpest peak, the narrowest spread, and the shortest tails, reflecting the lowest variance and the most stable errors. PSO-CatBoost and GS-CatBoost are next and the baseline CatBoost follows, with XGBoost and ExtraTrees exhibiting wider spreads and heavier tails. These residual patterns corroborate the quantitative results from Figure 14 and further confirm AE-CatBoost as the best-performing model.

4.3. TreeSHAP Analysis Results

Based on the above evaluation of predictive performance, we selected AE-CatBoost as the tree model to pair with TreeSHAP for subsequent interpretability analysis. AE-CatBoost delivers the highest test

R^{2}

, lowest

RMSE

and residual variance, and a small train–test gap, indicating strong generalization. This choice supports stable local and global attributions, enables clear quantification of key feature importance and interactions, and provides a more reliable basis for decision-making and model refinement. Figure 19 shows the mean absolute SHAP value of each input variable, helping to quantify each feature’s global contribution to the model output. It can be seen that D is the most dominant input variable with a

| SHAP |

of 0.125; H and L are the next-most importance influencing factors, while

Fr

and R provide relatively weak contributions to the run-up distance. The mean

| SHAP |

quantifies effect size; however, it lacks the information of the affecting direction, that is, whether increasing a feature raises or decreases

H_{m}

. The affecting direction depends on the local context, and can be shown using a SHAP bee-swarm plot.

Figure 20 presents the SHAP summary distribution on the basis of the AE-CatBoost model predicting the relative maximum run-up

H_{m}

. Each dot corresponds to one sample, and the color encodes the feature value from low (blue) to high (red). Red points indicate a local increase in

H_{m}

, while blue points indicate a decrease. The distribution shows that the relative distance to the dike D is the most influential driver. High D values cluster on the positive side and low D values on the negative side, indicating a near-monotonic positive relationship with

H_{m}

. Mechanistically, a longer propagation distance allows waves to evolve and shoal over a greater fetch, accumulating momentum before impact and producing higher run-up. The relative wave height H and wavelength L also exhibit clear positive effects. Red points for these two features lie predominantly to the right of zero, showing that taller incident waves and longer relative wavelengths carry more energy and momentum, pushing the uprush farther up the slope and increasing

H_{m}

. The Froude number

Fr

shows a similar but weaker pattern; low

Fr

tends to reduce

H_{m}

, while high

Fr

tends to increase it, consistent with greater momentum flux and approach to supercritical conditions enhancing run-up. R contributes only weakly overall, and is centered near zero with a slight negative tendency at higher values. Stronger nonlinearity can trigger earlier breaking and greater energy dissipation, which limits the ultimate uprush and slightly suppresses

H_{m}

. In summary, the importance ranking is D first, followed by H and L, then

Fr

, with R exerting the smallest influence. The signs and magnitudes of these effects align with hydrodynamics mechanism: longer propagation distance, larger wave height, longer wavelength, and higher Froude number generally promote larger run-up, whereas stronger nonlinearity tends to damp it through dissipation.

Figure 21 illustrates the heatmap of importance of interaction among each feature. In general, the diagonal terms reflect main effects, where the dominating factor is D, which matches the global importance ranking. The other information obtained from the heatmap is that the most off-diagonal entries are small, indicating a largely additive model. The standout feature interaction is the coupling between H and D, which exceeds all other pairs. Physically, greater wave height combined with a longer relative distance to the dike enhances shoaling and momentum buildup over the propagation path, amplifying uprush more than either factor alone. Interactions involving R remain negligible, consistent with its limited incremental value once H and L are accounted for.

5. Discussion

5.1. Methodological Contribution

Targeting dam break-induced flood propagation and dike overtopping, we establish CatBoost as the baseline and introduce autoencoder-based feature enhancement. AE-CatBoost attains higher test

R^{2}

, lower

RMSE

and residual variance, and a smaller train–test gap, indicating stronger generalization under the present data and variable setting. The dataset is split into training and test sets by proportional random sampling, without temporal structure; future assessments could add k-fold cross-validation and bootstrap confidence intervals for robustness. Methodologically, this work contributes in two ways: first, we propose a new AE-driven optimization scheme for tree models in which the autoencoder learns compact and physics-aware representations that consistently enhance CatBoost (higher test

R^{2}

, lower

RMSE

, reduced residual variance); second, we stratify the data by clustering on wave characteristics (relative distance to the dike D, relative wave height H, relative wavelength L, Froude number

Fr

, and nonlinearity index R) and train a predictive model within each cluster. This yields sharper local fits, captures regime-dependent behavior, and improves interpretability by aligning learned relationships with distinct hydrodynamic regimes.

5.2. Physics-Consistent Interpretability

TreeSHAP analyses provide a global ranking that aligns with the mechanics of dam-break waves impinging on a dike. The relative distance to the dike D is the dominant driver, followed by the relative wave height H and relative wavelength L; the Froude number

Fr

exerts a weaker but generally positive influence, while the nonlinearity indicator R contributes minimally. SHAP dependence plots reveal nearly monotonic positive effects for D, H, and L, with a strengthening D at 12 to 15, consistent with the idea that once propagation distance is sufficient, shoaling and momentum buildup become more effective and amplify overtopping. The SHAP interaction matrix indicates a largely additive model overall, yet highlights

H \times D

as the strongest coupling: larger incident waves traveling over longer distances accumulate energy and momentum in a way that amplifies run-up beyond simple linear superposition. Note that these attribution patterns are derived from global statistics aggregated over the full dataset, and remain consistent across different wave-regime clusters and train–test splits rather than arising from isolated cases or local effects. This stability across regimes and data partitions reduces the likelihood of coincidental correlations and supports the interpretation that the model has captured robust hydrodynamic mechanisms governing dam-break wave propagation and overtopping.

5.3. Implications for Practice

The dataset was generated using physical model experiments and numerical simulations based on a two-dimensional idealized flume. The present dataset is designed to isolate fundamental hydrodynamic controls such as distance to shoreline, wave height, wavelength, Froude number, and nonlinearity under controlled conditions, thereby enabling robust methodological development and physics-consistent interpretation. While this controlled configuration facilitates systematic parameter variation and physics-consistent learning, it may limit direct extrapolation to real-world terrains characterized by curved channels, irregular cross-sections, vegetation roughness, and built structures. In such environments, additional dissipation, wave scattering, and boundary effects may lead to degradation in predictive performance. Future work will incorporate high-resolution DEMs, realistic boundary conditions, and field observations to quantitatively assess model transferability and robustness.

6. Conclusions

This study investigates dam break-induced flood propagation and dike overtopping. We first benchmarked strong tree learners (XGBoost and ExtraTrees), finding that CatBoost outperforms both. This motivated us to focus our optimization efforts on CatBoost and to develop the AE-optimized variant (AE-CatBoost). Methodologically, we: (i) introduce an alpha evolution (AE) optimization scheme that enhances tree-model predictive accuracy beyond particle swarm optimization (PSO) and grid search (GS); (ii) develop a cluster-wise modeling framework in which regimes are defined by wave characteristics and cases are grouped via hierarchical clustering using

Fr

, R, and D as clustering criteria; and (iii) establish a physics-consistent interpretability pipeline with SHAP (TreeSHAP) to quantify global importance, local effects, and pairwise interactions.

AE-CatBoost achieves the strongest generalization performance (e.g., test

R^{2} = 0.9803

,

RMSE = 0.0599

), outperforming the untuned CatBoost baseline, PSO and GS-optimized variants, and alternative tree models including XGBoost and ExtraTrees, with a concurrent reduction in residual variance. By incorporating hierarchical clustering based on

Fr

, R, and D, the proposed cluster-wise modeling strategy sharpens local fits and better captures regime-dependent behavior. Furthermore, TreeSHAP analyses yield physics-aligned explanations consistently identifying D as the dominant controlling factor, followed by strong positive contributions from H and L, a weaker positive influence of

Fr

, and a minimal role of R. The interaction between H and D emerges as the strongest coupling, consistent with momentum accumulation during wave propagation over longer distances.

One limitation of this study is that the dataset relies on an idealized flume with simplified topography, which may constrain extrapolation to complex field terrains and boundary conditions. Future work will integrate high-resolution DEMs and realistic boundaries and will adopt robustness-enhancing evaluation such as k-fold cross-validation and uncertainty-aware metrics.

Author Contributions

Conceptualization, H.L. and Z.M.; methodology, Y.F. and Z.M.; validation, X.Z., L.W. and J.Z.; formal analysis, H.L.; writing—original draft preparation, H.L.; writing—review and editing, Z.M.; supervision, L.W.; project administration, H.L.; funding acquisition, H.L. and Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Dean’s Research Fund of Zhejiang Institute of Hydraulics & Estuary (Grant No. ZIHE22Q004), the Zhejiang Provincial Natural Science Foundation of China (Grant No.LZJWY24E090005), and the Central Guidance Funds for Science and Technology Local Development Projects (Grant No. 2025ZY01091).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Pseudocode of the Proposed Modeling Procedure

Algorithm A1 illustrates the pseudocode of HC–AE–CatBoost–SHAP modeling. First, the raw dataset is split into training and test sets and wave-related variables are standardized. Agglomerative hierarchical clustering on standardized

(F r, R, D)

is then used to identify wave regimes, which define stratified folds for cross-validation. Alpha evolution (AE) is employed to search CatBoost hyperparameters by minimizing the regime-stratified CV-RMSE, yielding the optimal configuration. A final CatBoost model is trained on the full training set, and TreeSHAP is applied to this model to obtain global feature importance along with local sample-wise explanations for

H_{m}

.

Algorithm 1 HC–AE-CatBoost–SHAP pseudocode

Require:: Dataset $D = {(x_{i}, y_{i})}_{i = 1}^{N}$
Require:: Hyperparameter bounds $[lb, ub]$
Require:: AE population size $N_{pop}$ , maximum function evaluations MaxFEs
Require:: Candidate cluster numbers $k \in {2, \dots, 6}$ , number of CV folds K
1:: Split $D$ into training set $D_{train}$ and test set $D_{test}$
2:: On $D_{train}$ , compute $μ_{j}, σ_{j}$ for $j \in {F r, R, D}$
3:: Standardize $F r, R, D$ in both sets using $μ_{j}, σ_{j}$
▹ Hierarchical clustering on wave features
4:: Extract standardized $(F r_{i}, R_{i}, D_{i})$ as $z_{i}$ from $D_{train}$
5:: for $k = 2$ to 6 do
6:: Apply agglomerative HC (Ward linkage, Euclidean distance) to obtain partition $C_{k}$
7:: Compute $WSS (C_{k})$ , $BSS (k)$ and $CH (k)$
8:: end for
9:: Set $k^{⋆} \leftarrow arg {max}_{k} CH (k)$
10:: Obtain cluster labels $c_{i}$ for $D_{train}$ under $k^{⋆}$
11:: Assign labels to $D_{test}$ by nearest-centroid rule
12:: Use ${c_{i}}$ to build wave-regime–stratified K-folds for CV
▹ Alpha evolution (AE) hyperparameter optimization
13:: Define decision vector $θ = (d, M, η, λ)$
14:: Initialize population $X \in R^{N_{pop} \times d}$ uniformly in $[lb, ub]$
15:: Set function evaluation counter $FEs \leftarrow 0$
16:: for each individual $X_{i}$ in the population do
17:: $f (X_{i}) \leftarrow CV_RMSE (X_{i}, D_{train}, {c_{i}}, K)$
18:: $FEs \leftarrow FEs + 1$
19:: end for
20:: while $FEs < MaxFEs$ do
21:: Generate exploratory steps $Δ r$ based on $[lb, ub]$
22:: Compute progress factor $α$ as a function of $FEs$
23:: Construct global reference $P$ via subpopulation aggregation
24:: for $i = 1$ to $N_{pop}$ do
25:: Select elite index u and non-elite index v from sorted population
26:: Set $W_{i} \leftarrow X_{u}$ , $L_{i} \leftarrow X_{v}$
27:: Sample perturbation weight vector $ϑ_{i}$
28:: Propose trial individual
$E_{i}^{t + 1} \leftarrow P + α Δ r_{i} + ϑ_{i} ⊙ (W_{i} + X_{i}^{t} - P - L_{i})$
29:: Project $E_{i}^{t + 1}$ onto $[lb, ub]$ (bound handling)
30:: Evaluate $f (E_{i}^{t + 1}) \leftarrow CV_RMSE (E_{i}^{t + 1}, D_{train}, {c_{i}}, K)$
31:: $FEs \leftarrow FEs + 1$
32:: if $f (E_{i}^{t + 1}) \leq f (X_{i}^{t})$ then
33:: $X_{i}^{t + 1} \leftarrow E_{i}^{t + 1}$
34:: else
35:: $X_{i}^{t + 1} \leftarrow X_{i}^{t}$
36:: end if
37:: end for
38:: end while
39:: Select $θ^{⋆}$ as the best individual (minimal f)
▹ Final training and evaluation
40:: Train CatBoost model $F^{⋆}$ on $D_{train}$ with $θ^{⋆}$ and early stopping
41:: Evaluate $F^{⋆}$ on $D_{test}$ to obtain test RMSE and $R^{2}$
▹ TreeSHAP analysis
42:: Compute baseline output $ϕ_{0} = E [F^{⋆} (X)]$
43:: for each sample $x_{i}$ in the dataset do
44:: Use TreeSHAP to compute $ϕ_{j} (x_{i})$ for $j \in {D, F r, H, L, R}$
45:: end for
46:: Compute global importance $I_{j} = E_{x} [| ϕ_{j} (x) |]$
47:: Generate global SHAP plots and local explanations from ${ϕ_{j} (x_{i})}$

References

Saville, T., Jr. Wave run-up on shore structures. J. Waterw. Harb. Div. 1956, 82, 925-1–925-14. [Google Scholar] [CrossRef]
Mather, A.; Stretch, D.; Garland, G. Predicting extreme wave run-up on natural beaches for coastal planning and management. Coast. Eng. J. 2011, 53, 87–109. [Google Scholar] [CrossRef]
Jensen, A.; Pedersen, G.K.; Wood, D.J. An experimental study of wave run-up at a steep beach. J. Fluid Mech. 2003, 486, 161–188. [Google Scholar] [CrossRef]
Stockdon, H.F.; Thompson, D.M.; Plant, N.G.; Long, J.W. Evaluation of wave runup predictions from numerical and parametric models. Coast. Eng. 2014, 92, 1–11. [Google Scholar] [CrossRef]
Peramuna, P.; Neluwala, N.; Wijesundara, K.; DeSilva, S.; Venkatesan, S.; Dissanayake, P. Review on model development techniques for dam break flood wave propagation. Wiley Interdiscip. Rev. Water 2024, 11, e1688. [Google Scholar] [CrossRef]
Kazeminezhad, M.H.; Etemad-Shahidi, A. A new method for the prediction of wave runup on vertical piles. Coast. Eng. 2015, 98, 55–64. [Google Scholar] [CrossRef]
Titov, V.V.; Synolakis, C.E. Numerical modeling of tidal wave runup. J. Waterw. Port Coast. Ocean Eng. 1998, 124, 157–171. [Google Scholar] [CrossRef]
Elliot, R.C.; Hanif Chaudhry, M. A wave propagation model for two-dimensional dam-break flows. J. Hydraul. Res. 1992, 30, 467–483. [Google Scholar] [CrossRef]
Rybkin, A.; Pelinovsky, E.; Didenkulova, I. Nonlinear wave run-up in bays of arbitrary cross-section: Generalization of the Carrier–Greenspan approach. J. Fluid Mech. 2014, 748, 416–432. [Google Scholar] [CrossRef]
Meng, Z.; Zhang, J.; Hu, Y.; Ancey, C. Temporal Prediction of Landslide-Generated Waves Using a Theoretical-Statistical Combined Method. J. Mar. Sci. Eng. 2023, 11, 1151. [Google Scholar] [CrossRef]
Paprotny, D.; Andrzejewski, P.; Terefenko, P.; Furmańczyk, K. Application of empirical wave run-up formulas to the Polish Baltic Sea coast. PLoS ONE 2014, 9, e105437. [Google Scholar] [CrossRef]
Kocaman, S.; Evangelista, S.; Guzel, H.; Dal, K.; Yilmaz, A.; Viccione, G. Experimental and numerical investigation of 3d dam-break wave propagation in an enclosed domain with dry and wet bottom. Appl. Sci. 2021, 11, 5638. [Google Scholar] [CrossRef]
Park, H.; Cox, D.T. Empirical wave run-up formula for wave, storm surge and berm width. Coast. Eng. 2016, 115, 67–78. [Google Scholar] [CrossRef]
Aureli, F.; Dazzi, S.; Maranzoni, A.; Mignosa, P.; Vacondio, R. Experimental and numerical evaluation of the force due to the impact of a dam-break wave on a structure. Adv. Water Resour. 2015, 76, 29–42. [Google Scholar] [CrossRef]
Oppikofer, T.; Hermanns, R.L.; Roberts, N.J.; Böhme, M. SPLASH: Semi-empirical prediction of landslide-generated displacement wave run-up heights. Geol. Soc. Lond. Spec. Publ. 2019, 477, 353–366. [Google Scholar] [CrossRef]
Yang, S.; Yang, W.; Qin, S.; Li, Q.; Yang, B. Numerical study on characteristics of dam-break wave. Ocean Eng. 2018, 159, 358–371. [Google Scholar] [CrossRef]
Girimaji, S.S. Partially-averaged Navier-Stokes model for turbulence: A Reynolds-averaged Navier-Stokes to direct numerical simulation bridging method. J. Appl. Mech. 2006, 73, 413–421. [Google Scholar] [CrossRef]
Haltas, I.; Elçi, S.; Tayfur, G. Numerical simulation of flood wave propagation in two-dimensions in densely populated urban areas due to dam break. Water Resour. Manag. 2016, 30, 5699–5721. [Google Scholar] [CrossRef]
Piomelli, U. Large-eddy simulation: Achievements and challenges. Prog. Aerosp. Sci. 1999, 35, 335–362. [Google Scholar] [CrossRef]
Mendes, D.; Andriolo, U.; Neves, M.G. Advances in wave run-up measurement techniques. In Advances on Testing and Experimentation in Civil Engineering: Geotechnics, Transportation, Hydraulics and Natural Resources; Springer: Cham, Switzerland, 2022; pp. 283–297. [Google Scholar]
Muraki, Y. Field Observations of Wave Pressure, Wave Run-Up, and Oscillation of Breakwater. In Coastal Engineering 1966; American Society of Civil Engineers (ASCE): Reston, VA, USA, 1966; pp. 302–321. [Google Scholar]
Hu, C.; Sueyoshi, M. Numerical simulation and experiment on dam break problem. J. Mar. Sci. Appl. 2010, 9, 109–114. [Google Scholar] [CrossRef]
Casella, E.; Rovere, A.; Pedroncini, A.; Mucerino, L.; Casella, M.; Cusati, L.A.; Vacchi, M.; Ferrari, M.; Firpo, M. Study of wave runup using numerical models and low-altitude aerial photogrammetry: A tool for coastal management. Estuar. Coast. Shelf Sci. 2014, 149, 160–167. [Google Scholar] [CrossRef]
Marangoz, H.O.; Anilan, T. Two-dimensional modeling of flood wave propagation in residential areas after a dam break with application of diffusive and dynamic wave approaches. Nat. Hazards 2022, 110, 429–449. [Google Scholar] [CrossRef]
Yuan, D.; Gu, C.; Wei, B.; Qin, X.; Xu, W. A high-performance displacement prediction model of concrete dams integrating signal processing and multiple machine learning techniques. Appl. Math. Model. 2022, 112, 436–451. [Google Scholar] [CrossRef]
Al-Ghosoun, A.; Gumus, V.; Seaid, M.; Simsek, O. Predicting morphodynamics in dam-break flows using combined machine learning and numerical modelling. Model. Earth Syst. Environ. 2025, 11, 74. [Google Scholar] [CrossRef]
Chen, H.; Huang, S.; Qiu, H.; Xu, Y.P.; Teegavarapu, R.S.; Guo, Y.; Nie, H.; Xie, H.; Xie, J.; Shao, Y.; et al. Assessment of ecological flow in river basins at a global scale: Insights on baseflow dynamics and hydrological health. Ecol. Indic. 2025, 178, 113868. [Google Scholar] [CrossRef]
Khosravi, K.; Sheikh Khozani, Z.; Hatamiafkoueieh, J. Prediction of embankments dam break peak outflow: A comparison between empirical equations and ensemble-based machine learning algorithms. Nat. Hazards 2023, 118, 1989–2018. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, X.; Li, H.; Fan, Y.; Meng, Z.; Liu, D.; Pan, S. Optimization of Water Quantity Allocation in Multi-Source Urban Water Supply Systems Using Graph Theory. Water 2025, 17, 61. [Google Scholar] [CrossRef]
Chen, H.; Xu, B.; Qiu, H.; Huang, S.; Teegavarapu, R.S.; Xu, Y.P.; Guo, Y.; Nie, H.; Xie, H. Adaptive assessment of reservoir scheduling to hydrometeorological comprehensive dry and wet condition evolution in a multi-reservoir region of southeastern China. J. Hydrol. 2025, 648, 132392. [Google Scholar] [CrossRef]
Meng, Z.; Hu, Y.; Jiang, S.; Zheng, S.; Zhang, J.; Yuan, Z.; Yao, S. Slope Deformation Prediction Combining Particle Swarm Optimization-Based Fractional-Order Grey Model and K-Means Clustering. Fractal Fract. 2025, 9, 210. [Google Scholar] [CrossRef]
Power, H.E.; Gharabaghi, B.; Bonakdari, H.; Robertson, B.; Atkinson, A.L.; Baldock, T.E. Prediction of wave runup on beaches using Gene-Expression Programming and empirical relationships. Coast. Eng. 2019, 144, 47–61. [Google Scholar] [CrossRef]
Rehman, K.; Khan, H.; Cho, Y.S.; Hong, S.H. Incident wave run-up prediction using the response surface methodology and neural networks. Stoch. Environ. Res. Risk Assess. 2022, 36, 17–32. [Google Scholar] [CrossRef]
Li, Y.; Peng, T.; Xiao, L.; Wei, H.; Li, X. Wave runup prediction for a semi-submersible based on temporal convolutional neural network. J. Ocean Eng. Sci. 2024, 9, 528–540. [Google Scholar]
Naeini, S.S.; Snaiki, R. A physics-informed machine learning model for time-dependent wave runup prediction. Ocean Eng. 2024, 295, 116986. [Google Scholar] [CrossRef]
Li, Y.; Xiao, L.; Wei, H.; Li, D.; Li, X. A comparative study of LSTM and temporal convolutional network models for semisubmersible platform wave runup prediction. J. Offshore Mech. Arct. Eng. 2025, 147, 011202. [Google Scholar] [CrossRef]
Li, J.; Meng, Z.; Zhang, J.; Chen, Y.; Yao, J.; Li, X.; Qin, P.; Liu, X.; Cheng, C. Prediction of Seawater Intrusion Run-Up Distance Based on K-Means Clustering and ANN Model. J. Mar. Sci. Eng. 2025, 13, 377. [Google Scholar] [CrossRef]
Wang, H.; Liang, Q.; Hancock, J.T.; Khoshgoftaar, T.M. Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods. J. Big Data 2024, 11, 44. [Google Scholar] [CrossRef]
Van den Broeck, G.; Lykov, A.; Schleich, M.; Suciu, D. On the tractability of SHAP explanations. J. Artif. Intell. Res. 2022, 74, 851–886. [Google Scholar] [CrossRef]
Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
Contreras, P.; Murtagh, F. Hierarchical clustering. In Handbook of Cluster Analysis; CRC Press: Boca Raton, FL, USA, 2015; pp. 103–123. [Google Scholar]
Ran, X.; Xi, Y.; Lu, Y.; Wang, X.; Lu, Z. Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artif. Intell. Rev. 2023, 56, 8219–8264. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. arXiv 2018, arXiv:1706.09516. [Google Scholar]
Gao, H.; Zhang, Q. Alpha evolution: An efficient evolutionary algorithm with evolution path adaptation and matrix generation. Eng. Appl. Artif. Intell. 2024, 137, 109202. [Google Scholar] [CrossRef]
Inan, M.S.K.; Rahman, I. Explainable AI integrated feature selection for landslide susceptibility mapping using TreeSHAP. SN Comput. Sci. 2023, 4, 482. [Google Scholar] [CrossRef]
Kopanja, M.; Hačko, S.; Brdar, S.; Savić, M. Cost-sensitive tree SHAP for explaining cost-sensitive tree-based models. Comput. Intell. 2024, 40, e12651. [Google Scholar] [CrossRef]
Li, Z. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput. Environ. Urban Syst. 2022, 96, 101845. [Google Scholar] [CrossRef]

Figure 1. Physical model of impulse-wave propagation: (a) initial stage and (b) impacting stage.

Figure 2. Sketch of the two-dimensional experimental flume with a dike.

Figure 3. Representative snapshots of the dam break-induced impulse wave impacting and overtopping the dike.

Figure 4. Velocity magnitude of a dam break-induced impulse wave during propagation and overtopping at t = 0.00 s, 1.00 s, 1.09 s, and 2.20 s, obtained from SPH numerical simulations.

Figure 5. The distribution of the five input variables: (a) D, (b)

Fr

, (c) H, (d) L, (e) R.

Figure 5. The distribution of the five input variables: (a) D, (b)

Fr

, (c) H, (d) L, (e) R.

Figure 6. Overall flow of the HC–AE–CatBoost–SHAP modeling process.

Figure 7. The principle of the hierarchical clustering (HC) method.

Figure 8. Principle of the CatBoost algorithm.

Figure 9. Alpha evolution (AE) hyperparameter optimization for CatBoost.

Figure 10. Principle of the interpretation model based on a tree-based prediction model and SHAP analysis.

Figure 11. Pearson correlation heatmap of the five input variables (

D, Fr, H, L, R

).

Figure 11. Pearson correlation heatmap of the five input variables (

D, Fr, H, L, R

).

Figure 12. Selection of optimal cluster number based on CH index.

Figure 13. The evolution of ward distance of all samples distinguishing the two clusters.

Figure 14. The pairwise scatter plots and marginal histograms in

(Fr, R, D)

space distinguishing the two clusters.

Figure 14. The pairwise scatter plots and marginal histograms in

(Fr, R, D)

space distinguishing the two clusters.

Figure 15. The 3D scatter in

(Fr, R, D)

space distinguishing the two clusters.

Figure 15. The 3D scatter in

(Fr, R, D)

space distinguishing the two clusters.

Figure 16. Comparison of the predicted and recorded data for both training data and testing data: (a) CatBoost, (b) AE-CatBoost, (c) XGBoost, (d) ExtraTrees, (e) PSO-CatBoost, (f) GS-CatBoost.

Figure 17. Radar plots of (a)

R^{2}

and (b) RMSE for the six prediction models.

Figure 17. Radar plots of (a)

R^{2}

and (b) RMSE for the six prediction models.

Figure 18. Histogram of residual of the six prediction models: (a) training data and (b) testing data.

Figure 19. The absolute mean SHAP values of the five explanatory variables.

Figure 20. SHAP summary distribution based on the AE-CatBoost model.

Figure 21. Mean absolute SHAP value, indicating the interaction among each feature.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Fan, Y.; Meng, Z.; Zhang, X.; Zhang, J.; Wang, L. Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP. Water 2026, 18, 42. https://doi.org/10.3390/w18010042

AMA Style

Li H, Fan Y, Meng Z, Zhang X, Zhang J, Wang L. Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP. Water. 2026; 18(1):42. https://doi.org/10.3390/w18010042

Chicago/Turabian Style

Li, Hanze, Yazhou Fan, Zhenzhu Meng, Xinhai Zhang, Jinxin Zhang, and Liang Wang. 2026. "Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP" Water 18, no. 1: 42. https://doi.org/10.3390/w18010042

APA Style

Li, H., Fan, Y., Meng, Z., Zhang, X., Zhang, J., & Wang, L. (2026). Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP. Water, 18(1), 42. https://doi.org/10.3390/w18010042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Physics-Consistent Overtopping Estimation for Dam-Break Induced Floods via AE-Enhanced CatBoost and TreeSHAP

Abstract

1. Introduction

2. Dataset

2.1. Variables and Dimensionless Analysis

2.2. Data Generation

3. Methods

3.1. Modeling Framework

3.2. Hierarchical Clustering

3.3. CatBoost Prediction

3.4. Alpha Evolution Optimization

3.5. TreeSHAP Analysis

4. Results

4.1. Clustering Results

4.2. Prediction Results

4.3. TreeSHAP Analysis Results

5. Discussion

5.1. Methodological Contribution

5.2. Physics-Consistent Interpretability

5.3. Implications for Practice

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Pseudocode of the Proposed Modeling Procedure

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI