Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation

Li, Dan; Wu, Kuanghuai; Li, Yiming; Huang, Jian; Liu, Xian

doi:10.3390/w18060740

Open AccessArticle

Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation

by

Dan Li

¹,

Kuanghuai Wu

^2,*

,

Yiming Li

¹,

Jian Huang

¹ and

Xian Liu

³

¹

Smart Building College, Guangzhou City Polytechnic, Guangzhou 511300, China

²

School of Civil Engineering, Guangzhou University, Guangzhou 510006, China

³

School of Civil Engineering, Sun Yat-sen University, Zhuhai 519082, China

^*

Author to whom correspondence should be addressed.

Water 2026, 18(6), 740; https://doi.org/10.3390/w18060740

Submission received: 27 December 2025 / Revised: 28 February 2026 / Accepted: 4 March 2026 / Published: 22 March 2026

(This article belongs to the Special Issue Approaches to Water-induced Landslide Hazard Risk Forecasting and Assessment)

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of landslide run-out distance is fundamental to hazard mapping, emergency planning, and risk-informed engineering design. However, many data-driven studies implicitly treat landslides as a homogeneous population and provide limited, physically interpretable insights into how geomorphic factors govern run-out behavior. To address these limitations, we propose a cluster-aware and explainable modeling framework to predict run-out distance L using four source-region and slope descriptors: crown–toe relief H, source area A, source volume V, and mean source-slope inclination

θ

. The dataset consists of 10,159 rainfall-induced landslides compiled from official inventories and peer-reviewed literature. After standardizing predictors, the optimal number of clusters is determined using information criteria (AIC/BIC), followed by k-means clustering to identify distinct landslide regimes. We first benchmark Random Forest, eXtreme Gradient Boosting, CatBoost, and LightGBM on identical data splits without hyperparameter tuning, using

R^{2}

, RMSE, and MAE as performance metrics. LightGBM consistently outperforms the alternatives and is therefore selected as the base learner. Within each cluster, LightGBM is further optimized using the Alpha Evolution (AE) algorithm, with Particle Swarm Optimization and Bayesian Optimization serving as benchmarks. The resulting AE-LightGBM model achieves the highest predictive accuracy across clusters. Model interpretability is achieved using TreeSHAP, which decomposes predictions into cluster-specific baselines and additive contributions from H, A, V, and

θ

. By integrating regime-sensitive learning with robust explainability, the proposed framework improves run-out distance prediction while providing transparent, physically meaningful insights to support scenario analysis and engineering decision-making.

Keywords:

landslide run-out distance; k-means clustering; LightGBM; hyperparameter optimization; rainfall induced landslide; explainable machine learning

1. Introduction

Landslides are sudden, far-reaching, and often cascading disasters that cause loss of life, property damage, infrastructure disruption, and secondary hazards [1,2]. Improving landslide prediction is fundamental to risk identification, spatial planning, engineering protection, and emergency management. Among candidate indicators, the run-out distance—the horizontal travel from the source to the distal deposit—most directly delineates the affected footprint. The magnitude of run-out distance depends on the geometry of the landslide source area and the slope, including crown–toe relief, source area, source volume, mean inclination, etc. [3,4]. Developing a robust, interpretable predictor of run-out distance from widely available descriptors enables rapid regional screening and supports site-scale engineering decisions, strengthening both scientific rigor and operational viability.

Four research lines have shaped the understanding of landslide run-out distance [5]. Analytical and semi-analytical models derive travel bounds from dynamics and energy considerations through effective friction, path roughness, and conversion between potential and kinetic energy; they offer clear physical interpretability and modest data requirements, but rely on simplifying assumptions and can struggle to represent complex topography or heterogeneous materials [6,7,8,9,10]. Numerical simulations reconstruct motion and deposition under detailed topography and material parameters. They provide high spatial fidelity and process richness, yet demand substantial input data and computational resources and can be sensitive to rheology choices and parameter uncertainty [11,12,13]. Physical modeling uses scaled experiments with controlled boundary conditions to test mechanisms and to provide calibration evidence; it enables direct observation and repeatable hypothesis testing, while remaining constrained by scaling laws, boundary effects, and practical limits on the range of conditions explored [14]. Empirical and statistical approaches fit transferable relations from historical cases, using geometric and topographic descriptors to quantify run-out under diverse settings; they are efficient to implement and validate across large datasets, but may face extrapolation risk, confounding from omitted variables, and reduced performance when regime heterogeneity is pronounced [15,16,17,18,19,20]. Despite their theoretical rigor, traditional physically based and empirical models rely on simplified mechanical assumptions and often require parameters that are difficult to measure at regional scales. These approaches are typically developed under specific geomorphic or material conditions, which may limit their applicability when event characteristics vary substantially. Moreover, heterogeneity in landslide size, slope geometry, and material properties can challenge the generalizability of single-regime formulations.

Building on these foundations, data-driven learning has emerged as a complementary avenue that leverages growing event inventories to provide flexible function approximation, scalable inference, and reproducible validation [21,22,23]. Machine learning methods impose fewer explicit assumptions regarding functional form and can flexibly capture nonlinear interactions among predictors [24,25]. Such approaches are particularly advantageous when dealing with heterogeneous, multi-regime datasets compiled from diverse environments [26,27]. However, data-driven models may suffer from reduced interpretability and limited extrapolation capability beyond the training domain [28]. Machine learning for landslide prediction spans several families: linear and generalized linear methods, kernel and distance-based methods, neural networks and deep learning, and tree-based ensembles [29,30,31]. For the nonlinearity, scale effects, and interactions typical of run-out, tree ensembles offer strong engineering practicality. Random Forest, gradient boosting, CatBoost, and LightGBM capture nonlinear structure and higher-order interactions, remain robust to mixed feature scales after standardization, and train efficiently [32,33,34,35]. Crucially, the SHAP framework provides theory-grounded attributions for trees, with TreeSHAP delivering exact Shapley values in polynomial time while satisfying local accuracy, missingness, and consistency [36,37,38]. This pairing unifies high predictive performance and rigorous interpretability within one framework.

In addition, most existing machine learning studies implicitly assume a single, homogeneous population and therefore overlook heterogeneity in geomorphic behavior. Here, we adopt a cluster-aware modeling framework in which statistically identified groups are interpreted as regime-like behavioral partitions rather than strictly physics-defined process regimes. Moreover, interpretability analyses are often conducted at the global level, making it difficult to assess whether feature attributions remain consistent and physically meaningful across different landslide types. These limitations motivate the need for a regime-aware and explainable modeling framework.

In this study, we address two core scientific questions. First, can a cluster-aware approach that partitions the population and models each group separately improve generalization and stability over a single-population baseline built from widely available descriptors? Second, can the resulting models provide principled, cross-cluster-comparable, and physically consistent additive explanations that support mechanism insight and decision transparency?

To address these questions, we develop a cluster-aware and explainable pipeline grounded in widely available predictors and a large, vetted dataset. We adopt four widely used, readily obtainable predictors—H, A, V, and

θ

. The dataset comprises 10,159 rainfall-induced landslides compiled from official inventories and peer-reviewed publications, enabling stable clustering and cluster-wise modeling. The methodological framework is structured as follows. We standardize the four predictors, determine the number of clusters by information criteria under a spherical-mixture approximation, and fit k-means clustering [39,40]. Before any tuning, we benchmark Random Forest (RF), eXtreme Gradient Boosting (XGB), CatBoost, and vanilla LightGBM on consistent splits using

R^{2}

, RMSE, and MAE, then select LightGBM as the base learner [41,42]. Within each cluster, we perform constrained hyperparameter searching with the Alpha Evolution (AE) algorithm and use Particle Swarm Optimization (PSO) and Bayesian Optimization (BO) as budget-matched baselines [43]. From the optimized configurations, we select AE-LightGBM as the final model. For interpretability, we apply TreeSHAP with cluster-specific background distributions to decompose predictions into a baseline plus additive contributions from the four predictors, yielding explanations that are consistent and comparable across clusters.

Having described the framework, the remainder of the paper is organized as follows. Section 2 describes the dataset and variable definitions. Section 3 presents the methodology including clustering criteria and the k-means method, the principles of LightGBM, and the Alpha Evolution (AE) optimization, as well as the TreeSHAP implementation. Section 4 reports cluster-wise predictions and compares them with PSO and BO optimized baselines. Section 5 discusses the effects of cluster imbalance on performance, the rationale for choosing tree ensembles with TreeSHAP. Section 6 concludes with key findings and outlines directions for future work.

2. Dataset

2.1. Variables Definition

Figure 1 presents a simplified physical model for rainfall-induced landslides moving downslope and depositing. The landslide initiates in the source area, then transitions into a flow zone where displaced material travels along the slope, and finally decelerates and accumulates within the deposition zone. The boundary between the source area and the remainder of the landslide is usually well defined, whereas the interface between the flow and deposition zones can be more subjective due to gradual transitions and spatially variable geomorphic expression. Within this framework, the explanatory variables include the vertical distance between the landslide crown and its accumulation toe H, the area of the landslide source A, the volume of the landslide source V, and the average inclination of the source slope section

θ

. The target variable is the landslide run-out distance L, defined as the horizontal projection of the line linking the upper part of the landslide source and the outermost edge of the landslide deposits.

2.2. Data Distribution

The dataset comprises 10,159 rainfall-induced landslide events compiled from official inventories and peer-reviewed publications [44,45,46,47]. All records correspond to historical landslide events that occurred in nature. The selection of predictor variables was guided by availability and comparability across reported cases. For most records, only geometric and slope descriptors, i.e., crown–toe relief H, source area A, source volume V, and mean slope inclination

θ

, were consistently documented. Additional factors such as lithology, path roughness, or hydrological conditions were unavailable or inconsistently reported and would have significantly reduced the usable sample size. Therefore, the analysis focuses on widely reported geometric predictors commonly used in empirical run-out studies, while acknowledging that incorporating material and path-related variables remains an important direction for future work. As the dataset was compiled from multiple inventories and published case reports, a harmonization procedure was applied prior to modeling to ensure consistent measurement definitions. Only records reporting geometrically defined source parameters were retained. All variables were converted to consistent SI units and screened for geometric plausibility. Records with incomplete parameters or internally inconsistent geometry were removed. These procedures reduce inter-inventory bias and ensure that the compiled dataset represents a comparable statistical population describing landslide source geometry and mobility rather than differences in reporting methodology.

Figure 2 shows the data distributions of the four input variables H, A, V, and

θ

, as well as the target variable L. The geomorphic size variables are right-skewed with heavy tails: H concentrates within 0–150 m but extends to roughly 400 m; A is mostly below 300 m² with occasional cases beyond 1500 m²; V is largely under 200 m³ yet reaches about 4000 m³ for a few events; and L is mostly under 300 m with some values exceeding 1000 m.

θ

exhibits an approximately unimodal, near-Gaussian distribution centered around 25° to 50° with an overall range of 15° to 75°. The broad coverage and pronounced scale variability indicated by these distributions highlight substantial heterogeneity in rainfall-induced landslides and provide a suitable foundation for training high-capacity, data-driven predictive models. The statistical characteristics shown in Figure 2 are directly relevant to the subsequent modeling strategy. The wide magnitude range and heavy-tailed behavior of H, A and V indicate heterogeneous event scales, which motivates partitioning the dataset into clusters prior to regression. Moreover, the comparatively narrower variability of

θ

and the physical interpretation of H as a proxy for potential energy help explain their different contributions in the explainable learning analysis presented later.

3. Methods

As presented in Section 2, we model the landslide run-out distance L as a function of four predictors describing the source-region geometry and slope: crown–toe relief H, source area A, source volume V, and mean inclination of the source-slope section

θ

. The workflow comprises three stages. First, we standardize H, A, V, and

θ

, determine the optimal number of clusters using information criteria, and separate the dataset into several groups using the k-means method. Second, within each cluster, we train a LightGBM model to predict L from

x = (H, A, V, θ)

, with hyperparameters optimized by the Alpha Evolution (AE) algorithm and benchmarked against Particle Swarm Optimization (PSO) and Bayesian Optimization (BO). Third, we explain the model using TreeSHAP. Before cluster-wise optimization, we compare Random Forest (RF), eXtreme Gradient Boosting (XGB), CatBoost, and LightGBM using consistent data splits and the metrics

R^{2}

, RMSE, and MAE. LightGBM offers the best performance and is therefore selected as the base learner. See Figure 3 for the overall pipeline of the method.

3.1. k-Means Clustering

The dataset is classified using k-means clustering [48,49]. All predictors are standardized to zero mean and unit variance. For a given K, k-means partitions the feature space of

x = (H, A, V, θ)

by minimizing the within-cluster sum of squares

WCSS (K) = \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} {∥x_{i} - μ_{k}∥}^{2},

(1)

where

C_{k}

is the set of samples assigned to cluster k and

μ_{k}

is its centroid. Adopting the spherical-Gaussian-mixture approximation of k-means, the shared variance is estimated as

σ^{2} = \frac{WCSS (K)}{N d},

(2)

with N the number of samples and

d = 4

the number of predictors. The maximized log-likelihood is

ℓ (K) = - \frac{N d}{2} log (2 π σ^{2}) - \frac{N d}{2} .

(3)

Counting

K d

centroid parameters,

K - 1

mixture weights, and the shared variance gives

p = K d + (K - 1) + 1 .

(4)

The information criteria are then computed as

AIC (K) = 2 p - 2 ℓ (K), BIC (K) = p log N - 2 ℓ (K) .

(5)

We evaluate these scores for

K = 1, \dots, K_{max}

and prioritize BIC when the two disagree. In our data, the information-criterion analysis supports

K = 4

, and we therefore fit k-means with

K = 4

and assign each sample a cluster label

c \in {1, 2, 3, 4}

for downstream modeling. It is important to clarify that the clustering step is not intended to identify physically discrete geomorphic process classes. Landslide behavior is expected to follow complex, anisotropic, and heavy-tailed distributions, which cannot be strictly represented by spherical mixture models. In this study, k-means serves as a pragmatic partitioning operator that separates the heterogeneous population into statistically comparable subsets, enabling conditional learning and stable model interpretation. Therefore, the clusters should be interpreted as behavioral partitions in feature space rather than strict physical regimes, and the predictive conclusions do not rely on the generative validity of the spherical assumption.

3.2. Cluster-Wise Modeling Based on AE-LightGBM

3.2.1. Principles of LightGBM

Within each cluster k, we predict L from

x = (H, A, V, θ)

using LightGBM, which constructs an additive ensemble of regression trees [50,51]:

F_{T} (x) = \sum_{t = 1}^{T} f_{t} (x) .

(6)

At boosting round t, LightGBM minimizes a second-order Taylor approximation of the regularized objective around the current predictions

{\hat{y}}_{i}^{(t - 1)}

:

{Obj}_{t} \approx \sum_{i} [g_{i} Δ f_{t} (x_{i}) + \frac{1}{2} h_{i} Δ f_{t} {(x_{i})}^{2}] + Ω (f_{t}),

(7)

where

g_{i} = \partial ℓ ({\hat{y}}_{i}, L_{i}) / \partial {\hat{y}}_{i}

and

h_{i} = \partial^{2} ℓ ({\hat{y}}_{i}, L_{i}) / \partial {\hat{y}}_{i}^{2}

are the first and second derivatives of the loss,

Δ f_{t}

is the contribution of the new tree, and the tree complexity penalty is

Ω (f_{t}) = γ # leaves + \frac{λ}{2} \sum_{j} w_{j}^{2},

(8)

with

w_{j}

the leaf values,

γ

a penalty on leaf count, and

λ

the

ℓ_{2}

regularization strength. For a leaf j with gradient sum

G_{j} = \sum_{i \in leaf j} g_{i}

and Hessian sum

H_{j} = \sum_{i \in leaf j} h_{i}

, the optimal leaf value is

w_{j}^{⋆} = - \frac{G_{j}}{H_{j} + λ} .

(9)

For a candidate split into left/right children with statistics

(G_{L}, H_{L})

and

(G_{R}, H_{R})

, the improvement is

Gain = \frac{1}{2} (\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}) - γ .

(10)

LightGBM adopts leaf-wise growth by selecting at each step the split that maximizes

Gain

while controlling complexity via

γ

,

λ

, and constraints such as maximum depth or maximum leaves. Figure 4 illustrates the principle of LightGBM.

3.2.2. Alpha Evolution Hyperparameter Optimization

The Alpha Evolution (AE) algorithm, proposed in 2024 [43], has demonstrated superior performance to many mainstream optimizers across diverse benchmarks. For each cluster k, we tune the LightGBM hyperparameters

ψ

within a box-constrained search space

[lb, ub]

by minimizing the cross-validated RMSE:

min_{ψ \in [lb, ub]} f_{c} (ψ) = {RMSE}_{CV}^{(c)} (L, F (x; ψ)) .

(11)

A population

X \in R^{N_{pop} \times p}

of candidate vectors is initialized uniformly in

[lb, ub]

. At each generation, a domain-scaled exploratory step is formed as

Δ r = (ub - lb) ⊙ (2 R_{1} ⊙ R_{2} - R_{2}) ⊙ S,

(12)

where

R_{1}

and

R_{2}

have i.i.d.

U (0, 1)

entries, S is a Bernoulli(0.5) mask, and ⊙ denotes element-wise multiplication. The progression factor

α = exp (\ln (1 - \frac{FEs}{MaxFEs}) - {(\frac{4 FEs}{MaxFEs})}^{2})

(13)

decreases with the number of function evaluations

FEs

up to a budget

MaxFEs

. A global reference P is updated by weighted aggregation of a sampled subpopulation B via

P \leftarrow c_{a} P + (1 - c_{a}) \sum_{j} ω_{j} B_{j}, c_{a} = 1 - \frac{FEs}{MaxFEs},

(14)

with nonnegative weights

ω_{j}

summing to one. For contrastive guidance, each candidate forms a trial point

E_{i}^{t + 1} = P + α Δ r_{i} + ϑ_{i} ⊙ (W_{i} + E_{i}^{t} - P - L_{i}),

(15)

where

W_{i}

and

L_{i}

are elite and non-elite references and

ϑ_{i}

is an element-wise random weight drawn in

[0, 2]

or

[0, 1]

. The trial is projected back into

[lb, ub]

and accepted greedily if it does not worsen

f_{c}

. The final model

F_{c}

is then trained in the cluster using

ψ_{c}^{⋆}

. See Figure 5 for the flowchart of the alpha evolution (AE) algorithm.

3.3. Principles of TreeSHAP

To explain how the predictors affect L, we use TreeSHAP to express, for any instance i with features

x_{i} = (H_{i}, A_{i}, V_{i}, θ_{i})

, the prediction as an additive Shapley decomposition

{\hat{L}}_{i} = ϕ_{0} + ϕ_{i, H} + ϕ_{i, A} + ϕ_{i, V} + ϕ_{i, θ},

(16)

where

ϕ_{0} = E_{x \sim background} [F (x)]

is the baseline under a background distribution taken as the cluster-wise training distribution, and

ϕ_{i, j}

is the attribution for feature

j \in {H, A, V, θ}

. In Shapley form with

M = 4

features,

ϕ_{i, j} = \sum_{S \subseteq F ∖ {j}} \frac{| S |! (M - | S | - 1)!}{M!} (f_{i} (S \cup {j}) - f_{i} (S)),

(17)

where

F

is the full feature set, S is a coalition (subset), and

f_{i} (S)

denotes the model output for instance i when only features in S are considered and the remaining features are integrated over the background distribution. TreeSHAP computes these quantities exactly in polynomial time by dynamic programming along decision-tree paths, using path probabilities to avoid explicit enumeration of all feature subsets, thereby yielding attributions that satisfy local accuracy, missingness, and consistency for the cluster-wise LightGBM models F [52,53]. Figure 6 illustrates the framework of the AE-LightGBM and TreeSHAP combination.

4. Results

4.1. Clustering Results Based on k-Means Method

Figure 7 reports information-criterion diagnostics for k-means clustering based on the four predictors

(H, A, V, θ)

. Both BIC and AIC, under full and diagonal covariance approximations shown for comparability, decrease steeply as the number of clusters increases from 1 to 3, and then exhibit a clear elbow at

k = 4

. Beyond

k = 4

, additional clusters yield only marginal gains. Consistent behavior across criteria supports selecting 4 clusters as the optimal number.

Figure 8 displays pairwise scatter plots with diagonal marginal distributions in the

(H, A, V, θ)

space for the four identified clusters. The corresponding sample sizes are: Cluster 1 (

n = 4204

), Cluster 2 (

n = 1105

), Cluster 3 (

n = 60

), and Cluster 4 (

n = 4790

). Cluster centers are indicated by yellow crosses. Overall, the geomorphic size variables H, A, and V exhibit pronounced right skewness and heavy-tailed behavior, while A and V show a strong positive correlation, indicating that source area and volume scale jointly with event magnitude. The slope parameter

θ

is approximately unimodal but displays systematic shifts in central tendency across clusters. From a physical perspective, Cluster 4 is characterized by low-to-moderate H, A, and V values together with relatively small

θ

, representing the most frequent small-to-moderate, lower-energy events. Cluster 1 comprises events with moderate geometric scales but comparatively larger

θ

, consistent with steeper slopes and potentially enhanced runout efficiency for medium-scale failures. Cluster 2 exhibits substantial dispersion in H, A, and V combined with intermediate

θ

, forming a transitional group that spans predominantly small-to-moderate events while extending toward larger magnitudes. In contrast, Cluster 3, despite its limited sample size, is concentrated at high A and V with relatively large H, representing rare, high-energy, large-scale events that dominate the distributional tails. These results demonstrate that the clustering effectively captures heterogeneity in both geometric and dynamical characteristics of rainfall-induced landslides, providing robust physical interpretation and statistical justification for subsequent cluster-specific modeling.

4.2. Prediction Results Based on AE-LightGBM

Based on the clustering results in Section 4.1, the dataset was partitioned into four groups (Clusters 1–4), and separate predictive models were developed for each cluster to account for cluster-specific behavior and distributional heterogeneity. Because each record represents an independent historical landslide event compiled from multiple regions rather than spatially contiguous observations within a single study area, the samples approximate event-based independence. Accordingly, data within each cluster were randomly split into 80% for training and 20% for independent testing to ensure objective and comparable model evaluation. Spatial cross-validation commonly required for susceptibility mapping is therefore not directly applicable in this context.

To establish a robust and computationally efficient baseline, four widely used tree-based algorithms—Random Forest (RF), eXtreme Gradient Boosting (XGB), CatBoost, and LightGBM—were first evaluated. Based on preliminary performance metrics, including the coefficient of determination (

R^{2}

), root mean square error (RMSE), and mean absolute error (MAE), LightGBM was selected as the base learner for further optimization. Subsequently, the Alpha Evolution (AE) algorithm was employed to optimize the LightGBM hyperparameters. For benchmarking purposes, two well-established automated optimization methods—Particle Swarm Optimization (PSO) and Bayesian Optimization (BO)—were also implemented. For fair comparison, AE, PSO, and BO were executed under an identical optimization budget defined by the same number of objective function evaluations and the same hyperparameter search space, rather than identical iteration counts.

Figure 9 illustrates the convergence behavior of the best fitness value during the AE-based hyperparameter optimization of LightGBM. The fitness value decreases rapidly from 66 to 43 within the first two iterations and further declines to approximately 37 before entering a plateau phase characterized by minor fluctuations between 36.7 and 37.5. The global optimum is attained at the 18th iteration, with a fitness value of 36.8468, as indicated by the red circle. This convergence pattern suggests that the algorithm efficiently approaches the optimal region in the early stages and maintains stable performance thereafter.

Figure 10 compares AE-LightGBM model outputs with monitoring observations of landslide run-out distance L on the training set. Points generally align with the 1:1 reference line and the black fit line has a slope close to unity, indicating limited systematic bias. Residual histograms are narrowly centered at zero, suggesting predominantly small random errors. Importantly, the green fit line represents the AE-LightGBM model’s learned response rather than a simple linear regression: it is a data-driven curve produced by the tree ensemble’s piecewise nonlinear combinations of predictors. It means that the proposed model captures salient nonlinearities and interaction effects among covariates, consistent with the high

R^{2}

, low RMSE and MAE, and concentrated residuals. Overall, Clusters 1, 2, and 4 exhibit stronger agreement (

R^{2}

larger than 0.95), whereas Cluster 3 is more dispersed but still coherent.

Figure 11 shows the corresponding test-set comparisons, where predictions remain tightly distributed around the 1:1 line and residuals stay concentrated near zero, indicating good generalization. As in the training dataset, Cluster 3 displays a looser pattern with heavier tails, which is likely due to its smaller sample size and higher variability. Clusters 1, 2, and 4 maintain higher consistency and narrower residual ranges. Collectively, the two figures show that AE-LightGBM reproduces L robustly, with the learned nonlinear structure playing a key role in its accuracy and stability. Due to space constraints, we present in the main text only the training and testing dataset comparisons and residual distributions for the best-performing AE-LightGBM model. Prediction results of testing data for the six benchmark models including RF, XGB, CatBoost, LightGBM, PSO-LightGBM, and BO-LightGBM are provided in Appendix B.

Figure 12 illustrates the

R^{2}

, RMSE and MAE of the seven models for each cluster. In Cluster 1, errors are very close, with RMSE varying between 5.92 and 6.48. AE-LightGBM leads with RMSE 5.9168 and

R^{2}

0.9593, while Catboost edges the best MAE at 3.5771 by a hair. Cluster 2 shows moderate separation with RMSE ranges between 44.91 and 50.25: AE-LightGBM is best on all three metrics, with PSO-LightGBM and BO-LightGBM variants next, and XGB and Catboost trailing. Cluster 3 is the hardest and most differentiated with RMSE ranging between 20.32 and 83.67: AE-LightGBM dominates, with about 64% lower RMSE than vanilla LightGBM, while the baselines show poor fit. Cluster 4 is stable with small gaps, with RMSE between 6.03 and 6.80: AE-LightGBM again tops the board, with about 4.8% lower RMSE than LightGBM. Overall, Catboost and AE-LightGBM are nearly tied in Cluster 1, AE-LightGBM attains the lowest RMSE in all four clusters and the best MAE and

R^{2}

in clusters 2, 3, and 4. For the mean evaluation metrics of the four clusters, AE-LightGBM attains the best overall performance. Relative to standard LightGBM, AE-LightGBM reduces error by roughly one third and improves explanatory power by about +0.08 in

R^{2}

. The other automated LightGBM variants—BO-LightGBM and PSO-LightGBM—form the next tier, trailing AE by a modest margin while clearly outperforming non-AE baselines. In contrast, XGB, CatBoost, and RF yield substantially larger errors and lower goodness of fit. Overall, these cross-cluster averages indicate that coupling LightGBM with the Alpha Evolution (AE) algorithm yields consistently stronger and more robust generalization than both vanilla LightGBM and alternative baselines. See Table 1 for the average RMSE, MAE, and

R^{2}

across all clusters.

The high predictive accuracy mainly reflects strong geometric constraints on runout distance rather than model memorization. Because mobility approximately follows energy-scaling relationships, a substantial fraction of variance is physically explainable, leading to higher

R^{2}

than typical site-specific hazard models.

Figure 13 compares the residual distributions of the seven models across the four clusters. AE-LightGBM consistently produces the narrowest, highest peaks centered near zero, indicating superior predictive stability and accuracy. Vanilla LightGBM performs better than XGB, RF, and CatBoost, and among the optimized variants, AE-LightGBM outperforms both PSO-LightGBM and BO-LightGBM. Across clusters, the residuals are approximately symmetric and bell-shaped around zero, consistent with a near-normal distribution. Notably, Cluster 3 exhibits a much more dispersed residual pattern with heavier tails, implying lower predictive accuracy than Clusters 1, 2, and 4; this behavior is likely attributable to its smaller sample size and greater data heterogeneity. Synthesizing the evaluation metrics

R^{2}

, RMSE, and MSE with these residual analyses, we conclude that AE-LightGBM is the most competitive model. Accordingly, we adopt AE-LightGBM as the final predictor and develop a TreeSHAP-based explainer in the following section, so as to quantify feature contributions and analyze the principal drivers of the response including cross-cluster heterogeneity and interaction effects, as well as local and global attributions.

4.3. Explanatory Analysis Based on TreeSHAP

For each cluster, an AE-LightGBM model was trained independently, and feature importance was quantified using the mean absolute SHAP value. As shown in Figure 14, across all four clusters, the results exhibit a consistent hierarchy: the vertical drop H is the dominant predictor of runout distance L, with an importance substantially exceeding that of the other covariates. The slope angle

θ

and source area A show comparatively minor and broadly similar contributions, whereas the source volume V has a negligible effect within the fitted models. Cluster 3, which represents large, high-energy events, is the only notable deviation from this baseline: although H remains predominant, the relative importance of A increases, indicating that planform extent provides additional explanatory power for extreme cases. Clusters 1, 2, and 4 conform to the general pattern of: H dominant;

θ

and A secondary; and V minimal. Overall, the TreeSHAP analysis is consistent with physical expectations: the gravitational-potential proxy H governs runout across regimes, with source geometry becoming more influential in the tail of large events, while

θ

and V contribute only marginal improvements in predictive performance.

Figure 15 summarizes the global contribution of each input variable. In all clusters, H exhibits SHAP values that are predominantly positive with a pronounced right tail, indicating a consistently strong and positive impact on L. The contributions of

θ

and A are comparatively modest and centered near zero, reflecting weaker but directionally mixed effects. A displays a longer positive tail in Cluster 3, implying increased relevance for large events. The SHAP values of V are tightly concentrated around zero across clusters, indicating negligible marginal influence within the present modeling framework. Differences in the horizontal scale among panels further indicate that Clusters 2 and 3 are associated with greater output sensitivity than Clusters 1 and 4.

As H was identified as the most influential predictor, Figure 16 focuses specifically on the SHAP dependence of H in each cluster. Across all clusters,

SHAP (H)

increases approximately monotonically with H, indicating a robust positive marginal effect of H on the predicted runout distance L. Clusters with larger event scales (notably Cluster 2 and Cluster 3) display wider ranges of

SHAP (H)

, consistent with stronger output sensitivity to variation in H. Color gradients suggest limited but non-negligible interactions:

θ

modulates the effect of H in Clusters 1, 2, and 4, whereas A plays a more pronounced interacting role in Cluster 3. Overall, the dependence curves corroborate the conclusion that H is the primary driver of L across regimes, while interactions become more evident for large events.

Figure 17 shows the two-dimensional response surfaces from AE-LightGBM showing the predicted L over selected feature pairs. Surfaces involving H (panels A–H,

θ

–H, and V–H) display a dominant gradient along the H axis: predicted L increases markedly with H, with a steeper rise in the mid-to-high range of H. By comparison, variations along A,

θ

, or V at fixed H produce smaller changes in L, indicating weaker marginal effects. The A–V and

θ

–V surfaces are largely flat along V, corroborating the minimal role of V observed in the SHAP analyses. Moderate shifts along

θ

and, in specific ranges, along A yield secondary adjustments to L, consistent with their status as auxiliary modulating factors. These bivariate patterns align with the univariate SHAP findings and reinforce the conclusion that H governs runout across clusters, while

θ

and A provide limited modulation and V contributes negligibly.

Beyond confirming variable dominance, the response surfaces indicate regime-dependent sensitivity rather than purely monotonic dependence. The gradient of predicted L with respect to relief H increases at larger values, suggesting a progressive transition from friction-limited motion toward inertia-enhanced mobility. This behavior implies a continuous mechanical transition rather than a discrete threshold. Similarly, the increasing influence of source area in high-energy cases reflects enhanced lateral spreading and internal deformation processes. These patterns suggest that landslide mobility evolves gradually with scale, and the model captures nonlinear changes in response intensity rather than simple proportional scaling.

The SHAP attribution patterns can be interpreted in terms of landslide mobility mechanics rather than purely statistical importance. The dominance of the relief H indicates that run-out distance is primarily controlled by gravitational potential energy. This is consistent with classical mobility scaling relationships in which the ratio

H / L

approximates an effective friction coefficient. Under this framework, the relatively weak contribution of slope angle

θ

suggests that local inclination mainly modulates motion once the total drop height is fixed.

The negligible influence of volume V implies that mobility is not directly mass-controlled but energy-controlled, consistent with observations that landslides of different magnitudes often follow similar run-out scaling. In contrast, the increased contribution of source area A within the extreme-event cluster suggests enhanced lateral spreading and internal deformation in high-energy failures. Therefore, the explainable machine learning results recover physically meaningful mobility behavior rather than providing only statistical correlations.

5. Discussion

5.1. Cluster-Aware Modeling Implications Under Sample Imbalance

The clustering procedure yielded groups with markedly different sample sizes. In particular, cluster 3 contains substantially fewer observations than the remaining three clusters. This imbalance is reflected in model performance: the cluster-wise AE-LightGBM fitted in cluster 3 exhibits lower predictive accuracy and higher variance relative to the models trained in the larger clusters. We interpret this discrepancy as a property of the data rather than a methodological artifact. From a geomorphic perspective, the clustering partitions the dataset into statistically inferred regimes representing distinct ranges of event scale and slope configuration. The compiled sample is dominated by small- to medium-sized landslides, whereas cluster 3 corresponds to a comparatively rare high-magnitude regime. Consequently, the model in this subset observes a narrower and less representative span of

(H, A, V, θ)

–L relationships and has fewer opportunities to learn stable nonlinear interaction patterns.

It should be noted that the imbalance is intrinsic to the physical occurrence of landslides rather than a sampling artifact. The rare-event cluster represents high-magnitude failures occupying the heavy tail of the distribution. Applying resampling or synthetic data augmentation would artificially modify the frequency structure and potentially distort the relationship between predictors and run-out distance. Therefore, instead of enforcing balanced performance across clusters, the reduced accuracy in this regime is interpreted as increased epistemic uncertainty, which is consistent with the objective of regime-aware analysis rather than benchmark optimization.

Although cross-validated training and early stopping mitigate overfitting, residual epistemic uncertainty remains larger in this regime. This behavior also illustrates a key distinction between the present cluster-aware framework and conventional clustering-based modeling strategies. Rather than treating clustering as a preprocessing step alone, regime identification directly governs model construction, hyperparameter optimization, and interpretability analysis. As a result, data imbalance influences not only statistical representation but also regime-specific predictive stability and attribution consistency. From a practical standpoint, risk communication and decision thresholds should acknowledge the wider predictive intervals associated with the rare-event regime, and future data collection should prioritize underrepresented regimes to improve balance and reduce uncertainty.

5.2. Roles of Predictors and Modeling Implications

The predictors area A and volume V exhibit substantial positive correlation, as shown in Figure 18. A reductionist approach might remove one of the two to limit redundancy. However, landslide looseness, internal structure, and thickness vary across events, implying that bulk density and vertical extent can differ even for similar planform areas. Volume therefore conveys information about effective mass and potential energy that is not fully captured by area, while area may reflect lateral spreading potential and roughness interactions not encoded by volume alone. For these reasons, both A and V are retained. The cluster-wise AE-LightGBM can exploit their partially overlapping but distinct signals, and TreeSHAP then quantifies their conditional, marginal contributions relative to the other predictors. In parallel, we monitor attribution stability to guard against interpretability distortions arising from multicollinearity; where strong redundancy is detected, future extensions may incorporate composite features or penalization schemes that preserve physically meaningful signal.

The use of crown–toe relief H highlights an important distinction between explanatory and predictive modeling. Although H is generally unknown prior to failure, it serves as a physically meaningful descriptor of potential energy and event scale within historical datasets. The inclusion of H in this study was motivated by its widespread use in empirical and statistical runout analyses, where it serves as a proxy for gravitational potential energy and the overall geometric scale of the event. This situation is common in mobility studies of long runout mass movements, where geometric drop height becomes available only after the event. Consistent with this perspective, the TreeSHAP results indicate that H plays a dominant role in explaining model outputs, reinforcing its physical relevance for describing mobility behavior within historical-event datasets. Therefore, the present framework should not be interpreted as a pre-event forecasting tool but as a post-event mobility characterization and comparative analysis approach consistent with established empirical runout studies [54].

5.3. Limitations and Directions for Future Work

The proposed approach has defined applicability limits. The predictors

(H, A, V, θ)

describe geometric controls but do not explicitly represent material strength, path roughness, or hydrological conditions, and deviations may occur where rheological effects dominate. The compiled inventory contains observational bias typical of historical datasets, where accessible and moderate-sized events are preferentially recorded. Consequently, the learned relationships represent a conditional empirical mobility relation valid within the range of geomorphic conditions present in the training data. The framework is most reliable for interpolation within comparable environments, while extrapolation to regions with substantially different lithology, climate, or topographic configuration should be undertaken cautiously.

6. Conclusions

This study introduced a cluster-aware and explainable framework for predicting landslide run-out distance L from four predictors

(H, A, V, θ)

. Information-criterion diagnostics supported a four-cluster partition, reflecting the dominance of small- to medium-sized landslides in the compiled dataset. We first established a baseline by comparing Random Forest, eXtreme Gradient Boosting, CatBoost, and LightGBM on consistent splits, then selected LightGBM as the best learner, and then optimized it within each cluster using the Alpha Evolution (AE) algorithm. PSO- and BO-optimized LightGBM variants served as benchmarks. The final predictor was AE-LightGBM. AE-LightGBM achieved the best cross-cluster averages with mean

R^{2} = 0.948

. Relative to vanilla LightGBM, the error was reduced by approximately one third and

R^{2}

increased by about

+ 0.08

on average. Cluster-wise testing results corroborate this pattern. Residual distributions across clusters were approximately symmetric and narrowest for AE-LightGBM, indicating improved stability.

Explainable analysis indicates that run-out distance is primarily controlled by gravitational potential energy represented by the relief H, while slope inclination and source geometry act as secondary modulators. The increased contribution of source area in the rare-event regime suggests enhanced spreading processes in large, high-energy failures. These findings support an energy-dominated mobility scaling rather than a mass-dominated one, and demonstrate that predictive uncertainty itself becomes scale dependent, with extreme events inherently less predictable.

The framework should be interpreted as a conditional empirical mobility model rather than a universal predictive law. Because only geometric descriptors are included, deviations may occur in environments where material strength, rheology, or hydrological conditions dominate. In addition, the compiled inventory contains observational bias typical of historical datasets, and the model is therefore most reliable for interpolation within comparable geomorphic conditions rather than extrapolation to entirely different regions.

Overall, the study shows that combining regime-aware learning with interpretable attribution can provide both predictive capability and physical insight into landslide mobility. Future work should expand observations of rare events and incorporate proxies for material and path properties to improve generalization and reduce epistemic uncertainty.

Author Contributions

Conceptualization, K.W.; methodology, D.L.; validation, Y.L. and J.H.; formal analysis, D.L.; data curation, X.L.; writing—original draft preparation, D.L.; writing—review and editing, D.L.; supervision, K.W.; project administration, D.L.; and funding acquisition, D.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Project of the Science and Technology Program of the Guangdong Provincial Department of Education (Grant No. 2023ZDZX3074), Industry–Education Integration Project of the Guangzhou Municipal Education Bureau (Grant No. 2024312529), Youth PhD Start-up Project of the Guangzhou Municipal Education Bureau (Grant No. 2024312019).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Overall Pipeline of the Current Study

Appendix A.1. Algorithm

Algorithm A1 describes the end-to-end pipeline for predicting landslide run-out distance L from source-region geometry and slope predictors

X = (H, A, V, θ)

. We first handle missing values and outliers, then standardize features. Over candidate cluster counts

K = 1 \dots K_{max}

, we compute k-means WCSS and, under a spherical-GMM approximation, evaluate AIC/BIC to select the optimal

K^{⋆}

; samples are then assigned cluster labels accordingly. Next, we benchmark RF, XGB, CatBoost, and LightGBM on the full dataset using

R^{2}

, RMSE, and MAE, and choose LightGBM as the base learner for optimization. For each cluster c, we train independently: using stratified cross-validation (by L quantiles) and early stopping, we employ Alpha Evolution (AE) to minimize validation RMSE within a shared search space and budget, with PSO and BO as optimization baselines. This yields the per-cluster optimal hyperparameters

ψ_{c}^{⋆}

and final model

F_{c}

. During evaluation, metrics are computed within clusters and aggregated across clusters without leakage. At inference, a new sample is routed to a cluster via k-means and predicted by the corresponding

F_{c}

to produce

\hat{L}

. Finally, TreeSHAP provides interpretability per cluster, reporting global importance, dependence relationships, cross-cluster contrasts, and case-level explanations that quantify the marginal and interaction effects of

(H, A, V, θ)

on L. The model was conducted on a workstation equipped with an Intel CPU and 32 GB RAM. The main software environment included Python 3.12, LightGBM, NumPy, and scikit-learn libraries.

Algorithm A1 Overall pipeline for predicting landslide run-out L from

(H, A, V, θ)

1:: Input: Dataset $D = {(H_{i}, A_{i}, V_{i}, θ_{i}, L_{i})}_{i = 1}^{N}$
2:: Output: Cluster-wise models ${F_{c}}_{c = 1}^{K^{⋆}}$ , metrics, TreeSHAP explanations
Preprocessing
3:: Handle missing/outliers on $(H, A, V, θ, L)$
4:: Standardize $x_{i} = (H_{i}, A_{i}, V_{i}, θ_{i})$ via z-score $\to x_{i}^{std}$
Cluster selection & k-means
5:: for $K \leftarrow 1$ to $K_{max}$ do
6:: Run k-means on ${x_{i}^{std}}$ (multi-inits); compute $WCSS (K)$
7:: $σ^{2} \leftarrow WCSS (K) / (N \cdot d)$ with $d = 4$ ; compute $ℓ (K)$ , AIC/BIC
8:: end for
9:: $K^{⋆} \leftarrow arg {min}_{K} BIC (K)$ ▹ prefer BIC; AIC as reference
10:: Fit final k-means with $K^{⋆}$ ; get labels $c_{i} \in {1, \dots, K^{⋆}}$
Baseline comparison
11:: Train {RF, XGB, CatBoost, LightGBM} for $L \leftarrow (H, A, V, θ)$ on consistent splits
12:: Compute $R^{2}$ , RMSE, MAE; select LightGBM as base learner
Cluster-wise AE optimization
13:: for $c \in {1, \dots, K^{⋆}}$ do
14:: $D_{c} \leftarrow {(x_{i}, L_{i}) ∣ c_{i} = c}$
15:: Define hyperparameters $ψ = (d, M, η, λ, \dots)$
16:: $f_{c} (ψ) \leftarrow {RMSE}_{CV}^{(c)} (L, \hat{L} (x; ψ))$ with K-fold CV stratified by L-quantiles; early stopping
17:: AE init: initialize population $X \in R^{N \times p}$ within bounds; $FEs + = N$
18:: while $FEs < MaxFEs$ do
19:: Evaluate $f_{c}$ on all individuals; $E \leftarrow X$ ; sort $ind = argsort (f_{c} (X))$
20:: $Δ r \leftarrow (ub - lb) ⊙ (2 R_{1} ⊙ R_{2} - R_{2}) ⊙ S$
21:: $α \leftarrow exp (\ln (1 - \frac{FEs}{MaxFEs}) - {(\frac{4 FEs}{MaxFEs})}^{2})$
22:: Build global reference $P$ by weighted subpopulation aggregation
23:: for each individual i do
24:: Pick u from elite half of $ind$ , v from remainder; $W_{i} \leftarrow X_{u}$ , $L_{i} \leftarrow X_{v}$
25:: Draw $ϑ_{i} = I_{2} 1_{p} \cdot rand (0, 2) + (1 - I_{2}) rand {(0, 1)}^{1 \times p}$
26:: $E_{i}^{t + 1} \leftarrow P + α Δ r_{i} + ϑ_{i} ⊙ (W_{i} + E_{i}^{t} - P - L_{i})$
27:: Bounds: $E_{i}^{t + 1} \leftarrow min {\max (E_{i}^{t + 1}, lb), ub}$
28:: if $f_{c} (E_{i}^{t + 1}) \leq f_{c} (X_{i}^{t})$ then
29:: $X_{i}^{t + 1} \leftarrow E_{i}^{t + 1}$
30:: else
31:: $X_{i}^{t + 1} \leftarrow X_{i}^{t}$
32:: end if
33:: end for
34:: Update $FEs$
35:: end while
36:: $ψ_{c}^{⋆} \leftarrow$ best individual; train final LightGBM $F_{c}$ on $D_{c}$ with $ψ_{c}^{⋆}$
37:: (Benchmarks) Run PSO- and BO-optimized LightGBM on $D_{c}$ under same space/budget
38:: end for
Evaluation and prediction
39:: Compute $R^{2}$ , RMSE, MAE within clusters and aggregate without leakage
40:: For new $x_{new}$ : assign $c_{new}$ by k-means (std. features) and output $\hat{L} = F_{c_{new}} (x_{new})$
Interpretability (TreeSHAP)
41:: for each cluster c do
42:: For each sample i: ${\hat{L}}_{i} = ϕ_{0} + ϕ_{i, H} + ϕ_{i, A} + ϕ_{i, V} + ϕ_{i, θ}$
43:: Report global importance $E_{i} [| ϕ_{i, j} |]$ , dependence plots, cross-cluster contrasts, case-level explanations
44:: end for

Appendix A.2. Implementation Details and Reproducibility

All models were implemented in Python. Data preprocessing and evaluation were performed using scikit-learn utilities, and LightGBM was used as the base learner. Random seeds were fixed to ensure reproducibility of data splitting and model training. Hyperparameters of the LightGBM models were optimized independently for each cluster using the Alpha Evolution (AE) algorithm. Particle Swarm Optimization (PSO) and Bayesian Optimization (BO) were implemented under the same search space and optimization budget. Experiments were conducted in a standard CPU environment using Python together with NumPy, scikit-learn, and LightGBM libraries. Table A1 shows the hyperparameter search space and optimal ranges across clusters for AE-LightGBM. As hyperparameters were optimized independently for each cluster, the resulting optimal values differ slightly across clusters. The table therefore reports the observed optimal ranges rather than individual configurations, as the solutions consistently converged to a narrow region of the search space.

Table A1. Hyperparameter search space and optimal ranges across clusters for AE-LightGBM.

Parameter	Search Range	Optimal Range Across Clusters
Learning rate	[0.01, 0.20]	0.03–0.07
Number of leaves	[16, 128]	48–80
Maximum depth	[−1, 12]	adaptive
Min data in leaf	[10, 200]	30–90
Feature fraction	[0.6, 1.0]	0.7–0.9
Bagging fraction	[0.6, 1.0]	0.7–0.9
Bagging frequency	[1, 10]	3–6
Objective	regression	regression

Appendix B. Prediction Results of Benchmark Prediction Models

This appendix compiles the cluster-wise prediction results for the benchmark learners used in this study. Specifically, Figure A1, Figure A2, Figure A3, Figure A4, and Figure A5 report observed-versus-modeled comparisons on the testing sets for clusters 1–4 for BO-LightGBM, PSO-LightGBM, vanilla LightGBM, CatBoost, and Random Forest, respectively. The AE-LightGBM comparisons are presented in the main text as the reference model. The materials here provide a transparent, side-by-side view of non-EA baselines and alternative LightGBM optimization strategies under the same data splits.

Figure A1. Comparison between monitoring observations and BO-LightGBM-modeled values for the testing dataset in each cluster.

Figure A2. Comparison between monitoring observations and PSO-LightGBM-modeled values for the testing dataset in each cluster.

Figure A3. Comparison between monitoring observations and LightGBM-modeled values for the testing dataset in each cluster.

Figure A4. Comparison between monitoring observations and CatBoost-modeled values for the testing dataset in each cluster.

Figure A5. Comparison between monitoring observations and RF-modeled values for the testing dataset in each cluster.

References

McDougall, S. 2014 Canadian Geotechnical Colloquium: Landslide runout analysis—Current practice and challenges. Can. Geotech. J. 2017, 54, 605–620. [Google Scholar]
Jia, W.; Wen, T.; Li, D.; Guo, W.; Quan, Z.; Wang, Y.; Huang, D.; Hu, M. Landslide displacement prediction of Shuping landslide combining PSO and LSSVM model. Water 2023, 15, 612. [Google Scholar] [CrossRef]
Okura, Y.; Kitahara, H.; Sammori, T.; Kawanami, A. The effects of rockfall volume on runout distance. Eng. Geol. 2000, 58, 109–124. [Google Scholar] [CrossRef]
Zou, Z.; Xiong, C.; Tang, H.; Criss, R.E.; Su, A.; Liu, X. Prediction of landslide runout based on influencing factor analysis. Environ. Earth Sci. 2017, 76, 723. [Google Scholar] [CrossRef]
Rickenmann, D. Runout prediction methods. In Debris-Flow Hazards and Related Phenomena; Springer: Berlin/Heidelberg, Germany, 2005; pp. 305–324. [Google Scholar]
Qarinur, M. Landslide runout distance prediction based on mechanism and cause of soil or rock mass movement. J. Civ. Eng. Forum 2015, 1, 29–36. [Google Scholar] [CrossRef]
von Ruette, J.; Lehmann, P.; Or, D. Linking rainfall-induced landslides with predictions of debris flow runout distances. Landslides 2016, 13, 1097–1107. [Google Scholar]
Zhang, W.; Zhang, W.; Chen, Y.; Ji, J.; Gao, Y. Uncertainty evaluation of the run-out distance of flow-like landslides considering the anisotropic scale of fluctuation in the random field of internal friction angle. Acta Geotech. 2023, 18, 5839–5857. [Google Scholar] [CrossRef]
Hungr, O.; Corominas, J.; Eberhardt, E. Estimating landslide motion mechanism, travel distance and velocity. In Landslide Risk Management; CRC Press: Boca Raton, FL, USA, 2005; pp. 109–138. [Google Scholar]
Liu, X.; Liu, Y.; Yang, Z.; Li, X. A novel dimension reduction-based metamodel approach for efficient slope reliability analysis considering soil spatial variability. Comput. Geotech. 2024, 172, 106423. [Google Scholar] [CrossRef]
Peruzzetto, M.; Mangeney, A.; Grandjean, G.; Levy, C.; Thiery, Y.; Rohmer, J.; Lucas, A. Operational estimation of landslide runout: Comparison of empirical and numerical methods. Geosciences 2020, 10, 424. [Google Scholar] [CrossRef]
Chen, X.; Li, D.; Tang, X.; Liu, Y. A three-dimensional large-deformation random finite-element study of landslide runout considering spatially varying soil. Landslides 2021, 18, 3149–3162. [Google Scholar]
Chen, G.; Wu, X.; Hu, L.; Chi, Y.; Jia, T.; Luo, Y. Numerical Analysis of 3D Slope Stability in a Rainfall-Induced Landslide: Insights from Different Hydrological Conditions and Soil Layering. Water 2025, 17, 3316. [Google Scholar] [CrossRef]
Chen, K.T.; Chen, T.C.; Chen, X.Q.; Chen, H.Y.; Zhao, W.Y. An experimental determination of the relationship between the minimum height of landslide dams and the run-out distance of landslides. Landslides 2021, 18, 2111–2124. [Google Scholar] [CrossRef]
Sun, X.; Zeng, P.; Li, T.; Zhang, T.; Feng, X.; Jimenez, R. Run-out distance exceedance probability evaluation and hazard zoning of an individual landslide. Landslides 2021, 18, 1295–1308. [Google Scholar] [CrossRef]
Maragaño-Carmona, G.; Fustos Toribio, I.J.; Descote, P.Y.; Robledo, L.F.; Villalobos, D.; Gatica, G. Rainfall-induced landslide assessment under different precipitation thresholds using remote sensing data: A Central Andes case. Water 2023, 15, 2514. [Google Scholar] [CrossRef]
Devoli, G.; De Blasio, F.V.; Elverhøi, A.; Høeg, K. Statistical analysis of landslide events in Central America and their run-out distance. Geotech. Geol. Eng. 2009, 27, 23–42. [Google Scholar] [CrossRef]
Roman Quintero, D.C.; Ortiz Contreras, J.D.; Tapias Camacho, M.A.; Oviedo-Ocaña, E.R. Empirical Estimation of Landslide Runout Distance Using Geometrical Approximations in the Colombian North–East Andean Region. Sustainability 2024, 16, 793. [Google Scholar] [CrossRef]
Troncone, A.; Pugliese, L.; Parise, A.; Conte, E. A practical approach for predicting landslide retrogression and run-out distances in sensitive clays. Eng. Geol. 2023, 326, 107313. [Google Scholar] [CrossRef]
Apriani, D.W.; Credidi, C.; Khala, S. An empirical-statistical model for landslide runout distance prediction in Indonesia. Pondasi 2022, 27, 15. [Google Scholar] [CrossRef]
Xu, Q.; Li, H.; He, Y.; Liu, F.; Peng, D. Comparison of data-driven models of loess landslide runout distance estimation. Bull. Eng. Geol. Environ. 2019, 78, 1281–1294. [Google Scholar] [CrossRef]
Ju, L.Y.; Xiao, T.; He, J.; Wang, H.J.; Zhang, L.M. Predicting landslide runout paths using terrain matching-targeted machine learning. Eng. Geol. 2022, 311, 106902. [Google Scholar] [CrossRef]
Chen, C.; Zhou, Y.; Wu, K.; Lu, Y.; Lv, X.; Cai, X.; Huang, W.; Vatin, N.I.; Huang, J. Computational insights into asphalt aging: A data-driven approach to model primary and secondary degradation of viscous behavior. Constr. Build. Mater. 2025, 496, 143478. [Google Scholar] [CrossRef]
Hu, Y.; Wang, Y.; Wei, B.; Meng, Z.; Yuan, D.; Jin, T. An interpretable dynamic evaluation framework fusing multi-dimensional data for assessing the operational safety of concrete dams. Expert Syst. Appl. 2025, 299, 130225. [Google Scholar] [CrossRef]
Li, J.; Meng, Z.; Zhang, J.; Chen, Y.; Yao, J.; Li, X.; Qin, P.; Liu, X.; Cheng, C. Prediction of seawater intrusion run-up distance based on K-means clustering and ANN model. J. Mar. Sci. Eng. 2025, 13, 377. [Google Scholar] [CrossRef]
Qin, P.; Meng, Z.; Su, H.; Cheng, C. A Novel Permutation Entropy–Based Method for Assessing the Stability of Seawalls on Soft Soils. Struct. Control Health Monit. 2026, 2026, 3016498. [Google Scholar] [CrossRef]
Chen, H.; Huang, S.; Qiu, H.; Xu, Y.P.; Teegavarapu, R.S.; Guo, Y.; Nie, H.; Xie, H.; Xie, J.; Shao, Y.; et al. Assessment of ecological flow in river basins at a global scale: Insights on baseflow dynamics and hydrological health. Ecol. Indic. 2025, 178, 113868. [Google Scholar] [CrossRef]
Liu, X.; Jiang, S.H.; Xie, J.; Li, X. Bayesian inverse analysis with field observation for slope failure mechanism and reliability assessment under rainfall accounting for nonstationary characteristics of soil properties. Soils Found. 2025, 65, 101568. [Google Scholar] [CrossRef]
Giarola, A.; Meisina, C.; Tarolli, P.; Zucca, F.; Galve, J.P.; Bordoni, M. A data-driven method for the estimation of shallow landslide runout. Catena 2024, 234, 107573. [Google Scholar]
Meng, Z.; Hu, Y.; Jiang, S.; Zheng, S.; Zhang, J.; Yuan, Z.; Yao, S. Slope Deformation Prediction Combining Particle Swarm Optimization-Based Fractional-Order Grey Model and K-Means Clustering. Fractal Fract. 2025, 9, 210. [Google Scholar] [CrossRef]
Ma, G.; Rezania, M.; Nezhad, M.M.; Phoon, K.K. Multivariate copula-based framework for stochastic analysis of landslide runout distance. Reliab. Eng. Syst. Saf. 2024, 250, 110270. [Google Scholar] [CrossRef]
Alkhasawneh, M.S.; Ngah, U.K.; Tay, L.T.; Mat Isa, N.A.; Al-Batah, M.S. Modeling and testing landslide hazard using decision tree. J. Appl. Math. 2014, 2014, 929768. [Google Scholar] [CrossRef]
Yi, X.; Wang, Y.; Feng, W.; Zhao, J.; Xue, Z.; Huang, R. Towards Accurate Prediction of Runout Distance of Rainfall-Induced Shallow Landslides: An Integrated Remote Sensing and Explainable Machine Learning Framework in Southeast China. Remote Sens. 2025, 17, 3660. [Google Scholar] [CrossRef]
Kavzoglu, T.; Teke, A. Advanced hyperparameter optimization for improved spatial prediction of shallow landslides using extreme gradient boosting (XGBoost). Bull. Eng. Geol. Environ. 2022, 81, 201. [Google Scholar] [CrossRef]
Dang, V.H.; Dieu, T.B.; Tran, X.L.; Hoang, N.D. Enhancing the accuracy of rainfall-induced landslide prediction along mountain roads with a GIS-based random forest classifier. Bull. Eng. Geol. Environ. 2019, 78, 2835–2849. [Google Scholar] [CrossRef]
Zhou, X.; Wen, H.; Li, Z.; Zhang, H.; Zhang, W. An interpretable model for the susceptibility of rainfall-induced shallow landslides based on SHAP and XGBoost. Geocarto Int. 2022, 37, 13419–13450. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into geospatial heterogeneity of landslide susceptibility based on the SHAP-XGBoost model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef]
Zheng, D.; Li, Y.; Yan, C.; Wu, H.; Yamashiki, Y.A.; Gao, B.; Nian, T. Landslide susceptibility assessment using AutoML-SHAP method in the southern foothills of Changbai Mountain, China. Landslides 2025, 22, 1855–1875. [Google Scholar]
Kodinariya, T.M.; Makwana, P.R. Review on determining number of Cluster in K-Means Clustering. Int. J. 2013, 1, 90–95. [Google Scholar]
Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
Song, Y.Y.; Lu, Y. Decision tree methods: Applications for classification and prediction. Shanghai Arch. Psychiatry 2015, 27, 130. [Google Scholar] [PubMed]
Myles, A.J.; Feudale, R.N.; Liu, Y.; Woody, N.A.; Brown, S.D. An introduction to decision tree modeling. J. Chemom. A J. Chemom. Soc. 2004, 18, 275–285. [Google Scholar] [CrossRef]
Gao, H.; Zhang, Q. Alpha evolution: An efficient evolutionary algorithm with evolution path adaptation and matrix generation. Eng. Appl. Artif. Intell. 2024, 137, 109202. [Google Scholar] [CrossRef]
Gong, W.; Wang, G.; Li, L.; Chen, B. A dataset and review of empirical estimation relationships for landslide runout distances. Earth-Sci. Rev. 2025, 270, 105225. [Google Scholar] [CrossRef]
Dias, A.; Hart, J.; Fung, E. The Enhanced Natural Terrain Landslide Inventory. 2009. Available online: https://ginfo.cedd.gov.hk/hkss/filemanager/common/publications-resources/list-of-technical-papers/511_Dias%20et%20al%20(2009)_The%20Enhanced%20Natural%20Terrain%20Landslide%20Inventory.pdf (accessed on 3 March 2026).
Bommer, J.J.; Rodríguez, C.E. Earthquake-induced landslides in Central America. Eng. Geol. 2002, 63, 189–220. [Google Scholar] [CrossRef]
Einbund, M.M.; Baxstrom, K.W.; Schulz, W.H. Map Data from Landslides Triggered by Hurricane Maria in Four Study Areas in the Utuado Municipality, Puerto Rico; US Geological Survey (USGS) Data Release: Reston, VA, USA, 2021; p. 824.
Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
Likas, A.; Vlassis, N.; Verbeek, J.J. The global k-means clustering algorithm. Pattern Recognit. 2003, 36, 451–461. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 1–9. [Google Scholar]
Yan, J.; Xu, Y.; Cheng, Q.; Jiang, S.; Wang, Q.; Xiao, Y.; Ma, C.; Yan, J.; Wang, X. LightGBM: Accelerated genomically designed crop breeding through ensemble learning. Genome Biol. 2021, 22, 271. [Google Scholar] [CrossRef]
Inan, M.S.K.; Rahman, I. Explainable AI integrated feature selection for landslide susceptibility mapping using TreeSHAP. SN Comput. Sci. 2023, 4, 482. [Google Scholar] [CrossRef]
Mitchell, R.; Frank, E.; Holmes, G. GPUTreeShap: Massively parallel exact calculation of SHAP scores for tree ensembles. PeerJ Comput. Sci. 2022, 8, e880. [Google Scholar] [CrossRef]
Strom, A. Prediction of the dimensions of rock avalanches’ affected areas based on the empirical relationships derived from the Central Asian database. Landslides 2024, 21, 1961–1970. [Google Scholar] [CrossRef]

Figure 1. Simplified physical model of rainfall-induced landslide motion and deposition, showing source, flow, and deposition areas.

Figure 2. Histogram distribution of input and output variables.

Figure 3. The overall pipeline of the modeling process.

Figure 4. The principle of lightGBM.

Figure 5. The flowchart of alpha evolution (AE) optimization.

Figure 6. The framework of AE-LightGBM and TreeSHAP combination.

Figure 7. Selection of cluster number based on BIC and AIC.

Figure 8. Pairwise scatter plots and marginal densities of 4 clusters in the (H, A, V,

θ

) space.

Figure 8. Pairwise scatter plots and marginal densities of 4 clusters in the (H, A, V,

θ

) space.

Figure 9. Identification of the best iteration during AE-LightGBM hyperparameter optimization.

Figure 10. Comparison between recorded data and AE-LightGBM-modeled data of the training dataset for each cluster.

Figure 11. Comparison between recorded data and AE-LightGBM-modeled data of the testing dataset for each cluster.

Figure 12. Radar plot of

R^{2}

, RMSE and MAE of the 7 models for each cluster.

Figure 12. Radar plot of

R^{2}

, RMSE and MAE of the 7 models for each cluster.

Figure 13. Residual distribution of the 7 models for each cluster.

Figure 14. Mean absolute SHAP value determined using AE-LightGBM based TreeSHAP analysis for each cluster.

Figure 15. Cluster-wise SHAP beeswarm plot using AE-LightGBM-based TreeSHAP analysis for each cluster.

Figure 16. Cluster-wise SHAP dependence of SHAP(H) versus H for each cluster.

Figure 17. Two-dimensional response surfaces showing the predicted L over selected feature pairs.

Figure 18. The correlation matrix of the four input variables (A, H, V,

θ

).

Figure 18. The correlation matrix of the four input variables (A, H, V,

θ

).

Table 1. Comparison of average RMSE, MAE, and

R^{2}

across all clusters.

Table 1. Comparison of average RMSE, MAE, and

R^{2}

across all clusters.

Model	Average RMSE	Average MAE	Average $R^{2}$
AE-LightGBM	19.294336	12.939086	0.948469
BO-LightGBM	20.898742	14.043180	0.936909
PSO-LightGBM	21.581442	14.828819	0.933655
LightGBM	28.821419	19.483252	0.867471
XGB	34.453173	21.623989	0.789772
Catboost	35.024452	23.168812	0.782134
RF	35.881306	23.290911	0.753934

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, D.; Wu, K.; Li, Y.; Huang, J.; Liu, X. Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation. Water 2026, 18, 740. https://doi.org/10.3390/w18060740

AMA Style

Li D, Wu K, Li Y, Huang J, Liu X. Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation. Water. 2026; 18(6):740. https://doi.org/10.3390/w18060740

Chicago/Turabian Style

Li, Dan, Kuanghuai Wu, Yiming Li, Jian Huang, and Xian Liu. 2026. "Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation" Water 18, no. 6: 740. https://doi.org/10.3390/w18060740

APA Style

Li, D., Wu, K., Li, Y., Huang, J., & Liu, X. (2026). Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation. Water, 18(6), 740. https://doi.org/10.3390/w18060740

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cluster-Aware Prediction of Rainfall-Induced Landslide Run-Out Distance Using AE-Optimized LightGBM with TreeSHAP Interpretation

Abstract

1. Introduction

2. Dataset

2.1. Variables Definition

2.2. Data Distribution

3. Methods

3.1. k-Means Clustering

3.2. Cluster-Wise Modeling Based on AE-LightGBM

3.2.1. Principles of LightGBM

3.2.2. Alpha Evolution Hyperparameter Optimization

3.3. Principles of TreeSHAP

4. Results

4.1. Clustering Results Based on k-Means Method

4.2. Prediction Results Based on AE-LightGBM

4.3. Explanatory Analysis Based on TreeSHAP

5. Discussion

5.1. Cluster-Aware Modeling Implications Under Sample Imbalance

5.2. Roles of Predictors and Modeling Implications

5.3. Limitations and Directions for Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Overall Pipeline of the Current Study

Appendix A.1. Algorithm

Appendix A.2. Implementation Details and Reproducibility

Appendix B. Prediction Results of Benchmark Prediction Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI