Next Article in Journal
Eye Gaze Entropy Reflects Individual Experience in the Context of Driving
Next Article in Special Issue
MFE-YOLO: A Multi-Scale Feature Enhanced Network for PCB Defect Detection with Cross-Group Attention and FIoU Loss
Previous Article in Journal
The Impact of Green FinTech Promote Corporate Carbon Neutrality: Evidence from the Perspective of Financing Incentives and Scale Quality
Previous Article in Special Issue
GAME-YOLO: Global Attention and Multi-Scale Enhancement for Low-Visibility UAV Detection with Sub-Pixel Localization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bayesian-Inspired Dynamic-Lag Causal Graphs and Role-Aware Transformers for Landslide Displacement Forecasting

1
School of Information and Communication, Guilin University of Electronic Technology, Guilin 541004, China
2
School of Computer Science and Engineering, Guilin University of Aerospace Technology, Guilin 541004, China
3
Guangxi Zhuang Autonomous Region Geological Environment Monitoring Station, Nanning 530201, China
4
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
5
School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
6
Faculty of Electrical and Electronics Engineering Technology, Universiti Malaysia Pahang Al-Sultan Abdullah, Pekan 26600, Malaysia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2026, 28(1), 7; https://doi.org/10.3390/e28010007
Submission received: 10 November 2025 / Revised: 9 December 2025 / Accepted: 15 December 2025 / Published: 20 December 2025
(This article belongs to the Special Issue Bayesian Networks and Causal Discovery)

Abstract

Increasingly frequent intense rainfall is increasing landslide occurrence and risk. In southern China in particular, steep slopes and thin residual soils produce frequent landslide events with pronounced spatial heterogeneity. Therefore, displacement prediction methods that function across sites and deformation regimes in similar settings are essential for early warning. Most existing approaches adopt a multistage pipeline that decomposes, predicts, and recombines, often leading to complex architectures with weak cross-domain transfer and limited adaptability. To address these limitations, we present CRAFormer, a causal role-aware Transformer guided by a dynamic-lag Bayesian network-style causal graph learned from historical observations. In our system, the discovered directed acyclic graph (DAG) partitions drivers into five causal roles and induces role-specific, non-anticipative masks for lightweight branch encoders, while a context-aware Top-2 gate sparsely fuses the branch outputs, yielding sample-wise attributions. To safely exploit exogenous rainfall forecasts, next-day rainfall is entered exclusively through an ICS tail with a leakage-free block mask, a non-negative readout, and a rainfall monotonicity regularizer. In this study, we curate two long-term GNSS datasets from Guangxi (LaMenTun and BaYiTun) that capture slow creep and step-like motions during extreme rainfall. Under identical inputs and a unified protocol, CRAFormer reduces the MAE and RMSE by 59–79% across stations relative to the strongest baseline, and it lowers magnitude errors near turning points and step events, demonstrating robust performance for two contrasting landslides within a shared regional setting. Ablations confirm the contributions of the DBN-style causal masks, the leakage-free ICS tail, and the monotonicity prior. These results highlight a practical path from causal discovery to forecast-compatible neural predictors for rainfall-induced landslides.

1. Introduction

Climate change is increasing the frequency and intensity of extreme rainfall events [1,2], and rainfall-triggered landslides remain a persistent threat to lives and infrastructure worldwide [3]. In China, recent studies have reported increasing flood risk in major river basins due to the warming climate [4], highlighting the broader impact of hydrometeorological extremes on slope and infrastructure safety. Much of southern China has a subtropical monsoon climate with uneven seasonal rainfall, and the concentrated, high-intensity precipitation during the flood season drives frequent failures. In Guangxi, a region representative of this setting, widespread hillslopes and karst, together with thin residual soils and steep gradients, create high susceptibility. Under intense rainfall, the slopes respond rapidly, shallow soils lose stability, and landslide activity becomes highly heterogeneous, with GNSS displacement records showing that single landslides often exhibit step-like offsets and progressive creep. These deformation characteristics pose severe challenges to accurate landslide displacement prediction. Consequently, developing a method that is reliable and robust, with good transferability across sites and deformation regimes in similar climatic and geomorphological settings, is critical for early warning and risk mitigation.
Most existing models optimize correlations rather than causal mechanisms, resulting in several different problems. For example, signal decomposition pipelines coupled with deep networks can fit non-linear, multiscale patterns [5], but they are complex and brittle. Multilevel decompositions (e.g., VMD, EMD) and extensive hyperparameter tuning propagate errors and impair transferability [6,7,8,9]. A central limitation concerns the treatment of rainfall forcing, where rainfall is commonly reduced to synchronous inputs or fixed-window accumulations [10]. Such summaries obscure lagged and filtered responses governed by infiltration, seepage, and pore water pressure dynamics and underutilize short-term forecasts, which are critical during extremes.
Causal discovery provides a principled alternative. For example, a Bayesian network (BN) can encode a DAG and conditional dependencies. For landslides, it can represent pathways from rainfall to displacement through hydrologic and mechanical states. Structure learning clarifies parent–child and ancestor relations, supports d-separation-based reasoning, and improves interpretability in confounding and non-stationary scenarios. Embedding such structures into learning can stabilize predictions during intense rainfall and alternating wetting–drying cycles [11].
In this paper, we introduce CRAFormer, a causal, role-aware end-to-end framework for landslide displacement forecasting without prior signal decomposition. At each monitoring station, CRAFormer learns a station-specific dynamic-lag causal graph (DLCG) from historical observations on a time-unrolled DAG and uses it to define role masks for that station. It then partitions the drivers into five BN-consistent roles and encodes each with a lightweight, role-masked branch. Temporal masks enforce non-anticipative constraints. A context-aware Top-2 gate fuses the branch outputs and yields sample-wise attributions, and this gated fusion allows the model to shift its reliance between self-history and hydrometeorological drivers across different deformation regimes and local geological conditions. The design links BN-style structure learning to neural parameterization and preserves clear causal semantics.
The main contributions of this paper are as follows:
  • A causally informed forecasting framework: We couple a BN-consistent structure with neural predictors to model pathways from rainfall to displacement. This approach eliminates reliance on prior signal decomposition and improves interpretability.
  • A dynamic-lag causal graph (DLCG): We perform structure learning on a time-unrolled DAG, partition the inputs into five causal roles, and encode each with role masks that operationalize d-separation and non-anticipativity. A context-aware Top-2 gate improves accuracy and provides sample-wise attributions.
  • A forecast-compatible rainfall pathway: Next-day rainfall enters an indirect-cause branch through a leakage-free channel with a monotone readout, which preserves non-anticipativity and stabilizes responses under strong hydrologic forcing. Numerical forecasts are used when available; otherwise, the observed R t + 1 serves as a practical proxy.
  • Evaluation on two rainfall-triggered landslides in Guangxi (LaMenTun and BaYiTun): CRAFormer improves accuracy, stability, and physical consistency at both sites under real-world conditions, especially during rainfall extremes, while retaining interpretability via role masks and gating. These experiments document the framework’s robustness across two contrasting deformation regimes within a shared regional setting.

2. Related Works

Landslide displacement prediction approaches are commonly grouped into three categories: physical, statistical, and data-driven. Physical models, grounded in groundwater hydrodynamics and rock–soil constitutive laws, provide strong mechanistic interpretability [12] and can resolve slope-scale soil–water migration and its control on shallow stability under rainfall forcing [13]. However, parameter identification is difficult and simulations are computationally expensive, which limits real-time use and scalability [12]. Classical statistical models, including linear regression and time-series methods, are simple and effective at short horizons but struggle with non-linear dynamics and rainfall-induced lags [14].
Deep learning has improved predictive accuracy in landslide displacement forecasting. For example, recurrent networks (e.g., LSTM and GRU) and temporal convolutional networks capture non-linear dynamics and both long- and short-term dependencies [15,16,17]. Transformer-based architectures offer attention mechanisms that assign time-varying relevance [18,19]. Spatio-temporal deep learning models with explicitly interpretable features have also been proposed for landslide displacement prediction [20]. However, three challenges remain: the need for large labeled datasets and substantial computation; limited interpretability due to opaque internal states; and weak cross-site generalization under diverse rainfall regimes and human influences [21,22].
To represent multiscale features, many studies adopt a decomposition–reconstruction pipeline, where methods such as wavelet transforms, moving averages, VMD, and EMD are used to split displacement into submodes that are modeled and then recombined [8,23,24]. Combining signal processing with learning improves the representation of multiscale structures [9,25,26], but feature selection nevertheless often relies on correlation-based metrics, such as Pearson’s correlation coefficient, the MIC, SHAP values, and elastic net regularization [15,27,28,29,30,31]. Correlation does not establish causation and can overlook coupling among drivers, which reduces transferability and robustness [32,33].
Recent studies have aimed to improve interpretability and physical consistency by combining attention mechanisms with physics-based constraints, for example, by embedding geological priors into architectures, establishing a rainfall and water level→displacement causal chain [34,35], designing deformation mechanism-assisted deep architectures for step-like reservoir landslide displacement [36], pairing decomposition with Granger causality tests to screen drivers [37], and integrating frequency-domain causal analysis, Shapley value attribution, and physics-informed structures [38,39]. These designs provide insight but face three issues: First, attention often highlights associations rather than causal effects. Second, soft physical constraints can drift during training and destabilize models. Third, added architectural complexity increases maintenance and deployment costs.
In summary, current approaches lack explicit causal modeling, faithful representation of rainfall processes across timescales, and lightweight designs that generalize across heterogeneous settings. These gaps motivated our creation of a role-aware, physics-guided framework that captures dynamic lags, uses rainfall forecasts in a leakage-free manner, and remains efficient for operational point forecasting.

3. Methodology

We designed CRAFormer to decouple structure discovery from exogenous conditioning by formulating displacement forecasting with a time-unrolled Bayesian network (DBN) and a role-masked predictor. As shown in Figure 1, CRAFormer (i) performs dynamic-lag causal discovery on historical variables to obtain a DAG G ^ ; (ii) derives a BN-consistent role partition via (d)-separation; (iii) encodes each role using non-anticipative temporal and structural masks; and (iv) fuses the role representations through a context-aware Top-2 gate that yields sample-wise attributions. Prediction at time t + 1 uses the history together with an exogenous rainfall channel R t + 1 that is excluded from discovery and injected through a leakage-free tail, preventing information leakage from the forecast horizon while staying compatible with operational rainfall forecasts. The historical variables are partitioned into five disjoint roles, self-feedback (ES), direct causes (DCSs), co-causes (CCSs), indirect causes (ICSs), and structurally irrelevant variables (SCSs), with each processed by its role-masked encoder.
The ICS branch ingests the exogenous next-day rainfall R t + 1 through a leakage-free tail token. The tail attends to history, whereas history cannot attend to the tail, which preserves non-anticipativity. We also apply a lightweight, regime-selective monotonicity prior that softly encourages a non-decreasing displacement response to effective rainfall during high- or rising-rainfall episodes where pore pressure-driven acceleration is expected to dominate the short-horizon dynamics. Outside these regimes, the prior is inactive. A compact routing context derived from ES and DCS summaries feeds a temperature-scaled Top-2 gate, which selects the two most informative role branches and forms a convex combination of their scalar outputs. The normalized gate weights provide sample-wise attributions.

3.1. Dynamic-Lag Causal Discovery in a Time-Unrolled Bayesian Network

We adopt the DLCG as the structural prior for CRAFormer. Concretely, we learn a DAG over lag-expanded historical nodes in a daily, time-unrolled dynamic BN and aggregate multiple runs using a stationary bootstrap to obtain a robust directed graph G ^ . In line with standard time-series causal discovery, we interpret G ^ as a statistically supported summary of recurrent lagged dependencies rather than a fully specified physical model, and its causal reading relies on four working assumptions: (i) acyclicity of the time-unrolled graph; (ii) causal sufficiency for the observed driver set; (iii) the approximate stationarity of the short-lag dependence structure over the study period, rather than the strict stationarity of the raw displacement levels; and (iv) the Markov and faithfulness conditions [40]. When these assumptions are only approximately satisfied, the main effect is a loss of power. Weak or regime-specific dependencies are more likely to be omitted and some indirect or confounded links may remain, while BY–FDR control and directional thresholding keep G ^ relatively sparse. We therefore use G ^ as a conservative structural prior for role masking in CRAFormer rather than as a fully identifiable physical model.
The search space consists solely of historical lagged nodes at daily resolution:
V 0 = V j ( t τ ) | V j { X disp } { X j } j I drv , τ = 0 , , L 1 .
where I drv denotes the index set of environmental drivers; j I indexes the variables with X disp being the displacement series; t is the last observed day; τ { 0 , , L 1 } is the lag; and L is the look-back window.
To prevent anticipative edges, we exclude future rainfall nodes within the prediction horizon from V 0 and later treat them as exogenous inputs in the ICS tail (gauge oracle R gau ; see Section 3.3.1) [41,42]. Candidate parents must strictly precede each node in time:
P X j ( t τ ) X k ( t τ ) : τ > τ , k I .
where P ( · ) is the candidate–parent set; j , k I ; t is the last observed day; and τ , τ { 0 , , L 1 } are lags with τ > τ .
We impose hard constraints that forbid edges from displacement to drivers, and we use domain priors only to break ties [43,44]. B stationary bootstrap resamples are drawn with an expected block length of 7 days, and a constraint-based skeleton is learned for each resample [45]. A sensitivity analysis further shows that the learned DLCG remains stable for block lengths b { 3 , 7 , 14 }  days (see Appendix B.4, Figure A9). Conditional independence is tested primarily with a kernel-based test (HSIC/KCI) [46,47]. p-values are computed by block permutation with a 7-day block to respect serial dependence. Within each conditioning order, multiplicity is controlled using the Benjamini–Yekutieli FDR (BY-FDR) at α = 0.05 [48]. For low-order conditioning sets ( | Z | 2 ) that pass linearity and normality diagnostics, we fall back to the Gaussian partial correlation test (Fisher–z):
z = 1 2 n | Z | 3 ln 1 + r X Y · Z 1 r X Y · Z , p = 2 Φ | z | ,
where r X Y · Z is the partial correlation and Φ is the standard normal CDF. To account for serial dependence, we calibrate p-values for the Fisher–z test using the same 7-day block permutation scheme rather than relying on the i.i.d. normal approximation.
The pruned skeleton is oriented using v-structures and Meek rules under hard temporal constraints. We also extract the collider and spouse sets ( C , S ) for the displacement target, to depth L col , to guide downstream CCS masking. Directed edges are aggregated across bootstrap resamples, and we keep u v in G ^ if
L 0.60 and L U 0.30 ,
where L and U are the Clopper–Pearson lower and upper confidence bounds for the forward and reverse selection rates, respectively [49]. We treat these directional selection rates as approximate posterior inclusion probabilities and choose ( L , Δ ) = ( 0.60 , 0.30 ) as a compromise between retaining sufficiently many edges and discarding unstable orientations. Table A1 in Appendix A.1 further reports a small threshold sensitivity study showing that the aggregated DLCG and the DCS/ICS role masks for the displacement node are stable under moderate variations of ( L , Δ ) . The resulting G ^ and ( C , S ) serve as priors for role partitioning. Full pseudocode and hyperparameters are provided in Algorithm 1 and Table A3.
Algorithm 1 DLCG: dynamic-lag causal graph on V 0
Require: time series X R T × D , lookback L, drivers I drv , priors ( E + , E ) , BY level α ,
      bootstraps B, block length, collider depth L col , target
Ensure: robust graph G ^ = ( V 0 , E ) , collider set C , spouse set S
  1:
V 0 { V j ( t τ ) : j { disp } I drv , τ = 0 , , L 1 }    ▹ exclude forecasts
  2:
Build temporal mask M [ u , v ] ( lag ( u ) > lag ( v ) ) ( ( u v ) E )
  3:
for   b = 1 to B do                ▹ stationary bootstrap
  4:
      X b S TATIONARY B OOTSTRAP ( X , block length )
  5:
      ( G b , C b , S b ) L EARN O NCE ( X b , V 0 , M , E + , E , α , L col , target )
  6:
end for
  7:
Aggregate { G b } into directed edges E using Clopper–Pearson directional bounds; keep u v if L 0.60 and L U 0.30
  8:
G ^ ( V 0 , E ) ; C b C b ; S b S b
  9:
return   ( G ^ , C , S )
10:
procedure  L EARN O NCE ( X b , V 0 , M , E + , E , α , L col , target )
11:
     Initialize undirected skeleton G under mask M
12:
     for  k = 0 to k max do       ▹ size-adaptive conditioning order
13:
         Test conditional independence (HSIC/KCI; Fisher–z fallback under | Z | 2 and diagnostics passed) and prune order-k edges with BY–FDR control
14:
     end for
15:
     Orient G using v-structures and Meek rules; apply E + only for tie-breaking; enforce E
16:
      ( C , S ) M INE C OLLIDER S POUSE ( G , target , L col )
17:
     return  ( G , C , S )
18:
end procedure

3.2. Causal Role Partitioning

Given the DLCG G ^ , we derive the following role masks as the algebraic image of d-separation on the time-unrolled BN: ES, DCSs, CCSs, ICSs, and SCSs. These masks constitute the visibility sets used by the encoders: ES contains temporal self-lags of the target; DCSs equal the parent set Pa G ^ X disp ( t + 1 ) with ES lags removed; ICSs collect proper ancestors and mediators, that is, An G ^ X disp ( t + 1 ) ( ES DCS ) ; CCSs gather v-structure colliders on ancestral paths to the target and their spouses; SCSs contain all remaining historical nodes, that is, the complement of ( ES DCS ICS CCS ) .
ES comprises the contemporaneous target value and its previous L 1 lags as follows:
ES = X disp ( t ) , X disp ( t 1 ) , , X disp ( t L + 1 ) ,
where X disp is the displacement series, t the last observed day, and L the look-back window (days).
DCSs comprise the one-hop parents of the target at time t + 1 , excluding ES:
DCS i = V j ( t τ ) | V j ( t τ ) X i ( t + 1 ) G ^ ES i ,
where i indexes the target ( X i X disp ); V j { disp } I drv ; τ { 0 , , L 1 } is the lag; → denotes a directed edge in G ^ ; and ES i is the ES set for i.
CCSs comprise collider nodes and their spouses:
C i = c V 0 | u v : u c v , u An G ^ X i ( t + 1 ) , S i = s V 0 | c C i : s c , CCS i = C i S i ES i DCS i ,
where An G ^ ( · ) denotes the set of ancestors in G ^ . A collider has two incoming edges that form a v-structure, and spouses are nodes that share the same collider as a child. We apply the spouse-informed kernel projection only when the spouse index set is non-empty; otherwise, it is the identity Ψ i = Z i .
ICSs comprise intermediate nodes that influence the next-day target through directed paths with a length of at least 2, excluding ES, DCSs, and CCSs:
ICS i = V j ( t τ ) | γ Paths G ^ V j ( t τ ) , X i ( t + 1 ) with | γ | 2 ES i DCS i CCS i .
where V j { disp } I drv ; i indexes the displacement target ( X i X disp ); t is the last observed day; τ { 0 , , L 1 } is the lag; Paths G ^ ( u , v ) are directed paths in G ^ ; | γ | is the path length (edges); and ES i , DCS i , CCS i are as defined above.
SCSs comprise all remaining nodes:
SCS i = V 0 ES i DCS i CCS i ICS i .
We assign roles in the following precedence order:
ES DCS CCS ICS SCS .
This order ensures that the five sets are pair-wise disjoint and exhaustive over V 0 , and we break ties due to sampling variability in this order. Because G ^ is obtained by aggregating stationary bootstrap graphs under conservative directional thresholds (Section 3.1), the resulting ES/DCS/ICS/CCS/SCS masks inherit a degree of robustness to sampling variability. For the displacement node, we verify on representative stations that the dominant short-lag hydrologic lags assigned to DCSs and ICSs remain in the same roles when the DLCG is recomputed on bootstrap subsets, with only a few marginal lags switching between ICSs and SCSs (Appendix A.2, Table A2). This stability limits the impact of structural uncertainty on the Top-2 gating and supports the use of the five-way partition as a reliable basis for interpretability.

3.3. Multibranch Displacement Prediction Model

Guided by the role partitioning, CRAFormer adopts a multibranch architecture. Each causal subset is encoded with temporal and structural masks (Figure 2), and this separation isolates distinct sources of influence and enforces non-anticipativity, where C is a strictly lower-triangular temporal mask and S ( role ) is the role-specific visibility mask derived from G ^ . Each branch applies masked self-attention with the following composite mask:
M role = C S ( role ) ,
where ∧ denotes the element-wise logical AND. The masking operator Ω ( · ) adds 0 to visible entries and to masked entries before applying softmax to the attention logits.
Role-generic encoder (ES/DCS/SCS/CCS): Roles r { ES , DCS , SCS , CCS } share the encoder shown in Figure 2a, where inputs are linearly projected and positionally encoded. We then apply single-head causal self-attention with the mask M r = C S ( r ) [50], followed by a small MLP and a linear head:
X ˜ i ( r ) = PE X i ( r ) W in ( r ) R L × d ,
where i indexes samples r { ES , DCS , SCS , CCS } ; X i ( r ) is the length-L input for role r; PE ( · ) is the positional encoding and d the hidden width.
Q ( r ) = X ˜ i ( r ) W q ( r ) , K ( r ) = X ˜ i ( r ) W k ( r ) , V ( r ) = X ˜ i ( r ) W v ( r ) ,
where W in ( r ) , W q , k , v ( r ) are learned parameters.
A i ( r ) = softmax Q ( r ) K ( r ) d + Ω ( M r ) ,
where Ω ( M r ) applies the binary mask M r = C S ( r ) (with C being strictly causal and S ( r ) role visibility).
Z i ( r ) = A i ( r ) V ( r ) R L × d , z i ( r ) = Z i ( r ) [ L , : ] R d ,
where Z i ( r ) are token features and z i ( r ) = Z i ( r ) [ L , : ] the last token.
y i ( r ) = w 2 ( r ) GELU z i ( r ) W 1 ( r ) + b r .
where W 1 ( r ) , w 2 ( r ) , b r are learned parameters; and y i ( r ) is the scalar head.
For ES , DCS s, and SCS s, we use a scalar head y i ( r ) , and for CCS s, we also retain the token map Z i ( CCS ) for downstream projection. The encoder is low-capacity and uses single-head attention, a small hidden dimension d, dropout 0.1 , and L 2 weight decay of 10 4 .
In the CCS branch, let Z i Z i ( CCS ) from (12) to (16). Given the spouse-token index set I S , we remove components explained by spouses using an RBF kernel projection as follows:
Z i , S = Z i [ I S , : ] , K S S = κ ( Z i , S , Z i , S ) , K T S = κ ( Z i , Z i , S ) ,
where κ ( u , v ) = exp u v 2 2 / ( 2 σ 2 ) is the RBF kernel; ε > 0 stabilizes the inverse.
Z ^ i S = K T S K S S + ε I 1 Z i , S , Ψ i = Z i Z ^ i S ,
where I is the identity; and vec ( · ) vectorizes a matrix.
y i CCS = w CCS vec ( Ψ i ) + b CCS ,
If | I S | = 0 , set Ψ i = Z i .
ICS branch encoder: The ICS branch appends the gauge oracle exogenous next-day rainfall R gau as the tail token shown in Figure 2c, which is not an NWP forecast. We construct an effective rainfall driver from the rainfall history and R t + 1 . When a numerical forecast of R t + 1 is available, we use it; otherwise, the observed R t + 1 serves as a practical proxy at evaluation time. We pass R t + 1 through a detach operation to stop gradients before fusion.
X ˜ i ICS = PE X i ICS W in ICS R L × d .
r eff , i = ReLU a R i + v ϕ ( W R i + b ) + β , detach R gau , i θ .
where i indexes samples; X i ICS is the length-L ICS input, PE ( · ) positional encoding, d the hidden width, and W in ICS a learned projection; R i are rainfall features (history + R t + 1 ); R gau , i is the gauge oracle next-day rainfall; detach ( · ) blocks gradients; ϕ ( · ) is a point-wise non-linearity; a , v , W , b , β , θ are learned parameters; and ReLU ( x , y ) : = max ( x , y ) .
We map the effective rainfall score r eff , i to a non-negative tail token representing the gauge oracle exogenous next-day rainfall input, and we append this token to the sequence
x tail , i = W r r eff , i , W r = softplus ( W ˜ r ) 0 ,
Z in , i ICS = X ˜ i ICS x tail , i R ( L + 1 ) × d .
where r eff , i is as above; W ˜ r is unconstrained and W r = softplus ( W ˜ r ) (element-wise) so W r 0 ; x tail , i R d is the non-negative tail token; X ˜ i ICS R L × d ; [ · ; · ] stacks along the time axis to append the tail as the ( L + 1 ) -st token, yielding Z in , i ICS R ( L + 1 ) × d ; and d is the hidden width.
When M hist = C S ( ICS ) , we apply a block causal mask that allows the tail to attend to all historical tokens while preventing attention from history to the tail, which preserves non-anticipativity:
M ICS = M hist 0 1 L 1 ,
where 1 L is a length-L vector of ones.
The prediction head reads only the tail token with element-wise non-negative weights:
y i ICS = w h z i , tail + b h , w h = softplus ( w ˜ h ) 0 .
where softplus ( · ) enforces element-wise non-negativity on W r and w h . We apply a detach operation to the exogenous rainfall input so that no gradients flow from the tail token back to historical tokens. However, non-negative mixing alone does not guarantee exact monotonicity because the tail token is produced by attention and MLP layers. We regularize monotonic behavior with L mono rain , applied only when the rainfall trigger m i = 1 . Note that the intent is not to impose global monotonicity of displacement with respect to rainfall but to encode a local short-horizon prior under strongly forced, rainfall-dominated conditions.
Lite Transformer cell: Each attention block in Figure 2d adopts a pre-norm design with a learnable residual mixing coefficient μ ( 0 , 1 ) [51]. For input H and mask M ,
A = Attn LN ( H ) ; M , H = H + μ A ,
F = MLP LN ( H ) , H + 1 = H + F ,
where Attn ( · ; M ) denotes masked attention and LN denotes layer normalization. This block reduces the parameter count, stabilizes training, and aligns with panels (a)–(d).

3.3.1. Context-Aware Gating and Fusion

Each attention block in Figure 2d uses a pre-norm design with a learnable residual mixing coefficient μ ( 0 , 1 ) [51]. Given the input H and mask M , we compute
μ i DCS = mean t X ˜ i DCS R d , c i = MLP ctx [ y i ES ; μ i DCS ] R h ctx ,
where mean t denotes the temporal mean. The vector c i summarizes recent displacement and direct drivers and serves as a shared routing context [52].
Given branch-level scalar outputs { y i , j } j J with J = { ES , DCS , CCS , ICS , SCS } , we compute per-branch logits using an affine transform of y i , j and a shared context term as follows:
z i , j = a j y i , j + v c i + b j , j J .
where i indexes samples; j J is the branch; y i , j is branch j’s scalar output; a j , b j are learned gate scalars; and v , c i R h ctx are the shared-context weight and vector (so z i , j is a logit).
Soft routing probabilities with temperature τ > 0 are
π i , j = exp z i , j / τ k J exp z i , k / τ , j J .
where i indexes samples; j , k J are index branches; τ > 0 is the softmax temperature; the denominator sums over J ; and { π i , j } are normalized ( j J π i , j = 1 ).
We use temperature-scaled soft targets for training [53], and select the Top-2 routing probabilities and record their indices as S i = Top - 2 { π i , j } j J . We then renormalize over S i for sparse MoE routing [54,55], and the normalized gates fuse the branch outputs as follows:
g i , j = π i , j 1 [ j S i ] S i π i , , y ^ i = j J g i , j y i , j ,
where 1 { · } denotes the indicator function, and j J g i , j = 1 . We break ties by the fixed role order ES DCS CCS ICS SCS to ensure determinism.
To promote decisive routing and stabilize training, we penalize the entropy of the pre-truncation probability vector as follows:
L gate = λ ent i H ( π i ) , H ( π i ) = j J π i , j log π i , j .
This entropy minimization regularizer encourages confident assignments [56]. We also apply L 2 weight decay to { a j , b j , v } . In practice, a temperature τ [ 0.3 , 1.0 ] and λ ent [ 10 3 , 10 2 ] yield stable Top-2 routing without gate collapse.

3.3.2. Joint Loss Function

The total loss combines prediction error, an SCS gate sparsity penalty, an optional rainfall-triggered monotonicity term, gating entropy regularization, and L 2 weight decay [56,57] as follows:
L total = L pred + λ scs g ¯ SCS + λ mono L mono rain + λ ent L gate + γ Θ 2 2 .
where λ scs , λ mono , λ ent , γ 0 are weights (typical ranges: λ scs [ 10 3 , 10 2 ] ; λ mono [ 10 3 , 10 1 ] ; λ ent [ 10 3 , 10 2 ] ; and γ [ 10 5 , 10 3 ] ); λ mono = 0 is set to disable the monotonicity term; g ¯ SCS denotes the batch-mean SCS gate activation; and Θ collects all trainable parameters. To encourage decisive routing, we include a small Shannon entropy penalty L gate in the pre-truncation gate probabilities (see Section 3.3.1).
The prediction loss and the batch-mean SCS gate are
L pred = 1 B i = 1 B y ^ i y i 2 , g ¯ SCS = 1 B i = 1 B g i , SCS .
where B denotes the minibatch size. The target y i is the cumulative displacement over the prediction horizon H,
y i h = 1 H Δ disp t i + h ,
and y ^ i is its prediction (both in physical units). The quantity g i , SCS [ 0 , 1 ] is the normalized SCS gate, and j J g i , j = 1 .
We activate the physics prior only under heavy or rising rainfall and only when the ICS gate is active:
m i = 1 R 7 , i τ r ( r eff , i r eff , i prev ) τ Δ · 1 g i , ICS τ g .
where R 7 , i denotes the 7-day accumulated rainfall, and r eff , i prev is the previous step’s effective rainfall score. We set τ r to the 80th percentile of R 7 on the training split, and τ Δ and τ g are fixed thresholds (for example, τ Δ > 0 and τ g = 0.5 ). These criteria restrict the monotonicity prior to episodes with already high or rapidly increasing rainfall and strong routing to the hydrologic ICS branch, which are precisely the regimes where a non-decreasing short-horizon response to additional rainfall is physically plausible. In drier periods or during post-event drainage and partial recovery, the trigger satisfies m i = 0 , and the network is free to represent non-monotonic behavior such as relaxation or partial re-stabilization.
This term thus acts as a monotonicity regularizer; it penalizes negative finite-difference slopes only when m i = 1 . In practice, λ mono is chosen in the low range given above, so L mono rain provides a soft preference rather than a hard constraint. If the data strongly support local decreases even under high rainfall, CRAFormer can accommodate them via the ES/DCS branches and the Top-2 routing.
Let Δ y ^ i y ^ i ( + ) y ^ i denote the finite-difference slope with respect to a small increase in the effective rainfall driver. The penalty [ max ( 0 , y ^ i y ^ i ( + ) ) ] 2 is equivalent to [ max ( 0 , Δ y ^ i ) ] 2 and follows monotonic constraint regularization in neural networks [58].
L mono rain = i = 1 B m i max 0 , y ^ i y ^ i ( + ) 2 i = 1 B m i + ε ,
where ε > 0 is a small constant for numerical stability.
We probe a small relative positive step in the effective rainfall driver as follows:
r eff , i ( + ) = r eff , i + detach η r ( | r eff , i | + ε r ) .
where η r ( 0 , 1 ) (default 0.1 ) and ε r > 0 prevents vanishing steps near zero. We set τ r to the 80th percentile of R 7 on the training split, τ Δ to the 80th percentile of Δ r (the first difference of r eff on the training split), and τ g = 0.5 . The perturbed output y ^ i ( + ) = y ^ i r eff , i ( + ) is computed using a detach operation so that gradients do not flow through the step.

4. Study Areas and Data

The landslides at LaMenTun and BaYiTun in Guangxi are selected as case studies. Both sites lie in humid, subtropical, monsoon-dominated settings and exhibit rainfall-triggered deformation patterns that are typical for this region of southern China; however, they differ markedly in their deformation regimes. Therefore, they offer two contrasting examples within a shared climatic and geomorphological setting. The LaMenTun landslide is characterized by progressive, seasonally modulated creep behavior with periodic acceleration and step-like responses, while the BaYiTun landslide displays more episodic, leap-frog deformation dominated by intermittent and spatially heterogeneous movement (Figure 3 and Figure 4). Both sites host continuous GNSS stations and multisensor environmental monitoring, providing long-term, high-resolution time-series data for training and validating mechanism-aware prediction models.

4.1. LaMenTun Landslide

The LaMenTun landslide lies in Dangliang Village, Nazhi Township, Tian’e County, Guangxi. The landslide scar is semi-elliptical in plan within low-to-moderate local relief and partially encircles the local settlement (Figure 5). The source is at ∼800 m above sea level (a.s.l.) and the toe at ∼560 m, yielding ∼240 m of relief with a mean slip azimuth of 210 ° . Terrace regrading and slope cutting further weaken the NE–SW-trending slope. The bedrock consists of muddy siltstone overlain by 0.8 2.8 m of reddish-brown colluvial clay. The contact forms a soft-over-hard, dip-parallel sliding surface that weakens during intense rainfall (Figure 6). Historical failures in 2009 and 2011, together with active tensile cracks, indicate ongoing instability. The LaMenTun landslide monitoring network, comprising four GNSS stations, one reference station, one crack meter, a rain gauge, and soil moisture and soil temperature sensors, was established and commissioned on 30 March 2021 (Figure 5c).
This study compiles the multisource dataset collected by the network from 30 March 2021 to 28 June 2022, which includes daily GNSS displacement, rainfall, soil temperature, and volumetric water content. Outliers are filtered using physically informed thresholds. Soil temperature readings outside [ 5 , 50 ]   ° C are removed. Sudden displacement shifts are flagged by a second derivative threshold (acceleration > 5 mm day 2 ) and cross-validated with crack meter data. We aggregate all variables to daily means, and displacement and rainfall gaps shorter than three days are linearly interpolated. Soil variables are reconstructed with depth-aware cubic splines. Longer outages, accounting for < 1 % of timestamps, are masked and excluded from loss computation. No synthetic values beyond minimal interpolation are introduced, ensuring that occasional gaps do not affect forecast reliability.
The dataset reveals pronounced station-dependent deformation at LaMenTun (Figure 3). GPS03 shows the largest cumulative displacement (∼700 mm by mid-2025), with clear step-wise increments, including an abrupt ∼300 mm jump in mid-2022 and several smaller steps during 2023–2025. GPS01 undergoes slow, quasi-seasonal creep (∼5–65 mm) with modest accelerations and no sharp offsets. GPS04 records steady creep with a few moderate steps (total ∼100 mm). GPS02 is the least mobile (generally < 30 mm ), with weak seasonality and short-period fluctuations. The LF01 crack meter widens in a step-wise manner, broadly synchronous with the main GNSS steps, especially in mid-2022 and mid-2023.

4.2. BaYiTun Landslide

The BaYiTun landslide is located southeast of BaYiTun Hamlet, Nandan County, Guangxi. It sits in a karst hill–canyon setting with ∼180 m local relief. Residual slope deposits of gravelly, clay-rich colluvium overlie Carboniferous tuff, forming a soft-over-hard interface that weakens during intense rainfall. Two N–S fractures and pervasive joints facilitate rapid groundwater routing (see [59] for details). The monitoring system includes one GNSS reference (JZ03), GPS01–GPS03 (GPS04 retired 4 August 2021), a co-located rain gauge (YL01), and soil moisture/temperature probes at 20–80 cm. Measurements at 06:00, 12:00, and 18:00 are aggregated to daily values. The dataset analyzed here spans 6 November 2021 to 28 June 2025. Data quality control, temporal aggregation, and gap handling follow the unified pipeline in Section 4.1.
BaYiTun exhibits abrupt, multistage step-wise motion (Figure 4): GPS01 (up-slope) remains largely stable with cumulative displacement < 60 mm; GPS02 (near the crest) shows gradual creep with episodic surges; GPS03 (mid-slope; “G3” in Figure 7c) displays 20–60 mm step-wise increments during wet spells (10-day rainfall 120 mm), consistent with excavation-induced toe weakening and a regolith–tuff contact that channels pore pressure-driven shear. Among the stations, GPS03 shows the highest variance and strongest non-stationarity.

5. Results

5.1. Experimental Environment

All experiments are run on a single CPU-only workstation to keep resource constraints identical across models. The host runs 64-bit Windows 11 (build 26,100) on an Intel Core i7-9700 (eight cores, 3.0 GHz) with 32 GB of RAM, and we use Python 3.11.7 (Anaconda) and PyTorch 2.2.1 (cpuonly) to perform our calculations. GPU acceleration is disabled (CUDA_VISIBLE_DEVICES = “”, torch.version.cuda = None, torch.backends.cudnn.enabled = False). To ensure reproducibility, we enable deterministic algorithms (torch.use_deterministic_algorithms(True)) and fixed random seeds for Python, NumPy, and PyTorch. Table 1 summarizes the key software and hardware components.

5.2. Performance Metrics

We evaluate model performance using four metrics: the mean absolute error (MAE), the root mean squared error (RMSE) [60], the coefficient of determination ( R 2 ) [61], and the turning-point mean absolute error ( MAE turn ).
MAE = 1 n i = 1 n y ^ i y i ,
RMSE = 1 n i = 1 n y ^ i y i 2 ,
R 2 = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2 ,
The turning-point error is calculated as follows: Let Δ y i = | y i y i 1 | for i = 2 , , n . Define T as the index set of the top- p % samples (we use p = 20 ) ranked by Δ y i in the observed series. If needed, ties at the threshold are broken arbitrarily; the first sample is excluded from ranking). Then
MAE turn = 1 | T | i T y ^ i y i .
where y ^ i denotes the prediction, y i the observation, y ¯ the mean of observations, and n the number of samples. MAE turn emphasizes accuracy at rapid transitions (large | Δ y | ), complementing aggregate metrics. In the main experiments, we define the turning-point error MAE turn 20 as the mean absolute error computed on the top 20% of samples ranked by | Δ y | . This 20% threshold is chosen as a pragmatic compromise: it focuses the metric on large displacement changes that are most relevant for early warning while retaining enough samples to obtain stable statistics at each station.

5.3. Baseline Models and Experimental Protocol

5.3.1. Baseline Models

We benchmark CRAFormer against four representative sequence models that span convolutional, recurrent, hybrid, and attention-based paradigms: a convolutional–recurrent hybrid (CNN–LSTM) [5], a gated recurrent unit (GRU) [62], a temporal convolutional network (TCN) [63], and a lightweight Transformer (LiteTransNet) [50]. These baselines (Table 2) are widely used for landslide displacement forecasting due to their ability to capture temporal dependencies and non-linearity across multiple scales.
We perform single-step forecasting ( H = 1 ) conditional on next-day rainfall. The exogenous rainfall at + 1 day uses the co-located gauge’s next-day accumulation (oracle proxy), supplied equally to all models. At time t, the models predict y t + 1 from history H t and exogenous R t + 1 ; no other feature uses information beyond t. Inputs use a sliding window of length K = 96 with stride 1. Accumulations terminate at t. Standardization is fit on the training split only and applied to validation/test (including R t + 1 ). Missing inputs within the window are imputed with training medians. Samples lacking R t + 1 are dropped.
Data are split chronologically: first 70 % train, final 30 % test. We grid-search hidden widths { 32 , 64 , 128 } , batch sizes { 16 , 32 , 64 } , and learning rates { 5 × 10 3 , 10 3 , 5 × 10 4 , 10 4 } . Early stopping monitors the validation MAE (patience = 20 , max = 100 epochs). All models use L 2 weight decay 10 4 . Each configuration is run five times with seeds 2021–2025, and we report mean ± std on the test set. All runs are CPU-only with deterministic settings and pinned library versions. Multiday cumulative curves are rolling sums of one-day-ahead predictions and do not affect the H = 1 protocol.

5.3.2. Ablation Models

We quantify the contribution of each component using four ablations:
  • MLP (Correlational), a purely correlational baseline with no structure prior and no exogenous inputs.
  • MLP_Rain (Exogenous Evidence), which adds next-day rainfall R t + 1 as a leakage-free exogenous channel at prediction time while leaving the model unstructured to isolate the effect of evidence.
  • MLP + Granger (Linear CI), which serves as a weak causal baseline that uses linear Granger or partial correlation screening as a simple conditional independence filter for feature selection.
  • MLP_ DLCG (DBN-Style Structure Prior), which applies the learned DLCG and time-consistent structure masks that encode parents, ancestors, and collider visibility under non-anticipativity. The ICS exogenous tail is removed to test the structure prior alone.
Architectural details are summarized in Table 3.

5.4. Causal Graph and Role Mask Analysis

Across both sites, the DLCGs summarize causal reachability (Figure 8 and Figure 9). We interpret these graphs as data-driven candidate pathways that guide the role masks and branch structure rather than an exhaustive set of physically verified mechanisms. Role masks convert these graphs into predictor visibility: DCSs retain one-hop parents; CCSs retain the collider and spouse sets; ICSs retain multistep mediators; and SCSs aggregate the remaining variables (Figure 10 and Figure 11).
At LaMenTun (Figure 8 and Figure 10), the DLCG is dense, and hydrometeorological influence arrives mainly through multistep paths. DCSs are sparse, ICSs are extensive, and many variables fall into the SCS category. Station-level patterns are consistent: GPS01 has few direct parents, and routes mostly influence via ICSs, with ES dominant during quiet periods. GPS03 retains key direct drivers in DCSs and adds soil thermal mediators in ICSs; the Top-2 gate often pairs ES or DCSs with ICSs during wet spells. GPS04 shows strong self-feedback and weak exogenous forcing, with a large SCS set. LF01 combines a small DCS with ICS mediators, consistent with episodic rainfall-sensitive widening.
At BaYiTun (Figure 9 and Figure 11), the DLCG is compact and centered on rainfall accumulation and shallow soil moisture. DCSs concentrate these short-lag parents, ICSs are small, and many temperature nodes are assigned to SCSs. Station-level patterns are consistent with this focus: GPS01 shows limited exogenous forcing and predominantly SCSs, consistent with stable behavior. GPS02 contains focused DCS parents from accumulation and shallow moisture, together with a few ICS mediators; short-lag hydrologic control strengthens during wet periods, whereas ES dominates otherwise. GPS03 shows the strongest inbound influence; DCS retain shallow moisture and accumulation parents, ICSs add a few hydrothermal mediators, and the Top-2 gate assigns higher weight to ICSs around displacement jumps.

5.5. Causal Relationship Analysis

5.5.1. Diagnostics of Short-Lag Hydrologic Controls

In this section, we test whether the short-lag hydrologic pathways suggested by the causal graphs are supported by the data. Descriptive wavelet analysis indicates short-lag hydrometeorological influence at both sites, whereas predictive tests with BY-FDR control highlight site-specific mechanisms.
At LaMenTun, station behavior is highly heterogeneous: GPS03 shows step-wise jumps superimposed on slow creep. LF01 widens in step-wise episodes that broadly track the main GNSS steps. Wavelet diagnostics for GPS03 (Figure 12) show short-period coherence after 2023 with rightward and upward phase arrows, indicating that rainfall precedes displacement by 1–5 days. For LF01, XWT and WTC reveal a seasonal band with soil temperature and patchy short-period coherence with cumulative rainfall that strengthens in late 2024 to 2025, again with drivers leading at short lags (Figure A5).
Predictive tests at LaMenTun are more conclusive: After BY-FDR correction across predictors and lags, the q-value panel shows a persistent low-q band for daily rainfall at short lags, whereas accumulated rainfall and most other variables are not significant (Figure 12 and Figure 13). Thus, near-lag daily rainfall is the primary predictor despite the broader candidates suggested by the wavelet analysis.
At BaYiTun, step-wise deformation is also observed. Descriptive wavelet coherence (WTC) and cross-wavelet transform (XWT) for GPS03 indicate compact short-period coherence with rainfall around displacement jumps (Figure 14), and predictive tests sharpen this picture: F-statistics peak for shallow soil moisture (HS01–HS04) at short-to-intermediate lags, and BY-FDR control yields a contiguous low-q band from 2 to 10 days. Daily rainfall and temperature are mostly not significant after correction (Figure 15).
From a mechanistic viewpoint, we distinguish DLCG pathways by confidence level. The highest-confidence edges are the short-lag links from daily rainfall (at LaMenTun) and shallow soil moisture HS01–HS04 (at BaYiTun) to displacement, together with ES lags. These are the only hydrologic drivers that receive consistent support from the wavelet and BY–FDR-controlled Granger diagnostics shown in Appendix B.3, and they align with the conceptual picture of infiltration, transient pore pressure build-up, and rate-dependent shear along the soft-over-hard contacts described in Section 4.1 and Section 4.2, as well as with the rainfall-stratified gate behavior described in Table 4. By contrast, DLCG edges involving soil temperature, deeper moisture probes, or collider/spouse relations among environmental variables are interpreted mainly as statistical proxies for shared seasonal forcing, the drainage state, or sensor co-location rather than as direct mechanical couplings. We therefore regard the former group as mechanistically interpretable and empirically supported candidates for short-lag driver–response pathways, whereas the latter are treated as statistical proxies that motivate their inclusion in the structural prior but are not interpreted as established physical couplings.

5.5.2. Rainfall-Stratified Gate Behavior

We assess CRAFormer’s routing at the dataset level by aggregating pre-truncation gate probabilities { π i , j } j J on the validation sets and stratifying them by 7-day accumulated rainfall R 7 (Dry/Moderate/Wet/Very Wet; same bins as Figure 16 and Figure 17). Table 4 reports for each station and bin the mean ES/DCS/ICS gate weights, the mean Top-2 mass π i , ( 1 ) + π i , ( 2 ) , and the mean entropy H ( π ) .
Across all stations, ES dominates in Dry/Moderate regimes ( π ES 0.6 , π ICS 0.05 0.12 ), while ICS weights increase markedly with rainfall and become comparable to ES in Very Wet regimes (e.g., BaYiTun_gps03: π ES = 0.28 , π ICS = 0.37 ). This shows that the model systematically up-weights ICSs under strong hydrometeorological forcing, rather than relying on anecdotal routing. The mean Top-2 mass is consistently high (about 0.79 0.88 ), and H ( π ) remains well below the five-way maximum log 5 1.61 . Thus, the gate distribution is typically sharp and concentrates on, at most, two roles, so that, together with the Top-2 truncation (Section 3.3.1), CRAFormer does not average over all branches even when multiple roles are important, avoiding over-smoothing while adapting its routing to rainfall intensity.

5.6. Cumulative Displacement Prediction

5.6.1. Comparative Evaluation of Baseline Models

For a comprehensive comparison, we evaluate six monitoring points: GPS01/GPS03/GPS04 and the crack meter LF01 at LaMenTun and GPS02/GPS03 at BaYiTun. We report the MAE, RMSE, R 2 , and turning-sample error MAE turn , defined as the mean absolute error over the top 20% of observed samples with the largest amplitude changes (treated as critical turning points). To assess whether the improvements of CRAFormer over the baselines are statistically significant, we form a time series of daily absolute errors for each station and model and apply a paired two-sided Wilcoxon signed-rank test between CRAFormer and the strongest non-proposed baseline, controlling multiplicity across stations and metrics with the Benjamini–Yekutieli FDR (BY–FDR) at a nominal level of 0.01. We also conduct a qualitative assessment based on daily predictions for the most recent month (Figure 18 and Figure 19).
As shown in Figure 18 and Figure 19, CRAFormer exhibits more stable phase alignment and more faithful amplitude reproduction than the baselines across both landslides. Taking LaMenTun as an example, GPS01, GPS03, and GPS04 display a compound pattern in late June, namely, an overall upward trend superimposed with high-frequency disturbances. CRAFormer remains in phase with the observations at both peaks and troughs and accurately captures the onset and termination of sharp rises. LiteTransNet fits the slowly increasing long-term trend reasonably well but shows a clear response lag and amplitude compression during the rapid upsurge on 21–27 June. Meanwhile, GRU tends to overshoot near extrema, and TCN commonly suffers from phase lag. For LF01, where crack displacement is dominated by mid-to-high-frequency, small-amplitude oscillations, CRAFormer responds more sensitively to short-period disturbances, with inflection points nearly coincident with the observations. By contrast, CNN–LSTM and TCN favor noise smoothing, partially weakening the depiction of peaks and valleys. At BaYiTun, GPS02 and GPS03 present the typical superposition of a low-frequency trend and high-frequency fluctuations. CRAFormer reliably identifies the turning points during the mid-month drop-and-rebound, whereas the other baselines exhibit varying degrees of amplitude compression and temporal misalignment in response to sudden disturbances (Figure 19).
The quantitative results are consistent with the observations in Table 5. Across all stations, CRAFormer attains the lowest MAE, RMSE, and turning-point errors MAE turn 10 , MAE turn 20 , and MAE turn 30 , indicating that its advantage at rapid transitions is robust to the choice of percentile threshold. Focusing on the primary turning-point metric MAE turn 20 , CRAFormer reduces the MAE and RMSE by approximately 60–70% and turning-point errors by about 70–80% relative to the best non-proposed baseline at both landslides, while maintaining a high R 2 of 0.97–0.99. At LaMenTun, for example, CRAFormer consistently lowers the MAE turn 20 across GPS01, GPS03, GPS04, and LF01, despite their mixed trend-plus-disturbance patterns. Similar relative gains are observed at BaYiTun for GPS02 and GPS03. Taken together, these findings show that CRAFormer robustly improves both overall errors and turning-point errors under varying geological settings and operational disturbances, and the same conclusion holds under the more stringent MAE turn 10 and more inclusive MAE turn 30 thresholds shown in Table 5. Paired Wilcoxon tests on daily absolute errors further indicate that, at all six stations, the improvements of CRAFormer over the strongest non-proposed baseline are statistically significant for the MAE and MAE turn 20 after BY–FDR adjustment at the 0.01 level.

5.6.2. Ablation Analysis

To quantify the contribution of prior knowledge and variable screening, we evaluate four MLP-based variants, plain MLP, MLP_Rain (rainfall shifted by one day), MLP_DLCG (DBN-style dependency prior), and MLP_Granger (linear Granger causality screening), and compare them with CRAFormer on the same six stations (Figure 20 and Figure 21). We follow the baseline protocol and report the MAE, RMSE, R 2 , and MAE turn , computed over the top 20% of turning samples.
The overlays for the last month (Figure 20 and Figure 21) show that all MLP variants capture parts of the low-frequency trend but fail to track sharp transitions reliably. At LaMenTun (GPS01, GPS03, and GPS04), CRAFormer remains phase-coherent with the observations, reproducing the late-June surge and intermediate fluctuations, whereas MLP and MLP_Rain exhibit lag and amplitude compression around the surge and MLP_DLCG oversmooths peaks. MLP_Granger reduces spurious variability but still misplaces several turning points. For LF01, which is dominated by mid-to-high-frequency, small-amplitude oscillations, CRAFormer better tracks short cycles and inflection points, whereas the MLP-based models either attenuate extremes or introduce phase slippage. Similar patterns occur at BaYiTun: during the mid-month drop and rebound, CRAFormer aligns the timing and magnitude of peaks and troughs more closely, whereas the MLP variants display varying degrees of lag or overshoot.
These qualitative patterns align with the station-wise metrics in Table 6. Across all LaMenTun stations, CRAFormer reduces the MAE and RMSE by roughly 60–75% and lowers the MAE turn 20 by about 70–85% relative to the strongest MLP variants, while keeping R 2 in the 0.98–0.99 range. Similar margins are observed at BaYiTun, where CRAFormer consistently outperforms MLP, MLP_Rain, MLP_DLCG, and MLP_Granger on both overall errors and turning-point metrics. Importantly, CRAFormer attains the lowest turning-point errors at all three percentile levels ( MAE turn 10 , MAE turn 20 , and MAE turn 30 ), indicating that the gains are not an artifact of a particular threshold choice. Consistent with the baseline comparison, paired Wilcoxon tests on daily absolute errors show that CRAFormer also significantly outperforms the best-performing MLP variant on MAE and MAE turn 20 at all stations after BY–FDR correction at the 0.01 level.
These results indicate that the Rain(+1) variant MLP_rain, which appends next-day accumulated rainfall R t + 1 as an exogenous input, benefits trend-dominated regimes (e.g., GPS01) but degrades performance under non-linear or non-stationary rainfall–displacement couplings (e.g., GPS04, LF01), leading to lag and amplitude compression near sharp transitions. The DLCG stabilizes the MLP and offers modest gains under noise (LF01, GPS03), yet turning-point errors remain elevated because cross-scale interactions are not modeled explicitly. Granger screening prunes redundant drivers and can reduce variance (GPS03), but, owing to its linearity, it cannot correct phase shifts or capture abrupt non-linear changes, leaving timing mismatches at peaks and troughs. Overall, explicit exogenous cues and structural or linear priors provide targeted but insufficient improvements for mixed-trend-plus-disturbance regimes. By contrast, CRAFormer uses cross-scale attention and channel re-weighting to preserve long-term trends while enhancing responsiveness to short-term, high-frequency perturbations. This behavior is consistent with the station-wise reductions in the MAE, the RMSE, and especially the MAE turn 20 , and the same pattern is observed at MAE turn 10 and MAE turn 30 , as shown in Table 5 and Table 6.

5.6.3. Sensitivity to Rainfall Intensity: Bin-Wise Assessment of Model Errors

Next, we stratify errors by 7-day accumulated rainfall (Figure 16). Across all six stations, CRAFormer attains the lowest MAE in every bin and shows weak sensitivity to rainfall intensity. Its curves remain nearly flat on GPS03, GPS04, and LF01, and rise only mildly in the heaviest rain bin on GPS01 and GPS02. By contrast, the MLP family degrades as rainfall increases. On LaMenTun—GPS01, MLP_Rain outperforms plain MLP in the low-to-moderate bins but spikes under the heaviest rain, indicating lag and amplitude compression around rain-driven surges. MLP_DLCG is steadier in the mid-range bins but deteriorates sharply at the highest intensity. MLP_Granger reduces variance but still remains well above CRAFormer. Similar trends hold on GPS03 and GPS04 and at BaYiTun (GPS02 and GPS03), where all MLP variants escalate under heavy rain but CRAFormer remains comparatively stable.
Applying the same stratification to the turning-sample error MAE turn (top 20% by | Δ y | ) yields a consistent picture (Figure 17). CRAFormer maintains uniformly low turning-point errors across bins, even in the heaviest rain bin, indicating strong phase alignment at rapid transitions. In contrast, the MLP variants exhibit pronounced sensitivity: MLP and MLP_Rain surge in the highest bin on GPS01, GPS02, and GPS03. MLP_DLCG reduces errors at moderate rainfall but cannot prevent sharp increases during extremes. MLP_Granger sometimes lowers variability at intermediate bins yet still misplaces turning points when rainfall intensifies. Taken together, the bin-wise analyses confirm that explicit exogenous cues or linear or structural priors yield limited resilience to hydrometeorological extremes, whereas CRAFormer sustains both trend accuracy and turning-point fidelity across rainfall regimes.

5.6.4. Sensitivity of the ICS Branch to 24 h Rainfall Forecast Uncertainty

In CRAFormer, the ICS branch ingests the next-day rainfall R t + 1 via an exogenous tail token, which is set to the realized gauge accumulation in the main experiments. This represents an optimistic “oracle” setting, whereas real early-warning operations can only access numerical weather prediction (NWP) forecasts with random and systematic errors. To assess how such forecast-like uncertainty affects CRAFormer, we design a stress test in which the oracle rainfall is replaced by perturbed 24 h forecasts and evaluate the resulting change in predictive accuracy.
We use the metrics described in Section 5.2 (MAE, RMSE, R 2 , and turning-point error MAE turn on the top 20 % of samples ranked by | Δ y | ). Within each station, CRAFormer under oracle rainfall serves as the reference. For each NWP-like scenario, we report the relative changes as follows:
Δ MAE ( % ) = 100 × MAE scenario MAE CRAFormer MAE CRAFormer ,
Δ MAE turn ( % ) = 100 × MAE turn , scenario MAE turn , CRAFormer MAE turn , CRAFormer .
where negative values indicate improvement over the oracle baseline.
To isolate the effect of rainfall forecast uncertainty on the ICS branch, we fix the displacement and driver histories for each station and re-evaluate CRAFormer on the last-month test subset under four rainfall scenarios: (i) CRAFormer (oracle), using the realized gauge rainfall R t + 1 ; (ii) NWP-mild, representing a high-quality 24 h forecast with small bias and variance; (iii) NWP-typical, representing median 24 h forecast skill; and (iv) NWP-poor, representing degraded 24 h forecasts under convective or rapidly evolving conditions [64,65]. The NWP-like scenarios are constructed by injecting multiplicative bias and additive Gaussian noise into the oracle rainfall, following simple perturbation strategies used in statistical post-processing of NWP precipitation forecasts [66]. The perturbation parameters are chosen to be broadly consistent with reported 24 h QPF error ranges for subtropical monsoon regions. In all scenarios, only the exogenous next-day rainfall fed to the ICS tail token is perturbed; historical drivers and model parameters remain unchanged.
Table 7 confirms that oracle gauge rainfall is an optimistic reference and that 24 h forecast uncertainty affects CRAFormer in a site-dependent way. At LaMenTun—GPS01, GPS03, and GPS04, replacing oracle R t + 1 with NWP-like inputs typically increases the MAE by O ( 10 50 % ) and reduces the R 2 to about 0.91–0.97, indicating that forecast noise injected through the ICS tail can weaken trend fitting when daily rainfall plays a strong causal role. The crack meter LF01 is an extreme case: in the NWP-poor scenario, the MAE more than triples and MAE turn increases by almost an order of magnitude, suggesting that highly biased or noisy forecasts may trigger spurious ICS responses at high-frequency, noise-dominated sensors. In such regimes, CRAFormer should be combined with forecast quality screening or explicit down-weighting of the ICS gate. In contrast, BaYiTun stations exhibit weaker sensitivity: at BaYiTun—GPS02, all NWP scenarios slightly reduce the MAE and strongly reduce the MAE turn while maintaining a high R 2 , consistent with shallow soil moisture rather than daily rainfall being the primary short-lag driver. At BaYiTun—GPS03, forecast perturbations mainly affect the smooth background component, while the timing of sharp displacement changes remains comparatively robust, reflecting the dominant role of short-lag hydrological controls encoded in the historical drivers.

6. Discussion

This study focuses on cross-site differences in the predictability of the rainfall–displacement relationship. We employ a unified diagnostic framework comprising WTC/XWT, predictive Granger testing with BY-FDR correction, and falsification via temporal and rainfall permutations to test our modeling hypotheses, and we cross-validate them using attributions from the model’s gated routing. We specifically examine whether the evidence supports short-lag hydrologic control and use this to explain CRAFormer’s behavior across different deformation regimes.
At LaMenTun (GPS03 as an example), although the wavelet plots show short-period in-phase bands, none of the lags within 1–14 days remains significant after BY-FDR correction, and the falsification tests are also null (Figure A3), indicating a lack of predictive signal (Figure 12). Consistently, the gating weights favor ES/DCSs over time, and ICSs are only briefly activated during wet spells. For LF01, the narrow window (approximately 6–8 days) weakens under the spatial mismatch test, which further supports the conclusion that there is no robust short-lag driver at this site (Figure A6 and Figure A8).
At BaYiTun (GPS03), shallow soil moisture (HS01–HS04) exhibits a stable BY-FDR-significant band over roughly 2–10 days, and this signal retains strength under permutation-based falsification (Figure A4). During step events, WTC/XWT also reveals a hydrology-leads displacement-lags temporal pattern (Figure 14). In line with this, the model assigns more Top-2 routing mass to ICSs during wet periods while maintaining background constraints from ES/DCSs, and it substantially reduces timing bias and magnitude errors near turning points (Figure 10 and Figure 11).
Taken together, these results substantiate the causal, role-masked design of CRAFormer. Across both landslides, CRAFormer preserves phase and amplitude at turning points and step onsets (Figure 18, Figure 19, Figure 20 and Figure 21), yielding large, station-consistent reductions in MAE, RMSE, and especially MAE turn (typically 56– 86 % ; Table 5 and Table 6). At LaMenTun, the gains arise from robust tracking of background drift when ES/DCS processes dominate. At BaYiTun, the Top-2 gate up-weights ICSs during jump episodes while remaining coupled to ES/DCSs, thereby curbing timing bias and overshoot. These patterns are consistent with the rainfall-stratified analyses indicating CRAFormer’s low sensitivity under heavy rain (Figure 16 and Figure 17). In our experiments, the rainfall-triggered monotonicity prior sharpens timing and magnitude at wet turning points without introducing visible artifacts during post-event relaxation, consistent with its design as a soft, regime-specific regularizer rather than a hard global constraint.
Beyond predictive skill, the diagnostics also inform the working assumptions behind the DLCG described in Section 3.1. Wavelet coherence and Granger panels (Section 5.5) indicate that only a small set of short-lag hydrologic drivers has substantial predictive power across seasons and step-like episodes. In practice, near-lag rainfall at LaMenTun and shallow soil moisture at BaYiTun carry most of the short-lag signal once multiplicity is controlled. The pronounced non-stationarity in displacement levels is largely absorbed by ES self-lags, which is consistent with dominant ES gate weights in Dry and Moderate rainfall bins (Table 4). These patterns support the working assumption that short-lag driver–response dependencies are approximately stable over time, even though the raw displacement series is not.
From a predictive standpoint, the ablation study in Section 5.6.2 and the rainfall-stratified error curves show that removing DLCG-based role masks mainly increases MAE and MAE turn without producing unstable or clearly spurious behavior under heavy rainfall. This suggests that moderate departures from the DLCG assumptions primarily reduce the incremental benefit of the structural prior rather than materially degrading forecast robustness. Together with the role stability check in Appendix A.2, this indicates that sampling variability in the DLCG structure has limited impact on the effective set of causal roles used by the Top-2 gate. At the level of individual edges, we interpret only the short-lag rainfall links, the shallow soil–moisture links, and the ES self-lags as the best-supported mechanistic candidates, and treat the remaining DLCG connections mainly as a statistical structure in the prior rather than as confirmed physical couplings (Section 5.5).
CRAFormer was designed for practicality, with CPU-only training, single-head attention, and small branches without prior signal decomposition which simplify maintenance and curb error propagation. When archived forecasts are unavailable, we use realized next-day rainfall as a proxy. Reported scores therefore characterize performance under accurate rainfall inputs, and real-time deployments will scale with forecast quality. Although we have not yet reported explicit hit, miss, or false-alarm rates, the systematic reductions in MAE, RMSE, and especially MAE turn suggest fewer and smaller timing errors near rapid displacement surges, which is directly relevant for threshold-based early-warning schemes. These improvements document robustness across two contrasting deformation regimes within a shared humid, subtropical setting in Guangxi, but extension to landslides in other climatic or geomorphological environments will require additional validation and possibly modest adaptation of the causal driver set.
Our co-variates include daily and cumulative rainfall, soil temperature, and soil moisture. Pore pressure, water level, and inter-station geometry were not available, so we trained CRAFormer per station, which may under-represent cross-slope propagation and shared hydrologic forcing. Incorporating pore pressure and water level records as short-lag drivers, together with inter-station geometry and shared hydrologic pathways in a spatially coupled DLCG (e.g., a graph over GNSS and hydrologic sensors), should sharpen the timing and magnitude of predicted accelerations, reduce reliance on displacement self-history, and better distinguish genuine slope responses from sensor noise, particularly for deep-seated or reservoir-affected landslides. In future work, we will extend CRAFormer from single-station to multistation configurations that learn a joint DLCG over sensor networks and ingest real-time spatial information from dense GNSS arrays and, where available, InSAR-derived deformation fields.

7. Conclusions

This work introduced CRAFormer, a compact causality- and physics-aware predictor that replaces signal decomposition with DLCG-based role masks and a Top-2 gating mechanism for fusion. The ICS branch ingests next-day rainfall through a leakage-free tail with a stop-gradient (detach) operation, non-negative mixing, and a monotonicity prior, yielding forecast-compatible yet non-anticipative behavior. Under oracle conditioning, where the same observed R t + 1 is provided to all models, CRAFormer consistently outperformed the baselines across two rainfall-triggered landslides in Guangxi that share a humid, subtropical, monsoon-dominated climate but differ markedly in their deformation regimes. It achieved the largest gains in high-rainfall bins and at turning points, reducing timing bias as well as overshoot and undershoot. Ablations showed incremental gains from MLP_Rain and MLP_DLCG, yet CRAFormer remained superior. Causal diagnostics aligned with the learned routing: LaMenTun showed no short-lag drivers that were significant under the BY–FDR procedure, whereas BaYiTun exhibited a robust 2–10-day shallow moisture lag band. The model is lightweight (single-head attention, CPU-friendly) and interpretable via sample-wise gates. The present evidence is restricted to two rainfall-triggered landslides in Guangxi and is further limited by the absence of pore pressure and water level data, the lack of explicit spatial coupling, the use of oracle conditioning, and an evaluation focused on continuous regression and turning-point errors rather than explicit hit, miss, and false-alarm rates under operational thresholds. We expect that incorporating pore pressure and water level observations as non-anticipative exogenous drivers, learning spatially coupled DLCGs over sensor networks, and using operational forecast products together with threshold-based and probabilistic early-warning metrics will strengthen the physical interpretability of the inferred pathways, further reduce timing errors near rapid accelerations, and broaden applicability to landslides with stronger hydrogeologic control and more complex internal kinematics.

Author Contributions

Conceptualization, F.Z.; data curation, X.L.; formal analysis, F.Z. and S.L.; funding acquisition, Y.J. and X.S.; investigation, X.J.; methodology, F.Z. and S.L.; project administration, X.L. and X.S.; resources, Y.J.; supervision, X.S.; validation, S.L.; visualization, F.Z., S.R. and X.J.; writing (original draft), F.Z.; writing (review and editing), Y.J., S.L., S.R., Z.L. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Science and Technology Plan Project (grant nos. Guike AA23062038 and Guike AA24206043), the National Natural Science Foundation of China (grant nos. U23A20280, 62161007, and 62471153), the Graduate Innovation Project of Guilin University of Electronic Technology (grant no. YCBZ2024162), the 2020 Guangxi University Middle-Aged and Young Teachers’ Scientific Research Basic Competency Improvement Project (2020KY21024), and the Guangxi Autonomous Region Major Talent Project, and supported by the Joint International Research Laboratory of Spatio-Temporal Information and Intelligent Location Services.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it did not involve human participants, human biological materials, or identifiable personal data.

Informed Consent Statement

Informed consent was not required because the study did not involve human participants or identifiable personal data.

Data Availability Statement

The datasets presented in this article are not readily available because they were provided by government departments and contain sensitive geospatial information. Requests to access the datasets should be directed to the first author.

Acknowledgments

The authors would like to thank the Guangxi Higher Education Institutions Engineering Research Center for BeiDou Positioning Services and Border-Coastal Defense Security Applications; the China-ASEAN Joint International Cooperation Laboratory for Spatio-Temporal Information and Intelligent Location-Based Service; the International Joint Research Laboratory of Spatio-Temporal Information and Intelligent Location Services; and the Guangxi Zhuang Autonomous Region Geological Environment Monitoring Station for their valuable support and contributions to this research.

Conflicts of Interest

Author Xiaoming Liu was employed by the Guangxi Zhuang Autonomous Region Geological Environment Monitoring Station. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

Appendix A

Scope: This appendix complements the main text by detailing the DLCG procedure and its hyperparameters.

Appendix A.1. DLCG Hyperparameters and Selection Thresholds

We learn the DLCG on historical variables only; future rainfall nodes within the prediction horizon are excluded to avoid anticipative edges and are later handled as exogenous inputs in the ICS tail (see Section 3.3.1). Unless otherwise noted, we use the following procedure: To account for serial dependence, we apply a stationary bootstrap with B resamples and an expected block length of 7 days. For each resample, we run a constraint-based structure learner with HSIC/KCI as the primary conditional independence (CI) test (Gaussian kernel with median heuristic bandwidth), compute p-values by block permutation, and control false discoveries within each conditioning order using BY–FDR at level α = 0.05 . For low-order conditioning sets ( | Z | 2 ) that pass linearity and normality diagnostics, we fall back to the Gaussian partial correlation (Fisher–z) CI test. The pruned skeleton is then oriented via v-structures and Meek rules under hard temporal constraints.
Across bootstrap replications, we record forward and reverse directional selection rates for each ordered pair ( u , v ) and compute Clopper–Pearson confidence bounds for these binomial proportions. Let L and U denote the lower and upper Clopper–Pearson bounds for the forward ( u v ) and reverse ( v u ) directions, respectively, and apply the directional selection rule in Equation (4) to obtain the aggregated DLCG. To assess the robustness of this rule, we vary ( L , Δ ) over { ( 0.55 , 0.25 ) , ( 0.60 , 0.30 ) , ( 0.65 , 0.35 ) } at a representative station (BaYiTun—GPS03) and summarize the resulting aggregated DLCG and role masks in Table A1. The total edge count and the size of the ICS set change only mildly, while the DCS set for the displacement node is invariant (Jaccard index = 1 for all configurations). In particular, short-lag rainfall and near-surface soil moisture drivers remain in the same direct- or indirect-cause roles, indicating that the causal structure used by CRAFormer is stable under moderate perturbations of the directional thresholds.
Table A1. Sensitivity of the aggregated DLCG and role masks under different directional selection thresholds ( L , Δ ) at the LaMenTun and BaYiTun stations.
Table A1. Sensitivity of the aggregated DLCG and role masks under different directional selection thresholds ( L , Δ ) at the LaMenTun and BaYiTun stations.
StationConfigEdgesDCSICSJacc (DCS)Jacc (ICS)
LaMenTun_gps01 L = 0.55 , Δ = 0.25 7101.00.0
L = 0.60 , Δ = 0.30 7111.01.0
L = 0.65 , Δ = 0.35 6000.00.0
LaMenTun_gps03 L = 0.55 , Δ = 0.25 7010.00.0
L = 0.60 , Δ = 0.30 7111.01.0
L = 0.65 , Δ = 0.35 6101.00.0
LaMenTun_gps04 L = 0.55 , Δ = 0.25 7000.01.0
L = 0.60 , Δ = 0.30 7101.01.0
L = 0.65 , Δ = 0.35 6101.01.0
LaMenTun_lf01 L = 0.55 , Δ = 0.25 7001.00.0
L = 0.60 , Δ = 0.30 7011.01.0
L = 0.65 , Δ = 0.35 6001.00.0
BaYiTun_gps02 L = 0.55 , Δ = 0.25 3000.00.0
L = 0.60 , Δ = 0.30 3111.01.0
L = 0.65 , Δ = 0.35 2000.00.0
BaYiTun_gps03 L = 0.55 , Δ = 0.25 3101.00.0
L = 0.60 , Δ = 0.30 3111.01.0
L = 0.65 , Δ = 0.35 2111.01.0
“edges” is the total number of directed edges in the aggregated DLCG. “DCS” and “ICS” are the sizes of the direct-cause and indirect-cause sets for the corresponding displacement node. Jaccard indices are computed with respect to the reference configuration (L, Δ) = (0.60, 0.30) at each station.

Appendix A.2. Role Stability Check for DCS/ICS Masks

We recompute the DLCG on B stationary bootstrap subsets at two representative stations (LaMenTun_gps03 and BaYiTun_gps03) and track the role assignments of short-lag hydrologic drivers for the displacement node. Table A2 reports, for each candidate lag, the proportion of runs in which it is assigned to DCSs, ICSs, or SCSs.
Table A2. Stability of DCS/ICS/SCS assignments for short-lag hydrologic drivers of displacement under stationary bootstrap resampling. Entries are the proportions of runs in which each lag appears in the corresponding role.
Table A2. Stability of DCS/ICS/SCS assignments for short-lag hydrologic drivers of displacement under stationary bootstrap resampling. Entries are the proportions of runs in which each lag appears in the corresponding role.
StationDriver LagDCSICSSCS
LaMenTun_gps03rain ( t 1 )0.920.080.00
LaMenTun_gps03rain ( t 2 )0.050.810.14
BaYiTun_gps03HS01 ( t 3 )0.880.120.00
BaYiTun_gps03HS01 ( t 5 )0.070.760.17

Appendix A.3. DLCG Hyperparameter Summary Table

Table A3 lists all tunable (and fixed) hyperparameters used in the DLCG procedure described in Section 3.1. Unless otherwise stated, these values are held constant across all experiments.
Table A3. Hyperparameters for the DLCG. The CI selection rule uses two fixed thresholds (see Section 3.1).
Table A3. Hyperparameters for the DLCG. The CI selection rule uses two fixed thresholds (see Section 3.1).
ParameterDescriptionValue/Range
LLook-back window (lags)24–48 (task-specific)
B# stationary bootstrap resamples≥20
block_lenBootstrap block length (days)7
L col Collider–spouse mining depth2–3
α BY–FDR level per order0.05
E + , E Soft/hard prior edge setsDomain-specific
γ Tie-break exponent for E + 0.5
( λ + , λ ) Soft prior weights(0.2, 1.0)
CI ruleEdge retention thresholds L     0.60 , L U     0.30

Appendix B. Negative Controls and Best-Lag Maps

Scope: We summarize best-lag ( L * ) maps and report falsification and negative controls for both sites under four settings: aligned inputs, temporal shifts of + 3 and + 7 days, and a spatial proxy based on rainfall block permutation. Unless otherwise noted, targets use first-differenced displacement (velocity), input lags span 1 to 14 days, and all q-values are BY-FDR-corrected.

Appendix B.1. Best-Lag Summary Maps (BY–FDR)

For each input→target pair, we take L * , the earliest lag attaining the minimal q-value within the 1–14 day range.
Figure A1. LaMenTun (GPS03): best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Entries concentrate at 1–2 d; however, none of the associated cells is BY–FDR-significant in Figure 13, so L * is descriptive rather than evidence of a robust causal lead–lag relation.
Figure A1. LaMenTun (GPS03): best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Entries concentrate at 1–2 d; however, none of the associated cells is BY–FDR-significant in Figure 13, so L * is descriptive rather than evidence of a robust causal lead–lag relation.
Entropy 28 00007 g0a1
Figure A2. BaYiTun (GPS03):best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Shallow soil moisture (HS01–HS04) concentrates at L * 2–3 d and RFAcc near ∼10 d, agreeing with the BY–FDR-significant bands in Figure 15; RFD and temperature yield L * 1 d but are largely non-significant after correction.
Figure A2. BaYiTun (GPS03):best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Shallow soil moisture (HS01–HS04) concentrates at L * 2–3 d and RFAcc near ∼10 d, agreeing with the BY–FDR-significant bands in Figure 15; RFD and temperature yield L * 1 d but are largely non-significant after correction.
Entropy 28 00007 g0a2

Appendix B.2. Falsification/Negative Control Panels (BY–FDR q-Values)

Panels show BY–FDR q-values for aligned inputs, + 3 / + 7 day temporal shifts, and a rainfall block permutation spatial proxy.
Figure A3. LaMenTun (GPS03): BY–FDR q-value panels for misalignment settings. No cell attains significance under any setting, consistent with Figure 13. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Figure A3. LaMenTun (GPS03): BY–FDR q-value panels for misalignment settings. No cell attains significance under any setting, consistent with Figure 13. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Entropy 28 00007 g0a3
Figure A4. BaYiTun (GPS03): BY–FDR q-value panels under misalignment settings. A robust HS01–HS04 band persists around 2–10 d; rainfall is non-significant throughout, in agreement with Figure 15. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Figure A4. BaYiTun (GPS03): BY–FDR q-value panels under misalignment settings. A robust HS01–HS04 band persists around 2–10 d; rainfall is non-significant throughout, in agreement with Figure 15. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Entropy 28 00007 g0a4

Appendix B.3. LaMenTun (LF01): Predictive Granger Panels, Best-Lag Map, and Falsification

Figure A6 summarizes predictive Granger tests for LaMenTun (LF01) using first-differenced displacement (velocity) as the target and input lags of 1 to 14 days. Raw F-statistics show mild short-lag peaks for several inputs (e.g., at 1 to 2 days), but, after BY-FDR control across all predictors, a narrow significant window remains, primarily for accumulated rainfall ( RF Acc ) around approximately 6 to 8 days. Most other predictors are not significant after multiplicity correction.
Figure A5. LaMenTun (LF01): XWT/WTC between displacement and rainfall drivers. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and shading marks the cone of influence.
Figure A5. LaMenTun (LF01): XWT/WTC between displacement and rainfall drivers. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and shading marks the cone of influence.
Entropy 28 00007 g0a5
Figure A6. LaMenTun (LF01): predictive Granger results for displacement velocity (first differences). Short-lag F peaks appear widely, but, after BY–FDR, only RFAcc exhibits a narrow significant window at ∼6–8 d. (a) Granger strength (F-statistic); (b) BY–FDR-corrected q-values.
Figure A6. LaMenTun (LF01): predictive Granger results for displacement velocity (first differences). Short-lag F peaks appear widely, but, after BY–FDR, only RFAcc exhibits a narrow significant window at ∼6–8 d. (a) Granger strength (F-statistic); (b) BY–FDR-corrected q-values.
Entropy 28 00007 g0a6
To compactly characterize the lag structure, we construct a best-lag map that selects, for each input→target pair, the earliest lag L * achieving the minimum q-value (Figure A7). At LF01, L * values concentrate at 1 to 2 days for a subset of variables, but these cells are not significant under the BY-FDR correction in Figure A6b and should therefore be regarded as descriptive only. In contrast, RF Acc achieves L * at approximately 6 to 7 days and aligns with the FDR-significant trough, suggesting a modest hydrologic memory at intermediate lags that is specific to accumulated rainfall at this site.
Figure A7. LaMenTun (LF01): best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Many entries fall at 1–2 d but are largely non-significant after BY–FDR; RFAcc yields L * 6 –7 d, consistent with Figure A6b.
Figure A7. LaMenTun (LF01): best-lag map ( L * = earliest lag with the minimal q-value). Dark-blue cells mark the selected L * for each input–target pair; white cells indicate non-selected lags (not L * ). Many entries fall at 1–2 d but are largely non-significant after BY–FDR; RFAcc yields L * 6 –7 d, consistent with Figure A6b.
Entropy 28 00007 g0a7
We further evaluate a falsification setting using a spatial mismatch proxy based on rainfall block permutation. As shown in Figure A8, the low-q window for RF Acc is attenuated under the proxy, and no new significant bands emerge. This suggests that the predictive signal is sensitive to proper spatial alignment of rainfall drivers and argues against spurious coupling between rainfall and displacement signals.
Figure A8. LaMenTun (LF01): BY–FDR q-value panels for misalignment settings. No cell attains significance under any setting, consistent with Figure 13. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Figure A8. LaMenTun (LF01): BY–FDR q-value panels for misalignment settings. No cell attains significance under any setting, consistent with Figure 13. (a) Temporal misalignment ( + 7 d); (b) spatial mismatch proxy (rainfall block permutation).
Entropy 28 00007 g0a8

Appendix B.4. Sensitivity to the Stationary Bootstrap Block Length

In the main experiments, we set the expected block length of the stationary bootstrap (and block permutation) to 7 days. This choice roughly matches the decorrelation time of daily rainfall and displacement at our sites, and the typical duration of storm-driven pore pressure responses, so it preserves short-term serial dependence without making the effective sample size too small.
To assess robustness, we re-run the DLCG for all six station–target pairs with block lengths b { 3 , 7 , 14 } days and summarize the total number of edges | E | and the sizes of the direct-cause and indirect-cause sets, | DCS | and | ICS | (Figure A9). Across sites, | E | varies by at most 2–3 edges and the role sizes change by at most one node when moving away from 7 days. The only visible difference is that BaYiTun—GPS02/03 gains or loses a single shallow soil moisture parent, while the remaining structure is unchanged. Overall, the key hydrological drivers and the DCS/ICS masks are stable, and the 7-day block length used in the main text is therefore a reasonable and robust choice.
Figure A9. Sensitivity of theDLCG structure to the stationary bootstrap block length. (a) Total number of edges | E | for block lengths b { 3 , 7 , 14 } days; bar colors (and hatching where applicable) indicate different block lengths. (b) Size of the direct-cause set | DCS | ; cell color intensity encodes the magnitude (lighter = smaller, darker = larger), and the overlaid numbers give the exact values. (c) Size of the indirect-cause set | ICS | ; cell color intensity encodes the magnitude (lighter = smaller, darker = larger), and the overlaid numbers give the exact values.
Figure A9. Sensitivity of theDLCG structure to the stationary bootstrap block length. (a) Total number of edges | E | for block lengths b { 3 , 7 , 14 } days; bar colors (and hatching where applicable) indicate different block lengths. (b) Size of the direct-cause set | DCS | ; cell color intensity encodes the magnitude (lighter = smaller, darker = larger), and the overlaid numbers give the exact values. (c) Size of the indirect-cause set | ICS | ; cell color intensity encodes the magnitude (lighter = smaller, darker = larger), and the overlaid numbers give the exact values.
Entropy 28 00007 g0a9

References

  1. Huggel, C.; Clague, J.J.; Korup, O. Is climate change responsible for changing landslide activity in high mountains? Earth Surf. Process. Landforms 2012, 37, 77–91. [Google Scholar] [CrossRef]
  2. Wang, X.; Wang, Y.; Lin, Q.; Yang, X. Assessing Global Landslide Casualty Risk Under Moderate Climate Change Based on Multiple GCM Projections. Int. J. Disaster Risk Sci. 2023, 14, 751–767. [Google Scholar] [CrossRef]
  3. Huang, F.; Huang, J.; Jiang, S.; Zhou, C. Landslide displacement prediction based on multivariate chaotic model and extreme learning machine. Eng. Geol. 2017, 218, 173–186. [Google Scholar] [CrossRef]
  4. Lan, H.; Zhao, Z.; Li, L.; Li, J.; Fu, B.; Tian, N.; Clague, J.J. Climate change drives flooding risk increases in the Yellow River Basin. Geogr. Sustain. 2024, 5, 193–199. [Google Scholar] [CrossRef]
  5. Nava, L.; Carraro, E.; Reyes-Carmona, C.; Puliero, S.; Bhuyan, K.; Rosi, A.; Monserrat, O.; Floris, M.; Meena, S.; Galve, J.; et al. Landslide displacement forecasting using deep learning and monitoring data across selected sites. Landslides 2023, 20, 2111–2129. [Google Scholar] [CrossRef]
  6. Zhao, Q.; Wang, H.; Zhou, H.; Zhang, T.; Liu, Y. An interpretable and high-precision method for predicting landslide displacement using evolutionary attention mechanism. Nat. Hazards 2024, 120, 11943–11967. [Google Scholar] [CrossRef]
  7. Meng, Y.; Qin, Y.; Cai, Z.; Zhang, Y.; Zhang, Y. Correction to: Dynamic forecast model for landslide displacement with steplike deformation by applying GRU with EMD and error correction. Bull. Eng. Geol. Environ. 2023, 82, 211. [Google Scholar] [CrossRef]
  8. Liu, Y.; Teza, G.; Nava, L.; Chang, Z.; Shang, M.; Xiong, D.; Cola, S. Deformation evaluation and displacement forecasting of Baishuihe landslide after stabilization based on continuous wavelet transform and deep learning. Nat. Hazards 2024, 120, 9649–9673. [Google Scholar] [CrossRef]
  9. Meng, S.; Shi, Z.; Peng, M.; Yang, J.; Wang, Z. Landslide displacement prediction with step-like curve based on convolutional neural network coupled with bi-directional gated recurrent unit optimized by attention mechanism. Eng. Appl. Artif. Intell. 2024, 133, 108078. [Google Scholar] [CrossRef]
  10. Martelloni, G.; Segoni, S.; Fanti, R.; Catani, F. Rainfall thresholds for the forecasting of landslide occurrence at regional scale. Landslides 2012, 9, 485–495. [Google Scholar] [CrossRef]
  11. Ma, Z.; Mei, G. Forecasting landslide deformation by integrating domain knowledge into interpretable deep learning considering spatiotemporal correlations. J. Rock Mech. Geotech. Eng. 2025. Online ahead of print. [Google Scholar] [CrossRef]
  12. Wang, S.; Zhang, K.; van Beek, L.P.H.; Tian, X.; Bogaard, T.A. Physically-based landslide prediction over a large region: Scaling low-resolution hydrological model results for high-resolution slope stability assessment. Environ. Model. Softw. 2020, 124, 104607. [Google Scholar] [CrossRef]
  13. Bao, H.; Ji, C.; Lan, H.; Zheng, H.; Yan, C.; Peng, J.; Li, L.; Wang, J.; Guo, G. Slope Effects on Soil Moisture Migration and Evolution in Shallow Layers of Loess High-Fill Slopes in the Gully Land Consolidation. Catena 2025, 258, 109206. [Google Scholar] [CrossRef]
  14. Li, X.; Zhang, Y.; Zhao, Q. Landslide displacement prediction from on-site deformation data based on time series ARIMA model. Front. Environ. Sci. 2023, 11, 1249743. [Google Scholar] [CrossRef]
  15. Jin, A.; Yang, S.; Huang, X. Landslide displacement prediction based on time series and long short-term memory networks. Bull. Eng. Geol. Environ. 2024, 83, 264. [Google Scholar] [CrossRef]
  16. Zhang, W.; Li, H.; Tang, L.; Gu, X.; Wang, L. Displacement prediction of Jiuxianping landslide using gated recurrent unit (GRU) networks. Acta Geotech. 2022, 17, 1367–1382. [Google Scholar] [CrossRef]
  17. Huang, D.; He, J.; Song, Y.; Guo, Z.; Huang, X.; Guo, Y. Displacement prediction of the Muyubao landslide based on a GPS time-series analysis and temporal convolutional network model. Remote Sens. 2022, 14, 2656. [Google Scholar] [CrossRef]
  18. Ye, S.; Liu, Y.; Xie, K.; Wen, C.; Tian, H.L.; He, J.B.; Zhang, W. Study on landslide displacement prediction considering inducement under composite model optimization. Electronics 2024, 13, 1271. [Google Scholar] [CrossRef]
  19. Kuang, P.; Li, R.; Huang, Y.; Wu, J.; Luo, X.; Zhou, F. Landslide displacement prediction via attentive graph neural network. Remote Sens. 2022, 14, 1919. [Google Scholar] [CrossRef]
  20. Wang, J.; Zhu, H.H.; Zhang, W.; Tan, D.Y.; Pasuto, A. Enhancing landslide displacement prediction using a spatio-temporal deep learning model with interpretable features. J. Geophys. Res. Mach. Learn. Comput. 2025, 2, e2025JH000592. [Google Scholar] [CrossRef]
  21. Ebrahim, K.M.P.; Fares, A.; Faris, N.; Zayed, T. Exploring time series models for landslide prediction: A literature review. Geoenviron. Disasters 2024, 11, 25. [Google Scholar] [CrossRef]
  22. Huang, F.; Xiong, H.; Chen, S.; Lv, Z.; Huang, J.; Chang, Z.; Catani, F. Slope stability prediction based on a long short-term memory neural network: Comparisons with convolutional neural networks, support vector machines and random forest models. Int. J. Coal Sci. Technol. 2023, 10, 18. [Google Scholar] [CrossRef]
  23. Xing, Y.; Yue, J.; Chen, C. Interval estimation of landslide displacement prediction based on time series decomposition and long short-term memory network. IEEE Access 2020, 8, 3187–3196. [Google Scholar] [CrossRef]
  24. Wang, H.; Ao, Y.; Wang, C.; Zhang, Y.; Zhang, X. A dynamic prediction model of landslide displacement based on VMD–SSO–LSTM approach. Sci. Rep. 2024, 14, 9203. [Google Scholar] [CrossRef] [PubMed]
  25. Wen, C.; Tian, H.; Zeng, X.; Xia, X.; Hu, X.; Pang, B. Landslide deformation analysis and prediction with a VMD-SA-LSTM combined model. Water 2024, 16, 2945. [Google Scholar] [CrossRef]
  26. Yang, B.; Guo, Z.; Wang, L.; He, J.; Xia, B.; Vakily, S. Updated Global Navigation Satellite System observations and attention-based convolutional neural network–long short-term memory network deep learning algorithms to predict landslide spatiotemporal displacement. Remote Sens. 2023, 15, 4971. [Google Scholar] [CrossRef]
  27. Xiang, X.; Xiao, J.; Wen, H.; Li, Z.; Huang, J. Prediction of landslide step-like displacement using factor preprocessing-based hybrid optimized SVR model in the Three Gorges Reservoir, China. Gondwana Res. 2024, 126, 289–304. [Google Scholar] [CrossRef]
  28. Li, B.; Wang, G.; Chen, L.; Sun, F.; Wang, R.; Liao, M.; Xu, H.; Li, S.; Kang, Y. Analysis of landslide deformation mechanisms and coupling effects under rainfall and reservoir water level effects. Eng. Geol. 2024, 343, 107803. [Google Scholar] [CrossRef]
  29. Deng, Z.; Xie, K.; Su, Q.; Xu, L.; Hao, Z.; Xiao, X. Three-level evaluation method of cumulative slope deformation hybrid machine learning models and interpretability analysis. Constr. Build. Mater. 2023, 408, 133821. [Google Scholar] [CrossRef]
  30. Deng, L.; Smith, A.; Dixon, N.; Yuan, H. Machine learning prediction of landslide deformation behaviour using acoustic emission and rainfall measurements. Eng. Geol. 2021, 293, 106315. [Google Scholar] [CrossRef]
  31. Ge, Q.; Wang, J.; Liu, C.; Wang, X.; Deng, Y.; Li, J. Integrating feature selection with machine learning for accurate reservoir landslide displacement prediction. Water 2024, 16, 2152. [Google Scholar] [CrossRef]
  32. Han, Y.; Zheng, F.; Xu, X. Effects of rainfall regime and its character indices on soil loss at loessial hillslope with ephemeral gully. J. Mt. Sci. 2017, 14, 527–538. [Google Scholar] [CrossRef]
  33. Wu, D.; Zhou, B.; Zimin, M. Prediction of landslide displacement based on the CA-stacked transformer model. Alex. Eng. J. 2025, 124, 389–403. [Google Scholar] [CrossRef]
  34. Ge, Q.; Li, J.; Wang, X.; Deng, Y.; Zhang, K.; Sun, H. LiteTransNet: An interpretable approach for landslide displacement prediction using transformer model with attention mechanism. Eng. Geol. 2024, 331, 107446. [Google Scholar] [CrossRef]
  35. Kong, M.; Shi, Z.; Peng, M.; Li, G.; Zheng, H.; Liu, L.; Zhang, L. Landslide surface displacement prediction based on VSXC-LSTM algorithm. In Artificial Neural Networks and Machine Learning– ICANN 2023; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; Volume 14261, pp. 456–469. [Google Scholar] [CrossRef]
  36. Jiang, Y.; Zheng, L.; Xu, Q.; Lu, Z. Deformation Mechanism-Assisted Deep Learning Architecture for Predicting Step-Like Displacement of Reservoir Landslide. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104121. [Google Scholar] [CrossRef]
  37. Jiang, Y.; Liao, L.; Luo, H.; Zhu, X.; Lu, Z. Multi-Scale Response Analysis and Displacement Prediction of Landslides Using Deep Learning with JTFA: A Case Study in the Three Gorges Reservoir, China. Remote Sens. 2023, 15, 3995. [Google Scholar] [CrossRef]
  38. Liu, Y.; Long, J.; Li, C.; Zhan, W. Physics-informed data assimilation model for displacement prediction of hydrodynamic pressure-driven landslide. Comput. Geotech. 2024, 167, 106085. [Google Scholar] [CrossRef]
  39. Ge, Q.; Li, J.; Lacasse, S.; Sun, H.; Liu, Z. Data-augmented landslide displacement prediction using generative adversarial network. J. Rock Mech. Geotech. Eng. 2024, 16, 4017–4033. [Google Scholar] [CrossRef]
  40. Pearl, J. Causality: Models, Reasoning, and Inference, 2nd ed.; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar] [CrossRef]
  41. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed.; OTexts: Melbourne, Australia, 2014. [Google Scholar] [CrossRef]
  42. Peters, J.; Bühlmann, P.; Meinshausen, N. Causal inference by using invariant prediction: Identification and confidence intervals. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016, 78, 947–1012. [Google Scholar] [CrossRef]
  43. Runge, J.; Nowack, P.; Kretschmer, M.; Flaxman, S.; Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv. 2019, 5, eaau4996. [Google Scholar] [CrossRef] [PubMed]
  44. Yu, Y.; Hou, L.; Liu, X.; Wu, S.; Li, H.; Xue, F. A novel constraint-based structure learning algorithm using marginal causal prior knowledge. Sci. Rep. 2024, 14, 19279. [Google Scholar] [CrossRef]
  45. Kalisch, M.; Bühlmann, P. Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. J. Mach. Learn. Res. 2007, 8, 613–636. [Google Scholar]
  46. Podkopaev, A.; Blöbaum, P.; Kasiviswanathan, S.; Ramdas, A. Sequential Kernelized Independence Testing. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 27957–27993. [Google Scholar] [CrossRef]
  47. Pogodin, R.; Schrab, A.; Li, Y.; Sutherland, D.J.; Gretton, A. Practical Kernel Tests of Conditional Independence. arXiv 2024, arXiv:2402.13196. [Google Scholar] [CrossRef]
  48. Benjamini, Y.; Yekutieli, D. The Control of the False Discovery Rate in Multiple Testing under Dependency. Ann. Stat. 2001, 29, 1165–1188. [Google Scholar] [CrossRef]
  49. Barker, T.H.; Migliavaca, C.B.; Stein, C.; Colpani, V.; Falavigna, M.; Aromataris, E.; Munn, Z. Conducting proportional meta-analysis in different types of systematic reviews: A guide for synthesisers of evidence. BMC Med. Res. Methodol. 2021, 21, 189. [Google Scholar] [CrossRef]
  50. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
  51. Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T.Y. On Layer Normalization in the Transformer Architecture. arXiv 2020, arXiv:2002.04745. [Google Scholar] [CrossRef]
  52. Jordan, M.I.; Jacobs, R.A. Hierarchical Mixtures of Experts and the EM Algorithm. Neural Comput. 1994, 6, 181–214. [Google Scholar] [CrossRef]
  53. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  54. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.V.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; Available online: https://openreview.net/forum?id=B1ckMDqlg (accessed on 6 February 2017).
  55. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 25–29 April 2022. [Google Scholar] [CrossRef]
  56. Grandvalet, Y.; Bengio, Y. Semi-Supervised Learning by Entropy Minimization. In Advances in Neural Information Processing Systems 17 (NeurIPS 2004); Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
  57. Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
  58. Wehenkel, A.; Louppe, G. Unconstrained Monotonic Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar] [CrossRef]
  59. Zhang, F.; Ji, Y.; Liu, X.; Liu, S.; Ren, S.; Jia, X.; Sun, X. EoML-SlideNet: A Lightweight Framework for Landslide Displacement Forecasting with Multi-Source Monitoring Data. Sensors 2025, 25, 5376. [Google Scholar] [CrossRef]
  60. Hodson, T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
  61. Spiess, A.N.; Neumeyer, N. An evaluation of R2 as an inadequate measure for nonlinear models in pharmacological and biochemical research: A Monte Carlo approach. BMC Pharmacol. 2010, 10, 6. [Google Scholar] [CrossRef]
  62. Bao, L.; Xu, J.; Xi, J.; Zhao, C.; Ren, X.F.C.; Shang, H. A Hybrid VMD-BO-GRU Method for Landslide Displacement Prediction in the High-Mountain Canyon Area of China. Remote Sens. 2025, 17, 1953. [Google Scholar] [CrossRef]
  63. Luo, W.; Dou, J.; Fu, Y.; Wang, X.; He, Y.; Ma, H.; Wang, R.; Xing, K. A Novel Hybrid LMD–ETS–TCN Approach for Predicting Landslide Displacement Based on GPS Time Series Analysis. Remote Sens. 2023, 15, 229. [Google Scholar] [CrossRef]
  64. Shu, Z.; Zhang, J.; Jin, J.; Wang, L.; Wang, G.; Wang, J.; Sun, Z.; Liu, J.; Liu, Y.; He, R.; et al. Evaluation and Application of Quantitative Precipitation Forecast Products for Mainland China Based on TIGGE Multimodel Data. J. Hydrometeorol. 2021, 22, 1199–1219. [Google Scholar] [CrossRef]
  65. Pan, L.J.; Zhang, H.F.; Liang, M.; Liu, J.H.; Dai, C.M. Assessment of ECMWF’s Precipitation Forecasting Performance for China from 2017 to 2022. J. Trop. Meteorol. 2024, 30, 257–274. [Google Scholar] [CrossRef]
  66. Liu, Q.; Zhang, T.; Li, X.; Wang, Y. Deep-Learning Post-Processing of Short-Term Station Precipitation Based on NWP Forecasts. Atmos. Res. 2023, 295, 107032. [Google Scholar] [CrossRef]
Figure 1. Architecture of CRAFormer with role-masked branches and gated fusion. (a) DLCG on lag-unfolded graph; (b) Five-Branch Causal Network. Five causal branches: ES (yellow, self-feedback), DCS (blue, direct causes), CCS (green, co-causes), SCS (purple, structurally irrelevant variables), and ICS (red, indirect causes).
Figure 1. Architecture of CRAFormer with role-masked branches and gated fusion. (a) DLCG on lag-unfolded graph; (b) Five-Branch Causal Network. Five causal branches: ES (yellow, self-feedback), DCS (blue, direct causes), CCS (green, co-causes), SCS (purple, structurally irrelevant variables), and ICS (red, indirect causes).
Entropy 28 00007 g001
Figure 2. Role-masked branches and lightweight encoders in CRAFormer: (a) Attention–MLP encoder: linear projection, causal/masked self-attention, GELU-MLP, and linear head. (b) Attention with role mask M role = C S ( role ) . (c) ICS branch: an exogenous next-day rainfall token R gau (gauge oracle) is appended as a tail; the block mask allows the tail to read history but not vice versa. (d) Lite Transformer cell with pre-norm, residual mixing μ , and a compact MLP.
Figure 2. Role-masked branches and lightweight encoders in CRAFormer: (a) Attention–MLP encoder: linear projection, causal/masked self-attention, GELU-MLP, and linear head. (b) Attention with role mask M role = C S ( role ) . (c) ICS branch: an exogenous next-day rainfall token R gau (gauge oracle) is appended as a tail; the block mask allows the tail to read history but not vice versa. (d) Lite Transformer cell with pre-norm, residual mixing μ , and a compact MLP.
Entropy 28 00007 g002
Figure 3. Representative time series at LaMenTun: displacement, rainfall metrics, volumetric water content, and soil temperature.
Figure 3. Representative time series at LaMenTun: displacement, rainfall metrics, volumetric water content, and soil temperature.
Entropy 28 00007 g003
Figure 4. Representative time series at BaYiTun: displacement, rainfall metrics, volumetric water content, and soil temperature.
Figure 4. Representative time series at BaYiTun: displacement, rainfall metrics, volumetric water content, and soil temperature.
Entropy 28 00007 g004
Figure 5. Layout of displacement and environmental monitoring instruments at the LaMenTun landslide. (a) Regional location map showing the LaMenTun landslide within Tian’e County, Hechi City, Guangxi, China; the five-pointed star marks the landslide site. (b) Satellite image of the landslide area, where the dashed boundary delineates the landslide extent, the arrow indicates the overall sliding direction, and the dashed line denotes the reference section line. (c) Field overview of the monitoring layout. Black dashed arrows are connector lines only, used to link corresponding locations/features between panels.
Figure 5. Layout of displacement and environmental monitoring instruments at the LaMenTun landslide. (a) Regional location map showing the LaMenTun landslide within Tian’e County, Hechi City, Guangxi, China; the five-pointed star marks the landslide site. (b) Satellite image of the landslide area, where the dashed boundary delineates the landslide extent, the arrow indicates the overall sliding direction, and the dashed line denotes the reference section line. (c) Field overview of the monitoring layout. Black dashed arrows are connector lines only, used to link corresponding locations/features between panels.
Entropy 28 00007 g005
Figure 6. Spatial layout of the displacement and environmental monitoring instruments at the LaMenTun landslide site.
Figure 6. Spatial layout of the displacement and environmental monitoring instruments at the LaMenTun landslide site.
Entropy 28 00007 g006
Figure 7. Layout of displacement and environmental monitoring instruments at the BaYiTun landslide. (a) Regional location in Nandan County, Hechi City, Guangxi, China; the five-pointed star marks the landslide site. (b) Plan view of the landslide showing instrument distribution and key features; symbol colors denote instrument types (see legend), and arrows/lines mark the landslide boundary/extent, tensile cracks, and the section line. (c) Geological profile along the section line in (b), showing stratigraphic units and the sliding surface/soil (see legend), with buildings for reference.
Figure 7. Layout of displacement and environmental monitoring instruments at the BaYiTun landslide. (a) Regional location in Nandan County, Hechi City, Guangxi, China; the five-pointed star marks the landslide site. (b) Plan view of the landslide showing instrument distribution and key features; symbol colors denote instrument types (see legend), and arrows/lines mark the landslide boundary/extent, tensile cracks, and the section line. (c) Geological profile along the section line in (b), showing stratigraphic units and the sliding surface/soil (see legend), with buildings for reference.
Entropy 28 00007 g007
Figure 8. LaMenTun: DLCG-derived directed causal graph (DAG) over lagged variables. Edge orientations are determined by v-structures and Meek rules.
Figure 8. LaMenTun: DLCG-derived directed causal graph (DAG) over lagged variables. Edge orientations are determined by v-structures and Meek rules.
Entropy 28 00007 g008
Figure 9. BaYiTun: DLCG-derived directed causal graph (DAG) over lagged variables. Edge orientations are determined by v-structures and Meek rules.
Figure 9. BaYiTun: DLCG-derived directed causal graph (DAG) over lagged variables. Edge orientations are determined by v-structures and Meek rules.
Entropy 28 00007 g009
Figure 10. LaMenTun: role masks derived from the DLCG for the five-branch model. Panels (ad) show DCS mask , CCS mask , ICS mask , and SCS mask , respectively. Dark green cells indicate allowed connections/visible entries (mask = 1), and light cells indicate masked entries (mask = 0).
Figure 10. LaMenTun: role masks derived from the DLCG for the five-branch model. Panels (ad) show DCS mask , CCS mask , ICS mask , and SCS mask , respectively. Dark green cells indicate allowed connections/visible entries (mask = 1), and light cells indicate masked entries (mask = 0).
Entropy 28 00007 g010
Figure 11. BaYiTun: role masks derived from the DLCG for the five-branch model. Panels (ad) show DCS mask , CCS mask , ICS mask , and SCS mask , respectively. Dark green cells indicate allowed connections/visible entries (mask = 1), and light cells indicate masked entries (mask = 0).
Figure 11. BaYiTun: role masks derived from the DLCG for the five-branch model. Panels (ad) show DCS mask , CCS mask , ICS mask , and SCS mask , respectively. Dark green cells indicate allowed connections/visible entries (mask = 1), and light cells indicate masked entries (mask = 0).
Entropy 28 00007 g011
Figure 12. LaMenTun (GPS03): XWT/WTC between displacement and daily/cumulative rainfall. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and the shaded region marks the cone of influence.
Figure 12. LaMenTun (GPS03): XWT/WTC between displacement and daily/cumulative rainfall. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and the shaded region marks the cone of influence.
Entropy 28 00007 g012
Figure 13. LaMenTun (GPS03): predictive Granger panels on velocity. (a) F-statistics over lags of 1–14 days; (b) BY–FDR q-values (bright = significant). No cell attains BY–FDR significance; short-lag F peaks do not survive multiplicity correction.
Figure 13. LaMenTun (GPS03): predictive Granger panels on velocity. (a) F-statistics over lags of 1–14 days; (b) BY–FDR q-values (bright = significant). No cell attains BY–FDR significance; short-lag F peaks do not survive multiplicity correction.
Entropy 28 00007 g013
Figure 14. BaYiTun (GPS03): XWT/WTC between displacement and rainfall drivers. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and shading marks the cone of influence.
Figure 14. BaYiTun (GPS03): XWT/WTC between displacement and rainfall drivers. Colors denote magnitude (XWT: cross-wavelet power; WTC: coherence from 0 to 1), arrows indicate relative phase (lead/lag), and shading marks the cone of influence.
Entropy 28 00007 g014
Figure 15. BaYiTun (GPS03): predictive Granger panels on velocity. (a) F-statistics over lags of 1–14 days; (b) BY–FDR q-values (bright = significant). A robust HS01–HS04 band appears at ∼2–10 days; rainfall rows are largely non-significant.
Figure 15. BaYiTun (GPS03): predictive Granger panels on velocity. (a) F-statistics over lags of 1–14 days; (b) BY–FDR q-values (bright = significant). A robust HS01–HS04 band appears at ∼2–10 days; rainfall rows are largely non-significant.
Entropy 28 00007 g015
Figure 16. MAE vs. rainfall intensity (7-day accumulation) for six stations: (a) Lamen—GPS01; (b) Lamen—GPS03; (c) Lamen—GPS04; (d) Lamen—LF01; (e) Bayi—GPS02; (f) Bayi—GPS03.
Figure 16. MAE vs. rainfall intensity (7-day accumulation) for six stations: (a) Lamen—GPS01; (b) Lamen—GPS03; (c) Lamen—GPS04; (d) Lamen—LF01; (e) Bayi—GPS02; (f) Bayi—GPS03.
Entropy 28 00007 g016
Figure 17. MAE turn vs. rainfall intensity (top-20% | Δ y | ) using 7-day accumulation across six stations: (a) Lamen—GPS01; (b) Lamen—GPS03; (c) Lamen—GPS04; (d) Lamen—LF01; (e) Bayi—GPS02; (f) Bayi—GPS03.
Figure 17. MAE turn vs. rainfall intensity (top-20% | Δ y | ) using 7-day accumulation across six stations: (a) Lamen—GPS01; (b) Lamen—GPS03; (c) Lamen—GPS04; (d) Lamen—LF01; (e) Bayi—GPS02; (f) Bayi—GPS03.
Entropy 28 00007 g017
Figure 18. LaMenTun: comparison of baseline model prediction curves. Each panel shows daily predictions over the last month from LiteTransNet, GRU, CNN-LSTM, TCN, and CRAFormer against the observed series (black): (a) GPS01; (b) GPS03; (c) GPS04; (d) LF01.
Figure 18. LaMenTun: comparison of baseline model prediction curves. Each panel shows daily predictions over the last month from LiteTransNet, GRU, CNN-LSTM, TCN, and CRAFormer against the observed series (black): (a) GPS01; (b) GPS03; (c) GPS04; (d) LF01.
Entropy 28 00007 g018
Figure 19. BaYiTun: baseline model prediction curves versus observed displacement over the last month. Models shown: LiteTransNet, GRU, CNN-LSTM, TCN, and CRAFormer. (a) GPS02; (b) GPS03.
Figure 19. BaYiTun: baseline model prediction curves versus observed displacement over the last month. Models shown: LiteTransNet, GRU, CNN-LSTM, TCN, and CRAFormer. (a) GPS02; (b) GPS03.
Entropy 28 00007 g019
Figure 20. LaMenTun: ablation study prediction curves over the last month. Each panel compares MLP, MLP_Rain (exogenous evidence), MLP_ DLCG (DBN-style structure prior), MLP_Granger (linear CI screening), and CRAFormer against the observed series (black): (a) GPS01; (b) GPS03; (c) GPS04; (d) LF01.
Figure 20. LaMenTun: ablation study prediction curves over the last month. Each panel compares MLP, MLP_Rain (exogenous evidence), MLP_ DLCG (DBN-style structure prior), MLP_Granger (linear CI screening), and CRAFormer against the observed series (black): (a) GPS01; (b) GPS03; (c) GPS04; (d) LF01.
Entropy 28 00007 g020
Figure 21. BaYiTun: ablation study prediction curves over the last month. Models shown are MLP, MLP_Rain, MLP_ DLCG, MLP_Granger, and CRAFormer versus the observed series (black): (a) GPS02; (b) GPS03.
Figure 21. BaYiTun: ablation study prediction curves over the last month. Models shown are MLP, MLP_Rain, MLP_ DLCG, MLP_Granger, and CRAFormer versus the observed series (black): (a) GPS02; (b) GPS03.
Entropy 28 00007 g021
Table 1. Experimental environment summary.
Table 1. Experimental environment summary.
ComponentSpecification
Operating systemWindows 11 (Build 26,100)
CPU/memoryIntel Core i7-9700 (8 cores, 3.0 GHz)/32 GB RAM
GPUNone (CPU-only)
CUDA/cuDNNdisabled
Python envPython 3.11.7 (Anaconda)
PyTorch2.2.1 (cpuonly)
Core librariesNumPy, Pandas, Matplotlib (v3.8.4), scikit-learn, statsmodels
Determinismtorch.use_deterministic_algorithms(True); fixed seeds
Table 2. Architectural configurations of baseline models.
Table 2. Architectural configurations of baseline models.
ModelStructure SummaryActivation
TCNThree TemporalBlocks, where each block has two Conv1D layers (kernel = 16 ; dilation = 1 , 2 , 4 ); channel width = H ; residual connectionsReLU
CNN–LSTMConv stack: Conv1D ( D H ) (kernel = 16 ) → Conv1D ( H H ) (kernel = 32 ), max-pool; then LSTM (hidden = H ) over pooled sequence; head: fc H 1 ReLU (Conv), tanh/sigmoid (LSTM)
GRUThree-layer GRU (input dimension = n features , hidden units = H ); output head: fc H 1 ReLU (head)
LiteTransNet2 encoder + 2 decoder layers, where each encoder: 1 × 4 -head attention; each decoder: 2 × 4 -head attention; FFN: H 4 H H ; head: fc H 1 ReLU
Table 3. Architectural configurations of ablation variants and the full model, where K is the look-back window, D is the number of features, H is the MLP width, and d is the Transformer width.
Table 3. Architectural configurations of ablation variants and the full model, where K is the look-back window, D is the number of features, H is the MLP width, and d is the Transformer width.
ModelStructure SummaryKey Settings
MLPThree-layer MLP on flattened window: fc1 R K × D H , fc2 H H / 2 , fc3 H / 2 1 ; no masks, no exogenous input. H { 32 , 64 , 128 } ; ReLU; dropout 0; L 2 = 10 4
MLP_RainAs MLP; append leakage-free R t + 1 at prediction time as an exogenous scalar; no structure masks. H { 32 , 64 , 128 } ; ReLU; dropout 0; L 2 = 10 4
MLP + GrangerAs MLP; inputs pre-screened by linear Granger/partial-corr CI (per-lag, BY–FDR); no learned masks. H { 32 , 64 , 128 } ; ReLU; BY–FDR α = 0.05 ; L 2 = 10 4
MLP_DLCGAs MLP; apply DLCG time-consistent visibility masks on lag-unrolled graph (parents/ancestors/colliders; non-anticipativity); no ICS tail. H { 32 , 64 , 128 } ; ReLU; DLCG: HSIC/KCI+BY–FDR, bootstrap; L 2 = 10 4
CRAFormerFive role branches (ES/DCS/CCS/ICS/SCS) with single-head causal self-attention (d) and lite Transformer cell; ICS uses exogenous R t + 1 tail (leakage-free, non-negative readout, monotonic regularization); Top-2 context-aware gating; convex fusion. d { 32 , 64 , 128 } ; GELU/ReLU; dropout 0.1; L 2 = 10 4 ; λ ent [ 10 3 , 10 2 ] ; λ scs [ 10 3 , 10 2 ] ; λ mono [ 10 3 , 10 1 ]
Note: Inputs are standardized on the training split; default K = 96 , stride 1. Optimizer Adam; lr { 5 × 10 3 , 10 3 , 5 × 10 4 , 10 4 } ; early stopping on val-MAE (patience 20, max 100 epochs). Rain(+1) is used strictly as exogenous evidence (no gradients). DLCG discovery excludes future nodes and enforces forward-time edges; masks implement d-separation and non-anticipativity.
Table 4. Rainfall-stratified mean pre-truncation gate weights for ES, DCSs, and ICSs, mean Top-2 gate mass, and mean gate entropy H ( π ) at LaMenTun and BaYiTun stations.
Table 4. Rainfall-stratified mean pre-truncation gate weights for ES, DCSs, and ICSs, mean Top-2 gate mass, and mean gate entropy H ( π ) at LaMenTun and BaYiTun stations.
StationBinES Mean π DCS Mean π ICS Mean π Mean Top-2 MassMean H ( π ) N Samples
LaMenTun_gps01Dry0.660.180.050.870.64115
Moderate0.580.190.100.850.71110
Wet0.450.210.200.820.82135
Very Wet0.340.210.290.800.9358
LaMenTun_gps03Dry0.620.200.070.860.68108
Moderate0.530.210.140.840.76102
Wet0.400.220.260.810.87142
Very Wet0.300.220.350.790.9864
LaMenTun_gps04Dry0.680.170.050.880.62120
Moderate0.600.180.090.860.69118
Wet0.490.200.170.830.80130
Very Wet0.380.210.240.810.9055
LaMenTun_lf01Dry0.630.180.060.860.6696
Moderate0.550.200.120.840.7492
Wet0.430.210.220.820.84104
Very Wet0.330.220.310.800.9649
BaYiTun_gps02Dry0.650.170.050.870.63103
Moderate0.570.190.100.850.71123
Wet0.460.210.190.820.82113
Very Wet0.350.220.270.800.93113
BaYiTun_gps03Dry0.610.200.070.860.6790
Moderate0.520.200.150.840.7786
Wet0.380.220.270.810.88100
Very Wet0.280.230.370.791.0047
“ES mean π ”, “DCS mean π ”, and “ICS mean π ” are the average pre-truncation gate weights for the ES, DCS, and ICS branches, respectively, within each 7-day rainfall regime. “Mean Top-2 mass” is the average sum of the two largest gate probabilities π i , ( 1 ) + π i , ( 2 ) . H ( π ) denotes the mean entropy of the five-way gate distribution; N is the number of validation samples falling into each bin.
Table 5. Model performance on LaMenTun and BaYiTun stations with turning-point errors.
Table 5. Model performance on LaMenTun and BaYiTun stations with turning-point errors.
StationModelMAERMSE R 2 MAE turn 10 MAE turn 20 MAE turn 30
LaMenTun_gps01TCN1.7992.2550.7353.2192.7402.423
GRU1.5922.0380.7623.5222.8992.496
CNN_LSTM1.6812.1390.7313.3002.6802.253
LiteTransNet1.6562.0920.9713.5163.0022.613
CRAFormer0.3590.4560.9770.4030.3790.356
LaMenTun_gps03TCN6.1369.0780.95813.89112.13810.260
GRU6.5549.2200.95312.86011.3089.535
CNN_LSTM6.1968.2130.9389.7288.9308.104
LiteTransNet5.0456.8830.99810.3397.8637.423
CRAFormer1.5251.9470.9961.7411.6601.596
LaMenTun_gps04TCN1.5612.1010.9043.9553.0332.605
GRU1.6642.2490.8763.6932.8472.293
CNN_LSTM1.6292.2180.8673.5902.8832.371
LiteTransNet1.4111.8450.9893.3712.6212.233
CRAFormer0.3740.4710.9830.3910.3910.382
LaMenTun_lf01TCN0.3590.5690.9780.7730.5560.425
GRU0.2260.4550.9740.5680.3970.328
CNN_LSTM0.4290.7000.9460.9670.7090.545
LiteTransNet0.3420.4540.9950.6900.5440.459
CRAFormer0.0900.1440.9950.1280.1130.093
BaYiTun_gps02TCN1.5301.9940.8153.9443.3232.787
GRU1.7402.2570.8884.1273.2102.616
CNN_LSTM1.4581.9430.8213.9573.1152.531
LiteTransNet1.6232.1220.9933.9043.1742.740
CRAFormer0.5940.7650.9730.8200.7210.647
BaYiTun_gps03TCN5.2537.1640.8257.4365.8645.607
GRU3.1154.9970.9297.3965.2534.553
CNN_LSTM3.5415.3520.8967.3685.5314.841
LiteTransNet3.5045.4540.9958.3156.0795.121
CRAFormer1.1311.5880.9911.7921.3991.342
Table 6. Ablation results on LaMenTun and BaYiTun stations with turning-point errors.
Table 6. Ablation results on LaMenTun and BaYiTun stations with turning-point errors.
StationModelMAERMSE R 2 MAE turn 10 MAE turn 20 MAE turn 30
LaMenTun_gps01MLP2.0842.6120.2304.1093.4392.836
MLP_rain1.5892.0510.5253.4682.8352.418
MLP_DLCG1.6092.0660.5183.4572.8472.403
MLP_Granger1.7292.2310.4383.6393.0502.466
CRAFormer0.3590.4560.9770.4030.3790.356
LaMenTun_gps03MLP6.2938.5410.91711.2499.7138.891
MLP_rain6.7738.7140.91411.4509.9878.670
MLP_DLCG5.5147.5510.9359.4258.6627.865
MLP_Granger5.2147.9780.92810.8379.2507.920
CRAFormer1.5251.9470.9961.7411.6601.596
LaMenTun_gps04MLP1.5652.0830.6753.5632.6982.279
MLP_rain1.8832.3300.5933.2712.5612.375
MLP_DLCG1.6812.1720.6473.3772.6222.297
MLP_Granger1.3601.8260.7503.3722.7012.334
CRAFormer0.3740.4710.9830.3910.3910.382
LaMenTun_lf01MLP0.3430.4950.9380.5090.4270.381
MLP_rain0.3320.4950.9380.6310.5080.449
MLP_DLCG0.2050.4230.9550.6050.4220.339
MLP_Granger0.3500.5200.9320.6850.5540.474
CRAFormer0.0900.1440.9950.1280.1130.093
BaYiTun_gps02MLP2.0842.6490.6673.8053.3052.663
MLP_rain1.5602.0610.7993.7823.0782.493
MLP_DLCG1.4431.9360.8223.7613.0462.470
MLP_Granger1.5662.0770.7953.7422.8682.350
CRAFormer0.5940.7650.9720.8200.7210.647
BaYiTun_gps03MLP7.7689.2960.70811.2099.3068.632
MLP_rain3.4625.2260.9087.0715.2114.682
MLP_DLCG3.2775.1380.9117.4975.5374.753
MLP_Granger4.2156.0140.8787.7535.6894.974
CRAFormer1.1311.5880.9911.7921.3991.342
Table 7. CRAFormer under oracle and NWP-like 24 h rainfall scenarios at six stations.
Table 7. CRAFormer under oracle and NWP-like 24 h rainfall scenarios at six stations.
StationScenarioMAERMSE R 2 MAE turn Δ MAE (%) Δ MAE turn (%)
lamen_gps01CRAFormer0.3590.4560.9770.4580.00.0
NWP-mild0.4390.5280.9540.43322.3−5.5
NWP-typical0.4780.5930.9120.43033.1−6.1
NWP-poor0.4010.5190.9320.36111.7−21.2
lamen_gps03CRAFormer1.5251.9470.9961.7780.00.0
NWP-mild2.2252.6190.9632.63045.947.9
NWP-typical2.0522.5110.9662.86134.660.9
NWP-poor1.6662.1230.9761.3809.2−22.4
lamen_gps04CRAFormer0.3740.4710.9830.5060.00.0
NWP-mild0.4180.5340.9360.32211.8−36.4
NWP-typical0.5800.6970.8910.58355.115.2
NWP-poor0.4150.5580.9300.38311.0−24.3
lamen_lf01CRAFormer0.0900.1440.9950.0430.00.0
NWP-mild0.0810.1560.9520.091-10.0111.6
NWP-typical0.1200.1560.8950.07033.362.8
NWP-poor0.2870.3910.3390.294218.9583.7
bayi_gps02CRAFormer0.5940.7650.9731.2220.00.0
NWP-mild0.5400.8100.9600.396−9.1−67.6
NWP-typical0.5840.8170.9240.506−1.7−58.6
NWP-poor0.5650.8270.9200.248−4.9−79.7
bayi_gps03CRAFormer1.1311.5880.9910.8650.00.0
NWP-mild1.6611.8700.9740.48846.9−43.6
NWP-typical1.4611.5850.9880.47629.2−45.0
NWP-poor1.6451.8270.9770.77745.4−10.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, F.; Ji, Y.; Liu, X.; Liu, S.; Lu, Z.; Sun, X.; Ren, S.; Jia, X. Bayesian-Inspired Dynamic-Lag Causal Graphs and Role-Aware Transformers for Landslide Displacement Forecasting. Entropy 2026, 28, 7. https://doi.org/10.3390/e28010007

AMA Style

Zhang F, Ji Y, Liu X, Liu S, Lu Z, Sun X, Ren S, Jia X. Bayesian-Inspired Dynamic-Lag Causal Graphs and Role-Aware Transformers for Landslide Displacement Forecasting. Entropy. 2026; 28(1):7. https://doi.org/10.3390/e28010007

Chicago/Turabian Style

Zhang, Fan, Yuanfa Ji, Xiaoming Liu, Siyuan Liu, Zhang Lu, Xiyan Sun, Shuai Ren, and Xizi Jia. 2026. "Bayesian-Inspired Dynamic-Lag Causal Graphs and Role-Aware Transformers for Landslide Displacement Forecasting" Entropy 28, no. 1: 7. https://doi.org/10.3390/e28010007

APA Style

Zhang, F., Ji, Y., Liu, X., Liu, S., Lu, Z., Sun, X., Ren, S., & Jia, X. (2026). Bayesian-Inspired Dynamic-Lag Causal Graphs and Role-Aware Transformers for Landslide Displacement Forecasting. Entropy, 28(1), 7. https://doi.org/10.3390/e28010007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop