Next Article in Journal
Forecasting of Macroclimatic Phases Through Stochastic Modeling and Machine Learning: Implications for Regional Hydrological Analysis
Previous Article in Journal
Applicability of the Elastic Water Column Method to Pressurized Pipeline Emptying: Dimensionless Pressure Analysis Under Different Air Pocket Configurations
Previous Article in Special Issue
Physics-Assisted Deep Learning Model for Improved Construction Performance Monitoring of Cutter Suction Dredger
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dam Seepage Analysis Based on Causal Testing and Regression Analysis

1
School of Hydraulic Engineering, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, China
2
Dongyang Reservoir Hydropower Operation Centre, Dongyang 322100, China
*
Author to whom correspondence should be addressed.
Water 2026, 18(11), 1359; https://doi.org/10.3390/w18111359
Submission received: 25 February 2026 / Revised: 21 May 2026 / Accepted: 22 May 2026 / Published: 3 June 2026
(This article belongs to the Special Issue Water Engineering Safety and Management, 2nd Edition)

Abstract

Dam seepage is a critical issue affecting the safe operation of reservoir dams, making the monitoring and early warning of abnormal seepage conditions particularly important. Currently, analyses of dam seepage primarily focus on using finite element methods to invert seepage conditions and employ regression analysis and classical machine learning methods to predict seepage. However, there has been limited analysis of the relationships among various influencing factors. The subjectivity of input factors in seepage safety monitoring models, the imprecision of factor relationships, and the randomness of parameter selection can all lead to uncertainty in model predictions. Therefore, to identify the primary factors influencing reservoir seepage issues, we took a specific reservoir project as an example and employed stepwise regression analysis and Granger causality tests to comprehensively examine the relationships between reservoir water level, rainfall, seepage pressure at various locations, and seepage pressure around the dam. Based on this analysis, the key influencing factors for seepage pressure around the dam were identified. The results indicate that reservoir water level and seepage pressure influence the seepage pressure around the dam. The stepwise regression method can comprehensively screen for potential influencing factors. Meanwhile, GCT utilizes time lag characteristics to further narrow the range of influencing factors. These two methods significantly narrow the scope of screening for factors affecting seepage pressure around the dam. These two methods can be used to narrow down the range of factors influencing seepage pressure around the dam, reduce interference from these factors, scientifically eliminate spurious correlations and redundant variables, and efficiently and reliably detect and provide early warnings of abnormal dam seepage.

1. Introduction

Dams are indispensable for flood control, power generation, and water resource utilization. However, a significant number of these structures were built in the 1960s and 1970s. Due to the economic conditions and dam-building technologies of that era, construction standards were relatively low, resulting in issues such as poor embankment quality, seepage in the dam body, and substandard flood control capacity [1]. Among the various dam types, earth-fill dams and rockfill dams are particularly susceptible to seepage issues due to the inherent porosity and permeability of their materials.
Earth-rock dams are fundamental structures in hydraulic engineering. Currently, safety research on earth-rock dams primarily focuses on seepage, structural integrity, and stability [2,3,4,5]. Among the various failure modes of earth-rock dams, failure due to seepage is the most common. If damage caused by seepage within the dam body (such as piping) is not detected in a timely manner during routine operation, it can lead to the loss of the internal skeletal particles. At the same time, changes in seepage pressure can easily trigger slope instability, and in severe cases, result in dam failure. Therefore, obtaining reliable seepage parameters is a fundamental requirement for simulating dam seepage and assessing its seepage safety. Currently, routine monitoring of the dam generates a large volume of high-frequency, multi-channel, and multi-source monitoring data. Consequently, integrating monitoring data and determining the optimal objective function value is crucial for analyzing the seepage stability of a dam, as it helps identify the optimal seepage parameters from a large volume of monitoring data [6].
Current research on seepage stability has primarily focused on numerical simulation. Most researchers employ finite element analysis to assess the seepage stability of dam bodies; however, this method typically requires a large amount of reliable monitoring data and involves high computational costs. To improve computational efficiency, methods such as machine learning, multivariate adaptive regression splines, and support vector regression are increasingly being applied to the inversion of seepage parameters. Tong et al. [7] employed a seepage–stress coupling analysis to invert the seepage and mechanical parameters of the dam foundation rock mass, thereby assessing the deep-seated anti-sliding stability of the dam foundation following reservoir water level elevation. Li et al. [8] investigated the influence of permeation coefficient on seepage characteristics in rockfill dams with cracked concrete facing. Using a concrete-faced rockfill dam in Qinghai Province as a case study, they combined the equivalent block continuous medium model with the Van Genuchten model to propose a saturated–unsaturated seepage calculation method accounting for concrete facing cracks. Zhou et al. [9] employed the finite element method to calculate the seepage stability of a reservoir after hazard mitigation and reinforcement. Based on finite element results and combined with field measurements, they established a hybrid model to analyze the dam’s seepage conditions. Liu et al. [10] evaluated the seepage safety status of the Dalongdong Reservoir dam. Using measured water level data as a reference, they iteratively adjusted trial calculations to identify seepage parameters reflecting the material properties of each zone, then performed seepage stability analysis on the dam. It is evident that research on dam seepage safety monitoring primarily focuses on establishing deterministic input–output relationships between environmental variables and dam seepage.
However, dam seepage issues arise from the interaction of multiple factors such as reservoir water level, rainfall, temperature, and aging, making the causes of seepage complex and exhibiting a time-lag effect [11,12,13,14]. Currently, researchers predominantly employ regression analysis and classical machine learning methods (such as support vector machines and artificial neural networks) for dam seepage prediction. Arslan et al. [15] employed adaptive neuro-fuzzy inference systems (ANFIS) and artificial neural networks (ANN) to predict dam seepage. Their findings indicated that both single and multiple ANN and ANFIS models demonstrated comparable predictive capabilities, though ANNs yielded superior predictions for certain pressure gauges despite being easier to construct than ANFIS. Chen et al. [14] proposed a dam seepage assessment method based on Dempster–Shafer (D–S) theory and deep neural networks (DNNs), integrating attention mechanisms (AMs) with long short-term memory (LSTM) models to evaluate the seepage safety of dams at individual monitoring points. Ishfaque et al. [16] employed machine learning methods to predict dam seepage flow using input data such as temperature, rainfall, sediment discharge, and reservoir water level. Li et al. [17] utilized CNN-LSTM models, support vector regression (SVR), adaptive boosting, artificial neural networks, recurrent neural networks, and LSTM models to forecast dam seepage. Results indicated that the CNN-LSTM model demonstrated higher prediction accuracy. Utilizing machine learning and neural networks for nonlinear fitting to map input–output relationships can significantly improve computational efficiency. However, these approaches still have limitations regarding multi-parameter coupling and ensuring data accuracy. Additionally, the validity of the model’s output results is influenced by the comprehensiveness of the input dataset [18,19].
As can be seen, current research on dam seepage analysis primarily focuses on using finite element methods to invert seepage conditions and employing regression analysis and machine learning techniques to predict seepage. However, existing optimization algorithms suffer from weaknesses such as poor global search capabilities and slow search speeds. Furthermore, the overall precision and accuracy of monitoring data can affect the reliability and engineering applicability of the inversion and prediction results. The factors influencing dam seepage are highly complex, and current monitoring data include multiple components such as dam seepage, regional groundwater, and rainfall runoff. In actual pumped-storage power plant projects, the subjectivity and inaccuracy of input factors in seepage safety monitoring models, as well as the randomness in parameter selection, can all lead to uncertainty in model predictions. This results in significant fluctuations in the reliability of the predictions.
At the same time, most of the machine learning models currently in use are “black-box” models; these models typically only provide results without revealing the underlying reasons for those results. In contrast, traditional regression analysis methods yield explicit analytical expressions, where the coefficients in the model reflect the impact of a specific factor on dam seepage pressure. Furthermore, machine learning methods require a vast amount of historical observational data to avoid low accuracy in output values. In particular, since most dams were constructed in the last century, the sensors installed during construction may have suffered damage or have significant missing values, leading to incomplete seepage data and affecting the reliability of inversion or prediction results [20,21]. Therefore, the ability to accurately handle data by removing outliers and imputing missing values is fundamental to the development of early warning models and safety assessment decisions [22,23]. Currently, researchers are beginning to combine traditional data screening methods with machine learning techniques. First, they use traditional methods such as Granger causality tests and stepwise regression analysis to identify the core factors truly related to seepage. Subsequently, machine learning models are employed to predict dam seepage. Therefore, screening the data and analyzing the key factors influencing seepage are crucial for subsequent prediction.
In dams subject to seepage, it takes time for changes in water level to propagate to the interior of the dam; therefore, there is a time lag in the influence between relevant parameters. Methods such as models with built-in regularization (e.g., LASSO regression) or interpretability techniques (e.g., SHAP values) address the correlations between parameters but cannot account for the temporal causal relationships in such data. The Granger causality test is a method for assessing predictive capability in time-directed sequences, addressing causal relationships among parameters in time series. Stepwise regression is a technique for automatically selecting independent variables. In modern dam safety monitoring, although methods such as the Granger causality test and stepwise regression face competition from machine learning, they remain valuable due to their extremely fast computation speed, fully transparent white-box mechanism, and alignment with the logic of engineering hydraulics.
Therefore, to minimize the interference of confounding factors, identify the key factors influencing dam seepage, this study used a reservoir project in China as a case study. By employing both Granger causality testing (hereinafter referred to as GCT) and stepwise regression analysis, the study examined the effects of various factors on seepage pressure around the dam and identified the key influencing factors. The aim is to scientifically eliminate spurious correlations and redundant variables, thereby reducing the number of inputs to the predictive model at the source, and providing an efficient and reliable method for the accurate identification of abnormal seepage in dams.

2. Method

2.1. Stepwise Regression Analysis

In practical dam engineering problems, the factors influencing a phenomenon are often complex. For instance, dam displacement is affected not only by reservoir water pressure, but also by temperature, seepage, construction practices, foundation conditions, surrounding environment, and aging effects. Seepage itself is influenced by reservoir water pressure, dam fill materials, stress field within the dam body, seepage control measures, and aging effects. Therefore, when establishing relationships between predicted variables and predictive factors, it is inevitable to consider multiple factors. Identifying the influence of various factors on a specific predicted variable, establishing mathematical expressions between them, and developing regression models are essential. This enables the calculation of predicted values for a given set of loads, which can then be compared with the measured values to assess the operational status of the structure and conduct monitoring.
In stepwise regression, the process begins with a single predictor variable, and subsequent variables are added to the regression equation one by one in descending order of their significance for the dependent variable. Conversely, when a previously included factor becomes insignificant due to the inclusion of a subsequent factor, it is removed. Therefore, in stepwise regression, some steps involve adding factors, while others involve removing them. A statistical test (F-test) is performed at each step to ensure that the regression equation contains only significant factors before a new significant factor is added, until all significant factors are included in the regression equation.
For a multiple linear regression model, suppose that there are n samples, and the model includes p independent variables:
Y = β 0 + β 1 X 1 + + β p X p + ϵ
After each model fit, the sum of squared errors (SSE) is calculated to assess the model’s performance.
S S E = i = 1 n ( y i y i ^ ) 2
In each step of the iterative reduction, the essence of the inclusion or exclusion calculation is to compare the change in the SSE after adding or removing a particular X j . The specific calculation steps for stepwise regression are as follows:
(1)
Initialization and setting of thresholds: Determine the significance levels for including and excluding variables from the model. In this paper, the inclusion threshold was set at 0.05 and the exclusion threshold at 0.1.
(2)
Calculate the contribution of variables not yet included in the model: First, for all candidate feature variables that have not yet been included in the model, calculate the partial F-statistic for each variable if it was added to the model individually. Next, identify the variable among these that yields the largest F-statistic—that is, the one with the smallest significance level (p) and the most significant contribution to reducing the model’s sum of squared errors. If the p-value for this variable is less than the predefined inclusion threshold, it is included in the model.
(3)
Calculate the degree of variable redundancy within the model: Since new variables were added to the model in the previous step, the partial correlations between the original variables and the dependent variable have been disrupted. Therefore, it is necessary to recalculate the partial F-statistics for all variables in the current model and identify the variable with the smallest F-statistic.
(4)
Iteration: Alternate between steps 2 and 3 to identify new variables and eliminate old ones.
(5)
Termination: The algorithm terminates when none of the remaining variables outside the model meet the inclusion criteria, and none of the variables inside the model meet the exclusion criteria.

2.2. Granger Causality Test (GCT)

The core principle of the GCT is as follows: if the past values of variable X significantly help predict the current and future values of variable Y, then X is considered a Granger cause of Y. In other words, it is used to test whether one set of time series is a cause of another. If A is a Granger cause of B, it means that changes in A are one of the causes of changes in B [24]. The calculation steps for the Granger causality test are as follows.
(1)
Stationarity Test. Before conducting the GCT, it is necessary to first determine whether the time series is stationary, as non-stationary series can affect the reliability of the test results. Common methods for testing stationarity include the ADF test, the PP test, and the KPSS test. If the data are non-stationary, apply differencing.
(2)
Information Criterion Method. After conducting stationarity tests on the data, the optimal lag order between variables is determined using information criteria. This lag order directly serves as the initial input step size for influencing factors in subsequent models. Commonly used information criteria include AIC, BIC, and HQ. This paper employed the AIC criterion to determine the model order, calculated as follows:
AIC K = 2 ln L   + 2 K
Here, K is the model’s independent parameter, and L is the model’s maximum likelihood function.
(3)
Granger Causality Test. After determining the stationarity and the order of lag, we proceed to the essential Granger causality test. Essentially, this involves conducting an F-test on the sum of squared residuals from the restricted and unrestricted regression equations. The formula for the Granger causality test is as follows [25].
Unrestricted regression, which involves using past values of X and Y to predict Y and calculating the sum of squared residuals for the model:
Y   t =   α   +   i = 1 p β i Y t i   +   i = 1 p γ i X t i   +   ϵ t
X t = α + i = 1 p β i Y t i + i = 1 p γ i X t i + ϵ t
In the equation, Y t is the dependent variable; X t represents the potential causal variable; p denotes the lag order; β i and γ i are the coefficients of influence on Y t ; β i and γ i are the coefficients of influence on X t ; α and α denote the constant terms; ϵ and ϵ denote the error terms.
Restricted regression involves predicting Y using past values of Y and calculating the sum of squared residuals for the model:
Y t   =   γ 0 + i = 1 p γ i Y t i + u t
In the equation, γ 0 denote the constant terms; u t denote the error terms.
F   = ( RSS r RSS ur ) / p RSS ur / ( n 2 p 1 )
Here, n represents the sample size, and p represents the lag order. Once the F-value and a specific significance level are calculated, if p < 0.05, the null hypothesis is rejected, indicating that past values of X help predict Y. If p ≥ 0.05, the null hypothesis cannot be rejected, meaning that X is not a Granger cause of Y.

3. Project Overview

The reservoir studied is located in the Qiantang River Basin of Zhejiang Province, China. It is a large-scale (Category 2) water conservancy project primarily designed for irrigation and flood control while also serving water supply and power generation. With a total storage capacity of 274 million cubic meters and a total installed capacity of 9950 kW, the reservoir directly protects 126,000 mu of farmland, 550,000 residents, villages, and factories along its downstream banks.
The key components of the project include the river-blocking dam, spillway, drainage tunnel, water conveyance tunnel, and power station. The dam is a clay-core rockfill dam with a sand-shell lining. The crest elevation is 174.50 m (based on the 1985 National Elevation Benchmark, same below), with a maximum dam height of 57.5 m. A 1.2 m wave-breaking wall is installed on the dam crest, with a crest elevation of 175.70 m. The crest width is 7.0 m, and the crest length is 295 m. The dam is designed to withstand a 100-year flood event with a design flood level of 167.42 m. It is also verified against a 10,000-year flood event with a verification flood level of 173.32 m.
Since its operation, the reservoir has delivered significant benefits in flood control, irrigation, and water supply. However, in recent years, the groundwater level on both banks has gradually risen, and seepage around the dam’s left shoulder has been observed. After reservoir impoundment, water flows through seepage along the slopes on both sides of the earthen dam and infiltrates downstream. This phenomenon is termed dam-bypass seepage. Such seepage can occur along the interface between the dam body and the slopes, or it can flow downstream through the interior of the mountain at the dam toe, causing the wetting line to rise within the toe zone. Mild seepage manifests as dampness or minor clear water seepage on the backside of the slopes. Severe cases can soften slope soils, form concentrated seepage pathways, and even trigger slope collapses or landslides, jeopardizing dam safety. Therefore, to ensure the long-term, safe operation of Hengjin Reservoir, it is necessary to employ multiple methods tailored to the project’s actual conditions to analyze seepage “hotspots”.
The dam features two strain monitoring sections at station numbers Dam 0 + 131.00 m and Dam 0 + 194.00 m. Each section incorporates three piezometers installed on both the upstream and downstream sides of the cut-off wall. Additionally, one piezometer is embedded at the secondary access ramp downstream and another at the gravel foundation of the drainage prism at the dam toe. These are designated as P1-1 to P1-8 and P2-1 to P2-8, totaling 16 piezometers. Additionally, to monitor seepage around the dam, six pressure monitoring pipes are installed on each bank at the dam crest (design designations L1 to L6 and R1 to R6). Automatic water level gauges are installed in the reservoir area for continuous monitoring of reservoir water levels. Water level gauges are installed at the dam toe for continuous monitoring of downstream water levels. Thermometers are installed in the dam area for continuous monitoring of air temperatures. A total of 8 piezometers are installed at Observation Section 2, with P2-1 to P2-3 and P2-4 to P2-6 located 1.0 m upstream and downstream of the impermeable wall, respectively.

4. Analysis of Current Seepage Conditions

4.1. Regression Analysis of Seepage Pressure in Dams

Preliminary investigations reveal that regarding dam seepage, the seepage pressure water level upstream of the impermeable wall is higher than that downstream, and both seepage pressure water levels correlate positively with reservoir water levels. Seepage pressure water levels within the downstream dam body or foundation are lower, indicating that the impermeable wall and core wall provide significant seepage control. The lag in pressure measurement tubes within the core wall is significantly greater than that of the piezometers, indicating a lower permeability coefficient for the core wall.
Seepage has been observed at the left abutment, with elevated groundwater levels in the mountainous terrain on both banks. At the same time, water levels at monitoring points P1–4, LB1, and L1 have all risen abnormally, and their correlation with the reservoir water level is becoming increasingly strong, indicating that the impermeability of the impermeable barrier in this area is gradually deteriorating. The process diagrams for changes in the left bank piezometer monitoring points are shown in Figure 1 and Figure 2. As can be seen from the figures:
(1)
Left Bank Pressure Tube L1: When the reservoir water level operates above 155.0 m, the higher the reservoir water level, the greater the difference between L1 and the reservoir water level. When reservoir water level drops below 155.0 m, the lower the reservoir water level, the higher the water level in L1 becomes relative to the reservoir water level. This indicates that the groundwater table within the mountain slopes on both sides of the project is relatively high. As a result, seepage is not only driven by the reservoir water level but is also subject to natural environmental influences. Therefore, when the reservoir water level is below 155.0 m, L1 is primarily controlled by the groundwater table within the mountain slopes.
(2)
When reservoir water levels are high, the water level in the L1 pressure tube exhibits a certain correlation with the reservoir water levels, with minimal head reduction. It has been confirmed that L1 is unrelated to water levels within the power tunnel, indicating seepage around the left dam toe. The L1 water level shows an upward trend, and its correlation with reservoir water levels has strengthened in recent years, suggesting a weakening of the impermeable barrier’s seepage control effectiveness.
(3)
The difference between the L2 water level and the reservoir water level generally stabilizes around 6 m. However, during periods of heavy rainfall, the groundwater level significantly impacts this difference, with L2 primarily influenced by rainfall.
(4)
L3–L4 exhibit a certain correlation with reservoir water level, but both experience significant head reduction. L5–L6 show little correlation with reservoir water level and exhibit substantial head reduction.
Preliminary analysis revealed abnormal water levels at two monitoring points, L1 and LB1, indicating that the impermeability of the cutoff wall in this area is gradually deteriorating. However, the specific factors causing the abnormal water levels at these two points remain unclear. Therefore, this study narrowed down the range of potential influencing factors through stepwise regression analysis and Granger causality tests.
As shown in Table 1, based on the preliminary survey results, this study selected five sets of characteristic data—reservoir water level, rainfall, seepage pressure P2-7, seepage pressure P2-8, and seepage pressure P1-4—for correlation analysis. We quantified the extent to which these factors influence the seepage pressures LB1 and L1 around the dam. However, as can be seen, there were differences in the data volume of the feature data in Table 1. This is because during long-term monitoring, zero values or other outliers may occur due to transient errors. To address this issue, this study selected, identified, and removed outliers during the preprocessing stage and applied interpolation methods to the data.
Based on the results of the stepwise regression analysis in Table 2, the overall goodness-of-fit for the LB-1 model of seepage around the dam was 0.686. Through a process of repeatedly introducing new factors and removing old ones, the two independent variables—seepage pressure P1-4 and seepage pressure P2-7—were ultimately identified as having a high overall goodness-of-fit and passing the significance test. The unstandardized coefficient for seepage pressure P1-4 was 1.061, and that for seepage pressure P2-7 was 0.802. This indicates that the seepage pressures at both P1-4 and P2-7 have a positive influence on the seepage pressure at LB-1, with the effect of P1-4 being more pronounced. The model is robust overall and demonstrates good explanatory power.
The overall goodness-of-fit for the L-1 model of seepage around the dam was 0.954. The final model retained four variables: seepage pressure P1-4, reservoir water level, seepage pressure P2-7, and rainfall, while discarding seepage pressure P2-8. Among these, seepage pressures P1-4 were introduced into the model first, likely because they exhibited the strongest correlation with the L-1 seepage pressure around the dam or accounted for the largest share of the independently explained variance. The inclusion of reservoir water level and seepage pressures P2-7 further enhanced the model’s explanatory power, illustrating the significant impact of water level fluctuations and seepage responses at specific monitoring points on the L-1 seepage pressure around the dam. Rainfall, as an external environmental factor, also influences seepage around Dam L-1, highlighting the necessity of incorporating hydrometeorological conditions into the model. Overall, the final model, through variable screening, highlights the roles of seepage pressures P1-4, reservoir water level, seepage pressures P2-7, and rainfall, providing a streamlined and effective empirical foundation for subsequent inversion and predictive analysis.

4.2. Determination of GCT Key Influence Factors

This study further employed Granger causality tests to analyze the causes of seepage around the dam. Analysis of engineering monitoring data indicates seepage around the dam’s left shoulder, with increased seepage pressure at both the LB-1 and L-1 locations. The P2-7 and P2-8 correlation coefficients were relatively low, while P1-4 initially showed minimal influence from reservoir water level but exhibited an overall trend similar to water level changes.
To identify statistical relationships among the data, thereby identifying key influencing factors and improving the accuracy of dam safety monitoring and analysis, as well as the accuracy of dam seepage prediction, this paper conducted a correlation analysis of the six types of feature data to quantify the extent to which each factor influences the L-1 and LB-1 seepage pressures around the dam, ensuring that the most representative variables were selected.
Before conducting the GCT test, to avoid spurious regression caused by non-stationarity in time series variables, an ADF test was first applied to examine the stationarity of all variables. The significance level for the stationarity test was set at p < 0.05. As shown in Figure 3, we conducted an ADF test using reservoir water levels as an example. The original time series of reservoir water levels was non-stationary, with p = 0.4949. After applying a first-order difference, the resulting series became stationary, with p = 0.0010. The selection of the lag order directly impacts the reliability of the GCT test.
To determine whether there is a long-term stable equilibrium relationship among these feature variables with long-term trends, the Johansen test was employed to conduct cointegration analysis. First, the lag order of the basic autoregressive model was determined based on information criteria, followed by the Johansen test. This study employed information criteria such as logL, FPE, AIC, SC, and HQ for order determination, with logL incorporated into the calculations of FPE, AIC, SC, and HQ. The final evaluation was based on the FPE, AIC, SC, and HQ metrics.
As shown in Table 3, the AIC, SC, and HQ values for lag order 0 were all positive and relatively high, while the FPE value was significantly elevated and the absolute value of logL was the highest, indicating that the fit at this lag order is poor. Starting from lag order 1, all information criteria values turned negative, suggesting that the introduction of lag terms significantly improves the model fit. Overall, lag order 10 performed best under both the AIC and FPE criteria, and the values of most criteria for other lag orders were close to the optimal levels; however, the SC criterion tended to favor the simpler lag order 1. Considering that the lag order is typically selected based on the majority of criteria or key criteria (such as AIC), lag order 10 was chosen as the final lag order for the cointegration test.
After determining the lag order, this study conducted a cointegration test using the Johansen test for the multivariate data. Table 4 presents the test results, which indicate the presence of three significant cointegration relationships within the system of variables analyzed, implying a long-run equilibrium relationship among these variables. Although the original time series of each variable exhibited non-stationary characteristics, the existence of three long-run stable linear equilibrium constraints within the system effectively mitigated the risk of spurious regression in subsequent tests. Subsequently, Granger causality tests were conducted on the feature data.
As shown in Table 5, the Granger causality test results indicate that based on rainfall and reservoir water level, the significance p-value was 0.047, demonstrating statistical significance. We rejected the null hypothesis, confirming that rainfall can cause changes in reservoir water level. However, the significance p-value for reservoir water level affecting rainfall was 0.719, indicating no statistical significance. Therefore, we cannot reject the null hypothesis, meaning that reservoir water level cannot cause changes in rainfall. This demonstrates the validity of the Granger causality test results.
Regarding dam-bypass seepage pressure, the selected data indicate that reservoir water level and seepage pressure at P2-7 may cause changes in dam-bypass seepage pressure at LB-1, while seepage pressure at P1-4 may cause changes at L-1. This is consistent with the results of the stepwise regression analysis. Changes in reservoir water levels create a difference in water levels between the upstream and downstream sides. The hydraulic gradient resulting from this difference directly drives water to seep into the rock at the dam abutments on both sides, leading to seepage around the dam. Both the Granger causality test and stepwise regression analysis identified the influence of reservoir water levels on seepage pressure around the dam at LB-1. This indicates that, after permeating through the rock and soil for a period of time, the reservoir water accurately reaches the LB-1 monitoring point; long-term seepage has resulted in the formation of a relatively unobstructed seepage recharge pathway between the location of LB1 and the reservoir water. The lag order in the GCT indicates that the permeability of this channel allows water pressure to vary within a certain range.
In this paper, “P” refers to the pressure tubes in the dam body/foundation, and “LB” refers to the pressure tubes on the left bank/around the dam. The analysis results indicate that the seepage pressure at P2-7 may also cause changes in the seepage pressure around the dam at LB-1. This is because long-term water flow within the dam body forms a fixed flow path, i.e., a seepage path. P2-7 is located upstream of the main seepage path leading to LB-1; that is, seepage first passes through P2-7, and then, under the influence of pressure, the water flow continues to propagate toward both banks, ultimately affecting LB-1. Therefore, it can be inferred that there is a hydraulic connection between the P2-7 and LB1 impermeable curtains, allowing water to flow from P2-7 to LB1. Seepage pressure at P1-4 may cause changes in the seepage pressure around the dam at L1. The cause of this phenomenon is similar to the mechanism by which water flows from P2-7 to LB-1. Therefore, during subsequent inspections and investigations of the dam, particular attention should be paid to the geological conditions and the impermeable curtain between these four seepage pressure gauges in order to provide an early warning of potential problems with the project.
Table 6 compares the results of stepwise regression analysis with those of GCT. The following conclusions were drawn regarding the seepage pressures around the dam, LB-1 and L-1, respectively:
① LB-1: Results from the stepwise regression analysis indicate that seepage pressures P1–4 and P2–7 influence seepage around the dam LB-1. Results from the GCT indicate that seepage pressure P2–7 influences seepage around the dam LB-1.
② L-1: The results of stepwise regression analysis indicate that seepage pressures P1–4 and P2–7, reservoir water level, and precipitation have an impact on seepage L-1 around the dam. GCT results indicate that seepage pressure P1–4 has an impact on seepage L-1 around the dam.
The results of the stepwise regression analysis revealed a synchronous correlation between the two datasets, whereas the GCT method requires that B changes only after A has occurred for a certain period of time. Therefore, both methods can be used to investigate the causes of seepage around the dam. Regarding the seepage pressure at LB1, the stepwise regression analysis suggests a relationship with P1-4 and P2-7, but the GCT method indicates a relationship only with P2-7. P1-4 was rejected by the Granger causality test, indicating that changes in P1-4 cannot predict changes in LB-1; it only exhibits correlation in the regression analysis. We believe that this may be because they are both driven by the same external factor. For example, an overall rise in reservoir water levels could cause the data to fluctuate in sync, even though there is no direct hydraulic connection. Since the water pressure changes at P2-7 occur before those at LB-1, it takes some time for the water pressure to be transmitted from P2-7 to LB-1, which is consistent with the time-lagged characteristics of GCT.
Therefore, based on the above analysis, it is concluded that both methods can reduce the factors contributing to seepage. Stepwise regression analysis can identify all potential influencing factors, while GCT can further identify those with causal relationships. The use of these two methods can help us screen for key factors influencing seepage pressure around the dam, scientifically eliminate spurious correlations and redundant variables, and reduce the number of inputs to the predictive model at the source.

5. Conclusions

Dam seepage safety monitoring systems often exhibit non-stationarity and time lags. In practical engineering applications, the subjectivity and risk of spurious correlations in the input factors of data-driven models can introduce uncertainty into model predictions and cause significant fluctuations in the reliability of those predictions. Based on an empirical study of typical engineering time-series data, this paper demonstrates that combining stepwise regression analysis with Granger causality tests can identify key influencing factors of seepage pressure around dams, scientifically eliminate spurious correlations and redundant variables, and reduce the number of inputs in the prediction model. The main scientific conclusions are as follows:
  • This study employs a hybrid approach combining multivariate Johansen cointegration analysis, Granger causality testing (GCT), and a two-way stepwise regression algorithm. This framework overcomes the issue of spurious correlations that often arise in variable selection when using traditional black-box machine learning models or simple statistical correlation analysis. The results reveal the spatial heterogeneity and local dynamic evolution characteristics of the seepage flow field around the dam. A distinct seepage flow path exists in the left abutment area of the dam, and the groundwater levels on both sides of the mountain have long remained in a state of high potential energy. The time-series evolution at the key monitoring points LB-1 and L-1 exhibits non-stationary trend-like steps, quantitatively confirming the dynamic restructuring process of the hydraulic gradient and permeability characteristics in this local area, thereby providing a basis for seepage early warning at the dam.
  • Combining Granger causality testing with stepwise regression analysis can effectively narrow down the range of characteristic parameters. The analysis indicates that the seepage behavior at LB-1 is primarily controlled by local hydraulic conduction within the dam, with P2-7 and P1-4 serving as the key driving factors; furthermore, GCT confirms that P2-7 exerts a temporal causal response on LB-1. In contrast, seepage at L-1 is influenced by both internal hydraulic conduction and external environmental factors. While it relies on internal hydraulic conduction via P1-4, reservoir water levels and rainfall also affect seepage at L-1.
  • The results of the Granger causality test and stepwise regression analysis corroborate one another. The results of Johansen’s multivariate cointegration analysis reveal the intrinsic long-term physical dynamic equilibrium of the dam seepage field, while stepwise regression comprehensively screens for all potential factors influencing synchronous fluctuations. GCT, in turn, utilizes time lag characteristics to further narrow down the range of influencing factors. These two methods complement each other’s strengths. Not only do they precisely identify the true physical causes driving seepage around the dam, but they also streamline the input variables of the predictive model at the source, eliminating the spurious correlations that are prone to occur in pure machine learning models. This process identifies key influencing factors for the scientific identification and reliable prediction of abnormal dam seepage in the future.

Author Contributions

L.L.: Conceptualization; methodology; data curation; writing—original draft preparation; writing—review and editing; funding acquisition. Y.J.: Conceptualization; validation; writing—review and editing. S.Z.: validation; visualization, supervision. F.C.: visualization, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. There are no copyright issues in this manuscript.

References

  1. Cen, W.; Zheng, X.; Deng, C.; Cao, Y. Seepage safety of a rockfill-based earth dam heightening project. J. Chang. River Sci. Inst. 2025, 42, 147–153. Available online: https://link.cnki.net/urlid/42.1171.TV.20250507.1136.004 (accessed on 7 May 2025).
  2. Li, J.; Chen, X.; Gu, C.; Huo, Z. Seepage Comprehensive Evaluation of Concrete Dam Based on Grey Cluster Analysis. Water 2019, 11, 1499. [Google Scholar] [CrossRef]
  3. Al-Janabi, A.M.S.; Ghazali, A.H.; Ghazaw, Y.M.; Afan, H.A.; Al-Ansari, N.; Yaseen, Z.M. Experimental and Numerical Analysis for Earth-Fill Dam Seepage. Sustainability 2020, 12, 2490. [Google Scholar] [CrossRef]
  4. Liang, M.-C.; Chen, H.-E.; Tfwala, S.S.; Lin, Y.-F.; Chen, S.-C. The Application of Wireless Underground Sensor Networks to Monitor Seepage inside an Earth Dam. Sensors 2023, 23, 3795. [Google Scholar] [CrossRef]
  5. Cheng, X.; Li, Q.; Zhou, Z.; Luo, Z.; Liu, M.; Liu, L. Research on a Seepage Monitoring Model of a High Core Rockfill Dam Based on Machine Learning. Sensors 2018, 18, 2749. [Google Scholar] [CrossRef] [PubMed]
  6. Li, D.-Q.; Kang, Q.; Yan, K.; He, J.-P.; Liu, Y. A dynamic multi-objective inversion framework for seepage parameters based on monitoring data: Case study of an earth-rockfill dam. J. Hydrol. 2026, 669, 135064. [Google Scholar] [CrossRef]
  7. Tong, G.; Zhang, H.; He, J.; Huang, S.; Zhang, Y.; Dong, Z. Anti-sliding stability analysis of the three gorges dam through seepage stress coupling inversion analysis. Water Resour. Power 2025, 43, 132–136. [Google Scholar] [CrossRef]
  8. Li, Y.; Sun, X.; Li, G. Sensitivity analysis of 3D seepage in rockfill dam considering concrete panel crack. Yellow River 2024, 46, 155–160. Available online: https://kns.cnki.net/kcms2/article/abstract?v=iMwhGHIyCLY2rtvIkC88kIJhE44VTlsDC4S6wKjheXDa8JkqN_Fx608x7HIyE2mp1Tk-zjXLvs9WM0V0U-tTzqrp1H55ym5wfoSg1x5ikgNVd-CigVquF_NbQjwCyZZLYUXo-0PvPPhXaTJ5GgEnhwCz372e8xgeaWEl9Fe9efuC4LdgSSaa4w==&uniplatform=NZKPT&language=CHS (accessed on 10 August 2024).
  9. Zhou, R.; Hu, C.; Li, D. Analysis of Seepage Stability in Reinforced Clay Core Wall Dams Using a Hybrid Model Approach. Yellow River 2024, 46, 102–103. Available online: https://kns.cnki.net/kcms2/article/abstract?v=iMwhGHIyCLYCFFJoXtepEKo_Vd8Xet6PzXpsaNtyHlzYTUF_zymBCKuuZhi1Yq-6z1G8_HBJA430UOeF1ZqWWrSXqecg1dZ3cSKmfkkJAIt6HU4IBlbYBTG6mKMGATXuix467f350c0ecpV-bSwPMNGJ0elKOJQ0prU9ZdDtdKqjK-0SRZRHCQ==&uniplatform=NZKPT&language=CHS (accessed on 28 June 2024).
  10. Liu, C.; Lyu, L.; Zhang, Y. Inverse Analysis of Seepage Characteristics of Dalongdong Reservoir Dam. Pearl River 2025, 46, 19–21. Available online: https://kns.cnki.net/kcms2/article/abstract?v=iMwhGHIyCLbJgM7KJMs7IVKAff6Qf0nvbfHAyPtSixTyiCOb5y28XoodqyApyE9nq0hNyvK45PtE45iW7x4dOOOlb11JrUhRZCFnh83Tris9hZJ9SjZRFTIjY1V9P0it7XgEr-yoqKeAbThoLCu1tOfHGXUc_T9wLywSeDXF9BCN79j9d8z2Pw==&uniplatform=NZKPT&language=CHS (accessed on 30 June 2024).
  11. Li, F.; Wang, Z.Z.; Liu, G. Towards an Error Correction Model for dam monitoring data analysis based on Cointegration Theory. Struct. Saf. 2013, 43, 12–20. [Google Scholar] [CrossRef]
  12. Wang, S.W.; Bao, T.F. Monitoring Model for Dam Seepage Based on Lag Effect. Appl. Mech. Mater. 2013, 353–356, 2456–2462. [Google Scholar] [CrossRef]
  13. Su, H.; Hu, J.; Yang, M. Dam Seepage Monitoring Based on Distributed Optical Fiber Temperature System. IEEE Sens. J. 2014, 15, 9–13. [Google Scholar] [CrossRef]
  14. Chen, X.; Xu, Y.; Guo, H.; Hu, S.; Gu, C.; Hu, J.; Qin, X.; Guo, J. Comprehensive evaluation of dam seepage safety combining deep learning with Dempster-Shafer evidence theory. Measurement 2024, 226, 114172. [Google Scholar] [CrossRef]
  15. Arslan, C.A.; Al-Jalabi, F.A. Artificial intelligence models for seepage analysis through embankment dam-case study: Khasa Chi Dam. Earth Sci. Inform. 2025, 18, 550. [Google Scholar] [CrossRef]
  16. Danish, A. Understanding the Effect of Hydro-Climatological Parameters on Dam Seepage Using Shapley Additive Explanation (SHAP): A Case Study of Earth-Fill Tarbela Dam, Pakistan. Water 2022, 14, 2598. [Google Scholar] [CrossRef]
  17. Li, D.; Kang, Q.; Wang, R.; He, J.; Liu, Y. Application of artificial intelligence models in the seepage flow prediction of dam: A case study of Shenzhen reservoir. Georisk Assess. Manag. Risk Eng. Syst. Geohazards 2025, 19, 944–965. [Google Scholar] [CrossRef]
  18. Zheng, C.; Cen, W.; Liu, B.; Qian, J.; Ding, Y.; Mo, C. Hybrid optimization and AI-driven surrogate model for seepage parameters inversion in complex dam foundations. J. Hydrol. 2026, 664, 134484. [Google Scholar] [CrossRef]
  19. Yin, Q.; Li, Y.; Li, W.; Wen, L.; Zhang, Y.; Wang, T.; Yang, T.; Zhou, T. Intelligent inversion analysis of seepage parameters for deep overburden dam foundations based on an improved grey wolf optimization algorithm. Comput. Geotech. 2010, 188, 19. [Google Scholar] [CrossRef]
  20. Zhou, Y.; Bao, T.; Shu, X.; Li, Y.; Li, Y. BIM and ontology-based knowledge management for dam safety monitoring. Autom. Constr. 2023, 145, 104649. [Google Scholar] [CrossRef]
  21. Tian, D.; Liu, H.; Chen, S.; Li, M.; Liu, C. Human Error Analysis for Hydraulic Engineering: Comprehensive System to Reveal Accident Evolution Process with Text Knowledge. J. Constr. Eng. Manag. 2022, 148, 13. [Google Scholar] [CrossRef]
  22. Xu, B.; Rong, Z.; Pang, R.; Tan, W.; Wei, B. A novel method for settlement imputation and monitoring of earth-rockfill dams subjected to large-scale missing data. Adv. Eng. Inform. 2024, 62, 102642. [Google Scholar] [CrossRef]
  23. Li, D.; Chen, G.; He, N.; Xu, X. Advances in data processing and evaluation techniques for safety monitoring of earth-rock dams. Hydro-Sci. Eng. 2025, 5, 88–100. [Google Scholar] [CrossRef]
  24. Shojaie, A.; Fox, E.B. Granger Causality: A Review and Recent Advances. Annu. Rev. Stat. Its Appl. 2021, 9, 289–319. [Google Scholar] [CrossRef] [PubMed]
  25. Zhang, M.; Wang, W.; Yang, W.; Zhang, T.; Li, Z.; Jin, L.; Jiang, Z. Weir Flow Prediction of Panel Rockfill Dam Based on Causality Test and Elman Neural Network. Water Power 2026, 52, 108–115. Available online: https://link.cnki.net/urlid/11.1845.TV.20251212.1117.002 (accessed on 9 February 2026).
Figure 1. Measured water level process lines (L1, L2, L3) from the left bank bypass seepage pressure gauge.
Figure 1. Measured water level process lines (L1, L2, L3) from the left bank bypass seepage pressure gauge.
Water 18 01359 g001
Figure 2. Measured water level process lines for the left bank bypass pressure pipes (L4, L5, L6).
Figure 2. Measured water level process lines for the left bank bypass pressure pipes (L4, L5, L6).
Water 18 01359 g002
Figure 3. Results of ADF test analysis for reservoir water level. (a) Original reservoir water level. (b) Distribution of original data. (c) Autocorrelation of original data. (d) Reservoir water level after first-order differencing. (e) Reservoir water level distribution after first-order differencing. (f) First-order autocorrelation coefficients.
Figure 3. Results of ADF test analysis for reservoir water level. (a) Original reservoir water level. (b) Distribution of original data. (c) Autocorrelation of original data. (d) Reservoir water level after first-order differencing. (e) Reservoir water level distribution after first-order differencing. (f) First-order autocorrelation coefficients.
Water 18 01359 g003
Table 1. Time range and data volume for characteristic data.
Table 1. Time range and data volume for characteristic data.
Feature DataTime RangeData Volume
Reservoir water level27 July 2020–31 December 2022890
Rainfall28 July 2020–31 December 2022560
Seepage pressure P2-727 July 2020–31 December 2022791
Seepage pressure P2-827 July 2020–31 December 2022158
Seepage pressure P1-427 July 2020–31 December 2022812
Seepage pressure around the dam LB-127 July 2020–31 December 20221066
Seepage pressure around the dam L-127 July 2020–31 December 2022817
Note: P stands for the pressure tube in the dam body/foundation. L and LB stand for the pressure tubes on the left bank and around the dam, respectively.
Table 2. Results of regression analysis for seepage pressure water levels.
Table 2. Results of regression analysis for seepage pressure water levels.
Introducing FactorsDependent VariableR2FPRegression Model
Seepage pressure P1-4Seepage pressure around the dam LB-10.686169.5410.000 *** y   = 94.573 + 1.061 X P 1 - 4 + 0.802 X P 2 - 7
Seepage pressure P2-70.686169.5410.000 ***
Seepage pressure P1-4Seepage pressure around the dam L-10.954795.7570.000 *** y   = 45.293 + 1.141 X P 1 - 4 + 0.188 X P 2 - 7
+ 0.082 X Reservoir   water   level + 0.002 X Rainfall
Seepage pressure P2-70.954795.7570.000 ***
Reservoir water level0.954795.7570.000 ***
Rainfall0.954795.7570.000 ***
Note: ***, **, and * denote significance levels of 1%, 5%, and 10%, respectively.
Table 3. Comparison of different lags.
Table 3. Comparison of different lags.
Order of LagAICSCHQFPElogL
04.2814.3294.30072.341−5779.300
1−16.208−15.875 *−16.0780.000−180.103
2−16.434−15.815−16.192 *0.000−82.378
3−16.425−15.519−16.0700.000−48.699
4−16.363−15.169−15.8960.000−29.041
5−16.566−15.083−15.9860.00061.846
6−16.549−14.777−15.8560.00093.650
7−16.480−14.418−15.6740.000111.407
8−16.386−14.033−15.4650.000122.475
9−16.442−13.797−15.4070.000173.784
10−16.641 *−13.703−15.4920.0 *263.155
11−16.606−13.374−15.3410.000289.872
Note: The order with the most * marks is considered the optimal lag order. If multiple orders have the same number of * marks, choose the smallest possible order. Considering that the lag order is typically selected based on the majority of criteria or key criteria (such as AIC). In this study, the order was ultimately selected based on the results of the AIC criterion.
Table 4. Results of the cointegration test.
Table 4. Results of the cointegration test.
Original HypothesisEigenvalueTrace (Max Root)10% Threshold5% Threshold5% Threshold
No cointegration0.226250.71991.10995.754104.964
Up to 1 cointegration0.085112.36665.82069.81977.820
Up to 2 cointegration0.06864.37544.49347.85554.681
Up to 3 cointegration0.03026.41427.06729.79635.463
Up to 4 cointegration0.0159.71613.42915.49419.935
Up to 5 cointegration0.0021.2822.7053.8416.635
Table 5. Granger causality test results.
Table 5. Granger causality test results.
Mated SamplesFP
RainfallReservoir water level4.0590.047 **
Reservoir water levelRainfall0.130.719
Reservoir water levelSeepage around the dam LB-12.9680.088 *
Reservoir water levelSeepage around the dam L-11.5860.211
RainfallSeepage around the dam LB-10.9240.339
RainfallSeepage around the dam L-13.2390.075 *
Seepage around the dam LB1Reservoir water level20.4650.000 ***
Seepage around the dam LB1Rainfall0.1050.746
Seepage around the dam LB1Seepage pressure P2-75.080.026 **
Seepage around the dam LB1Seepage pressure P2-80.450.504
Seepage around the dam LB1Seepage around the dam L-11.4580.230
Seepage around the dam LB1Seepage pressure P1-43.2810.073 *
Seepage around the dam L1Reservoir water level0.670.415
Seepage around the dam L1Rainfall0.0920.762
Seepage around the dam L1Seepage pressure P2-70.9440.334
Seepage around the dam L1Seepage pressure P2-80.0130.909
Seepage around the dam L1Seepage pressure P1-413.230.000 ***
Seepage around the dam L1Seepage around the dam LB-10.6370.427
Note: ***, **, and * denote significance levels of 1%, 5%, and 10%, respectively.
Table 6. Comparison of stepwise regression analysis and GCT results.
Table 6. Comparison of stepwise regression analysis and GCT results.
Dependent VariableStepwise Regression AnalysisGCT
Seepage around the dam LB-1Seepage pressure P1-4Reservoir water level
Seepage pressure P2-7
Seepage pressure P2-7
Seepage around the dam L-1Seepage pressure P1-4Seepage pressure P1-4
Seepage pressure P2-7
Reservoir water level
Rainfall
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, L.; Jin, Y.; Zhang, S.; Cheng, F. Dam Seepage Analysis Based on Causal Testing and Regression Analysis. Water 2026, 18, 1359. https://doi.org/10.3390/w18111359

AMA Style

Liu L, Jin Y, Zhang S, Cheng F. Dam Seepage Analysis Based on Causal Testing and Regression Analysis. Water. 2026; 18(11):1359. https://doi.org/10.3390/w18111359

Chicago/Turabian Style

Liu, Linsong, Yu Jin, Shengyang Zhang, and Fangjun Cheng. 2026. "Dam Seepage Analysis Based on Causal Testing and Regression Analysis" Water 18, no. 11: 1359. https://doi.org/10.3390/w18111359

APA Style

Liu, L., Jin, Y., Zhang, S., & Cheng, F. (2026). Dam Seepage Analysis Based on Causal Testing and Regression Analysis. Water, 18(11), 1359. https://doi.org/10.3390/w18111359

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop