1. Introduction
Groundwater, as a critical component of the global water cycle, plays an irreplaceable role in maintaining ecological balance and supporting socioeconomic development [
1,
2]. Accurate prediction of groundwater level dynamics is of great significance for water resources management, agricultural irrigation planning, flood mitigation, and ecological conservation [
3,
4]. However, under intensifying climate change and growing human disturbances, groundwater systems have become increasingly complex in their evolutionary behavior [
5,
6]. Traditional physically based hydrological models face severe challenges in parameterization and scale matching, particularly in highly heterogeneous aquifer systems such as karst.
Karst aquifers are characterized by dual-porosity structures and highly heterogeneous conduit–matrix systems, resulting in strong nonlinearity and rapid rainfall–response dynamics [
7,
8]. The coexistence of conduit and matrix flows makes the rainfall–groundwater relationship extremely complex, posing unique challenges for groundwater level prediction in karst regions [
9,
10,
11]. In China, karst landscapes cover approximately one-third of the national territory [
12], with the Guangxi Zhuang Autonomous Region exhibiting particularly well-developed karst landforms [
13], providing a natural laboratory for advancing karst hydrogeological research.
In recent years, machine learning (ML) and deep learning (DL) methods have gradually become important tools for hydrological time series prediction due to their powerful nonlinear fitting capabilities [
14]. In the field of groundwater level prediction, traditional ML models such as Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) have been proven effective in handling moderate-complexity nonlinear relationships [
15,
16,
17]. The introduction of DL models has further enhanced prediction performance: Long Short-Term Memory networks (LSTM) and Convolutional Neural Networks (CNN) can automatically extract temporal features and capture long-term dependencies [
18,
19]. Hybrid models such as CNN-LSTM demonstrate promising potential in complex hydrological processes by integrating local feature extraction with sequence modeling capabilities [
20,
21]. Recently, Transformer models with self-attention mechanisms have demonstrated potential in handling long-range dependencies in hydrological forecasting [
22].
In the specific domain of groundwater level prediction, studies have evaluated the performance of several models included in the present study. Traditional machine learning models, including RF and XGBoost, have been applied to groundwater level prediction in the Najafabad plain, demonstrating competitive accuracy in handling nonlinear feature interactions under data-limited conditions [
23]. LSTM and Transformer models have been assessed for extended-horizon groundwater level forecasting in the Thames Basin, with the Transformer showing advantages in capturing long-range temporal dependencies over traditional recurrent architectures [
24]. The CNN-LSTM hybrid architecture has been validated for groundwater level forecasting in arid regions, confirming the benefit of combining local convolutional feature extraction with sequential memory modeling [
25]. Although Seq2Seq-LSTM and Attention-Seq2Seq-LSTM have not yet been applied to groundwater level prediction, both architectures have demonstrated strong performance in multi-step time series prediction tasks: Seq2Seq-LSTM has shown effectiveness in capturing temporal dynamics through its encoder–decoder structure [
26], while the attention mechanism further enhances prediction stability by enabling the decoder to selectively focus on the most relevant historical states [
27]. N-BEATS, despite achieving state-of-the-art results in energy and financial time series forecasting [
28] and showing promise in streamflow prediction [
29,
30], similarly lacks any reported application in groundwater level estimation. The absence of systematic evaluation for these three architectures in groundwater level prediction, particularly in complex karst systems, motivates their inclusion in the present study.
Despite significant progress in hydrological prediction using machine learning and deep learning, their application in karst groundwater level prediction still faces numerous challenges. The rapid proliferation of model architectures has exacerbated the model selection dilemma in karst groundwater level prediction [
31,
32,
33]. Moreover, existing studies predominantly focus on single-step-ahead prediction, with limited attention to multi-step prediction stability, while multi-step prediction is of greater practical relevance for long-term water resource management [
34,
35,
36]. More critically, the influence of hydrogeological conditions on model performance has not been sufficiently elucidated, and their potential dominant role may exceed the differences arising from model architectures themselves.
To address the above challenges, this study systematically evaluates nine ML and DL models (RF, XGBoost, LSTM, CNN, Transformer, N-BEATS, CNN-LSTM, Seq2Seq-LSTM, and Attention-Seq2Seq-LSTM) for rainfall–groundwater level forecasting in a typical karst watershed in Guilin, Guangxi. Three monitoring sites representing distinct hydrogeological zones, namely the recharge, flow, and discharge zones, were selected, and two years of hourly high-frequency data were employed. The central scientific question addressed in this study is to what extent hydrogeological complexity, relative to model architecture, governs prediction feasibility in karst groundwater systems. To investigate this question, three specific objectives are pursued: (1) to quantify the relative influence of hydrogeological conditions versus model architecture on prediction performance by systematically comparing nine models across three sites with contrasting hydrogeological characteristics; (2) to evaluate single-step and multi-step prediction performance and characterise stability degradation trends across increasing forecast horizons; (3) to assess the trade-off between prediction accuracy and computational efficiency and identify model structures best suited to karst groundwater forecasting under different application scenarios. The scientific contributions of this work are threefold. First, we provide systematic cross-site evidence that hydrogeological complexity constitutes a dominant constraint on model predictive skill, consistently outweighing architectural differences across all nine model families, a finding that reframes model selection in karst systems as a hydrogeological problem rather than purely an algorithmic one. Second, we propose a replicable multidimensional evaluation framework integrating single-step accuracy, multi-step stability, and computational efficiency, which can serve as a methodological reference for model selection and performance assessment in complex hydrological systems. Third, we present the first application of N-BEATS to karst groundwater level forecasting, expanding the methodological toolkit available for heterogeneous aquifer environments.
2. Materials and Methods
2.1. Study Area
The Maocun subterranean river catchment is situated in the southeastern part of Chaotian Township, Lingchuan County, Guilin City, Guangxi Zhuang Autonomous Region (110°30′–110°35′ E, 25°09′–25°13′ N), encompassing an area of approximately 11.2 km
2 (
Figure 1). The region experiences a typical subtropical monsoon climate, with a long-term mean annual precipitation of 1903.9 mm (
Figure 2) and a mean annual temperature of 18.6 °C [
37]. Precipitation is unevenly distributed throughout the year: the concentrated rainfall period from April to July accounts for 60–70% of the annual total.
Carbonate lithologies occupy 7.6 km
2 of the catchment, approximately two-thirds of its total area [
37]. Karst landforms are well developed, with numerous surface features such as sinkholes and dolines. Subsurface development includes an intricate network of caves and underground river systems, rendering the Maocun catchment a paradigmatic karst watershed [
38].
2.2. Data Sources and Data Preprocessing
To compare the performance of different models in karst groundwater level prediction, this study utilized data from three groundwater monitoring wells: Shanwan (ZK1), Zhangshandi (ZK2), and Maocun outlet (ZK3), along with one automatic meteorological station. The three monitoring wells are characterized as follows. ZK1, located in the upper reaches of the watershed, has a shallow water table depth of approximately 2 m. ZK2, situated in the central catchment area, exhibits a significantly deeper water table depth of about 12 m and is characterized by complex topographic and geological conditions. ZK3, near the watershed outlet, also has a shallow water table depth of approximately 2 m. The observation period spans from 15 July 2021 to 10 July 2023, comprising 728 days of continuous monitoring records. Precipitation data were recorded at 15 min intervals, while groundwater level measurements were taken every 30 min.
Karst groundwater level variations exhibit pronounced lag and continuity characteristics, with response times typically measured in hours [
39]. Although the original monitoring data were collected at 30 min intervals, groundwater level fluctuations are generally small at such short time scales, and high-frequency noise may interfere with model training. To achieve a balance between data precision and computational efficiency, this study employed mean resampling to convert the original 30 min data to hourly resolution, with precipitation data correspondingly resampled to hourly intervals. This processing strategy preserved the primary variational characteristics of the data while effectively reducing data volume and improving model training efficiency.
To ensure data quality and enhance model training effectiveness, several preprocessing procedures were applied to the original time series data. Outliers, defined as values significantly exceeding normal ranges, were identified in approximately 1% of the dataset, primarily caused by sensor handling during data collection. These outliers were replaced with the average of adjacent values to maintain data continuity. Missing values, resulting from sensor malfunctions or signal interruptions, were addressed as follows. For short gaps, linear interpolation was used to fill missing data points. For continuous data gaps exceeding 6 h, regression models were constructed using observations from nearby monitoring stations during corresponding periods to estimate missing values.
The dataset from monitoring stations ZK1, ZK2, and ZK3 was analyzed for completeness. ZK3 data were complete, while ZK1 and ZK2 exhibited missing values, accounting for approximately 35% of their respective datasets. These missing periods occurred primarily from July to September 2021 and January to March 2022, coinciding with low precipitation and stable water levels. The missing data were attributed to sensor failures. Given the strong correlations between measured data from the stations (all pairwise correlations > 0.8), regression models were employed to reconstruct missing values. The correlation matrix for the measured data, after removing missing values, is shown in
Table 1.
Cross-validation of the regression models used for data reconstruction demonstrated high accuracy, as shown in
Table 2.
Following preprocessing, the dataset for model training comprised hourly precipitation and groundwater level observations from three monitoring wells, with each well containing 17,442 records (
Figure 3). Basic statistical characteristics of the groundwater level data at each monitoring site are summarized in
Table 3. To maintain temporal integrity and prevent data leakage, the dataset was chronologically divided into training set (15 July 2021, to 31 December 2022, totaling 12,823 records, about 70%), validation set (1 January 2023, to 15 April 2023, totaling 2520 records, about 15%), and test set (16 April 2023, to 12 July 2023, totaling 2099 records, about 15%). This partitioning strategy ensures the training period captures multiple seasonal cycles for comprehensive pattern learning, while the test period covers the critical spring-summer transition for robust model evaluation.
2.3. Experimental Setup
The experimental environment configuration for this study is presented in
Table 4, with all models utilizing identical random seeds (seed = 42) to ensure reproducibility. To establish a fair and consistent comparison, a unified experimental framework was established across all models. All input features and target variables were normalized using MinMaxScaler (version 1.5.1). A sliding window approach was adopted, defining an input sequence length of 6 time steps and a prediction window of 12 time steps. A consistent feature engineering process was applied to incorporate temporal dependencies, with details provided in
Section 2.4.
While the input/output window was identical across models, the data was formatted differently to suit the inherent structure of each model family. For the tree-based models, namely Random Forest and XGBoost, the input sequence of 6 time steps was flattened into a single feature vector, transforming the time series problem into a standard tabular regression task. In contrast, the deep learning models received the input as a sequence of shape (6, number of features), allowing their architectures to directly process the temporal structure of the data.
Crucially, all models employed a Direct Multi-output strategy to generate the 12-step forecast, effectively avoiding the error accumulation issues inherent in recursive forecasting. For the machine learning models, this was achieved using the MultiOutputRegressor wrapper from the multioutput module of scikit-learn (sklearn.multioutput.MultiOutputRegressor) in Python, which trains an independent regressor for each of the 12 future time steps. For the deep learning models, this strategy was implemented architecturally through a final output layer designed to produce a 12-dimensional vector, where each dimension corresponds to a step in the prediction window.
All deep learning models were uniformly trained using the Adam optimizer with an initial learning rate of 0.001, a batch size of 32, and the mean squared error loss function. The training process was regularized using a ReduceLROnPlateau learning rate scheduler (patience = 5) and an early stopping mechanism (patience = 15), with a maximum of 100 training epochs. The machine learning models adopted fixed parameter configurations: Random Forest was set with 100 decision trees (n_estimators = 100) [
17], while XGBoost was configured with 100 boosting rounds (n_estimators = 100), a maximum depth of 6 (max_depth = 6), and a learning rate of 0.1 [
40].
2.4. Feature Engineering
Groundwater levels in karst regions exhibit distinct lag response characteristics to precipitation, with response patterns varying among monitoring sites due to differences in hydrogeological conditions. Statistical analysis of precipitation events during the study period revealed average response times of 5.75 h at ZK1, 3.95 h at ZK2, and 4.875 h at ZK3 (
Supplementary Materials Table S1, Figures S1–S3). To adequately capture diverse response characteristics while establishing a unified modeling framework, this study configured precipitation accumulation windows of 3 and 6 h based on the response time distribution patterns of monitoring sites. The 3 h accumulation window primarily captures early response signals from sites with relatively rapid responses, while the 6 h accumulation window encompasses the main lag response processes of all monitoring sites, ensuring that models can fully utilize temporal information from precipitation-groundwater level responses.
Karst groundwater level variations exhibit significant temporal autocorrelation, with current water levels largely influenced by historical values [
41], particularly the water level at the previous time step, which plays an important role in current predictions. Based on this, we constructed lagged water level features and water level rate-of-change features in this study.
For lagged water level features, we incorporated the previous time step water level value as an input feature, which can be expressed as:
where H
lag(t) represents the water level value with a 1 h time lag, and H(t − 1) denotes the observed water level at the previous time step. Additionally, to better characterize water level variation trends and dynamic features, we constructed water level differential features as follows:
This feature can effectively quantify the rate of water level change per unit time, enhancing the model’s perception of both the direction and intensity of water level fluctuations, thereby contributing to improved response prediction accuracy.
2.5. Prediction Models
2.5.1. Conventional Machine Learning Models
Traditional machine learning models have been widely applied in hydrological and environmental forecasting due to their strong performance in handling small samples, high-dimensional, and nonlinear problems [
42,
43]. Compared to deep learning methods, traditional machine learning models offer advantages such as faster training speed, better interpretability, and relatively lower data requirements. This study selects the following two representative algorithms.
As a typical ensemble learning method, RF (
Figure S4a) constructs a nonlinear regression model by building multiple decision trees and making predictions based on averaging [
44]. This method is capable of handling nonlinear interactions between features, while also being robust, resistant to overfitting, and easy to tune [
45,
46]. In groundwater forecasting, RF can automatically identify key influencing factors, making it well-suited for handling multivariate and heterogeneous hydrological data [
15,
47].
As an efficient implementation of gradient boosting algorithms, XGBoost (
Figure S4b) constructs weak learners sequentially and combines them through weighted aggregation to form a strong predictive model [
48]. The algorithm employs regularization techniques to control model complexity, effectively capturing complex feature-response relationships, and performing excellently in structured data modeling scenarios. Its efficient parallel computing capability and superior generalization performance make it an important benchmark model for time series forecasting [
40,
49].
2.5.2. Single Deep-Learning Architectures
Deep learning models possess powerful nonlinear mapping capabilities and automatic feature extraction abilities, offering inherent advantages when dealing with high-dimensional, dynamic, and nonlinear problems such as groundwater systems [
50]. Deep architectures can automatically learn hierarchical feature representations from raw data, eliminating the need for manual feature engineering, and are particularly well-suited for handling groundwater dynamics with complex spatiotemporal dependencies [
51]. This study selects the following four deep learning models.
As an important variant of recurrent neural networks (RNNs), LSTM effectively addresses the gradient vanishing problem of traditional RNNs by introducing gating mechanisms (forget gate, input gate, output gate) (
Figure S5a), where Xt − 1, Xt, Xt + 1 represent input vectors at consecutive time steps, and ht − 1, ht, ht + 1 denote corresponding hidden state outputs. The symbol σ represents the sigmoid activation function (output range 0 to 1), tanh denotes the hyperbolic tangent activation function (output range −1 to 1), ⊗ indicates element-wise multiplication (Hadamard product), and ⊕ represents element-wise addition operations. These gating structures work collaboratively to control information flow and effectively capture long-term temporal dependencies. This model selectively remembers and forgets historical information, making it particularly suitable for capturing long-term dependencies and lag effects in groundwater level time series [
52]. In karst groundwater systems, LSTM can model the complex lagged response relationships between rainfall, runoff, and groundwater levels [
53].
CNN employs local receptive fields and parameter sharing mechanisms for efficient feature extraction [
54]. In time series modeling, CNN can automatically identify local patterns, trend changes, and periodic features (
Figure S5b). Starting from input, the architecture extracts local temporal features through one-dimensional convolutional layers (Conv1D), introduces non-linear transformations via ReLU activation functions, further extracts high-level features through additional Conv1D and ReLU layers, applies Global Average Pooling (GAP) for feature dimensionality reduction, and finally generates predictions through a fully connected layer (FC). The 1D convolutional kernel captures short-term dependencies and local correlations in the time series, while the multi-layer convolutional structure extracts features at different temporal scales, providing rich feature information for groundwater level prediction [
55,
56].
Based on the self-attention mechanism, Transformer models the direct relationships between any two positions in an input sequence, enabling parallel processing of sequence information and capturing global dependencies [
57] (
Figure S5c). Transformer presents the attention-based encoder architecture where input time series undergo input embedding to convert into high-dimensional vector representations, combined with positional encoding to preserve temporal sequence information. The multi-head self-attention mechanism (h = 4 indicates 4 parallel attention heads) computes attention weights across different subspaces in parallel to capture complex intra-sequence dependencies. Add and LayerNorm represent residual connections and layer normalization operations for training stabilization, while the Feed Forward Network consists of two fully connected layers for feature transformation. The entire encoder repeats for 2 layers (Transformer Encoder Layer), with final predictions generated through the Last Token Selection mechanism. This model overcomes the limitations of recurrent structures, offering higher computational efficiency and stable training when handling long sequences. The self-attention mechanism can automatically learn the importance weights of different time steps in the sequence, making it particularly suitable for capturing complex, cross-time-scale association patterns in karst groundwater systems [
58].
N-BEATS is a deep neural network architecture specifically designed for time series forecasting [
28] (
Figure S5d). This model adopts residual connections and hierarchical forecasting concepts, decomposing complex time series into trend and seasonal components for modeling. Trend Stack captures long-term trend components, Seasonality Stack models periodic patterns, and Generic Stack handles other complex non-linear patterns. Each stack contains multiple structurally identical blocks (Block 1, Block 2, Block 3), with each block internally composed of four fully connected layers (FC1, FC2, FC3, FC4). The dual-output mechanism generates both forecast and backcast signals, implementing residual learning through “residual = residual − backcast” to progressively remove modeled components, while accumulating predictions from all stacks via “total_forecast += forecast” to achieve decomposed modeling and ensemble prediction of different frequency and pattern components in time series. N-BEATS performs decomposition and reconstruction of time series using learnable basis functions, capturing multi-scale temporal patterns and demonstrating exceptional performance in pure time series forecasting tasks [
59].
2.5.3. Hybrid Deep-Learning Architectures
To leverage the advantages of different network structures and enhance the model’s expressive power, this study introduces three hybrid deep learning models. By combining the strengths of different types of neural networks, hybrid architectures enable complementary advantages in feature extraction, sequence modeling, and temporal prediction, further improving the accuracy of modeling complex groundwater dynamic processes [
60].
The Seq2Seq-LSTM (
Figure S6a) model consists of the Sequence-to-Sequence (Seq2Seq) model adopting an encoder–decoder architecture and two LSTM networks responsible for encoding the input sequence and generating the output sequence [
26]. The encoder LSTM compresses the input historical groundwater level sequence into a fixed-length context vector, and the decoder LSTM generates the future prediction sequence step by step based on this context vector. This architecture is naturally suited for multi-step prediction tasks, maintaining semantic consistency between the input and output sequences, and providing an effective modeling framework for medium- to long-term groundwater level forecasting.
The CNN-LSTM (
Figure S6b) hybrid architecture combines the local feature extraction ability of CNN with the sequence memory capability of LSTM, forming a hierarchical feature learning framework. The CNN layer first performs convolution on the input time series to extract local patterns and short-term dependency features. The LSTM layer then processes the feature sequences extracted by the CNN, modeling long-term temporal dependencies [
61,
62]. This structure is particularly suitable for handling groundwater level data with multi-scale temporal features, capable of capturing both short-term fluctuations and long-term trends.
Based on the Seq2Seq-LSTM model, the attention mechanism is introduced (Attention-Seq2Seq-LSTM model) (
Figure S6c) to address the information bottleneck issue in the traditional encoder–decoder architecture. The attention mechanism allows the decoder to dynamically focus on different hidden states of the encoder when generating each prediction step, rather than relying solely on a fixed context vector. This dynamic attention allocation mechanism improves the model’s efficiency in utilizing key historical information, making it particularly suitable for modeling complex scenarios in groundwater systems where different historical periods contribute differently to future predictions [
63].
2.6. Model Evaluation Metrics
To objectively evaluate the performance of constructed models in karst groundwater level prediction, this study employed strict temporal partitioning for model training, validation, and testing to avoid evaluation bias caused by data leakage. Based on the characteristics of regression prediction tasks, three widely adopted evaluation metrics were selected to quantify model performance.
Root Mean Squared Error (RMSE) measures the mean squared deviation between the predicted values and the observed values [
64]. Its mathematical expression is as follows:
where
represents the true value of the i-th observation, and
represents the predicted value of the i-th observation. RMSE penalizes large errors more strongly and is sensitive to outliers, making it particularly suitable for evaluating the fitting ability of models in processes with abrupt changes, such as heavy rainfall responses. In karst groundwater level forecasting, RMSE effectively identifies the model’s prediction stability under extreme weather events.
Mean Absolute Error (MAE) reflects the average deviation between the predicted values and the observed values [
65]. Its calculation formula is as follows:
MAE possesses intuitive interpretability, as its unit is consistent with that of groundwater level (meters), enabling direct physical understanding of prediction deviations. Unlike RMSE, MAE is less sensitive to outliers and extreme errors, making it more suitable for characterizing the overall prediction bias. By assigning equal weight to all errors, MAE better reflects the model’s average predictive capability across the entire dataset.
Coefficient of Determination (
R2) represents the goodness-of-fit of the model to the trend of groundwater level variations [
64] and is defined as:
where
denotes the mean of the observed values. The closer the
R2 value approaches 1, the better the model’s fitting performance for groundwater level variation trends, indicating greater effectiveness in capturing the intrinsic patterns of groundwater level fluctuations.
Comprehensive evaluation using these three metrics quantifies model prediction performance from different perspectives, providing objective criteria for selecting appropriate karst groundwater level prediction methods.
3. Results and Discussion
3.1. Performance Analysis of Single-Step Prediction
To comprehensively evaluate the single-step prediction performance of each model, this study conducted a comparative analysis of nine models across test sets from three monitoring sites (ZK1, ZK2, and ZK3), employing RMSE, MAE, and
R2 metrics for quantitative assessment. Results are presented in
Figure 4.
The prediction model performance corresponded well with the hydrogeological characteristics at each monitoring site. ZK1, located in the upper catchment with a shallow water table, achieved the highest prediction accuracy (R2 > 0.950, RMSE: 0.130–0.168), which can be attributed to its straightforward hydrological processes dominated by the direct influence of allogenic water and precipitation. The simple and direct rainfall–groundwater coupling at this site provides models with high-quality, low-noise input–output mappings that are amenable to accurate prediction regardless of architectural complexity. In contrast, ZK2, situated in the central catchment with complex topography and geology and a deep-water table, presented the greatest prediction challenge (R2: 0.769–0.813, RMSE: 0.606–0.673). Its elevated prediction uncertainty reflects the complexity of groundwater flow paths in the highly karstified zone and the strong nonlinearity introduced by heterogeneous media structures. ZK3, located at the watershed outlet with a shallow water table, exhibited intermediate performance (R2: 0.855–0.877, RMSE: 0.278–0.302), consistent with its role as the regional discharge boundary that effectively dampens high-frequency fluctuations induced by internal catchment complexities.
Overall, deep learning models demonstrated superior performance compared to traditional machine learning methods in groundwater level prediction, consistent with findings from other studies [
18]. Hybrid deep learning architectures outperformed single-model structures, with CNN-LSTM surpassing individual CNN and LSTM models, and Seq2Seq-LSTM and Attention-Seq2Seq-LSTM both exceeding the basic LSTM model, aligning with previous research conclusions [
34,
66]. The performance gain of hybrid architectures over their single-network counterparts reflects the multi-scale nature of karst groundwater fluctuations, where short-term precipitation pulses and longer-term recession dynamics operate at distinct temporal scales that benefit from complementary feature extraction strategies. Specifically, the Transformer model exhibited optimal overall performance, achieving the lowest RMSE and highest R
2 values across all three monitoring sites. The CNN-LSTM model also demonstrated strong predictive capability, ranking second only to the Transformer, and obtained lower MAE values than the Transformer at the ZK1 and ZK3 sites, consistent with prior studies [
22,
62,
67,
68]. In contrast, traditional machine learning models (XGBoost, Random Forest) showed relatively lower prediction accuracy across all monitoring sites, particularly at the geologically complex ZK2 site, where R
2 values were only 0.774 and 0.769, significantly lower than deep learning models.
Notably, N-BEATS, a deep learning architecture specifically designed for time series forecasting and initially applied in financial and energy forecasting, has seen limited application in the hydrological field. Based on the current literature review, N-BEATS has only been applied to monthly inflow prediction for dam reservoirs [
30], with no reported applications in groundwater level forecasting, particularly in karst regions. The present study demonstrates that the N-BEATS model exhibits promising performance in groundwater level prediction, ranking third after Transformer and CNN-LSTM models, and achieving lower MAE values than the Transformer at the ZK1 and ZK3 sites. This competitive performance may be attributed to the structural compatibility between the N-BEATS stacked block decomposition mechanism and the multi-scale composition of karst groundwater level signals, which comprise both slowly varying baseflow recession components and high-frequency storm-driven recharge pulses. This provides new evidence for extending N-BEATS applications in hydrological forecasting.
3.2. Performance Analysis of Multi-Step Prediction
To evaluate the long-term prediction capability of each model, this study conducted 1 to 12-step prediction experiments at three monitoring sites (ZK1, ZK2, and ZK3), systematically analyzing the impact of prediction horizon on model performance using RMSE, MAE, and
R2 metrics (
Figure 5), while the complete MAE results are provided in
Figure S7.
Results demonstrate that all models exhibit pronounced performance degradation during multi-step prediction, which is an inherent characteristic of long-term time series forecasting [
69]. However, significant differences in performance degradation rates across monitoring sites reveal that prediction reliability depends not only on model architecture but is also constrained by the intrinsic hydrodynamic characteristics of groundwater systems. Among the three sites, ZK1 exhibited the slowest performance degradation, followed by ZK3, while ZK2 experienced the most severe deterioration, with
R2 values of some models dropping below 0.6, rendering predictions unreliable.
This spatial variability corresponds closely to the hydrogeological conditions at each site. ZK1 features relatively simple hydrological processes with weak nonlinearity, enabling models to readily capture its regular variations. In contrast, ZK2 is characterized by deep groundwater levels, prolonged system residence time, complex flow pathways, and coupling effects of multiple factors, resulting in significantly elevated prediction uncertainty. ZK3, situated in the watershed discharge zone, is influenced by upstream complex processes yet possesses buffering capacity that attenuates high-frequency fluctuations, thereby achieving intermediate prediction performance. These findings demonstrate that the inherent complexity of groundwater systems constitutes a critical constraint on multi-step prediction capability, and even advanced models remain limited by actual hydrogeological settings [
70,
71].
Overall, traditional machine learning models (XGBoost and Random Forest) exhibit a marked decline in performance in multi-step ahead prediction. As the prediction horizon increases, their
R2 continuously decreases, remaining the lowest among all models across the three monitoring sites. This accelerating degradation is structurally consistent with the inability of tree-based models to represent temporal state evolution: because RF and XGBoost encode the input window as a static feature vector, they cannot model the propagation of hydrological states through time, causing predictive skill to collapse as the forecast horizon exceeds the temporal scale captured by the input features. This finding is consistent with previous studies, which have shown that while traditional machine learning models perform well in single-step or short-term predictions, they struggle to effectively model the dynamic evolution of hydrological systems over longer horizons [
18]. The single neural network architecture, LSTM, also shows similar limitations, with prediction accuracy significantly deteriorating as the forecast step increases, particularly at ZK2, where hydrological responses are highly complex.
To address this challenge, Attention-Seq2Seq-LSTM and Seq2Seq-LSTM incorporate attention mechanisms and encoder–decoder structures to enhance model capability. Although these architectures improve the model’s ability to focus on critical historical states and show relatively better adaptability at the ZK2 site, their overall performance improvement over the basic LSTM remains limited. This may be attributed to suboptimal network design, attention weight allocation, or gradient propagation efficiency during training, suggesting that further optimization is needed to strengthen their modeling of long-term dependencies.
In contrast, the CNN and its hybrid with LSTM (CNN-LSTM) demonstrate superior predictive performance, indicating that convolutional neural networks are effective in extracting local features from input variables. By integrating CNN with LSTM, the hybrid architecture enables synergistic modeling of spatial local patterns and temporal dependencies, thereby improving prediction accuracy. This result aligns with findings from river conductivity forecasting [
66] and runoff simulations in the headwater region of the Yellow River [
34], validating the effectiveness of the CNN-LSTM framework in hydrological prediction. Nevertheless, its long-term predictive capability still lags behind that of N-BEATS and Transformer, reflecting limitations in modeling distant future dynamics under highly nonlinear conditions.
The Transformer model demonstrates a distinct advantage in capturing seasonal fluctuations and abrupt events, exhibiting stronger robustness in medium- to long-term predictions. By dynamically weighting the importance of historical states, the self-attention mechanism effectively models long-range dependencies, thereby mitigating the prediction uncertainty caused by information decay. This advantage is consistent with prior research: Transformer has been shown to significantly outperform traditional LSTM models in multi-step groundwater level prediction due to its powerful sequence modeling capacity [
72]. Comparative experiments in karst spring flow prediction further confirm the applicability of Transformer in complex aquifer systems [
67]. These evidences collectively suggest that Transformer is particularly suitable for modeling groundwater systems characterized by memory effects, nonlinear responses, and multi-scale dynamics.
Most remarkably, the N-BEATS model demonstrated exceptional performance. Whether at the relatively simple hydrological ZK1 site or at the dynamically complex, nonlinearly responsive ZK2 and ZK3 sites, N-BEATS consistently exhibited the strongest long-term stability, significantly outperforming other models. This result is highly consistent with findings from multi-step prediction of harmful algal blooms [
73] and reservoir monthly runoff prediction [
30], further confirming the superiority of N-BEATS in long-horizon forecasting tasks. The results indicate that N-BEATS not only performs well in single-step prediction but also exhibits robust modeling capability for distant forecast horizons under a direct multi-step prediction framework, offering new perspectives for the selection and optimization of intelligent models in future groundwater prediction.
To provide a more intuitive visualization of model performance across different prediction horizons, this study conducted a comprehensive visual analysis of representative models at the ZK1 site.
Figure 6 presents the time series fitting performance of nine models for 1-step, 6-step, and 12-step predictions, intuitively illustrating their capability to track groundwater level dynamics.
Figure 7 quantitatively evaluates the accuracy distribution and consistency of three typical deep learning models (CNN-LSTM, Transformer, and N-BEATS) across different prediction steps through scatter plots of predicted versus observed values.
Figure 8 depicts the evolution of cumulative absolute errors over time during the testing period for each model, serving as a measure of their stability in long-term prediction. These three figures systematically reveal the evolution patterns of model performance with increasing prediction steps from three perspectives: temporal dynamic fitting, numerical accuracy performance, and error growth trends. Furthermore, to comprehensively assess model generalization capability under different hydrogeological conditions, the prediction time series and cumulative error evolution for the other two monitoring sites (ZK2 and ZK3) are presented in the
Supplementary Materials (Figures S8 and S9 and Figures S10 and S11, respectively).
3.3. Comparative Analysis of Computational Efficiency
Computational efficiency represents a critical factor for the practical deployment of groundwater level prediction models. This section evaluates nine models across multiple dimensions, including parameter complexity, training time, and convergence characteristics.
Model parameter counts exhibit substantial variation, ranging from 34,764 parameters for CNN to 490,770 parameters for N-BEATS, representing a 14-fold complexity difference (
Table 5). It should be noted that these values reflect static weight file sizes only; runtime memory consumption during training and inference is substantially larger and scales with batch size and sequence length. Traditional machine learning methods (XGBoost, Random Forest) employ different architectures that do not involve trainable parameters in the traditional deep learning sense.
Training time comparison results reveal significant differences in computational efficiency across model categories (
Table 6). Traditional machine learning methods demonstrate superior training speed, with XGBoost (3.49 s) and Random Forest (15.85 s) completing training within 20 s. Among deep learning models, CNN (24.25 s) and CNN-LSTM (27.97 s) achieved relatively high training efficiency, while sequence-to-sequence models required significantly longer training times.
Convergence characteristic analysis reveals differences in convergence speed and stability among deep learning models during training (
Table 7). CNN-LSTM exhibited the fastest convergence speed, requiring only 28.0 training epochs on average to achieve optimal performance. Sequence-to-sequence architectures demonstrated slower convergence patterns, with Seq2Seq-LSTM requiring an average of 65.3 training epochs.
Comprehensive analysis of the trade-off relationship between computational cost and model performance reveals distinct efficiency-accuracy characteristics across model categories. Traditional machine learning methods provide exceptional training efficiency but limited predictive performance in handling complex temporal patterns. Among deep learning models, CNN-based architectures (CNN, CNN-LSTM) offer a favorable balance between accuracy and computational cost. Although N-BEATS requires substantial computational resources (average training time 334.42 s), its superior long-term prediction stability may justify the increased computational investment in applications requiring extended prediction horizons.
From a practical application perspective, CNN-LSTM emerges as the most computationally efficient deep learning model, combining fast convergence (28.0 epochs) and reasonable training time (27.97 s) while maintaining competitive predictive performance. Traditional methods retain advantages in scenarios prioritizing training speed over prediction accuracy. In practical applications, model category selection should consider specific trade-offs between computational constraints and prediction requirements.
3.4. Limitations and Transferability
Data quality. Approximately 35% of the records at ZK1 and ZK2 required regression-based reconstruction prior to model training. Three considerations limit the practical impact of this constraint on the reported conclusions. First, the missing periods were concentrated in hydrologically stable intervals (July–September 2021 and January–March 2022), with low precipitation and minimal water-table variation, so the reconstructed values represent low-variance segments rather than the high-energy recharge events that most challenge model performance. Second, reconstruction accuracy was high for both wells (R
2 > 0.90,
Table 2), supported by strong inter-well correlations (>0.80,
Table 1). Third, and most critically, the test set (April–July 2023) on which all performance metrics are based consists of more than 90% original, uninterpolated observations, ensuring that the comparative evaluation reflects genuine model behaviour. Nevertheless, future studies with continuous long-term monitoring records would eliminate this constraint and enable more robust assessment of model performance under extreme recharge conditions.
Interpretability. The present study concludes that hydrogeological complexity exerts a dominant control on model predictive skill, exceeding the influence of model architecture. This inference is grounded in a consistent cross-site performance gradient that persists across all nine architecturally distinct models, a pattern more parsimoniously explained by site-specific hydrogeological complexity than by any model-specific factor. The systematic and architecture-independent nature of this cross-site evidence provides strong indirect support for the conclusion. Nevertheless, no formal quantitative attribution analysis was conducted in the present study. Future work will couple a MODFLOW-based physically distributed groundwater model with SHAP analysis to explicitly partition the relative contributions of hydrogeological and architectural factors to prediction uncertainty, thereby providing direct mechanistic validation of the hydrogeological dominance finding.
Transferability. All findings derive from a single karst catchment in subtropical southern China, with a well-developed conduit–matrix system and pronounced monsoon seasonality. The transferability of the reported conclusions to karst systems with contrasting geological structures, recharge mechanisms, and climatic regimes remains to be validated. Future research could advance transferability through three directions: replicating the evaluation framework in karst systems with different conduit development and climate regimes; developing transfer learning protocols that leverage pre-trained weights from data-rich sites; and embedding physically interpretable parameters into physics-informed hybrid architectures to improve cross-site generalisation without requiring extensive local calibration.
4. Conclusions
This study addresses a fundamental question in karst hydrology: whether model architecture or hydrogeological complexity is the primary determinant of groundwater level prediction feasibility. By systematically evaluating nine ML and DL models across three hydrogeologically distinct monitoring sites within a unified multidimensional framework, we demonstrate that aquifer complexity exerts a dominant and consistent control on predictive skill that outweighs architectural differences, a finding with direct implications for how model selection should be approached in heterogeneous karst environments. Three principal conclusions emerge from this work.
First, among the evaluated architectures, the Transformer achieves the highest single-step prediction accuracy, benefiting from its self-attention mechanism that effectively captures multi-scale temporal dependencies. N-BEATS demonstrates superior long-term stability in multi-step prediction across all sites, suggesting that its stacked block architecture with backcast–forecast decomposition is particularly well-suited to systems with prolonged hydrological memory. CNN-LSTM achieves the best balance between prediction accuracy and computational cost, making it the most practically deployable option for engineering applications.
Second, and more fundamentally, hydrogeological complexity exerts a dominant control on predictive skill that systematically outweighs differences arising from model architecture. This cross-site performance contrast—persisting consistently across all nine model families—indicates that aquifer complexity, rather than model choice, is the primary constraint on prediction feasibility. Consequently, model selection for karst groundwater prediction should be treated as a hydrogeological problem first and an algorithmic problem second.
Third, this study presents the first application of N-BEATS to karst groundwater level forecasting and proposes a replicable multi-dimensional evaluation framework that can serve as a standardised paradigm for intelligent modelling of complex hydrological systems.
These findings collectively advocate a shift from one-size-fits-all model selection toward a site-adaptive, geology-informed modelling paradigm. Future research should prioritise physics-informed hybrid frameworks that embed hydrological prior knowledge into model design, multi-source data integration, and cross-basin transferability assessments to further advance intelligent modelling of heterogeneous karst systems.