3.1. Data Prediction
Using the aforementioned data combinations, water levels in boreholes O2-4, O2-6, XO5, and KM were predicted by applying their partitioned datasets to a Bidirectional LSTM (BiLSTM) network. To prevent the influence of post-accident mitigation measures (e.g., grouting) on model training, water level data from these boreholes were divided into training and testing sets based on the timeline of the water inrush incident. The training period spans from 1 April 2020 to 23 September 2021 and the testing period spans from 23 September 2021 to 1 December 2021.
The prediction model employed in this study is BiLSTM network, comprising a single bidirectional LSTM layer with 18 units, followed by batch normalization. The architecture also includes three fully connected layers with 16, 8, and 1 neuron(s), respectively, all using linear activation functions. Model training was conducted using the Adam optimizer (learning rate = 0.001, epsilon = 1 × 10−5), with mean squared error (MSE) as the loss function. Key training parameters included a batch size of 15, 50 epochs, a 10% validation split, and early stopping criteria (patience = 20, min_delta = 0.001) with restoration of the best weights.
Using borehole XO5 as an example,
Figure 6 illustrates the training and validation loss curves, which converge sufficiently after 50 epochs, indicating stable training. Experiments showed that increasing prediction timesteps beyond 12 did not significantly improve accuracy (as measured by RMSE and MAE) but considerably increased computational cost. Thus, the final model uses 15 historical timesteps (equivalent to 45 min of data) to predict the water level for the next 5-minute interval.
Model performance was evaluated using the coefficient of determination (R
2) and mean absolute error (MAE) across various cell configurations. As illustrated in
Figure 7, the multivariate model (denoted as R
2(2) and MAE(2)) consistently exhibited superior performance compared to the univariate model. The optimal configuration was uniquely attained with 18 cells, where R
2(2) reached its maximum while MAE(2) simultaneously achieved a minimum—a dual optimum that was not observed at other cell counts, such as 10 or 12 cells.
A comprehensive set of eight evaluation metrics was selected to thoroughly evaluate the model’s performance in water-level prediction, addressing key aspects such as accuracy, stability, and operational safety (
Table 2). These metrics were carefully chosen to align with the specific requirements of groundwater monitoring and early-warning systems for water inrush. R
2 was employed to evaluate the overall goodness-of-fit between predicted and observed water levels. RMSE and MAE were used to quantify absolute error magnitudes, with RMSE emphasizing larger deviations often critical in anomaly detection. MARE provided a relative error measure intuitive for interpreting percentage deviations in hydraulic head. MSRE complemented this by offering enhanced sensitivity to outliers, which is vital given the abrupt changes characteristic of inrush precursors. MaxARE was included to monitor worst-case prediction errors, essential for ensuring reliability in safety-critical forecasting. MBE helped identify any systematic bias in predictions—over- or under-estimation of water levels—that could influence risk decisions. Finally, SD quantified the consistency of prediction errors, reflecting the model’s stability under continuous operational deployment in dynamic mining environments [
58].
A comparative analysis of four neural network models—BiLSTM, LSTM, GRU, and RNN—each trained over 50 runs using 15 time-step inputs and 18 hidden units (
Table 3), reveals distinct performance differences across four boreholes. BiLSTM achieves the highest predictive accuracy in boreholes O2-4 and O2-6, with R
2 values of 0.980 and 0.978, and MAE values of 0.019 and 0.029, respectively. It also attains the lowest values in MSE, MAE, MARE, and MSRE across most boreholes, reflecting high precision and robustness against outliers.
A slight negative MBE, such as −0.178 in borehole KM, indicates a conservative prediction tendency that reduces the risk of missed alarms—an essential trait for operational safety. Furthermore, BiLSTM exhibits among the smallest maximum absolute errors (erMAX) and low standard deviation (SD), emphasizing its exceptional stability for continuous groundwater monitoring.
Although LSTM shows competitive accuracy in borehole KM (R2 = 0.965, MAE = 0.189), its overall performance is generally inferior to BiLSTM. GRU yields inconsistent results across boreholes, such as a relatively low R2 of 0.739 in XO5, while RNN performs poorest with high MAE and low R2, particularly in XO5 and KM.
In summary, BiLSTM demonstrates superior accuracy, reliability, and stability, making it highly suitable for real-time water-level forecasting and early-warning systems in water inrush prevention.
During actual water inrush incidents, the model proved highly responsive. For example, a sudden drop of 1.19 m was observed in borehole XO5, during which predicted values consistently exceeded actual measurements with a visible time lag. A peak prediction error of 0.768 m occurred at 11:55 on 25 October 2021 (
Figure 8). This behavior stems from the disruption of natural periodic aquifer fluctuations under sudden hydraulic forcing, resulting in anomalous variations that are effectively captured by BiLSTM as abrupt increases in prediction error.
In conclusion, while water level prediction alone is insufficient for reliable forecasting of major water inrush events, the integration of specialized anomaly detection methods—such as the proposed BiLSTM-based framework—can significantly enhance early warning capabilities and hazard preparedness.
3.2. Anomaly Detection
Given that neither the water level nor the water inflow data followed a normal distribution, the 3σ method failed to identify any anomalies. Considering the atypical nature of water-level data after April 2021 and based on expert knowledge and common threshold-setting practices in the coal mining industry, a threshold range of 0.1 to 0.4 times the maximum absolute variation rate of borehole XO5 observed between April 2020 and April 2021 was established [
50]. Data points exceeding this variation rate threshold were labeled as anomalous. In the identification of genuine anomalies, the following scenarios are explicitly classified as anomalous conditions: firstly, a simultaneous sharp rise in water levels across multiple boreholes during the rainy season, accompanied by rapid hydraulic increase; secondly, a synchronized rapid decline in water levels observed across several boreholes within a short period during water inrush events; thirdly, throughout the water control phase, such as during the implementation of grouting and water reduction engineering, persistent declining and fluctuating water levels resulting from the continuous intervention of such engineering measures. Additionally, since grouting engineering exerts a continuous and complex influence on water level dynamics, it is often difficult to accurately delineate the specific time periods affected using empirical methods. Therefore, the entire grouting construction phase was excluded from subsequent anomaly indicator scoring in the evaluation process. A comparison between anomalies identified under different thresholds and the actual anomaly events is illustrated in
Figure 9.
The selection of evaluation metrics was guided by the critical need to minimize both missed detections and false alarms in mine water inrush early warning. Missed anomalies may lead to catastrophic safety failures, while false alerts can trigger unnecessary emergency responses, resulting in operational disruptions and economic losses.
To holistically evaluate model performance under these constraints, key metrics including Recall, Precision, F1-Score, False Positive Rate (FPR), Specificity, Matthews Correlation Coefficient (MCC), True Positives (TP), and False Positives (FP) were adopted (
Table 4). These indicators collectively provide insights into the model’s ability to detect true anomalies while controlling false alarms.
Due to the specific requirements of mine water inrush early warning—which aims to capture true anomalies (true positives) to the greatest extent while minimizing false alarms (false positives)—the selection of an appropriate threshold is critical. As illustrated in
Figure 9, a threshold set at or below 0.25 times the maximum variation rate results in a higher number of false alarms. Conversely, when the threshold is raised to 0.35 times or above, the missed detection rate increases significantly. In comparison, a threshold defined as 0.3 times the maximum absolute variation rate demonstrates a better balance between controlling false alarms and minimizing missed detections. This threshold accurately identifies two key genuine anomaly events: one is a more rapid synchronous rise occurring during the general water level increase in the rainy season, and the other is a synchronous rapid decline during the water inrush incident, which deviates from normal dynamic patterns. Furthermore, following the initiation of the third phase involving grouting and water reduction engineering, the 0.3-times threshold is also the first to trigger anomaly alerts, indicating its high sensitivity to sustained abnormal water level responses induced by continuous engineering interventions. This aligns well with the practical requirement for persistent anomaly detection during this phase.
Among the tested thresholds, 0.3 was identified as optimal. At this value, Recall reached 0.840, indicating high sensitivity to true anomalies, while Precision was 0.992 and FPR was only 0.00049—reflecting minimal false alarms. The MCC of 0.907 confirms well-balanced classification performance. With only six false positives across all samples, this threshold effectively meets the operational objective of maximizing detection without introducing significant false alerts.
Lower thresholds increased FPR substantially, raising false alarm rates, whereas higher thresholds severely reduced Recall, thus increasing the risk of missed events. Threshold 0.3 thus represents the most suitable trade-off, ensuring both safety and operational continuity in mine water inrush.
According to the BiLSTM prediction results, the variables can be obtained: borehole water level
y1, predicted water level
y2, prediction error (
AE), absolute rate of change in prediction error (Δ
AE), absolute rate of change in water level data (Δ
y1), absolute rate of change in prediction data (Δ
y2) and other data. Six combinations can be obtained by combining the variables (
Table 5).
Selecting optimal input variables is critical for enhancing anomaly detection performance. Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC) analysis were employed to evaluate classification model performance across variable combinations.
For the XO5 borehole, the combination (
y1,
y2,
AE, Δ
y1, Δ
y2, Δ
AE) designated as Combination 3 achieved maximum AUC for the VAE model (
Figure 10). Its superior performance can be attributed to the comprehensive multi-perspective input it provides: the raw water levels (
y1,
y2) convey the system’s current state; the prediction error (
AE) reflects model deviation; and the temporal rates of change (Δ
y1, Δ
y2, Δ
AE) capture dynamic trends. This integration of instantaneous, residual, and derivative information enables the VAE to better learn normal patterns and identify subtle anomalies.
This variable group was subsequently used for anomaly detection in all four boreholes. The detection results were validated using a hydraulic threshold method, with the threshold value set at 0.3 times the maximum absolute water-level change (|Δh|max).
Figure 11 shows the results of the VAE model’s detection of anomalies in the XO5 borehole, where combination 3 performs best in terms of Recall, Precision, and F1 Score, especially with a Recall value of 1, indicating the model’s high ability to identify anomalous samples. Despite the imbalance in the anomaly detection data, the high Precision and F1 Score of combination 3 further confirms its superior performance in anomaly detection.
To obtain the most appropriate detection model, combine 3 variables of the XO5 prediction results as input data. VAE and autoencoder (AE), one-class support vector machine (OCSVM), local outlier factor (LOF), robust covariance estimation (Elliptic Envelope, EE) and Isolation Forest (iForest). The models are trained on the training set and the results and the abnormalities determined by the threshold method are shown in
Figure 12.
In detecting water level anomalies under precipitation and mining impacts: iForest suffers missed detection; OCSVM, LOF, and EE exhibit higher false positives (EE being most sensitive), while AE and VAE demonstrate optimal performance with minimal false alerts and highest accuracy.
Four standard evaluation metrics are used: Accuracy, F1-Score, Recall, Precision evaluate the detection performance of different models. Due to the fact that there are far more normal categories than abnormal ones in the water level data, anomaly detection places more emphasis on comprehensive indicators such as F1 value, Precision, and Recall rate, rather than just accuracy.
Figure 13 shows the performance of different methods for detecting abnormal water levels in four boreholes.
Based on a comprehensive analysis of multiple performance metrics, the VAE demonstrates significant advantages in anomaly detection, achieving the highest scores in both Accuracy (0.99) and Precision (0.81), indicating high overall classification accuracy with a very low false alarm rate. Its F1 Score (0.70) and Recall (0.61) further reflect a well-balanced performance in identifying both positive and negative instances. In contrast, although the AE attains relatively high Accuracy (0.92), its exceptionally low Precision (0.12) and F1 Score (0.22) reveal a high false alarm propensity. The OCSVM shows strong Recall (0.89) and high sensitivity, yet its severely limited Precision (0.066) leads to significant false positive issues. The LOF model delivers mediocre performance across all metrics (F1 = 0.11, Precision = 0.066), rendering it inadequate for practical applications. Both EE and iForest exhibit complete performance failure, with near-zero F1 Scores (0.0078, 0.10) and extremely low Precision values (0.0042, 0.053), indicating an almost total inability to effectively distinguish anomalous samples. In summary, the VAE is the only method that combines high accuracy and strong robustness, significantly outperforming all other compared algorithms.
Combination 3 and a threshold of 0.3 times the historical maximum absolute change in water level for calibration, the VAE detected anomalies across four boreholes (with post-November 2021 data excluded for stability), demonstrating validated effectiveness in stochastic water level scenarios (
Figure 12 and
Figure 14).
The anomaly detection results for the three boreholes are divided into four stages: pre-inrush, inrush, and post-inrush periods. The VAE model can effectively detect anomalies at all stages, and these anomalies correspond to those determined by the threshold method.
The VAE model demonstrated high accuracy and sensitivity in anomaly detection across the four stages for the four boreholes. By comparing the false alarm situations of each borehole over time, it was found that there were instances of multiple boreholes issuing warnings simultaneously, mutual warnings, and false alarms in one borehole corresponding to real anomalies in others. This occurs because the formation of a cone of depression during water inrush establishes spatial correlations between borehole water levels, leading to synchronized anomalies in the detection results that provide mutual verification.
In summary, a single water level anomaly is not sufficient to determine an anomaly at that time. It is necessary to consider the anomalies of all four drill holes and conduct further analysis and judgment to obtain more reliable anomaly detection results and provide a more reliable basis for early warning of mine water inrush accidents.
3.3. Comprehensive Water Inrush Warning
When a water inrush occurs in a mine, the groundwater level near the outburst point drops rapidly, forming a cone of depression (
Figure 15). In space, the water level of the borehole will change with the occurrence of water inrush in the mine, and the water level change in the borehole near the water inrush point will be more obvious. The higher the correlation between the borehole and the water inrush event, the more the abnormality detected by the borehole can reflect the real water inrush. Therefore, all the abnormalities detected by the borehole should be comprehensively considered to capture the real water inrush abnormality more accurately.
In water inrush events, dynamically assign weights based on borehole response speed and fluctuation magnitude: higher weights for rapid/significant changes, lower for delayed/minor variations. The weight assigned to each borehole is calculated as the proportion of its water level change rate relative to the total water level change rate across all boreholes. The formula for calculating the weight is as follows:
In the formula, Δ
hᵢ represents the water level change rate (unit: m/min) before and after the water inrush event, calculated as the difference between the water level before the inrush (
Hbefore, unit: m) and after the inrush (
Hafter, unit: m) divided by the time interval (Δ
t, min);
Wᵢ denotes the dimensionless weight coefficient derived from each borehole’s Δ
hᵢ value as a proportion of the total Δ
hᵢ sum across all four boreholes. The calculated weight values are presented in
Table 6.
The Comprehensive Alert Value (CAV) is derived through weighted integration of anomalies from four boreholes, with values normalized to the range [0,1]. Threshold determination for warning levels is based on Distribution patterns of historical normal monitoring data, and Validation from documented water inrush events.
Normal (CAV ≤ 0.4): 86.7% historical baseline coverage;
Low risk (0.4 < CAV ≤ 0.6): Coordinated micro-fluctuations;
Medium risk (0.6 < CAV ≤ 0.8): Engineered/rainfall-induced hydraulic coordination;
High risk (CAV > 0.8): Diagnostic of inrush events or extreme hydrological responses.
As shown in
Figure 16, the comprehensive early warning results of the water inrush incident successfully identified three distinct types of anomalies: a rise in water level due to heavy precipitation, a rapid decline caused by the water inrush itself, and subsequent fluctuations resulting from post-accident mitigation measures such as grouting and plugging. These anomalies were detected with a low false alarm rate and triggered warnings at different risk levels. This demonstrates that the comprehensive early warning approach not only enables early detection of water inrush events but also effectively responds to diverse abnormal conditions. Compared to single-borehole anomaly detection, this method offers higher accuracy and sensitivity in identifying water inrush.
For the water inrush incident at Xin’an Mine (03:35, 25 October 2021),
Figure 17 compares the warning times and risk levels between the cascaded BiLSTM–VAE model and a conventional threshold-based method.
As illustrated, the cascaded BiLSTM–VAE model demonstrates superior anomaly detection and early warning performance for the Xin’an Mine case compared to the conventional method. The first medium-risk warning was issued at 17:00 on 24 October 2021—10 h and 35 min prior to the accident—and the first high-risk warning at 18:05 on the same day, 9 h and 30 min before the inrush. These warnings were issued approximately 7 h 5 min and 6 h earlier, respectively, than the first alert from the threshold method, thus allowing substantially more time for implementing safety measures before the water inrush occurred.
3.4. Result Comparison
To systematically evaluate the performance of different models in mine water level prediction and water inrush early warning tasks, this study selects two types of prediction models for comparison: one is classical statistical models, including Seasonal Autoregressive Integrated Moving Average (SARIMA) and Holt–Winters Exponential Smoothing (HWES), which are characterized by clear structure and strong interpretability, suitable for modeling stationary time series; the other is deep learning models, including Gated Recurrent Unit (GRU) and Bidirectional Long Short-Term Memory Network (BiLSTM). Meanwhile, to deeply investigate the design necessity of the “prediction-detection” cascaded framework, this study introduces a Variational Autoencoder (VAE) as a comparison baseline that only performs anomaly detection, aiming to validate the contribution of the prediction component to enhancing anomaly detection performance through ablation comparison.
The water level prediction performance of each model at four boreholes (O2-4, O2-6, XO5, KM) is shown in
Table 7. Owing to its bidirectional gated architecture, the BiLSTM model can integrate temporal context information and effectively model the complex nonlinear dynamics of mine water level influenced by the coupling effects of historical trends and external factors (such as rainfall, water inrush events, and grouting engineering), thus achieving the highest prediction accuracy (R
2: 0.94–0.98) and the lowest errors (MAE: 0.019–0.265, MSE: 0.025–0.348) across all boreholes. As a lightweight variant of LSTM, GRU shows better prediction performance (R
2: 0.739–0.972) than classical models in most boreholes; however, its simplified gated structure has limited capability in characterizing extreme nonlinear processes, leading to a significant increase in prediction deviation at borehole XO5 (R
2 = 0.739). SARIMA underwent rigorous series order determination: ACF diagnosis indicated that the original series was non-stationary (
Figure 18a), but became stationary after first-order differencing (
Figure 18c). Combined with PACF truncating after lag 2 (
Figure 18b), the optimal order was determined as (2,1,0). However, its linear modeling nature struggles to adapt to the nonlinear and non-stationary characteristics of water level data, resulting in significantly higher prediction errors (MAE: 0.246–0.634) than neural network models. The HWES model, reliant on fixed seasonal patterns and smoothing coefficients, performed poorly when applied to water level series lacking significant seasonal patterns and containing sudden fluctuations, leading to prediction failure (negative R
2 values for some boreholes); consequently, it was excluded from subsequent analysis.
To further investigate the impact of model architecture on water inrush early warning effectiveness, this study constructed hybrid detection frameworks by integrating various prediction modules (SARIMA, GRU, BiLSTM) with a variational autoencoder (VAE) for anomaly detection and compared them against a pure VAE baseline. Given that both false alarms and missed detections can lead to severe consequences in mine safety management, and considering that water inrush events are extremely rare—often accounting for only about 2% of the total samples, which renders accuracy (ACC) a misleading and ineffective metric due to extreme class imbalance—a comprehensive set of evaluation metrics, previously established in
Table 4 specifically for high-stakes and imbalanced anomaly detection, was employed to ensure a robust and meaningful assessment. The complete evaluation results are presented in
Table 8.
The BiLSTM-VAE model demonstrates superior overall performance, excelling across a comprehensive set of evaluation metrics. It achieves an optimal balance between a high true positive rate (Recall: 0.846) and a high positive predictive value (Precision: 0.958), yielding an F1-Score of 0.898. This result reflects the model’s strong capability to accurately detect actual inrush events while substantially reducing false alerts. Such a balance is critical for operational safety in mining environments and is further supported by an exceptionally low false positive rate (FPR: 0.005) and high Specificity (0.995), indicating a robust capacity to avoid false alarms during normal operational periods. The Matthews Correlation Coefficient (MCC), a reliable metric for evaluating classification performance under imbalanced conditions, reaches 0.887, further confirming the model’s overall robustness and discrimination capability.
In comparison, the GRU-VAE and ARIMA-VAE models exhibit notable limitations. The ARIMA-VAE model shows significantly reduced performance, with Recall and Precision values of 0.527 and 0.439, respectively, indicating pronounced vulnerabilities to both missed detections and false alarms. These shortcomings are quantitatively reflected in its elevated FPR (0.092) and low MCC (0.403), suggesting that its simplified gating mechanism inadequately captures the complex precursor patterns of water inrush, thereby producing residual signals that hinder discriminative anomaly detection. The GRU-VAE model, while marginally outperforming GRU-VAE with Recall and Precision around 0.605 and 0.633 and an MCC of 0.568, remains constrained by its inherent linearity assumption. Although it achieves a comparatively lower FPR (0.048) and higher Specificity (0.952), indicating some ability to suppress false alarms, its moderate Recall level underscores a persistent tendency to miss true inrush events—a limitation rooted in its inability to model nonlinear hydrogeological dynamics.
The pure VAE model, which operates without a preceding prediction module, performs substantially below acceptable levels across all metrics (Recall: 0.034, Precision: 0.032, F1-Score: 0.033, MCC: −0.017), performing no better than random guessing. Given its comprehensively deficient detection capabilities, the pure VAE model is omitted from further visual analysis in
Figure 19, which instead focuses on the three hybrid models that yielded practically meaningful anomaly detection performance.
The operational implications of these quantitative results are visually articulated in
Figure 19. The BiLSTM-VAE model, for which intermediate detection results are further illustrated in
Figure 16 and
Figure 17, produces reliable early warnings 6 to 9.5 h preceding the inrush incident, with minimal false or missed alarms. This lead time provides a critical window for emergency response actions. In contrast, the ARIMA-VAE model fails to provide effective early warnings; although it produces fewer false alarms, it demonstrates a high missed alarm rate both before and during the inrush and grouting phases, severely limiting its practical utility. The GRU-VAE model, while achieving broader anomaly coverage with a lower missed alarm rate, triggers excessive false alerts and offers only minimal advance warning—approximately 3 h—which is operationally insufficient for effective hazard mitigation.
The significant performance disparity between the pure VAE model and the hybrid architectures underscores the essential role of the prediction module. The forecasting stage is instrumental in generating discriminative residual features that amplify the detectability of anomalous sequences, thereby establishing a necessary condition for the effectiveness of the two-stage early warning framework.