Next Article in Journal
Microsieving-Based Advanced Primary Treatment: A Promising Technology for Carbon Redistribution and Recovery for Wastewater Treatment
Previous Article in Journal
A Hierarchical Reinforcement Learning Based Bi-Population Optimization Framework for Green Distributed Hybrid Flow-Shop Scheduling with Multiple Crane Transportation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning from Disturbances, Not Timestamps: A Dynamic Event-Driven Transformer for Rock Burst Forecasting

1
School of Resources, Environment and Safety Engineering, Hunan University of Science and Technology, Xiangtan 411201, China
2
Guizhou Ganxing Coal Industry Co., Ltd., Bijie 553300, China
*
Authors to whom correspondence should be addressed.
Processes 2026, 14(9), 1413; https://doi.org/10.3390/pr14091413
Submission received: 14 March 2026 / Revised: 13 April 2026 / Accepted: 20 April 2026 / Published: 28 April 2026
(This article belongs to the Section AI-Enabled Process Engineering)

Abstract

Rock bursts remain among the most destructive and unpredictable disasters in mining operations, yet existing deep learning methods face significant challenges in engineering practicality, noise robustness, and representing complex inter-event relationships for accurate prediction. To address these limitations, this paper proposes DynamiXFormer, a novel Transformer-based rock burst prediction model. Unlike traditional temporal prediction paradigms, DynamiXFormer establishes a direct mapping from working face advancement to rock burst risk, thereby linking predictions to mining-induced disturbances. The model integrates three innovative modules: an Adaptive Frequency Denoising module that suppresses noise while enhancing salient information from a frequency-domain perspective; a Relative Event Encoding module that constructs inter-event correlation graphs to capture physical attribute correlations and spatio-temporal dependencies; and a Dynamic Sparse Attention mechanism that introduces a strong inductive bias, enabling attention to focus on both local precursory patterns and global critical shifts. Experiments on real-world microseismic monitoring data demonstrate that DynamiXFormer significantly outperforms six baseline models across all prediction horizons and evaluation metrics. In short-term prediction tasks, it achieves a Mean Squared Error as low as 0.000518 and a Recall of up to 97.85%. Ablation studies further validate the individual effectiveness and synergistic effects of the proposed modules. This research provides a new methodology for rock burst early warning, with strong potential to enhance mine safety monitoring and engineering applications.

1. Introduction

Rock burst is a severe mine dynamic disaster where the quantitative assessment of risk levels presents a significant challenge [1]. This difficulty stems from the complex interaction of geological and mechanical factors, which makes the direct measurement of risk impossible with conventional physical sensors. To address this, research has focused on indirect measurement methodologies. Mainstream approaches rely on monitoring correlated physical phenomena, such as microseismic activity, geo-acoustic emissions, and electromagnetic radiation [2,3]. Among these, microseismic monitoring is the most widely applied technique, providing a rich stream of indirect data by tracking energy accumulation and release within the rock mass [4,5]. A key advantage of microseismic monitoring is its ability to delineate the three-dimensional spatial distribution of surrounding rock fractures, thereby enabling the visualization of the damage zone [6,7]. Recently, deep learning has emerged as a powerful tool for developing soft sensors capable of transforming this complex, indirect data into a quantitative risk assessment. By learning the underlying patterns from large-scale historical data, these data-driven models offer a new pathway for the quantification and early warning of rock bursts, moving beyond traditional statistical prediction [8]. Accordingly, the objective of this research is to develop a quantitative risk assessment framework to forecast rock burst risk, based on the indirect quantification of microseismic signals.
To address the limitations of traditional risk assessment methods for real-time dynamic warning of rock bursts, Qin et al. [9] proposed a predictive framework that combines ensemble learning with Bayesian optimization, aiming to improve the accuracy of risk prediction for rock bursts induced by high-energy seismic events. A key innovation of this study was the use of a sliding window method to construct a microseismic (MS) dataset correlated with future risk levels. The authors systematically addressed the data imbalance problem and enhanced the interpretability of the model’s decision-making process through SHAP analysis. Addressing two core challenges in rock burst prediction with neural networks—complex hyperparameter tuning and imbalanced training samples—Li et al. [10] introduced an intelligent prediction model based on a Feedforward Neural Network (FNN) integrated with Bayesian Optimization (BO) and the SMOTETomek technique. This research utilized BO for automatic optimization of the FNN’s architecture and employed the SMOTETomek hybrid sampling technique to resolve data imbalance, significantly enhancing the model’s predictive performance.
Confronting the complexity of rock burst prediction in hard coal mines, Wojtecki et al. [11] developed a method for assessing hazardous rock burst states using machine learning algorithms. The novelty of their work lies in using 11 parameters as model inputs, including an index that synthetically reflects the rock burst propensity of the coal seam and surrounding rock system, as well as vertical stress anomalies. They systematically evaluated the effectiveness of various algorithms, such as decision trees and multi-layer perceptrons, in distinguishing between destructive rock bursts and non-destructive mining tremors. To tackle the issues of existing warning methods failing to provide timely alerts and having ambiguous trigger conditions, Ma et al. [12] proposed a novel warning framework based on time-series prediction of Acoustic Emission (AE) parameters. The core contribution of this research is the use of a Long Short-Term Memory (LSTM) model to forecast the future evolution of AE parameters. It innovatively combines the Isolation Forest (IF) anomaly detection algorithm with the CRITIC objective weighting method to determine the warning thresholds and weights for multiple indicators, culminating in a quantifiable comprehensive warning coefficient (EC).
Yin et al. [13] introduced a new method for real-time prediction of rock burst intensity based on microseismic (MS) data. To preserve both the temporal and spatial features of microseismic events, their study innovatively converted MS sequences, which contain multiple source parameters, into two-dimensional numerical matrices as input. They developed an integrated CNN-Adam-BO algorithm, which uses Bayesian Optimization (BO) to tune a Convolutional Neural Network (CNN) model based on the Adam optimizer, and incorporated the SMOTE oversampling technique to handle data imbalance, ultimately achieving effective classification and prediction of rock burst intensity. To overcome the insufficient generalization ability of single machine learning models in rock burst prediction, Liu et al. [14] proposed a novel hybrid predictive model (NGO-CNN-BiGRU-Attention) based on intelligent optimization and deep learning techniques. Their research combines a Convolutional Neural Network (CNN), a Bidirectional Gated Recurrent Unit (BiGRU), and an Attention mechanism to fully leverage their respective advantages in feature extraction, sequence modeling, and focusing on key information. Furthermore, they employed the novel Northern Goshawk Optimization (NGO) algorithm to optimize the model’s hyperparameters.
Addressing the challenges of insufficient extraction of precursory information and the poor generalization of existing deep learning models for rock burst prediction in steeply inclined thick coal seams, Cui Feng et al. [15] proposed a predictive model based on deep learning and multivariate chaotic time series (PSR-LSTM). Their work provides a new methodology for the intelligent graded prediction of rock bursts in roadways driven through such geological formations. Qiao et al. [16] proposed a novel fusion model for rock burst risk prediction by combining physical indicators with deep learning. Their method introduces a coordinate attention mechanism to weight features derived from the spatial distribution of seismic sources. The model effectively integrates these multi-source features and achieves 91.1% overall accuracy in classifying different risk levels on the test set, demonstrating its viability for real-world engineering warning systems. To resolve the predicament of poor generalization in models based on physical indicators and insufficient feature extraction in data-driven models, Cao Anye et al. [17] introduced a time-series prediction method for rock bursts that is driven by the fusion of physical indicators and data-derived features.
Despite the considerable progress in this area, rock burst prediction based on microseismic monitoring still faces several limitations. To address these shortcomings, this study proposes a novel prediction model based on the Transformer architecture. The specific research gaps identified in previous studies and our corresponding contributions to overcome them are summarized in Table 1.

2. Prediction Model

Since its introduction by Vaswani et al. [18] in 2017, the standard Transformer model has become a cornerstone in the sequence-to-sequence domain, achieving revolutionary success, particularly in the field of natural language processing. Its core, the self-attention mechanism, enables the model to weigh the significance of all positions in the input data simultaneously, rather than processing them sequentially like Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs). The Transformer consists of an encoder, responsible for feature encoding, and a decoder, which decodes the target sequence from the feature information provided by the encoder. However, directly applying the standard Transformer to specialized sequence prediction tasks—especially with highly complex data such as microseismic monitoring records—reveals certain limitations, including an insufficient awareness of properties unique to the time series.
In contrast to the standard Transformer designed for general-purpose tasks, the DynamiXFormer model proposed in this paper is a deeply optimized and specialized framework. As illustrated in Figure 1, its core distinctions from the standard Transformer are as follows:
(1) The DynamiXFormer model replaces the Transformer’s full attention with a unique dynamic sparse attention mechanism. By fusing multiple sparse strategies—such as local, key-point, and global attention—in a manner driven by prior knowledge, it introduces an effective inductive bias tailored to the problem domain. This allows the model to efficiently focus on critical information points within the sequence, enhancing its ability to capture salient features while improving computational efficiency.
(2) To suppress the high-frequency noise inherent in microseismic data, the DynamiXFormer model introduces an adaptive frequency denoising module. This module serves as a general-purpose and efficient data processing unit that suppresses noise and enhances signal fidelity from a frequency-domain perspective.
(3) To capture the relative spatio-temporal relationships and physical attributes among microseismic events, the DynamiXFormer model incorporates a relative event embedding module within its embedding layer. This module constructs a global graph of inter-event relationships by capturing the relative evolutionary trends between events at different scales of variation, as well as their composite similarities.
(4) Unlike the standard Transformer, the DynamiXFormer model comprehensively adopts a Pre-Layer Normalization (Pre-LN) architecture. This results in more stable gradient flow, making the model better suited for the construction of deeper networks.

2.1. Subsection

In the highly challenging task of rock burst prediction, microseismic data collected in coal mines embodies crucial information about the stress state and instability of the surrounding rock mass. This data is, however, inherently complex. Furthermore, monitoring equipment is highly susceptible to noise interference from mining machinery, personnel activities, and blasting operations. Consequently, microseismic data from underground mines is pervasively contaminated with high-frequency noise, the presence of which severely impedes a model’s ability to extract critical information [19].
To address this challenge, we have designed the Adaptive Frequency Denoise Block (AFDB), illustrated in Figure 2. The primary objective of AFDB is to create a versatile and efficient frequency-domain processing module capable of performing adaptive, fine-grained denoising on noisy time-series data, extending its applicability beyond just microseismic records. The design philosophy of this module is twofold: first, it dynamically learns the relative importance of different frequency scales, allowing the model to autonomously identify the scales most critical to the prediction target based on data characteristics. Second, it leverages the intrinsic properties of the data to automatically distinguish and differentially process noise components versus those containing salient information. Through this dual mechanism, AFDB enables the model to focus on the most relevant patterns for the prediction task, thereby enhancing both the robustness and accuracy of its forecasts.
Functioning as a general-purpose processing unit, the AFDB is embedded within both the encoder and decoder of the DynamiXFormer model. In the encoder, it serves as a pre-processing module positioned before the attention layers. Its primary role here is to perform initial denoising and enhance key features in the raw input data before the model engages in deeper information interactions, ensuring that subsequent attention mechanisms operate on a cleaner, more information-rich representation. Conversely, in the decoder, the AFDB is applied for post-processing, following the attention and feed-forward layers. In this capacity, its function is to refine the predictive sequence generated by the decoder, thereby improving the quality of the final output. The choice of the Discrete Cosine Transform (DCT) within AFDB is motivated by its strong energy compaction property, where a signal’s significant information becomes concentrated within a small number of low-frequency coefficients [20].
To learn the importance of each frequency scale, we define two sets of learnable filter parameters, whigh and wlow, for high-pass and low-pass filtering, respectively. To weigh the significance of these different scales, we introduce an additional learnable parameter, s. This parameter is transformed into a set of attention weights, α, via a softmax function. The output of the multi-scale filtering is then computed as the weighted sum of the results from all filtered scales. The final high-pass and low-pass components can be expressed as:
X h i g h = i = 1 L α i X d c t σ w h i g h , i
X l o w = i = 1 L α i X d c t σ w l o w , i
A core mechanism of this module is the generation of an adaptive mask. Specifically, we first calculate the energy, E, of the high-frequency component, Xhigh, at each frequency bin. To enhance the stability of the subsequent thresholding operation, we normalize the energy using a quantile of its distribution:
E n o r m = E quantile E + ϵ
where ϵ is a small constant added for numerical stability.
Based on this normalized energy, Enorm, a smooth mask, M, is generated using a learnable threshold parameter, θ, and a sigmoid function, σ(⋅):
M = σ k E n o r m θ
We establish two parallel channels to differentially process the high-frequency signal. One channel generates a denoising mask, Mdenoise, to suppress noise, while the other generates a detail-enhancing mask, Mdetail, to amplify salient features. To dynamically fuse the outputs of these channels, we introduce a learnable gating parameter, which is passed through a sigmoid function to produce a balancing weight, β. The final combined high-frequency component is thus:
X h i g h c o m b i n e d = β X h i g h d e n o i s e d + 1 β X h i g h d e t a i l e d
Finally, this processed high-frequency component is recombined with the original low-frequency component via direct summation. The low-frequency part, which typically represents the signal’s overall trend and primary energy, is preserved without modification to yield the final frequency-domain representation:
X d c t r e c o m b i n e d = X h i g h c o m b i n e d + X l o w

2.2. Relative Event Encoding Module

The occurrence of microseismic events is typically characterized by non-linear and non-homogeneous patterns. Consequently, information precursory to rock bursts is often embedded not in isolated events, but in the relative spatio-temporal dynamics and physical interdependencies among them [21]. To capture these complex relationships, we design a Relative Event Embedding (REE) module. As illustrated in Figure 3, the core concept of this module is to enrich the representation of each microseismic event, moving beyond a simple positional index to a multi-dimensional embedding jointly defined by its intrinsic physical attributes and its complex interrelationships with other events.
The REE module comprises two core components: A Multi-Scale Relative Position Encoder: This component captures the relative evolutionary trends of key physical attributes between events (such as spatial distance and energy magnitude) across various scales. An Event-Driven Similarity Encoder: This component is designed to construct a global graph of inter-event relationships based on a composite similarity measure between events.
The development of rock bursts is a multi-scale process, exhibiting different characteristics such as short-term energy release and long-term stress accumulation. Information precursory to such events is often embedded in the relative relationships between microseismic events across these scales. Our proposed Relative Positional Encoding (REE) aims to capture these dynamics. Specifically, for each time step t, we calculate its difference from the preceding l-th step. A distance weight is then generated based on exponential decay to suppress the influence of distant events, thereby assigning higher importance to more recent time steps. Finally, the distance difference, energy difference, and weighted distance across a maximum of L-max scales are fused and mapped to the encoding dimension via a linear transformation to produce the final relative positional encoding:
Δ d t l = d t d t l
Δ e t l = e t e t l
d weighted , t l = w l e x p Δ d t l 2 l
PE r e l a t i v e = l = 1 L m a x Linear Δ d l , Δ e l , d weighted l
where t is the event’s index in the sequence; Δ d t l and Δ e t l reflect the event’s spatial and intensity trends, respectively; w l is a learnable weight for the l-th scale, controlling the importance of different time spans; and the exponential function serves as a decay mechanism, reducing the importance of events with larger Δ d t l .
Furthermore, to capture the dependencies between microseismic events at different time steps, we design a dynamic event-aware mechanism that follows the relative event encoding. This is achieved by first computing a hybrid similarity matrix between all time steps. This matrix incorporates both trend similarity (measuring directional change via cosine similarity) and numerical similarity (measuring magnitude change via Euclidean distance):
S c o s i , j = x i x j x i 2 x j 2
S euc i , j = 1 x i x j 2 m a x x i x j 2 + ϵ
S h y b r i d i , j = σ α S c o s i , j + 1 σ α S e u c i , j
where S c o s i , j and S euc i , j are the cosine similarity and Euclidean similarity matrices between time steps i and j; x i and x j are their respective feature vectors; 2 is the L2 norm; ϵ is a small constant for numerical stability; and α is a learnable parameter that controls the weighting between trend and numerical similarity; The final similarity matrix is denoted as S hybrid .
Since the relationships between MS events vary in significance, an event attention matrix is computed to quantify the strength of connections between different events:
W i = σ W e x i  
A i j = W i W j T
A i j ~ = A i j k = 1 N A i k
where W i is the event attention weight at time step i, W e is a learnable weight matrix, σ(⋅) is the sigmoid activation function ensuring values remain within [0, 1], A i j represents the event attention correlation matrix, and A i j ~ is the normalized attention matrix where each row sums to 1.
To further capture the nonlinear correlations between events, a Gaussian kernel function dynamically computes inter-event similarities, allowing the model to learn multi-scale dependencies:
E i j = A i j ~ e x p S hybrid 2 σ 2
where E i j represents the final dynamic Gaussian similarity matrix, and σ is a learnable parameter that controls the scale of the Gaussian kernel. If S hybrid is small (indicating high similarity), then e x p S hybrid 2 σ 2 ≈ 1, implying that the two time steps have strong information similarity. Conversely, if S hybrid is large (indicating low similarity), then e x p S hybrid 2 σ 2 ≈ 0, signifying weak correlation between the two time steps.
Finally, the Gaussian similarity matrix E i j is used to perform a weighted sum of input event features, achieving feature aggregation and event encoding:
P E event , i = j = 1 N E i j x j

2.3. Dynamic Sparse Attention Module

The standard self-attention mechanism, the core of the Transformer architecture, derives its power from its ability to learn dependencies among all elements without any prior structural constraints [22]. However, this unbiased nature can become a drawback when processing specific types of data, such as those with high noise content or weak inter-element correlations [23]. A completely unconstrained model may over-focus on noise and spurious patterns within the training data instead of learning the true underlying regularities, leading to poor generalization performance on unseen data [24].
To address this issue, we contend that it is essential to introduce an effective inductive bias into the attention mechanism. By injecting prior knowledge about the problem into the model architecture, an inductive bias guides the model toward learning solutions that better reflect the problem’s intrinsic structure. Building on this premise, we propose the Dynamic Sparse Attention (DSA) mechanism. As illustrated in Figure 4, the core of this mechanism is to act as a powerful regularizer that dynamically constructs a sparse attention graph. This focuses the model’s limited attentional resources on connections most likely to contain critical information, thereby actively filtering out noise in the process.
The core of our Dynamic Sparse Attention (DSA) consists of four distinct sparse strategies, rather than relying on a single, fixed prior assumption. The final attention pattern is the union of these four strategies, which collectively form a rich set of inductive biases. These four strategies are: Dynamic Local Attention, Key-point Attention, Global Attention, and Adaptive Random Connectivity.
(1) Dynamic Local Attention
During the gestation process of a rock burst, stress becomes locally concentrated around the seismic source, producing a series of closely related fracture events that form a continuous precursory pattern [25]. The core of this strategy is to capture these short-range yet continuous causal relationships. Crucially, the required local window size varies under different geological and stress conditions. We therefore design a window that can be automatically learned and dynamically adjusted based on data characteristics.
This is achieved using a small feed-forward network (FFN) to autonomously learn a dynamic scaling factor for each query. This factor then adjusts a base window size, generating an adaptive window for each point in the sequence:
f adjust q i = 0.5 + Sigmoid MLP q i
W dyn i = clamp round W base f adjust q i , 1 , W max
M local i , j = 1 if j i W dyn i
where q i is the query vector at step i; f adjust is the adjustment factor; W base is a preset base window size; W dyn is the computed dynamic window size for time step i; and M local is the final dynamic local attention mask.
(2) Key-point Attention
In a microseismic sequence, points of high energy or high rates of change often correspond to critical fracture events or significant energy releases [26]. Such key points are focal to the rock burst process, and their appearance signals a change in the stress state. This strategy aims to identify and grant these points global attention—allowing key points to attend to all other points, and vice versa—to construct the causal chain associated with their occurrence.
The strategy first computes a composite measure of change via a multi-scale differencing operation. A point is then identified as a key point only if its composite change score exceeds a dynamic threshold (calculated from the sequence’s statistics) and it is also a local peak. Here, Mean(δ) and Std(δ) represent the mean and standard deviation of the overall rate of change (δ) for the current input sequence, respectively. The term C is a learnable coefficient that modulates the threshold’s sensitivity. This design enables the threshold to be adaptively adjusted according to the intrinsic volatility of each sequence, thereby ensuring the robust identification of genuine critical points across varying data distributions. A fallback mechanism of uniform sampling is enabled if no key points are detected, ensuring information flow:
Δ s t = | | x t x t s | | 1 s
δ t = s S w s Δ s t
T = Mean δ + C Std δ
is _ keypoint t = δ t > T δ t > δ t 1 δ t > δ t + 1
M keypoint i , j = 1 if is _ keypoint i is _ keypoint j
where x t is the input vector at step t; s denotes different scales; 1 is the L1 norm; δ(t) is the composite measure of change; w s is the weight for scale s; Tis the key-point determination threshold; C is a learnable threshold coefficient; and M keypoint is the key-point mask.
(3) Global Attention
Besides local correlations and abrupt events, the development of rock bursts is also influenced by long-range, non-abrupt factors, such as stress accumulation or periodic engineering activities. This strategy aims to sample a set of global attention points from the entire sequence using a comprehensive importance score and stratified sampling. This ensures the model maintains an awareness of global trends while focusing on local details.
First, an importance score, combining signal magnitude and the magnitude of local change, is calculated for each point in the sequence. The sequence is then divided into non-overlapping strata. Within each stratum, multinomial sampling is performed based on the importance scores, ensuring that more important points are sampled while maintaining global coverage:
I i = w m a g x i 2 + w c h g 1 d m o d e l d = 1 d m o d e l x i , d x i 1 , d
P s e g k = I r e s h a p e d k j = 1 S s e g I r e s h a p e d j + ϵ
i d x g l o b a l = i d x l o c a l + l S s e g
M g l o b a l b , : , i d x g l o b a l b , : = 1
where I i is the importance score for step i; x i is the input vector at step i; 2 is the L2 norm; x i , d is the d-th feature dimension of xi; d m o d e l is the model’s embedding dimension; w m a g and w c h g are the weighting coefficients for magnitude and change, respectively; P s e g k is the sampling probability for the k-th element within a stratum; I r e s h a p e d k is the importance score of the k-th element within that stratum; S s e g is the size of each stratum; ϵ is a small positive constant for numerical stability; i d x g l o b a l is the global index in the original sequence i d x l o c a l is the local index sampled within a stratum; l is the index of the current stratum; and M g l o b a l is the global attention mask.
(4) Adaptive Random Connectivity:
This strategy serves as a supplement to the three aforementioned approaches by introducing a controlled amount of random connections. This prevents the model from over-relying on the predefined priors and enhances its generalization ability.
The strategy first calculates the complexity of the input sequence, normalizes it, and then uses it to dynamically adjust a base sparsity parameter, yielding a target sparsity level. By comparing this target with the existing connection density from the other three strategies, the required number of additional random connections is calculated and then activated in the mask:
C b = mean i var d x b , i , d
ρ a d a p t = clamp ρ b a s e 0.5 + C n o r m , ρ m i n , ρ m a x
ρ current = mean M sparse ,   N n e e d e d = ρ a d a p t ρ c u r r e n t L q L k v
M r a n d o m b , q r a n d , k v r a n d = 1
where C b is the complexity of the b-th sample in the batch; x b , i , d is the value of the d-th feature dimension at the i-th time step for the b-th sample; var d denotes the variance computed along the feature dimension d; mean i denotes the mean computed along the time dimension i; ρ a d a p t is the adaptively computed target attention density; ρ b a s e is the base sparsity rate; N n e e d e d is the number of random connections to be added; ρ c u r r e n t is the existing connection density; L q , L k v are the sequence lengths of the query and key/value, respectively; and M r a n d o m is the final random connection mask.
Finally, all masks are merged via a union operation, and the standard attention formula is applied to produce the final output:
M sparse = M local M keypoint M global M random
Attention Q , K , V = softmax Q K T d k + M sparse V

3. Dataset Construction and Risk Quantification

3.1. Dataset Overview and Disturbance Correlation Analysis

The microseismic data for this study were sourced from a longwall panel with a weak rock burst propensity, located in a coal mine. This panel is the second to be developed in this mining area and the first to be excavated adjacent to a previously mined-out area (goaf). The panel has an average burial depth of approximately 650 m, a strike length of 1061 m, and a dip length of 185 m. An SOS-type microseismic monitoring system, manufactured by the Central Mining Institute of Poland (GIG), was employed on-site to continuously record signals generated by rock mass fracturing. To investigate the characteristics of microseismic activity, a preliminary analysis was conducted on the data monitored from January to March 2024.
As depicted in Figure 5a,b, the monitoring results indicate that the microseismic activity at this panel is characterized by high frequency and low energy. Over 72% of the events released energy below 1000 J, with only 3.4% exceeding 5000 J. The maximum recorded microseismic energy was 8280 J. This statistical distribution is consistent with the initial assessment of the panel’s weak rock burst propensity. Further analysis (Figure 6) reveals a notable positive linear trend between the cumulative daily microseismic energy release and the daily face advance distance. Although the coefficient of determination for a simple linear regression is not high, the underlying physical mechanism—that mining disturbance is the primary driver of energy release—is unequivocal. This provides a clear and sufficient motivation for developing a disturbance-driven predictive model.

3.2. Reconstruction of the Prediction Benchmark from a Disturbance-Driven Perspective

Traditional time-based prediction models are often inadequate for practical applications as they neglect the dynamic nature of mining disturbances. Therefore, motivated by the linear relationship between microseismic energy and advance distance revealed in Figure 6, we propose a framework that uses mining disturbance intensity—specifically, the working face advance distance—as the predictive benchmark. According to on-site engineering records, each web cut by the shearer corresponds to a face advance of 0.8 m. Consequently, we reconstruct the dataset using this 0.8 m advance as the minimum prediction unit (Figure 7). This transforms the rock burst prediction task into one of modeling a mapping from an “advancement step” to a “risk level,” a method that effectively circumvents the biases arising in time-based approaches from the non-uniform intensity of mining disturbances across fixed temporal intervals.
Specifically, the maximum energy within each advancement unit is adopted as the core indicator. This extreme value extraction strategy serves to suppress noise interference and accentuate the critical energy release events triggered by mining activities. Furthermore, retaining the peak energy within each unit effectively captures the dynamic relationship between a single web cut and a corresponding energy surge, thereby avoiding spurious predictions that might be caused by non-critical fluctuations in a continuous time series.

3.3. Feature Engineering by Integrating Statistical Metrics and Physical Parameters

Effective features play a crucial role in enhancing both the learning efficiency and predictive capabilities of a model, as their quality directly dictates the final performance. Therefore, based on the raw microseismic data, this study derives a set of statistical and physical features to serve as the core input variables, ensuring the model can more accurately capture critical information [27]. All features are derived from the microseismic data aggregated within each 0.8-m advancement unit, which corresponds to a single web cut.
The statistical features are primarily derived from signal processing principles, designed to capture the dynamic evolutionary patterns of microseismic energy directly from the data’s morphology. To this end, we engineered several features: To capture local characteristics, we calculated the moving average (MA), exponentially weighted moving average (EWMA), and moving standard deviation of the energy within a sliding window of size 3 [28]. To characterize the sequence’s trend and historical dependencies with respect to mining advancement, we derived second-order differenced energy and fifth-order lagged energy values [29]. Finally, the rate of energy change per unit of face advance was computed to explicitly model the relationship between energy release and spatial disturbance [30].
The physics-informed features, grounded in seismology and rock mechanics theory, are intended to provide the model with insights into the underlying physical mechanisms that transcend superficial data patterns. Comprehensive indicators, such as the S-value and the seismic activity scale, reflect the stress accumulation and instability state of the rock mass from multiple dimensions—including energy, frequency, and source parameters. These indicators compensate for the limitations of a single energy metric in describing the complex damage nucleation process. By fusing these two categories of features, we provide the model with an information-rich input that is both data-driven and constrained by physical principles, creating a complementary feature set.
In this work, we introduce two comprehensive physical indicators, which include:
(1) S-value [31]: The S-value integrates seismic event frequency, energy, intensity levels, and source distribution, offering a three-dimensional perspective on MS activity. Variations in time, space, or intensity all influence the S-value, making it a sensitive indicator of rockburst activity.
S = 0.117 l g N + 1 + 0.029 × g 1 N i = 1 N 10 1.5 M i + 0.0015 M
where N is Total number of MS events; M i is Magnitude; M is Maximum magnitude.
(2) Seismic Activity Scale [32]: Based on Soviet seismology research, Roland Granger established a correlation between the total stress F_0 on a fault plane and seismic magnitude. Before a strong rockburst event, the seismic activity scale exhibits an abnormally high value, making it a useful indicator for early warning.
F = lg F 0 T F 0 = 10 6.11 + 1.09 M
where T is Number of days; M is Magnitude; F_0 is Total stress on the fault plane.
To further evaluate the association between the selected features and the target variable, this study employs Spearman correlation analysis to validate the relevance of each feature. Spearman correlation is a non-parametric statistical method based on ranks that effectively measures the strength of a relationship between two variables. This approach is particularly suitable as it makes no assumptions about the underlying data distribution, is robust to outliers, and is adept at capturing non-linear relationships. Its formula is given by:
ρ = 1 6 d i 2 n n 2 1
where ρ is Spearman correlation coefficient, ranging from −1 to 1; d i Rank difference for the i-th observation, representing the difference in ranks between two variables; n is Sample size.
Figure 8 illustrates the correlation between each engineered feature and the target variable. It is evident that the majority of the derived features exhibit a significant correlation with the target, indicating they can provide crucial predictive information for the model. In contrast, the first-order difference (Delta1_Energy) and second-order difference (Delta2_Energy) features show comparatively weak correlations with the target variable, Energy, with coefficients of 0.15 and −0.07, respectively, both having a magnitude below 0.2. Consequently, these two features were excluded to prevent the introduction of irrelevant noise. After this pruning step, a final set of 11 features was retained for model training, comprising two physical and nine statistical features.

3.4. Quantification of Risk Levels

Microseismic energy is one of the most widely used indicators for rock burst prediction. To establish rational thresholds for different risk levels, this study utilizes the relationship proposed by Gutenberg and Richter [33] to delineate energy values corresponding to various hazard degrees:
l g N = a b l g E  
where N is Frequency of microseismic events with energy greater than or equal to E; lgN is Frequency of microseismic events; E is Microseismic event energy; lgE is Energy level of microseismic events.
The Gutenberg-Richter (G-R) relation is a widely accepted empirical law in seismology and mining seismology that describes the statistical distribution between the frequency of seismic events and their magnitude. Through laboratory experiments on rock mechanics, Scholz verified that the frequency-magnitude distribution of microfracturing events generated during the stress-induced failure process of rocks also follows the G-R relation [34]. This relation indicates that within a specific region and time period, the number of low-magnitude events is substantially greater than that of high-magnitude events. In the study of rock bursts, the G-R relation is frequently employed to analyze the cumulative energy release patterns of microseismic activity, thereby providing a quantitative basis for risk assessment [35]. According to the G-R law, the higher the energy of a microseismic event, the lower its frequency of occurrence, and high-energy events often signify an increased risk of rock bursts. Therefore, establishing energy thresholds based on the G-R curve provides a risk assessment methodology that integrates both statistical significance and physical meaning [36].
Following the G-R relationship and the analytical framework proposed in [37], the magnitude-frequency curve for microseismic events in the study area was plotted, as shown in Figure 9. On a log-log scale, this curve theoretically exhibits a piecewise characteristic, primarily consisting of an exponential segment at lower energy levels and a linear segment at higher energy levels. Based on this, the risk thresholds are defined as follows: The Low-Risk Threshold is defined at the critical inflection point of the linear segment, which signifies the onset of an expanding fracture zone in the coal-rock mass. The Medium- and High-Risk Thresholds are defined by two subsequent significant deviation points from this linear trend. These deviations indicate that the accelerated energy accumulation is transitioning into an unstable release phase.
Based on the preceding analysis, we performed a piecewise regression on the magnitude-frequency curve to quantitatively determine the microseismic energy thresholds for different risk levels. The weak-risk threshold corresponds to the inflection point between the exponential and linear segments of the curve. This optimal point was identified through an iterative process validated by goodness-of-fit tests. As shown in Figure 10, the ideal inflection point was determined to be at lg(E) = 3.66, where the coefficients of determination (R2) for the respective fits on either side were exceptionally high at 0.99 and 0.98. Therefore, the Low-Risk Threshold was set at 103.66 J.
Furthermore, an analysis of the linear segment (depicted in Figure 11) reveals two subsequent deviation points from the established trend, located at lg(E) = 3.755 and lg(E) = 3.85. Consequently, these points were designated as the medium- and high-risk thresholds. The specific energy thresholds for each risk level are summarized in Table 2.
Based on the energy thresholds defined above, the dataset in this study was partitioned into four risk levels. “No Risk” events constitute 91.09% of the total samples, “Low Risk” events account for 6.89%, “Medium Risk” for 1.59%, and “High Risk” events for only 0.42%.
This severe class imbalance, particularly the extreme scarcity of “High Risk” samples, represents a common yet critical challenge in rock burst prediction. It poses a significant obstacle to the model’s learning process and its capability for long-horizon forecasting.
However, this study did not employ active mitigation techniques for class imbalance, such as SMOTE oversampling. This decision was based on the rationale that microseismic events are physically driven phenomena. Artificially synthesized samples risk introducing “pseudo-physical patterns” that lack genuine physical meaning, which could potentially compromise the model’s generalization performance in real-world engineering scenarios. Consequently, this research prioritizes enhancing the model’s capability to learn critical information from sparse samples through architectural innovations, such as the Dynamic Sparse Attention (DSA) mechanism.
It is important to note that the risk energy thresholds established in this study are dataset-specific and may exhibit data dependency. Consequently, when this risk classification framework is applied to different geological conditions or mining sites, a recalibration of these energy thresholds will likely be necessary to ensure predictive accuracy.

4. Experiments and Result Analysis

4.1. Experimental Setup

The processed dataset, as described in the preceding sections, was partitioned into training, validation, and test sets using a 7:1:2 ratio. During the training process, model parameters were learned exclusively on the training set. The validation set was used for performance evaluation and hyperparameter tuning, while the test set was strictly held out to ensure no information leakage, thereby guaranteeing the objectivity and reliability of the final model assessment. To reduce training complexity and improve efficiency, all data partitions were normalized using Min-Max Normalization, which scales features to the [0, 1] range. This method effectively mitigates scale disparities between different features, providing a more stable input that can accelerate convergence and enhance model performance. The normalization formula is as follows:
x = x min x max x min x
where x represents the normalized data, x is the original data before normalization, and m a x x and m i n x denote the maximum and minimum values within the dataset, respectively.
To comprehensively evaluate the performance of our proposed DynamiXFormer model, we selected six models that have recently demonstrated strong performance in this field for a comparative analysis: Transformer [18], CNN-BiGRU-Attention [14], CNN-LSTM [38], DNN [39], CNN-BiLSTM-Attention [40] and LSTM [41]. This selection not only includes top-performing models in rock burst prediction but also covers a diverse range of architectures, from classic recurrent neural networks to attention-based mechanisms.
To ensure a rigorous and fair comparison, all models were trained using a unified strategy. We employed the Adam optimizer with a Mean Squared Error (MSE) loss function and used ReLU as the activation function. Furthermore, an early stopping mechanism was implemented: training was halted if the validation loss did not show improvement for 10 consecutive epochs, thereby preventing model overfitting.
The architectural hyperparameters for each baseline model—such as the number of layers, neurons/filters, kernel size, learning rate, and dropout rate—were optimized using a Bayesian optimization algorithm. For our proposed DynamiXFormer and the standard Transformer model, a different set of hyperparameters was optimized due to their unique architectures, including the number of encoder/decoder layers, model dimension, feed-forward network dimension, number of attention heads, and dropout rate, also using Bayesian optimization. To ensure that each model achieved its best possible performance, we optimized a distinct set of hyperparameters for each prediction length. The random seed was fixed throughout all experiments to guarantee reproducibility.
To ensure the reproducibility of our results, we set a fixed random seed (random_seed = 42) for all relevant libraries (including NumPy==2.2.2 and PyTorch==2.5.1) to control processes such as data partitioning and model weight initialization. The reported results are based on a single deterministic run under this fixed seed. All experiments were conducted on the following hardware configuration: CPU: AMD Ryzen 5800H; GPU: NVIDIA RTX3050; RAM: 16 GB. The software environment included Python 3.11.11 and PyTorch 2.5.1.

4.2. Evaluation Metrics

To evaluate model performance, this study employs four metrics for assessing both regression and classification capabilities: Mean Absolute Error (MAE), Mean Squared Error (MSE), Recall, and False Positive Rate (FPR). These are defined as follows:
M A E =   1 n i = i n Y i y i
M S E = 1 n i = i n ( y i Y i ) 2
M S E = 1 n i = i n ( y i Y i ) 2
F P R = F P F P + T N
where n is the number of data points, Y i is the actual value of the i-th data point, and y i is its predicted value. For the classification metrics, the terms are defined from a one-vs-rest perspective for each specific risk level (e.g., considering the “No Risk” class):
True Positive (TP): An instance is correctly predicted as belonging to the class of interest. For example, an actual “No Risk” sample is correctly predicted as “No Risk”.
False Positive (FP): An instance is incorrectly predicted as belonging to the class of interest. For example, an actual “Low Risk” sample is incorrectly predicted as “No Risk”.
True Negative (TN): An instance does not belong to the class of interest and is correctly predicted as not belonging to it. For example, relative to the “No Risk” class, an actual “Medium Risk” sample that is predicted as “Low Risk” constitutes one true negative.
False Negative (FN): An instance belongs to the class of interest but is incorrectly predicted as belonging to a different class. For example, an actual “No Risk” sample is predicted as “Low Risk”, “Medium Risk”, or “High Risk”.
In summary, lower values for MAE and MSE signify better regression performance. For classification, superior performance is indicated by a higher Recall score in conjunction with a lower FPR.

4.3. Comparison of Model Performance

To evaluate the model’s performance, we designed three sets of experiments on the test set, each corresponding to a different forecast horizon. The forecast horizon is defined in terms of the face advance distance, with a single coal cutting pass (approximately 0.8 m) serving as the fundamental unit. The experiments were configured as follows:
(1) Short-term forecast: A horizon of 0.8 m, predicting the rock burst risk for the next single cutting pass.
(2) Medium-term forecast: A horizon of 1.6 m, predicting the risk for the next two cutting passes.
(3) Long-term forecast: A horizon of 2.4 m, predicting the risk for the next three cutting passes.
The comprehensive experimental results for our proposed model and the baseline models are presented in Table 3 and visualized in Figure 12a–d.

4.4. Analysis of Experimental Results

The results demonstrate that the proposed DynamiXFormer model performs best across all prediction horizons, achieving MSE and MAE values that are significantly lower than all other models. At the 0.8 m horizon, DynamiXFormer records an MSE of just 0.000518 and an MAE of 0.015936, achieving remarkable MSE reductions of 60.7% and 65.0% compared to the standard Transformer (0.001318) and LSTM (0.001481), respectively. Its MAE is also reduced by 41.0% relative to the standard Transformer. This lead is maintained even at the longer 2.4 m prediction horizon, where DynamiXFormer’s MSE of 0.008479 represents a 51.8% reduction compared to the standard Transformer, remaining substantially lower than those of all competing models.
Furthermore, the experimental results in Table 3 reveal that for short-term prediction (0.8 m), the standard Transformer’s performance is second only to our proposed model, surpassing all other baselines. However, in medium- to long-term prediction tasks, the overall performance of LSTM exceeds that of the Transformer, which validates that the memory capacity of LSTMs contributes to more robust performance over longer sequences. Our proposed DynamiXFormer combines the advantages of both architectures—the global relational modeling of the Transformer’s attention mechanism and the strong sequential inductive bias inherent in LSTMs—enabling it to maintain a leading performance across all tested prediction scales.
Recall measures the model’s ability to correctly identify all true instances of a given risk level, while the False Positive Rate (FPR) quantifies how often a specific risk level is incorrectly assigned. At the 0.8 m horizon, the DynamiXFormer model achieves a Recall of 97.85%, an absolute improvement of 3.17 percentage points over the second-best LSTM model (94.68%), significantly outperforming the other baseline models. Even for the more challenging long-range prediction task (2.4 m), it maintains a high Recall of 80.90%, again superior to the baselines. This demonstrates that DynamiXFormer exhibits excellent robustness in risk prediction, effectively reducing instances of missed detections compared to other models. Concurrently, at the 0.8 m horizon, DynamiXFormer’s FPR is only 0.72%, a marked improvement over the baselines, and its FPR also remains superior at the 2.4 m horizon. A comparison plot of the actual versus predicted values for the DynamiXFormer model at different prediction lengths is provided in Figure 13.
The plots in Figure 13, which compare the true values (blue) with the model’s predictions (orange), reveal that the microseismic data exhibits strong fluctuations in certain regions, underscoring its highly non-linear nature. Accurately forecasting these volatile periods is inherently challenging, and this predictive capability tends to degrade as the forecast horizon extends. Our model demonstrates this trend: for one-step-ahead predictions (a 0.8 m advance), it accurately captures the highly volatile regions. At a two-step horizon (1.6 m), the predictions, while showing some lag, still successfully track the general fluctuations of the true values. By the three-step-ahead forecast (2.4 m), predicting the peaks of these rapid fluctuations becomes considerably more difficult, although the model’s overall performance is maintained at a commendably stable level.
Figure 14 presents the confusion matrices for the DynamiXFormer model’s risk classification performance across different prediction horizons. In each matrix, the horizontal axis indicates the predicted risk level, while the vertical axis represents the true risk level. The results clearly show that classification performance degrades as the prediction horizon extends.
At the 0.8 m horizon, the model performs optimally. All “No Risk” and “Medium Risk” instances are classified correctly. The only errors consist of three “Low Risk” instances being misclassified as one level higher (“Medium Risk”) and one “High Risk” instance being misclassified as one level lower (“Medium Risk”). Thus, for short-term prediction, the model achieves high classification accuracy, with confusion being minimal and limited to adjacent risk levels. As the horizon increases, however, performance degrades. At the two-step horizon (1.6 m), confusion between the “Low Risk” and “Medium Risk” classes becomes more pronounced. By the three-step horizon (2.4 m), performance deteriorates substantially, particularly for the “High Risk” class, where all instances are misclassified. We attribute this performance degradation in long-range forecasting to three primary factors: The inherent difficulty of long-horizon prediction. As the forecast distance increases, the efficacy of precursory information diminishes, and the effect of cumulative error becomes significantly more pronounced. The extreme scarcity of “High-Risk” samples. The inherent class imbalance within the dataset provides the model with insufficient examples of high-risk events, hindering its ability to learn their distinguishing features. The smoothing effect of long-range forecasting. Models predicting far into the future tend to produce smoothed-out outputs. This can cause sharp energy peaks, which are often indicative of high risk, to be “averaged out” and consequently misclassified as medium or low risk.
Figure 15a compares the true and predicted microseismic energy values for the 0.8 m forecast horizon. The predicted values demonstrate a high degree of consistency with the overall morphology of the true energy series. The model rapidly responds to energy trends, both during accumulation and release phases. Crucially for rock burst prediction, the model also exhibits a strong capability to capture peak energy events. For the two significant high-energy events at face advances of approximately 700 m and 754 m, the predicted peaks closely match the true peaks. This validates that the proposed attention mechanism can effectively identify and focus on critical precursory information.
Figure 15b illustrates the model’s classification performance. The majority of events are correctly classified (indicated by matching colors and shapes) and are tightly clustered around the ideal prediction diagonal. This visually confirms the model’s high accuracy. Notably, the few misclassified samples are located near the decision boundaries between risk levels. This suggests that the model’s errors are not random but are concentrated on ambiguous, borderline cases, providing evidence that the model has learned the underlying data patterns rather than merely overfitting.
Interestingly, at these decision boundaries, the model exhibits a tendency towards conservative forecasting. For instance, three events with a true ‘Low-Risk’ label near the threshold were classified as ‘Medium-Risk’. This behavior, which prioritizes safety by elevating the risk level in uncertain situations, is a highly desirable characteristic for practical engineering applications.
Figure 16a examines the relationship between the prediction error (residual) and the magnitude of the predicted energy. In this plot, the vertical axis represents the residual (true value—predicted value). Ideally, these points should be randomly scattered around the zero-error line, exhibiting no discernible pattern. As observed, the vast majority of residuals are tightly and randomly distributed around the zero line. Furthermore, the locally weighted regression (LOESS) line is nearly horizontal and closely tracks the zero-error line, confirming the absence of systematic bias in the model’s predictions across different energy ranges.
Figure 16b assesses the overall error characteristics using a histogram and a quantile-quantile (Q-Q) plot. The histogram clearly shows that the prediction errors approximate a normal distribution, with a mean of −20.97 J, which is negligible given that microseismic energy values can be on the order of thousands of Joules. Moreover, the Q-Q plot shows that the sample quantiles align closely with the theoretical quantiles of a normal distribution. These observations collectively validate that the DynamiXFormer model has successfully learned the underlying patterns in the data, and the remaining residuals can be attributed to random noise.

4.5. Ablation Study

To systematically validate the effectiveness of each core innovative component within the DynamiXFormer and its impact on the final performance, we designed a series of ablation studies. These experiments begin with a standard Transformer architecture as the baseline, which excludes all modules proposed in this paper. We then incrementally add our core modules to this baseline or replace its standard components to evaluate their individual contributions.
The task for all ablation experiments was set to two-step-ahead prediction. To ensure a fair comparison, all models in the study shared a unified hyperparameter configuration: 1 encoder layer, 1 decoder layer, a model dimension of 64, a feed-forward network dimension of 32, 4 attention heads, and a dropout rate of 0. All other environmental settings and training configurations, including the early stopping mechanism, were kept identical to those in the main comparative experiments.
The specific configurations for the ablation study are as follows:
  • Baseline: The standard Transformer model without any of our proposed modifications.
  • Baseline + AFDB: The baseline model augmented with our Adaptive Frequency Denoise Block.
  • Baseline + REE: The baseline model augmented with our Relative Event Embedding module.
  • Baseline + DSA: The baseline model with its standard full attention mechanism replaced by our Dynamic Sparse Attention mechanism.
  • DynamiXFormer (Full Model): The complete proposed model incorporating all three innovative modules (AFDB, REE, and DSA).
The results of our ablation study are presented in Table 4 and visualized in Figure 17. The baseline model, a standard Transformer, achieved an MAE of 0.0727. Upon individually incorporating our three proposed modules—the Adaptive Frequency Denoise Block (AFDB), the Relative Event Embedding (REE) module, and the Dynamic Sparse Attention (DSA) mechanism—into this baseline, the model’s performance improved significantly in each case. Specifically, the MAE decreased to 0.0525 (a 27.7% reduction), 0.0535 (a 26.4% reduction), and 0.0490 (a 32.6% reduction) compared to the baseline’s MAE of 0.0727, respectively. This indicates that each of the three modules makes a distinct and positive contribution. Notably, replacing the standard attention with our Dynamic Sparse Attention (DSA) mechanism yielded the single greatest performance gain. Furthermore, the complete DynamiXFormer model, which integrates all three components, achieved the optimal performance, demonstrating a positive synergistic effect among our proposed modules where their combined impact exceeds the sum of their individual contributions. This synergy is achieved because the adaptive denoising module provides higher-quality input for subsequent components, while the relative event encoding module furnishes the attention mechanism with richer inter-event correlations. Together, these enhancements enable the dynamic sparse attention mechanism to more precisely focus on critical precursory information.

4.6. Analysis of Model Complexity and Inference Speed

To evaluate the practical viability of the model for real-world engineering applications, we analyzed the parameter count and inference speed of the DynamiXFormer, a standard Transformer, and an LSTM model. The models used for this analysis were those previously trained for the 0.8 m forecast horizon experiment. The results are presented in Table 5.
The results show that, owing to its sparse attention mechanism, our proposed DynamiXFormer has only 68.1k parameters. This is substantially lower than both the LSTM (204.4k) and the standard Transformer (293.6k) models, validating the efficiency of the sparse attention design.
In terms of inference speed, the structurally simpler LSTM model exhibited the fastest performance (3207 FPS), which can be attributed to the highly optimized nature of its underlying implementation in PyTorch. Although the DynamiXFormer (1295 FPS) is slower than the LSTM due to its more complex dynamic sparsity and adaptive logic, it is still slightly faster than the standard Transformer, a benefit also conferred by its sparse architecture.
This analysis demonstrates that the proposed DynamiXFormer model is fully capable of meeting the near real-time requirements of practical engineering applications.

4.7. Analysis of Model Performance with Varying Training Sample Sizes

To evaluate the model’s performance under different data availability scenarios, we retrained and evaluated the models using 100%, 75%, 50%, and 25% of the full training dataset, respectively. The LSTM and standard Transformer models, which demonstrated strong overall performance in previous experiments, were selected as baselines. A superior model, owing to its effective inductive biases and architectural design, should be capable of learning generalizable features even from limited data, thus exhibiting greater stability. The results of this experiment are illustrated in Figure 18.
The results indicate that while the performance of all models degrades as the training sample size decreases, the performance of the DynamiXFormer and LSTM models declines at a significantly slower rate than that of the standard Transformer. This suggests that the vanilla attention mechanism of the Transformer is more data-hungry. In contrast, DynamiXFormer incorporates effective inductive biases that guide the model to focus on critical physical patterns. Furthermore, its relative event encoding module captures inter-event correlations, enabling the model to learn more valuable feature representations even when data is sparse.

4.8. Sensitivity Analysis of the Key Risk Classification Threshold

To further validate the sensitivity of the classification thresholds established in this study, this section analyzes the boundary between “No Risk” and “Low Risk”. This threshold was selected for two primary reasons. First, it represents a critical warning boundary, marking the transition from a “safe” state to one requiring “alertness”. Second, sufficient samples exist on both sides of this threshold, which ensures that the analysis results are statistically significant.
In this section, the optimal threshold value was adjusted by ±5% and ±10%, respectively. The classification performance of the DynamiXFormer model on the 0.8 m short-term prediction task was then re-evaluated. The results are presented in Table 6.
The results in Table 6 validate that the threshold with 0.0% variation—the inflection point derived from the G-R relation in Section 3.4—indeed provides the optimal performance, achieving the highest Recall (97.85%) and the lowest FPR (0.72%).
Furthermore, the model’s performance did not exhibit a sharp decline when the threshold fluctuated around its optimal value. Even with a larger deviation of ±10%, the model’s Recall consistently remained at an excellent level (above 94%). This demonstrates that the model’s performance shows “graceful degradation” rather than a sudden failure, confirming its robustness to the precise threshold value.

5. Discussion

5.1. Innovative Forecasting Paradigm and Model Architecture

To address the challenges in existing microseismic-based rock burst prediction, this study proposes a novel forecasting model, DynamiXFormer. This model departs from the conventional time-series paradigm by establishing a disturbance-driven forecasting framework that directly correlates predictions with mining activities, thereby enhancing the practical engineering relevance and accuracy of the forecasts. From a metrological perspective, this approach ensures a direct link between the measurement baseline (mining advance) and the object of measurement (rock mass response), significantly improving the specificity of the risk assessment.

5.2. Module Contributions and Synergy

The primary contributions of the DynamiXFormer model lie in three innovative modules. The adaptive frequency denoising module suppresses high-frequency noise while enhancing valid signals. The relative event encoding module constructs an inter-event relationship graph, capturing physical and spatio-temporal dependencies missed by traditional methods. Finally, the dynamic sparse attention mechanism introduces a powerful inductive bias, enabling the model to focus on both local precursory patterns and global critical shifts.

5.3. Comprehensive Performance Advantages

Comprehensive experiments have robustly validated the effectiveness of the DynamiXFormer. In comparisons against baselines, including a standard Transformer and various RNNs, we found that the Transformer excels in short-term forecasting, while the LSTM has an edge in medium- to long-term tasks. By synergizing the strengths of both attention and inductive bias, the DynamiXFormer consistently achieves state-of-the-art performance across all tested forecast horizons.

5.4. Limitations

The dataset in this study was sourced from a single longwall panel with a weak rock burst propensity. Therefore, the model’s generalizability to sites with different geological conditions or stronger burst tendencies requires further validation. Additionally, the model’s classification performance declined as the forecast horizon increased, limiting its utility for long-range forecasting.

5.5. Future Work

Building on this research, our future work will proceed in two main directions: First, we will apply the DynamiXFormer model to diverse, multi-source datasets from various mines. These datasets will integrate information such as geological structures, panel layouts, and changes in support systems to systematically evaluate the model’s cross-site generalization capabilities. Second, we plan to explore the use of Graph Neural Networks (GNNs) to construct more complex event relationship graphs. This would extend the pairwise relationships in the current Relative Event Encoding (REE) module to model more holistic, global physical correlations among microseismic events. Moreover, effectively addressing the challenge of class imbalance remains a critical priority. Artificially synthesized samples risk introducing unrealistic or “pseudo-physical” patterns, which can mislead the model during training and ultimately compromise its generalization performance. Therefore, future work will systematically investigate and apply advanced techniques to mitigate this issue, such as cost-sensitive loss functions, physics-informed data augmentation (PIDA) methods, and other cutting-edge approaches like few-shot learning.

Author Contributions

Conceptualization, J.Z.; Methodology, J.Z.; Software, Q.X.; Validation, T.L.; Formal analysis, J.Z. and Q.W.; Investigation, Q.X.; Resources, Q.W.; Data curation, T.L.; Writing—review & editing, S.W.; Visualization, H.W.; Supervision, H.W.; Funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Joint Fund Project of the National Natural Science Foundation of China (Grant No. U24A2086); and the National Natural Science Foundation of China (No. 51774133, No. 52074117); and the Research Fund of The State Key Laboratory for Fine Exploration and Intelligent Development of Coal Resources, CUMT (SKLCRSM20KF007).

Data Availability Statement

The data supporting the findings of this study are available from a partner coal mine, but access to these data is restricted as they were used under license for this study and are not publicly available. Data are available from the corresponding author with permission from the partner coal mine.

Conflicts of Interest

Author Qiang Wu was employed by the company Guizhou Ganxing Coal Industry. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Jiang, F.; Wei, Q.; Yao, S.; Wang, C. Key theory and technical analysis on mine pressure bumping prevention and control. Coal Sci. Technol. 2013, 41, 6–9. [Google Scholar]
  2. Pan, J.; Mao, D.; Lan, H.; Wang, S.; Qi, Q. Study status and prospects of mine pressure bumping control technology in China. Coal Sci. Technol. 2013, 41, 21–25. [Google Scholar]
  3. Ma, T.; Tang, C.; Liu, F.; Zhang, S.-C.; Feng, Z.-Q. Microseismic monitoring, analysis and early warning of rockburst. Geomat. Nat. Hazards Risk 2021, 12, 2956–2983. [Google Scholar]
  4. Yin, X.; Liu, Q.; Pan, Y.; Huang, X. A novel tree-based algorithm for real-time prediction of rockburst risk using field microseismic monitoring. Environ. Earth Sci. 2021, 80, 504. [Google Scholar] [CrossRef]
  5. Dong, L.; Yan, X.; Wang, J.; Tang, Z.; Wang, H.; Wu, W. Case study on pre-warning and protective measures against rockbursts utilizing the microseismic method in deep underground mining. J. Appl. Geophys. 2025, 237, 105687. [Google Scholar] [CrossRef]
  6. Jiang, R.; Dai, F.; Liu, Y.; Li, A. Fast marching method for microseismic source location in cavern-containing rockmass: Performance analysis and engineering application. Engineering 2021, 7, 1023–1034. [Google Scholar] [CrossRef]
  7. Liu, Y.; Dai, F.; Liu, K.; Wei, M. Continuum analysis of the structurally controlled displacements for large-scale underground caverns in bedded rock masses. Tunn. Undergr. Space Technol. 2020, 97, 103288. [Google Scholar] [CrossRef]
  8. Ma, K.; Shen, Q.; Sun, X.; Ma, T.-H.; Hu, J.; Tang, C.-A. Rockburst prediction model using machine learning based on microseismic parameters of Qinling water conveyance tunnel. J. Cent. South Univ. 2023, 30, 289–305. [Google Scholar] [CrossRef]
  9. Qin, C.; Zhao, W.; Chen, W.; Zhang, X.; Xie, P. Prediction of rockburst risk induced by mine tremor using ensemble learning techniques. J. Rock Mech. Geotech. Eng. 2025, 18, 1937–1953. [Google Scholar]
  10. Li, D.; Liu, Z.; Xiao, P.; Zhou, J.; Armaghani, D.J. Intelligent rockburst prediction model with sample category balance using feedforward neural network and Bayesian optimization. Undergr. Space 2022, 7, 833–846. [Google Scholar] [CrossRef]
  11. Wojtecki, Ł.; Iwaszenko, S.; Apel, D.B.; Bukowska, M.; Makówka, J. Use of machine learning algorithms to assess the state of rockburst hazard in underground coal mine openings. J. Rock Mech. Geotech. Eng. 2022, 14, 703–713. [Google Scholar] [CrossRef]
  12. Ma, K.; Xie, H.; Ren, F.; Chang, Y. Rockburst early-warning method based on time series prediction of multiple acoustic emission parameters. Tunn. Undergr. Space Technol. 2024, 153, 106060. [Google Scholar] [CrossRef]
  13. Yin, X.; Liu, Q.; Huang, X.; Pan, Y. Real-time prediction of rockburst intensity using an integrated CNN-Adam-BO algorithm based on microseismic data and its engineering application. Tunn. Undergr. Space Technol. 2021, 117, 104133. [Google Scholar] [CrossRef]
  14. Liu, H.; Ma, T.; Lin, Y.; Peng, K.; Hu, X.; Xie, S.; Luo, K. Deep learning in rockburst intensity level prediction: Performance evaluation and comparison of the NGO-CNN-BiGRU-attention model. Appl. Sci. 2024, 14, 5719. [Google Scholar] [CrossRef]
  15. Cui, F.; He, S.F.; Luo, Z.; Zong, C.; Li, H.; Ma, L.; Zhao, Z.; Yang, X. Research on multi-index early warning of rock burst based on bayesian optimization algorithm and machine learning. J. China Coal Soc. 2025, 50, 297–313. [Google Scholar]
  16. Qiao, M.; Shi, Y. Prediction of rock burst risk level based on combination of physical indexes and deep learning. J. Saf. Sci. Technol. 2024, 20, 56–63. [Google Scholar]
  17. Cao, A.; Liu, Y.; Yang, X.; Li, S.; Wang, C.; Bai, X.; Liu, Y. Physical index and data fusion-driven method for coal burst prediction in time sequence. J. China Coal Soc. 2023, 48, 3659–3673. [Google Scholar]
  18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017. [Google Scholar] [CrossRef]
  19. Zhang, Z.; Ye, Y.; Luo, B.; Chen, G.; Wu, M. Investigation of microseismic signal denoising using an improved wavelet adaptive thresholding method. Sci. Rep. 2022, 12, 22186. [Google Scholar] [CrossRef]
  20. Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 2006, 100, 90–93. [Google Scholar] [CrossRef]
  21. Maxwell, S. Microseismic Imaging of Hydraulic Fracturing: Improved Engineering of Unconventional Shale Reservoirs; Society of Exploration Geophysicists: Tulsa, OK, USA, 2014. [Google Scholar]
  22. Roy, A.; Saffar, M.; Vaswani, A.; Grangier, D. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 2021, 9, 53–68. [Google Scholar] [CrossRef]
  23. Liu, L.; Qu, Z.; Chen, Z.; Ding, Y.; Xie, Y. Transformer acceleration with dynamic sparse attention. arXiv 2021, arXiv:2110.11299. [Google Scholar] [CrossRef]
  24. Zhao, G.; Lin, J.; Zhang, Z.; Ren, X.; Su, Q.; Sun, X. Explicit sparse transformer: Concentrated attention through explicit selection. arXiv 2019, arXiv:1912.11637. [Google Scholar] [CrossRef]
  25. Cao, A.; Dou, L.; Wang, C.; Yao, X.X.; Dong, J.Y.; Gu, Y. Microseismic precursory characteristics of rock burst hazard in mining areas near a large residual coal pillar: A case study from Xuzhuang coal mine, Xuzhou, China. Rock Mech. Rock Eng. 2016, 49, 4407–4422. [Google Scholar] [CrossRef]
  26. Tang, C. Numerical simulation of progressive rock failure and associated seismicity. Int. J. Rock Mech. Min. Sci. 1997, 34, 249–261. [Google Scholar]
  27. Anikiev, D.; Birnie, C.; Waheed, U.; Alkhalifah, T.; Gu, C.; Verschuur, D.J.; Eisner, L. Machine learning in microseismic monitoring. Earth-Sci. Rev. 2023, 239, 104371. [Google Scholar] [CrossRef]
  28. Zhang, X.; Hou, D.; Mao, Q.; Wang, Z. Predicting microseismic sensitive feature data using variational mode decomposition and transformer. J. Seismol. 2024, 28, 229–250. [Google Scholar] [CrossRef]
  29. Fei, Y.; Yang, X.; Chuan, J.; Wu, X.S.; Cheng, H.M.; Lü, X.F. Time series prediction of microseismic energy level based on feature extraction of one-dimensional convolutional neural network. Chin. J. Eng. 2021, 43, 1003–1009. [Google Scholar]
  30. Zorn, E.; Kumar, A.; Harbert, W.; Hammack, R. Geomechanical analysis of microseismicity in an organic shale: A West Virginia Marcellus Shale example. Interpretation 2019, 7, T231–T239. [Google Scholar] [CrossRef]
  31. Gu, J.; Wei, F. On the quantification of seismic activity: Seismic activity rate. Earthq. Res. China 1987, 3, 14–24. [Google Scholar]
  32. Luo, L.; Hou, J. Scaling of seismic activity. Earthquake 1987, 40–45. [Google Scholar]
  33. Gutenberg, B.; Richter, C.F. Frequency of earthquakes in California. Bull. Seismol. Soc. Am. 1944, 34, 185–188. [Google Scholar] [CrossRef]
  34. Scholz, H. The frequency-magnitude relation of microfracturing in rock and its relation to earthquakes. Bull. Seismol. Soc. Am. 1968, 58, 399–415. [Google Scholar]
  35. Kijko, A.; Funk, C.W. The assessment of seismic hazards in mines. J. South. Afr. Inst. Min. Metall. 1994, 94, 179–185. [Google Scholar]
  36. Cui, F.; Zong, C.; Lai, X.; He, S.; Zhang, S.; Jia, C. Intelligent prediction of time series and grade of rock burst in steeply inclined ultrathick coal seam excavation roadway. J. China Coal Soc. 2025, 50, 845–861. [Google Scholar]
  37. Xie, J.; Zhang, Y.; Zhang, Y.; Ding, G.L.; Shi, C.H.; Yao, R. Optimization of microseismic energy early-warning index based on energy level and frequency analysis. Coal Eng. 2021, 53, 67–72. [Google Scholar]
  38. Liu, H.; Xu, F.; Liu, B.; Deng, M. Time-series prediction method for risk level of rockburst disaster based on CNN-LSTM. J. Cent. South Univ. (Sci. Technol.) 2021, 52, 659–670. [Google Scholar]
  39. Shuang, G.; Yi, T.; Wen, W. Prediction and evaluation of coal mine coal bump based on improved deep neural network. Geofluids 2021, 2021, 5594019. [Google Scholar] [CrossRef]
  40. Shu, P.; Yang, Z.; Lai, X.; Xu, H.; Hu, Q.; Guo, Z. An analytical methodology of rock burst with fully mechanized top-coal caving mining in steeply inclined thick coal seam. Sci. Rep. 2024, 14, 651. [Google Scholar] [CrossRef]
  41. Hochreiter, S.; Schmidhuber, J.; Computation, N. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Figure 1. Diagram of the DynamiXFormer Model.
Figure 1. Diagram of the DynamiXFormer Model.
Processes 14 01413 g001
Figure 2. Structure of the Adaptive Frequency Denoise Block (AFDB).
Figure 2. Structure of the Adaptive Frequency Denoise Block (AFDB).
Processes 14 01413 g002
Figure 3. Structure of the Relative Event Encoding Module.
Figure 3. Structure of the Relative Event Encoding Module.
Processes 14 01413 g003
Figure 4. Structure of the Dynamic Sparse Attention Module.
Figure 4. Structure of the Dynamic Sparse Attention Module.
Processes 14 01413 g004
Figure 5. (a) Daily Microseismic Frequency and Energy. (b) Energy Interval Distribution.
Figure 5. (a) Daily Microseismic Frequency and Energy. (b) Energy Interval Distribution.
Processes 14 01413 g005
Figure 6. Relationship Between Daily Cumulative Energy Release and Working Face Advancement Distance.
Figure 6. Relationship Between Daily Cumulative Energy Release and Working Face Advancement Distance.
Processes 14 01413 g006
Figure 7. Reconstructed Dataset Using 0.8 m Advancement as the Minimum Prediction Unit.
Figure 7. Reconstructed Dataset Using 0.8 m Advancement as the Minimum Prediction Unit.
Processes 14 01413 g007
Figure 8. Correlation Heatmap Between Features and the Target Variable.
Figure 8. Correlation Heatmap Between Features and the Target Variable.
Processes 14 01413 g008
Figure 9. Magnitude-Frequency Curve of Microseismic Events.
Figure 9. Magnitude-Frequency Curve of Microseismic Events.
Processes 14 01413 g009
Figure 10. Fitting Results of the Exponential and Linear Segments in the Magnitude-Frequency Curve.
Figure 10. Fitting Results of the Exponential and Linear Segments in the Magnitude-Frequency Curve.
Processes 14 01413 g010
Figure 11. Magnitude-Frequency Curve of the Linear Segment.
Figure 11. Magnitude-Frequency Curve of the Linear Segment.
Processes 14 01413 g011
Figure 12. Visualization of Prediction Results for Different Models.
Figure 12. Visualization of Prediction Results for Different Models.
Processes 14 01413 g012
Figure 13. Comparison of Actual and Predicted Values for Different Prediction Horizons.
Figure 13. Comparison of Actual and Predicted Values for Different Prediction Horizons.
Processes 14 01413 g013
Figure 14. Confusion Matrices of DynamiXFormer Model for Different Prediction Lengths.
Figure 14. Confusion Matrices of DynamiXFormer Model for Different Prediction Lengths.
Processes 14 01413 g014
Figure 15. Performance evaluation of DynamiXFormer under a 0.8 m prediction unit.
Figure 15. Performance evaluation of DynamiXFormer under a 0.8 m prediction unit.
Processes 14 01413 g015
Figure 16. Prediction error distribution of DynamiXFormer for microseismic energy forecasting.
Figure 16. Prediction error distribution of DynamiXFormer for microseismic energy forecasting.
Processes 14 01413 g016
Figure 17. Ablation Study of DynamiXFormer Components.
Figure 17. Ablation Study of DynamiXFormer Components.
Processes 14 01413 g017
Figure 18. Model Performance Comparison vs. Training Sample Size.
Figure 18. Model Performance Comparison vs. Training Sample Size.
Processes 14 01413 g018
Table 1. A Summary of Research Gaps and Our Corresponding Contributions.
Table 1. A Summary of Research Gaps and Our Corresponding Contributions.
Research GapsOur Solutions and Contributions
Weak correlation between conventional time-based predictions and the non-uniform nature of mining activities.Aligning predictions with engineering practices by using the mining face advance distance as the primary benchmark instead of time.
Difficulty for time-series models in handling ambient noise in mines, which often masks critical precursory information.Introducing a frequency-domain perspective to suppress noise and adaptively amplify signals in key frequency bands.
The inherent spatio-temporal relationships among individual microseismic events are typically underutilized.Constructing an event relationship graph to explicitly model the complex dependencies between microseismic events.
Inability of standard sequential models to simultaneously capture both local, abrupt precursors and long-term, cumulative trends.Incorporating a specific inductive bias that enables the model to dynamically focus on both local key patterns and global evolutionary trends.
Table 2. Classification of Rockburst Risk Levels Based on Energy Thresholds.
Table 2. Classification of Rockburst Risk Levels Based on Energy Thresholds.
Energy LevelRisk Level
≤4570.8818 JNo Risk
4570.8818~5688.5293 JLow Risk
5688.5293~7079.4578 JMedium Risk
≥7079.4578 JHigh Risk
Table 3. Performance Comparison of Different Models Across Various Prediction Horizons.
Table 3. Performance Comparison of Different Models Across Various Prediction Horizons.
ModelPrediction Length (m)MAEMSERecallFPR
LSTM0.80.0290670.00148194.68%1.77%
1.60.0823050.01148384.57%5.14%
2.40.1407870.03174479.79%6.74%
CNN-LSTM0.80.1041300.01762480.32%6.56%
1.60.1474980.03464677.13%7.62%
2.40.1823090.05315870.21%9.93%
DNN0.80.0426050.00322390.96%3.01%
1.60.0929050.01460782.45%5.85%
2.40.1384970.03118977.66%7.45%
CNN-BiLSTM-Attention0.80.0939330.01456683.51%5.50%
1.60.1431540.03298478.19%7.27%
2.40.1898020.05585472.87%9.04%
CNN-BiGRU-Attention0.80.0980660.01634682.98%5.67%
1.60.1526110.03625574.47%8.51%
2.40.1798810.05100473.40%8.87%
Transformer0.80.0270190.00131893.48%2.17%
1.60.0673830.00732081.97%6.01%
2.40.1041340.01759277.47%7.51%
DynamiXFormer (ours)0.80.0159360.00051897.85%0.72%
1.60.0423590.00356788.11%3.96%
2.40.0703800.00847980.90%6.37%
Table 4. Results of the Ablation Study on DynamiXFormer Model Performance.
Table 4. Results of the Ablation Study on DynamiXFormer Model Performance.
ModelMAEMSE
Baseline (Transformer)0.0727490.008827
Baseline + AdaptiveFreqDenoiseBlock0.0525660.005065
Baseline + RelativeEventEmbedding0.0535460.005313
Baseline + DynamicSparseAttention0.0490450.004447
Table 5. Comparison of Model Parameters and Inference Speed.
Table 5. Comparison of Model Parameters and Inference Speed.
ModelTotal ParametersAverage Latency (ms, CPU)Inference Speed (FPS, CPU)
LSTM204,4170.31193206.58
DynamiXFormer68,1170.77241294.73
Transformer293,6010.86251159.46
LSTM204,4170.31193206.58
Table 6. Sensitivity Analysis of Key Risk Thresholds.
Table 6. Sensitivity Analysis of Key Risk Thresholds.
Threshold VariationEnergy Threshold (J)Recall (%)FPR (%)
−10%4113.894.091.97
−5%4342.396.241.25
0.0% (Optimal)4570.997.850.72
+ 5%4799.497.310.90
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Wu, H.; Wu, Q.; Xia, Q.; Wei, S.; Ling, T. Learning from Disturbances, Not Timestamps: A Dynamic Event-Driven Transformer for Rock Burst Forecasting. Processes 2026, 14, 1413. https://doi.org/10.3390/pr14091413

AMA Style

Zhang J, Wu H, Wu Q, Xia Q, Wei S, Ling T. Learning from Disturbances, Not Timestamps: A Dynamic Event-Driven Transformer for Rock Burst Forecasting. Processes. 2026; 14(9):1413. https://doi.org/10.3390/pr14091413

Chicago/Turabian Style

Zhang, Junming, Hai Wu, Qiang Wu, Qiyuan Xia, Sailei Wei, and Tao Ling. 2026. "Learning from Disturbances, Not Timestamps: A Dynamic Event-Driven Transformer for Rock Burst Forecasting" Processes 14, no. 9: 1413. https://doi.org/10.3390/pr14091413

APA Style

Zhang, J., Wu, H., Wu, Q., Xia, Q., Wei, S., & Ling, T. (2026). Learning from Disturbances, Not Timestamps: A Dynamic Event-Driven Transformer for Rock Burst Forecasting. Processes, 14(9), 1413. https://doi.org/10.3390/pr14091413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop