2.1. Gradient-Sensitive Temporal Adversarial Attention Network with Fractal-Aware Localization (GSLAN-BiLSTM)
Long short-term memory (LSTM) networks, proposed by Hochreiter and Schmidhuber (1997), have been widely applied in temporal modeling due to their gating architecture [
14]. LSTM effectively addresses the vanishing gradient problem in traditional recurrent neural networks (RNNs) through the collaborative operation of the forget gate, input gate, and output gate, enabling the capture of long-term dependencies in time-series data [
6]. Specifically, the forget gate determines which historical information to discard from the cell state, the input gate filters and adds new features to the cell state, and the output gate generates the current hidden state based on the cell state. However, carbon emission price sequences exhibit significant non-stationary characteristics and high-frequency noise coupling phenomena [
15]. A single LSTM struggles to capture cross-scale dynamic correlations due to limitations in feature extraction dimensions. Bidirectional LSTM (BiLSTM) networks construct bidirectional information flows through reverse temporal encoding layers, theoretically enhancing feature expression capabilities. Forward LSTM encodes sequence information in chronological order, while backward LSTM processes data in reverse chronological order. The concatenated hidden states can simultaneously fuse past and future contextual information. Jamshidzadeh et al. (2024) proposed a novel hybrid model, BILSTM-SVM, for predicting water quality parameters. By combining the bidirectional information processing capability of BILSTM with the classification advantages of SVM, the model effectively extracts key features from the data, significantly improving prediction accuracy [
16]. The role of BILSTM lies in its bidirectional structure, which can simultaneously capture past and future dependencies in time series, addressing the shortcomings of traditional SVM in feature extraction and thereby optimizing overall prediction performance. Liu et al. (2025) used BILSTM to capture complex dynamic features in time series, demonstrating excellent performance in predicting pollutants such as PM2.5 (R
2 exceeding 0.94), highlighting BILSTM’s core advantages in handling high-dimensional time-series data and nonlinear relationships [
17]. However, standard BiLSTM still faces challenges such as noise sensitivity and insufficient generalization on small samples in carbon price prediction. Carbon market data are limited, especially in emerging markets, making models prone to overfitting due to insufficient training and degradation in generalization performance due to parameter overload.
To address the aforementioned bottlenecks, existing studies primarily employ heterogeneous model integration strategies to optimize BiLSTM. A portion of these focus on improvements in BiLSTM using Transformers. Dong et al. (2024) combined EMD with Transformer-BiLSTM, decomposing the air quality index into intrinsic mode functions (IMFs) and performing parallel predictions, achieving an RMSE as low as 1.80 on multiple Indian datasets [
18]. Fan et al. (2024) employed a fusion model of Transformer-BiLSTM, demonstrating outstanding performance in time-series prediction within the energy and environmental sectors [
19]. In photovoltaic output prediction, the self-attention mechanism of Transformer can directly capture the correlation between historical radiation, temperature, and future output, thereby avoiding information loss caused by overly long sequences in BiLSTM, reducing the RMSE to 5.685, an improvement of 42.38% over the traditional BiLSTM [
19]. Cao et al. (2024) proposed a hybrid architecture combining LSTM and Transformer, integrating online learning and knowledge distillation techniques to achieve high-precision real-time multi-task prediction in engineering systems, significantly enhancing computational efficiency and dynamic adaptability [
3]. However, its self-attention mechanism requires sufficient training data to avoid overfitting, making it limited in applicability for carbon price prediction scenarios. Some scholars have focused on attention-enhanced BiLSTM (Attention-BiLSTM), which reinforces key time step features through dynamic weight allocation. Zrira et al. (2024) improved sea surface temperature prediction accuracy by capturing bidirectional temporal information and allocating attention weights, outperforming LSTM, XGBoost, and other models [
20]. Guo et al. (2023) applied Attention-BiLSTM to lithium-ion battery degradation trend prediction, combining gray relational analysis (GRA) and empirical mode decomposition (EMD) to filter redundant features, achieving an RMSE below 1.15 in both open-source and real-vehicle datasets, validating the attention mechanism’s role in reinforcing key features [
21].
Teragawa et al. (2024) proposed introducing PGD adversarial training on a dual residual structure (TCN + BiLSTM) to force the model to learn invariant representations of key methylation features, optimizing feature extraction and mitigating overfitting [
22], as well as significantly improving prediction accuracy and model robustness. However, using adversarial samples generated by traditional PGD may overly disturb global features, disrupting the local temporal relationships in time series and reducing the model’s sensitivity to temporal patterns. While these methods partially improve prediction accuracy, they do not systematically address the co-optimization of data noise robustness and small-sample generalization, leading to specific limitations in carbon price prediction scenarios.
This study proposes the GSLAN-BiLSTM model, which integrates local adversarial training (Local-PGD) with a dynamic gated attention mechanism (attention) network into the BiLSTM model framework. By identifying critical time steps through gradient significance analysis and applying directed perturbations, combined with multi-head attention and bidirectional LSTM multimodal fusion, we enhance the noise robustness and sudden change response capability of carbon market price prediction. The core innovations are threefold: (1) Local adversarial training optimization: A multi-step adversarial perturbation generation module is embedded at the model input end. Based on gradient significance analysis, key time steps are selected, and projection gradient descent perturbations are applied in a targeted manner. By constructing local worst-case perturbation scenarios, the model is forced to learn robust representations of market anomalies while maintaining the integrity of the temporal structure, thereby avoiding the feature distortion issues caused by traditional global perturbations. (2) Heterogeneous temporal feature fusion: A bidirectional LSTM is designed to capture long-term and short-term forward and reverse dependencies, while multi-head self-attention dynamically focuses on key time step interaction patterns. Features from dual pathways are fused across modalities through pooling compression and nonlinear mapping, enhancing the model’s ability to model the non-stationary characteristics of carbon market price sequences. (3) Dynamic robustness constraint mechanism: We introduce Lipschitz continuity constraints to smooth the model decision boundary, combined with an alternating training strategy to balance the learning of original data patterns and the enhancement of invariance to adversarial samples. This mechanism adaptively adjusts perturbation intensity and loss weights to suppress overfitting risks in small-sample scenarios.
2.2. Fractal-Driven Implementation Principle of GSLAN-BiLSTM Core Module
Gradient sensitivity-driven local adversarial training: Distinct from traditional global perturbations, this model applies PGD perturbations only in regions sensitive to market fluctuations. This strategy identifies fractal critical time steps through multiple-fractal detrended fluctuation analysis (MF-DFA). It constructs local worst-case perturbation scenarios, forcing the model to learn robust representations of abnormal fluctuations. The Hurst index is calculated by dividing the time series into subintervals of length via MF-DFA, computing the fluctuation functions of each order of moments, and obtaining the generalized Hurst index through the logarithmic fitting, where the critical time steps correspond to regions of significant changes in the Hurst index. Simultaneously, a dynamic batch processing strategy constrains the perturbation range. Each training batch independently generates adversarial samples, with perturbation calculations relying solely on the current batch’s input data and gradient information, thereby preventing gradient information leakage across batches or between cross-validation folds.
A derivable perturbation module is embedded after the BiLSTM input layer and attention mechanism, and directed adversarial examples are generated based on the projection gradient descent (PGD) algorithm. This training mechanism combines a dual-path perturbation injection strategy with a synergistic optimization objective of “original data robustness” and “key time step robustness.” Perturbations are applied to the input layer to enhance the model’s robustness to noise in the original data, while local perturbations are introduced into the attention weight matrix to focus on the dynamic sensitivity of critical time steps. The core innovation lies in the introduction of a gradient saliency mapping
computation mechanism, which uses gradient backpropagation to obtain the importance weights of each time step in the input sequence. The calculation formula is as follows:
where
represents the significance score of time step t, measuring the contribution of that time step to the loss function; L is the loss function;
represents the feature values of the input sequence at time step t;
is the model prediction value;
indicates element-wise multiplication.
After calculating the gradient significance matrix, the top k% of sensitive time steps are selected to form the critical region M(K). Within this region, multi-step PGD perturbations are generated, with the perturbation strength ϵ and iteration count optimized via grid search, and the optimal parameter combination determined through cross-validation. The perturbation generation formula is as follows:
where
represents the input sequence after adversarial perturbation;
denotes the sign function, ensuring that the perturbation direction aligns with the gradient direction; and
denotes the gradient of the loss function with respect to the input.
To keep the perturbations within the current training batch, a dynamic batching strategy is used to avoid gradient information leakage across batches or cross-validation folds. By constraining the perturbation range and batch independence, the model’s generalization ability to market events is maximized while maintaining the local continuity of the temporal structure, ensuring that the feature encoder satisfies δ-Lipschitz stability. If a function f satisfies for any input , then f is called a δ-Lipschitz function. In the model, by limiting the perturbation intensity , we ensure that small changes in input perturbations do not cause drastic fluctuations in output predictions, thereby enhancing the model’s robustness to adversarial samples while avoiding feature distortion.
Memory-Guided Gated Attention with Fractal Adaptive Decay: To address the issue of weight oscillation in traditional attention mechanisms during abrupt transition regions, we design a memory state-guided attention gating unit that introduces a gating factor gt to dynamically regulate the fusion weights between historical memory states and current hidden states. This enables selective attenuation of historical information while achieving adaptive focus on recent features. The mathematical formulation is presented as follows:
where Ct−1 denotes the historical memory state, ht represents the current hidden state, and σ is the Sigmoid activation function. The gating factor gt dynamically regulates the attention decay rate. During stable phases of carbon market prices, such as periodic fluctuations, gt approaches 1 to preserve long-term dependencies. In abrupt change phases, such as policy shocks or market panics, gt approaches 0 to attenuate historical information and focus on short-term dynamics rapidly. The computation of attention weights is further optimized as follows:
is the attention weight for the time step; is the query weight matrix, which maps the hidden state to the query space; is the element-wise multiplication of the gate factor and the query vector, achieving dynamic weight decay. This mechanism effectively alleviates the lag effect of traditional attention in non-stationary time series and improves the model’s response speed to sudden events.
Heterogeneous Temporal Feature Fusion Module: The forward and backward LSTM layers capture sequential data’s forward and reverse temporal evolution patterns, generating hidden state representations that incorporate bidirectional temporal dependencies. The multi-head attention pathway employs an 8-head attention mechanism (key_dim = 64), which computes cross-time-step attention weight matrices to extract interactive features between different time steps, effectively capturing long-range dependencies. The bidirectional LSTM and multi-head attention outputs are concatenated through a feature fusion layer and then processed by pooling operations before being fed into a fully connected layer. Nonlinear feature transformation is achieved via ReLU activation, ultimately generating the fused temporal feature representation.
Bidirectional LSTM pathway:
where
is the forward LSTM layer, encoded in chronological order;
is the backward LSTM layer, encoded in reverse chronological order; and
is the concatenated hidden state of the bidirectional LSTM output.
Multi-head attention pathway:
where Q, K, and V are the query, key, and value matrices, respectively, obtained by linear transformation of the input features;
is the dimension of the key vector, used to scale the dot product and stabilize the gradient.
Feature fusion and output:
where
is the hidden state output of the bidirectional LSTM;
is the feature vector output of the multi-head attention;
is the pooling operation used for dimension reduction;
and
are the weight matrices of the fully connected layer;
is the linear rectification activation function, which introduces a nonlinear transformation.
The combined structure of BiLSTM and attention layers centers on bidirectional temporal feature extraction and dynamic weight allocation, achieving precise responses to carbon price inflection points through memory-guided gating mechanisms. BiLSTM constructs feature representations incorporating bidirectional temporal dependencies through parallel computation of forward and reverse LSTM layers. Forward LSTM encodes historical trends in price sequences in chronological order, capturing the influence of past information on the current state. At the same time, the backward LSTM extracts future information in reverse chronological order to identify the correlation between future information and the current state, such as the potential impact of policy expectations or market sentiment on current prices. The forward and backward hidden states are then concatenated to form a bidirectional hidden state, creating a comprehensive feature that incorporates both past and future context, thereby addressing the limitations of single-directional information processing in a single LSTM network. The attention layer introduces a gating factor to dynamically adjust the fusion weights between historical memory and the current hidden state. The gating factor is calculated using the Sigmoid activation function based on the current hidden and historical memory states. The principle of gradient significance guiding local antagonistic training is shown in
Figure 1. When carbon prices are stable, the gating factor approaches 1 to preserve long-term dependencies. When prices undergo sudden changes, they approach 0 to decay historical information and focus on short-term dynamics rapidly. Attention weights are generated by multiplying the query weight matrix element-wise with the gating factor and then applying the softmax function, effectively mitigating the lag effect of traditional attention in non-stationary time series. The outputs of the bidirectional LSTM and attention layers interact across time steps through a multi-head attention mechanism. After feature concatenation, pooling operations, and nonlinear transformations via a fully connected layer, a fused temporal feature representation is generated, enhancing the model’s ability to model the non-stationary characteristics of carbon price sequences. The improved BiLSTM-attention architecture is shown in
Figure 2.
2.3. Residual Correction Framework Based on Multiscale Wavelet Decomposition and Dynamic Weighted Fusion
In carbon market price forecasting, traditional wavelet packet decomposition (WPD) struggles to adapt to the non-stationary volatility characteristics of price series due to its fixed weight distribution, leading to overfitting or under-correction during the residual correction. To address this limitation, this study proposes a residual dynamic correction framework that involves decomposition, forecasting, and final correction. The core of this framework lies in achieving residual fusion through multiscale frequency domain decoupling and data-driven dynamic weighting mechanisms, thereby enhancing the model’s ability to analyze complex volatility. Specifically, WPD is first used to adaptively decompose the prediction residual sequence of the base model GSLAN-BiLSTM into a set of high-frequency detail components with frequency band separation characteristics{d1,d2,…,dk} and low-frequency approximation components. The high-frequency components correspond to short-term market noise, while the low-frequency components reflect medium- to long-term trends. Selecting appropriate wavelet basis functions and decomposition levels avoids traditional decomposition methods’ spectral leakage issues, ensuring that each component’s physical meanings are clear and distinguishable [
23].
For each decomposed component, an independent ARIMA(p,d,q) model is established for specific modeling. Before modeling, each component undergoes a stationarity test and is converted into a stationary sequence through differencing. The parameters are then optimized using the Akaike Information Criterion (AIC) to capture the fluctuation patterns of different frequency bands accurately. The high-frequency detail component, which is dominated by noise, uses a low-order ARIMA model. In contrast, the low-frequency approximation component [
23], which has a strong trend, uses a high-order model. The predicted values of each component are linearly superimposed to form the initial residual correction term:
where
are the predicted values of the high-frequency component and low-frequency component at time k + 1, respectively. To address the limitations of traditional fixed weights, a dynamic weight fusion (DWF) strategy is introduced to optimize the residual correction term. This strategy calculates dynamic weights based on the historical volatility
of each component.
where
represents the dynamic weight of the high-frequency detail component or low-frequency approximation component obtained from the wavelet packet decomposition of the i-th component, with a value range of
, reflecting the contribution of this component to the residual correction.
is the historical volatility of the i-th component, calculated as the standard deviation of its historical data, measuring the intensity of its volatility. Higher volatility indicates more short-term noise or abnormal fluctuations in the component; lower volatility suggests that the component is closer to a stable medium-to-long-term trend.
is the reciprocal of the volatility, which reduces the weights of high-frequency components with intense volatility and increases the weights of low-frequency components with mild volatility. j denotes the total number of components participating in the weighted fusion, including all high-frequency detail components and low-frequency approximation components. The denominator
is the sum of the reciprocals of the volatilities of all components, used to normalize the weights of each component to ensure
.
Based on the dynamic weight calculation, the correction strength is further adjusted through a global coefficient
, and the final predicted value is a linear combination of the base model’s predicted value and the dynamically weighted correction term:
where
is the final carbon price prediction at time k + 1, which integrates the base model prediction and the residual correction term.
is the original prediction of the base model at time k + 1, reflecting the model’s capture of the main trends and characteristics of the carbon price series.
is the global coefficient, with a value range of [0, 1], which is adaptively adjusted based on the historical residual volatility. When market volatility increases,
automatically increases to enhance the role of the residual correction term and compensate for the prediction error of the base model; when the market stabilizes,
automatically decreases to avoid overcorrection leading to prediction bias.
is the prediction correction value of the i-th component at time
, generated by the corresponding ARIMA model, reflecting the component’s contribution to the residual prediction.
is the dynamically weighted residual correction term, obtained by summing the products of each component’s prediction values and their dynamic weights, enabling differentiated corrections for fluctuations across different frequency bands. The residual correction architecture of WPD-ARIMA is shown in
Figure 3.
2.4. Training and Prediction Framework
In the WPD-GSLAN-BiLSTM-ARIMA hybrid model, the GSLANBT-BiLSTM core undertakes high-frequency nonlinear component prediction tasks. After decomposing the high-frequency detail components of carbon market prices through WPD (Wavelet Packet Transform), GSLANBT-BiLSTM leverages its noise resistance and generalization capabilities to capture abrupt fluctuation characteristics accurately. The ARIMA model processes low-frequency trend components, with final predictions obtained through component reconstruction. The prediction process is illustrated in
Figure 4.
The specific implementation steps are as follows:
Step 1: Data Preprocessing: Apply missing value imputation, outlier correction, and standardization to the original carbon market price series to construct a normalized dataset.
Step 2: Model Training: Initialize LSTM unit counts, attention heads, dropout rates, etc. Input the training dataset into the model and optimize network weights through local adversarial training and dynamic regularization strategies.
Step 3: Model Prediction: Feed the test dataset into the GSLAN-BiLSTM model, which outputs initial carbon market price predictions for the time series {,}.
Step 4: Extract the prediction residual sequence {} and input it into the ICCEMDAN model. Employ wavelet packet decomposition (WPD) to adaptively decompose it into K frequency band separated modal components {} and a residual trend term R. Each modal component is trained and predicted by an ARIMA model (ARIMA1, …, ARIMAj), corresponding to each modal component’s output of the carbon market price prediction residual sequence {}. Superimposing these yields the total residual prediction value {} as the prediction for the carbon market price residual at time point k + 1.
Step 5: Calculate dynamic weights based on the historical volatility of each modal component. Obtain the combined forecast value via weighted fusion of the modal component forecast value and the trend term . Combined with the base prediction value and correction term, the dynamic coefficient λ adjusts correction intensity to produce the final carbon market price prediction {} at time point k + 1.