Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA

Li, Yuhong; Yang, Nan; Bi, Guihong; Chen, Shiyu; Luo, Zhao; Shen, Xin

doi:10.3390/sym17060962

Open AccessArticle

Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA

by

Yuhong Li

¹,

Nan Yang

¹,

Guihong Bi

^1,*,

Shiyu Chen

¹,

Zhao Luo

¹ and

Xin Shen

²

¹

Faculty of Electric Power Engineering, Kunming University of Science and Technology, Kunming 650500, China

²

Measurement Center of Yunnan Power Grid Co., Ltd., Kunming 650051, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 962; https://doi.org/10.3390/sym17060962

Submission received: 7 May 2025 / Revised: 31 May 2025 / Accepted: 11 June 2025 / Published: 17 June 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

As a core strategy for carbon emission reduction, carbon trading plays a critical role in policy guidance and market stability. Accurate forecasting of carbon prices is essential, yet remains challenging due to the nonlinear, non-stationary, noisy, and uncertain nature of carbon price time series. To address this, this paper proposes a novel hybrid deep learning framework that integrates dual-mode decomposition and a TKMixer-BiGRU-SA model for carbon price prediction. First, external variables with high correlation to carbon prices are identified through correlation analysis and incorporated as inputs. Then, the carbon price series is decomposed using Variational Mode Decomposition (VMD) and Empirical Wavelet Transform (EWT) to extract multi-scale features embedded in the original data. The core prediction model, TKMixer-BiGRU-SA Net, comprises three integrated branches: the first processes the raw carbon price and highly relevant external time series, and the second and third process multi-scale components obtained from VMD and EWT, respectively. The proposed model embeds Kolmogorov–Arnold Networks (KANs) into the Time-Series Mixer (TSMixer) module, replacing the conventional time-mapping layer to form the TKMixer module. Each branch alternately applies the TKMixer along the temporal and feature-channel dimensions to capture dependencies across time steps and variables. Hierarchical nonlinear transformations enhance higher-order feature interactions and improve nonlinear modeling capability. Additionally, the BiGRU component captures bidirectional long-term dependencies, while the Self-Attention (SA) mechanism adaptively weights critical features for integrated prediction. This architecture is designed to uncover global fluctuation patterns in carbon prices, multi-scale component behaviors, and external factor correlations, thereby enabling autonomous learning and the prediction of complex non-stationary and nonlinear price dynamics. Empirical evaluations using data from the EU Emission Allowance (EUA) and Hubei Emission Allowance (HBEA) demonstrate the model’s high accuracy in both single-step and multi-step forecasting tasks. For example, the e_MAPE of EUA predictions for 1–4 step forecasts are 0.2081%, 0.5660%, 0.8293%, and 1.1063%, respectively—outperforming benchmark models and confirming the proposed method’s effectiveness and robustness. This study provides a novel approach to carbon price forecasting with practical implications for market regulation and decision-making.

Keywords:

carbon price; deep learning; multivariate time series prediction; TSMixer; KAN

1. Introduction

Climate change poses a significant threat to global sustainable development, and is primarily driven by greenhouse gas emissions, among which carbon dioxide plays a central role. Its impacts on human health and economic activities are profound and far-reaching [1]. The international community has put a lot of pressure on different countries to reduce greenhouse gas emissions [1]. Hence, CO₂-related environmental quality is an undeniable issue in energy-related policymaking across countries [2]. In response, the international community established the Kyoto Protocol in 1997, introducing market-based mechanisms to incentivize emission reductions. The Paris Agreement, signed in 2016, further reinforced the global carbon trading framework [3]. China officially launched its national carbon emissions trading scheme in July 2021, following several successful regional pilot programs [4]. Within carbon emissions trading systems, carbon emission allowances (CEAs) are scarce and possess financial value. Their trading prices—carbon prices—play a pivotal role in shaping national climate policies and serve as a critical reference for corporate operational planning [5]. Given the non-stationary and nonlinear characteristics of carbon prices, as well as the potential influence of cross-market information transmission on CEA supply, demand, and pricing, accurate carbon price forecasting is essential for both effective policy formulation and risk management by market participants. A major demand for policymakers is handling the energy transition without hampering economic growth or imposing a high burden on people to achieve a sustainable environment [6].

Carbon prices constitute a typical example of nonlinear, non-stationary, and highly complex time series, making their prediction particularly challenging. To address this, deep learning approaches have emerged as a prominent and effective tool. Deep learning usually employs ordinary neural networks (ANNs) arranged in many connected layers to process information in a way that resembles the functioning of the human nervous system [7]. Reference [8] proposed a convolutional neural network long short-term memory (CNN-LSTM) hybrid model, evaluating various parameter configurations and validating model robustness using Z-scores. Reference [9] developed a forecasting framework combining swarm intelligence algorithms with deep learning, using an improved Harris Hawk Optimization (IHHO) algorithm to optimize LSTM networks. This model, MS-IHHO-LSTM, integrated multi-source carbon trading data and significantly improved prediction accuracy. In [10], an enhanced spectral optimizer was integrated with LSTM, along with explainable AI techniques to interpret results. Reference [11] designed a Gated Recurrent Unit-Attention (GRU-Attention) model, leveraging attention mechanisms to enhance the receptive field of GRU and capture long-range dependencies across time steps, yielding excellent results across eight Chinese carbon markets. Reference [12] introduced a hybrid model combining sliding-window Empirical Wavelet Transform and GRU with no data leakage, achieving accurate forecasts through decomposition, noise reduction, feature extraction, and hyperparameter optimization.

TSMixer, proposed in [13], is a novel time series architecture based on multilayer perceptrons (MLPs), capable of mixing features across time and variable dimensions through separate MLP transformations. It enhances long-term dependency modeling and complex pattern recognition in time series. In [14], TSMixer was applied to stock price forecasting and compared with both traditional and modern deep learning baselines, showing superior performance in extracting time-dependent features. Reference [15] integrated TSMixer with transfer learning and dynamic time warping (DTW) for photovoltaic power forecasting, significantly improving accuracy. However, these studies primarily rely on a standalone TSMixer model. Integrating TSMixer with other deep learning components may yield better prediction accuracy, adaptability, and generalization. Reference [16] proposed the Kolmogorov–Arnold Network (KAN) module, which uses hierarchical nonlinear transformations to extract complex nonlinear features, enhancing feature learning when integrated with deep learning models. KANs have been successfully applied to battery state-of-charge estimation [17], water level prediction [18] and, more recently, as a replacement for traditional MLP layers in the TSMixer module [19], demonstrating superior prediction accuracy across multiple datasets.

Despite these advances, challenges remain in forecasting carbon prices due to their highly volatile and complex nature [20]. Single-model predictions are often suboptimal. As a result, recent research emphasizes hybrid modeling approaches. Among these, signal decomposition techniques are widely used to reduce non-stationarity and extract multi-scale features, thereby improving predictive performance [21,22,23]. Variational Mode Decomposition (VMD) is particularly promising due to its strong adaptability, solid mathematical foundation, and effectiveness in handling non-stationary signals. Its adjustable regularization and ability to control bandwidth prevent mode mixing and endpoint effects, making it highly suitable for carbon price forecasting [24]. In [25], a multi-step point-interval prediction framework was proposed, integrating multi-factor selection, multivariate VMD, sample entropy reconstruction, LSTM with attention, and enhanced kernel density estimation. This framework showed superior performance in forecasting EU carbon prices and in quantifying predictive uncertainty.

Feature selection is another critical yet often overlooked aspect of forecasting. It involves identifying the most relevant and interpretable variables as model inputs, which enhances model performance, accelerates training, and improves interpretability [26]. Excessive input variables can hinder generalization, particularly in multivariate settings. Common feature selection methods include Grey Relational Analysis [27], LASSO [28], the Pearson correlation, and the Maximal Information Coefficient (MIC). However, most studies adopt a single selection technique without comparative analysis, potentially undermining the credibility of results. Reference [29] combined multiple feature selection techniques with probabilistic estimation. Using improved CEEMDAN and wavelet transforms for denoising, the study applied PCA, random forest, and gradient boosting decision trees for comprehensive variable screening, followed by BiGRU-based forecasting. This model outperformed alternatives in empirical tests across three carbon markets.

While the combination of time series analysis and deep learning has made significant strides in carbon price forecasting, several challenges persist:

(1): Most studies focus solely on historical carbon prices, overlooking external factors that significantly influence market behavior. Incorporating such variables can improve model performance.
(2): Existing hybrid deep learning architectures often fail to fully leverage the strengths of individual components, limiting their ability to extract multi-scale features.
(3): Research tends to prioritize single-step forecasting, with limited attention to multi-step forecasting which is essential for real-world applications requiring a timely response to price fluctuations.

To overcome the aforementioned limitations, this study introduces a novel approach to carbon price forecasting. First, external variables with strong correlations to carbon prices are identified using Pearson correlation analysis, and these are incorporated into the model as a dedicated input branch, thereby enriching the feature space. Second, the original carbon price time series is decomposed using both Variational Mode Decomposition (VMD) and Empirical Wavelet Transform (EWT), producing multiple sub-series that capture diverse fluctuation patterns. These decomposed components constitute two additional input branches. Based on this design, we propose a tri-branch hybrid deep learning architecture, TKMixer-BiGRU-SA, which embeds Kolmogorov–Arnold Networks (KANs) within the TSMixer framework. This architecture enables efficient cross-dimensional feature mixing and transformation, enhancing the model’s ability to capture temporal dependencies and expressive representations through hierarchical nonlinear mappings. The BiGRU component models bidirectional temporal relationships, while the Self-Attention (SA) mechanism adaptively emphasizes salient features. Together, these modules improve prediction accuracy and robustness. The model is empirically validated on the EU carbon market and further tested through a quantitative trading simulation using data from the Hubei carbon market. The experimental results demonstrate the proposed framework’s robustness and effectiveness in practical scenarios.

2. Data Preprocessing

2.1. Dataset Description

The European Union Emissions Trading System (EU ETS), as the world’s largest carbon market, exerts a significant influence on other carbon markets globally. To evaluate the performance of the proposed model, historical short-term carbon price data from the EU ETS are used as the benchmark. Due to the absence of spot carbon trading on the European Climate Exchange, this study employs the daily settlement prices of European Union Allowance (EUA) futures, excluding holidays, as the target for carbon price forecasting. The data span from 1 January 2013 to 1 January 2021. To further assess the model’s applicability and robustness in the context of China’s carbon market, this study includes an analysis of Hubei Emission Allowance (HBEA) prices. Since its pilot launch in 2014, the Hubei carbon market has matured significantly, with a large market scale and comprehensive data availability. In 2023, it recorded the highest trading volume among all regional markets in China, making it highly representative. Accordingly, this study selects the daily closing prices of the HBEA, from 1 January 2018 to 5 January 2023, for empirical analysis to evaluate the model’s predictive capability and application potential within China’s carbon market.

2.2. Dataset Stability Analysis

For the collected datasets, a simple linear interpolation method is employed to fill a small number of missing values. The statistical characteristics of the two datasets are summarized in Table 1. Each dataset is sequentially divided into training and testing sets at a ratio of 9:1. The carbon price trends are illustrated in Figure 1. The skewness and kurtosis values of both the EUA and HBEA datasets indicate slight asymmetry and a relatively flat distribution. However, the large standard deviations reflect considerable volatility, suggesting that the data cannot be assumed to be stationary.

This study further evaluates the stationarity and linearity of the datasets using the Augmented Dickey–Fuller (ADF) test and the Brock–Dechert–Scheinkman (BDS) test, respectively. The results are presented in Table 2. The ADF test yields a test statistic of 0.412 and a p-value of 0.982 for the EUA dataset, and a test statistic of −1.511 with a p-value of 0.528 for the HBEA dataset—both p-values significantly exceed the 0.05 threshold, indicating non-stationarity. The BDS test results show large test statistics and p-values of 0 across embedding dimensions from 2 to 5 for both datasets, confirming the presence of nonlinear dependence. These findings demonstrate that carbon price data exhibit both non-stationary and nonlinear characteristics, making accurate forecasting particularly challenging. Therefore, this study applies signal decomposition techniques such as Variational Mode Decomposition (VMD) and Empirical Wavelet Transform (EWT), which are well-suited for handling non-stationary and nonlinear time series, to enhance the prediction of carbon prices.

2.3. Correlation Analysis

Carbon prices are influenced by a variety of factors, including energy, economic, and environmental indicators. A high correlation between two independent variables, known as multicollinearity, can be problematic because it can inflate the variance of the model’s coefficients, making the model less stable and interpretable. Detecting and addressing multicollinearity through techniques like variable selection, dimensionality reduction, or regularization ensures that the model is more reliable and that the predictions are more accurate [30]. This study selects multiple relevant influencing variables and employs Pearson correlation analysis to quantitatively assess the relationship between these variables and carbon prices. To optimize the model’s input variables, the Extremely Randomized Trees (ET) method [31] is employed to assess the grey relational degree between each variable and both the EUA and HBEA carbon prices. By analyzing the SHAP (Shapley Additive explanations) values, the contribution of each variable within the ET model is quantified. This approach aims to evaluate the significance of external factors in driving carbon price fluctuations. The analysis results are presented in Table 3 and Figure 2.

As shown in Table 3, nearly half of the external factors in both datasets exhibit a correlation coefficient greater than 0.5 with carbon prices. Among them, the S&P 500 Index has the highest correlation coefficient of 0.874 with the EUA carbon price; the EUA carbon price, in turn, has the highest correlation coefficient of 0.837 with the HBEA carbon price. The collected climate and environmental indicators can be considered negligible. External factors with correlation coefficients exceeding 0.5 also show relatively high importance scores in the Extra Trees (ET) model, and the SHAP analysis results are consistent with these findings. By integrating the correlation coefficients, ET importance scores, and SHAP contribution values, strong theoretical support is provided for using the matrix of external variables with larger indicators as one branch input for the carbon price prediction models of each dataset. The external variables selected for the EUA and HBEA datasets are illustrated in Figure 3. However, the high multicollinearity among external factors may cause challenges in the modeling process. This phenomenon highlights the importance of developing efficient feature selection algorithms aimed at precisely identifying the optimal factors to further optimize the model inputs.

2.4. Data Normalization

To eliminate the dimensional discrepancies among different features in the dataset and to avoid the adverse effects of inconsistent feature scales on model prediction accuracy, this study applies the min-max normalization method to preprocess the data. The core idea of this method is to use the maximum and minimum values of the data as references and linearly transform the original data range to a unified scale. Specifically, the data are mapped onto the interval [−1, 1], aiming to remove the influence of dimensional differences on feature weights, thereby improving the training performance and prediction accuracy of the model. The corresponding formula is expressed as

X_{norm} = \frac{2 (X - X_{\min})}{X_{\max} - X_{\min}} - 1

(1)

where

X

denotes the original data,

X_{norm}

represents the normalized data,

X_{\min}

is the minimum value in the dataset, and

X_{\max}

is the maximum value in the dataset.

2.5. Modal Decomposition

Variational Mode Decomposition (VMD) is an adaptive signal decomposition method that effectively handles non-stationary and nonlinear signals. By iteratively searching for variational modes, VMD decomposes the original time series into a set of sub-signals with limited bandwidths, each associated with a specific center frequency [32].

Empirical Wavelet Transform (EWT), which integrates Fourier spectrum analysis with wavelet decomposition, addresses the limitations of both methods. It segments the frequency spectrum based on local extrema in the Fourier amplitude and constructs a set of wavelet filter banks for each segment. This allows the original signal to be decomposed into amplitude-modulated and frequency-modulated components across different frequency bands, enabling the extraction of salient modes in both time and frequency domains [33].

In this study, VMD and EWT are applied to decompose carbon price data from different carbon markets to enhance the prediction performance of deep learning models. As shown in Figure 4, VMD decomposes EUA and HBEA price series into six intrinsic mode functions (IMFs), effectively extracting frequency-specific features, reducing noise and mode mixing, and improving the model’s ability to capture trends and nonlinear dependencies—thereby enhancing both accuracy and robustness. EWT adaptively segments the frequency spectrum and generates seven and five physically interpretable IMFs for the EUA and HBEA datasets, respectively. These components precisely characterize high-frequency transient fluctuations caused by abrupt price changes and extract amplitude–frequency modulation signals, uncovering key patterns in carbon price dynamics. Particularly during periods of volatility, EWT enhances the time-frequency representation of signals. After decomposition, the original series exhibits more distinguishable trends, periodicities, and noise components. Compared to the raw sequence, these transformed components allow clearer identification and analysis of carbon price movements. Therefore, the multi-channel subcomponent matrices obtained via VMD and EWT are fed in parallel into the deep learning model, forming a frequency–time dual-driven feature space. This enables joint exploration of high-frequency details and low-frequency trends, providing rich multidimensional features for accurate carbon price forecasting under complex environments.

3. Deep Learning Model and Prediction Process

3.1. TKMixer Module

For multivariate time series forecasting based on historical data, reference [13] proposes an innovative architecture called TSMixer, which enables efficient modeling by alternately applying multilayer perceptions (MLPs) in both the temporal and feature dimensions. Reference [34] proposed a novel neural network architecture, KAN, based on the Kolmogorov–Arnold theorem. Its breakthrough lies in the introduction of learnable marginal activation functions. This design departs from the conventional multilayer perceptron (MLP) paradigm, where fixed activation functions are assigned to nodes, by instead associating activation functions with network edges (i.e., weights) and endowing them with learning capability. This shift not only enables independent nonlinear transformations along each coordinate axis but also constructs multidimensional mappings by combining these transformations, thus fundamentally differing from the layer-wise uniform nonlinear transformations characteristic of MLPs. The design advantages of a KAN are notable: It supports network sparsification, pruning, and other optimization techniques, thereby enhancing model interpretability and generalization ability. Moreover, this architecture integrates the benefits of spline functions and MLPs, maintaining high precision in low-dimensional spaces while effectively adapting to the complexity of high-dimensional spaces, demonstrating outstanding representational power.

For the carbon price prediction task, this study leverages the strengths of both TSMixer and KAN in mining temporal features, by embedding KAN into the time-mapping layer of the traditional TSMixer module, thus designing the TKMixer module, as illustrated in Figure 5. This module primarily consists of the following core components:

Temporal Mixing MLP: This module is designed to capture temporal patterns within time series data. It employs a structure consisting of fully connected layers, activation functions, and dropout layers. By transposing the input, the fully connected layers are applied along the time axis, allowing feature-wise parameter sharing. Studies have shown that even simple single-layer MLPs can effectively learn complex temporal dependencies through linear transformations. Specifically, let the historical observations be represented as

X \in R^{L \times C_{x}}

, where L is the length of the input time window and C_x is the number of input variables. The forecasting target is

Y \in R^{T \times C_{y}}

where T is the number of future time steps to predict, and C_y denotes the number of output variables, C_y ≤ C_x. The linear model predicts the feature values for T time steps by learning the parameters (

A \in R^{T \times L}

) and the bias term (

b \in R^{T \times 1}

), as follows

\hat{Y} = A X \oplus b \in R^{T \times C_{x}}

(2)

where

\oplus

denotes column-wise addition and the corresponding C_y columns in

\hat{Y}

are used for prediction.

For any periodic function

x (t) = x (t - P)

with period P < L, the linear model can effectively predict its future values, as shown below:

A_{i j} = \{\begin{array}{l} 1, n = L - P + (i \mod P) \\ 0, n \neq L - P + (i \mod P) \end{array}, b_{i} = 0

(3)

When extended to periodic sequences under affine transformations, i.e.,

x (t) = a \cdot x (t - P) + c

, where

a, c \in R

, the linear model can still achieve perfect prediction:

A_{i j} = \{\begin{array}{l} a, n = L - P + (i \mod P) \\ 0, n \neq L - P + (i \mod P) \end{array}, b_{i} = c

(4)

Feature Mixing MLP: This module shares weights across all time steps to fully exploit covariate information. A two-layer MLP architecture is adopted, similar to transformer-based models, to learn complex feature transformation relationships and enhance the model’s understanding of time series patterns.

Residual connections: TSMixer introduces residual connections between each temporal and feature mixing layer to improve learning efficiency in deep architectures, prevent gradient explosion, and enhance information flow. Additionally, this design allows the model to bypass less important temporal or feature mixing operations when necessary, thereby improving computational efficiency and generalization performance.

Normalization: Normalization is a key technique for optimizing deep learning model training. Although the choice between batch normalization and layer normalization depends on the specific task, batch normalization performs better on common time series datasets. Unlike traditional normalization along the feature dimension, this work applies 2D normalization across both temporal and feature dimensions, in coordination with the temporal and feature mixing operations, to improve model stability and generalization.

Time mapping: KAN is used as a substitute for the fully connected layer in the traditional TSMixer framework, applying the KAN architecture to learn complex nonlinear relationships within temporal data and perform time-domain projection. By capturing the long-term dependencies between historical carbon price inputs and future forecasts, it maps the input time feature sequence of length L to a secondary length H, thereby enabling efficient mixing and transformation of carbon price data features.

KAN integrates nonlinear activation functions into the traditional TSMixer module, resulting in smoother parameter representations that enhance both model accuracy and interpretability [35]. The computation process is formulated as follows

f (x) = \sum_{q = 1}^{2 n + 1} φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p}))

(5)

where

f (x)

denotes the function output;

2 n + 1

is the upper limit of the outer summation and is related to the input dimension;

n

;

x_{p}

represents the

p

component of the input vector

x

with the domain

1 ~ n

;

ϕ_{q, p} (x_{p})

is the inner function representing a composition of the

q

and

p

terms; and

φ_{q}

denotes the outer function corresponding to the

q

term of the outer summation.

A single KAN layer can thus be expressed as a one-dimensional function matrix:

φ = \{ϕ_{q, p}\}, p = 1, 2, \dots n_{i n}, q = 1, 2, \dots n_{o u t}

(6)

To construct a deep KAN network by simply stacking multiple KAN layers, the transition matrix between the input and output layers can be expressed as

x_{l + 1} = (\begin{matrix} ϕ_{l, 1, 1} (\cdot) & ϕ_{l, 1, 2} (\cdot) & \dots & ϕ_{l, 1, n_{l}} (\cdot) \\ ϕ_{l, 2, 1} (\cdot) & ϕ_{l, 2, 2} (\cdot) & \dots & ϕ_{l, 2, n_{l}} (\cdot) \\ ⋮ & ⋮ & \dots & ⋮ \\ ϕ_{l, n_{l + 1}, 1} (\cdot) & ϕ_{l, n_{l + 1}, 2} (\cdot) & \dots & ϕ_{l, n_{l + 1}, n_{l}} (\cdot) \end{matrix}) x_{l}

(7)

where

ϕ_{l}

denotes the matrix function corresponding to the

l

KAN layer, and

ϕ_{l, i, j}

represents the activation function on each edge, which performs the nonlinear transformation. The number of nodes in each KAN layer is determined by the number of input nodes. Consequently, the cascading relationship of multiple layers can be expressed in matrix form as

KAN (x) = (φ_{L - 1} \circ φ_{L - 2} \circ \dots \circ φ_{1} \circ φ_{0}) x

(8)

where

KAN (x)

denotes the output of the KAN network;

φ_{L}

represents the function matrix corresponding to the Lth KAN layer; and

\circ

indicates the composition of inter-layer connections and functions.

3.2. BiGRU Module

The Gated Recurrent Unit (GRU) primarily consists of a reset gate and an update gate. The reset gate facilitates the capture of short-term dependencies in time series data, while the update gate aids in modeling long-term dependencies. The structure of the GRU is illustrated in Figure 6. The computational process is described by the following equations: The architecture of a KAN typically involves decomposing the input space along each dimension, then processing each component with univariate functions before aggregating the results. The theorem can be formulated as

\{\begin{cases} {\vec{h}}_{t} = G R U (x_{t}, {\vec{h}}_{t - 1}) \\ {\overset{\leftarrow}{h}}_{t} = G R U (x_{t}, {\overset{\leftarrow}{h}}_{t - 1}) \\ Y_{t} = α_{t} {\vec{h}}_{t} + β_{t} {\overset{\leftarrow}{h}}_{t} + b_{t} \end{cases}

(9)

where

α_{t}

and

β_{t}

represent the hidden layer output weights for the forward and backward propagation of the GRU unit at time step

t

, respectively, and

b_{t}

denotes the bias corresponding to the hidden state at time step

t

.

The Bidirectional Gated Recurrent Unit (BiGRU) extends the standard GRU by integrating two independent GRU hidden layers processing the sequence in forward and backward directions, respectively. The forward layer scans the sequence in chronological order, while the backward layer scans it in reverse order, enabling the model to capture both past and future contextual information. This bidirectional structure enhances the ability to learn long-term dependencies and extract deep features. The architecture is illustrated in Figure 7.

BiGRU is a combination of two unidirectional GRU models. Therefore, the output

Y_{t}

at time step

t

is obtained from the weighted sum of the forward hidden layer output

{\vec{h}}_{t}

, the backward hidden layer output

{\overset{\leftarrow}{h}}_{t}

, and a bias term.

3.3. SA Model

In deep learning, the input to a neural network often consists of multiple vectors that contain potential interdependencies. However, local learning processes between layers may overlook these correlations. To address this issue, the Self-Attention (SA) mechanism has been introduced. SA dynamically adjusts weights to emphasize key features and suppress redundant ones, thereby enhancing the model’s ability to understand and process complex information. The structure is illustrated in Figure 8. The SA mechanism is mathematically expressed as

A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(10)

where

Q

,

K

, and

V

represent the query vector, key vector, and value vector, respectively;

d_{k}

denotes the dimension of the key vector

K

.

3.4. TKMixer-BiGRU-SA Model

To address the challenge of feature extraction from carbon price data and strongly correlated external factors, this paper proposes a multi-branch input TKMixer-BiGRU-SA carbon price prediction model based on a hybrid decomposition of VMD and EWT, as illustrated in Figure 9.

The proposed model employs a three-branch parallel input architecture. Each branch first applies time series modeling using TSMixer, which utilizes fully connected layers to facilitate interactions across both temporal and feature dimensions, thereby enhancing the modeling of temporal dependencies and feature representations. To further boost expressiveness, a Kolmogorov–Arnold Network (KAN) module is embedded within the temporal mapping layer of TSMixer. This allows for hierarchical nonlinear transformations across time and feature axes, enabling flexible extraction of high-order features. The transformed outputs are subsequently processed by a Bidirectional Gated Recurrent Unit (BiGRU), which captures bidirectional contextual information and strengthens long-term dependency modeling. Following this, a Self-Attention (SA) mechanism is applied to adaptively reweight and fuse features from all input branches. The fused representation is then passed through a fully connected layer to produce the final carbon price prediction. Through this multi-level feature extraction and fusion strategy, the model effectively enhances predictive accuracy for carbon price forecasting.

3.5. Forecasting Process

The proposed carbon price prediction method, based on dual-modal decomposition and the TKMixer-BiGRU-SA architecture, comprises two main stages: data processing and analysis, and model training and prediction evaluation.

In the data processing and analysis stage, missing values are first filled using standard linear interpolation. Then, the correlation coefficients between external factors and carbon prices are calculated. External factors with correlation coefficients greater than 0.5 are selected, and a feature matrix composed of carbon prices and these strongly correlated external factors is constructed as one of the model’s input branches. Meanwhile, the carbon price data from different datasets are decomposed using Variational Mode Decomposition (VMD) and Empirical Wavelet Transform (EWT), and the resulting component matrices serve as the other two input branches. These three input branches are ultimately fed into the TKMixer-BiGRU-SA deep learning model.

The proposed TKMixer-BiGRU-SA model architecture constructs three distinct feature matrices. Input matrix X_D for Branch 1 consists of fused data from EUA and HBEA carbon prices along with external factors showing a correlation above 0.5. Input matrices X_V and X_E for Branches 2 and 3 are obtained by applying VMD and EWT, respectively, to the EUA and HBEA carbon price data.

For the prediction process, a sliding time window approach is used, with each input unit composed of seven time points. Three step sizes—1, 2, 3, and 4—are employed to construct input matrices of size 7 × N, where N denotes the number of features. In terms of dataset division, the last 10% of each dataset is designated as the test set, while the remaining 90% is used for training.

During sliding prediction, the model uses a 7 × N matrix from the time step immediately preceding point i to predict carbon prices for the interval from point i to i + t − 1, where t is the prediction step size. After each prediction, the model shifts the window to use a new 7 × N matrix from the time step before point i + t as the next input, enabling continuous rolling forecasts. Figure 10 illustrates the sliding window training and prediction process under different step sizes t.

For the model training and prediction evaluation stage, the data processing workflow is illustrated in Figure 11. The TKMixer-BiGRU-SA model, leveraging a multi-module collaborative mechanism, efficiently extracts and integrates multi-scale temporal features from carbon price data and strongly correlated external factors, thereby enhancing prediction accuracy. The detailed process is as follows:

(1): The TKMixer module primarily processes time series data through two MLP structures: temporal mixing and feature mixing. The KAN network is embedded within the temporal mapping layer to perform hierarchical nonlinear transformations on the features, enabling the flexible extraction of high-order features and enhancing the model’s representational capacity.

The Temporal Mixing MLP operates along the temporal dimension, processing each feature channel independently to capture temporal dependencies. The computation process is as follows

\{\begin{array}{l} X_{1} = L N (X) \\ X_{T} = σ (X_{1} W_{n}) \\ X_{T} = X + X_{T} \end{array}

(11)

where the input matrix

X \in R^{n \times j}

(where n is the number of time steps and j is the feature dimension) corresponds to inputs from the three branches: X_D, X_V, and X_E, respectively.

W_{n} \in R^{n \times n}

denotes the temporal mixing weights;

σ

is the activation function (GELU is used in this paper); LN represents the layer normalization function; and the residual connection (+) denotes element-wise addition.

The Feature Mixing MLP operates along the feature dimension, transforming the feature vector at each time step to model inter-variable relationships. The computation process is as follows

\{\begin{array}{l} X_{2} = L N (X_{T}) \\ X_{C} = σ (X_{2} W_{j}) \\ f_{x i} = X_{T} + X_{C} \end{array}

(12)

where

W_{j} \in R^{j \times j}

denotes the feature mixing weights and

f_{x i} \in R^{n \times j} (i = 1, 2, 3)

represents the output of each branch’s TSMixer module. In this study, the output of the TSMixer modules is maintained at the same dimensionality as the input.

In the first step of the KAN module, each neuron performs a linear transformation on the feature matrix output from the TSMixer module

Z = f_{x_{i}} \cdot W + B

(13)

where

W \in R^{j \times m}

is the weight matrix that maps the output from dimension j to the hidden layer of dimension m,

B \in R^{n \times m}

is the bias matrix, and

Z \in R^{n \times m}

is the result of the linear transformation:

Z_{i j} = \sum_{k = 1}^{d} {x_{i}}^{(k)} {ω_{k}}^{(j)} + b_{j}

(14)

Unlike MLPs that use fixed activation functions such as ReLU, the KAN introduces a learnable one-dimensional nonlinear function

Φ_{j}

at this stage:

f H_{i j} = Φ_{j} (Z_{i j})

(15)

That is,

f H = Φ (Z)

, where

Φ_{j} (\cdot)

is a learnable univariate function applied element-wise to each column of Z:

f H_{i j} = Φ_{j} (\sum_{k = 1}^{d} {x_{i}}^{(k)} {ω_{k}}^{(j)} + b_{j})

(16)

These learnable functions are typically parameterized using piecewise polynomials or small neural networks rather than fixed functions such as ReLU or Sigmoid. The final output of the KAN module is denoted as

f H_{i} \in R^{n \times m}

.

(2): For the BiGRU module, an input consisting of an n × m matrix—where each row is an m-dimensional feature vector—is fed into the BiGRU. By leveraging both forward and backward propagation, the model captures dependencies between historical and future data. This enhances the temporal representation of photovoltaic power and associated meteorological variables at each time step. The computation process is as follows

\{\begin{array}{l} \vec{{h_{t}}^{i}} = {δ_{t}}^{i} (\vec{W_{x_{i}}} f H_{i} + \vec{{W^{i}}_{h}} \vec{{h^{i}}_{t - 1}} + \vec{b_{i}}) \\ \overset{\leftarrow}{{h_{t}}^{i}} = {δ_{t}}^{i} (\overset{\leftarrow}{W_{x_{i}}} f H_{i} + \overset{\leftarrow}{{W^{i}}_{h}} \overset{\leftarrow}{{h^{i}}_{t - 1}} + \overset{\leftarrow}{b_{i}}) \\ f_{B_{i}} = {δ_{t}}^{i} (\vec{W_{1}} \vec{{h_{t}}^{1}} + \overset{\leftarrow}{W_{i}} \overset{\leftarrow}{{h_{t}}^{i}}) \end{array}

(17)

where i = 1, 2, 3, and

\vec{W_{x_{i}}}

and

\overset{\leftarrow}{W_{x_{i}}}

are the weight matrices that project the input layer to the forward and backward hidden layers, respectively.

f H_{i}

denotes the output from the TKMixer module.

\vec{{W^{i}}_{h}}

and

\overset{\leftarrow}{{W^{i}}_{h}}

are the recurrent weight matrices that map the outputs from the previous time step to the current time step in the forward and backward hidden layers, respectively.

\vec{b_{i}}

and

\overset{\leftarrow}{b_{i}}

are the bias vectors for the forward and backward hidden layers.

\vec{W_{i}}

and

\overset{\leftarrow}{W_{i}}

represent the weight matrices that project the forward and backward hidden states to the output layer.

{δ_{t}}^{i}

denotes the hyperbolic tangent activation function.

\vec{{h_{t}}^{i}}

and

\overset{\leftarrow}{{h_{t}}^{i}}

are the forward and backward hidden states at time step t for each of the three input branches. The output of the BiGRU module is denoted as

f_{B_{i}}

. Assuming a batch size of h, the output of each BiGRU branch is a feature matrix of dimension h × n × 2m, meaning that, at each time step, each branch produces an output of dimension h × 2m after passing through the BiGRU module.

(3): The temporal features fH1, fH2, and fH3, extracted by the BiGRU module, are stacked and fused to obtain the spatiotemporal feature F_M of a single carbon price sample (with dimensions h × n × 6m). After a tensor slicing operation, the features from the last time step (i.e., the feature matrix of size h × 6m) are extracted. Then, the Self-Attention mechanism is applied to correlate and interact with the information from different positions in the sequence, enabling a more comprehensive capture of the dependencies within the sequence. This allows for more effective identification and a focus on the key information within the sequence. The computational process is as follows

F_{M} = f H_{1} \oplus f H_{2} \oplus f H_{3}

(18)

\{\begin{matrix} Q = F_{M} \times W^{q} \\ K = F_{M} \times W^{k} \\ V = F_{M} \times W^{v} \\ F_{S} = s o f t \max (\frac{Q \times K^{T}}{\sqrt{d_{k}}}) V \end{matrix}

(19)

where the symbol ‘⊕’ denotes the stacking operation applied to the features obtained from each of the three branches. F_M represents the one-dimensional long vector formed by the fused features of the three branches. W^q, W^k, and W^v are the weight matrices corresponding to the query, key, and value in the Self-Attention (SA) module, respectively. Q, K, and V represent the query, key, and value matrices within the SA module. The softmax function is used for normalization, T denotes the matrix transpose operation, and d_k is the scaling factor for normalization. F_PSA represents the feature sequence output from the SA module.

(4): Finally, the spatiotemporal feature information FSFSFS obtained from the Self-Attention (SA) module is passed through a fully connected layer to output the predicted carbon price Y at each time step.

4. Discussion

All experiments in this study were conducted under the following hardware configuration: CPU (Intel Core i5-13400F, 2.5 GHz), RAM (64 GB), and GPU (RTX 3060, 12 GB). The deep learning models were implemented using PyTorch 1.10.1 within the PyCharm 2024.1.1 environment. The Adam optimizer was employed for model training.

To objectively evaluate the experimental results, this study adopts four error metrics: enhanced Root Mean Square Error (e_RMSE), enhanced Mean Absolute Error (e_MAE), enhanced Mean Absolute Percentage Error (e_MAPE), and the coefficient of determination (R²). The mathematical formulations of these evaluation metrics are defined as follows

e_{R M S E} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

(20)

e_{M A E} = \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|

(21)

e_{M A P E} = \frac{1}{n} \sum_{i = 1}^{n} |\frac{{\hat{y}}_{i} - y_{i}}{y_{i}}| \times 100 %

(22)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}}

(23)

where

{\hat{y}}_{i}

denotes the predicted carbon price,

y_{i}

represents the actual carbon price, and

{\bar{y}}_{i}

is the mean of the actual carbon prices.

4.1. Model Parameter Settings

To optimize model performance, this study conducts a systematic experimental analysis of key parameters across different modules.

(1): Model architecture and parameter configuration

The hidden layer dimension of the TSMixer module influences its capacity to capture both temporal and feature-level representations, which in turn affects the KAN module’s ability to extract deep correlations from the feature matrix. Second, the hidden layer size of the KAN module directly impacts the model’s ability to extract high-order features. Additionally, the number of neurons in the BiGRU module must strike a balance between effectively capturing bidirectional long- and short-term dependencies and maintaining computational efficiency, thereby ensuring high prediction accuracy without overfitting or excessive resource consumption. Based on these considerations, multiple sets of comparative experiments were designed to test combinations of hidden layer dimensions in the TSMixer and KAN modules, as well as different neuron counts in the BiGRU module. The optimal configuration for each module was determined based on experimental results. With a learning rate of 0.001, a batch size of 16, and 200 training epochs, the evaluation metrics and prediction errors for the one-step carbon market price prediction experiment on the EUA dataset are presented in Table 4 and Figure 12.

As shown in Table 4 and Figure 12, the proposed TKMixer-BiGRU-SA model achieves the lowest error evaluation metrics and the prediction errors are closest to zero when the hidden layer dimensions of the TSMixer and KAN modules are set to 16 and 32, respectively, and the number of neurons in the BiGRU module is set to 32. These results indicate that the model delivers the best prediction performance under this configuration, demonstrating the strongest overall feature extraction capability and the most appropriate parameter settings.

(2): Model training hyperparameter configuration

Hyperparameters have a significant impact on the training performance and effectiveness of the model. Different combinations of hyperparameters can lead to notable differences in accuracy, convergence speed, and overall model behavior. By conducting comparative experiments, we can systematically and comprehensively evaluate the influence of various hyperparameter settings, visually compare the advantages and disadvantages of each combination, and accurately identify the configuration that yields optimal model performance on a specific task and dataset. The experimental results under the optimal model parameters with different training hyperparameter settings are shown in Table 5, and the training loss curves are illustrated in Figure 13.

The experimental results indicate that appropriately increasing the number of training epochs (epoch = 200) significantly improves performance. The combination of batch size = 16 and learning rate = 0.001 achieves the best trade-off between error and model fitting, yielding the lowest e_RMSE (0.0648) and the highest R² (0.9997), while maintaining good training efficiency (323 s). In contrast, an excessively large batch size or an overly small learning rate leads to performance degradation. Overall, the 200-16-0.001 configuration proves to be the optimal and most stable setting, demonstrating a favorable balance between training efficiency and prediction accuracy.

4.2. Comparison Experiment with Different Inputs

To verify the effectiveness of the selected feature variables and the combined input strategy based on decomposition methods, a series of experiments were designed as follows:

B1: uses the original carbon price data as a single-branch model input.

B2: builds upon B1 by adding relevant feature variables filtered through Pearson correlation analysis, still using a single-branch input.

B3: extends B2 by incorporating the VMD-decomposed components of the original carbon price series, forming a dual-branch model input.

B4: builds upon B2 by introducing EWT-decomposed components of the original carbon price series, maintaining a dual-branch structure.

B5: builds upon B2 by introducing CEEMDAN-decomposed components of the original carbon price series, maintaining a dual-branch structure

B0: combines the inputs from B2, B3, and B4 to form a full three-branch model input.

Traditional model frameworks have primarily focused on single-step prediction, where deep learning models infer the next day’s carbon price based solely on historical closing prices. However, multi-step prediction offers a significant advantage by uncovering longer-term trends in price dynamics, providing broader and more strategic insights for market decision-making.

To assess the model’s performance in multi-step forecasting scenarios, this study employs carbon price data from the EUA and HBEA markets and conducts forward prediction experiments for two-step, three-step, and four-step horizons. Specifically, two-step forecasting aims to estimate the carbon prices for the two trading days following the end of the training set; three-step and four-step forecasts extend this prediction window accordingly.

All experiments were conducted using the proposed TKMixer-BiGRU-SA model and its core variants. The prediction errors on the test sets for both datasets are presented in Table 6, and the linear regression results between predicted and actual values are shown in Figure 14.

The analysis results indicate that under the forecasting scenarios 1–4 steps ahead, prediction accuracy generally declines as the forecasting horizon increases. However, compared to other input configurations, the proposed input strategy in experiment B0 consistently yields the lowest error levels across all forecast lengths. Furthermore, analysis of the normal distribution of prediction errors across different prediction step sizes shows that B0 achieves the smallest mean and standard deviation of errors, indicating higher model stability and reliability. These findings provide strong evidence supporting the predictive capability of the proposed model.

For the EUA dataset, Experiment B0 outperforms all other configurations across the four evaluation metrics. Specifically, in the one-step forecast, the model achieves an e_RMSE of 0.0648, an e_MAE of 0.0504, an e_MAPE of 0.2081%, and an R² of 0.9997, demonstrating high prediction accuracy and an excellent fit. In the four-step forecast, comparative experiments B1 through B5 demonstrate the effectiveness of various input configurations. Compared with B1, B2 achieves a 1.127% reduction in e_MAPE and a 0.6459% improvement in R², indicating that the inclusion of relevant variables enhances the model’s sensitivity and accuracy by providing a more comprehensive representation of carbon price dynamics. Further, B3 and B4 show significant improvements over B2, with e_MAPE reductions of 51.732% and 12.8966%, and R² increases of 6.3309% and 1.7840%, respectively. These results indicate that the introduction of VMD and EWT decomposition branches substantially enhances prediction performance. The VMD algorithm effectively decomposes nonlinear and non-stationary signals, while the EWT algorithm extracts amplitude- and frequency-modulated components, both of which help uncover the intrinsic patterns of carbon price fluctuations. Although B5 shows relatively high performance metrics compared to B3 and B4, its slightly inferior accuracy suggests that CEEMDAN, despite decomposing the original signal into a greater number of components, may introduce additional noise-like disturbances, leading to overfitting and reduced prediction precision. Compared with B3 and B4, B0 achieves e_MAPE reductions of 33.3514% and 63.0667%, and R² increases of 1.4629% and 5.9955%, respectively. These results underscore that integrating additional input data leads to a more comprehensive and accurate predictive model.

For the HBEA dataset, similar trends are observed, further validating the superiority of the three-branch input structure. In the one-step prediction experiment, configuration B0 achieved the best performance across all four metrics—e_RMSE, e_MAE, e_MAPE, and R²—registering values of 0.0884, 0.0664, 0.1389%, and 0.9968, respectively. The worst performance was observed in B1 and B2, differing from the results on the EUA dataset. This discrepancy may stem from the developmental stage of the Hubei carbon market, which is likely less mature and characterized by relatively simpler and more stable price dynamics. Furthermore, Hubei’s market data may be influenced by factors such as policies, quota allocations, and enterprise behavior. In this context, VMD and EWT decomposition techniques can effectively extract more critical features from the raw data, thereby improving prediction accuracy.

In summary, the proposed dual-modal decomposition tri-branch input model achieves higher forecasting accuracy and lower prediction error, effectively demonstrating the validity of the improved input strategy and providing a strong foundation for future forecasting research.

4.3. Ablation Experiment

To evaluate and understand the importance of each module within the deep learning model and its impact on overall performance, this study conducts ablation experiments. By observing how modifications to the model structure affect its performance and outputs, the goal is to identify the most optimized architecture, improve efficiency, and enhance the interpretability of the proposed TKMixer-BiGRU-SA model. The experimental configuration is as follows:

C1: TSMixer.

C2: TKMixer.

C3: BiGRU.

C4: TKMixer-BiGRU.

C0: TKMixer-BiGRU-SA (proposed model).

Configuration C1 and C3 represent baseline models using only the TSMixer or BiGRU modules, respectively, within the three-branch deep learning architecture. C2 embeds the KAN network into the temporal mapping layer of the TSMixer module. C4 combines the modules from C2 and C3 in a deep sequential structure within the same branch. C0, the complete model proposed in this study, extends C4 by integrating a Self-Attention (SA) module to validate the significance of attention mechanisms in optimizing model performance and improving prediction accuracy.

Table 7 presents the performance metrics of the model architectures under different ablation experiment configurations. As shown, the models in experiments C1 and C3, which adopt a single basic module, exhibit relatively low parameter counts and floating-point operations, resulting in faster training. This efficiency is primarily attributed to the simplicity of the module structures. However, these configurations demonstrate limited feature extraction capability, leading to lower prediction accuracy. In contrast, experiments C2 and C4 incorporate both the KAN and BiGRU modules, which significantly enhance the model’s ability to capture complex data features. This improvement, however, comes at the cost of increased model parameters and computational complexity, thus reducing training efficiency. Experiment C0 represents the full model proposed in this study, which integrates the strengths of multiple structural modules. Although this configuration leads to increased model complexity and longer training time, it achieves superior prediction accuracy compared to the other configurations. The increase in computational overhead remains within an acceptable range. Therefore, the moderate trade-off between accuracy and efficiency—achieved through a multi-module collaborative architecture—proves to be a rational and effective design choice.

Taking the one-step prediction experiment on the EUA dataset as an example, the features extracted by each module from a single input are visualized using pseudo-color images, as shown in Figure 15. A comparative analysis of the visualized features from the three-branch structure reveals significant differences between the feature maps extracted from the carbon price subcomponent matrix—generated via the hybrid VMD + EWT decomposition—and those extracted from the external factor matrix. This demonstrates the effectiveness of the proposed scheme in capturing multi-scale features inherent in the carbon price data. Under optimal parameter settings, the features extracted by the different modules exhibit considerable variation, indicating that the sequential arrangement of submodules contributes to feature complementarity. This, in turn, provides valuable references for the Self-Attention (SA) module to effectively focus on key temporal features.

All five configurations are trained using the tri-branch input framework. The prediction curves for the test sets of the EUA and HBEA datasets are shown in Figure 16 and Figure 17, respectively, and the corresponding prediction error metrics are summarized in Table 8.

EUA dataset: A comparative analysis between experiments C1 and C2 reveals that integrating the KAN module into the baseline TSMixer architecture significantly improves model performance. Compared to the original TSMixer model, the TKMixer configuration achieves substantial reductions in all error metrics. Specifically, R² values increase to 0.9986, 0.9962, 0.9915, and 0.9906 across the 1–4 step prediction horizons, validating the effectiveness of the KAN module in extracting high-order features and enhancing feature representation.

Further comparisons between C4 and both C2 and C3 show that, in one-step forecasting, C4 reduces e_RMSE by 34.7440% and 42.2007%, e_MAE by 11.3200% and 47.6673%, and e_MAPE by 31.4208% and 46.3130%, respectively. These results demonstrate that the TKMixer structure enables efficient feature mixing and transformation, strengthening temporal dependencies and expressiveness. When coupled with BiGRU’s bidirectional dependency modeling, the combined architecture accurately captures critical features in the input sequence, leading to significantly improved prediction performance.

C0, the full model incorporating the Self-Attention (SA) mechanism, further enhances prediction accuracy through dynamic weighting of feature importance. Under this configuration, e_MAPE drops to 0.2081%, 0.5660%, 0.8293%, and 1.1063% for 1–4 step predictions, while R² reaches 0.999, 0.9978, 0.9957, and 0.9918, respectively.

As shown in Figure 11, the predicted values from the proposed model closely align with the ground truth, with minimal fluctuations and positioning near the center of the 95% confidence interval for all comparative predictions. The predicted mean curve closely follows that of the true values, while e_RMSE, e_MAE, and e_MAPE all exhibit a clear inward contraction, and R² shows a pronounced outward expansion. These consistent trends confirm that the proposed model achieves the lowest prediction errors and the highest fit quality, demonstrating superior forecasting performance.

HBEA dataset: The experimental results on the HBEA dataset confirm the performance trends observed in the EUA dataset, further highlighting the robustness and superior predictive capabilities of the proposed model. As shown in Figure 17, during periods of sharp carbon price fluctuations, the proposed model closely fits the actual values. Compared to the TKMixer and BiGRU models, in the one-step forecast, the proposed model reduces e_RMSE by 41.9567% and 39.9864%, e_MAE by 51.7090% and 49.2355%, and e_MAPE by 51.5352% and 48.8586%, while improving R² by 0.6259% and 0.5650%, respectively.

In contrast to the EUA dataset, the HBEA dataset exhibits more pronounced volatility and uneven historical data distribution, which increases the difficulty of prediction. The TKMixer model tends to underestimate in low-price regions, resulting in larger errors, while the TKMixer-BiGRU model overestimates in high-price regions, also leading to increased errors. In comparison, the proposed model produces predictions that closely follow the actual curve, especially around the average value of the test set, with reduced fluctuations in the fitting line. Moreover, the inward contraction of error metrics such as e_RMSE, e_MAE, and e_MAPE, alongside the outward increase in R², reinforces that the proposed model yields the lowest prediction errors and the highest degree of fit. However, R² values for the HBEA test set under 1–4 step forecasts—0.9968, 0.9926, 0.9755, and 0.9726—are slightly lower than those from the EUA dataset. This difference is attributed to the higher complexity and volatility of the HBEA data, as well as external influences such as China’s carbon reduction policies (e.g., mitigation actions, nationally determined contributions, and carbon neutrality goals), which were not explicitly modeled.

Figure 18 and Table 9 present the variance of prediction errors and the results of paired-sample t-tests across different model configurations on the EUA and HBEA datasets. Variance-based comparative analysis demonstrates that the proposed TKMixer-BiGRU-SA model consistently outperforms others on both datasets, achieving the lowest and most stable error variances across one- to four-step forecasts, thus exhibiting a clear competitive advantage.

The paired-sample t-test results further confirm that integrating the KAN module into the TSMixer architecture (C1 vs. C2) leads to significant improvements in multi-step prediction performance, particularly enhancing stability on the HBEA dataset. This highlights the KAN’s effectiveness in capturing complex temporal dependencies. Moreover, introducing the BiGRU structure (C2 vs. C4) yields notable performance gains at all prediction horizons, validating the importance of bidirectional contextual modeling in sequence prediction.

The combination of TKMixer with BiGRU (C3 vs. C4) also consistently achieves statistically significant improvements, underscoring the synergistic effect between these components as a key driver of model performance enhancement. Building upon C4, the incorporation of the Self-Attention mechanism (C4 vs. C0)—as in the proposed final model—delivers additional performance gains at most time steps, with particularly pronounced improvements at t = 1 and t = 2. This demonstrates the model’s enhanced capacity to focus on critical temporal features. However, minor performance fluctuations observed at a few time steps suggest that the application of attention mechanisms should be carefully tailored to task-specific characteristics.

Overall, these experimental findings validate the effectiveness and robustness of modular composition in improving prediction performance and highlight subtle yet statistically significant differences between competing model architectures.

4.4. Comparison Experiment with Different Literature

To validate the superiority of the proposed forecasting scheme compared to various existing models and methods reported in current research, this study conducted a series of comparative experiments.

Using the EUA carbon price data from 1 January 2013 to 1 January 2021 as an example, the prediction error metrics of the method presented in [36] and those of the proposed model are compared. As shown in Figure 19, the proposed method demonstrates clear advantages in terms of prediction accuracy across all evaluated metrics.

As shown in Figure 19, the prediction difficulty increases with the forecasting horizon. This is reflected in the rising values of e_RMSE, e_MAE, and e_MAPE, along with a decreasing R², indicating a decline in predictive accuracy. The ET-MVMD-LSTM hybrid forecasting model proposed in [36] integrates ET-based feature selection, MVMD decomposition, and LSTM deep learning architecture. This approach effectively reduces the complexity of time series data while capturing the inter-variable correlations, enabling it to accurately model the dynamic behavior of carbon prices. It performs well even in multi-step forecasting tasks, demonstrating more stable and reliable performance than single models such as ANN and RNN. Specifically, for one-step forecasting, it achieves an e_RMSE of 0.376, e_MAE of 0.296, e_MAPE of 1.109%, and an R² of 0.996.

In contrast, the TKMixer-BiGRU-SA model proposed in this study maintains high prediction fidelity across 1 to 4-step ahead forecasts. Notably, for the four-step forecast, the model achieves an e_MAPE of just 0.2081% and an R² of 0.9997. Compared to the ET-MVMD-LSTM model from [36], the e_MAPE of our model is reduced by 81.2353%, 52.1151%, 33.7090%, and 21.5390% for 1–4 step predictions, respectively. These results strongly support the effectiveness and robustness of our model in handling complex and challenging prediction tasks.

Regarding the HBEA dataset, Figure 20 presents a comparison of one-step forecasting error metrics between our proposed model and those from [37,38,39]. The TKMixer-BiGRU-SA model consistently achieves the lowest e_RMSE, e_MAE, and e_MAPE, and the highest R² among all methods. Specifically, compared to the traditional ARIMA model from [37], the proposed model reduces e_RMSE, e_MAE, and e_MAPE by 82.4777%, 82.3591%, and 88.9057%, respectively—highlighting the limited predictive power of single models in capturing complex data features.

By integrating suitable decomposition strategies and leveraging complementary strengths of hybrid deep learning models, our approach effectively utilizes multidimensional features of carbon price data and its external factors, leading to a significant improvement in predictive performance. Compared with the VMD-AWLSSVR-PSOLS-WSM model [37], the HI-TVFEMD-transformer model [38], and the Informer-DABOHBTVFEMD-CL model [39], our method achieves reductions in e_RMSE, e_MAE, and e_MAPE of 50.9433%, 79.4705%, and 58.7879%; 53.5989%, 56.9390%, and 51.3909%; and 70.3142%, 66.1220%, and 52.1034%, respectively. Additionally, the R² reaches as high as 0.9968. The potential overfitting in previous models may stem from overlapping functionalities and excessive parameter complexity in their hybrid architectures, despite employing signal decomposition techniques. In contrast, our model effectively balances data decomposition and feature extraction, maximizing the performance of each module and validating the soundness of our methodological design.

5. Conclusions

This study proposes a dual-modal decomposition and triple-branch input deep learning model—TKMixer-BiGRU-SA—and successfully applies it to two representative carbon trading market datasets: EUA and HBEA. Based on an extensive cross-comparison of four evaluation metrics, the proposed model consistently demonstrates strong predictive fitting and low forecasting errors across various experimental settings. In single-step forecasting, the model achieves an e_MAPE of 0.2081% and an R² of 0.9997 on the EUA dataset, and an e_MAPE of 0.1389% with an R² of 0.9968 on the HBEA dataset. Moreover, in forecasts 1–4 steps ahead, the model maintains a clear advantage over different decomposition schemes, ablation tests, and alternative approaches, exhibiting notable robustness and accuracy.

(1): The triple-branch input structure of the deep learning model integrates multiple information sources, enhancing adaptability to complex nonlinear data and improving forecasting performance. Specifically, the first branch preserves the integrity of the original data, capturing direct signals and the latent influence of external factors. The second branch utilizes Variational Mode Decomposition (VMD) to extract frequency-based components, aiding the understanding of data nonlinearity and non-stationarity. The third branch applies Empirical Wavelet Transform (EWT) to obtain intrinsic mode functions (IMFs) with clear physical interpretations, accurately characterizing high-frequency transient fluctuations due to price shocks and isolating amplitude and frequency modulation components.
(2): The TKMixer-BiGRU-SA architecture integrates the strengths of TSMixer, Kolmogorov–Arnold Networks (KAN), the Bidirectional Gated Recurrent Unit (BiGRU), and Self-Attention mechanisms (SA). A novel contribution of this work is embedding a KAN into the time-mapping layer of the TSMixer module, enabling efficient feature mixing and transformation. This architecture fully captures the temporal dependencies and long-range patterns in time series data while dynamically adjusting the attention to key features. The model exhibits multi-level feature extraction capabilities, contextual understanding, and precise focus on relevant information, thus enhancing generalization and significantly improving carbon price prediction accuracy.
(3): This study underscores the importance of multi-step ahead forecasting, a topic that remains underexplored in existing research. To address this gap, we conduct comprehensive experiments validating the model’s performance in multi-step prediction scenarios. The results confirm that the TKMixer-BiGRU-SA model delivers strong and consistent performance even in forecasts 1–4 steps ahead, demonstrating its clear advantage and great potential in complex time series forecasting tasks, especially in forward-looking applications.
(4): In this study, when applying decomposition methods, the training set is decomposed separately first, and then the test set and training set are combined to decompose the entire dataset. Although this approach avoids information leakage from the test set into the decomposition results of the training set, because the entire training set is decomposed together, there still exists the issue of future information leakage during the training phase. Decomposing the whole dataset to obtain the decomposition results of the test set can reduce the endpoint effects on the test set; however, since the entire dataset is decomposed uniformly, some future information leakage inevitably occurs within the test set, so boundary problems still persist [40]. Moreover, due to different time series lengths between separately decomposed training sets and the whole dataset decomposition, differences in decomposition levels may arise. Therefore, further improvements addressing this issue should be made in future research.
(5): With the advancement of signal processing technologies and deep learning, the combined prediction method proposed in this paper can be further improved or applied to more fields, thereby further demonstrating the research value of the proposed approach.

Author Contributions

Conceptualization, Y.L.; methodology, G.B.; software, Y.L.; validation, Y.L.; formal analysis, N.Y.; investigation, S.C.; resources, Z.L. and X.S.; data curation, G.B.; writing—original draft preparation, Y.L.; writing—review and editing, G.B.; visualization, Y.L.; supervision, N.Y. and S.C.; project administration, Z.L.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the National Key R&D Program of China (Grant no. 2022YFB2703500).

Data Availability Statement

The codes developed are not public. However, data will be made available on request.

Conflicts of Interest

Author Xin Shen is employed by the Measurement Center of Yunnan Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

McMichael, A.J.; Woodruff, R.E.; Hales, S. Climate change and human health: Present and future risks. Lancet 2006, 367, 859–869. [Google Scholar] [CrossRef] [PubMed]
Kazemzadeh, E.; Fuinhas, J.A.; Salehnia, N.; Koengkan, M.; Shirazi, M.; Osmani, F. Factors driving CO₂ emissions: The role of energy transition and brain drain. Environ. Dev. Sustain. 2024, 26, 1673–1700. [Google Scholar] [CrossRef]
Qin, B.; Zhou, X.; Ding, T.; Shi, W.; Li, H.; Wen, Y. Review on Development of Global Carbon Market and Prospect of China’s Carbon Market Construction. Dianli Xitong Zidonghua/Autom. Electr. Power Syst. 2022, 46, 186–199. [Google Scholar]
Zhou, K.; Li, Y. Carbon finance and carbon market in China: Progress and challenges. J. Clean. Prod. 2019, 214, 536–549. [Google Scholar] [CrossRef]
Bataille, C.; Guivarch, C.; Hallegatte, S.; Rogelj, J.; Waisman, H. Carbon prices across countries. Nat. Clim. Chang. 2018, 8, 648–650. [Google Scholar] [CrossRef]
Emad, K.; Alberto, J.F.; Narges, S.; Koengkan, M.; Silva, N. Exploring necessary and sufficient conditions for carbon emission intensity: A comparative analysis. Environ. Sci. Pollut. Res. Int. 2023, 30, 97319–97338. [Google Scholar]
Owais, M.; Sayed, E.A.M. Red light crossing violations modelling using deep learning and variance-based sensitivity analysis. Expert Syst. Appl. 2025, 267, 126258. [Google Scholar] [CrossRef]
Shi, H.; Wei, A.; Xu, X.; Zhu, Y.; Hu, H.; Tang, S. A CNN-LSTM based deep learning model with high accuracy and robustness for carbon price forecasting: A case of Shenzhen’s carbon market in China. J. Environ. Manag. 2024, 352, 120131. [Google Scholar] [CrossRef]
Mu, G.; Dai, L.; Ju, X.; Chen, Y.; Huang, X. MS-IHHO-LSTM: Carbon price prediction model of multi-source data based on improved swarm intelligence algorithm and deep learning method. IEEE Access 2024, 12, 80754–80769. [Google Scholar] [CrossRef]
Sayed, G.I.; El-Latif, E.I.A.; Darwish, A.; Snasel, V.; Hassanien, A.E. An optimized and interpretable carbon price prediction: Explainable deep learning model. Chaos Solitons Fractals 2024, 188, 115533. [Google Scholar] [CrossRef]
Wang, M.; Hu, Q.; Zhu, W.; Huang, J. Carbon Price Forecasting for China’s Eight Major Markets Based on GRU-Attention Model. In Proceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition, Xiamen, China, 26–28 April 2024; pp. 1–6. [Google Scholar]
Zhang, Z.; Liu, X.; Zhang, X.; Yang, Z.; Yao, J. Carbon Price Forecasting Using Optimized Sliding Window Empirical Wavelet Transform and Gated Recurrent Unit Network to Mitigate Data Leakage. Energies 2024, 17, 4358. [Google Scholar] [CrossRef]
Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.O.; Pfister, T. Tsmixer: An all-mlp architecture for time series forecasting. arXiv 2023, arXiv:2303.06053. [Google Scholar]
Souto, H.G.; Heuvel, S.K.; Neto, F.L. Time-mixing and feature-mixing modelling for realized volatility forecast: Evidence from TSMixer model. J. Financ. Data Sci. 2024, 10, 100143. [Google Scholar] [CrossRef]
Lee, Y.; Jeong, J. TSMixer- and Transfer Learning-Based Highly Reliable Prediction with Short-Term Time Series Data in Small-Scale Solar Power Generation Systems. Energies 2025, 18, 765. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kan: Kolmogorov-arnold networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Sulaiman, M.H.; Mustaffa, Z.; Mohamed, A.I.; Samsudin, A.S.; Rashid, M.I.M. Battery state of charge estimation for electric vehicle using kolmogorov-arnold networks. Energy 2024, 311, 133417. [Google Scholar] [CrossRef]
Ren, D.; Hu, Q.; Zhang, T. Eklt: Kolmogorov-arnold attention-driven LSTM with transformer model for river water level prediction. J. Hydrol. 2025, 649, 132430. [Google Scholar] [CrossRef]
Hong, Y.-C.; Xiao, B.; Chen, Y. TKMixer: Kolmogorov-arnold networks with mlp-mixer model for time series forecasting. arXiv 2025, arXiv:2502.18410. [Google Scholar]
Huang, Y.; Dai, X.; Wang, Q.; Zhou, D. A hybrid model for carbon price forecasting using GARCH and long short-term memory network. Appl. Energy 2021, 285, 116485. [Google Scholar] [CrossRef]
Zhang, C.; Yang, X. Forecasting of China’s regional carbon price based on multi-frequency combined model. Syst. Eng. Theory Pract. 2016, 36, 3017–3025. [Google Scholar]
Zhang, W.; Wu, Z. Optimal hybrid framework for carbon price forecasting using time series analysis and least squares support vector machine. J. Forecast. 2022, 41, 615–632. [Google Scholar] [CrossRef]
Wang, Y.L.; Yan, Z.; Bai, Y. Carbon Price Prediction Using Complete Ensemble Empirical Mode Decomposition with Adaptive Noise Analysis and Convolutional Neural Network. In Applied Mathematics, Modeling and Computer Simulation; IOS Press: Amsterdam, The Netherlands, 2022. [Google Scholar]
Sun, W.; Xu, Z. Carbon price prediction model based on adaptive variational mode decomposition and optimized extreme learning machine. Int. J. Environ. Sci. Technol. 2023, 20, 103–123. [Google Scholar] [CrossRef]
Zeng, L.; Hu, H.; Tang, H.; Zhang, X.; Zhang, D. Carbon emission price point-interval forecasting based on multivariate variational mode decomposition and attention-LSTM model. Appl. Soft Comput. 2024, 157, 111543. [Google Scholar] [CrossRef]
Zhang, C.; Lin, B. Carbon prices forecasting based on the singular spectrum analysis, feature selection, and deep learning: Toward a unified view. Process. Saf. Environ. Prot. 2023, 177, 932–946. [Google Scholar] [CrossRef]
Wang, Y.; Qin, L.; Wang, Q.; Chen, Y.; Yang, Q.; Xing, L.; Ba, S. A novel deep learning carbon price short-term prediction model with dual-stage attention mechanism. Appl. Energy 2023, 347, 121380. [Google Scholar] [CrossRef]
Zhang, X.; Zong, Y.; Du, P.; Wang, S.; Wang, J. Framework for multivariate carbon price forecasting: A novel hybrid model. J. Environ. Manag. 2024, 369, 122275. [Google Scholar] [CrossRef]
Wang, J.; Cui, Q.; Sun, X. A novel framework for carbon price prediction using comprehensive feature screening, bidirectional gate recurrent unit and Gaussian process regression. J. Clean. Prod. 2021, 314, 128024. [Google Scholar] [CrossRef]
Owais, M. Preprocessing and postprocessing analysis for hot-mix asphalt dynamic modulus experimental data. Constr. Build. Mater. 2024, 450, 138693. [Google Scholar] [CrossRef]
Wang, Z.; Guo, L.; Gong, H.; Li, X.; Zhu, L.; Sun, Y.; Chen, B.; Zhu, X. Land subsidence simulation based on Extremely Randomized Trees combined with Monte Carlo algorithm. Comput. Geosci. 2023, 178, 105415. [Google Scholar] [CrossRef]
Gan, W.; Ma, R.; Zhao, W.; Peng, X.; Cui, H.; Yan, J.; Duan, S.; Wang, L.; Feng, P.; Chu, J. A VMD-LSTNet-Attention model for concentration prediction of mixed gases. Sensors Actuators B Chem. 2025, 422, 136641. [Google Scholar] [CrossRef]
Zhangm, P.; Qi, B.; Zhang, R.Y.; Shao, M.; Li, C. Dissolved gas prediction in transformer oil based on empirical wavelet transform and gradient boosting radial basis. Power Syst. Technol. 2021, 45, 3745–3754. [Google Scholar]
Li, C.; Liu, X.; Li, W.; Wang, C.; Liu, H.; Liu, Y.; Chen, Z.; Yuan, Y. U-kan makes strong backbone for medical image segmentation and generation. arXiv 2024, arXiv:2406.02918. [Google Scholar] [CrossRef]
Guo, L.; Wang, Y.; Guo, M.; Zhou, X. YOLO-IRS: Infrared Ship Detection Algorithm Based on Self-Attention Mechanism and KAN in Complex Marine Background. Remote Sens. 2024, 17, 20. [Google Scholar] [CrossRef]
Zhang, K.; Yang, X.; Wang, T.; Thé, J.; Tan, Z.; Yu, H. Multi-step carbon price forecasting using a hybrid model based on multivariate decomposition strategy and deep learning algorithms. J. Clean. Prod. 2023, 405, 136959. [Google Scholar] [CrossRef]
Chen, L.; Zhao, X. A multiscale and multivariable differentiated learning for carbon price forecasting. Energy Econ. 2024, 131, 107353. [Google Scholar] [CrossRef]
Yue, W.; Zhong, W.; Xiaoyi, W.; Xinyu, K. Multi-step-ahead and interval carbon price forecasting using transformer-based hybrid model. Environ. Sci. Pollut. Res. 2023, 30, 95692–95719. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.; Luo, Y. A hybrid carbon price forecasting model combining time series clustering and data augmentation. Energy 2024, 308, 132929. [Google Scholar] [CrossRef]
Chen, Y.; Yu, S.; Islam, S.; Lim, C.P.; Muyeen, S.M. Decomposition based wind power forecasting models and their boundary issue: An in-depth review and comprehensive discussion on potential solutions. Energy Rep. 2022, 8, 8805–8820. [Google Scholar] [CrossRef]

Figure 1. Trends in carbon price dynamics of EUA and HBEA.

Figure 2. SHAP values of the ET model.

Figure 3. Strongly correlated external factors in the EUA and HBEA datasets.

Figure 4. Decomposition results of EUA and HBEA carbon price data.

Figure 5. Architecture of the TKMixer module.

Figure 6. GRU module structure.

Figure 7. BiGRU network structure.

Figure 8. Self-Attention mechanism structure.

Figure 9. Architecture of the TKMixer-BiGRU-SA model.

Figure 10. Construction of the input matrix.

Figure 11. Framework of the prediction process.

Figure 12. Prediction error across different model parameters.

Figure 13. Training loss curves under different hyperparameter settings.

Figure 14. Fitting curves of EUA and HBEA carbon price predictions under different input schemes.

Figure 15. Visualization of feature extraction in each module of the three-branch input.

Figure 16. EUA carbon price prediction curves of different model schemes.

Figure 17. HBEA carbon price prediction curves of different model schemes.

Figure 18. Variance of carbon price prediction errors for EUA and HBEA under different model schemes.

Figure 19. Evaluation metrics of multi-step EUA carbon price forecasting using different literature models.

Figure 20. Evaluation metrics of single-step HBEA carbon price forecasting using different literature models.

Table 1. Carbon price statistics.

Dataset	Mean	Median	Minimum	Maximum	Standard Deviation	Skewness
EUA	12.36	7.37	2.7	33.66	9.01	0.79
HBEA	32.73	29.76	14.07	61.48	9.83	0.22

Table 2. ADF stationarity and BDS nonlinear test results.

Data	ADF Test			BDS Test
	T-Statistic	p-Value	Stability	Two-Dimensional		Three-Dimensional		Four-Dimensional		Five-Dimensional
	T-Statistic	p-Value	Stability	T-Statistic	p-Value	T-Statistic	p-Value	T-Statistic	p-Value	T-Statistic	p-Value
EUA	0.412	0.982	no	19.432	0.0	43.954	0.0	83.446	0.0	151.789	0.0
HBEA	−1.511	0.528	no	14.591	0.0	31.850	0.0	57.916	0.0	100.462	0.0

Table 3. Influencing factors of EUA and HBEA dataset selection.

Dataset	Classification	Variable Selection	Variable Name	PCC	ET	SHAP
EUA	Energy price	West Texas Intermediate (WTI) crude oil futures closing price	WTI (1)	0.348	0.0008	0.1700
		Brent crude oil futures closing price	Brent (2)	0.315	0.0018	0.2599
		New York Mercantile Exchange (NYMEX) natural gas futures closing price	NYMEX (3)	0.458	0.0001	0.0762
	Market economy index	FTSE 100 Index (Financial Times Stock Exchange 100 Index)	FTSE100 (4)	0.143	0.0011	0.0845
		Stoxx Europe 600 Index	STOXX600 (5)	0.279	0.0004	0.1728
		France CAC 40 Index	CAC40 (6)	0.630	0.0038	0.3790
		Germany DAX 30 Index	DAX30 (7)	0.595	0.0025	0.3694
		S&P 500 Index” (Standard & Poor’s 500 Index)	S&P500 (8)	0.874	0.1103	4.7189
		S&P 500 Energy Sector	SPNY (9)	0.655	0.0066	0.6597
		Euro Stoxx 50 Index	STOXX50E (10)	0.354	0.0008	0.2377
		Swiss Market Index (SMI)	SSMI (11)	0.781	0.0456	1.0918
	Political economy indicators	Short-term European bonds	FGBSH5 (12)	0.543	0.0041	0.3837
	Political economy indicators	Long-term European bonds	FGBLH5 (13)	0.712	0.0532	1.1604
	Exchange rate	EUR/USD exchange rate	EUR/USD (14)	0.337	0.0036	0.1498
HBEA	Market carbon price	EU ETS carbon price	EUA (1)	0.837	0.2302	4.4769
	Market carbon price	Shanghai carbon quota price	SHEA (2)	0.704	0.0275	0.4401
	Market economy index	CSI 300 Index	CSI300 (3)	0.132	0.0051	0.2424
		SSE Composite Index	SSEC (4)	0.196	0.0014	0.1031
		FTSE China A50 Index	FTXIN9 (5)	0.079	0.0034	0.2051
		S&P 500 Energy Sector	SPNY (6)	0.349	0.0088	0.3203
		SSE 380 Energy Index	SSE380EI (7)	0.463	0.0219	0.2848
		Gold futures price	GCZ4 (8)	0.561	0.0096	0.6924
	Political economy indicators	China 10-year government bond	CN10Y (9)	0.707	0.0191	0.4181
		China 5-year government bond	CN5Y (10)	0.628	0.0159	0.3962
		China 1-year government bond	CN1Y (11)	0.610	0.0284	0.6536
	Energy price	Daqing crude oil price	DQCQP (12)	0.627	0.0457	0.3545
		Shengli crude oil price	SLCOP (13)	0.233	0.0051	0.2429
		International natural gas market price	NGZ4 (14)	0.727	0.1068	1.3368
	Exchange rate	USD/CNY exchange rate	USD/CNY (15)	0.112	0.0134	0.4851
	Climate environment	Air Quality Index (AQI)	AQI (16)	0.001	0.00003	0.0128
	Climate environment	PM2.5 index	PM2.5 (17)	0.151	0.0002	0.0309

Table 4. Comparison of experimental errors across different model parameters.

Serial Number	TSMixer Module	KAN Module	BiGRU Module	Evaluation Metrics
Serial Number	TSMixer Module	KAN Module	BiGRU Module	e_RMSE	e_MAE	e_MAPE/%	R²
A1	8	32	32	0.1367	0.1180	0.4678	0.9988
A0	16	32	32	0.0648	0.0504	0.2081	0.9997
A2	32	32	32	0.1433	0.1249	0.4863	0.9987
A3	16	16	32	0.1612	0.1437	0.5477	0.9983
A4	16	64	32	0.1743	0.1564	0.5874	0.9981
A5	16	32	16	0.1569	0.1404	0.5346	0.9985
A6	16	32	64	0.1528	0.1353	0.5224	0.9985

Table 5. Error comparison across different hyperparameter configurations.

Epoch	Batch	Learning Rate	Time	Metric
Epoch	Batch	Learning Rate	Time	e_RMSE	e_MAE	e_MAPE/%	R²
100	8	0.001	488 s	0.6276	0.5879	2.2264	0.9754
200	8	0.001	1060 s	0.1984	0.1688	0.6922	0.9975
200	16	0.001	323 s	0.0648	0.0504	0.2081	0.9997
200	32	0.001	244 s	0.1080	0.0830	0.3390	0.9993
200	16	0.0001	480 s	0.3153	0.2887	1.1236	0.9940

Table 6. Comparison of experimental results with different inputs.

Data	Experiment	e_RMSE				e_MAE				e_MAPE/%				R²
Data	Experiment	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4
EUA	B1	0.6204	0.8247	1.1071	1.1555	0.4719	0.6407	0.8612	0.8852	1.8697	2.4958	3.3451	3.4781	0.9759	0.9568	0.9215	0.9134
	B2	0.6031	0.6824	1.0191	1.1157	0.4651	0.5288	0.7944	0.8462	1.8549	2.1648	3.0970	3.4389	0.9773	0.9679	0.9335	0.9193
	B3	0.1716	0.2851	0.4331	0.5433	0.1394	0.2159	0.3439	0.7392	0.5558	0.8733	1.3820	1.6599	0.9982	0.9944	0.9879	0.9775
	B4	0.3162	0.5051	0.8357	0.9962	0.2464	0.3860	0.6364	0.7450	0.9686	1.5575	2.5296	2.9954	0.9938	0.9824	0.9553	0.9357
	B5	0.5695	0.7486	0.8938	1.0745	0.4300	0.5667	0.6880	0.7932	1.7294	2.2760	2.7824	3.1919	0.9798	0.9613	0.9489	0.9252
	B0	0.0648	0.1787	0.2599	0.3549	0.0504	0.1398	0.2048	0.2763	0.2081	0.5660	0.8293	1.1063	0.9997	0.9978	0.9957	0.9918
HBEA	B1	0.4837	0.5502	0.7452	0.9199	0.4057	0.3408	0.4286	0.6684	0.7467	0.7224	0.9135	1.3985	0.9155	0.8778	0.8362	0.8058
	B2	0.3424	0.4423	0.5418	0.6218	0.3233	0.3496	0.3556	0.3844	0.6736	0.7269	0.7437	0.8063	0.9426	0.9210	0.9118	0.8901
	B3	0.1293	0.2347	0.2993	0.4099	0.1032	0.1938	0.2015	0.2885	0.2144	0.4035	0.4238	0.6063	0.9932	0.9775	0.9641	0.9322
	B4	0.3248	0.3862	0.4014	0.4833	0.2457	0.3422	0.3967	0.4135	0.5164	0.7106	0.8221	0.8603	0.9574	0.9421	0.9398	0.9057
	B5	0.3283	0.3901	0.4290	0.5212	0.2533	0.3435	0.3981	0.4036	0.6109	0.7279	0.8280	0.8349	0.9565	0.9387	0.9263	0.8903
	B0	0.0884	0.1358	0.2475	0.2605	0.0664	0.1019	0.1646	0.1840	0.1389	0.2117	0.3464	0.3829	0.9968	0.9926	0.9755	0.9726

Table 7. Performance metrics of different model configurations in ablation experiments.

Experiment	Number of Parameters	FLOPs (Floating Point Operations)	Total Training Time	Average Time per Batch	Samples Processed per Second
C1	39,946	12,681,344	185 s	0.0052 s	912,987
C2	59,851	183,447,296	336 s	0.0104 s	451,768
C3	149,899	7,551,423	208 s	0.0061 s	770,635
C4	113,428	431,752,224	432 s	0.0142 s	332,653
C0	114,571	436,753,612	481 s	0.0156 s	302,861

Table 8. Comparison of ablation results.

Data	Experiment	e_RMSE				e_MAE				e_MAPE/%				R²
Data	Experiment	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4	t = 1	t = 2	t = 3	t = 4
EUA	C1	0.1513	0.3527	0.4155	0.6057	0.1247	0.2833	0.3307	0.4578	0.4997	1.1441	1.3530	1.8663	0.9985	0.9914	0.9889	0.9762
	C2	0.1466	0.1981	0.3637	0.3800	0.1132	0.1622	0.3016	0.2940	0.4427	0.6422	1.1600	1.1765	0.9986	0.9962	0.9915	0.9906
	C3	0.1654	0.2675	0.3215	0.3691	0.1479	0.2415	0.2524	0.2925	0.5655	0.9507	1.0053	1.1554	0.9982	0.9951	0.9933	0.9912
	C4	0.0956	0.2000	0.2758	0.3564	0.0774	0.1612	0.2103	0.2828	0.3036	0.6382	0.8469	1.1331	0.9994	0.9972	0.9951	0.9913
	C0	0.0648	0.1787	0.2599	0.3549	0.0504	0.1398	0.2048	0.2763	0.2081	0.5660	0.8293	1.1063	0.9997	0.9978	0.9957	0.9918
HBEA	C1	0.2649	0.2573	0.3366	0.3891	0.2396	0.1787	0.2574	0.2858	0.3754	0.4933	0.5383	0.5965	0.9732	0.9707	0.9546	0.9388
	C2	0.1523	0.2400	0.3101	0.3631	0.1375	0.2142	0.2477	0.2303	0.2866	0.4433	0.5180	0.4843	0.9906	0.9767	0.9615	0.9468
	C3	0.1473	0.1900	0.2839	0.2973	0.1308	0.1484	0.2066	0.2272	0.2716	0.3083	0.4322	0.4743	0.9912	0.9854	0.9677	0.9643
	C4	0.1030	0.1510	0.2625	0.2832	0.0797	0.1229	0.1768	0.1924	0.1655	0.2562	0.3708	0.4032	0.9957	0.9908	0.9724	0.9676
	C0	0.0884	0.1358	0.2475	0.2605	0.0664	0.1019	0.1646	0.1840	0.1389	0.2117	0.3464	0.3829	0.9968	0.9926	0.9755	0.9726

Table 9. Paired sample t-test results of prediction errors for different model schemes.

Datasets		EUA				HBEA
Experimental Comparison		C1-C2	C2-C4	C3-C4	C4-C0	C1-C2	C2-C4	C3-C4	C4-C0
t = 1	t-statistic value	3.2432	1.77494	30.07087	15.7797	5.3686	−14.1556	45.0561	−16.7463
t = 1	p-value (<0.05)	0.00132	0.07695	<0.0001	<0.0001	<0.0001	<0.0001	<0.0001	<0.0001
t = 2	t-statistic value	−1.6391	3.8974	13.58	6.7674	−14.1127	37.8755	1.481	19.3525
t = 2	p-value (<0.05)	0.0123	0.0001	<0.0001	<0.0001	<0.0001	<0.0001	0.0403	<0.0001
t = 3	t-statistic value	−13.6768	23.6948	19.5425	−13.9517	−0.3292	10.7495	−6.6052	8.0101
t = 3	p-value (<0.05)	<0.0001	<0.0001	<0.0001	<0.0001	0.7424	<0.0001	<0.0001	<0.0001
t = 4	t-statistic value	−1.55131	17.1779	20.3551	−11.9701	−3.5199	−3.9141	−14.7583	−2.6458
t = 4	p-value (<0.05)	0.0131	<0.0001	<0.0001	<0.0001	0.0005	0.0001	<0.0001	0.0089

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Yang, N.; Bi, G.; Chen, S.; Luo, Z.; Shen, X. Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA. Symmetry 2025, 17, 962. https://doi.org/10.3390/sym17060962

AMA Style

Li Y, Yang N, Bi G, Chen S, Luo Z, Shen X. Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA. Symmetry. 2025; 17(6):962. https://doi.org/10.3390/sym17060962

Chicago/Turabian Style

Li, Yuhong, Nan Yang, Guihong Bi, Shiyu Chen, Zhao Luo, and Xin Shen. 2025. "Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA" Symmetry 17, no. 6: 962. https://doi.org/10.3390/sym17060962

APA Style

Li, Y., Yang, N., Bi, G., Chen, S., Luo, Z., & Shen, X. (2025). Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA. Symmetry, 17(6), 962. https://doi.org/10.3390/sym17060962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Carbon Price Forecasting Using a Hybrid Deep Learning Model: TKMixer-BiGRU-SA

Abstract

1. Introduction

2. Data Preprocessing

2.1. Dataset Description

2.2. Dataset Stability Analysis

2.3. Correlation Analysis

2.4. Data Normalization

2.5. Modal Decomposition

3. Deep Learning Model and Prediction Process

3.1. TKMixer Module

3.2. BiGRU Module

3.3. SA Model

3.4. TKMixer-BiGRU-SA Model

3.5. Forecasting Process

4. Discussion

4.1. Model Parameter Settings

4.2. Comparison Experiment with Different Inputs

4.3. Ablation Experiment

4.4. Comparison Experiment with Different Literature

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI