CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting

Tao, Yuze; Li, Lu

doi:10.3390/math13233750

Open AccessArticle

CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting

by

Yuze Tao

and

Lu Li

^*

School of Mathematics, Physics and Statistics, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3750; https://doi.org/10.3390/math13233750 (registering DOI)

Submission received: 22 October 2025 / Revised: 9 November 2025 / Accepted: 20 November 2025 / Published: 22 November 2025

Download

Browse Figures

Versions Notes

Abstract

Structural breaks and volatility clustering are fundamental challenges in time series analysis. We propose CQEformer, an encoder-only Transformer variant for time-series modeling that addresses these challenges via two complementary innovations. First, the causal residual embedding (CRE) ensures temporal causality and improves local adaptivity to abrupt structural changes. Second, the query-enhanced multi-head self-attention (QEAttention) incorporates multi-order moment statistics and entropy to guide attention dynamically toward high-volatility regions while preserving global dependence structures. For parameter optimization, we derive analytical gradients for all components and update them using the Adam stochastic optimization algorithm. Empirical evaluations on financial time series datasets and the public Traffic dataset show that CQEformer consistently outperforms established baselines, including LSTM, GRU, TCN, and the standard Transformer. Time-window sensitivity analyses demonstrate the robustness of the framework, while ablation studies further confirm that the proposed modules are complementary and contribute to improved forecasting performance across different volatility regimes.

Keywords:

CQEformer; causal residual embedding; Query-Enhanced Attention; structural breaks; volatility clustering; time-series forecasting

MSC:

62M10; 68T07

1. Introduction

Time series arising in complex systems, such as financial markets, exhibit key statistical properties in asset price time series, including volatility clustering, heteroskedasticity, periodicity, and nonstationarity [1]. These characteristics increase the difficulty of accurate modeling and forecasting, which is critical in practical scenarios such as high-frequency trading, quantitative stock selection, and asset allocation.

Traditional approaches, including Autoregressive Integrated Moving Average [2], Vector Autoregression [3], and Generalized Autoregressive Conditional Heteroskedasticity [4], offer interpretability and theoretical completeness and can model linear correlations and conditional heteroskedasticity to some extent. However, they fail to adequately model nonlinear and high-dimensional dependencies in financial markets and lack robustness due to sensitivity to specification, distributional assumptions, and regime shifts [5,6,7].

Deep learning methods have demonstrated strong capabilities in capturing nonlinear patterns and high-dimensional representations. Recurrent architectures such as Recurrent Neural Network (RNN), Long Short-Term Memory network (LSTM), Gated Recurrent Unit (GRU), and their variants have been applied to stock price forecasting. For instance, Nguyen et al. [8] proposed the SR-SV model, combining RNN with stochastic volatility models (SV), achieving strong out-of-sample predictive performance and interpretability. Sang and Li [9] enhanced LSTM structures with attention mechanisms in AMV-LSTM, improving stability and predictive accuracy, while Ming-Che Lee integrated GRU with attention to predict significant price fluctuations [10]. Nevertheless, these methods face limitations in modeling long-term dependencies and enabling efficient parallel training.

The introduction of the Transformer architecture has provided a promising alternative [11]. Its self-attention mechanism excels at capturing long-range dependencies and has inspired variants such as Informer [12], FEDformer [13], PatchTST [14], TimesNet [15], iTransformer [16], Pathformer [17], Twinsformer [18], and Timer-XL [19], demonstrating strong performance in general-purpose time series forecasting. Yet these models do not fully account for financial data’s unique properties, such as local anomalies, tail risks, and volatility clustering.

To address these domain-specific challenges, some works have customized Transformer architectures for financial applications. Ding et al. [20] proposed a Transformer enhanced with multi-scale Gaussian priors, orthogonal regularization, and a trading-gap segmenter, achieving strong results on NASDAQ and Chinese A-share markets. Zhang et al. [21] combined refined small-sample feature engineering with multiple attention mechanisms, attaining high predictive accuracy and improved realized returns. Nonetheless, domain-specific Transformer variants remain limited and lack systematic development, and as a result, leaving complex statistical characteristics and risk features underexploited.

To confront the key challenges of financial time series prediction—dynamic volatility, structural breaks, and nonstationarity—we propose CQEformer, an Encoder-Only Transformer variant that integrates Causal Convolution Residual Embedding with a Statistical-Prior-Enhanced Query Mechanism. This architecture is specifically designed to capture both local temporal patterns and global structural dynamics in complex, high-noise financial data, with two core innovations detailed as follows:

Causal convolutional residual embedding in the input layer, emphasizing spatiotemporal locality and enforcing temporal causality. By focusing on residual features between consecutive time steps, the model improves sensitivity to sudden market shocks, structural breaks, and local anomalies, thereby enhancing short-term predictive accuracy and stability.
Query-Enhanced multi-head self-attention mechanism, dynamically adjusting the Query matrix using higher-order statistical moments and entropy in time and frequency domains. This guides attention to regions exhibiting volatility clustering and sudden structural shifts, improving modeling of tail risks and nonlinear market behaviors.

Unlike recent frequency-aware or decomposition-based Transformers, CQEformer explicitly combines CRE for local temporal patterns with QEAttention, which injects multi-order time– and frequency–domain priors into attention logits. This design enables more effective amplification of tail and high-volatility signals and facilitates capturing both local and global dynamics, complementing existing approaches.

In Section 4, we provide comprehensive evaluations, including multi-model comparison across three real-world datasets and further analyses—time-window sensitivity and ablation studies—conducted on the CSI 300 dataset. The results show that CQEformer consistently outperforms baseline models, effectively captures complex temporal dynamics, and maintains robustness across varying market conditions.

2. Background

2.1. Problem Setup

Consider the problem of time series forecasting: Given a multivariate time series

X \in R^{L \times D}

, where L denotes the number of time steps and D is the feature dimension. Let

y \in R^{L^{'}}

denote the target label, the objective is to learn a mapping f:

R^{L \times D} \to R^{L^{'}}

to predict the future values

\hat{y} \in R^{L^{'}}

over the next

L^{'}

steps, as follows:

\hat{y} = f (X) .

(1)

2.2. Encoder-Only Transformer

An encoder-only Transformer typically consists of four main components: an input embedding layer that maps raw features to a high-dimensional space, positional encoding to inject sequence order information, a stack of encoder layers—each containing multi-head self-attention (MSA) and feed-forward network (FFN) with residual connections (Res) and layer normalization (LayerNorm)—to extract contextual representations, and an output layer to produce the final predictions. The architecture is shown in Figure 1.

Input Embedding and Positional Encoding

For a standardized multivariate time series

X \in R^{L \times D}

, it is first projected into a high-dimensional space

R^{L \times H}

, where H is the embedding dimension. The standard Transformer applies a fully connected layer to obtain

X_{e}

, and then

X_{e} = X \cdot W_{e} + 1_{L} \cdot b_{e}^{T},

(2)

where

W_{e} \in R^{D \times H}

,

b_{e} \in R^{H}

are learnable,

1_{L}

denotes a L-dimensional all-ones vector. The Transformer is permutation-equivariant and cannot capture order [22], a positional encoding

PE \in R^{L \times H}

is therefore introduced, with elements defined as

{PE}_{i j} = \{\begin{matrix} sin (\frac{i - 1}{10000^{\frac{j - 1}{H}}}), & j = 2 k - 1, \\ cos (\frac{i - 1}{10000^{\frac{j - 1}{H}}}), & j = 2 k, \end{matrix}

(3)

where

i = 1, 2, \dots, L

,

k = 1, 2, \dots, ⌊H / 2⌋

. Then

PE

is added to the

X_{e}

to obtain

X_{i n}

,

X_{i n} = X_{e} + PE .

(4)

Multi-Head Self-Attention

MSA (\cdot)

is defined as the multi-head self-attention, calculated as follows. With N heads (H divisible by N), let

S = H / N

denote head’s dimension, this layer input

X_{i n}

is mapped into N sets of query, key, and value matrices

(Q^{i}, K^{i}, V^{i})

,

i = 1, 2, \dots, N

,

\{\begin{matrix} Q^{i} = X_{i n} \cdot W_{Q^{i}} + 1_{L} \cdot b_{Q^{i}}, \\ K^{i} = X_{i n} \cdot W_{K^{i}} + 1_{L} \cdot b_{K^{i}}, \\ V^{i} = X_{i n} \cdot W_{V^{i}} + 1_{L} \cdot b_{V^{i}}, \end{matrix}

(5)

where

W_{Q^{i}}, W_{K^{i}}, W_{V^{i}} \in R^{H \times S}

,

b_{Q^{i}}, b_{K^{i}}, b_{V^{i}} \in R^{S}

. The attention for one head is computed as

{Head}^{i} = Attention (Q^{i}, K^{i}, V^{i}) : = Softmax (\frac{Q^{i} \cdot {(K^{i})}^{T}}{\sqrt{S}}) \cdot V^{i},

(6)

where Softmax is applied row-wise, so that the elements of each row sum to 1. Let

Head = Concat ({Head}^{1}, {Head}^{2}, \dots, {Head}^{N}) \in R^{L \times H}

, the multi-head self-attention is computed as

X_{m s a} = Head \cdot W_{o} + 1_{L} \cdot b_{o}^{T},

(7)

where

W_{o} \in R^{H \times H}

, and

b_{o} \in R^{H}

.

First Res & LayerNorm

After multi-head self-attention, a residual connection and a layer normalization are applied, Let

μ_{l n}

and

σ_{l n}

denote the row-wise mean and standard deviation, and the normalized output is

X_{l n}

,

X_{r e} = Res (X_{i n}, MSA) : = X_{i n} + MSA (X_{i n}) .

(8)

X_{l n} = LayerNorm (X_{r e}) : = (1_{L} \cdot γ_{l n}) \circ ((X_{r e} - μ_{l n} \cdot 1_{H}^{T}) ⊘ (σ_{l n} \cdot 1_{H}^{T})) + 1_{L} \cdot β_{l n},

(9)

where

MSA (X_{i n}) = X_{m s a}

,

γ_{l n}, β_{l n} \in R^{H}

are learnable, ∘ and ⊘ denote element-wise multiplication and division.

Feed-Forward Network

D_{f}

denotes this network’s hidden dimension, calculation follows:

X_{f} = FFN (X_{l n}) : = ReLU (X_{l n} \cdot W_{f_{1}} + 1_{L} \cdot b_{f_{1}}^{T}) \cdot W_{f_{2}} + 1_{L} \cdot b_{f_{2}}^{T},

(10)

where

W_{f_{1}} \in R^{H \times D_{f}}

,

W_{f_{2}} \in R^{D_{f} \times H}

,

b_{f_{1}} \in R^{D_{f}}

,

b_{f_{2}} \in R^{H}

.

Second Res & LayerNorm

Residual connection and layer normalization are applied again on

X_{l n}

as in Equations (8) and (9), yielding

X_{e n}

,

X_{e n} = LayerNorm (X_{l n} + X_{f}),

(11)

and this LayerNorm contains two learnable parameters

γ_{e n}, β_{e n} \in R^{H}

.

Output Layer

\hat{y} = W_{y} \cdot x_{L} + b_{y},

(12)

where

x_{L}

denotes the transpose of the L-th row vector of

X_{e n}

,

W_{y} \in R^{L^{'} \times H}

,

b_{y} \in R^{L^{'}}

.

3. CQEformer (Causal Query-Enhanced Transformer)

Time series with rapidly changing volatility and structural shifts pose challenges for standard Transformers. To address this, we propose CQEformer, which integrates a causal residual embedding (CRE) module and query-enhanced multi-Head self-attention (QEAttention) mechanism—a novel Attention that injects multi-order time- and frequency–domain priors into attention to focus on high-volatility regimes. CQEformer’s architecture is shown in Figure 2.

3.1. Causal Residual Embedding

We introduce a causal convolutional residual embedding in the input layer, which leverages residual connections with causal convolutions. This design preserves temporal causality and local spatio-temporal structure, enabling the model to capture local fluctuations and respond to sudden shocks. By combining convolutional feature extraction with residual pathways, the embedding remains robust to abrupt changes while retaining the original sequence information.

Causal convolution was introduced in WaveNet to ensure temporal causality in audio modeling [23]. For a standardized multivariate time series

X

, we define

C = CausalConv (X)

, where CausalConv denotes a causal convolution, each element

C_{i j}

is computed as follows:

C_{i j} = \sum_{d = 1}^{D} \sum_{k = 1}^{K} I_{i - k + 1 \geq 0} \cdot X_{i^{'}, d} \cdot {(W_{c})}_{k, d, j} + {(b_{c})}_{j},

(13)

where K is the size of the causal convolution kernel,

K \in N^{+}

,

I

is the indicator function,

i^{'} = max (0, i - k + 1)

,

W_{c} \in R^{K \times D \times H}

,

b_{c} \in R^{H}

. In our experiments, we adopt a small kernel size

K = 3

to capture local abrupt changes while maintaining a certain temporal context, which yields the best performance among tested values

K = 3, 5, 7, 9, 11

.

An efficient module, residual connection, was first proposed in ResNet [24], and have been widely adopted. Based on this two structures, we propose causal convolution residual embedding, defined as follows:

X_{e} = (X + CausalConv (X)) \cdot W_{e} + 1_{L} \cdot b_{e}^{T},

(14)

where

W_{e} \in R^{D \times H}

,

b_{e} \in R^{H}

.

3.2. Query-Enhanced Multi-Head Self-Attention Mechanism

Traditional Transformer models struggle to capture volatility that changes rapidly across the entire sequence. To address this, we propose QEAttention, a query-enhanced multi-head self-attention mechanism that projects multi-order statistical moments and entropy from time and frequency domains into head space and uses them as scaling to dynamically reshape attention logits—effectively amplifying and propagating volatility signals across the full attention map, improving model’s ability to capture risk and extreme events.

Given the input matrix

X_{i n}

, we first compute statistical prior information, including multi-order moments and entropy features in both the time and frequency domains. These features are then used by the QEAttention mechanism to obtain the attention scores.

The time-domain statistics comprise the first-order raw moment

μ_{t}

, the second-order central moment

σ_{t}

, the normalized third-order central moment

τ_{t}

, and the entropy

ϵ_{t}

, which is computed from its probability matrices

P_{t}

, via a feature-wise Softmax, as follows,

P_{t} = Softmax (X_{i n}),

(15)

so, these statistics are calculated as follows:

μ_{t} = \frac{1}{H} X_{i n} \cdot 1_{H},

(16)

σ_{t} = {(\frac{1}{H} {(X_{i n} - μ_{t} \cdot 1_{H}^{T})}^{\circ 2} \cdot 1_{H})}^{\circ 0.5},

(17)

τ_{t} = \frac{1}{H} {((X_{i n} - μ_{t} \cdot 1_{H}^{T}) ⊘ (σ_{t} \cdot 1_{H}^{T}))}^{\circ 3} \cdot 1_{H},

(18)

ϵ_{t} = - (P_{t} \circ log P_{t}) \cdot 1_{H},

(19)

where

\circ i

denotes the element-wise i-th power.

The frequency–domain statistics—namely the first-order raw moments

μ_{s}

, the second-order central moments

σ_{s}

, the normalized third-order central moments

τ_{s}

, and entropy

ϵ_{s}

—are computed from the power spectral density (PSD) of the input, which is obtained by applying the Fast Fourier Transform (FFT) along the temporal dimension and taking the squared magnitude of the resulting complex coefficients. Specifically, for

X_{i n} \in R^{L \times H}

, we compute

X_{i n}^{'} = FFT (X_{i n}) : = Ω \cdot X_{i n}, Ω_{kj} = \exp (- 2 π i \cdot \frac{(k - 1) (j - 1)}{L}),

(20)

where

k = 1, \dots, L

indexes the frequency components, and

j = 1, \dots, L

indexes the time steps. And we compute PSD as follows:

X_{p s} = {| X_{i n}^{'} |}^{\circ 2},

(21)

where

| \cdot |

denotes element-wise modulus.

From

X_{p s}

, we derive frequency–domain statistics

μ_{s}

,

σ_{s}

,

τ_{s}

,

ϵ_{s}

.

ϵ_{s}

is computed using the probability matrix

P_{s}

, which is obtained via L1 normalization to preserve energy distribution, as follows:

P_{s} = X_{p s} ⊘ (∥ X_{p s} ∥_{1, row} \cdot 1_{H}^{T}),

(22)

where

∥ X_{p s} ∥_{1, row}

denotes the vector of row-wise L1 norms of

X_{p s}

.

The frequency–domain statistics of

X_{i n}

are then calculated as described follows:

μ_{s} = \frac{1}{H} X_{p s} \cdot 1_{H},

(23)

σ_{s} = {(\frac{1}{H} {(X_{p s} - μ_{s} \cdot 1_{H}^{T})}^{\circ 2} \cdot 1_{H})}^{\circ 0.5},

(24)

τ_{s} = \frac{1}{H} {((X_{p s} - μ_{s} \cdot 1_{H}^{T}) ⊘ (σ_{s} \cdot 1_{H}^{T}))}^{\circ 3} \cdot 1_{H},

(25)

ϵ_{s} = - (P_{s} \circ log P_{s}) \cdot 1_{H} .

(26)

Next, we concatenate the statistics along the feature dimension and apply LayerNorm to obtain the statistical prior matrix

Π

:

\begin{matrix} Π = LayerNorm (Concat (μ_{t}, σ_{t}, τ_{t}, ϵ_{t}, μ_{s}, σ_{s}, τ_{s}, ϵ_{s})), \end{matrix}

(27)

and this LayerNorm contains two learnable parameters

γ_{Π}, β_{Π} \in R^{8}

. Then, we project

Π

into head space and apply Tanh to get the statistical prior weight matrix

Ψ

:

\begin{matrix} Ψ = Tan h (Π \cdot W_{Ψ} + 1_{L} \cdot b_{Ψ}^{T}), \end{matrix}

(28)

where

W_{Ψ} \in R^{8 \times N}

,

b_{Ψ} \in R^{N}

.

X_{i n}

is projected into N sets of query, key, and value matrices

(Q^{i}, K^{i}, V^{i})

, as defined in Equation (5), containing learnable

W_{Q^{i}}, W_{K^{i}}, W_{V^{i}}, b_{Q^{i}}, b_{K^{i}}, b_{V^{i}}

. The QEAttention for head i produces

{Head}^{i} \in R^{L \times S}

, as follows:

\begin{matrix} {Head}^{i} & = QEAttention (Ψ^{i}, Q^{i}, K^{i}, V^{i}) \\ : = Softmax (\frac{((Ψ^{i} \cdot 1_{S}^{T}) \circ Q^{i}) \cdot {(K^{i})}^{T}}{\sqrt{S}}) \cdot V^{i}, \end{matrix}

(29)

where

Ψ^{i}

represents the i-th column vector of

Ψ

.

Finally, let

Head = Concat ({Head}^{1}, {Head}^{2}, \dots, {Head}^{N})

, and the query-enhanced multi-head self-attention is completed through a linear transformation (containing learnable

W_{o} \in R^{H \times H}, b_{o} \in R^{H}

) in the same way as described in Equation (7), yielding

X_{m s a}

.

3.3. Forward Propagation Algorithm

After introducing the two improvements, the complete forward propagation algorithm of CQEformer is summarized in Algorithm 1.

3.4. Parameter Optimization

This section details the parameter optimization process of CQEformer during training. We first derive the gradients for all CQEformer modules. Then, we present the parameter update scheme based on the Adam optimizer.

3.4.1. Gradient Derivation

In traditional encoder-only Transformer for time series forecasting (with a single encoder), there are

14 + 6 N

learnable parameter groups (N denotes the number of attention heads), each represented by a vector or matrix. These include weight matrices in the set

\{W_{e}, W_{Q^{i}}, W_{K^{i}}, W_{V^{i}}, W_{o}, W_{f_{1}}, W_{f_{2}}, W_{y} | i = 1, 2, \dots, N\}

, bias vectors in the set

\{b_{e}, b_{Q^{i}}, b_{K^{i}}, b_{V^{i}}, b_{o}, b_{f_{1}}, b_{f_{2}}, b_{y} | i = 1, 2, \dots, N\}

and four vectors

γ_{l n}, γ_{e n}, β_{l n}, β_{e n}

. For CQEformer, a total of

20 + 6 N

parameter groups must be updated, which includes the six additional

W_{c}, W_{Ψ}, b_{c}, b_{Ψ}, γ_{Π}, β_{Π}

from the CRE and QEAttention modules.

Algorithm 1 CQEformer Forecasting Procedure

Require: Time series $X \in R^{L \times H}$ , forecast horizon $L^{'}$ , model dimension H, encoder numbers E, head numbers N
Ensure: Forecast $\hat{y} \in R^{L^{'}}$

1:: Step 1: Input Layer

$X_{i n} = (X + CausalConv (X)) \cdot W_{e} + 1_{L} \cdot b_{e}^{T} + PE$
2:: Step 2: CQEformer Encoder Layer
3:: for each encoder layer $(i = 1, \dots, E)$ do
4:: Compute statistical prior information $Π$ :

$Π = LayerNorm (Concat (μ_{t}, σ_{t}, τ_{t}, ϵ_{t}, μ_{s}, σ_{s}, τ_{s}, ϵ_{s})) \in R^{L \times 8}$
5:: Compute statistical prior weight matrix $Ψ$ :

$Ψ = Tan h (Π \cdot W_{Ψ} + 1_{L} \cdot b_{Ψ}^{T})$
6:: Compute QEAttention mechanism:
7:: for each head $(i = 1, \dots, N)$ do

$Q^{i} = X_{i n} \cdot W_{Q^{i}} + 1_{L} \cdot b_{Q^{i}}^{T}, K^{i} = X_{i n} \cdot W_{K^{i}} + 1_{L} \cdot b_{K^{i}}^{T}, V^{i} = X_{i n} \cdot W_{V^{i}} + 1_{L} \cdot b_{V^{i}}^{T},$

${Head}^{i} = Softmax (\frac{((Ψ^{i} \cdot 1_{S}^{T}) \circ Q^{i}) \cdot {(K^{i})}^{T}}{\sqrt{S}}) \cdot V^{i}$

$X_{m s a} = Concat ({Head}^{1}, {Head}^{2}, \dots, {Head}^{N}) \cdot W_{o} + 1_{L} \cdot b_{o}^{T}$
8:: end for
9:: Apply first residual connection and layer normalization:

$X_{l n} = LayerNorm (X_{i n} + X_{m s a})$
10:: Apply feedforward network, second residual connection and layer normalization:

$X_{e n} = LayerNorm (X_{l n} + FFN (X_{l n}))$
11:: end for
12:: Step 3: Output Layer

$\hat{y} = W_{y} \cdot x_{L} + b_{y}$
13:: return $\hat{y}$

The Mean Squared Error (MSE) is used as the loss function, as Equation (30). For any parameter

X

, let

δ_{X}

represent the gradient of

L (y, \hat{y})

with respect to

X

. Then

δ_{\hat{y}}

can be computed as Equation (31):

L (y, \hat{y}) = \frac{1}{L^{'}} {∥y - \hat{y}∥}_{2}^{2},

(30)

δ_{\hat{y}} = \frac{2}{L^{'}} (y - \hat{y}) .

(31)

In output layer,

δ_{W_{y}}

,

δ_{b_{y}}

and

δ_{X_{e n}}

need to be computed as follows:

\{\begin{matrix} δ_{W_{y}} = δ_{\hat{y}} \cdot x_{y}^{T}, \\ δ_{b_{y}} = δ_{\hat{y}}, \\ δ_{X_{e n}} = (0; 0; \dots; 0; δ_{\hat{y}}^{T} \cdot W_{y}) . \end{matrix}

(32)

There are 12 learnable parameter groups in a single encoder. First, we present the derivation of gradients for a general LayerNorm. For

X \in R^{L \times H}

, let

Y = LayerNorm (X)

, we define centering operator

M = I - \frac{1}{H} 1_{H} \cdot 1_{H}^{T}

(

I

is the identity matrix) and normalized matrix

\tilde{X} = (X \cdot M) ⊘ (σ \cdot 1_{H}^{T})

. These gradients in LayerNorm are given as follows:

\{\begin{matrix} δ_{γ} = 1_{L}^{T} \cdot (δ_{Y} \circ \tilde{X}), \\ δ_{β} = 1_{L}^{T} \cdot δ_{Y}, \\ δ_{X} = δ_{Y} \circ (1_{L} \cdot γ) ⊘ {(σ \cdot 1)}_{L} \cdot M . \end{matrix}

(33)

Using the method outlined in Equation (33), for the backpropagation of second Res & LayerNorm, we can obtain

δ_{γ_{e n}}

,

δ_{β_{e n}}

and

δ_{Z}

, where

Z = X_{l n} + FFN (X_{l n})

. Additional gradients

δ_{W_{f_{1}}}

,

δ_{W_{f_{2}}}

,

δ_{b_{f_{1}}}

,

δ_{b_{f_{2}}}

and

δ_{X_{l n}}

can be computed as follows:

\{\begin{matrix} δ_{W_{f_{1}}} = X_{l n}^{T} \cdot (δ_{Z} \cdot W_{f_{2}}^{T} \circ I_{(T > 0)}), \\ δ_{b_{f_{1}}} = {(δ_{Z}^{T} \cdot W_{f_{2}}^{T} \circ I_{(T > 0)})}^{T} \cdot 1_{L}, \\ δ_{W_{f_{2}}} = {(ReLU (T))}^{T} \cdot δ_{Z}, \\ δ_{b_{f_{2}}} = δ_{Z}^{T} \cdot 1_{L}, \\ δ_{X_{l n}} = δ_{Z} + δ_{Z} \cdot W_{f_{1}}^{T}, \end{matrix}

(34)

where

T = X_{l n} \cdot W_{f_{1}} + 1_{L} \cdot b_{f_{1}}^{T}

and

I_{(T > 0)}

denotes a matrix of the same size as

T

, where each element is 1 when

T_{i j} > 0

, and 0 otherwise. Similarly, we can also obtain the gradients

δ_{γ_{l n}}

,

δ_{β_{l n}}

,

δ_{X_{m s a}}

and

δ_{X_{i n}}^{(l n)}

in first Res & LayerNorm, among which

δ_{X_{i n}}^{(l n)}

denotes a part of the gradient of the loss with respect to

X_{i n}

that originates from the LayerNorm.

Next, the computation methods of the gradients in QEAttention will be provided. The gradients in the final linear transformation therein are computed as follows:

\{\begin{matrix} δ_{W_{o}} = {Head}^{T} \cdot δ_{X_{m s a}}, \\ δ_{b_{o}} = δ_{X_{m s a}}^{T} \cdot 1_{L}, \\ δ_{Head} = δ_{X_{m s a}} \cdot W_{o}^{T}, \end{matrix}

(35)

and we can further obtain

δ_{{Head}^{i}}

since

δ_{Head} = Concat (δ_{{Head}^{1}}, δ_{{Head}^{2}}, \dots, δ_{{Head}^{N}})

.

For Equation (29), let

T^{i} = (Ψ^{i} \cdot 1_{S}^{T} \circ Q^{i}) \cdot {(K^{i})}^{T}

and

A = Softmax (\frac{T^{i}}{\sqrt{S}})

, we can compute

δ_{T^{i}} = \frac{T^{i}}{\sqrt{S}} A \circ (δ_{{Head}^{i}} \cdot {(V^{i})}^{T} - A \circ (δ_{{Head}^{i}} \cdot {(V^{i})}^{T}) \cdot 1_{S} \cdot 1_{S}^{T})

, and then we can compute the following:

\{\begin{matrix} δ_{Q^{i}} = (δ_{T^{i}} \cdot K^{i}) \circ (Ψ^{i} \cdot 1_{S}^{T}), \\ δ_{K^{i}} = δ_{T^{i}}^{T} \cdot (δ_{T^{i}} \cdot K^{i}), \\ δ_{V^{i}} = A \cdot δ_{{Head}^{i}}, \\ δ_{Ψ^{i}} = (δ_{T^{i}} \cdot K^{i}) \circ Q^{i} \cdot 1_{S}, \end{matrix}

(36)

further we obtain

δ_{Ψ} = (δ_{Ψ^{1}}, δ_{Ψ^{2}}, \dots, δ_{Ψ^{N}})

. Thus,

δ_{W_{Ψ}}, δ_{b_{Ψ}}

,

δ_{Π}

can be computed as follows:

\{\begin{matrix} δ_{W_{Ψ}} = Π^{T} \cdot ((1_{L} \cdot 1_{H}^{T} - Ψ^{\circ 2}) \circ δ_{Ψ}), \\ δ_{b_{Ψ}} = {((1_{L} \cdot 1_{H}^{T} - Ψ^{\circ 2}) \circ δ_{Ψ})}^{T} \cdot 1_{L}, \\ δ_{Π} = ((1_{L} \cdot 1_{H}^{T} - Ψ^{\circ 2}) \circ δ_{Ψ}) \cdot W_{Ψ}^{T}, \end{matrix}

(37)

δ_{γ_{Π}}, δ_{β_{Π}}, δ_{Λ}

can be computed based on

δ_{Π}

using the method as Equation (33), where

Λ = (μ_{t}, σ_{t}, τ_{t}, ϵ_{t}, μ_{s}, σ_{s}, τ_{s}, ϵ_{s})

, and then

δ_{μ_{t}}, δ_{σ_{t}}, δ_{τ_{t}}, δ_{ϵ_{t}}, δ_{μ_{s}}, δ_{σ_{s}}, δ_{τ_{s}}, δ_{ϵ_{s}}

can be obtained.

Let

δ_{X_{i n}}^{(q k v)}

denote the partial gradients of

X_{i n}

that originate from the generation process of

Q^{i}, K^{i}, V^{i}

. Next, we present the form of the gradients in the generation process of

Q^{i}, K^{i}, V^{i}

, as follows:

\{\begin{matrix} δ_{W_{Q^{i}}} = X_{i n}^{T} \cdot δ_{Q^{i}}, \\ δ_{b_{Q^{i}}} = δ_{Q^{i}}^{T} \cdot 1_{L}, \end{matrix} \{\begin{matrix} δ_{W_{K^{i}}} = X_{i n}^{T} \cdot δ_{K^{i}}, \\ δ_{b_{K^{i}}} = δ_{K^{i}}^{T} \cdot 1_{L}, \end{matrix} \{\begin{matrix} δ_{W_{V^{i}}} = X_{i n}^{T} \cdot δ_{V^{i}}, \\ δ_{b_{V^{i}}} = δ_{V^{i}}^{T} \cdot 1_{L}, \end{matrix}

(38)

δ_{X_{i n}}^{(q k v)} = \sum_{i = 1}^{N} (δ_{W_{Q^{i}}} \cdot W_{Q^{i}}^{T} + δ_{W_{K^{i}}} \cdot W_{K^{i}}^{T} + δ_{W_{V^{i}}} \cdot W_{V^{i}}^{T}) .

(39)

To obtain the gradients of the last four parameter groups

W_{c}, W_{e}, b_{c}, b_{e}

whose gradients originate from the loss with respect to the encoder’s input

X_{i n}

, we first need to compute

δ_{X_{i n}}

that integrates

δ_{X_{i n}}^{(l n)}, δ_{X_{i n}}^{(q k v)}

, and

δ_{X_{i n}}^{(Λ)}

; the latter refers to the partial gradients of

X_{i n}

originating from the generation process of

Λ

. To do this, we first present the general method—one applicable to both temporal and frequency domains—for obtaining the gradient of input, given the precomputed gradients of statistical moments and entropy.

For a matrix

X \in R^{L \times H}

from time or frequency domain, given these statistics’ gradients

δ_{μ}, δ_{σ}, δ_{τ}, δ_{ϵ}

, we can obtain the partial gradients of

X

originating from the calculation of these statistics as follows:

\{\begin{matrix} δ_{X}^{(μ)} = \frac{1}{H} δ_{μ} \cdot 1_{H}^{T}, \\ δ_{X}^{(σ)} = \frac{1}{H} ((δ_{σ} ⊘ σ) \cdot 1_{H}^{T}) \circ (X \cdot M), \\ δ_{X}^{(τ)} = \frac{3}{H} (δ_{τ} \cdot 1_{H}^{T}) \circ S^{\circ 2} ⊘ (σ \cdot 1_{H}^{T}) \cdot (M - S \cdot 1_{H} ⊘ σ \cdot 1_{H}^{T} \circ X \cdot M^{2}), \\ δ_{X}^{(ϵ)} = - {(\frac{\partial P}{\partial X})}^{T} \cdot (δ_{ϵ} \cdot 1_{H}^{T}) \circ (\log (P) + 1_{L} \cdot 1_{H}^{T}), \end{matrix}

(40)

where

S = (X \cdot M) ⊘ (σ \cdot 1_{H}^{T})

,

P

is the probability matrix, namely

P_{t}

in time domain and

P_{s}

in frequency,

\frac{\partial P_{t}}{\partial X} = P - P \circ (P \cdot 1_{H} \cdot 1_{H}^{T})

in time domain, and

\frac{\partial P_{s}}{\partial X} = (X \cdot 1_{H} \cdot 1_{H}^{T}) \circ (1_{L} \cdot 1_{H}^{T} - P_{s} \cdot 1_{H} \cdot 1_{H}^{T})

.

Based on Equation (40), we can compute

δ_{X_{i n}}^{(μ_{t})}

,

δ_{X_{i n}}^{(σ_{t})}

,

δ_{X_{i n}}^{(τ_{t})}

,

δ_{X_{i n}}^{(ϵ_{t})}

,

δ_{X_{p s}}^{(μ_{s})}

,

δ_{X_{p s}}^{(σ_{s})}

,

δ_{X_{p s}}^{(τ_{s})}

,

δ_{X_{p s}}^{(ϵ_{s})}

, and further we can obtain

δ_{X_{i n}}^{(t)}, δ_{X_{i n}}^{(s)}

, which are the partial gradients of

X_{i n}

originating from both time and frequency domains, as follows:

\{\begin{matrix} δ_{X_{i n}}^{(t)} = δ_{X_{i n}}^{(μ_{t})} + δ_{X_{i n}}^{(σ_{t})} + δ_{X_{i n}}^{(τ_{t})} + δ_{X_{i n}}^{(ϵ_{t})}, \\ δ_{X_{i n}}^{(s)} = 2 Ω^{T} \cdot Ω \cdot X_{i n} \circ (δ_{X_{p s}}^{(μ_{s})} + δ_{X_{p s}}^{(σ_{s})} + δ_{X_{p s}}^{(τ_{s})} + δ_{X_{p s}}^{(ϵ_{s})}) . \end{matrix}

(41)

Thus, we can obtain

δ_{X_{i n}}^{Λ} = δ_{X_{i n}}^{(t)} + δ_{X_{i n}}^{(s)}

and then

δ_{X_{i n}} = δ_{X_{i n}}^{(l n)} + δ_{X_{i n}}^{(q k v)} + δ_{X_{i n}}^{(Λ)}

. The gradients of the final four parameter groups,

δ_{W_{c}}, δ_{W_{e}}, δ_{b_{c}}, δ_{b_{e}}

, are given below, where

δ_{W_{c}}

and

δ_{b_{c}}

are defined element-wise:

\{\begin{matrix} δ_{W_{e}} = {(X + CausalConv (X))}^{T} \cdot X_{e}, \\ δ_{b_{e}} = {(X_{e})}^{T} \cdot 1_{L}, \end{matrix} \{\begin{matrix} δ_{{(W_{c})}_{k, d, j}} = \sum_{i = 1}^{L} (I_{i - k + 1 \geq 1} \cdot {(δ_{c})}_{i, j} \cdot X_{i^{'}}), \\ δ_{{(b_{c})}_{j}} = \sum_{i = 1}^{L} ({(δ_{c})}_{i, j}) . \end{matrix}

(42)

3.4.2. Parameter Update

Given the gradients derived above, CQEformer’s parameters are updated using the Adam optimizer [25]. For parameter

Θ

, given the gradient of the loss with respect to

Θ

at t step,

δ_{Θ}^{(t)}

, the calculation method is as follows:

M^{(t)} = β_{1} M^{(t - 1)} + (1 - β_{1}) δ_{Θ}^{(t)},

(43)

V^{(t)} = β_{2} V^{(t - 1)} + (1 - β_{2}) {(δ_{Θ}^{(t)})}^{\circ 2},

(44)

\hat{M^{(t)}} = \frac{M^{(t)}}{1 - β_{1}^{t}}, \hat{V^{(t)}} = \frac{V^{(t)}}{1 - β_{2}^{t}},

(45)

W^{(t + 1)} = W^{(t)} - α \frac{\hat{M^{(t)}}}{\sqrt{\hat{V^{(t)}}} + ϵ},

(46)

where

β_{1}, β_{2}

are the exponential decay rates,

α

is the learning rate, and

ϵ

is a very small constant added to avoid division by zero. Utilizing this method presented in Equation (46), all parameters in CQEformer can be updated in a similar manner. The ReduceLROnPlateau scheduler is employed to dynamically decrease the learning rate once the validation loss stops improving, ensuring smoother convergence and enhanced training stability.

4. Experiments

To evaluate CQEformer’s performance, we performed three experiments—multi-model comparison, time-window sensitivity analysis, and ablation studies—primarily on financial time series, which are representative of datasets exhibiting structural breaks and volatility clustering. In addition, we conducted supplementary experiments on the public Traffic dataset to assess cross-domain generalization under nonstationary conditions.

4.1. Data, Indicators and Standardization

The dataset consists of twenty years of historical transaction records for the CSI 300 Index from 13 June 2006, to 12 June 2025. Since the Bank of Communications was listed in 2007, we additionally use its nineteen years of historical stock transaction data from 13 June 2007 to 12 June 2025. Both datasets are retrieved from AKShare [26]. Ten raw indicators are considered (Open, Close, High, Low, Volume, Turnover, Amplitude, Change, ChangeAmount, TurnoverRate) and 17 technical indicators are further constructed (SMA(10), EMA(12), EMA(26), DEMA(10), TEMA(10), WMA(10), MACD, Signal, Histogram, RSI(14), ROC(12), MOM(10), CCI(20), WILLR(14), CMO(14), K(14), D). The definitions for these indicators are provided in Table 1. To further assess the generalization capability of CQEformer beyond the financial domain, we also conduct experiments on the public Traffic dataset (1 July 2017 –30 June 2018) [27], which captures hourly road occupancy rates with evident structural shifts and volatility clustering patterns.

Prior to experimentation, all features are range-normalized. Specifically, for the j-th column feature

X_{j} = {[x_{1}, x_{2}, \dots, x_{n}]}^{T}

of the input matrix (where n is the number of samples), the normalized feature is denoted as

X_{j}^{'} = {[x_{1}^{'}, x_{2}^{'}, \dots, x_{n}^{'}]}^{T}

, calculated as

x_{i}^{'} = \frac{x_{i} - min (X_{j})}{max (X_{j}) - min (X_{j})} .

(47)

4.2. Experimental Setup

Both multivariate time series datasets are partitioned into samples via a sliding window approach (step size = 1, task-specific window size) and split into 80%, 10%, and 10% for training, validation, and testing sets. Random seed is 2025. Training uses Adam (lr =

10^{- 4}

, batch size = 32) with a ReduceLROnPlateau scheduler (patience = 5, factor = 0.3, mode = rel, threshold =

5 \times 10^{- 5}

) for 50 epochs. Model hyperparameters include a hidden layer dimension of 256 (FFN = 512), dropout = 0.1, encoder depth = 1, and 8 attention heads.

After denormalization, model performance is evaluated using four metrics: mean squared error (MSE), mean absolute percentage error (MAPE), coefficient of determination (

R^{2}

), and training time. Let

\bar{y}

denote the mean of elements in

y

, MSE, MAPE, and

R^{2}

are defined as follows:

MSE = \frac{1}{n} {∥ y - \hat{y} ∥}_{2}^{2}, MAPE = \frac{1}{n} {∥\frac{y - \hat{y}}{y}∥}_{1}, R^{2} = 1 - \frac{∥ y - \hat{y} ∥_{2}^{2}}{∥ y - \bar{y} ∥_{2}^{2}} .

4.3. Multi-Model Comparison Experiment

In this experiment, for two financial datasets, the time window size is fixed at 60 (approximately one quarter). We choose representative baseline models in the field of time-series prediction, including LSTM, GRU, TCN, and Transformer. These models are evaluated alongside our proposed CQEformer on all datasets, aim to validate the prediction accuracy of CQEformer and further highlight its advantages over existing approaches. The empirical results are shown in Figure 3 and Table 2.

In the multi-model comparison experiment, CQEformer achieves the best overall performance in trend fitting accuracy, local fluctuation capture, and adaptation to extreme market conditions. This superiority is visually illustrated in Figure 3 and quantitatively supported in Table 2. Across the two datasets, CQEformer reduces the MSE by an average of 38.90% compared to all baseline models. Benefiting from the causal residual embedding mechanism, the model quickly responds to sudden changes, effectively captures long-term fluctuation patterns, and demonstrates substantially higher robustness compared with the baseline models.

In contrast, the Transformer, although effective in trend fitting, exhibits a lagged response and low accuracy in capturing instantaneous fluctuations. LSTM and GRU, while computationally efficient and stable in trend tracking, lack the ability to capture fine-grained details. TCN, which excels in local feature extraction, is overly sensitive to sudden changes, leading to overreactions and reduced long-term prediction robustness.

Overall, CQEformer achieves a good balance between training time and predictive accuracy. It demonstrates clear advantages in high-precision forecasting scenarios and offers substantial potential for practical applications.

In addition to financial datasets, we also evaluated CQEformer on the public Traffic dataset to examine its cross-domain adaptability. Although traffic flow data differ from financial series in semantics, they similarly exhibit structural shifts and volatility clustering caused by external disturbances. In this experiment, the time-window length was set to 48, corresponding to approximately two days of data, to capture short-term fluctuation patterns. CQEformer achieved the best overall performance (

R^{2} = 0.9299

, MSE =

4.35 \times 10^{- 5}

), reducing the average MSE of baseline models by approximately 27.29%, confirming its robustness and transferability to non-financial time series.

4.4. Time Window Sensitivity Experiment

Time window length is a key hyperparameter in time-series prediction tasks, directly affecting the model’s ability to capture historical information. To evaluate its impact for forecasting and to further verify the robustness of the proposed model, we conduct a time window sensitivity analysis by varying the window size from 10 to 100 with a step of 10, and compare the prediction performance of traditional encoder-only temporal Transformer and CQEformer.

Figure 4 summarizes the time-window sensitivity experiment. As the window size changes, CQEformer exhibits consistently low prediction error and strong stability, reducing MSE variance by 94.04% compared to the traditional Transformer, demonstrating robust adaptability to variations in time window size.

4.5. Ablation Experiment

For the ablation study, we use the CSI 300 dataset with a time window of 60 and compare four variants: Encoder-Only Transformer (baseline), CREformer (causal residual embedding), QEformer (QEAttention), and CQEformer. The aim is to assess each module’s contribution, their combined effects, and validate the model design. Figure 5 and Table 3 summarize each module’s effect on the CSI 300 dataset (window size 60).

CREformer reduces MSE by 38.82% vs. Transformer, increases

R^{2}

by 7.84%, and decreases MAPE by 28.85%, confirming its role in capturing temporal dependencies. QEformer also outperforms the baseline with MSE reduced by 32.96%,

R^{2}

increased by 6.66%, and MAPE decreased by 21.30%, demonstrating the module’s independent value. CQEformer achieves the best results: MSE down by 60.51%,

R^{2}

up by 12.22%, and MAPE down by 46.70%; relative to CREformer and QEformer, MSE is further reduced by 35.44% and 41.09%. Although CQEformer requires the longest training time, the substantial improvement in prediction accuracy justifies the extra cost. And CQEformer converges quickly and stably, generalizes well, and outperforms all ablation variants as shown in Figure 6.

To explore the fitting ability of ablation variants under different market risk regions, we design the experiment as follows. Using historical CSI 300 closing prices, we calculate the 20-day rolling volatility as a measure of market risk:

{Risk}_{t} = E [{(r_{t - i} - {\bar{r}}_{t})}^{2}],

(48)

where

r_{t - i}

is the return on day

t - i

(

i = 0, \dots, 19

), and

{\bar{r}}_{t} = \frac{1}{20} \sum_{i = 0}^{19} r_{t - i}

.

Test data are divided into three risk regions (stable, mildly volatile, highly volatile) based on the 33rd and 66th percentiles of full-series volatility: Low (

Risk \leq 0.0077

), Medium (

0.0077 < Risk \leq 0.0097

), and High (

Risk > 0.0097

) (Figure 7).

Using the risk region classification, we compare the MSE, MAPE, and

R^{2}

of the ablation variants, Figure 8 and Table 4 show the prediction performance of these four models across low, medium, and high volatility regions. As volatility increases, all models exhibit higher prediction errors and lower accuracy. For CQEformer, MSE rises by 168.88% from low to high volatility, MAPE increases by 47.52%, and

R^{2}

drops by 5.31%, indicating that higher market volatility substantially increases prediction difficulty.

Across all regions, CQEformer achieves the best performance. Compared with Transformer, its MSE decreases by 66.64%, 62.06%, and 56.70% in low, medium, and high volatility regions, respectively, demonstrating strong risk adaptability. While the improvement is slightly limited in high-volatility scenarios, CQEformer still effectively handles sudden changes, reflecting its structural robustness to high-noise and nonlinear price sequences. CREformer and QEformer perform similarly, both clearly outperforming Transformer, highlighting the effectiveness of the two enhanced modules in volatility modeling.

5. Conclusions

This study proposes CQEformer, an encoder-only model addressing the limitations of standard Transformers in nonstationary time series with structural breaks and volatility clustering, through two core innovations: causal residual embedding (CRE) and query-enhanced multi-head self-attention (QEAttention).

Key findings validate its effectiveness. In comparative experiments on the CSI 300 Index and Bank of Communications datasets, CQEformer outperforms baselines (LSTM, GRU, TCN, standard Transformer) across core metrics, achieving an average MSE reduction of 38.90%. Ablation studies confirm the modular synergy: CRE and QEAttention individually reduce MSE by 38.82% and 32.96%, respectively, while their combination yields a 60.51% reduction, highlighting the rationale of the dual-module design. CQEformer also demonstrates strong robustness to temporal window variations, outperforming the standard Transformer in adaptability. Evaluation on the public Traffic dataset further verifies its cross-domain generalization.

Theoretically, CQEformer advances nonstationary time-series prediction by integrating causal structure awareness with statistical prior guidance, effectively balancing local detail capture and global context preservation. Practically, its superior performance across financial and traffic domains underscores its potential for high-precision forecasting under dynamic, volatile conditions, while its interpretable framework aids understanding of complex temporal dynamics. Future work will focus on developing lightweight architectures, enhancing cross-domain generalization, improving long-term forecasting, and integrating multi-source information including external indicators.

Author Contributions

Conceptualization, Y.T.; Methodology, Y.T.; Software, Y.T.; Writing—Original Draft, Y.T.; Supervision, L.L.; Funding Acquisition, L.L.; Project Administration, L.L.; Writing—Review and Editing, Y.T. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62173222) and Shanghai University of Engineering Science Horizontal Research Project (SJ20230195).

Data Availability Statement

This study used publicly available financial and traffic time series datasets obtained from open repositories, as cited in the manuscript. No new data were generated. https://github.com/akfamily/akshare (accessed on 15 June 2025). https://pems.dot.ca.gov/ (accessed on 8 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Osborne, M.F.M. Periodic Structure in the Brownian Motion of Stock Prices. Oper. Res. 1962, 10, 345–379. [Google Scholar] [CrossRef]
Box, G.; Jenkins, G. Time Series Analysis: Forecasting and Control; Holden-Day Series in Time Series Analysis and Digital Processing; Holden-Day: Sydney, Australia, 1970. [Google Scholar]
Sims, C.A. Macroeconomics and Reality. Econometrica 1980, 48, 1–48. [Google Scholar] [CrossRef]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
Petrica, A.C.; Stancu, S.; Tindeche, A. Limitation of ARIMA Models in Financial and Monetary Economics. Theor. Appl. Econ. 2016, XXIII, 19–42. [Google Scholar]
Wang, D.; Zheng, Y.; Lian, H.; Li, G. High-Dimensional Vector Autoregressive Time Series Modeling via Tensor Decomposition. J. Am. Stat. Assoc. 2022, 117, 1338–1356. [Google Scholar] [CrossRef]
Andersen, T.G.; Bollerslev, T.; Diebold, F.X.; Labys, P. Modeling and Forecasting Realized Volatility. Econometrica 2003, 71, 579–625. [Google Scholar] [CrossRef]
Nguyen, T.N.; Tran, M.N.; Gunawan, D.; Kohn, R. A Statistical Recurrent Stochastic Volatility Model for Stock Markets. J. Bus. Econ. Stat. 2023, 41, 414–428. [Google Scholar] [CrossRef]
Sang, S.; Li, L. A Novel Variant of LSTM Stock Prediction Method Incorporating Attention Mechanism. Mathematics 2024, 12, 945. [Google Scholar] [CrossRef]
Lee, M.C. Research on the Feasibility of Applying GRU and Attention Mechanism Combined with Technical Indicators in Stock Trading Strategies. Appl. Sci. 2022, 12, 1007. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S., Eds.; PMLR (Proceedings of Machine Learning Research): New York, NY, USA, 2022; Volume 162, pp. 27268–27286. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Chen, P.; Zhang, Y.; Cheng, Y.; Shu, Y.; Wang, Y.; Wen, Q.; Yang, B.; Guo, C. Pathformer: Multi-scale Transformers with Adaptive Pathways for Time Series Forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhou, Y.; Ye, Y.; Zhang, P.; Du, X.; Chen, M. TwinsFormer: Revisiting Inherent Dependencies via Two Interactive Components for Time Series Forecasting. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Liu, Y.; Qin, G.; Huang, X.; Wang, J.; Long, M. Timer-XL: Long-Context Transformers for Unified Time Series Forecasting. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Ding, Q.; Wu, S.; Sun, H.; Guo, J.; Guo, J. Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Yokohama, Japan, 11–17 July 2020; Special Track on AI in FinTech. Bessiere, C., Ed.; International Joint Conferences on Artificial Intelligence Organization: New York, NY, USA, 2020; pp. 4640–4646. [Google Scholar] [CrossRef]
Zhang, Q.; Qin, C.; Zhang, Y.; Bao, F.; Zhang, C.; Liu, P. Transformer-based attention network for stock movement prediction. Expert Syst. Appl. 2022, 202, 117239. [Google Scholar] [CrossRef]
Xu, H.; Xiang, L.; Ye, H.; Yao, D.; Chu, P.; Li, B. Permutation Equivariance of Transformers and its Applications. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5987–5996. [Google Scholar] [CrossRef]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
King, A. AKShare. GitHub. 2022. Available online: https://github.com/akfamily/akshare (accessed on 15 June 2025).
California Department of Transportation. Traffic Dataset. In California Performance Measurement System (PeMS); California Department of Transportation: Sacramento, CA, USA, 2021. [Google Scholar]

Figure 1. Network architecture of Encoder-Only Transformer.

Figure 2. Network architecture of CQEformer.

Figure 3. Multi-model prediction effect comparison on the test set.

Figure 4. Comparison of multi-indicator performance under different time windows.

Figure 5. Prediction performance comparison of ablation variants on the test set.

Figure 6. Loss curves of ablation variants on training and validation sets.

Figure 7. Volatility region distribution of the test set.

Figure 8. Histogram of fitting performance across volatility regions.

Table 1. Technical Indicator Definitions and Calculation Methods.

Indicator Name	Formula
Simple Moving Average	${SMA}_{t} (n) = \frac{1}{n} \sum_{i = 0}^{n - 1} P_{t - i}$
Exponential Moving Average	${EMA}_{t} (n) = α \cdot P_{t} + (1 - α) \cdot {EMA}_{t - 1} (n)$
Double Exponential MA	$DEMA (n) = 2 EMA (n) - {EMA}^{(2)} (n)$
Triple Exponential MA	$TEMA (n) = 3 EMA (n) - 3 {EMA}^{(2)} (n) + {EMA}^{(3)} (n)$
Weighted Moving Average	$WMA (n) = \frac{\sum_{i = 0}^{n - 1} (i + 1) P_{t - i}}{\sum_{i = 0}^{n - 1} (i + 1)}$
MACD	${MACD}_{t} = {EMA}_{t} (12) - {EMA}_{t} (26)$
MACD Signal Line	${Signal}_{t} = {EMA}_{t}$
MACD Histogram	${Histogram}_{t} = {MACD}_{t} - {Signal}_{t}$
Relative Strength Index	${RSI}_{t} (n) = 100 - \frac{100 {SumLoss}_{t} (n)}{{SumGain}_{t} (n)}$
Rate of Change	${ROC}_{t} (n) = \frac{P_{t} - P_{t - n}}{P_{t - n}} \times 100 %$
Momentum	${MOM}_{t} (n) = P_{t} - P_{t - n}$
Commodity Channel Index	${CCI}_{t} (n) = \frac{T P_{t} - SMA {[T P]}_{t} (n)}{0.015 {MD}_{t} (n)}$
Williams R	${WILLR}_{t} (n) = \frac{P_{t} - {max}_{i = 0}^{n - 1} {High}_{t - i}}{{max}_{i = 0}^{n - 1} {High}_{t - i} - {min}_{i = 0}^{n - 1} {Low}_{t - i}} \times 100$
Chande Momentum Oscillator	${CMO}_{t} (n) = \frac{{SumGain}_{t} (n) - {SumLoss}_{t} (n)}{{SumGain}_{t} (n) + {SumLoss}_{t} (n)} \times 100$
Stochastic K	$K_{t} (n) = \frac{P_{t} - {LowestLow}_{t} (n)}{{HighestHigh}_{t} (n) - {LowestLow}_{t} (n)} \times 100$
Stochastic D	$D_{t} (m) = SMA [K_{t} (14)] (m)$

Symbol Explanation:

P_{t}

: price at time t; n: window size;

α = \frac{2}{n + 1}

(smoothing factor);

{EMA}_{0} = {SMA}_{0}

;

{EMA}^{(k)}

: k-times applied EMA;

{SumGain}_{t} (n) = \sum_{i = 0}^{n - 1} max (P_{t - i} - P_{t - i - 1}, 0)

;

{SumLoss}_{t} (n) = \sum_{i = 0}^{n - 1} max (P_{t - i - 1} - P_{t - i}, 0)

;

{TP}_{t} = \frac{{High}_{t} + {Low}_{t} + P_{t}}{3}

;

{MD}_{t} (n) = \frac{1}{n} \sum_{i = 0}^{n - 1} |{TP}_{t - i} - SMA {[TP]}_{t} (n)|

.

Table 2. Multi-indicator prediction effect comparison of multi-models on the test set.

Data	Model	MSE	MAPE	$R^{2}$	Training Time
CSI 300	LSTM	5979.9843	1.7470	0.8967	38.50
	GRU	6791.7403	1.9428	0.8827	37.71
	TCN	5156.9241	1.3713	0.9109	42.26
	Transformer	9720.6929	2.2986	0.8320	68.85
	CQEformer	3839.0661	1.2252	0.9337	106.56
Bank of Communications	LSTM	0.0219	1.6819	0.9582	40.46
	GRU	0.0290	2.0708	0.9446	36.56
	TCN	0.0192	1.5426	0.9633	39.11
	Transformer	0.0430	2.5738	0.9179	64.35
	CQEformer	0.0163	1.3648	0.9688	102.27

Table 3. Multi-indicator performance of ablation variants.

Model	MSE	MAPE	$R^{2}$	Training Time
Transformer	9720.6929	2.2986	0.8320	66.63
CREformer	5946.8689	1.6355	0.8972	73.53
QEformer	6516.6724	1.8091	0.8874	96.61
CQEformer	3839.0661	1.2252	0.9337	106.12

Table 4. Fitting performance of ablation variants across volatility regions.

Volatility (Average)	Model	MSE	MAPE	$R^{2}$
Low (0.0064)	Transformer	6680.7169	2.0489	0.8552
	CREformer	3488.8279	1.3186	0.9244
	QEformer	4586.1880	1.7105	0.9006
	CQEformer	2228.9627	0.9931	0.9517
Medium (0.0088)	Transformer	8523.8722	2.1798	0.8122
	CREformer	5165.9218	1.6306	0.8862
	QEformer	5061.9404	1.6978	0.8885
	CQEformer	3233.8762	1.2110	0.9288
High (0.0155)	Transformer	13840.4316	2.6567	0.7718
	CREformer	9097.2318	1.9489	0.8500
	QEformer	9805.1429	2.0127	0.8383
	CQEformer	5993.2861	1.4650	0.9012

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, Y.; Li, L. CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting. Mathematics 2025, 13, 3750. https://doi.org/10.3390/math13233750

AMA Style

Tao Y, Li L. CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting. Mathematics. 2025; 13(23):3750. https://doi.org/10.3390/math13233750

Chicago/Turabian Style

Tao, Yuze, and Lu Li. 2025. "CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting" Mathematics 13, no. 23: 3750. https://doi.org/10.3390/math13233750

APA Style

Tao, Y., & Li, L. (2025). CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting. Mathematics, 13(23), 3750. https://doi.org/10.3390/math13233750

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CQEformer: A Causal and Query-Enhanced Transformer Variant for Time-Series Forecasting

Abstract

1. Introduction

2. Background

2.1. Problem Setup

2.2. Encoder-Only Transformer

3. CQEformer (Causal Query-Enhanced Transformer)

3.1. Causal Residual Embedding

3.2. Query-Enhanced Multi-Head Self-Attention Mechanism

3.3. Forward Propagation Algorithm

3.4. Parameter Optimization

3.4.1. Gradient Derivation

3.4.2. Parameter Update

4. Experiments

4.1. Data, Indicators and Standardization

4.2. Experimental Setup

4.3. Multi-Model Comparison Experiment

4.4. Time Window Sensitivity Experiment

4.5. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI