Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction

Zhang, Haoran; Li, Xiaofei; Wan, Tianjiao; Du, Junjie

doi:10.3390/sym18010112

Open AccessArticle

Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction

¹

School of Information and Mathematics, Yangtze University, Jingzhou 434020, China

²

School of Information Engineering, Jingzhou University, Jingzhou 434020, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 112; https://doi.org/10.3390/sym18010112

Submission received: 5 December 2025 / Revised: 30 December 2025 / Accepted: 5 January 2026 / Published: 7 January 2026

(This article belongs to the Special Issue Machine Learning and Data Analysis III)

Download

Browse Figures

Versions Notes

Abstract

Deep reinforcement learning (DRL) plays a pivotal role in decision-making within financial markets. However, DRL models are highly reliant on raw market data and often overlook the impact of future trends on model performance. To address these challenges, we propose a novel framework named Cluster Embedding-Proximal Policy Optimization (CE-PPO) for trading decision-making in financial markets. Specifically, the framework groups feature channels with intrinsic similarities and enhances the original model by leveraging clustering information instead of features from individual channels. Meanwhile, zero-shot prediction for unseen samples is achieved by assigning them to appropriate clusters. Future Open, High, Low, Close, and Volume (OHLCV) data predicted from observed values are integrated with actually observed OHLCV data, forming the state space inherent to reinforcement learning. Experiments conducted on five real-world financial datasets demonstrate that the time series model integrated with Cluster Embedding (CE) achieves significant improvements in predictive performance: in short-term prediction, the Mean Absolute Error (MAE) is reduced by an average of 20.09% and the Mean Squared Error (MSE) by 30.12%; for zero-shot prediction, the MAE and MSE decrease by an average of 21.56% and 31.71%, respectively. Through data augmentation using real and predicted data, the framework substantially enhances trading performance, achieving a cumulative return rate of 137.94% on the S&P 500 Index. Beyond its empirical contributions, this study also highlights the conceptual relevance of symmetry in the domain of algorithmic trading. The constructed deep reinforcement learning framework is capable of capturing the inherent balanced relationships and nonlinear interaction characteristics embedded in financial market behaviors.

Keywords:

deep reinforcement learning; time series forecasting; data augmentation

1. Introduction

Financial markets, as the beating heart of global economic activity, are defined by their dynamism and inherent high risk, challenges with which all participants grapple persistently. While recent strides in financial time series modeling have made headway in capturing market patterns, these models often stumble when faced with rapid, sudden market shifts, ultimately translating into lackluster trading results [1,2]. A key shortcoming of existing models lies in their heavy reliance on extensive historical data for training. However, in realistic financial contexts, data constraints are prominent, as certain market data remains inaccessible due to commercial confidentiality, while newly listed instruments typically suffer from a scarcity of training samples [3]. These gaps have catapulted zero-shot prediction technology into a pivotal role as a way to break free from traditional modeling frameworks [4].

Reinforcement learning (RL) has gained significant attention for its capacity to refine decision-making in unpredictable, dynamically evolving environments. Its applications span robotic control [5], game playing [6], autonomous driving [7,8], and fine-tuning of large language models [9,10], with particularly notable performance in algorithmic trading [11,12,13]. The dynamic nature of financial markets makes them an optimal setting for the application of RL: RL agents are capable of iteratively learning and optimizing trading strategies based on market feedback. However, despite these advancements in robust algorithms and frameworks for implementing deep reinforcement learning (DRL) methodologies within the financial domain, significant limitations persist [14,15]. A substantial body of existing research relies exclusively on raw market price data, including daily opening, high, low, and closing prices, as well as trading volume(OHLCV). This data dependency constrains the capacity of RL models to capture rapid market fluctuations, which are often the critical factor in capitalizing on profitable trading opportunities. While recent studies have sought to enhance data diversity by integrating technical indicators and social media sentiment, such metrics are plagued by lag and thus fail to capture realistic market dynamics [16]. Consequently, trading decisions predicated solely on historical data carry inherent risks, which may result in missed opportunities or significant financial losses.

Data-Centric Artificial Intelligence (DCAI) [17,18] is an emerging concept that shifts our focus from advancing model design to enhancing data-centric processes, marking a significant elevation in the recognition of data’s crucial role in the field of artificial intelligence [19]. Historically, artificial intelligence has been primarily conceptualized from a model-driven standpoint, with priority given to optimizing model architecture based on static datasets to improve AI performance [20]. However, this approach often overlooks underlying data flaws, such as missing values, incorrect labels, and outliers [21]. This gives rise to a critical question: Do the numerous improvements in model performance truly reflect the model’s inherent potential, or are they merely the result of overfitting to the dataset?

To address these challenges, we introduce a DRL framework that integrates data augmentation [22,23,24] to enhance financial decision-making. Within this framework, we integrate the Cluster Embedding (CE) as a plug-and-play component into mainstream time series models. By dynamically clustering channels characterized by intrinsic similarities and leveraging cluster identities instead of individual channel identities, CE retains the unique properties of individual channels while capturing shared patterns within clusters, thereby achieving an optimal fusion of the advantages inherent in both Channel Independent (CI) and Channel Dependent (CD) models. When applied to financial reinforcement learning scenarios, this approach enhances the accuracy of stock price forecasting and enables effective capture of dynamic market trends, as validated in the study.

In our framework, we integrate DCAI into the DRL architecture. Specifically, the time series prediction network integrated with CE is capable of forecasting stock price movements over the next N days., endowing the RL agent with the capability to anticipate future price fluctuations. To capture both daily volatilities and global trends simultaneously, we integrate predicted data with realistic data and perform data augmentation operations to achieve data complementarity. Collectively, these data form the state input for DRL. This combined dataset provides the RL agent with a more granular perspective on market trends across multiple time dimensions, facilitating its ability to make timely and data-driven trading decisions. The primary contributions of this paper are as follows:

We propose a DRL framework for stock indices that integrates a time series prediction network with the CE to forecast future financial price data, significantly enhancing prediction accuracy. Meanwhile, during the training phase, CE learns and distills generalized prototype embeddings from the training data when confronted with entirely new, unseen time series or environments, which eliminates the need for additional labeled data or retraining. Instead, by calculating the matching probability between new samples and known clusters based on pre-trained prototype embeddings, it enables knowledge transfer through a cluster-aware feed-forward mechanism, thereby rapidly generating accurate predictions.
Within the proposed DCAI framework, data augmentation is employed to generate novel data by leveraging observed prices and predicted prices. Distinct from traditional Open, High, Low, Close and Volume (OHLCV) data, this augmented data helps RL agents perceive more macroscale patterns in stock prices.
Three DRL algorithms, namely Double Deep Q-Network (DDQN), Advantage Actor–Critic (A2C), and Proximal Policy Optimization (PPO), were utilized to evaluate the proposed framework. Experiments conducted on five widely adopted datasets, including the Dow Jones Industrial Average (DJI), NASDAQ 100, S&P 500 (SP500), Hang Seng Index (HSI), and Nikkei 225 (N225), demonstrate that the framework consistently outperforms various traditional methods and DRL-based approaches. These results highlight the framework’s potential in optimizing stock trading strategies and improving the reliability of algorithmic trading through the effective integration of DRL with predictions of future stock price movements.

This paper is structured as follows: Subsequent to the introduction, Section 2 (Related Work) provides a comprehensive review of the literature pertaining to time series forecasting models and reinforcement learning. Section 3 (Preliminaries) presents the preliminary knowledge of CI and CD models. Section 4 (Methods) elaborates on the proposed prediction model, dataset, and methodological framework, including the relevant mathematical formulations. In Section 5 (Experimental), the relevant experimental configurations and datasets are primarily described. In Section 6 (Results and Analysis), the predictive performance of the models on the index dataset is evaluated and compared through tabular and graphical illustrations. Finally, Section 7 (Conclusions) summarizes the core research findings and offers concluding remarks on this study.

2. Related Work

2.1. Time Seires Forecasting

In the field of time series forecasting, deep learning models are primarily constructed based on architectures such as Multi-Layer Perceptrons (MLP) [25], Transformers [26], Convolutional Neural Networks (CNNs) [27], and recurrent neural networks (RNNs) [28]. To delve deeper into the complex structures of time series, the Transformer-based Informer [29] captures the dependencies between time and dimensions by introducing a sparse attention mechanism and a two-stage attention layer. The CNN-based TimesNet [30] employs TimesBlock as a general backbone. TimesBlock converts 1D time series into 2D tensors and models intraperiod and interperiod changes through 2D kernels, enabling more effective capture of temporal variations. The MLP-based TSMixer [31] shares the time-mixing MLP across all features and the feature-mixing MLP across all time steps. This design allows TSMixer to automatically adapt to the utilization of temporal and cross variable information, achieving better generalization with a limited number of parameters. Time-LLM [32] reprograms input time series data into more natural text prototype representations, enhances input context through declarative prompts to guide large language models (LLMs) in effective cross-domain reasoning, and achieves superior performance in few-shot and zero-shot scenarios. The Diffusion-TS [33] is a time series generation framework based on denoising diffusion probabilistic models that achieves interpretable multivariate time series generation through designs including an encoder–decoder Transformer, disentangled temporal representations, and Fourier loss terms. Time series analysis has demonstrated significant application potential in the financial sector. In the future, it will be necessary to explore its profound integration with other technologies and optimization strategies across a broader range of scenarios.

2.2. Deep Reinforcement Learning in Trading

As DRL is recognized as an effective method in quantitative finance [34,35], gaining practical experience in it appeals to beginners. In financial markets, applying deep learning in trading algorithms has notably improved the accuracy of stock return predictions [36]. Complex neural networks can capture market patterns and trends that are hard to identify via traditional methods, and can process large datasets to extract key features [37]. Compared with traditional approaches, deep learning shows higher predictive stability and accuracy. By combining deep learning’s feature extraction with DRL’s strategy optimization, DRL better adapts to market complexity and nonlinear relationships, thus enhancing trading stability and profitability.

Recent research has revealed significant advancements and integration trends of reinforcement learning in solving more generalized algorithmic trading problems [38]. Among them, critic-only DRL relies on the Q-function [39,40]. Based on the Deep Q-Network (DQN) algorithm [41], it uses a simplified reward function with a delayed feedback mechanism to facilitate the training of the DRL model. This simplifies the decision-making process and provides a more flexible and adaptive reward mechanism, which in turn drives improvements in revenue management and risk control. Critic-only DRL is applicable to discrete action spaces but is limited in continuous states (e.g., stock prices) and multi-asset scenarios, and is sensitive to the reward function. For instance: The Deep Recurrent Q-Network (DRQN), a DQN based on recurrent neural networks [40], achieved an annual return of 22–23% on the S&P 500 ETF, yet it was only tested on a single stock and did not consider risk.A multi-factor stock trading strategy based on DQN [42], integrating Multi-layer Bidirectional Gated Recurrent Unit (Multi-BiGRU) and multi-head ProbSparse self-attention mechanism. When verified in Chinese and U.S. stock markets, this strategy achieves better trading returns than both temporal and nontemporal models. The DeepScaler framework [43] incorporates a dueling branch Q-network to address complex trading actions, while integrating diverse types of market information and a volatility prediction auxiliary task through an encoder–decoder architecture for risk mitigation.

Actor-only DRL [44] can handle continuous action spaces. Compared with some Critic-only methods that rely on discrete action spaces, it is more flexible, can adapt to more complex trading scenarios, and can directly construct a mapping from states to actions. It does not need to derive indirectly through value functions, thus enabling more direct optimization of trading decision strategies. For example, Herein, a novel temporal discretization scheme is adopted [45], and two policy gradient-based algorithms (actor-based and actor–critic-based) are proposed. Parameterized function approximators are constructed using CNN and LSTM. In backtests on natural gas futures spanning 2017–2022, the Sharpe ratio is 83% higher than that of the buy and hold strategy. Risk tolerance is adjustable, with the actor-based algorithm significantly outperforming the actor–critic-based counterpart.Researchers have also adopted various modified approaches to enhance their performance. Among these approaches, LSTM-PPO integrates Long Short Term Memory (LSTM) networks into PPO [46] to enhance its state representation capability in high-frequency stock trading. The other type is Actor–Critic DRL [47], which simultaneously trains the actor responsible for decision-making and the critic evaluating the quality of actions. It combines the characteristics of Critic-only methods that evaluate actions through value functions and the advantages of Actor-only methods that directly optimize strategies. The SACRL-AF framework [48] rectifies the transitions in the replay buffer via an action feedback mechanism and utilizes the actually executed positions as labels for supervised learning. Experimental results demonstrate that the proposed algorithms achieve state-of-the-art performance in terms of profitability.The MetaTrader framework [49] leverages a suite of diverse expert strategies to train multiple trading approaches, subsequently selecting the most suitable one for portfolio management based on realistic market dynamics.

These studies demonstrate the promising prospects of DRL in algorithmic trading. Traditional trading strategies often overprioritize short term gains and neglect risks, while DRL achieves a dynamic balance between returns and risks by designing long term cumulative reward functions that integrate risk constraints into optimization objectives. With the continuous advancement of DRL, it is expected to further drive innovation and transformation in financial markets.

3. Preliminaries

Diverse Strategies for Multivariate Time Series Forecasting

The CI models [50] adopt a strategy where each dimensional feature of the multivariate time series is separately input into the backbone network for independent processing. The prediction results of each dimension are then concatenated along the dimensional axis for integration. This approach treats different dimensions as independent variables, while the embedding layers and weights are shared across all dimensions. In models adopting the CI strategy, potential cross-channel interactions are ignored by modeling each channel individually, which can be represented by the function

f^{(i)} : R^{T} \to R^{H} (i = 1, \dots, C)

, where T denotes the historical sequence length, H represents the prediction sequence length, and C stands for the number of feature channels. For example, the CI design is adopted in PatchTST [51], and its flowchart is shown in Figure 1 below.

The CD models [52] treat all channels as a single entity through the function

f : R^{C \times T} \to R^{C \times H}

. This strategy is critical in scenarios where channels are not just parallel data streams but also exhibit interconnections, such as in financial markets or traffic flows. The structure diagram of the model with channel dependence is shown in Figure 2 below:

4. Method

In this study, we model the stock trading scenario as a Markov Decision Process (MDP) [53] to apply RL. The RL agent acquires decision-making capabilities by interacting with an environment represented as an MDP. The MDP consists of several components, including state space S, action space A, transition probabilities P, a reward function R and a discount factor

γ

. At each time step, the agent observes a state

S_{t}

, selects an action

A_{t}

based on a policy

π

, and transitions to a new state

S_{t + 1}

according to the transition probability

T (S_{t + 1} | S_{t}, A_{t})

. The agent then receives a reward

R_{t}

for this action. The action-value function

Q^{π}

assesses the quality of taking a specific action in a given state, defined as the expected cumulative reward:

Q^{π} (s, a) = E [\sum_{i = t}^{\infty} γ^{i} r_{t + i} | s_{t} = s, a_{t} = a]

. The objective of the RL agent is to learn the optimal policy

π^{*}

that maximizes the action-value function

Q^{π}

. In algorithmic trading, financial reinforcement learning agents strive to identify the optimal trading strategy by exploring and evaluating the consequences of different actions in a dynamic trading environment. However, to ensure the robustness and reliability of the research findings, several key assumptions must be established:

The Efficient Market Hypothesis (EMH) [54] posits that financial markets comprise a large cohort of investors, whose core objective is to generate returns by harnessing available information. This competitive dynamics results in the mutual offsetting of individual actions, and when coupled with the constraints imposed by liquidity restrictions and short-selling borrowing costs, the impact of any single investor’s buy or sell orders on market prices becomes negligible. Consequently, individual transactions are deemed to exert no material influence on price trends.
In the theory of financial market microstructure, the idealized benchmark state typically assumes that the market exhibits sufficient liquidity, is free from liquidity restrictions and short-selling borrowing costs, and involves no order execution slippage, with orders executable exactly at the prevailing market quotes at the time of submission.
Under the framework of the Adaptive Market Hypothesis (AMH) [55], investors with learning capabilities and the ability to dynamically adjust strategies can achieve periodic excess returns during market panic. They leverage continuous cognitive iteration, optimized decision-making patterns and contrarian strategies like value anchored reverse operations to do so.

4.1. Overview of the Proposed Framework

In particular, Figure 3 illustrates the framework proposed in this study. This framework integrates RL with a time series prediction network that takes historical OHLCV data as input to predict future OHLCV values. These predicted values are fused with realistic data to collectively form the state input for the RL model. Here,

x_{t}^{r}

denotes the real scale data at time t, while

x_{t}^{p}

represents the predicted data. We perform data augmentation by concatenating the predicted and real-world data to generate actual prediction fused data (denoted as

x_{t}^{i}

), which is then used as the input state s for the RL algorithm. By fusing

x_{t}^{r}

and

x_{t}^{p}

, the model can capture both fine grained real time market dynamics and short term trends for the next trading day, thereby providing more precise informational support for trading decision making. To better illustrate the overall framework, the time series prediction network is elaborated in Section 4.2, the data augmentation approach is detailed in Section 4.4.1, and the workflow of zero-shot prediction is updated in Section 4.2.3.

4.2. Generate Exclusive Weights for Clusters

Among mainstream time series models, there are typically three core components: an optional normalization layer [56], a time dimension module, and a feed-forward network layer for predicting future values. In the context of financial time series analysis, the inherent characteristics of financial data—including prominent random abruptness, non-stationarity, and the presence of multiple technical indicators and features—render the combination of core components in existing time series models inadequate to meet the complex requirements of financial scenarios. Accordingly, starting from the channel correlation of financial time series data, the i-th channel is fed into a multi-layer perceptron (MLP), which outputs the hidden embedding

h_{i}

of this channel. To enhance such correlations, m Cluster Embedding vectors are initialized via a standard normal distribution, where

c_{k} \in R^{d}

. Subsequently, for each channel i and each cluster m, the normalized inner product between the Cluster Embedding

c_{k}

and the channel embedding

h_{i}

is calculated to obtain the raw correlation degree of channel i belonging to cluster m. After normalizing this raw correlation degree, the probability

p_{i, k}

is derived, and the mathematical formulation of this process is given as follows:

p_{i, k} = Normalize (\frac{c_{k}^{⊤} h_{i}}{∥ c_{k} ∥ ∥ h_{i} ∥}) \in [0, 1]

(1)

∥ c_{k} ∥

and

∥ h_{i} ∥

denote the

L_{2}

norms of the Cluster Embedding

c_{k}

and channel embedding

h_{i}

, respectively. Normalization via the

L_{2}

norm ensures that the inner product of the Cluster Embedding and channel embedding is independent of the absolute length of the vectors, thereby guaranteeing that the initial correlation degree calculated based on Equation (1) can accurately characterize the intrinsic matching relationship between channels and clusters. After the normalization operation, the correlation degrees of each channel

X_{i}

corresponding to all clusters are converted into probability values, ultimately forming a clustering probability matrix P with dimensions of

R^{C \times K}

. To transform the continuous probability matrix P into clustering results with discrete attribution significance, and to convert discrete sampling into a differentiable approximation in the continuous space and determine the attribution relationship between channels and clusters, the reparameterization trick is utilized to generate a cluster membership matrix

M \in R^{C \times K}

, where

M_{i, k} \approx Bernoulli (p_{i, k})

indicates whether channel i belongs to cluster K. The closer the value of

M_{i, k}

is to 1, the more deterministically the specific channel is assigned to the corresponding cluster; conversely, a value closer to 0 means the channel does not belong to the cluster. The structure diagram of the is shown in Figure 4 below:

4.2.1. Cross-Attention for Prototype Embedding Generation

In real financial markets, predicting future values is often challenging due to data privacy concerns or the lack of sufficient training samples for newly listed stocks to learn their price fluctuation patterns. To address the aforementioned issues, it combines the transposed membership matrix

R^{m \times K}

with cross-attention, and the corresponding formula is given as follows:

\hat{E} = Attention (Q, K, V) = Normalize (\exp (\frac{(W_{Q} \cdot C_{i n i t}) \cdot {(W_{K} \cdot H)}^{⊤}}{\sqrt{k_{1}}}) ⊙ M^{⊤}) \cdot (W_{V} \cdot H)

(2)

Herein,

C_{init}

is formed by integrating the

c_{k}

of m clusters into a matrix form, and H is formed by integrating the

h_{i}

of C channels into a matrix form. The cross-attention computes the updated prototype embedding

\hat{E}

and the new Cluster Embedding matrix

C_{init}

, which are used for the CE probability calculation in the next iteration.The core design objective of the cross-attention mechanism is to achieve precise focusing of intra-cluster information and optimize the quality of prototype learning. As the core foundation of zero-shot prediction in this study, prototype embedding

\hat{E}

possesses a representational capability that directly influences prediction performance and the reliability of subsequent reinforcement learning decisions. Through dynamic interactions between the query matrix

W_{Q}

and the key matrix

W_{K}

, the cross-attention mechanism adaptively aggregates common features across intra-cluster channels, effectively filters redundant information, and enables the generated prototype embedding to accurately capture the core patterns of intra-cluster time series. For traditional feature aggregation methods, simple average pooling directly smooths out key interactive features between channels, resulting in over-generalized prototype representations. In contrast, the self-attention mechanism tends to excessively focus on global correlations, conflating inter-cluster difference information and compromising the independence of intra-cluster patterns. To make the clustering results more consistent with the true correlations among channels (i.e., to align the clustering results with the real feature correlations between channels), this paper incorporates the loss function from [57] as follows:

L_{C} = Tr ((I - M M^{⊤}) S) - Tr (M^{⊤} S M)

(3)

where Tr denotes the trace operator. Specifically,

Tr (M^{⊤} S M)

is used to maximize the intra-cluster channel similarity, while

Tr ((I - M M^{⊤}) S)

is designed to improve the separability between different clusters, thereby further alleviating the overlap and ambiguity issues in cluster assignments. Among them,

L_{C}

can capture meaningful time series prototypes without relying on external labels or annotations. The channel similarity matrix S has a dimension of

R^{C \times C}

(where C denotes the number of channels), and its element

S_{i, j} = Sim (X_{i}, X_{j})

represents the intrinsic similarity between the i-th channel and the j-th channel. The identity matrix I has a dimension of

R^{C \times C}

, with elements of 1 on the main diagonal and 0 elsewhere, and it is used to compute the inter-cluster separation term.

4.2.2. Feed-Forward Layer

Financial time series contain a great deal of meaningless noise, and the noise from different channels can interfere with the model’s learning process. We configure an exclusive feed-forward network for each cluster, where clusters are used to replace channels. Each exclusive feed-forward network is parameterized by a single linear layer. Let

h_{θ_{k}} (\cdot)

denote the linear layer corresponding to the m-th cluster with weight

θ_{k}

, and

z^{(i)}

represent the output representation of the Transformer encoder for the i-th univariate sequence. The predicted value of the i-th channel is obtained by performing a weighted average on the outputs of the feed-forward networks of all clusters, which is given by

Y_{i} = \sum_{k} p_{i, k} h_{θ_{k}} (Z_{i})

. By stacking the predicted results

Y_{i}

of all channels along the channel dimension, the final multi-channel prediction matrix

\hat{Y}

can be obtained. Referring to [11], the model presets a future prediction horizon of

H = 7

. The aforementioned prediction matrix

\hat{Y}

is aligned in dimension with the concurrent true result matrix, and the two matrices are then concatenated. The integrated input data is directly used to train the market embedding module and action decision-making module of the RL agent. Among them, Algorithm 1 provides the pseudocode for the time series model.

4.2.3. Transfer Prediction in Financial Markets

During the inference phase of zero shot transfer, the pre-trained prototype embeddings are frozen and the cross-attention mechanism is disabled. When dealing with newly listed stocks, the limited historical price sequence of the target stock is converted into a channel embedding

h_{n e w}

, the update mechanism of prototype clusters is turned off, and the association probability

p_{n e w, k}

between

h_{n e w}

and the pre-trained prototype clusters

\hat{E}

is directly calculated using Equation (2); then, the cluster weights

{θ_{1}, \dots, θ_{k}}

are weighted averaged using

p_{n e w, k}

to obtain the exclusive prediction weight

θ_{n e w} = \sum_{m} p_{n e w, k} θ_{k}

for the target stock, and finally, the future price prediction is generated through the cluster-aware feed-forward network. This approach realizes knowledge transfer by leveraging pre-trained prototype clusters, which not only avoids the problems of being unable to train with private data and insufficient data for newly listed stocks but can also accurately match the fluctuation pattern of the target stock to efficiently complete the prediction. Figure 5 below illustrates the flowchart for zero-shot sample prediction and Algorithm 2 presents the pseudocode for the inference phase.

Algorithm 1 The Process of Future Value Prediction

Input: Historical Financial Time Series

X \in R^{T \times C}

Output: Future time series

Y \in R^{H \times C}

Initialize the weights of linear layers and initialize m cluster embeddings

c_{m} \in R^{d}

for

m = 1, \dots, M

X \leftarrow Normalize (X)

h_{i} \leftarrow MLP (X_{i})

Compute Clustering Probability Matrix:

p_{i, m} \leftarrow Normalize (\frac{c_{m}^{⊤} h_{i}}{∥ c_{m} ∥ ∥ h_{i} ∥}) \in [0, 1]

Sample Clustering Membership Matrix:

M \leftarrow Bernoulli (P)

Update Cluster Embedding via Cross Attention:

\hat{E} \leftarrow

Normalize (\exp (\frac{(W_{Q} \cdot C_{i n i t}) \cdot {(W_{K} \cdot H)}^{⊤}}{\sqrt{k_{1}}}) ⊙ M^{⊤}) \cdot (W_{V} \cdot H)

for channel i in

{1, 2, \dots, C}

do

Weight Averaging and Projection:

Y_{i} \leftarrow h_{θ_{i}} (Z_{i})

where

θ_{i} = \sum_{k} p_{i, k} θ_{k}

end for

Algorithm 2 The Prediction Process of Unseen via Pre-trained Models

Input: Historical Financial Time Series

X \in R^{T \times C}

; pre-trained Model F

Output: Future time series

Y \in R^{H \times C}

Load K cluster embedding

c_{k} \in R^{d}

and weights of K linear layers from F

X \leftarrow Normalize (X)

h_{i} \leftarrow MLP (X_{i})

Compute Clustering Probability Matrix:

p_{i, m} \leftarrow Normalize (\frac{c_{m}^{⊤} h_{i}}{∥ c_{m} ∥ ∥ h_{i} ∥}) \in [0, 1]

Sample Clustering Membership Matrix:

M \leftarrow Bernoulli (P)

for channel i in

{1, 2, \dots, C}

do

Weight Averaging and Projection:

Y_{i} \leftarrow h_{θ_{i}} (Z_{i})

where

θ_{i} = \sum_{k} p_{i, k} θ_{k}

end for

4.3. Data Center Artificial Intelligence

DCAI is an emerging concept that shifts the focus from improving model design to pursuing excellence in data quality, marking a significant elevation in the recognition of data’s critical importance in the field of artificial intelligence. In the past, artificial intelligence was primarily viewed from a model-driven perspective, where the core lay in optimizing model design based on fixed datasets to enhance AI performance. However, this approach often overlooks underlying data flaws, such as missing values, incorrect labels, and outliers. This raises a key question: Do many of the improvements in model performance truly reflect the model’s genuine potential, or are they merely the result of overfitting to the dataset?

This study adopts a data augmentation method based on real data and predicted data. This step is rooted in the core idea of DCAI, shifting the focus from model design to the optimization of data quality. By fusing daily scale data and predicted scale data, on one hand, it responds to the concept of “systematic data engineering in AI development” emphasized by DCAI. Specifically, through targeted data augmentation, it improves data quality to better match the actual characteristics of the task, ensuring that subsequent model training is built on a more reliable and comprehensive data foundation. On the other hand, it addresses the key issue arising from the limitations of traditional methods: by improving data quality at the source, the enhanced model performance achieved based on augmented data is more likely to reflect the model’s true generalization ability, rather than the result of overfitting to flawed original data. This thereby lays a solid data foundation for the success of the entire AI task.

4.4. Reinforcement Learning Framework

4.4.1. State

The algorithmic trading process is modeled as a Markov Decision Process (MDP). The state is primarily generated by fusing real data and predicted data, which features complementary real and predicted data. At the current time step t, the real data is denoted as follows:

s_{t}^{r} = {O_{n}^{r}, H_{n}^{r}, L_{n}^{r}, C_{n}^{r}, V_{n}^{r}}_{t}

and the predicted data is denoted as follows:

s_{p}^{t} = {O_{n}^{p}, H_{n}^{p}, L_{n}^{p}, C_{n}^{p}, V_{n}^{p}}_{t}

These two types of data form

s_{t}^{i}

through data augmentation, which can be expressed as follows:

s_{t}^{i} = {s_{t}^{r}, s_{t}^{p}}

. Here

s_{t}^{i}

constitutes the state input in the MDP.

4.4.2. Action

In the MDP, the agent has three possible actions in algorithmic trading: buy, hold, and sell. The corresponding formula is as follows:

a_{t} = \{\begin{matrix} 1, & if π (a_{t} ∣ s_{t}^{i}) = buy; \\ 0, & if π (a_{t} ∣ s_{t}^{i}) = hold; \\ - 1, & if π (a_{t} ∣ s_{t}^{i}) = sell, \end{matrix}

(4)

Herein,

π (a_{t} | s_{t}^{i})

represents the RL policy calculated by the RL agent based on the current state

s_{t}^{i}

at time step t. This policy is the core of the agent’s action selection. In the algorithmic trading framework, there may be discrepancies between the actual trading actions executed by the agent and the trading signals generated by the policy network. Such discrepancies stem from the adherence to real-world trading rules. For instance, if the account is already in a long position and the policy network generates a buy signal, the actual action will be adjusted to hold. If the account is in a long position and receives a sell signal, the actual action will be “close the position”. The purpose of this adjustment is to ensure that trading behaviors comply with position limit rules in real-world markets and avoid invalid or non-compliant operations. Table 1 presents the actual trading operations in the algorithmic trading framework.

4.4.3. Reward

In RL agents, the reward function is the core driver of an agent’s decision-making and strategy iteration. The agent continuously learns with the goal of maximizing cumulative rewards to adjust its decision-making patterns. Within the RL trading framework, the reward function reflects the profitability of the trading strategy. We adopt a profit-based reward function similar to the literature [58] that measures returns under different positions and incorporates an adjustable time window to capture trends, ultimately aiming to maximize profits. The final position state of the agent at time

τ

is denoted by

H_{τ}

, where

H_{τ} = 1

indicates a long position,

H_{τ} = - 1

indicates a short position, and

H_{τ} = 0

indicates a flat position. The immediate reward

R_{τ}

at time

τ

is then defined as follows:

\begin{matrix} R_{τ} = \{\begin{matrix} H_{τ} \cdot R_{m a x}, & if R_{m a x} > 0 or R_{m a x} + R_{m i n} > 0, \\ H_{τ} \cdot R_{m i n}, & if R_{m i n} < 0 or R_{m a x} + R_{m i n} < 0 . \end{matrix} \end{matrix}

(5)

Let the single-step return rates over the next k time steps be denoted as

Δ p_{τ + 1}, Δ p_{τ + 2}, \dots, Δ p_{τ + k}

, where

Δ p_{τ + i} = \frac{P_{τ + i} - P_{τ}}{P_{τ}} \times 100

and

P_{τ + i}

represents the asset price at time

τ + i

. The positive return ratio

R_{m a x}

is defined as follows:

\begin{matrix} R_{m a x} = \{\begin{matrix} \max (Δ p_{τ + 1}, Δ p_{τ + 2}, \dots, Δ p_{τ + k}), & if \exists Δ p_{τ + i} > 0, \\ 0, & otherwise . \end{matrix} \end{matrix}

(6)

Similarly, the negative return ratio

R_{m i n}

is defined as follows:

\begin{matrix} R_{m i n} = \{\begin{matrix} \min (Δ p_{τ + 1}, Δ p_{τ + 2}, \dots, Δ p_{τ + k}), & if \exists Δ p_{τ + i} < 0, \\ 0, & otherwise . \end{matrix} \end{matrix}

(7)

In the formulas,

R_{m a x} = 0

indicates that there are no positive returns over the next m days, with only flat or declining price movements observed. In contrast,

R_{m i n} = 0

means there are no negative returns in the upcoming m days, and only flat or rising price trends occur. Under such circumstances, the strategy is to maintain the current position. Since there are no clear trend signals, this approach avoids unnecessary transactions that would incur additional costs.Given that this study employs an offline RL framework, the reward is observed instantaneously upon the execution of an action at time t. The three-day-ahead price changes incorporated into the reward function calculation are fixed empirical values derived from historical datasets. Functioning as an ex-post feedback signal for the action implemented at time t during offline training, the reward is exclusively leveraged to optimize the trading strategy, thereby precluding any methodological flaws related to look ahead bias. Both scenarios help the reward function focus more on valid market signals related to the current position by filtering out ineffective trends, thereby enhancing the relevance of the agent’s decision-making.

4.5. Proximal Policy Optimization

The proposed framework can incorporate various RL algorithms, including value-based methods, policy-based methods, and actor–critic methods. This section uses a classic policy-based method, PPO, to explain the basic logic of this framework. PPO [59] is a policy optimization algorithm proposed by OpenAI in 2017, which overcomes the computational complexity of traditional policy gradient methods. Like other policy-based algorithms, PPO is essentially an improvement of the Policy Gradient method. Its core goal is to directly optimize the agent’s policy function (rather than indirectly optimizing it through a value function), and its complete objective function is as follows:

L_{PPO} (θ, ϕ) = L_{CLIP} (θ) - c_{1} \cdot L_{V} (ϕ) + c_{2} \cdot H (π_{θ})

(8)

Herein,

c_{1}

and

c_{2}

are weight hyperparameters.

H (π_{θ}) = - E [\log π_{θ} (a | s)]

represents the entropy of the policy. Increasing entropy helps prevent the policy from converging prematurely to a local optimum. Meanwhile,

L_{V} (ϕ)

represents the Value Function Loss, which is typically defined using the MSE as follows:

L_{V} (ϕ) = E [{(V_{ϕ} (s_{t}) - {\hat{G}}_{t})}^{2}]

In this equation,

V_{ϕ} (s_{t})

is the value estimate of state

s_{t}

generated by the Critic network with parameters

ϕ

, and

{\hat{G}}_{t}

corresponds to the actual cumulative reward target for state

s_{t}

.

In the PPO framework,

L_{CLIP} (θ)

corresponds to the clipped surrogate objective, which is designed to constrain the policy update magnitude. Its mathematical formulation is given by the following:

L_{CLIP} (θ) = E [\min (r_{t} (θ) \cdot A_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \cdot A_{t})]

(9)

In this equation,

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

is the ratio term, which quantifies the discrepancy between the new policy

π_{θ}

and the old policy

π_{θ_{old}}

.

A_{t}

represents the excess return of taking action

a_{t}

relative to the average level. Meanwhile,

ϵ

functions to constrain the value range of the ratio term

r_{t} (θ)

.

In summary, PPO ensures the stability of the policy through the combination of probability ratios and advantage functions, coupled with a clipping mechanism. This prevents the deterioration of trading performance caused by excessively aggressive policy adjustments, thereby improving the overall performance of the trading strategy.

CE-PPO Training

The training process of CE-PPO flow diagram is presented in Figure 3. Initially, a time series prediction network is trained using the dataset spanning from 1 January 2007, to 31 December 2018. The prediction results of the time series prediction network depend exclusively on historical OHLCV data up to and including the current time step t. Via a sliding window sampling mechanism, each prediction task within a sampling window only leverages the historical observation data confined to that window. Once the time series prediction network completes training, it is transferred to the PPO algorithm.

At the initial stage, raw data

s_{t}^{r}

is input into the prediction network to generate predicted values. Subsequently, data augmentation is performed by integrating the predicted values with the observed state

s_{t}^{p}

. Within the state space, the predicted data in the observed state

s_{t}^{p}

are all derived from sliding window predictions based on historical data, and concatenation with the raw data occurs exclusively within the same time window, resulting in the complete state

s_{t}^{i}

. The PPO agent samples the action

a_{1}

by observing the state

s_{t}^{i}

. According to the Markov Decision Process (MDP), this action leads the environment to a new state

s_{t + 1}^{r}

. Afterwards, a new complete state

s_{t + 1}^{i}

is generated following the aforementioned process. The PPO agent leverages the transition tuple

(s_{t}^{i}, a_{t}, r_{t}, s_{t + 1}^{i})

and relies on the clipped objective function to stabilize parameter updates. The flowchart of this framework is illustrated as follows Figure 6.

5. Experiments

5.1. Datasets and Experimental Setup

To verify the model’s robustness, we selected index data from the world’s top three economies as research objects. For the Chinese market, we chose the HSI Index, which comprises the largest cap and most liquid companies in the Hong Kong stock market. Covering multiple sectors including finance, real estate, public utilities, and industry and commerce, it provides the model with widely representative data support for the Chinese market. For the Japanese market, we selected the N225, known as the “barometer” of the Japanese stock market; this index intuitively reflects the overall operation of the Japanese economy. For the U.S. market, we included the SP500, DJI and NASDAQ 100. These three major indices jointly form the core indicators of the U.S. stock market: the S&P 500 covers large cap stocks across all industries, the Dow Jones focuses on traditional leading enterprises, and the NASDAQ centers on technology and growth oriented companies. The experiments were conducted on a machine with an Intel Core i7-13700H CPU and an NVIDIA GeForce RTX 4060 GPU (8GB VRAM). The software stack included Python 3.9, PyTorch 1.13.0, and CUDA 11.8. The operating system is Windows 10. The details regarding the dataset and experimental configurations are provided in Appendix A.

5.2. Evaluation Metrics

To verify the overall effectiveness of the framework in this study, it is necessary to design a differentiated evaluation system based on the functional characteristics of the core components. Given that the core goal of the time series prediction network is to improve prediction accuracy, while the core goal of the RL trading algorithm is to optimize trading returns and risk control capabilities, we have set exclusive evaluation metrics for these two types of modules, respectively. Among them, the evaluation metrics used to measure the prediction performance of the time series prediction network are as follows:

\begin{matrix} MSE & = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}, \end{matrix}

(10)

\begin{matrix} MAE & = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |, \end{matrix}

(11)

\begin{matrix} RMSE & = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}, \end{matrix}

(12)

\begin{matrix} MAPE & = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| . \end{matrix}

(13)

where

y_{i}

denotes the true value,

{\hat{y}}_{i}

represents the predicted value, and n stands for the number of samples. The evaluation metrics for the RL trading algorithm are as follows:

\begin{matrix} AR & = {(1 + R_{total})}^{\frac{T_{y}}{t}} - 1, \end{matrix}

(14)

\begin{matrix} CR & = \prod_{i = 1}^{n} (1 + r_{i}) - 1, \end{matrix}

(15)

\begin{matrix} MDD & = \max (\frac{Peak - Trough}{Peak}), \end{matrix}

(16)

\begin{matrix} SR & = \frac{μ - r_{f}}{σ} \times \sqrt{T_{y}} . \end{matrix}

(17)

where

R_{total}

denotes the total return over the backtesting period, t represents the actual number of trading days in the backtest, and

T_{y}

stands for the annual trading day base.

r_{i}

is the return on i-th trading day;

{Peak}_{k}

refers to the peak net asset value(NAV) before the k-th trading day;

{Trough}_{k}

is the trough NAV on the k-th trading day;

μ_{r}

denotes the daily return mean;

r_{f}

represents the risk-free rate; and

σ_{r}

stands for the daily return standard deviation.

5.3. Baseline Methods

To better evaluate the effectiveness of the framework, we introduce several classic algorithmic trading methods, machine learning-based trading algorithms, and cutting-edge financial reinforcement learning algorithms as baseline methods. The classical strategies include Buy-and-Hold (B&H), Sell-and-Hold (S&H), and Mean Reversion strategy based on moving averages. The machine learning algorithms encompass Support Vector Machine (SVM) and Random Forest (RF). The state-of-the-art financial reinforcement learning algorithms include the Fire algorithm (DQN-HER) and TDQN.

The Buy-and-Hold (B&H) strategy refers to an approach where investors purchase an asset and hold it throughout the entire investment period without responding to price fluctuations. In contrast, the Sell-and-Hold (S&H) strategy requires investors to short-sell an asset at the initial stage and maintain this position until the end of the period. The Mean Reversion (MR) strategy is based on the assumption that asset prices will revert to their historical average. In this study, a 10-day Simple Moving Average (SMA) is used to identify potential reversion points. In the field of machine learning, the SVM model [60] conducts analytical predictions based on real-time data generated after the opening of the trading day, and then outputs clear trading signals, namely buy, sell, or hold recommendations, providing a quantitative basis for decision-making. On the other hand, Ref. [61] conducts multi-dimensional analytical predictions using real-time data generated after the daily market opening. Through an integrated voting mechanism of multiple decision trees, it generates comprehensive trading signals, thereby providing reliable quantitative references for decision-making. The TDQN algorithm [62] aims to optimize trading positions through a five-layer fully connected Deep Q-Network. The DQN-HER model [63] explores the application of multi-objective DRL in stock and cryptocurrency trading. This model incorporates dynamic reward functions and discount factors to enhance learning performance.

6. Results and Analysis

6.1. Time Series Forecasting Results

As noted earlier, the proposed method inputs stock price predictions generated by a time series forecasting network into the RL agent. This equips the agent with potential future market trends to reference when making trading decisions, thereby driving better trading results. Thus, to enable the RL agent to make more informed decisions, a rigorous evaluation of existing state-of-the-art time series forecasting networks is critical.

We apply CE to time series forecasting models to enhance their performance. In this study, four mainstream time series models in recent years are selected as base models, namely TSMixer, DLinear, PatchTST, and TimesNet. These models mainly cover three mainstream paradigms: linear models, Transformer-based models, and convolutional models. To ensure the fairness of evaluation, the optimal experimental configurations provided in the official code are adopted. Among all experimental models, TimesNet and PatchTST combined with CE show excellent predictive performance. As a CD strategy convolutional model, TimesNet captures time series multi-periodicity, with CE optimizing complex channel correlation handling via dynamic clustering. As a CI strategy Transformer, PatchTST excels in long-range dependency capture through patching and self-attention; CE adds cross-channel correlation modeling without impairing its strengths, compensating for overlooked cross-dimensional correlations. The results show that on the N225 dataset, the CE reduces the MSE of the four models by an average of 32.57% and the MAE by an average of 54.78%; verifying CE aggregates cross-channel information, preserving local trends and capturing macro patterns for optimal performance. The financial time series prediction results are shown in Table 2, and the best results are underlined below, while the second-best results are bolded.

Specifically, Figure 7 explicitly illustrates the comparative trends of predicted values versus ground truth across time steps for the four CE-integrated models (DLinear, PatchTST, TSMixer, and TimesNet) in the forecasting task. The predictions generated by DLinear (+CE), PatchTST (+CE), and TimesNet (+CE) exhibit relatively high congruence with the ground truth, particularly in trend capture, as they effectively track the oscillatory movements of real-world data. Despite the incorporation of the CE module, TSMixer demonstrates suboptimal performance in financial dataset forecasting. As a linear model, TSMixer predominantly relies on multi-layer perceptrons(MLPs) for independent processing of channel and temporal dimensions. While this architectural design enhances computational efficiency, it has inherent limitations in capturing intricate nonlinear correlations within financial data, such as the intertemporal linkage of price fluctuations and the implicit influence of trading volume on price dynamics. Notably, financial market trends are typically shaped by the interplay of multifaceted factors, including macroeconomic indicators, policy shifts, and market sentiment, thereby manifesting highly non-linear characteristics. Consequently, TSMixer’s linear modeling paradigm is inherently incapable of accommodating such complex patterns. The prediction results of the other four stock indices are included in Appendix B.

6.2. Validation of Transfer Prediction in Financial Markets

Existing time series models often overfit specific datasets, resulting in poor generalization to unseen data. In contrast, CE captures cluster-specific knowledge using learned prototypes, enabling meaningful comparisons between unseen time series and pre-trained knowledge for accurate zero-shot prediction. We conducted zero-shot predictions on three U.S. stock market indices: SP500, DJI, and NASDAQ. Table 3 presents the results of MSE, MAE, MAPE and RMSE on the test datasets; the best results are underlined below, while the second-best results are bolded.

TimesNet achieves the best performance in most cross-market transfer tasks, as its MAE, MSE, RMSE, and MAPE are significantly lower than those of other models. For instance, in the “NASDAQ→DJI” task, the MAE of TimesNet is 0.045, which is 65.0% lower than that of PatchTST (0.128) and 84.7% lower than that of TSMixer (0.294). In the “SP500→NASDAQ” task, its MAPE is 0.018, accounting for only 1.4% of that of TSMixer (1.243). These results indicate that the temporal decomposition module of TimesNet can effectively extract the common temporal features of the U.S. stock market and maintain adaptability to financial market data even in zero-shot scenarios. In contrast, TSMixer shows significant performance fluctuations, with high errors in tasks such as “DJI→SP500” (MAE = 0.890). As a linear model, TSMixer primarily relies on multi-layer perceptrons (MLPs) for independent processing of channel and temporal dimensions. This design struggles to capture complex non-linear correlations in financial data, and the lack of fine-tuning with target market data in zero-shot scenarios amplifies the interference from noise. Transfer Prediction in Financial Markets is shown in Figure 8.

6.3. Ablation Studies

Figure 9 presents an ablation study on the clustering ratio, which is defined as the ratio of the number of clusters to the number of channels. A ratio of 0.0 indicates that all channels are assigned to a single cluster. Taking the N225 dataset as an example, when the clustering ratio increases from 0.0 to 0.25, the MAE of the TimesNet model decreases from 0.042 to 0.039 (7.1% reduction). When the ratio exceeds 0.75, the MAE rises rapidly; at a ratio of 1.0, it increases by 15.4% compared to that at a ratio of 0.25. This fluctuation is similarly significant in the SP500 dataset: within the clustering ratio range of 0.25–0.75, the average MAE of all models decreases by 9.3%, while beyond this range, it increases by an average of 12.7%.

It is observed that as the clustering ratio increases, the MAE loss first decreases slightly and then increases. When the clustering ratio is in the range of 0.25 to 0.75, the time series models integrated with the CE achieve optimal performance. This is because clustering groups similar channels to capture common correlations. Meanwhile, retaining sufficient clusters helps distinguish the unique patterns of different channels, thereby adapting to the complex similarity distribution among channels in the data. Notably, among these four base models, as a convolutional model, TimesNet can accurately extract periodic patterns at different time scales in financial data through its designed multi-period modeling module. As a Transformer-based model, PatchTST can efficiently capture long-range correlations in financial data (such as the sustained impact of a policy release on the market and the transmission effects across asset classes) via the patching mechanism and self-attention mechanism. TimesNet and PatchTST consistently benefit from CE and maintain the best predictive performance regardless of the number of clusters. This addresses the issue that traditional models struggle to handle the non-linear impact of historical events on current prices, such as the medium- and long-term trends of stock prices after the release of annual report data.

6.4. Implementation of Reinforcement Learning Agents in Trading

To more effectively evaluate the efficacy of the proposed method, this study conducts a comparative analysis between the test results of the agent and the baseline methods mentioned earlier. Specifically, the RL agent is trained over the period from 1 January 2007 to 31 December 2018, with the testing phase spanning from 1 January 2019 to 1 June 2021. The pre-trained time series prediction network CE not only possesses continuous learning capabilities but also demonstrates improved prediction accuracy. Building upon this, this paper verifies whether the performance of the RL agent is enhanced by utilizing states generated from both predicted prices and observed prices. Meanwhile, to validate whether the proposed method can achieve performance improvements across different RL algorithms, three classical algorithms, namely DDQN, A2C, and PPO, are integrated into the proposed framework, referred to, respectively, as CE-DDQN, CE-A2C, and CE-PPO. Detailed experimental results are presented in Table 4.

As shown in Table 4, the proposed method outperforms all baseline methods in terms of cumulative return (CR), Sharpe ratio (SR), annualized return (AR), and Maximum Drawdown (MDD). Across the five datasets involved in the experiments, the cumulative returns of CE-DDQN, CE-A2C, and CE-PPO all exceed 100%, whereas the highest cumulative return among the baseline methods is 155.60%. Among these proposed methods, CE-DDQN, which is integrated with DDQN, achieves the best performance, yielding a cumulative return of 1334.781% and an annualized return of 137.94% on the NASDAQ dataset. On the other hand, CE-PPO, combined with PPO, performs relatively poorly, with a cumulative return of 104.964% on the DJI dataset.

From the analysis of trading results, it can be observed that the core characteristics of trading decisions lie in discrete actions coupled with short-term reward dependence. As a value-based algorithm, DDQN is inherently suitable for discrete action spaces: it can directly calculate explicit Q-values for each action, clearly distinguishing the value differences among buying, selling, and holding under the current state, thereby enabling more decisive selection of optimal actions. In contrast, PPO, as a policy-based algorithm, although capable of handling discrete actions, focuses more on the overall optimization of action probability distributions. When the action space is small, its design of restricting policy mutations may lead to insufficient enhancement of the probability for high-value actions. Compared with baseline models, our proposed method can effectively enhance traditional RL algorithms, allowing the agent to interpret price data more clearly during both training and testing phases, thus generating more optimal trading strategies.

However, an examination of the experimental results reveals that the performance on the NASDAQ dataset is unexpectedly high. For other datasets, the cumulative returns (CRs) mostly fluctuate between 100% and 300%, with only a few exceeding 300%. In contrast, the cumulative returns on the NASDAQ dataset all surpass 1000%. To avoid potential issues, it is necessary to analyze this dataset. Figure 10 presents the proportion of closing prices in the test sets of the five datasets that fluctuated by more than 1% within three days. As mentioned earlier, the reward function employed in this study is a min-max function, which takes the maximum price change rate over the next three days as the reward for buying or selling actions. When the agent chooses to hold, the reward is calculated as a threshold minus the maximum price change rate within these three days, and in our experiments, this threshold is set to 2.5%. The threshold of 2.5% is determined based on the volatility distribution of the data: a lower threshold elevates the frequency of trading signal triggers, while a higher threshold incentivizes extended position-holding. The core of the reward mechanism resides in future window extremum aggregation rather than a solitary threshold. Even with threshold adjustments, the future trend forecasting capability of the time series prediction network and the reinforcement learning agent collectively safeguard the efficacy of decision-making. This assertion is substantiated by the framework’s significant outperformance over baseline methods across other datasets under the identical threshold, thereby verifying that its performance is not contingent upon a single threshold.

It can be clearly seen from the heatmap that the NASDAQ dataset has the highest proportion of closing price fluctuations exceeding 1%. Notably, 25.7% of the declines in the NASDAQ dataset exceed 1%, while 28.7% of the increases also exceed 1%. It should be noted that our experimental environment allows the agent to perform short-selling operations, and since the NASDAQ index has more downward trends, the exceptionally high experimental results on the NASDAQ dataset are understandable.

6.5. Zero-Shot Prediction Applied to Trading

In the experiments, we integrated the zero-shot prediction results described in Section 6.2 into the testing phase of the RL trading agent. During the data selection stage, the model with the optimal prediction performance was selected for each dataset, and actual trading simulations were conducted for the period from 1 January 2019, to 1 June 2021. The experimental data clearly demonstrate that even under extreme conditions where a large amount of critical raw data (including partial historical transaction records) is missing, our framework can still generate reliable prediction results based on pre-trained prototype embeddings. This enables it to achieve stable and substantial profit performance in the trading process, with the profit outcomes of trading using zero-shot prediction results presented in Table 5.

The DDQN algorithm achieves the optimal return rate in trading with index datasets. Index data aggregates market dynamics and exhibits lower idiosyncratic risk, thus requiring effective capture of temporal dependencies. DDQN excels at learning from sequential state transitions. When combined with the predicted data from the CE module, it can fully leverage both historical trends and forward-looking signals. This characteristic aligns well with the time series nature of index data, facilitating the capture of long-term trends for accumulated gains. Moreover, its

ϵ

-greedy strategy strikes a superior balance between exploration and exploitation: it avoids excessive exploration when index trends are clear and appropriately increases exploration during market turning points, enabling more flexible adaptation to the cyclical fluctuations of indices.

The stability of maintaining high profitability even in data-scarce scenarios not only validates the technical feasibility of integrating zero-shot prediction with RL but also directly addresses the core requirements of financial trading for realistic performance and adaptability. This provides a solution that balances security and efficiency for intelligent decision-making in realistic markets.

6.6. Analysis of Model Robustness

To further verify the reliability and robustness of the proposed method, it is necessary to conduct additional experiments to evaluate the model’s performance in response to unexpected events. The training dataset used in this study includes price data spanning from 1 January 2007, to 31 December 2018, while the testing dataset covers the period from 1 January 2019, to 30 June 2021. It is well-known that on 6 July 2018, the United States imposed a 25% tariff on approximately USD 34 billion worth of goods imported from China, marking the official onset of the China–U.S. trade war. In response, China promptly implemented retaliatory measures by levying a 25% tariff on 545 types of goods originating from the United States, with a total value of USD 34 billion. Evidently, the training dataset employed in this study encompasses partial data during the China–U.S. trade war.

To verify the reliability of the proposed method and its robustness against abrupt events, experiments should be conducted using training datasets that exclude data from the trade war period. In this case, the trade war period and the post-trade war period serve as the test datasets, enabling us to observe the model’s performance across multiple datasets. Accordingly, the method was tested on the five aforementioned datasets, with the training set spanning from 1 January 2007 to 31 December 2017, and the test set ranging from 1 January 2019 to 31 December 2021. The experimental results are documented in Table 6. It can be observed that when the training data lacks information from the trade war period, the model’s performance declines significantly. This clearly indicates that the trade war, as an abrupt event, exerts a negative impact on trading performance.

6.7. Analysis of Trading Stragegy

Following the analysis of effectiveness across various experiments, this study further examines the actions taken by the agent within the datasets. Figure 11 presents the trading operations executed by the agent on three datasets. These operations are overlaid on the stock price trend charts, clearly and intuitively illustrating the alignment between the agent’s decisions and price movements, thereby facilitating a better understanding of the logic underlying its trading strategy. Specifically, the figure displays the trading operations of the agent employing the PPO algorithm on the N225, HSI, and SP500 datasets. In such scenarios, the agent adopts a high-frequency trading strategy: given the significant short-term fluctuations in stock prices, the agent tends to buy at the troughs and sell at the peaks of these fluctuations, which undoubtedly yields profits.

During the sideways consolidation phase (the interval of days 100–200), the density of long-short trading signals decreases significantly, and the frequency of signal alternation slows down. This is not indicative of strategy failure but rather reflects the agent’s proactive restraint in trading based on its judgment of market conditions. In a sideways market with no clear directional price movement, frequent long-short operations are prone to unnecessary transaction cost attrition. By reducing ineffective trades and maintaining a wait-and-see approach, the agent effectively avoids the negative impact of sideways trends on strategy returns, demonstrating the adaptability and rationality of the strategy in complex market environments.

Short-selling operations were executed only when a significant downward trend was predicted to commence around day 280. Therefore, through observing these three figures, it can be confidently stated that the agent is capable of effectively monitoring the market environment and price movements. This indicates that the PPO agent can accurately identify the directional trend of market prices, establishing a clear trading logic of “going long in an uptrend and shorting in a downtrend” which is deeply aligned with the core idea of trend-following trading. It follows that, during the testing phase, the trading agent driven by the PPO algorithm exhibits excellent trend-judging capabilities and rational logic in executing long-short trades.

7. Conclusions

This study presents a novel DRL framework that significantly enhances the trading performance of DRL agents in financial markets. Within this framework, the CE effectively improves prediction accuracy by leveraging large volumes of unlabeled data to learn the intrinsic similarities and temporal dependencies among channels. When encountering unseen financial time series, the model achieves seamless knowledge transfer through a cluster-aware feed-forward mechanism by learning and distilling generalizable prototype embeddings from limited accessible data during the training phase, thereby rapidly generating accurate predictions. By integrating CE into mainstream time series models to predict future prices and performing data augmentation on both raw price data and predicted price data to generate prediction-ground truth complementary data, the agent can not only gain insights into potential future price change patterns but also observe the market from a more macroscopic perspective. Under such circumstances, DRL models are able to extract superior latent features from the data. Through extensive testing on multiple datasets, including the DJI, NASDAQ 100, HSI, SP500 and N225, the model achieved an annualized return of 137.94% on the NASDAQ dataset during the testing phase. Most experimental metrics outperformed those of the baseline models, verifying the effectiveness of this approach.

This study achieves synergistic optimization by integrating multiple technical chains. Leveraging a transfer mechanism based on pre-trained prototype embeddings and cluster matching probabilities, the prototype embeddings learned from limited data can be transferred to new samples, enabling rapid generation of prediction results without retraining while capturing the inherent balanced and asymmetric dynamic characteristics of stock price behaviors. Specifically, the model design emphasizes adaptability to address symmetry and asymmetry issues in financial markets: targeting the frequent asymmetric phenomena in financial markets, the model effectively identifies the hidden symmetric patterns in stock price fluctuations through correlation modeling of multi-channel financial data via Cluster Embeddings. This approach breaks through the limitations of data-driven learning in traditional models and provides a novel theoretical framework for solving the sample generalization challenge in financial time series forecasting.

Despite the numerous advantages of the proposed framework, it still exhibits certain limitations that merit further exploration in future research. The model fails to incorporate commonly used technical indicators and fundamental features in financial markets. Such information typically encapsulates critical signals including market sentiment and capital flows, and their explanatory power for price fluctuations may be substantially enhanced, particularly under extreme market conditions. Additionally, the integration of the CE expands the model’s parameter scale and elevates training costs. When processing ultra-high-frequency data, the model may face inference latency challenges, rendering it difficult to satisfy the stringent real-time requirements of scenarios such as high-frequency trading.

Author Contributions

Formal analysis, H.Z.; Funding acquisition, X.L.; Methodology, X.L. and J.D.; Project administration, X.L.; Supervision, X.L.; Writing—original draft, H.Z.; Writing—review and editing, T.W., J.D., H.Z. and X.L. contributed equally. All authors have read and agreed to the published version of the manuscript.

Funding

The research was financially supported by Natural Science Foundation of Hubei Province (2022CFB023), Education Science Planning Project of Hubei Province (2022GA031), Yangtze University College Student Innovation and Entrepreneurship Project (Yz2024330), and Project of Research Team Jingzhou University (JYKYTD202408).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Experimental

Appendix A.1. Datasets

We selected index data from the world’s top three economies for our study, with relevant data sourced from a public financial dataset on Kaggle. This dataset, on a daily basis, comprehensively records the stock movements of the aforementioned indices spanning from 1 January 2007 to 31 December 2023; details of the raw data can be found in Table A1. In the time series prediction task, the dataset is divided as follows: the training set spans from 1 January 2007 to 31 December 2018, and the test set covers the period from 1 January 2019 to 31 December 2023. For the reinforcement learning agent task, the dataset is split into a training set (1 January 2007–31 December 2018) and a test set (1 January 2019–1 June 2021). This division is motivated by the COVID-19 pandemic that occurred between 2019 and 2021, as we aim to verify the model’s robustness. Additionally, a sliding window sampling method is adopted for both tasks, and the merged state spaces in reinforcement learning are arranged according to the same temporal window.

Table A1. Experiment configuration.

Financial Datasets	Start Date	End Date	Channel	Open Price	Frequency
N225	2007-1-1	2023-12-31	5	17,322.50	1 Day
SP500	2007-1-1	2023-12-31	5	1418.03	1 Day
DJI	2007-1-1	2023-12-31	5	12,459.54	1 Day
NASDAQ	2007-1-1	2023-12-31	5	1769.22	1 Day
HSI	2007-1-1	2023-12-31	5	20,004.84	1 Day

Appendix A.2. The Settings of Hyperparameters

In the experimental workflow, data preprocessing serves as the primary step: the study applies reversible instance normalization technology to standardize time series data, with the core objective of adjusting the data distribution to a form with zero mean and unit standard deviation. In terms of its function, this processing method can effectively avoid potential distribution shift issues of time series data across different time intervals or channels, laying a stable data foundation for subsequent model training. The parameter configuration for the model training phase is as follows: regarding the optimizer, Adam is selected as the training optimization tool, and its hyperparameters adopt the default settings—specifically,

(β_{1}, β_{2})

is set to (0.9, 0.999). To mitigate the risk of model overfitting, an early stopping strategy is incorporated into the experiment: training is terminated if the validation loss fails to decrease for 10 consecutive epochs. Details of the specific experimental parameters for the stock index datasets, including the number of MLP layers in the cluster assigner. The number of temporal module layers, hidden dimension, optimal number of clusters, and regularization parameters have all been compiled in (Table A2) for reference.

Table A2. Experiment configuration for the time series prediction network.

	# Clusters	$β$	# Linear Layers in MLP	Hidden Dimension	# Layers (TSMixer)	# Layers (PatchTST)	# Layers (TimesNet)
N225	2	0.3	1	64	2	4	3
SP500	2	0.3	1	64	2	4	3
DJI	2	0.3	1	64	2	4	3
NASDAQ	2	0.3	1	64	2	4	3
HSI	2	0.3	1	64	2	4	3

For RL related configurations, the selection of the agent network refers to mature schemes from the existing literature; the initial experimental capital is set to USD 500,000, with a total transaction cost of 0.3% incurred per completed trade. To ensure experimental stability, five sets of random seeds are employed, and the experiments are repeated independently. The random seeds utilized in the experiments are 12,345, 23,451, 32,154, 54,321, and 43,215, which guarantee the reproducibility of the results. Additionally, the hyperparameters of the RL agent are summarized in Table A3. The sliding window method is adopted in the data sampling phase to ensure the rationality of the sampling process and the continuity of the data.

Table A3. Experiment configuration for reinforcement learning.

Hyperparameters	CE-DDQN	CE-PPO	CE-A2C
input length	20	20	20
output length	5	5	5
Discount factor	0.9	0.9	0.9
Learning Rate	0.001	0.001	0.001
Batch Size	32	32	32
Episode	100	100	100
Replay Memory	10,000	10,000	10,000
Greedy	0.9	-	-
Clip	-	0.1	-

Appendix B. Financial Time Series Forecasting Results

To further demonstrate our forecasting results, we present supplementary forecasting cases on four additional datasets, with the results shown in Figure A1, Figure A2, Figure A3 and Figure A4. It can be observed from the figure that the integration of the CE module with TimesNet and PatchTST achieves the optimal results, while the performance of TSMixer is less satisfactory.

Figure A1. Comparison between ground truth and predicted values on SP500.

Figure A2. Comparison between ground truth and predicted values on DJI.

Figure A3. Comparison between ground truth and predicted values on NASDAQ.

Figure A4. Comparison between ground truth and predicted values on HSI.

References

Sun, S.; Qin, M.; Wang, X.; An, B. PRUDEX-compass: Towards systematic evaluation of reinforcement learning in financial markets. arXiv 2023, arXiv:2302.00586. [Google Scholar] [CrossRef]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; PMLR: New York, NY, USA, 2017; pp. 22–31. [Google Scholar]
Ghosh, S.; Laguna, S.; Lim, S.H.; Wynter, L.; Poonawala, H. A deep ensemble method for multi-agent reinforcement learning: A case study on air traffic control. In Proceedings of the International Conference on Automated Planning and Scheduling, Guangzhou, China, 2–13 August 2021; Volume 31, pp. 468–476. [Google Scholar]
Buckman, J.; Hafner, D.; Tucker, G.; Brevdo, E.; Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Peng, Z.; Chen, C.; Luo, R.; Zhang, J.; Cheng, H.; Ghosh, B.K. Learning-Based Tracking Control of Unknown Robot Systems with Online Parameter Estimation. In Proceedings of the 2024 American Control Conference (ACC), Toronto, ON, Canada, 10–12 July 2024; pp. 3768–3774. [Google Scholar]
Zhasulanov, D.; Marat, B.; Erkin, K.; Omirgaliyev, R.; Kushekkaliyev, A.; Zhakiyev, N. Enhancing gameplay experience through reinforcement learning in games. In Proceedings of the 2024 IEEE 4th International Conference on Smart Information Systems and Technologies (SIST), Astana, Kazakhstan, 15–17 May 2024; pp. 175–180. [Google Scholar]
Zheng, Y.; Xing, Z.; Zhang, Q.; Jin, B.; Li, P.; Zheng, Y.; Xia, Z.; Zhan, K.; Lang, X.; Chen, Y.; et al. Planagent: A multi-modal large language agent for closed-loop vehicle motion planning. arXiv 2024, arXiv:2406.01587. [Google Scholar]
Pang, H.; Wang, Z.; Li, G. Large language model guided deep reinforcement learning for decision making in autonomous driving. arXiv 2024, arXiv:2412.18511. [Google Scholar] [CrossRef]
Ren, Y.; Sutherland, D.J. Learning dynamics of llm finetuning. arXiv 2024, arXiv:2407.10490. [Google Scholar] [CrossRef]
Zhang, Y.; Li, P.; Hong, J.; Li, J.; Zhang, Y.; Zheng, W.; Chen, P.Y.; Lee, J.D.; Yin, W.; Hong, M.; et al. Revisiting zeroth-order optimization for memory-efficient llm fine-tuning: A benchmark. arXiv 2024, arXiv:2402.11592. [Google Scholar]
Zhou, C.; Huang, Y.; Kong, Y.; Lu, X. Enhancing trading strategies by combining incremental reinforcement learning and self-supervised prediction. Expert Syst. Appl. 2025, 289, 128297. [Google Scholar] [CrossRef]
Yu, S.; Xue, H.; Ao, X.; Pan, F.; He, J.; Tu, D.; He, Q. Generating synergistic formulaic alpha collections via reinforcement learning. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 5476–5486. [Google Scholar]
Gao, S.; Wang, Y.; Yang, X. StockFormer: Learning Hybrid Trading Machines with Predictive Coding. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 4766–4774. [Google Scholar]
Zong, C.; Wang, C.; Qin, M.; Feng, L.; Wang, X.; An, B. Macrohft: Memory augmented context-aware reinforcement learning on high frequency trading. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4712–4721. [Google Scholar]
Zhao, L.; Kong, S.; Shen, Y. Doubleadapt: A meta-learning approach to incremental learning for stock trend forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 3492–3503. [Google Scholar]
Fang, Y.; Tang, Z.; Ren, K.; Liu, W.; Zhao, L.; Bian, J.; Li, D.; Zhang, W.; Yu, Y.; Liu, T.Y. Learning multi-agent intention-aware communication for optimal multi-order execution in finance. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 4003–4012. [Google Scholar]
Jin, W.; Wang, H.; Zha, D.; Tan, Q.; Ma, Y.; Li, S.; Lee, S.I. Dcai: Data-centric artificial intelligence. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 1482–1485. [Google Scholar]
Jarrahi, M.H.; Memariani, A.; Guha, S. The principles of data-centric AI (DCAI). arXiv 2022, arXiv:2211.14611. [Google Scholar]
Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Hu, X. Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 27–29 April 2023; pp. 945–948. [Google Scholar]
Kumar, S.; Sharma, R.; Singh, V.; Tiwari, S.; Singh, S.K.; Datta, S. Potential impact of data-centric AI on society. IEEE Technol. Soc. Mag. 2023, 42, 98–107. [Google Scholar] [CrossRef]
Nieberl, M.; Zeiser, A.; Timinger, H. A review of data-centric artificial intelligence (dcai) and its impact on manufacturing industry: Challenges, limitations, and future directions. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024; pp. 44–51. [Google Scholar]
Tang, C.; Qendro, L.; Spathis, D.; Kawsar, F.; Mascolo, C.; Mathur, A. Kaizen: Practical self-supervised continual learning with continual fine-tuning. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Los Alamitos, CA, USA, 3–8 January 2024; pp. 2829–2838. [Google Scholar]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Gool, L.V. Revisiting contrastive methods for unsupervised learning of visual representations. Adv. Neural Inf. Process. Syst. 2021, 34, 16238–16250. [Google Scholar]
Wu, Z.; Wang, Q.; Yang, J. SketchTriplet: Self-Supervised Scenarized Sketch-Text-Image Triplet Generation. IEEE Internet Things J. 2025, 12, 13021–13032. [Google Scholar] [CrossRef]
Desplanques, B.; Thienpondt, J.; Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv 2020, arXiv:2005.07143. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. Adv. Neural Inf. Process. Syst. 2022, 35, 5816–5828. [Google Scholar]
Hewamalage, H.; Bergmeir, C.; Bandara, K. Recurrent neural networks for time series forecasting: Current status and future directions. Int. J. Forecast. 2021, 37, 388–427. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv 2022, arXiv:2210.02186. [Google Scholar]
Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.O.; Pfister, T. Tsmixer: An all-mlp architecture for time series forecasting. arXiv 2023, arXiv:2303.06053. [Google Scholar] [CrossRef]
Jin, M.; Wang, S.; Ma, L.; Chu, Z.; Zhang, J.Y.; Shi, X.; Chen, P.Y.; Liang, Y.; Li, Y.F.; Pan, S.; et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv 2023, arXiv:2310.01728. [Google Scholar]
Yuan, X.; Qiao, Y. Diffusion-ts: Interpretable diffusion for general time series generation. arXiv 2024, arXiv:2403.01742. [Google Scholar] [CrossRef]
Jeon, J.; Park, J.; Park, C.; Kang, U. Frequant: A reinforcement-learning based adaptive portfolio optimization with multi-frequency decomposition. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1211–1221. [Google Scholar]
Niu, H.; Li, S.; Zheng, J.; Lin, Z.; Li, J.; Guo, J.; An, B. Imm: An imitative reinforcement learning approach with predictive representation learning for automatic market making. arXiv 2023, arXiv:2308.08918. [Google Scholar] [CrossRef]
Niu, H.; Li, S.; Li, J. Macmic: Executing iceberg orders via hierarchical reinforcement learning. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; pp. 6008–6016. [Google Scholar]
Li, Z.; Jiang, J.; Cao, Y.; Cui, A.; Wu, B.; Li, B.; Liu, Y. Logic-guided Deep Reinforcement Learning for Stock Trading. In Proceedings of the Tiny Papers@ ICLR, Vienna, Austria, 11 May 2024. [Google Scholar]
Lien, Y.H.; Li, Y.K.; Wang, Y.S. Contrastive learning and reward smoothing for deep portfolio management. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 3966–3974. [Google Scholar]
Luo, B.; Liu, D.; Huang, T.; Wang, D. Model-free optimal tracking control via critic-only Q-learning. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 2134–2144. [Google Scholar] [CrossRef]
Dong, H.; Zhao, X.; Luo, B. Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only ADP. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 561–573. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Liu, W.; Gu, Y.; Ge, Y. Multi-factor stock trading strategy based on DQN with multi-BiGRU and multi-head ProbSparse self-attention. Appl. Intell. 2024, 54, 5417–5440. [Google Scholar] [CrossRef]
Sun, S.; Xue, W.; Wang, R.; He, X.; Zhu, J.; Li, J.; An, B. DeepScalper: A risk-aware reinforcement learning framework to capture fleeting intraday trading opportunities. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 1858–1867. [Google Scholar]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Byun, W.J.; Choi, B.; Kim, S.; Jo, J. Practical application of deep reinforcement learning to optimal trade execution. FinTech 2023, 2, 414–429. [Google Scholar] [CrossRef]
Zhang, T.; Ke, Z.; Chen, L.; Qiao, K.; Jia, Y.; Zhang, Z. Trading Performance of an Improved PPO Algorithm in the Chinese Stock Market. In Proceedings of the 2023 4th International Conference on Big Data Economy and Information Management, Zhengzhou, China, 8–10 December 2023; pp. 709–713. [Google Scholar]
Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transac. Sys. Man Cybern. 2012, SMC-13, 834–846. [Google Scholar] [CrossRef]
Sun, Q.; Si, Y.W. Supervised actor-critic reinforcement learning with action feedback for algorithmic trading. Appl. Intell. 2023, 53, 16875–16892. [Google Scholar] [CrossRef]
Niu, H.; Li, S.; Li, J. MetaTrader: An reinforcement learning approach integrating diverse policies for portfolio optimization. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17–21 October 2022; pp. 1573–1583. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Sun, L.; Yao, T.; Yin, W.; Jin, R. Film: Frequency improved legendre memory model for long-term time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 12677–12690. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef]
Fama, E.F. Efficient capital markets: A review of theory and empirical work. J. Financ. 1970, 25, 383–417. [Google Scholar] [CrossRef]
Lo, A.W. The adaptive markets hypothesis: Market efficiency from an evolutionary perspective. J. Portf. Manag. Forthcom. 2004. [Google Scholar] [CrossRef]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Chen, J.; Lenssen, J.E.; Feng, A.; Hu, W.; Fey, M.; Tassiulas, L.; Leskovec, J.; Ying, R. From similarity to superiority: Channel clustering for time series forecasting. Adv. Neural Inf. Process. Syst. 2024, 37, 130635–130663. [Google Scholar]
Huang, Y.; Zhou, C.; Cui, K.; Lu, X. Improving algorithmic trading consistency via human alignment and imitation learning. Expert Syst. Appl. 2024, 253, 124350. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Suthaharan, S. Support vector machine. In Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning; Springer: Berlin/Heidelberg, Germany, 2016; pp. 207–235. [Google Scholar]
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
Théate, T.; Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl. 2021, 173, 114632. [Google Scholar] [CrossRef]
Cornalba, F.; Disselkamp, C.; Scassola, D.; Helf, C. Multi-objective reward generalization: Improving performance of Deep Reinforcement Learning for applications in single-asset trading. Neural Comput. Appl. 2024, 36, 619–637. [Google Scholar] [CrossRef]

Figure 1. Channel independent model.

Figure 2. Channel dependent model.

Figure 3. The structure of the proposed method.

Figure 4. Cluster module framework.

Figure 5. The framework diagram of prototype learning.

Figure 6. Flowchart of the overall framework.

Figure 7. Comparison between ground truth and predicted values on HSI.

Figure 8. Comparison between ground truth and predicted values on HSI.

Figure 9. The result of ablation studies.

Figure 10. Heatmap distribution of the 3-day future price changes in the dataset.

Figure 11. Action display of the intelligent agent on the Nikkei 225, Hang Seng Index, and S&P 500 datasets.

Table 1. Trading operations based on signals and account positions.

Old Position	Signal ( $a_{t}^{*}$ )	Actual Action ( $a_{t}$ )	New Position	Description
0	0	0	0	Hold the cash.
0	−1	−1	−1	Open a short position.
0	1	1	1	Open a long position.
1	1	0	1	Hold the long position.
1	0	0	1	Hold the long position.
$- 1$	0	0	$- 1$	Hold the short position.
$- 1$	$- 1$	0	$- 1$	Hold the short position.
1	$- 1$	$- 1$	0	Close the long position.
$- 1$	1	1	0	Close the short position.

Table 2. The prediction performance of the five indices.

Datasets	Metrics	PatchTST	+CE	TimesNet	+CE	TSMixer	+CE	DLinear	+CE	IMP (%)
N225	MAE	0.055	0.051	0.053	$\underset{̲}{0.051}$	0.397	0.275	0.423	0.072	32.57%
	MSE	0.005	$\underset{̲}{0.004}$	$\underset{̲}{0.004}$	$\underset{̲}{0.004}$	0.207	0.133	0.208	0.008	54.78%
	RMSE	0.069	0.067	0.067	$\underset{̲}{0.065}$	0.446	0.351	0.441	0.091	30.88%
	MAPE	0.018	0.017	0.017	$\underset{̲}{0.016}$	0.130	0.092	0.135	0.024	33.70%
SP500	MAE	0.058	0.057	0.057	$\underset{̲}{0.056}$	0.512	0.312	0.112	0.088	19.64%
	MSE	0.006	$\underset{̲}{0.005}$	$\underset{̲}{0.005}$	$\underset{̲}{0.005}$	0.317	0.163	0.019	0.013	24.25%
	RMSE	0.074	0.073	0.073	$\underset{̲}{0.072}$	0.550	0.391	0.132	0.109	12.21%
	MAPE	0.017	0.013	0.013	$\underset{̲}{0.012}$	0.111	0.071	0.025	0.020	25.97%
DJI	MAE	0.048	$\underset{̲}{0.042}$	0.077	0.043	0.514	0.310	0.087	0.074	26.82%
	MSE	$\underset{̲}{0.004}$	$\underset{̲}{0.004}$	0.010	0.003	0.309	0.166	0.012	0.009	35.3%
	RMSE	0.062	$\underset{̲}{0.052}$	0.096	0.055	0.547	0.396	0.105	0.092	20.9%
	MAPE	0.013	$\underset{̲}{0.010}$	0.020	0.011	0.130	0.082	0.023	0.019	26.7%
NASDAQ	MAE	0.079	$\underset{̲}{0.080}$	0.082	0.081	1.083	0.327	0.155	0.132	23.21%
	MSE	0.011	0.011	$\underset{̲}{0.012}$	0.011	1.303	0.175	0.038	0.029	29.64%
	RMSE	$\underset{̲}{0.101}$	$\underset{̲}{0.101}$	0.105	0.013	1.113	0.404	0.185	0.162	19.54%
	MAPE	$\underset{̲}{0.013}$	$\underset{̲}{0.013}$	0.014	$\underset{̲}{0.013}$	0.172	0.055	0.026	0.023	21.63%
HSI	MAE	0.067	0.067	0.067	$\underset{̲}{0.066}$	0.511	0.344	0.124	0.114	10.55%
	MSE	$\underset{̲}{0.007}$	$\underset{̲}{0.007}$	$\underset{̲}{0.007}$	$\underset{̲}{0.007}$	0.178	0.389	0.025	0.022	16.56%
	RMSE	0.085	0.084	0.084	$\underset{̲}{0.083}$	0.569	0.401	0.151	0.140	9.72%
	MAPE	0.145	0.135	0.129	$\underset{̲}{0.128}$	0.793	0.565	0.348	0.341	9.56%

Table 3. The performance of zero-shot prediction in the U.S. market.

Datasets	Metrics	PatchTST	+CE	TimesNet	+CE	TSMixer	+CE	DLinear	+CE	IMP (%)
① NASDAQ→DJI	MAE	0.057	0.048	0.048	$\underset{̲}{0.045}$	0.495	0.294	0.134	0.079	17.18%
	MSE	0.005	0.004	0.004	$\underset{̲}{0.003}$	0.299	0.147	0.027	0.010	39.70%
	RMSE	0.071	0.062	0.061	$\underset{̲}{0.058}$	0.537	0.374	0.157	0.098	21.38%
	MAPE	0.015	$\underset{̲}{0.012}$	0.013	$\underset{̲}{0.012}$	0.175	0.125	0.037	0.021	17.73%
② NASDAQ→SP500	MAE	0.069	0.058	0.056	$\underset{̲}{0.055}$	0.632	0.286	0.165	0.094	28.88%
	MSE	0.007	0.006	$\underset{̲}{0.005}$	$\underset{̲}{0.005}$	0.469	0.139	0.039	0.015	36.55%
	RMSE	0.085	0.074	$\underset{̲}{0.072}$	$\underset{̲}{0.071}$	0.670	0.361	0.192	0.116	25.01%
	MAPE	0.015	0.013	$\underset{̲}{0.012}$	$\underset{̲}{0.010}$	0.137	0.065	0.037	0.021	31.45%
③ SP500→NASDAQ	MAE	0.086	0.081	0.106	$\underset{̲}{0.078}$	0.903	0.890	0.221	0.126	19.16%
	MSE	0.012	0.011	0.020	$\underset{̲}{0.010}$	0.948	0.894	0.071	0.026	31.85%
	RMSE	0.108	$\underset{̲}{0.103}$	0.136	0.099	0.953	0.921	0.260	0.155	23.85%
	MAPE	0.014	$\underset{̲}{0.013}$	0.017	$\underset{̲}{0.013}$	0.144	0.141	0.036	0.021	18.61%
④ SP500→DJI	MAE	0.048	0.047	0.047	$\underset{̲}{0.042}$	0.556	0.393	0.099	0.096	22.62%
	MSE	0.004	0.004	$\underset{̲}{0.003}$	$\underset{̲}{0.003}$	0.369	0.194	0.015	0.015	31.96%
	RMSE	0.062	0.061	0.060	$\underset{̲}{0.055}$	0.599	0.431	0.119	0.117	20.44%
	MAPE	0.012	0.012	0.012	$\underset{̲}{0.011}$	0.141	0.099	0.026	0.025	22.99%
⑤ DJI→SP500	MAE	0.058	0.057	0.094	$\underset{̲}{0.053}$	0.651	0.304	0.117	0.113	31.92%
	MSE	$\underset{̲}{0.005}$	$\underset{̲}{0.005}$	0.014	$\underset{̲}{0.005}$	0.484	0.150	0.024	0.021	45.08%
	RMSE	0.062	0.061	0.115	$\underset{̲}{0.058}$	0.682	0.387	0.142	0.137	27.30%
	MAPE	0.012	0.012	0.021	$\underset{̲}{0.011}$	0.141	0.069	0.026	0.025	29.99%
⑥ DJI→NASDAQ	MAE	0.083	0.081	0.083	$\underset{̲}{0.079}$	1.170	1.106	0.146	0.124	6.68%
	MSE	0.011	0.011	0.012	$\underset{̲}{0.010}$	1.522	1.335	0.034	0.026	13.13%
	RMSE	0.105	0.103	0.105	$\underset{̲}{0.100}$	1.210	1.129	0.176	0.154	6.46%
	MAPE	$\underset{̲}{0.013}$	$\underset{̲}{0.013}$	0.014	$\underset{̲}{0.013}$	0.187	0.176	0.024	0.021	6.38%

Table 4. The performance of the various methods on the five datasets.

Datasets	Metrics	B&H	S&H	MR	SVM	Random Forest	TDQN	DQN-HER	CE DDQN	CE A2C	CE PPO
N225	CR	3.76%	−0.87%	89.44%	47.71%	73.62%	−13.90%	41.62%	353.853%	306.201%	232.626%
	AR	10.55%	−10.38%	4.84%	4.33%	7.72%	−4.23%	11.42%	64.43%	83.88%	63.16%
	SR	0.54	−0.40	1.79	0.02	0.47	−0.21	0.72	4.201	3.885	3.185
	MDD	20.23%	2.58%	27.32%	23.78%	34.77%	31.67%	15.96%	25.46%	30.54%	20.98%
SP500	CR	2.354%	−0.461%	52.77%	72.71%	95.96%	−36.76%	90.21%	137.224%	353.595%	654.27%
	AR	0.32%	0.03%	2.54%	3.28%	17.57%	−21.59%	20.37%	52.18%	84.08%	78.61%
	SR	0.31	0.60	0.05	0.88	0.38	−1.09	1.15	2.937	3.266	4.83
	MDD	56.03%	19.06%	44.65%	33.78%	34.77%	45.31%	13.96%	22.53%	33.87%	22.81%
DJI	CR	4.12%	−0.58%	34.12%	36.22%	16.73%	−45.79%	74.83%	445.528%	152.272%	177.683%
	AR	10.55%	−0.03%	4.84%	2.84%	0.94%	−24.34%	18.29%	92.74%	55.54%	60.74%
	SR	0.54	−8.24	0.27	1.09	0.28	−0.91	1.47	4.489	2.020	2.088
	MDD	20.23%	30.66%	17.95%	29.73%	13.14%	44.31%	10.96%	27.53%	30.17%	27.79%
NASDAQ	CR	8.98%	1.36%	21.53%	155.60%	72.35%	14.29%	35.02%	1334.781%	1183.893%	608.815%
	AR	10.55%	0.08%	3.81%	5.56%	4.69%	1.87%	10.94%	137.94%	133.11%	106.90%
	SR	0.58	−0.62	0.27	0.26	0.20	1.02	0.69	7.125	6.968	4.967
	MDD	58.23%	50.27%	17.95%	36.55%	23.53%	31.57%	12.73%	6.02%	3.05%	8.03%
HSI	CR	−10.75%	−1.29%	33.13%	37.23%	37.51%	35.67%	44.27%	305.523%	302.819%	104.964%
	AR	1.55%	−0.08%	2.31%	3.71%	2.79%	11.43%	12.41%	89.95%	80.31%	45.08%
	SR	−0.10	−1.69	0.23	0.37	0.75	0.73	0.68	4.201	3.974	2.016
	MDD	55.86%	31.35%	55.13%	29.32%	39.72%	31.28%	17.64%	25.46%	35.08%	24.02%

Table 5. Performance metrics of different trading algorithms.

Datasets	Metrics	A2C	DDQN	PPO	Datasets	Metrics	A2C	DDQN	PPO
① NASDAQ→DJI	CR	173.047%	513.933%	178.183%	② NASDAQ→SP500	CR	244.48%	239.642%	178.19%
	AR	59.58%	98.06%	60.83%		AR	70.98%	70.32%	60.29%
	SR	2.17	3.74	2.09		SR	2.68	2.24	2.27
	MDD	26.99%	21.72%	28.69%		MDD	31.54%	36.54%	34.83%
③ SP500→NASDAQ	CR	1032.89%	1284.37%	581.43%	④ SP500→DJI	CR	184.69%	490.06%	204.409%
	AR	129.61%	117.38%	91.83%		AR	61.59%	96.25%	65.24%
	SR	6.12	6.61	4.53		SR	2.29	3.67	2.29
	MDD	9.81%	7.36%	8.53%		MDD	23.88%	24.03%	24.67%
⑤ DJI→SP500	CR	264.11%	258.53%	190.94%	⑥ DJI→NASDAQ	CR	1046.37%	1127.51%	579.34%
	AR	73.65%	72.90%	62.47%		AR	124.39%	118.91%	94.65%
	SR	2.80	2.78	2.40		SR	6.14	6.75	4.31
	MDD	30.65%	30.79%	31.41%		MDD	4.71%	6.89%	8.74%

Table 6. Comparison of trading performance under training sets that include and exclude data during the China–U.S. trade war.

Method	Datasets	CR	AR	SR	MDD	Method	Datasets	CR	AR	SR	MDD
DDQN with political events	N225	353.853%	64.43%	4.201	25.46%	DDQN without political events	N225	124.67%	31.54%	2.130	28.64%
	SP500	137.224%	52.18%	2.937	22.53%		SP500	81.973%	31.76%	1.237	27.31%
	DJI	445.528%	92.74%	4.489	27.53%		DJI	173.371%	51.34%	2.132	31.26%
	NASDAQ	1334.78%	137.94%	7.12	6.02%		NASDAQ	486.314%	42.93%	3.36	14.26%
	HSI	305.52%	89.95%	4.20	25.46%		HSI	119.31%	41.62%	2.91	29.31%
A2C with political events	N225	306.201%	83.88%	3.88	30.54%	A2C without political events	N225	196.351%	43.13%	2.74	35.08%
	SP500	353.595%	84.08%	3.266	33.87%		SP500	129.372%	36.98%	2.43	35.92%
	DJI	445.528%	92.74%	4.489	27.53%		DJI	374.346%	41.34%	2.93	30.71%
	NASDAQ	1334.78%	137.94%	7.12	6.02%		NASDAQ	561.14%	72.67%	3.23	12.34%
	HSI	305.523%	89.95%	4.20	25.46%		HSI	234.31%	57.32%	2.91	29.61%
PPO with political events	N225	232.626%	63.16%	3.185	20.98%	PPO without political events	N225	297.317%	37.81%	3.241	27.31%
	SP500	654.27%	78.61%	4.83	22.81%		SP500	93.618%	37.81%	1.97	26.71%
	DJI	177.683%	60.74%	2.088	27.79%		DJI	51.231%	27.84%	1.127	30.21%
	NASDAQ	608.815%	106.90%	4.967	8.03%		NASDAQ	431.673%	71.31%	3.813	14.34%
	HSI	104.964%	45.08%	2.016	24.02%		HSI	73.127%	29.87%	1.476	28.64%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Li, X.; Wan, T.; Du, J. Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction. Symmetry 2026, 18, 112. https://doi.org/10.3390/sym18010112

AMA Style

Zhang H, Li X, Wan T, Du J. Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction. Symmetry. 2026; 18(1):112. https://doi.org/10.3390/sym18010112

Chicago/Turabian Style

Zhang, Haoran, Xiaofei Li, Tianjiao Wan, and Junjie Du. 2026. "Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction" Symmetry 18, no. 1: 112. https://doi.org/10.3390/sym18010112

APA Style

Zhang, H., Li, X., Wan, T., & Du, J. (2026). Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction. Symmetry, 18(1), 112. https://doi.org/10.3390/sym18010112

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning for Financial Trading: Enhanced by Cluster Embedding and Zero-Shot Prediction

Abstract

1. Introduction

2. Related Work

2.1. Time Seires Forecasting

2.2. Deep Reinforcement Learning in Trading

3. Preliminaries

Diverse Strategies for Multivariate Time Series Forecasting

4. Method

4.1. Overview of the Proposed Framework

4.2. Generate Exclusive Weights for Clusters

4.2.1. Cross-Attention for Prototype Embedding Generation

4.2.2. Feed-Forward Layer

4.2.3. Transfer Prediction in Financial Markets

4.3. Data Center Artificial Intelligence

4.4. Reinforcement Learning Framework

4.4.1. State

4.4.2. Action

4.4.3. Reward

4.5. Proximal Policy Optimization

CE-PPO Training

5. Experiments

5.1. Datasets and Experimental Setup

5.2. Evaluation Metrics

5.3. Baseline Methods

6. Results and Analysis

6.1. Time Series Forecasting Results

6.2. Validation of Transfer Prediction in Financial Markets

6.3. Ablation Studies

6.4. Implementation of Reinforcement Learning Agents in Trading

6.5. Zero-Shot Prediction Applied to Trading

6.6. Analysis of Model Robustness

6.7. Analysis of Trading Stragegy

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Experimental

Appendix A.1. Datasets

Appendix A.2. The Settings of Hyperparameters

Appendix B. Financial Time Series Forecasting Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI