A Novel Portfolio Selection Method via Deep Reinforcement Learning

Gao, Ni; Liu, Yan; He, Yiyue; Zhang, Juan; Zhang, Lefang

doi:10.3390/systems14030292

Open AccessArticle

A Novel Portfolio Selection Method via Deep Reinforcement Learning

by

Ni Gao

¹,

Yan Liu

¹,

Yiyue He

^2,*,

Juan Zhang

³ and

Lefang Zhang

⁴

¹

School of Economics and Finance, Xi’an International Studies University, Xi’an 710128, China

²

School of Economics and Management, Northwest University, Xi’an 710127, China

³

School of Journalism and Communication, Shaanxi Normal University, Xi’an 710019, China

⁴

School of Economics and Finance, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(3), 292; https://doi.org/10.3390/systems14030292

Submission received: 27 January 2026 / Revised: 28 February 2026 / Accepted: 4 March 2026 / Published: 9 March 2026

Download

Browse Figures

Versions Notes

Abstract

Portfolio selection is a fundamental task in quantitative finance that aims to allocate capital across assets to balance risk and return. While deep learning has shown great promise in this field, extracting reliable feature representations from non-stationary and noisy financial data remains a significant challenge. The existing models often fail to simultaneously capture the temporal dynamics of price series and complex inter-asset correlations, which limits their trading performance. To address these issues, we propose Denoising-Sequence-Correlation Reinforcement Learning (DSCRL), a novel portfolio selection framework based on deep reinforcement learning. DSCRL employs a dual-stream feature extraction network, where one stream aims to learn temporal market dynamics and the other aims to capture asset correlations, enabling more informative representations. A denoising module is further integrated to mitigate the impact of noise, ensuring stability and robustness in the learning process. Furthermore, a deterministic policy gradient (DPG)-based decision network is designed to directly optimize continuous portfolio weights and normalize them to satisfy budget constraints while preserving the importance. Extensive experiments conducted on multiple benchmark datasets demonstrate that DSCRL consistently outperforms both traditional financial heuristics and advanced deep reinforcement approaches. The results highlight its superior ability to achieve higher cumulative returns with lower volatility. Overall, DSCRL provides an effective and robust solution that strikes a better trade-off between pursuing profits and managing risks in dynamic financial markets.

Keywords:

portfolio selection; asset correlations; deep reinforcement learning; feature extraction

1. Introduction

Portfolio selection is a fundamental research topic in finance and a core task in financial engineering. It can be regarded as the dynamic allocation of wealth among a group of assets to maximize long-term returns. The classic research school is based on the Mean–Variance Theory [1], which explores the trade-off between the expected return (Mean) and risk (Variance) of an investment portfolio in the context of a single investment period. The Growth-Optimal Portfolio Theory, which focuses on the selection of multi-period investment portfolios, aims to maximize the expected growth rate or logarithmic return of the investment portfolio. This continuous portfolio selection is more aligned with the operational behavior of investors. Early portfolio selection models grounded in mathematical statistics predominantly relied on empirical features, such as moving averages and technical indicators, which had a limited capacity to extract informative characteristics from financial data and consequently yielded suboptimal profitability. With the advancement of computational technologies, recent approaches have increasingly advocated the application of deep learning, reinforcement learning, and related algorithms to train on heterogeneous financial datasets for enhanced pattern recognition. Nevertheless, owing to several inherent limitations, the existing methodologies remain considerably distant from practical implementation. First, existing methods are difficult to extract the price series patterns. Although RNN (Recurrent Neural Network)-based methods show strong ability with time series data and achieve great success in modeling time series data [2,3], they still suffer from issues such as vanishing gradients and limited ability [4], which undermines the expressive power of the trained model. Moreover, there exists complex and rapidly changing correlation, especially nonlinear correlation in financial assets. Traditional approaches have primarily focused on linear correlation measures, which fail to accurately capture the interdependencies among assets. However, these nonlinear correlations reflect deeper associations between different assets, which are more critical for portfolio selection yet considerably more difficult to extract [5]. Furthermore, most existing portfolio selection models are built on deep learning algorithms, which make prediction as the goal of supervised learning. The prediction signals are difficult to apply in trading strategy directly, and trading strategy still needs to be generated manually [6]. Finally, financial time series data often exhibit characteristics such as noise, jumps, and volatility [7]. According to behavioral finance theory, factors such as investor psychological biases, principal-agent problems, and market manipulation give rise to substantial short-term speculation and noise trading in financial markets. Moreover, traditional hand-crafted feature representations are inherently limited, and most neural network approaches fail to incorporate multi-scale temporal information, thereby overlooking short-, medium- and long-term price signals during feature extraction.

To address these issues, we aim to devise a practical portfolio selection model that can learn representative patterns from complex feature interactions and facilitate its application to trading strategies. However, designing such a scheme that can achieve accurate strategies is a rather challenging task due to the low SNR (Signal to Noise Ratio) within financial data. Resolving the problems paves the way for practical portfolio selection models. To this end, this paper proposes the DSCRL portfolio selection model.

Our method is built upon the following key observations. First, to leverage nonlinear correlations among financial features, a dual-stream network is devised to extract assets features, which can effectively extract both price series patterns and assets correlation features. Furthermore, rather than designing the model based on deep learning, we design the DSCRL within the paradigm of reinforcement learning to facilitate the application to trading strategies. Moreover, considering that the most representative financial noise stems from uncertainties in various financial market conditions, a denoising module is designed within the DSCRL framework. By incorporating these two unique observations into our overall design, empirical studies demonstrate that the proposed method can achieve better performance on benchmark financial datasets.

Overall, the contributions of this paper include the following three aspects: Firstly, a novel denoising method is proposed. To address the common problem of high noise in financial time series, this paper proposes a price series denoising approach based on Variational Mode Decomposition (VMD) and component reconstruction [8]. This method effectively filters out redundant fluctuations while preserving valuable market information, thereby improving the quality of input data and providing a more stable foundation for portfolio optimization. Secondly, a comprehensive framework is designed. DSCRL that integrates correlation features enables the simultaneous extraction of both temporal patterns and inter-asset correlations. The correlation feature extraction branch is mainly inspired by the Relation-Aware Transformer (RAT) [9], and the dual-stream features are effectively integrated within a reinforcement learning structure to better adapt to the characteristics of financial decision-making tasks. Finally, the superior experimental performance: Extensive empirical experiments are conducted on datasets from both traditional stock markets and cryptocurrency markets. The experimental results demonstrate that the proposed DSCRL model significantly outperforms all baseline methods across multiple evaluation metrics, confirming its ability to enhance return performance while effectively controlling investment risk.

2. Methods

2.1. General Architecture

DSCRL is mainly composed of three submodules: financial data denoising, feature extraction, and decision-making (Figure 1). The model processes the closing price of each asset to generate time series data after noise reduction. It then extracts feature maps for each asset and uses them to make trading decisions. The model structure and specifics will be presented in Section 3.1. Neural network parameters are acquired through training with a deep reinforcement learning algorithm.

Financial Data Denoising Module: To address the noise issue and ensure that no future information is introduced during the training and testing phases, strict causal constraints are imposed on the denoising module. Within each sliding window, the key parameters of the VMD model are optimized using the Sparrow Search Algorithm (SSA) solely based on the historical price data contained in the sliding window, without involving any future information [10]. The optimized VMD is then applied to decompose the stock price time series into intrinsic mode functions (IMFs) synchronously, which represent components of the stock price series with different frequencies. Some high-frequency IMFs are removed due to their excessive noise. The remaining IMFs are reconstructed to generate a denoised price series composed of high-frequency, low-frequency, and trend components, using gray relational clustering. The window subsequently rolls forward in chronological order, and the parameter optimization and decomposition procedures are conducted independently within each window.

Therefore, at any time step

t

, the feature construction process relies exclusively on information available at or prior to

t

, which ensures that the entire denoising procedure satisfies strict temporal causality constraints.

Feature Extraction Module: In this module, we perform dual-stream feature extraction on the denoised sequence data, focusing on both temporal and spatial dimensions. For temporal feature extraction, the Temporal Convolutional Network (TCN) is employed to overcome the limitations of RNN and Long Short-Term Memory (LSTM), such as fixed receptive fields and vanishing gradients, thereby enhancing the model’s ability to capture temporal information across different scales, including short-term, medium-term, and long-term patterns. For spatial relationship extraction, the Relational-Aware Transformer (RAT), a variant of the Transformer architecture, is utilized to address the shortcomings of treating stocks as independent entities and relying solely on correlation coefficients. This enables the model to more effectively capture inter-asset relational features.

Decision-Making Module: In this module, the extracted features together with the previous portfolio weights are fed into the portfolio policy network to make portfolio allocation decisions. The strategy network is trained via reinforcement learning with the objective of maximizing cumulative returns. By incorporating the previous asset weights as recursive inputs, the model becomes aware of transaction costs and can produce smoother decisions, thereby better aligning with realistic trading environments.

2.2. Financial Data Denoising

Short-term speculative behavior and noise trading in financial markets cause time series data to exhibit irregular fluctuations and a high noise level. However, in the long term, the price of financial assets tends to revert to its intrinsic value according to the law of value. In this paper, the noise reduction method leverages the unique characteristics of financial assets by decomposing the close price sequence based on various fluctuation frequencies and eliminating the segments with higher noise content to reduce financial data noise effectively. Figure 2 displays the flowchart of the noise reduction method. VMD is applied to decompose the close price sequence within the historical window [8] and the SSA to optimize the penalty factor

α

and the number of modal decompositions k [10]. After decomposing to acquire k IMFs, the gray correlation analysis (GRA) technique was employed to categorize each IMF into trend, low-frequency, and high-frequency components [11]. The low-frequency terms and the remaining high-frequency terms are summed and reconstructed after eliminating high-frequency terms with high noise content [12]. The resulting trend terms, low-frequency terms, and noise-reduced high-frequency terms are then fed into the feature extraction network.

In the denoising module, VMD requires the specification of two critical parameters: the number of modes

κ

and the penalty factor

α

. These parameters directly influence the decomposition quality, particularly with respect to over-decomposition and mode mixing. However, deriving analytically optimal parameter values is generally intractable, and manual selection inevitably introduces subjectivity and potential bias. To mitigate excessive human intervention and reduce the risk of convergence to local optima, we incorporate SSA to perform automatic parameter optimization [10], to enhance the stability and generalization capability of the decomposition process.

Compared with conventional filtering techniques, simple methods such as moving averages (MA) tend to smooth the series indiscriminately, which leads to the mixing of high- and low-frequency components and the potential loss of critical turning-point information. In contrast, the framework SSA-VMD-GRA proposed in this paper enables explicit frequency separation, effectively decomposing the original series into trend and fluctuation components while preserving informative mid-frequency signals. Although wavelet denoising can address multi-scale characteristics, it requires the selection of a mother wavelet, which introduces methodological subjectivity, and its performance is sensitive to threshold determination. By comparison, VMD does not rely on predefined basis functions and provides clearer and more adaptive frequency band partitioning.

Overall, the framework SSA-VMD-GRA has advantages in terms of adaptive parameter optimization and robust frequency separation, making it more suitable for modeling non-stationary and structurally complex financial time series.

VMD is an adaptive signal decomposition method. Its advantage lies in the ability to determine the number of mode decompositions, which effectively solves the problem of mode aliasing and can better decompose the fluctuations of different center frequencies in financial time series data [8]. The problem description is as shown in Equation (1):

{\begin{matrix} \min_{{ω_{k}}} {\sum_{k} {‖ \partial_{t} [δ (t) + \frac{j}{π t} * {i m f}_{k} (t)] * e^{- j ω_{k} t} ‖}_{2}^{2}} \\ s . t \sum_{k} i m f (t) = f (t) \end{matrix}

(1)

Here,

{i m f} = {{i m f}_{1}, {i m f}_{2}, \dots, {i m f}_{k}}

is the k intrinsic mode functions obtained by decomposition,

{ω} = {ω_{1}, ω_{2}, \dots, ω_{k}}

is the center frequency of each IMF,

δ (t)

is the unit pulse function, j is the imaginary unit, and

f (t)

is the original timing signal.

Next, the penalty factor

α

and the Lagrange operator

λ

are introduced to obtain the augmented Lagrange function, as shown in Equation (2):

\begin{matrix} ({{i m f}_{k}}, {ω_{k}}, λ) = α \sum_{k} {‖ \partial_{t} [δ (t) + \frac{j}{π t} * {i m f}_{k} (t)] * e^{- j ω_{k} t} ‖}_{2}^{2} + {‖ f (t) - \sum_{k} {i m f}_{k} (t) ‖}_{2}^{2} \\ + 〈 λ (t), f (t) - \sum_{k} {i m f}_{k} (t) 〉 \end{matrix}

(2)

The Alternating Direction Method of Multipliers (ADMM) is employed to iteratively update each modal component, the center frequencies, and the Lagrange multipliers, with the update procedure given in Equation (3):

{\hat{i m f}}_{k}^{n + 1} (ω) = \frac{\hat{f} (ω) - \sum_{i \neq k} {\hat{i m f}}_{i} (ω) + \frac{\hat{λ} (ω)}{2}}{1 + 2 α {(ω - ω_{k})}^{2}}

(3)

ω_{k}^{n + 1} = \frac{\int_{0}^{\infty} ω {| {\hat{i m f}}_{k}^{n + 1} (ω) |}^{2} d ω}{\int_{0}^{\infty} {| {\hat{i m f}}_{k}^{n + 1} (ω) |}^{2} d ω}

(4)

λ^{n + 1} (ω) = λ^{n} (ω) + γ [f (ω) - \sum_{k} {i m f}_{k}^{n + 1} (ω)]

(5)

Here,

{\hat{i m f}}_{k}^{n + 1} (ω), {\hat{i m f}}_{i} (ω), \hat{f} (ω), λ^{n + 1} (ω)

correspond to the modes after Fourier transformation, and

γ

represents the noise tolerance.

During the iterative updating process, once the decomposition accuracy satisfies the required condition, the current output is regarded as the final decomposition result. The convergence criterion of the decomposition is given in Equation (6), where

ε

is maximum admissible tolerance:

\sum_{k} \frac{{‖ {i m f}_{k}^{n + 1} - {i m f}_{k}^{n} ‖}_{2}^{2}}{{‖ {i m f}_{k}^{n} ‖}_{2}^{2}} < ε

(6)

In the VMD process, the selection of the penalty factor

α

and the number of modes k is particularly critical to the decomposition results. Therefore, this study employs the SSA to optimize these parameters by minimizing the envelope entropy, the computation of which is defined in Equation (7):

S_{i} = - \sum_{j = 1}^{N} p_{i, j} * \log p_{i, j}

(7)

p_{i, j} = \frac{a_{i} (j)}{\sum_{k = 1}^{N} a_{i} (k)}

(8)

Here,

S_{i}

denotes the envelope entropy of the i-th

{i m f}_{i} (j)

derived from the original signal, and

a_{i} (j)

represents the envelope signal obtained through Hilbert demodulation. Among the components obtained through VMD, those containing stronger periodic information correspond to smaller envelope entropy values, whereas those with weaker periodicity exhibit larger envelope entropy values. By adopting envelope entropy as the optimization criterion for SSA, the parameter optimization problem can be formulated as in Equation (9), and the optimal parameter pair

[α, k]

is determined using the SSA.

{\begin{matrix} m i n S_{i} \\ k_{m i n} \leq k \leq k_{m a x} \\ α_{m i n} \leq α \leq α_{m a x} \end{matrix}

(9)

The original time series signal

f (t)

is decomposed via VMD into k IMFs, denoted as

{i m f}_{1} (t)

,

{i m f}_{2} (t)

, …,

{i m f}_{k} (t)

, where

{i m f}_{1} (t)

represents the trend component; the volatility frequencies of

{i m f}_{2} (t)

through

{i m f}_{k} (t)

increase progressively, and the complexity of the modes gradually increases. In financial markets, substantial speculative behavior and irrational trading introduce considerable noise into the high-frequency IMF sequences. To mitigate the negative impact of such noise on the effectiveness of portfolio models, the IMF sequences are categorized according to their frequencies into trend components, low-frequency components, and high-frequency components. Certain high-frequency IMFs are then removed, thereby enabling the extraction of medium- and long-term volatility features from asset price series.

The gray correlation analysis method is employed to categorize the IMF sequences into low-frequency and high-frequency groups, followed by the reconstruction of the low-frequency IMF sequences [11]. To eliminate the influence of differing scales,

{i m f}_{1}^{'} (t)

is defined as follows:

{i m f}_{1}^{'} (t) = \frac{{i m f}_{1} (t)}{\max {i m f}_{1} (t)}

(10)

Next, the relative correlation coefficients are calculated, as defined in Equation (11):

γ_{i}^{1} = \frac{1}{T} \sum_{t = 1}^{T} ε_{i}^{2} (t) = \frac{1}{T} \sum_{t = 1}^{T} \frac{\min_{i} \min_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) | + ρ \max_{i} \max_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) |}{| {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) | + ρ \max_{i} \max_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) |}

(11)

Here,

ρ

denotes the distinguishing coefficient, which is typically set to 0.5. Subsequently, the absolute correlation coefficients are computed, as defined in Equation (12):

γ_{i}^{1} = \frac{1}{T} \sum_{t = 1}^{T} ε_{i}^{2} (t) = \frac{1}{T} \sum_{t = 1}^{T} \frac{\min_{i} \min_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) | + ρ \max_{i} \max_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) |}{| {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) | + ρ \max_{i} \max_{t} | {i m f}_{1}^{'} (t) - {i m f}_{i}^{'} (t) |}

(12)

Subsequently, the comprehensive correlation coefficient, as shown in Equation (13), is obtained to represent the overall degree of association among the IMF components.

γ_{i} = β γ_{i}^{1} + (1 - β) γ_{i}^{2}

(13)

Here,

β

is the weight and set to 0.5 in this paper.

The above procedure is repeated to compute the comprehensive correlation coefficients among all IMF sequences, after which the gray correlation analysis method is applied to classify the IMF sequences according to their volatility frequencies. IMF sequences exhibiting both high correlation and similar volatility frequencies are grouped together, sequentially defined as low-frequency and high-frequency components. After discarding certain high-frequency components containing excessive noise, the low-frequency components and the remaining high-frequency components are summed and reconstructed, respectively [12].

2.3. Feature Extraction

In this section, we propose a dual-stream feature extraction network, as shown in Figure 3, composed of two complementary branches. Specifically, the temporal sequence branch extracts dynamic evolution features from univariate asset price series to characterize temporal patterns and volatility dynamics, while the correlation branch focuses on uncovering the interactions and latent dependency structures among assets [9]. The network, adopting the parallel dual-stream model, can integrate temporal dynamics and cross-sectional correlation features within a unified framework, providing more diverse feature representations for subsequent forecast and decision-making.

In the temporal sequence branch, this study employs TCN to extract features from asset price series [13]. Compared with the traditional Convolutional Neural Network (CNN), TCN overcomes the limitations of restricted receptive fields in convolution kernels and the difficulty of capturing long-term dependencies in sequences (Figure 4a). At the same time, relative to RNN and their variants, TCN eliminates the constraint of sequential recursive computation, enabling highly parallelized training that significantly improves computational efficiency while preserving modeling capacity. The overall structure of TCN is illustrated in Figure 4. The core of TCN lies in its dilated causal convolutional architecture, as shown in Figure 4b. Causal convolutions effectively prevent information leakage from future time steps, ensuring the validity of prediction tasks, while dilated convolutions expand the receptive field through spaced sampling, thereby allowing efficient modeling of long-term dependencies in time series. In addition, residual connections are incorporated into the network design, which not only enhances model robustness but also helps alleviate the vanishing gradient problem in deep network training. Overall, TCN demonstrates distinct advantages in capturing long-range dependencies, improving computational efficiency, and maintaining training stability, making it particularly well-suited for feature extraction from financial asset price series.

As shown in the right of Figure 3, the temporal feature extraction network constructed in this paper consists of several components, including an input layer, dilated causal convolution layers, weight normalization, a ReLU activation function, a dropout layer, and residual connections. The input layer receives the sequential data with dimensions

[N, T, C]

, where

N

denotes the number of assets,

T

represents the number of time steps, and

C

indicates the number of input features. Subsequently, dilated causal convolution is applied to ensure that the output at each time step depends only on the current and past information, thereby preserving temporal causality. At the same time, dilated convolution expands the receptive field without significantly increasing computational complexity. In this model, the dilation rates are set to 2 and 4. The ReLU activation function is then employed to enhance the network’s expressive capacity by sparse activation while alleviating the vanishing gradient problem. A dropout layer follows to prevent overfitting by randomly deactivating a proportion of neurons during training, where the dropout rate is set to

p = 0.2

. Next, residual connections are introduced by adding the input of the convolutional layer directly to its output, forming a residual structure that effectively mitigates gradient vanishing and facilitates the training of deeper architectures. The above operations are repeated more than twice to more effectively extract temporal features from the input data, resulting in an output tensor of dimension

[N, T, C]

, where

M

denotes the number of convolutional output channels. Finally, a one-dimensional convolution is applied to compress the three-dimensional output into a tensor of size

[N, 1, M]

for subsequent computation. In the proposed model,

M

is set to 8. In the temporal stock price feature extraction network, the input of each layer can be expressed as follows:

y_{1} = C o n v (R e L u (D r o p o u t (W e i g h t N o r m (F (x^{'})))) x)

(14)

x^{'} = R e L u (D r o p o u t (W e i g h t N o r m (F (x))))

(15)

where

x

denotes the multi-scale representation of the input stock prices,

F

is the dilated causal convolution operation, and

y_{1}

denotes the output temporal features.

In the correlation branch, the focus is on capturing the interactions and dynamic dependencies among different assets. While the standard Transformer model, with its multi-head attention mechanism, excels at modeling nonlinear correlations in natural language processing and general sequence tasks, its direct application to financial portfolio problems faces notable limitations. Specifically, the self-attention mechanism in the standard Transformer relies primarily on point-to-point similarity measures, making it sensitive to local noise and less effective at capturing local contextual patterns in price sequences. As a result, Transformers alone are insufficient for extracting meaningful correlations between assets. To address this challenge, we incorporate the RAT [9] into the correlation branch. RAT extends the standard Transformer with two key enhancements. First, the Sequential Attention Layer strengthens the modeling of local dependencies through context-aware attention, effectively suppressing short-term noise while preserving the ability to capture long-term relationships in price sequences. Second, the Relation Attention Layer explicitly models dynamic correlations between assets, revealing systemic risks and market co-movement patterns. The overall structure of the RAT is illustrated in Figure 5, and the algorithm process is described as follows.

First, the feature maps as the outputs of dilated causal convolutional module are projected through different linear transformations to obtain the query matrix

Q

, key matrix

K

, and value matrix

V

. By matrix splitting,

Q

,

K

, and

V

are mapped into

h

heads, generating multi-head representations that capture pairwise price correlation features among assets, denoted as

q

,

k

, and

v

. These linear transformations and multi-head operations are formulated in Equations (16)–(20):

Q = {L i n e r}^{q} (x)

(16)

K = {L i n e r}^{k} (x)

(17)

V = {L i n e r}^{v} (x)

(18)

[q_{1}, q_{2}, \dots, q_{h}] = M u l t i H e a d (Q)

(19)

[(k_{1}, v_{1}), (k_{2}, v_{2}), \dots, (k_{h}, v_{h})] = M u l t i H e a d (K, V)

(20)

Next, the matrices

q

and

k

are multiplied within each head to obtain the attention distribution

E

, which represents the relative importance of each asset with respect to all other assets. To prevent excessively large inner products, each element

e

in

E

is scaled by dividing by

\sqrt{d}

, where

d

denotes the dimensionality of each head.

Subsequently, the SoftMax function is applied to normalize the attention distribution

E

, ensuring that the attention weights of each asset over all assets sum to 1. Finally, the normalized attention distribution in each head is multiplied by the corresponding value matrix

v

. The above operations are formulated in Equations (21) and (22):

E_{i} = \frac{q_{i}^{T} k_{i}}{\sqrt{d}}

(21)

M_{i} = S o f t m a x (E_{i}) v_{i}

(22)

Finally, the outputs from all heads are concatenated and passed through a linear convolution layer to obtain the final attention representation, which reflects the attention value of each asset with respect to all other assets across all heads. This process can be expressed in Equation (23):

A t t e n t i o n = C o n (C o n t a c t (M_{1}, M_{2}, \dots, M_{h}))

(23)

The extracted inter-asset correlation features can therefore be expressed as shown in Equation (24):

y_{2} = R e L u (d r o p (A t t e n t i o n))

(24)

As shown in Figure 5, the RAT retains the fundamental components of the Transformer, including positional encoding, feed-forward layers, and layer normalization, ensuring that the model remains generalizable while being adapted to the financial context.

By integrating RAT, the correlation branch can simultaneously capture complex inter-asset dependencies and local price patterns within a unified framework [14]. A concatenation operation is employed to integrate these two types of features, and the operation can be expressed as follows:

y^{t} = y_{1}^{t} ⨁ y_{2}^{t} ⨁ ω^{t}

(25)

where

y_{1}^{t} \in R^{N \times 8}

denotes the temporal features extracted at time

t

,

y_{2}^{t} \in R^{N \times 8}

denotes the inter-asset correlation features extracted at time

t

, and

ω_{t} \in R^{N \times 1}

denotes the portfolio weight vector at time

t

. The integrated feature is

y^{t} \in R^{N \times 16}

via concatenation.

2.4. Decision-Making Net

As a result of the dual-stream feature extraction, the extracted time series features and asset correlation features are input into the decision network based on reinforcement learning to generate the optimal portfolio allocation strategy.

This decision network is designed as a policy network that directly outputs continuous weights for each asset. In this study, the policy network is trained using the DPG algorithm (Figure 6), which is highly suitable for handling continuous action spaces and allows direct optimization of asset allocations. To further enhance the stability of the investment portfolio and reduce transaction costs, the strategy network introduces a recursive mechanism. When generating trading decisions for the current period, it takes the trading decisions of the previous period as input references. This helps maintain consistency across consecutive portfolio allocations and mitigates the risk of incurring large transaction costs [15]. When computing asset weights, the network first applies a convolutional operation to integrate feature information across all assets, ensuring that the decision for a single asset considers the characteristics of the entire portfolio. The resulting weights are then normalized using a SoftMax function, producing a final allocation vector that satisfies the budget constraint and preserves relative importance across assets [16]. The algorithmic principle is described as follows:

After the denoising module and the dual-stream feature extraction networks, the concatenated features

{s_{t} = y}^{t} \in R^{N \times 16}

are obtained to construct the reinforcement learning state. The state is further reshaped into a vector form and fed into a DPG framework.

The actor network parameterized by

θ^{μ}

outputs continuous portfolio weights:

w_{t} = μ (s_{t} | θ^{μ})

(26)

where

w_{t} \in R^{N}

is normalized through a Softmax operation to satisfy the budget constraint

\sum_{i = 1}^{N} w_{t, i}

. During training, Gaussian exploration noise is added to enhance policy exploration:

{\tilde{w}}_{t} = μ (s_{t}) + ε_{t} ε_{t} ~ N (0, σ_{t}^{2})

(27)

The reward at time

t

is defined as the portfolio return:

r_{t} = w_{t}^{T} r_{t + 1}

(28)

The critic network parameterized by

θ^{Q}

estimates the action-value function, as expressed by Equation (29):

Q (s_{t} {, w}_{t}, | θ^{Q})

(29)

The temporal-difference target

y_{t}

is defined as follows:

y_{t} = r_{t} + γ Q^{'} (s_{t + 1}, μ^{'} (s_{t + 1}))

(30)

where

Q^{'}

and

μ^{'}

denote target networks. The critic network is updated by minimizing the mean squared error, as expressed in Equation (31):

L (θ^{Q}) = E [(Q (s_{t}, w_{t}) - y_{t})^{2}]

(31)

The actor network is updated using the deterministic policy gradient:

\nabla_{θ^{μ}} J = E [\nabla_{a} Q (s, a ∣ θ^{Q}) ∣_{a = μ (s)} \nabla_{θ^{μ}} μ (s ∣ θ^{μ})]

(32)

To ensure stable training under noisy and non-stationary financial environments, three stabilization mechanisms are incorporated.

First, the target network is updated via soft updates as expressed in Equation (33):

θ^{'} \leftarrow τ θ + (1 - τ) θ^{'}

(33)

where

τ =

0.001.

Second, an experience replays buffer store transition tuples

(s_{t}, w_{t}, r_{t}, s_{t + 1})

, during each training session. In the experience replay mechanism, samples are randomly selected from the buffer, and the batch size is set to 128, replay size is set to 50,000, and for each environment interaction step one gradient update is performed.

Third,

ε

-greedy exploration strategy for continuous trading action selection is injected during our model training to prevent premature convergence. Specifically, Gaussian noise is added to the actor output

{\tilde{w}}_{t}

. To balance exploration and exploitation, the noise intensity is gradually decayed over training, as expressed in Equation (34):

σ_{t} = m a x (σ_{m i n}, σ_{0} e^{- k t})

(34)

where

σ_{0} = 0.1

is the initial exploration scale and

σ_{m i n} = 0.01

is the lower bound to prevent the policy from becoming completely deterministic during training.

Overall, through this design, the dual-stream feature representation serves as a structured and informative state input, while the DPG framework dynamically optimizes long-term cumulative returns in a continuous portfolio allocation space with enhanced training stability.

Furthermore, the dynamic allocation procedure of investment portfolio weights is described as follows:

In the beginning, the investment portfolio weight vector at time step

t

can be expressed by Equation (35):

w_{t} = {[w_{t, 1}, w_{t, 2}, \dots, w_{t, N}]}^{T}, \sum_{i = 1}^{N} w_{t, i} = 1, w_{t, i} \geq 0

(35)

where

N

is the number of assets. The weights

w_{t + 1}

evolve over time according to the learned policy function

f_{θ}

:

w_{t + 1} = f_{θ} (w_{t}, r_{t}, s_{t})

(36)

Here,

r_{t}

represents the vector of asset returns at time

t

, and

s_{t}

denotes the market state. The weight

w_{t, i}

is updated iteratively and incrementally:

w_{t + 1, i} = w_{t, i} + ∆ w_{t, i}, \sum_{i = 1}^{N} ∆ w_{t, i} = 0, i = 1, \dots, N

(37)

where

Δ w_{t, i} = g_{θ} (w_{t}, r_{t}, s_{t})

is the adjustment term, which is determined by the DSCRL policy. The portfolio returns at time

t

can be expressed as follows:

R_{t}^{p o r t f o l i o} = {[w_{t}]}^{T} r_{t}

(38)

DSCRL optimizes the weights

w_{t}

over time to maximize the expected Sharpe ratio:

\max_{θ} E [\frac{E [R_{t}^{p o r t f o l i o}]}{S t d [R_{t}^{p o r t f o l i o}]}]

(39)

The formula demonstrates how DSCRL dynamically adjusts the asset allocation based on market conditions.

2.5. Portfolio Selection via RL

2.5.1. Problem Formulation

The portfolio selection problem in this study involves m assets exchanged over n periods, with each item analyzed based on five attributes: opening price, highest price, closing price, lowest price, and volume. The dynamic choice of portfolio selection is represented as a Markov decision process

(S, A, T, R)

, where

S

is the state space,

A

is the action space,

T (s' | s, α)

is the state transition function and

R (s, α)

is the reward function. This paper will simplify the portfolio selection problem into a Markov decision process quaternion using the following method: At time

t

, input the asset features

{P_{t - T}, P_{t - T + 1}, \dots P_{t}}

as the observation state

s_{t}

. The intelligent system generates the trading decision

α_{t} = {[α_{t, 1}, α_{t, 2}, \dots, α_{t, m}]}^{T}

based on

s_{t}

using strategy

π (α ∣ s)

. Here,

a_{t, i}

represents the proportion of asset i in total wealth at time

t

, and

\sum_{i = 0}^{m} α_{t, i} = 1

. The trading agent will receive the reward

r_{t} = R (s_{t}, α_{t})

in the next period’s state

S_{t + 1} = T (s_{t})

because of the action taken and the current state.

In addition, there are two general assumptions in this task: (1) the market is available for any trading at any time; (2) any trading executed by the agent has no influence on the financial market [17].

2.5.2. Agent Learning

This research employs the DPG algorithm to train the policy network

π_{θ} (α ∣ s)

with the goal of maximizing the reward function [16]. The relative price vector

x_{t} = \frac{p^{c}}{p_{t - 1}^{c}}

represents the price change in each asset in period t, where

p_{t}^{c}

represents the closing price. Then, using logarithmic rate of return, the reward function is designed based on the long-term cumulative return of the portfolio,

R = \frac{1}{n} \sum_{t = 0}^{n} (α_{t}^{T} x_{t} (1 - c_{t}))

, where

c_{t} = γ \sum_{i = 1}^{m} | α_{t, i} - α_{t - 1, i} |

, and

γ

is the transaction fee rate. Finally, with the goal of maximizing this reward function, the gradient ascent method

θ \to α \frac{\partial R}{\partial θ} + θ

is used to update the policy network parameters. The pseudocode for the learning process is shown in Algorithm 1.

Algorithm 1: DPG-based Feature Extraction Process
	Input: Policy network $μ (s \| θ)$ , value network $Q (s, α \| ω)$ , learning rates $α_{a c t o r}$ , discount factor $γ$ , and total training episodes M. Output: Optimized policy parameters $θ$ and value parameters $ω$ .
	1. Initialize the policy network $μ (s \| θ)$ and value network $Q (s, α \| ω)$ with parameters $θ$ and $ω$ . 2. For each episode $t$ = 1 to M: 3. Initialize the environment and get initial state $s_{1}$ . 4. Select action $α_{t} = μ (s_{t} \| θ)$ . 5. Execute $α_{t}$ in the environment. 6. Observe reward $r_{t}$ and next state $s_{t + 1}$ . 7. Compute target value $y_{t} = r_{t} + γ Q (s_{t + 1}, μ (s_{t + 1} \| θ) \| ω)$ 8. Update value network by minimizing the loss: $L = {(y_{t} - Q (s_{t}, a_{t} \| ω))}^{2}$ 9. Compute the deterministic policy gradient (minibatch of N samples): $\nabla_{θ} J = \frac{1}{N} {\sum_{i = 1}^{N} \nabla_{α} Q (s_{i}, α \| ω) \|}_{α = μ (s_{i}; θ)} \nabla_{θ} μ (s_{i}; θ)$ 10. Update policy parameters $θ \leftarrow θ + α_{a c t o r} \nabla_{θ} J$ 11. Set state $s_{t} \leftarrow s_{t + 1}$ 12. End for

3. Experimental Results

To verify our proposed method, we evaluate DSCRL in terms of three main aspects: (1) profitability on real-world datasets; (2) validity of feature representation of each component; (3) sensitivity to transaction cost.

3.1. Experimental Datasets and Settings

Datasets: A subset of thirty stocks from the S&P 500 index and eleven cryptocurrencies with the highest trading volume, excluding Bitcoin, are chosen for separate experiments. The datasets are divided as detailed in Table 1, with the validation set utilized to identify the number of high frequencies

i m f

to be omitted. The objective of investment in the stock market is to generate cash, but the investment purpose in the cryptocurrency market is to accumulate Bitcoin [16,18]. Stock data for S&P 500 constituents is obtained from Tushare (https://tushare.pro/, accessed on 17 April 2025), while cryptocurrency data is derived from Poloniex (https://poloniex.com/, accessed on 18 April 2025).

Implementation details: Table 2 gives the experimental settings and detailed network architecture used in DSCRL, including model hyperparameters and tensor transformation configurations. DSCRL is implemented via PyTorch version 2.6 and trained with the Adam Optimizer (The training employed PyCharm 2024.1 (Professional Edition), while Navicat Premium 17 was used for managing the database.). All experiments are conducted on an NVIDIA RTX 3060 GPU. The number of training steps is 20,000 on the stock dataset and 80,000 on the cryptocurrency dataset; the batch size is 128, the learning rate is 0.0001, the transaction cost is 0.025%, and the time window is 31.

Metrics: We use three metrics to evaluate performance [19]: (1) Accumulated Portfolio Value:

A P V = S_{n} = S_{0} \prod_{t = 1}^{n} (1 - c_{t}) p_{t}^{T} w_{t}

, where

S_{0} = 1

; (2) Sharp Ratio:

S R = \frac{A v e r a g e (r_{t}^{c})}{s t d (r_{t}^{c})},

where

r_{t}^{c}

denotes the logarithmic return of the portfolio in period

t

; (3) Calmar Ratio: CR

= \frac{S_{n}}{MDD}

, where MDD denotes the biggest loss from a peak to a trough and is calculated via

M D D = {m a x}_{t : τ > t} \frac{S_{t} - S_{τ}}{S_{t}}

. These metrics assess the profitability and risk management capabilities of the strategy, offering a comprehensive and efficient evaluation of the model’s performance.

3.2. Comparative Methods

To comprehensively evaluate the effectiveness of the proposed framework, we compare it with a diverse set of baseline methods. Specifically, the baseline methods are categorized into three groups based on their methodological paradigms: traditional non-reinforcement learning methods, deep reinforcement learning-based portfolio models, and general-purpose reinforcement learning algorithms.

Traditional non-reinforcement learning methods: These approaches do not employ reinforcement learning frameworks; instead, they allocate assets based on statistical rules or online updating mechanisms. This category includes the best strategy in hindsight (Best), UP [20], Anticor [21], OLMAR [17], and Uniform Buy-And-Hold (UBAH).

Deep reinforcement learning methods: This category formulates the portfolio optimization problem as MDP and employs deep neural networks for policy function approximation. It includes EIIE [18], RAT [9], PPN [16], and Ensemble [22]. These models typically adopt policy gradient or actor-critic frameworks to learn continuous asset allocation strategies in dynamic market environments. By integrating reinforcement learning mechanisms, these methods aim to enhance decision-making performance and have become a new research direction in portfolio management.

Classical reinforcement learning algorithms: To evaluate the robustness and effectiveness of the proposed framework, we also conduct comparative experiments using alternative reinforcement learning algorithms DQN [23], DDPG [24], and PPO [25]. Specifically, we consider:

Deep Q-Network (DQN): A value-based method suitable for discrete action spaces, adapted here for simplified portfolio discretization.

Deep Deterministic Policy Gradient (DDPG): An actor-critic extension of DPG that incorporates deep neural networks to handle high-dimensional continuous action spaces more effectively.

Proximal Policy Optimization (PPO): A policy gradient method that uses clipped objective functions to stabilize training and improve sample efficiency.

By comparing the performance of DPG with these alternative algorithms, we aim to demonstrate the advantages of DPG in continuous portfolio allocation tasks, while assessing the relative merits of value-based, actor-critic, and policy gradient approaches in financial trading scenarios.

3.3. Experimental Results and Analysis

3.3.1. Overall Experiment Results

Table 3, Figure 7 and Figure 8 present the back testing results obtained on the S&P 500 partial component stock dataset and the cryptocurrency dataset. For the comparative non-reinforcement learning methods, deep reinforcement learning models, and classical reinforcement learning algorithms, most of the hyperparameters are set according to the configurations provided in the original studies. The remaining hyperparameters are determined by the procedures given in the corresponding studies, when such configuration information is not explicitly provided.

As shown, the proposed DSCRL model achieves superior overall performance across three evaluation metrics, consistently outperforming all baseline methods. Specifically, DSCRL demonstrates substantial improvements in return-related measures while maintaining effective risk control, validating its robust decision-making capability in both traditional financial markets and highly volatile cryptocurrency environments. These findings underscore the model’s effectiveness in balancing return maximization and risk mitigation.

On the S&P 500 dataset, DSCRL’s APV exhibits a steady upward trend after about 200 steps (16 June 2023), followed by two pronounced increases. PPO shows a similar pattern and outperforms other benchmark methods. These movements coincide with improving macroeconomic conditions, including lower-than-expected CPI data in mid-June, the Federal Reserve’s pause in rate hikes, and strong economic indicators that boosted stock market confidence. In July, amid the sustained optimism of artificial intelligence, the decline in inflation expectations and the robust profit performance of major technology companies further propelled the market’s upward trajectory. Then, the upward momentum was kept after July 27, when the anticipated interest rate hike was confirmed.

On the cryptocurrency dataset, only the reinforcement learning and deep reinforcement learning methods achieve relatively strong performance. Although EIIE, RAT, and PPN exhibit stable growth, their cumulative gains remain below those of DSCRL. DQN and DDPG show significant volatility, while PPO demonstrates a relatively stable performance but still underperforms DSCRL. From June to August 2023, the Bitcoin market experienced a period of high volatility. Regulatory actions against Binance and Coinbase triggered early declines, while BlackRock’s spot Bitcoin ETF application and subsequent institutional participation boosted market sentiment. However, the rising U.S. Treasury yields and a stronger dollar in August led to renewed downward pressure.

To further evaluate the robustness of the performance improvements, we conduct paired t-tests to compare DSCRL with PPO and Anticor across multiple independent runs with different random seeds. Specifically, the training set, validation set, test set and hyperparameters are kept fixed. We then vary only the random seed to train DSCRL, PPO, and Anticor, and then record their respective Sharpe ratios on the test set. Subsequently, we compute the Sharpe ratio differences between DSCRL and PPO, and between DSCRL and Anticor, denoted as

d_{1}

and

d_{2}

, respectively. The corresponding t-statistics are calculated:

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

(40)

where

\bar{d}

denotes the mean of those paired Sharpe ratio differences between DSCRL and the comparative method,

s_{d}

represents the standard deviation of the differences, and

n

is the number of random seeds. The results on the two market datasets indicate that the performance improvement of DSCRL is significant at the confidence level of 5% (p < 0.05), with S&P 500 demonstrating greater stability, while DSCRL on the cryptocurrency market exhibits higher volatility.

In summary, DSCRL can effectively capture the impact of shifts in market sentiment and unexpected events in the stock market, while demonstrating strong resilience to high volatility in the cryptocurrency market. Although its performance is more pronounced in the traditional stock market than in the crypto market, DSCRL still achieves relatively better performance in cryptocurrencies with the balance between return and risk considered. Moreover, the experimental results demonstrate that the introduction of reinforcement learning makes significant contributions to enhance the performance of investment decision-making. The results also confirm that the denoising module does not excessively suppress informative signals during regime transitions, while remaining capable of preserving and capturing market dynamics.

3.3.2. Ablation Experiment Results

Ablation studies: To evaluate the feature representation of each component in DSCRL, we designed the ablation experiment as follows:

DSCRL-D: Compared to DSCRL, the data noise reduction component was removed to evaluate the efficacy of the data noise reduction method in enhancing portfolio efficiency.

DSCRL-S: Compared to DSCRL, LSTM is utilized to extract asset sequence features for evaluating the sequence feature extraction network’s characterization ability.

DSCRL-C: Compared to DSCRL, Pearson correlation coefficients are utilized to extract inter-asset correlation features to assess the effectiveness of the correlation feature extraction network in this study.

DSC-MV: Compared to DSCRL, we construct a mean–variance variant that preserves the entire feature extraction pipeline while replacing the reward-driven optimization with the Markowitz mean–variance objective to isolate the contribution of reinforcement learning. Specifically, at time step

t

, the denoising module and dual-stream feature extraction network of DSCRL are first employed to obtain the state representation

y^{t}

. A linear layer is then used to predict the next period return vector

μ_{t}

. Finally, a mean–variance optimization problem is formulated based on the predicted returns:

\max_{ω} {ω^{T} μ_{t} - λ ω^{T} \sum_{t} ω}

(41)

where

λ

denotes the risk aversion coefficient,

Σ_{t}

is estimated from historical window returns, and the optimization is subject to the constraints of

\sum ω = 1, ω \geq 0

. The resulting optimal portfolio weights are denoted as

ω_{t}^{M V} .

Theoretically, reinforcement learning may outperform mean–variance optimization for several reasons. Mean–variance is a single-period static framework, whereas reinforcement learning optimizes cumulative long-term returns in a dynamic setting. Moreover, mean–variance relies only on the first and second moments of returns and assumes quadratic risk preferences, which may not adequately capture nonlinear market dynamics. In contrast, reinforcement learning can model complex nonlinear relationships without restrictive distributional assumptions, and it can deal with more flexible risk control through reward design, incorporating transaction costs and drawdown considerations within a unified framework.

Ablation experiments were conducted using four degraded variants, and the corresponding results are reported in Table 4 and Figure 9. The findings indicate that financial noise reduction, temporal sequence feature extraction, inter-asset correlation modeling and RL each contribute positively to the overall performance of the DSCRL model, confirming the effectiveness and necessity of these three components.

More specifically, the denoising module appears to contribute substantially to the overall performance improvement. By incorporating the denoising component, DSCRL-D achieves an APV increase of 30.5% on S&P 500, and about 42% on the cryptocurrency. In contrast, DSCRL-S, DSCRL-C, and DSC-MV all underperform DSCRL on both markets. The performances of DSCRL-S, DSCRL-C, and DSC-MV on S&P 500 decrease by approximately 16.4%, 22.6%, and 7%, respectively. These results verify that data denoising plays a critical role in enhancing the robustness of financial time series modeling, and the significant value-added effect of reinforcement learning for our proposed framework.

Furthermore, improvements in SR and CR on S&P 500 also reflect the positive contribution of the denoising module. However, this improvement effect of the denoising model is less consistent in Crypto. In cryptocurrency markets, greater emphasis may be required on optimizing the temporal feature extraction network and the inter-asset correlation modeling component to enhance the risk resistance capability of investment portfolios. Overall, integrating all modules leads to substantial performance gains, demonstrating that their joint optimization is crucial for achieving robust portfolio management.

3.3.3. Sensitivity Analysis Results

Transaction cost is one of the most important factors affecting investment returns. In the above experiment we set the transaction cost rate as 0.25%. To verify DSCRL’s sensitivity of transaction cost rate, DSCRL is compared with RAT and PPN in different transaction cost rates. Finally, an experiment comparing the DSCRL model with DQN, DDPG, PPO, RAT, and PPN at different transaction cost levels (c = 0.01%, 0.1%, 0.25%, and 1%) was undertaken to evaluate the sensitivity of the DSCRL model. The results are presented in Table 5 and Figure 10. The results indicate that transaction costs significantly affect portfolio returns. DSCRL outperforms competing models in various transaction cost scenarios, demonstrating its stability and superiority. Table 5 gives the comparative experimental results on different datasets with different transaction costs.

4. Conclusions

A novel portfolio selection model DSCRL is proposed to enhance portfolio optimization performance by efficient feature extraction and intelligent decision-making. The model captures both the sequential and correlation features of assets, effectively reduces financial data noise, leverages an MDP for portfolio optimization, and trains the strategy network via RL to enable precise decision-making in complex financial markets. Empirical results demonstrate that DSCRL exhibits significant advantages in feature extraction, excess return generation, and risk management when applied to both S&P 500 constituent stocks and cryptocurrencies, highlighting its effectiveness and practical value in solving the portfolio selection problem. Moreover, DSCRL’s dual benefits extend beyond return enhancement, also reflecting robustness and risk-aversion capabilities, indicating strong adaptability across different market conditions.

Although DSCRL has achieved the best performance compared to benchmark models, there are still several limitations that deserve further exploration and research. Firstly, in terms of research data, our research relies on historical market data for training and evaluation. Given the inherent non-stationarity of financial markets and the impact of extreme market shocks, the robustness of the adopted denoising method and the generalization capability of DSCRL require further validation. The experimental performance of DSCRL has not yet been thoroughly compared with standard preprocessing methods, and these issues all remain to be addressed in future research. Secondly, in terms of the modeling methodology for portfolio selection, the portfolio optimization task is formulated as MDP in DSCRL, assuming that market states can be adequately characterized by a finite historical window. However, real-world markets may exhibit long-term dependencies or non-Markovian dynamics. The dual-stream architecture enhances feature representation, while it also increases model complexity and computational cost. Future research will further explore the relationship between financial news texts and asset price movements from the perspective of investor sentiment analysis, incorporating graph-theoretical methods to develop more efficient and intelligent portfolio models. Thirdly, several modules rely on empirically tuned hyperparameters, lacking rigorous theoretical optimality analysis. In the future, the limitations of the current framework will be further addressed and refined, to provide deeper theoretical insights and practical guidance for financial decision-making.

Author Contributions

N.G. mainly provides the overall idea of the article and determines the structure of the article; Y.L. is responsible for conducting the experiments and the main writing; Y.H. is responsible for designing specific models and writing the empirical part; J.Z. is responsible for collecting, cleaning and organizing the data; L.Z. is responsible for data analysis and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Education of Humanities and Social Science Project of China (No. 21YJCZH030), the Nature Science Foundation of Shaanxi Province (No. 2024JC-YBMS-601) and the Key Research and Development Program of Shaanxi Province (No. 2023-YBSF-28).

Data Availability Statement

Stock data for S&P 500 constituents is obtained from Tushare (https://tushare.pro/, accessed on 17 April 2025), while cryptocurrency data is derived from Poloniex (https://poloniex.com/, accessed on 18 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Markowitz, H. Portfolio Selection. J. Financ. 1952, 7, 77. [Google Scholar] [CrossRef]
Fischer, T.; Krauss, C. Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
Duan, Y.; Wang, L.; Zhang, Q.; Li, J. FactorVAE: A Probabilistic Dynamic Factor Model Based on Variational Autoencoder for Predicting Cross-Sectional Stock Returns. AAAI 2022, 36, 4468–4476. [Google Scholar] [CrossRef]
Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; Xu, L. DeepTrader: A Deep Reinforcement Learning Approach for Risk-Return Balanced Portfolio Management with Market Conditions Embedding. AAAI 2021, 35, 643–650. [Google Scholar] [CrossRef]
Zhao, T.; Ma, X.; Li, X.; Zhang, C. Asset Correlation Based Deep Reinforcement Learning for the Portfolio Selection. Expert Syst. Appl. 2023, 221, 119707. [Google Scholar] [CrossRef]
Ma, Y.; Han, R.; Wang, W. Portfolio Optimization with Return Prediction Using Deep Learning and Machine Learning. Expert Syst. Appl. 2021, 165, 113973. [Google Scholar] [CrossRef]
Willman, P.; Fenton-O’Creevy, M.; Nicholson, N.; Soane, E. Noise Trading and the Management of Operational Risk; Firms, Traders and Irrationality in Financial Markets. J. Manag. Stud. 2006, 43, 1357–1374. [Google Scholar] [CrossRef]
Dragomiretskiy, K.; Zosso, D. IEEE Transactions on Signal Processing Publication Information. IEEE Trans. Signal Process. 2013, 61, C2. [Google Scholar] [CrossRef]
Xu, K.; Zhang, Y.; Ye, D.; Zhao, P.; Tan, M. Relation-Aware Transformer for Portfolio Policy Learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization, Yokohama, Japan, 11 July 2020; pp. 4647–4653. [Google Scholar]
Xue, J.; Shen, B. A novel swarm intelligence optimization approach: Sparrow search algorithm. Syst. Sci. Control. Eng. 2020, 8, 22–34. [Google Scholar] [CrossRef]
Wang, S.P.; Zhu, Y.Y. Forecasting of wheat price Based on multiscale analysis. Chin. J. Manag. Sci. 2016, 25, 85–91. [Google Scholar]
He, Y.Y.; Li, P.; Han, J.B. Research on prediction modeling of stock market index based on CEEMDAN-LSTM. Stat. Inf. Forum 2020, 35, 59–70. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Moody, J.; Saffell, M. Learning to Trade via Direct Reinforcement. IEEE Trans. Neural Netw. 2001, 12, 875–889. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Zhao, P.; Li, B.; Wu, Q.; Huang, J.; Tan, M. Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning. IEEE Trans. Knowl. Data Eng. 2020, 34, 236–248. [Google Scholar] [CrossRef]
Li, B.; Hoi, S.C.H. On-Line Portfolio Selection with Moving Average Reversion. arXiv 2012, arXiv:1206.4626. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, D.; Liang, J. A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv 2017, arXiv:1706.10059. [Google Scholar] [CrossRef]
Shen, W.; Wang, J. Portfolio Selection via Subset Resampling. AAAI 2017, 31, 1517–1523. [Google Scholar] [CrossRef]
Cover, T.M. Universal Portfolios. Math. Financ. 1991, 1, 1–29. [Google Scholar] [CrossRef]
Borodin, A.; El-Yaniv, R.; Gogan, V. Can We Learn to Beat the Best Stock. JAIR 2004, 21, 579–594. [Google Scholar] [CrossRef]
Yang, H.; Liu, X.-Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; ACM: New York, NY, USA, 2020; pp. 1–8. [Google Scholar]
Gao, Z.; Gao, Y.; Hu, Y.; Jiang, Z.; Su, J. Application of Deep Q-Network in Portfolio Management. In Proceedings of the 2020 5th IEEE International Conference on Big Data Analytics (ICBDA), Xiamen, China, 8–11 May 2020; pp. 268–275. [Google Scholar]
Liang, Z.; Chen, H.; Zhu, J.; Jiang, K.; Li, Y. Adversarial Deep Reinforcement Learning in Portfolio Management. arXiv 2018, arXiv:1808.09940. [Google Scholar] [CrossRef]
Karzanov, D.; Garzón, R.; Terekhov, M.; Gulcehre, C.; Raffinot, T.; Detyniecki, M. Regret-Optimized Portfolio Enhancement through Deep Reinforcement Learning and Future Looking Rewards. In Proceedings of the 6th ACM International Conference on AI in Finance, Singapore, 15–18 November 2025; ACM: New York, NY, USA, 2025; pp. 890–897. [Google Scholar]

Figure 1. Overall framework of the model DSCRL.

Figure 2. Financial data denoising method.

Figure 3. The scheme of the dual-stream feature extraction network, in which the dotted frames in the figure represent functional sub-modules.

Figure 4. Comparison of causal convolutional and dilated causal convolutional.

Figure 5. The scheme of Relation-Aware Transformer, in which the dotted frames represent different functional sub-modules.

Figure 6. Framework of the DPG algorithm.

Figure 7. Experimental results based on S&P 500.

Figure 8. Experimental results based on cryptocurrency.

Figure 9. Back test results of the ablation experiment.

Figure 10. Experimental results based on different transaction cost levels.

Table 1. Statistics of datasets.

Datasets	Training Data		Validation Data		Testing Data
Datasets	Data Range	Num	Data Range	Num	Data Range	Num
S&P 500	Jan 2012–Feb 2022	2554	Mar 2022–Aug 2022	128	Sep 2022–Aug 2023	250
Crypto	Dec 2020–Mar 2023	20,424	Apr 2023–May 2023	1464	Jun 2023–Aug 2023	2373

Table 2. Detailed parameters setting of the network architecture of DSCRL. (m: number of assets; T: window size;

α

: penalty term;

k

: modal factorization; trend imf: VMD trend items; low imf: VMD low items; high imf: VMD high items; N: number of output channel; K: kernel size; S: stride size; P: padding size; DiR: dilation rate; DrR: dropout rate; In: input size; Out: output size; H: head number; Conv: convolutional).

Table 2. Detailed parameters setting of the network architecture of DSCRL. (m: number of assets; T: window size;

α

: penalty term;

k

: modal factorization; trend imf: VMD trend items; low imf: VMD low items; high imf: VMD high items; N: number of output channel; K: kernel size; S: stride size; P: padding size; DiR: dilation rate; DrR: dropout rate; In: input size; Out: output size; H: head number; Conv: convolutional).

Part	Input → Output Shape	Detail Information
Financial Data Denoising
SSA-VMD	(m, T, 5) → (m, T, 5)	S&P 500: [ $α$ : 1400, $k$ : 13]; Crypto: [ $α$ : 2500, $k$ : 17]
GRA	(m, T, 5) → (m, T, 8)	S&P 500: trend ${i m f}_{1}$ ; low ${i m f}_{2}$ ~ ${i m f}_{7}$ ; high ${i m f}_{8}$ ~ ${i m f}_{10}$
GRA	(m, T, 5) → (m, T, 8)	Crypto: trend ${i m f}_{1}$ ; low ${i m f}_{2}$ ~ ${i m f}_{9}$ ; high ${i m f}_{10}$ ~ ${i m f}_{13}$
Sequential Information Net
TCN	(m, T, 8) → (m, T, 8)	DCONV-(N: 8, K: [1 $\times$ 3], S: 1, P: 2), DiR: 2, DrR: 0.2, ReLU
TCN	(m, T, 8) → (m, T, 8)	DCONV-(N: 8, K: [1 $\times$ 3], S: 1, P: 4), DiR: 4, DrR: 0.2, ReLU
Conv	(m, T, 8) → (m, 1, 8)	CONV-(N: 8, K: [T $\times$ 1], S: 1), ReLU
Correlation Information Net
Transformer	(m, T, 8) → (m, T, 8)	Linear {Q, K, V} - (In: T, Out: T), H: 2
Conv	(m, T, 8) → (m, 1, 8)	CONV-(N: 8, K: [T $\times$ 1], S: 1), ReLU
Decision-Making Net
Concatenation	(m, 8) ⨁ (m, 8) ⨁ (m, 1) → (m, 17)	Concatenation of extracted features and last period portfolio
Prediction	(m, 17) → (m, 1)	CONV-(N: 1, K: [1 $\times$ 1], S: 1), SoftMax

Table 3. Comparative experimental results on different datasets.

Method	S&P 500			Crypto
Method	APV	SR (%)	CR	APV	SR (%)	CR
UBAH	1.09	0.26	0.86	0.89	0.63	0.77
OLMAR	12.53	0.63	3.05	51.01	1.21	395.41
UP	18.32	0.9	4.71	27.17	1.04	435.48
Anticor	25.74	0.9	5.54	35.92	1.11	524.25
Best	1.54	0.64	2.46	1.16	0.5	0.83
Ensemble	5.94	1.51	20.61	68.03	1.85	548.28
EIIE	19.39	2.25	120.86	629.30	3.41	14,616.24
DQN	95.52	1.09	717.93	2702.79	3.95	38,291.35
DDPG	85.23	1.08	1104.71	2642.10	3.5	45,710.61
PPO	117.94	2.49	1489.78	3129.11	4.67	52,543.32
RAT	75.99	3.52	1119.86	1691.42	7.5	35,056.74
PPN	82.08	3.41	1549.51	2387.08	5.94	50,799.94
DSCRL (ours)	120.62	3.76	2495.39	3797.11	8.5	64,029.67

Note: In accordance with academic conventions, we have highlighted the best-performing results in bold.

Table 4. Ablation experimental results on different datasets.

Method	S&P 500			Crypto
Method	APV	SR (%)	CR	APV	SR (%)	CR
DSCRL	120.62	3.76	2495.39	3797.11	8.5	64,029.67
DSCRL-D	83.83	3.30	1088.06	2201.46	6.22	46,098.34
DSCRL-S	100.86	3.30	1220.77	2755.94	6.67	25,660.13
DSCRL-C	93.34	3.52	1358.44	2493.82	4.57	46,140.18
DSC-MV	112.18	3.59	1728.61	3271.11	6.40	44,757.27

Note: In accordance with academic conventions, we have highlighted the best-performing results in bold.

Table 5. Comparative experimental results on different datasets with different transaction costs.

Method	S&P 500			Crypto
Method	APV	SR (%)	CR	APV	SR (%)	CR
C = 0.01%
DQN	94.05	0.11	1860.24	4839.72	6.09	169,372.91
DDPG	113.85	0.17	2171.42	5897.48	6.42	157,323.56
PPO	133.65	1.59	2508.73	7436.37	7.46	181,606.79
RAT	165.30	3.92	2470.60	6807.34	7.95	138,437.01
PPN	168.15	3.90	2551.12	9576.29	7.53	192,688.28
DSCRL (ours)	180.22	3.98	2823.13	11,697.11	8.25	235,718.78
C = 0.1%
DQN	14.62	1.17	12.5	310.62	2.79	4396.71
DDPG	18.2	1.15	16.39	355.94	3.12	8465.31
PPO	21.15	1.37	38.75	412.76	4.28	10,234.81
RAT	23.78	2.45	241.47	406.67	4.77	10,367.89
PPN	20.89	2.85	258.71	281.59	4.56	6673.73
DSCRL (ours)	28.15	3.12	278.43	434.31	4.96	12,905.45
C = 0.25%
DQN	1.53	0.18	0.72	20.52	0.49	91.47
DDPG	2.06	0.33	2.89	24.40	1.21	129.24
PPO	2.39	0.39	2.5	25.04	1.79	154.88
RAT	2.25	0.76	5.46	25.24	1.93	140.91
PPN	2.55	0.83	4.63	30.15	2.33	149.65
DSCRL (ours)	3.49	1.52	25.98	34.55	2.57	202.95
C = 1%
DQN	0.98	0.01	0.03	1.26	0.31	1.31
DDPG	1.2	0.09	0.53	1.39	0.39	2.36
PPO	1.32	0.14	1.38	1.62	0.54	4.93
RAT	1.25	0.30	1.34	1.53	0.59	2.42
PPN	1.65	0.53	3.18	1.99	0.80	5.68
DSCRL (ours)	1.97	0.66	3.43	2.21	0.82	6.81

Note: In accordance with academic conventions, we have highlighted the best-performing results in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, N.; Liu, Y.; He, Y.; Zhang, J.; Zhang, L. A Novel Portfolio Selection Method via Deep Reinforcement Learning. Systems 2026, 14, 292. https://doi.org/10.3390/systems14030292

AMA Style

Gao N, Liu Y, He Y, Zhang J, Zhang L. A Novel Portfolio Selection Method via Deep Reinforcement Learning. Systems. 2026; 14(3):292. https://doi.org/10.3390/systems14030292

Chicago/Turabian Style

Gao, Ni, Yan Liu, Yiyue He, Juan Zhang, and Lefang Zhang. 2026. "A Novel Portfolio Selection Method via Deep Reinforcement Learning" Systems 14, no. 3: 292. https://doi.org/10.3390/systems14030292

APA Style

Gao, N., Liu, Y., He, Y., Zhang, J., & Zhang, L. (2026). A Novel Portfolio Selection Method via Deep Reinforcement Learning. Systems, 14(3), 292. https://doi.org/10.3390/systems14030292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Portfolio Selection Method via Deep Reinforcement Learning

Abstract

1. Introduction

2. Methods

2.1. General Architecture

2.2. Financial Data Denoising

2.3. Feature Extraction

2.4. Decision-Making Net

2.5. Portfolio Selection via RL

2.5.1. Problem Formulation

2.5.2. Agent Learning

3. Experimental Results

3.1. Experimental Datasets and Settings

3.2. Comparative Methods

3.3. Experimental Results and Analysis

3.3.1. Overall Experiment Results

3.3.2. Ablation Experiment Results

3.3.3. Sensitivity Analysis Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI