2.1. General Architecture
DSCRL is mainly composed of three submodules: financial data denoising, feature extraction, and decision-making (
Figure 1). The model processes the closing price of each asset to generate time series data after noise reduction. It then extracts feature maps for each asset and uses them to make trading decisions. The model structure and specifics will be presented in
Section 3.1. Neural network parameters are acquired through training with a deep reinforcement learning algorithm.
Financial Data Denoising Module: To address the noise issue and ensure that no future information is introduced during the training and testing phases, strict causal constraints are imposed on the denoising module. Within each sliding window, the key parameters of the VMD model are optimized using the Sparrow Search Algorithm (SSA) solely based on the historical price data contained in the sliding window, without involving any future information [
10]. The optimized VMD is then applied to decompose the stock price time series into intrinsic mode functions (IMFs) synchronously, which represent components of the stock price series with different frequencies. Some high-frequency IMFs are removed due to their excessive noise. The remaining IMFs are reconstructed to generate a denoised price series composed of high-frequency, low-frequency, and trend components, using gray relational clustering. The window subsequently rolls forward in chronological order, and the parameter optimization and decomposition procedures are conducted independently within each window.
Therefore, at any time step , the feature construction process relies exclusively on information available at or prior to , which ensures that the entire denoising procedure satisfies strict temporal causality constraints.
Feature Extraction Module: In this module, we perform dual-stream feature extraction on the denoised sequence data, focusing on both temporal and spatial dimensions. For temporal feature extraction, the Temporal Convolutional Network (TCN) is employed to overcome the limitations of RNN and Long Short-Term Memory (LSTM), such as fixed receptive fields and vanishing gradients, thereby enhancing the model’s ability to capture temporal information across different scales, including short-term, medium-term, and long-term patterns. For spatial relationship extraction, the Relational-Aware Transformer (RAT), a variant of the Transformer architecture, is utilized to address the shortcomings of treating stocks as independent entities and relying solely on correlation coefficients. This enables the model to more effectively capture inter-asset relational features.
Decision-Making Module: In this module, the extracted features together with the previous portfolio weights are fed into the portfolio policy network to make portfolio allocation decisions. The strategy network is trained via reinforcement learning with the objective of maximizing cumulative returns. By incorporating the previous asset weights as recursive inputs, the model becomes aware of transaction costs and can produce smoother decisions, thereby better aligning with realistic trading environments.
2.2. Financial Data Denoising
Short-term speculative behavior and noise trading in financial markets cause time series data to exhibit irregular fluctuations and a high noise level. However, in the long term, the price of financial assets tends to revert to its intrinsic value according to the law of value. In this paper, the noise reduction method leverages the unique characteristics of financial assets by decomposing the close price sequence based on various fluctuation frequencies and eliminating the segments with higher noise content to reduce financial data noise effectively.
Figure 2 displays the flowchart of the noise reduction method. VMD is applied to decompose the close price sequence within the historical window [
8] and the SSA to optimize the penalty factor
and the number of modal decompositions k [
10]. After decomposing to acquire k IMFs, the gray correlation analysis (GRA) technique was employed to categorize each IMF into trend, low-frequency, and high-frequency components [
11]. The low-frequency terms and the remaining high-frequency terms are summed and reconstructed after eliminating high-frequency terms with high noise content [
12]. The resulting trend terms, low-frequency terms, and noise-reduced high-frequency terms are then fed into the feature extraction network.
In the denoising module, VMD requires the specification of two critical parameters: the number of modes
and the penalty factor
. These parameters directly influence the decomposition quality, particularly with respect to over-decomposition and mode mixing. However, deriving analytically optimal parameter values is generally intractable, and manual selection inevitably introduces subjectivity and potential bias. To mitigate excessive human intervention and reduce the risk of convergence to local optima, we incorporate SSA to perform automatic parameter optimization [
10], to enhance the stability and generalization capability of the decomposition process.
Compared with conventional filtering techniques, simple methods such as moving averages (MA) tend to smooth the series indiscriminately, which leads to the mixing of high- and low-frequency components and the potential loss of critical turning-point information. In contrast, the framework SSA-VMD-GRA proposed in this paper enables explicit frequency separation, effectively decomposing the original series into trend and fluctuation components while preserving informative mid-frequency signals. Although wavelet denoising can address multi-scale characteristics, it requires the selection of a mother wavelet, which introduces methodological subjectivity, and its performance is sensitive to threshold determination. By comparison, VMD does not rely on predefined basis functions and provides clearer and more adaptive frequency band partitioning.
Overall, the framework SSA-VMD-GRA has advantages in terms of adaptive parameter optimization and robust frequency separation, making it more suitable for modeling non-stationary and structurally complex financial time series.
VMD is an adaptive signal decomposition method. Its advantage lies in the ability to determine the number of mode decompositions, which effectively solves the problem of mode aliasing and can better decompose the fluctuations of different center frequencies in financial time series data [
8]. The problem description is as shown in Equation (1):
Here, is the k intrinsic mode functions obtained by decomposition, is the center frequency of each IMF, is the unit pulse function, j is the imaginary unit, and is the original timing signal.
Next, the penalty factor
and the Lagrange operator
are introduced to obtain the augmented Lagrange function, as shown in Equation (2):
The Alternating Direction Method of Multipliers (ADMM) is employed to iteratively update each modal component, the center frequencies, and the Lagrange multipliers, with the update procedure given in Equation (3):
Here, correspond to the modes after Fourier transformation, and represents the noise tolerance.
During the iterative updating process, once the decomposition accuracy satisfies the required condition, the current output is regarded as the final decomposition result. The convergence criterion of the decomposition is given in Equation (6), where
is maximum admissible tolerance:
In the VMD process, the selection of the penalty factor
and the number of modes k is particularly critical to the decomposition results. Therefore, this study employs the SSA to optimize these parameters by minimizing the envelope entropy, the computation of which is defined in Equation (7):
Here,
denotes the envelope entropy of the
i-th
derived from the original signal, and
represents the envelope signal obtained through Hilbert demodulation. Among the components obtained through VMD, those containing stronger periodic information correspond to smaller envelope entropy values, whereas those with weaker periodicity exhibit larger envelope entropy values. By adopting envelope entropy as the optimization criterion for SSA, the parameter optimization problem can be formulated as in Equation (9), and the optimal parameter pair
is determined using the SSA.
The original time series signal is decomposed via VMD into k IMFs, denoted as , , …, , where represents the trend component; the volatility frequencies of through increase progressively, and the complexity of the modes gradually increases. In financial markets, substantial speculative behavior and irrational trading introduce considerable noise into the high-frequency IMF sequences. To mitigate the negative impact of such noise on the effectiveness of portfolio models, the IMF sequences are categorized according to their frequencies into trend components, low-frequency components, and high-frequency components. Certain high-frequency IMFs are then removed, thereby enabling the extraction of medium- and long-term volatility features from asset price series.
The gray correlation analysis method is employed to categorize the IMF sequences into low-frequency and high-frequency groups, followed by the reconstruction of the low-frequency IMF sequences [
11]. To eliminate the influence of differing scales,
is defined as follows:
Next, the relative correlation coefficients are calculated, as defined in Equation (11):
Here,
denotes the distinguishing coefficient, which is typically set to 0.5. Subsequently, the absolute correlation coefficients are computed, as defined in Equation (12):
Subsequently, the comprehensive correlation coefficient, as shown in Equation (13), is obtained to represent the overall degree of association among the IMF components.
Here, is the weight and set to 0.5 in this paper.
The above procedure is repeated to compute the comprehensive correlation coefficients among all IMF sequences, after which the gray correlation analysis method is applied to classify the IMF sequences according to their volatility frequencies. IMF sequences exhibiting both high correlation and similar volatility frequencies are grouped together, sequentially defined as low-frequency and high-frequency components. After discarding certain high-frequency components containing excessive noise, the low-frequency components and the remaining high-frequency components are summed and reconstructed, respectively [
12].
2.3. Feature Extraction
In this section, we propose a dual-stream feature extraction network, as shown in
Figure 3, composed of two complementary branches. Specifically, the temporal sequence branch extracts dynamic evolution features from univariate asset price series to characterize temporal patterns and volatility dynamics, while the correlation branch focuses on uncovering the interactions and latent dependency structures among assets [
9]. The network, adopting the parallel dual-stream model, can integrate temporal dynamics and cross-sectional correlation features within a unified framework, providing more diverse feature representations for subsequent forecast and decision-making.
In the temporal sequence branch, this study employs TCN to extract features from asset price series [
13]. Compared with the traditional Convolutional Neural Network (CNN), TCN overcomes the limitations of restricted receptive fields in convolution kernels and the difficulty of capturing long-term dependencies in sequences (
Figure 4a). At the same time, relative to RNN and their variants, TCN eliminates the constraint of sequential recursive computation, enabling highly parallelized training that significantly improves computational efficiency while preserving modeling capacity. The overall structure of TCN is illustrated in
Figure 4. The core of TCN lies in its dilated causal convolutional architecture, as shown in
Figure 4b. Causal convolutions effectively prevent information leakage from future time steps, ensuring the validity of prediction tasks, while dilated convolutions expand the receptive field through spaced sampling, thereby allowing efficient modeling of long-term dependencies in time series. In addition, residual connections are incorporated into the network design, which not only enhances model robustness but also helps alleviate the vanishing gradient problem in deep network training. Overall, TCN demonstrates distinct advantages in capturing long-range dependencies, improving computational efficiency, and maintaining training stability, making it particularly well-suited for feature extraction from financial asset price series.
As shown in the right of
Figure 3, the temporal feature extraction network constructed in this paper consists of several components, including an input layer, dilated causal convolution layers, weight normalization, a ReLU activation function, a dropout layer, and residual connections. The input layer receives the sequential data with dimensions
, where
denotes the number of assets,
represents the number of time steps, and
indicates the number of input features. Subsequently, dilated causal convolution is applied to ensure that the output at each time step depends only on the current and past information, thereby preserving temporal causality. At the same time, dilated convolution expands the receptive field without significantly increasing computational complexity. In this model, the dilation rates are set to 2 and 4. The ReLU activation function is then employed to enhance the network’s expressive capacity by sparse activation while alleviating the vanishing gradient problem. A dropout layer follows to prevent overfitting by randomly deactivating a proportion of neurons during training, where the dropout rate is set to
. Next, residual connections are introduced by adding the input of the convolutional layer directly to its output, forming a residual structure that effectively mitigates gradient vanishing and facilitates the training of deeper architectures. The above operations are repeated more than twice to more effectively extract temporal features from the input data, resulting in an output tensor of dimension
, where
denotes the number of convolutional output channels. Finally, a one-dimensional convolution is applied to compress the three-dimensional output into a tensor of size
for subsequent computation. In the proposed model,
is set to 8. In the temporal stock price feature extraction network, the input of each layer can be expressed as follows:
where
denotes the multi-scale representation of the input stock prices,
is the dilated causal convolution operation, and
denotes the output temporal features.
In the correlation branch, the focus is on capturing the interactions and dynamic dependencies among different assets. While the standard Transformer model, with its multi-head attention mechanism, excels at modeling nonlinear correlations in natural language processing and general sequence tasks, its direct application to financial portfolio problems faces notable limitations. Specifically, the self-attention mechanism in the standard Transformer relies primarily on point-to-point similarity measures, making it sensitive to local noise and less effective at capturing local contextual patterns in price sequences. As a result, Transformers alone are insufficient for extracting meaningful correlations between assets. To address this challenge, we incorporate the RAT [
9] into the correlation branch. RAT extends the standard Transformer with two key enhancements. First, the Sequential Attention Layer strengthens the modeling of local dependencies through context-aware attention, effectively suppressing short-term noise while preserving the ability to capture long-term relationships in price sequences. Second, the Relation Attention Layer explicitly models dynamic correlations between assets, revealing systemic risks and market co-movement patterns. The overall structure of the RAT is illustrated in
Figure 5, and the algorithm process is described as follows.
First, the feature maps as the outputs of dilated causal convolutional module are projected through different linear transformations to obtain the query matrix
, key matrix
, and value matrix
. By matrix splitting,
,
, and
are mapped into
heads, generating multi-head representations that capture pairwise price correlation features among assets, denoted as
,
, and
. These linear transformations and multi-head operations are formulated in Equations (16)–(20):
Next, the matrices and are multiplied within each head to obtain the attention distribution , which represents the relative importance of each asset with respect to all other assets. To prevent excessively large inner products, each element in is scaled by dividing by , where denotes the dimensionality of each head.
Subsequently, the SoftMax function is applied to normalize the attention distribution
, ensuring that the attention weights of each asset over all assets sum to 1. Finally, the normalized attention distribution in each head is multiplied by the corresponding value matrix
. The above operations are formulated in Equations (21) and (22):
Finally, the outputs from all heads are concatenated and passed through a linear convolution layer to obtain the final attention representation, which reflects the attention value of each asset with respect to all other assets across all heads. This process can be expressed in Equation (23):
The extracted inter-asset correlation features can therefore be expressed as shown in Equation (24):
As shown in
Figure 5, the RAT retains the fundamental components of the Transformer, including positional encoding, feed-forward layers, and layer normalization, ensuring that the model remains generalizable while being adapted to the financial context.
By integrating RAT, the correlation branch can simultaneously capture complex inter-asset dependencies and local price patterns within a unified framework [
14]. A concatenation operation is employed to integrate these two types of features, and the operation can be expressed as follows:
where
denotes the temporal features extracted at time
,
denotes the inter-asset correlation features extracted at time
, and
denotes the portfolio weight vector at time
. The integrated feature is
via concatenation.
2.4. Decision-Making Net
As a result of the dual-stream feature extraction, the extracted time series features and asset correlation features are input into the decision network based on reinforcement learning to generate the optimal portfolio allocation strategy.
This decision network is designed as a policy network that directly outputs continuous weights for each asset. In this study, the policy network is trained using the DPG algorithm (
Figure 6), which is highly suitable for handling continuous action spaces and allows direct optimization of asset allocations. To further enhance the stability of the investment portfolio and reduce transaction costs, the strategy network introduces a recursive mechanism. When generating trading decisions for the current period, it takes the trading decisions of the previous period as input references. This helps maintain consistency across consecutive portfolio allocations and mitigates the risk of incurring large transaction costs [
15]. When computing asset weights, the network first applies a convolutional operation to integrate feature information across all assets, ensuring that the decision for a single asset considers the characteristics of the entire portfolio. The resulting weights are then normalized using a SoftMax function, producing a final allocation vector that satisfies the budget constraint and preserves relative importance across assets [
16]. The algorithmic principle is described as follows:
After the denoising module and the dual-stream feature extraction networks, the concatenated features are obtained to construct the reinforcement learning state. The state is further reshaped into a vector form and fed into a DPG framework.
The actor network parameterized by
outputs continuous portfolio weights:
where
is normalized through a Softmax operation to satisfy the budget constraint
. During training, Gaussian exploration noise is added to enhance policy exploration:
The reward at time
is defined as the portfolio return:
The critic network parameterized by
estimates the action-value function, as expressed by Equation (29):
The temporal-difference target
is defined as follows:
where
and
denote target networks. The critic network is updated by minimizing the mean squared error, as expressed in Equation (31):
The actor network is updated using the deterministic policy gradient:
To ensure stable training under noisy and non-stationary financial environments, three stabilization mechanisms are incorporated.
First, the target network is updated via soft updates as expressed in Equation (33):
where
0.001.
Second, an experience replays buffer store transition tuples , during each training session. In the experience replay mechanism, samples are randomly selected from the buffer, and the batch size is set to 128, replay size is set to 50,000, and for each environment interaction step one gradient update is performed.
Third,
-greedy exploration strategy for continuous trading action selection is injected during our model training to prevent premature convergence. Specifically, Gaussian noise is added to the actor output
. To balance exploration and exploitation, the noise intensity is gradually decayed over training, as expressed in Equation (34):
where
is the initial exploration scale and
is the lower bound to prevent the policy from becoming completely deterministic during training.
Overall, through this design, the dual-stream feature representation serves as a structured and informative state input, while the DPG framework dynamically optimizes long-term cumulative returns in a continuous portfolio allocation space with enhanced training stability.
Furthermore, the dynamic allocation procedure of investment portfolio weights is described as follows:
In the beginning, the investment portfolio weight vector at time step
can be expressed by Equation (35):
where
is the number of assets. The weights
evolve over time according to the learned policy function
:
Here,
represents the vector of asset returns at time
, and
denotes the market state. The weight
is updated iteratively and incrementally:
where
is the adjustment term, which is determined by the DSCRL policy. The portfolio returns at time
can be expressed as follows:
DSCRL optimizes the weights
over time to maximize the expected Sharpe ratio:
The formula demonstrates how DSCRL dynamically adjusts the asset allocation based on market conditions.