You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

22 August 2023

DADE-DQN: Dual Action and Dual Environment Deep Q-Network for Enhancing Stock Trading Strategy

,
,
and
1
School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao, China
2
Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology, Taipa, Macao, China
*
Author to whom correspondence should be addressed.

Abstract

Deep reinforcement learning (DRL) has attracted strong interest since AlphaGo beat human professionals, and its applications in stock trading are widespread. In this paper, an enhanced stock trading strategy called Dual Action and Dual Environment Deep Q-Network (DADE-DQN) for profit and risk reduction is proposed. Our approach incorporates several key highlights. First, to achieve a better balance between exploration and exploitation, a dual-action selection and dual-environment mechanism are incorporated into our DQN framework. Second, our approach optimizes the utilization of storage transitions by utilizing independent replay memories and performing dual mini-batch updates, leading to faster convergence and more efficient learning. Third, a novel deep network structure that incorporates Long Short-Term Memory (LSTM) and attention mechanisms is introduced, thereby improving the network’s ability to capture essential features and patterns. In addition, an innovative feature selection method is presented to efficiently enhance the input data by utilizing mutual information to identify and eliminate irrelevant features. Evaluation on six datasets shows that our DADE-DQN algorithm outperforms multiple DRL-based strategies (TDQN, DQN-Pattern, DQN-Vanilla) and traditional strategies (B&H, S&H, MR, TF). For example, on the KS11 dataset, the DADE-DQN strategy has achieved an impressive cumulative return of 79.43% and a Sharpe ratio of 2.21, outperforming all other methods. These experimental results demonstrate the performance of our approach in enhancing stock trading strategies.

1. Introduction

The primary objective of investing in financial assets is usually to optimize returns while effectively managing risk. This objective revolves around striking a balance between risk and returns to achieve sustainable long-term investment returns. However, traditional methods typically involve human traders who analyze market data and trends, evaluate price fluctuations, and execute buy and sell trades based on their decisions. In contrast, algorithmic trading uses computer programs to automate decision making and trade execution. These programs operate using pre-defined algorithms and rules, eliminating the need for human intervention. Algorithmic trading mainly includes rule-based and machine learning (ML)-based approaches. Rule-based algorithmic trading involves several steps. These include analyzing and modeling the market and developing a set of rules based on market trends, price fluctuations, and other factors. Afterward, decisions and trading are carried out based on those rules. On the other hand, machine learning-based algorithmic trading involves training machine learning models to discover market patterns and then making decisions and trades based on those patterns.
However, rule-based algorithmic trading methods have been widely used to solve financial decision problems that consider the time series of financial assets as a sequence of random variables controlled by a stochastic process [1]. These methods often oversimplify real-world financial markets by assuming discrete time points, leading to suboptimal outcomes and potential financial losses. To overcome this limitation and develop models that can effectively capture the complex nature of financial markets without oversimplifying them is a huge mathematical and computational challenge.
AlphaGo’s success in competing against human professionals has sparked interest in reinforcement learning (RL) as a promising approach for statistical modeling and data processing in finance [2]. Significant progress has been made in several areas by combining RL with deep learning techniques. Notable examples include AlphaGo [3], which used RL and Monte Carlo tree search to defeat top Go players, as well as AlphaStar and OpenAI Five demonstrated extraordinary performance in StarCraft II and Dota 2 through DRL-based methods [4,5]. DRL involves an agent interacting with an unknown environment in a sequential manner, utilizing deep learning techniques to make decisions based on acquired information with the objective of maximizing cumulative rewards. RL techniques have demonstrated significant potential in addressing complex sequential decision problems.
Advances in DRL and massive financial datasets have motivated researchers to apply DRL algorithms to financial markets. The researchers focus on modeling and analyzing financial markets by adapting DRL techniques. Using historical asset prices as input, they design neural networks to optimize trading actions to maximize returns or Sharpe ratios. DRL has been successful in several financial areas, including optimal execution, portfolio optimization, and market making.
In this paper, a Dual Action and Dual Environment Deep Q-Network, named DADE-DQN, which is a trading strategy that extends and enhances the DQN algorithm, is proposed. First, to achieve a better balance between exploration and exploitation, a dual-action selection and dual-environment mechanism is added to the DQN framework. This enables agents to explore new actions while utilizing what they have learned. Second, for faster convergence and more efficient learning, our approach optimizes storage conversion by exploiting independent replay memories and performing dual mini-batch updates. Third, a novel deep network structure combining LSTM and attention mechanisms is introduced, which enhances the network’s ability to capture key features and patterns. In addition, a creative feature selection method is proposed to efficiently improve the input data by utilizing mutual information to identify and eliminate irrelevant features.
In summary, our study makes the following key contributions:
  • A novel DRL model, Dual Action and Dual Environment Deep Q-Network, named DADE-DQN, is proposed. The proposed model incorporates dual action selection and dual environment mechanisms into the DQN framework to effectively balance exploration and exploitation.
  • Optimized utilization of storage transitions by leveraging independent replay memory and performing dual mini-batch updates leads to faster convergence and more efficient learning.
  • A novel deep network architecture of combining LSTM and attention mechanisms is introduced to capture important features and patterns in stock market data for improving the network’s ability.
  • An innovative feature selection method is proposed to efficiently enhance input data by utilizing mutual information in order to identify and eliminate irrelevant features.
  • Evaluations on six datasets show that the presented DADE-DQN algorithm demonstrates excellent performance compared to multiple DRL-based strategies such as TDQN, DQN-Pattern, DQN-Vanilla, and traditional strategies such as B&H, S&H, MR, and TF.
The rest of this paper is arranged as follows. Section 2 describes the related work. Section 3 generally formalizes the stock trading problem and describes the proposed approach in detail. The experimental results are shown and analyzed in Section 4. Section 5 presents the discussion. Section 6 presents the conclusions and future work.

3. Materials and Methods

Driven by the progress of DQN, the DADE-DQN method was proposed to improve stock trading strategies using a dual-action, dual-environment deep Q-network approach The DQN algorithm has been widely utilized in optimizing trading strategies using stock states as inputs.

3.1. Materials

RL is a framework that involves an agent, a set of states S, and a set of actions A for each state [56]. The goal of reinforcement learning is to improve decision making by allowing the agent to learn from the consequences of its actions. Markov Decision Process (MDP) is a mathematical framework used to model decision-making problems in which an agent interacts with an environment to achieve certain goals. It is widely used in various fields, including finance and economics, to analyze and optimize decision-making processes. In the context of the financial market, both in a broad sense and specifically in the stock market, MDPs can be employed to understand and potentially improve decision-making strategies [57]. Figure 1 illustrates the agent’s interaction with the Markov decision process environment.
Figure 1. The interaction diagram of an agent with the Markov decision process environment. Different colored circles represent different neurons.
When an agent performs an action a A , it transitions from one state to another. Each action taken in a particular state provides the agent with a reward, denoted as r t . The agent’s goal is to maximize its cumulative reward by considering the potential future rewards. This is achieved by adding the maximum achievable future reward to the reward obtained in the current state, thereby influencing the agent’s current action based on the potential future rewards. The potential future reward is calculated as the weighted sum of the expected values of rewards for all future steps starting from the current state. Mathematically, it can be represented as Equations (1) and (2).
π * = arg max π E [ R π ] ,
R = t = 0 γ t r t ,
where π * represents the optimal policy, π represents the current policy, γ is the discount factor ( γ [ 0 , 1 ] ), r t represents immediate rewards at time step t, and R denotes the cumulative reward. The discount factor γ determines the relative importance of future rewards in the overall cumulative reward calculation.

3.1.1. States

In the context of algorithmic trading, the RL environment can be considered as an abstraction that contains the trading mechanism and all relevant information that can influence the agent’s trading activity. At each time step t, the RL agent receives a set of information including RL status (current trading position, available cash, etc.), stock information (opening price, closing price, volume, etc.), technical indicators of the stock market (moving average convergence divergence, relative strength index, average direction index, etc.), macroeconomic information, news information, etc.
Therefore, it is important to select and properly process information from the environment. Many existing studies rely on the use of raw stock information as input to RL, which may not provide enough information for RL agents to make effective differentiation and informed decisions. In this paper, mutual information is used as a technique to select highly relevant features. The dimension of the state of the environment at time t is denoted as m × n , where m corresponds to the different technical indicators obtained from the price information and n represents the time window length. It can be expressed as Equations (3) and (4).
s t [ I t n + 1 , I t n + 2 , , I t ] ,
I t = [ I t 1 , I t 2 , , I t m ] ,
where, I t j denotes the ith normalized technical indicator on the tth day; n represents the number of days in the past.

3.1.2. Actions

At each time step t, the RL agent first obtains the environment state s t . According to the RL policy π ( a t | s t ) , the agent selects an action a t . The state–action value is referred to as the Q-value and represents the expected cumulative reward, expressed as Equation (5).
Q π ( s , a ) = E π k = 0 γ k r t + k + 1 | s t , a t ,
This Q-value represents the expected cumulative reward when taking action a t under state s t according to policy π . The optimal Q-value, denoted Q * ( s , a ) , is determined by the optimal policy that maximizes the Q-value. The Bellman equation can be expressed as Equation (6):
Q * ( s , a ) = max π Q π ( s , a ) ,
The optimal Q-value Q * ( s , a ) can be recursively can be expressed as Equation (7):
Q * ( s , a ) = max π r t + γ Q ( s t + 1 , a ) ,
where a represents the action taken in the subsequent state s t + 1 .
This paper assumes that the agent’s actions do not influence the financial market. Consequently, the action space can be expressed as Equation (8).
a t * = 1 , if Q * ( s , a ) = Q t Sell ; 0 , if Q * ( s , a ) = Q t Hold ; 1 , if Q * ( s , a ) = Q t Buy .
where Q t Sell , Q t Hold , and Q t Buy is determined by the Q-network at each time step t.
To incorporate the position information POS t , the final action and updated position information are computed as Equations (9) and (10).
a t = a t * · a t * POS t 1 ,
where the symbol ⊕ indicates a process of element-wise multiplication followed by addition. This operation involves multiplying corresponding values of two elements and subsequently summing the products. In this context, it implies that each element of a t * is individually multiplied by the corresponding element of POS t 1 , and the resulting products are then aggregated to yield the ultimate value of a t .
POS t = a t + POS t 1 ,
The position information POS t takes values from the set { 1 , 0 , 1 } , where POS 0 = 0 indicates no initial position. A positive value of POS t (e.g., POS t = 1 ) indicates a long position, while a negative value (e.g., POS t = 1 ) indicates a short position. In the action space A = { 1 , 0 , 1 } , when a t * = 1 and POS t 1 = 0 , the agent establishes a long position. When a t * = 1 and POS t 1 = 1 , it closes this long position. Similarly, when a t * = 1 and POS t 1 = 0 , the agent opens a short position and closes this short position when a t * = 1 and POS t 1 = 1 . These trading strategies are intuitive and realistic, allowing agents to behave more similarly to human traders.

3.1.3. Reward Function

The reward function is an important component of the RL algorithm that determines the reward assigned to each action taken by the agent. Its purpose is to maximize the cumulative reward over time. In this study, the short-term Sharpe Ratio (SSR) award is used to address this challenge [37]. The SSR reward function effectively contains information about the subsequent k days and also the agent’s location. Thus, it is an appropriate indicator of the actual reward in the RL context. Mathematically, the SSR reward function is defined as Equations (11)–(13).
SSR t = POS t × R t * ,
R t * = mean ( R t k ) std ( R t k ) ,
R t k = p t + 1 p t p t , p t + 2 p t p t , , p t + k p t p t ,
where R t k represents the returns over the next k days, calculated as the percentage change in price ( p t ) from time t to t + k .

3.2. Methods

3.2.1. Dual Action and Dual Environment Deep Q-Network (DADE-DQN)

In the financial market, achieving higher profits while minimizing risks is a difficult task for traders. Inspired by the DQN algorithm, a novel approach called Dual Action and Dual Environment Deep Q-Network (DADE-DQN) is proposed to address this challenge. DADE-DQN is an extension of the DQN algorithm specifically designed for the trading problem. Typically, DQN-based RL agents require a large number of training events to learn the optimal strategy and maximize the cumulative reward. However, in stock trading, this can lead to overfitting and poor performance on the test set due to high data noise, randomness, and limited real stock data. Our goal is to overcome overfitting and explore the best policy efficiently with limited data. The main features of the DADE-DQN algorithm are as follows:
Optimal and Exploit Replay Memory: The DADE-DQN algorithm utilizes two separate replay memories: an optimal replay memory (D) and an exploit replay memory ( D ). This separation allows for more effective learning by selectively replaying transitions based on their potential for exploration or exploitation.
Dual Action Selection: In DADE-DQN, the agent takes two different actions at the same time, obtaining two sets of transactions from the environment. This approach addresses the issue of limited data to a certain extent. During training, using these two sets of transactions, the maximum and minimum Q values corresponding to the actions generated by the Q network are optimized alternatively. This allows the RL agent to avoid selecting the worst action. The selection of the best action is based on maximizing the Q-value, while the selection of the exploited action is based on minimizing the Q-value. This dual action selection strategy enhances the agent’s ability to explore while using its knowledge to exploit the learned values.
Dual Minibatch Update: The DADE-DQN algorithm performs dual minibatch updates based on the selected actions. In one time interval, it samples transitions from the optimal replay memory (D) and uses the argmax action selection to update the Q value for the next state. In another time interval, it samples transitions from the exploitation replay memory ( D ) and updates the Q-values using argmin action selection. This dual minibatch update process facilitates more efficient learning and better utilization of the replay memories.
Novel Q-network: The efficacy of DRL algorithms relies largely on the structure of the value/policy network employed by agents. Therefore, a network structure is designed to approximate the action-value function of DADE-DQN. The network topology consists of four main layers: an input layer, an encoder layer, an attention layer, and an action layer, as shown in Figure 2. In recent years, attention mechanisms have shown good potential in computer vision (CV) and natural language processing (NLP) fields. In this paper, a hierarchical attention mechanism proposed by Yang et al. is used to design the proposed network [58].
Figure 2. The Q-network structure of the proposed algorithm.
Input: As stated in Section 3.1.1, the input part contains M components, consisting of several technical indicators and the open, close, high, and low price of the previous N days as the state of the time step t. During the training process, a mini-batch is selected to train the network.
Encoder Layer: Initially, the input is passed through a LSTM network to obtain a hidden representation, denoted as h i , which effectively captures the temporal information [59]. LSTM is a specialized type of recurrent neural network (RNN) that alleviates the vanishing gradient problem and facilitates the learning of long-term dependencies.
The following equations outline the fundamental operations, where ⊙ denotes the element-wise product (Hadamard product) as Equations (14)–(18).
i t = σ ( W x i x t + W h i h t 1 + W c i c t 1 + b i )
f t = σ ( W x f x t + W h f h t 1 + W c f c t 1 + b f )
c t = f t c t 1 + i t t a n h ( W x c x t + W h c h t 1 + b c )
o t = σ ( W x o x t + W h o h t 1 + W c o c t + b o )
h t = o t t a n h ( c t )
The core of LSTM lies in the cell state, which can selectively add or remove information through specialized structures called gates. Gates, which include a sigmoid neural network layer and a point-wise product operation, regulate the flow of information. The sigmoid functions have output values between 0 and 1, where 1 represents “keep this information completely” and 0 represents “discard this information entirely”. LSTM cells contain three gates: the forget gate, input gate, and output gate, which together protect and manage the cell state. The forget gate f t uses the sigmoid neural network layer that uses the past cell output h t 1 and the current cell input x t to determine which information should be discarded from the cell state c t . The input gate i t works alongside a tanh layer to control the addition of new information. The tanh layer generates a vector c t ˜ = tan h ( W x c x t + W h c h t 1 + b c ) , which is then added to the cell state c t . Subsequently, the sigmoid layer outputs a value between 0 and 1 for each element in c t ˜ , determining the extent to which the new information is assimilated. The output gate o t governs the amount of information that is filtered out from the current cell state.
Attention Layer: The hidden states from different days have different levels of importance to the agent. To address this problem, an attention mechanism is introduced to extract the weights assigned to each day and hidden state, thereby aggregating the representation of these summarized states. The hierarchical attention mechanism used in this study operates as Equations (19)–(21).
u i = tan h ( W h i + b ) ,
α i = exp ( u i T u ) i exp ( u i T u ) ,
v = i α i h i ,
In our implementation, the hidden representation h i is initially transformed by matrix W, which is uniformly initialized, resulting in u i as a weighted representation of h i . The importance of the states is then measured based on the similarity between u i and a uniformly initialized vector u. A softmax function is applied to obtain normalized weights α i . Subsequently, the state v is computed as a weighted sum of the hidden representations. This process generates an attention value vector v, which encapsulates all the information related to the different days and hidden states in the hidden representation h i . The matrix W and the vector v are learned together during the training process.
Action Layer: Finally, the attention value vector v is fed into an action network consisting of two fully connected layers to obtain the Q values of the three available actions.

3.2.2. Feature Selection Based on Mutual Information

In this section, mutual information is used to measure the statistical dependence or the amount of information shared between two random variables. Developed within the field of information theory by Claude Shannon, mutual information has become an important concept in various fields. The goal is to select several technical indicators as explanatory variables, with the closing price of the day serving as the response variable.
The feature selection process begins by pre-selecting several features from the library as potential explanatory variables. Then, univariate analysis is conducted to identify features with low variance and a significant number of missing values, which are excluded from further analysis. Subsequently, mutual information is employed to measure the interdependence between all remaining explanatory variables and the response variable, providing insights into the correlation between each explanatory variable and the closing price. Finally, a higher threshold is set to filter out the final technical factors. This threshold is used to select the characteristics that exhibit a high correlation with the closing price. Figure 3 illustrates the flow of feature selection.
Figure 3. The flow of feature selection.
Mutual information quantifies the statistical dependence between two random variables. It is based on the concept of entropy, which measures the uncertainty or randomness of a random variable. The mutual information between two random variables, denoted as I ( X ; Y ) , is calculated using the probabilities of their joint distribution P ( X , Y ) and their individual marginal distributions P ( X ) and P ( Y ) . The formula for calculating mutual information can be expressed as Equation (22).
I ( X ; Y ) = P ( X , Y ) · log P ( X , Y ) P ( X ) · P ( Y ) ,
where, P ( X , Y ) represents the joint probability distribution of variables X and Y, while P ( X ) and P ( Y ) represent their respective marginal probability distributions. The resulting mutual information value, I ( X ; Y ) , is always non-negative, with a higher value indicating a stronger dependency between the variables. A value of zero indicates that the variables are independent.
In specific cases, a sample threshold can be set for the mutual information. This threshold is used as a criterion for selecting the final technical factor features. Factors with mutual information values above the threshold were considered to have a strong dependence on the closing price and were included in the final set of selected features.

3.2.3. DADE-DQN Training

To summarize, Algorithm 1 provides the pseudo-code of the entire DADE-DQN algorithm, which is the basis of our proposed trading strategy. The strategy is updated based on further learning from daily observations during the trading process. This iterative process ensures the continuous learning and refinement of the trading strategy.
Algorithm 1 DADE-DQN algorithm
 1:
Initialize optimal replay memory D to capacity N.
 2:
Initialize exploit replay memory D to capacity N.
 3:
Initialize the policy network with random weights θ .
 4:
Initialize the target network with weights θ = θ .
 5:
for episode = 1, M do
 6:
      Acquire the initial state s 1 from the environment.
 7:
      for t = 1, T do
 8:
            Choose an optimal action a t + = a r g m a x a Q ( s t , a ) .
 9:
            With probability ϵ t select a random action a t between actions except a t + ;
10:
            Otherwise, select a t = a r g m i n a Q ( s t , a ; θ ) .
11:
            Copy the environment δ = δ .
12:
            Execute action a t + in the environment δ and get reward r t + and the new state s t + 1 .
13:
            Execute action a t in the environment δ and get reward r t and the new state s t + 1 .
14:
            Store transition ( s t , a t , r t , s t + 1 ) in D.
15:
            Store transition ( s t , a t , r t , s t + 1 ) in D .
16:
            if t % Interval = 0 and t % (2*Interval)! = 0 then
17:
            Sample random minibatch of transitions ( s i , a i , r i , s i + 1 ) from D.
18:
            Set y i = r i if s i + 1 is terminal , r i + γ Q ( s i + 1 , argmax a Q ( s i + 1 , a ; θ ) ; θ ) otherwise .
19:
            Perform a gradient descent step on ( y i Q ( s i , a i ; θ ) ) 2 with respect to the parameters θ .
20:
            end if
21:
            if t % Interval! = 0 and t % (2*Interval) = 0 then
22:
            Sample random minibatch of transitions ( s i , a i , r i , s i + 1 ) from D .
23:
            Set y i = r i if s i + 1 is terminal , r i + γ Q ( s i + 1 , argmin a Q ( s i + 1 , a ; θ ) ; θ ) otherwise .
24:
            Perform a gradient descent step on ( y i Q ( s i , a i ; θ ) ) 2 with respect to the parameters θ .
25:
            end if
26:
            Update the target network parameters θ = θ every T steps.
27:
            Anneal the ϵ -greedy exploration parameter ϵ t .
28:
      end for
29:
end for

4. Experiments and Results

In this section, the process of selecting the dataset and associated features is first investigated. Subsequently, the details of the experiment settings are discussed. Finally, the experimental results are examined and discussed. The effectiveness of the proposed model is assessed by evaluating its performance on six different stock index datasets.

4.1. Datasets

In order to fully assess the performance and robustness of our model, a range of indices from different regions and economic backgrounds were used in the experiments, including the NASDAQ Composite (IXIC), the S&P 500 (SP500), and the Dow Jones Industrial Average (DJI) in the U.S., as well as the CAC 40 (FCHI) in France, the Korea Composite Stock Price Index (KOSPI, KS11) in Korea, and the Nikkei 225 (N225) in Japan. The duration of the dataset spans from 1 January 2007, to 31 December 2022. The dataset was divided into a training set, covering 1 January 2007, to 31 December 2022, and a test set, covering 1 January 2007, to 31 December 2022. Figure 4 shows the price movement of the seven assets. The training period was the green line and the test period was the blue line.
Figure 4. The price movement of the seven assets [from 1 January 2007 to 31 December 2022] [green parts: training set; blue parts: test set].

4.2. Feature Selection Results

To identify relevant explanatory variables, 57 features were selected from the TA-Lib software library, as shown in Table 2. Subsequently, a series of steps were performed to filter and refine these features. First, univariate analyses were first performed to exclude features with low variance (greater than 90%) and those with a significant number of missing values. Following this, mutual information scores were calculated to assess the interdependence between the explanatory and response variables, as demonstrated in Table 3.
Table 2. Description of exploited technical indicators.
Table 3. Mutual information scores for selected technical indicators.
To establish a strong relationship with the closing price, 24 features were selected based on a threshold of 0.5 . These selected features were detailed in Table 4. For each time step, the RL environment state was constructed by considering the closing prices and technical indicators from the previous n days. Consequently, the size of the environment state at time t could be represented as an m × n matrix, where m = 24 . This matrix included the opening price, closing price, high price, low price, and 20 different technical indicators based on price and volume. Further information regarding these indicators can be found in Table 4. In this context, the previous n days were represented by n = 10 , with more details in the experiment settings.
Table 4. Mutual information scores for selected technical indicators with a threshold of 0.5.

4.3. Experiment Settings

To assess the performance of the DADE-DQN model with different window lengths, experiments were conducted using the DJI dataset. Figure 5 depicts the cumulative rewards curves obtained from training DADE-DQN with window lengths of 5, 10, 15, and 20, respectively. As can be seen in the figure, the training curve with a window length of 10 was balanced between stability and cumulative return. Furthermore, Table 5 presents the results of various performance metrics, including the cumulative return, annualized return, Sharpe ratio, and maximum drawdown, for window lengths of 5, 10, 15, and 20, respectively. Upon analyzing the table, it became apparent that a window length of 10 yields superior results compared to other window lengths. Consequently, the window length chosen for this study was 10.
Figure 5. The cumulative rewards curve for DADE-DQN training with window lengths of 5, 10, 15, and 20 on the DJI dataset.
Table 5. Performance of DADE-DQN on DJI dataset with window lengths of 5, 10, 15 and 20.
The implementation of the models was carried out using the PyTorch library in Python. The baseline methods were adopted from the code provided by Theate et al. [31] (Note: https://github.com/ThibautTheate/An-Application-of-Deep-Reinforcement-Learning-to-Algorithmic-Trading (accessed on 1 August 2023)) and Taghian et al. [34] (Note: https://github.com/MehranTaghian/DQN-Trading (accessed on 1 August 2023)) to build the algorithm.
The experimental setup involved several key components as follows:
  • Preprocessing: Preprocessing was performed by normalizing each input variable to ensure consistent scaling and prevent gradient explosions. This normalization resulted in a mean of 0 and a standard deviation of 1 for the normalized data.
  • Initialization: Xavier initialization was employed to enhance the convergence of the algorithm by setting the initial weights in a manner that maintained a constant variance of gradients across the deep neural network layers.
  • Performance metrics: To accurately evaluate the performance of the trading strategies, four performance metrics were selected, namely cumulative return (CR), Sharpe ratio (SR), maximum drawdown (MDD), and annualized return (AR). These metrics provide a comprehensive and objective analysis of the strategy’s benefits and drawbacks in terms of profitability and risk management.
  • Baseline methods: To objectively evaluate the advantages and disadvantages of the DADE-DQN algorithm, a comparison was made with traditional methods like buy and hold (B&H), sell and hold (S&H), mean reversion with moving averages (MR), and trend following with moving averages (TF) [60,61,62], as well as DRL methods including TDQN [31], DQN-Vanilla, and DQN-Pattern [34].
It was worth mentioning that tuning the hyperparameters of DRL algorithms was a complex and time-consuming task. In this study, the trial-and-error method was used to determine the hyperparameter values. Table 6 presents the hyperparameter values used in the DADE-DQN model.
Table 6. The hyperparameters values of DADE-DQN.
Figure 6 illustrates the trend of the cumulative reward during the training process of the DADE-DQN model using SSR on six datasets. Figure 6 shows a gradual increase in the cumulative reward as the number of training episodes progresses, eventually reaching a state of convergence. The cumulative rewards served as a crucial performance metric to evaluate the effectiveness of the DADE-DQN algorithm, quantifying the total reward obtained by the model over multiple episodes of training and reflecting its ability to make profitable trading decisions. Initially, the cumulative reward exhibited fluctuations and variability as the model explores and learned from the environment. However, as training progresses, the DADE-DQN algorithm improved its policy and decision-making capabilities, resulting in a steady increase in cumulative rewards.
Figure 6. The trend of the cumulative reward during the training process of the DADE-DQN model using SSR on six datasets.
Therefore, the convergence of the cumulative rewards indicated that the DADE-DQN model learned an effective trading strategy and achieved a stable state. At this point, the model acquired sufficient knowledge from the training data and consistently generates profitable trading decisions. The convergence of cumulative returns meant that the DADE-DQN algorithm successfully captured the underlying patterns and dynamics of the financial markets and used historical price data, technical indicators, and environmental conditions to make informed trading choices that maximized cumulative returns.

4.4. Experimental Results

This section presented a comparison between the proposed DADE-DQN framework and seven baseline methods to evaluate its performance. The experimental results for the six stock index datasets are provided in Table 7. The trading performance was assessed using four performance metrics: cumulative return (CR), Sharpe ratio (SR), annualized return (AR), and maximum drawdown (MDD).
Table 7. Performances of various trading methods on six assets.
Table 7 shows the performance of various trading methods on six assets. Among the listed metrics, the DADE-DQN strategy achieved the best scores, indicating its promising performance. The DADE-DQN strategy achieved significantly higher cumulative returns compared to other strategies. Additionally, it almost demonstrated the smallest maximum drawdown, highlighting its strong profitability and risk aversion on the six datasets. For instance, on the IXIC dataset, the DADE-DQN strategy achieved a cumulative return of 49.15%, surpassing all other methods. It also demonstrated a Sharpe ratio of 0.99, indicating favorable risk-adjusted returns. Furthermore, the maximum drawdown of the DADE-DQN strategy on the IXIC dataset was 16.35%, which was the lowest among all the methods. Similar observations could be made for the other assets listed in the table. The DADE-DQN consistently outperformed the baseline methods with respect to cumulative return, Sharpe ratio, and annualized return, demonstrating its superior performance. These results validated the effectiveness of the proposed DADE-DQN framework in generating profitable trading strategies with reduced risk.
Figure 7 illustrates the cumulative return curves obtained by applying various trading strategies, including B&H, S&H, MR, TF, TDQN, DQN-Vanilla, DQN-Pattern, and the proposed DADE-DQN strategy. The figure clearly shows the superior performance of the DADE-DQN strategy compared to the other strategies. It exhibited a remarkable ability to mitigate the risk of significant losses while achieving excess returns. Moreover, the total assets under the DADE-DQN strategy exhibited a smoother upward trend compared to the benchmark strategy.
Figure 7. Performance of different models on six datasets.
The observed performance in Figure 7 aligns with the conclusions derived from the performance metrics in Table 7. The DADE-DQN strategy consistently outperformed the baseline methods across different assets, confirming its effectiveness in generating profitable trading strategies. It not only avoided the substantial risk of loss but also generated excess returns, surpassing the benchmark strategy and other comparative approaches. This result further confirmed the effectiveness of the DADE-DQN framework in generating profitable trading strategies. It successfully balanced risk and reward, providing investors with a more stable and profitable investment approach compared to traditional strategies and DRL methods.
Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 illustrates the trading signals generated by the DADE-DQN strategy for different assets. The figures showcase the strategy’s ability to accurately detect market trends and effectively adapt its position to mitigate risks, particularly during periods of significant market shocks. The trading signals provided by the DADE-DQN strategy demonstrated its capability to identify favorable market conditions and adjust positions accordingly. By leveraging its DRL framework to analyze historical price and technical indicator data, it could capture important market patterns and make informed decisions. During periods of severe market shocks, such as sudden downturns or volatility spikes, the DADE-DQN strategy demonstrated its adaptability by swiftly changing the direction of positions. This allowed it to hedge against potential risks and minimize losses.
Figure 8. Trading signals of DJI based on DADE-DQN algorithm.
Figure 9. Trading signals of FCHI based on DADE-DQN algorithm.
Figure 10. Trading signals of IXIC based on DADE-DQN algorithm.
Figure 11. Trading signals of N225 based on DADE-DQN algorithm.
Figure 12. Trading signals of KS11 based on DADE-DQN algorithm.
Figure 13. Trading signals of SP500 based on DADE-DQN algorithm.

5. Discussion

This study introduces the DADE-DQN algorithm, an extension of the DQN algorithm, aimed at enhancing stock trading strategies. Table 7 presents a comprehensive overview of performance across various trading methods for each of the six assets. Of significance, the DADE-DQN strategy consistently exhibits exceptional performance, underscoring its potential. Notably, it consistently achieves the highest cumulative returns when compared to all other methods. Additionally, its strong risk-management capabilities are evident through consistently low maximum drawdowns, emphasizing its capacity for balanced profitability and risk reduction across diverse datasets. These empirical findings have important implications for policymakers, traders, and finance experts.
Comparative analysis involving multiple DRL-based strategies (TDQN, DQN-Pattern, DQN-Vanilla) and traditional strategies (B&H, S&H, MR, TF) under identical conditions clearly establishes the superiority of the DADE-DQN algorithm. Across various trading objectives, the DADE-DQN algorithm consistently outperforms benchmark methods, achieving remarkable cumulative returns and Sharpe ratios. Notably, on the KS11 dataset, the DADE-DQN strategy stands out with an impressive cumulative return of 79.43% and a Sharpe ratio of 2.21, showcasing its ability to generate profitable trading strategies while effectively managing risk.
The compelling results validate the effectiveness of the proposed DADE-DQN framework in generating profitable trading strategies while mitigating risk. This conclusion is further supported by Figure 7, which presents cumulative return curves for different trading strategies, including DADE-DQN, across various assets. The visual representation unequivocally demonstrates the outstanding performance of the DADE-DQN strategy, consistently achieving excess returns while effectively controlling the risk of significant losses. Importantly, the cumulative return curve of the DADE-DQN strategy exhibits a smoother upward trend in comparison to benchmark strategies, highlighting its enhanced stability and profitability.
Moreover, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 provide detailed insights into the specific trading signals generated by the DADE-DQN strategy for different assets. These figures illustrate the strategy’s proficiency in identifying market trends and adeptly adjusting positions to manage risk, especially during periods of market turbulence. The trading signals capture favorable market conditions and dynamically adapt, showcasing the strategy’s versatility. Leveraging deep reinforcement learning, the strategy effectively analyzes historical price data and technical indicators, enabling it to identify crucial market patterns and make informed decisions. Importantly, in response to abrupt market changes, such as sudden declines or heightened volatility, the DADE-DQN strategy displays resilience by swiftly readjusting positions to mitigate risk. Hence, the robust performance of the DADE-DQN strategy, demonstrated in both backtesting and real-world application, highlights its potential to enhance stock trading strategies, offering investors stability and favorable returns.
In summary, the effectiveness of the DRL strategy in identifying market trends and dynamically adjusting positions to manage risk contributes to its overall profitability and establishes its role as a reliable trading strategy across diverse market scenarios. This underscores its potential to enhance stock trading strategies, providing investors with stability and favorable returns.

6. Conclusions and Future Work

Firstly, the primary value of this paper lies in the introduction of the DADE-DQN algorithm, an extension of the DQN algorithm, which presents a significant enhancement to stock trading strategies. The novel dual action selection and dual environment mechanism incorporated into the DADE-DQN algorithm effectively addresses the challenges posed by limited financial data and algorithmic stability. By striking a balance between exploration and exploitation, this mechanism elevates the performance of the algorithm, surpassing traditional methods. The incorporation of LSTM and attention mechanisms enriches the deep network architecture, enhancing its capacity to capture intricate patterns and important features within stock market data. Furthermore, the introduction of a feature selection method based on mutual information demonstrates innovation in enhancing model interpretability and efficiency.
Secondly, by moving forward, several avenues for improvement exist. While the DADE-DQN algorithm showcased promising results, we acknowledge the need for further refinement. In particular, our current trading strategy, involving all-or-nothing position adjustments, might expose traders to higher risks and transaction fees during actual trading. We also recognize the absence of consideration for the impact of trading actions on the environment and the issue of slippage, both of which are pertinent aspects of real-world trading scenarios.
Finally, future research endeavors should focus on developing more sophisticated trading position mechanisms that enhance control over risk exposure while targeting increased returns. Additionally, exploring action distribution strategies to address uncertainties and risks is an exciting avenue for investigation. By accounting for the environmental impact of trading actions and tackling the slippage problem, we can attain a more accurate representation of actual trading conditions. These advancements will contribute to more reliable and effective trading strategies, offering practical implications for financial professionals and decision-makers alike.

Author Contributions

Conceptualization, methodology, data curation, writing—original draft preparation. Y.H. and Y.S.; software, validation, visualization, investigation, Y.H. and C.Z.; supervision, writing—review and editing, project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the Faculty Research Grants, Macau University of Science and Technology (Project no. FRG-22-001-INT), and the Science and Technology Development Fund, Macau SAR (File no. 0096/2022/A).

Data Availability Statement

Our stock data are available for download at http://finance.yahoo.com (accessed on 1 February 2023).

Acknowledgments

The authors would like to express the appreciation to Zhen Guo, a student at the School of Computer Science and Engineering, Macau University of Science and Technology. His insightful inputs and suggestions have significantly enriched the content of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 2020. [Google Scholar]
  2. Hambly, B.; Xu, R.; Yang, H. Recent Advances in Reinforcement Learning in Finance. Math. Financ. 2021, 33, 435–975. [Google Scholar] [CrossRef]
  3. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Hassabis, D. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
  4. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef] [PubMed]
  5. Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
  6. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. Comput. Sci. 2013. [Google Scholar]
  7. Hasselt, H.V.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015. [Google Scholar]
  8. Lipton, Z.C.; Gao, J.; Li, L.; Li, X.; Ahmed, F.; Deng, L. Efficient exploration for dialog policy learning with deep BBQ networks & replay buffer spiking. arXiv 2016, arXiv:1608.05081. [Google Scholar]
  9. Mossalam, H.; Assael, Y.M.; Roijers, D.M.; Whiteson, S. Multi-objective deep reinforcement learning. arXiv 2016, arXiv:1610.02707. [Google Scholar]
  10. Mahajan, A.; Tulabandhula, T. Symmetry Learning for Function Approximation in Reinforcement Learning. arXiv 2017, arXiv:1706.02999. [Google Scholar]
  11. Taitler, A.; Shimkin, N. Learning control for air hockey striking using deep reinforcement learning. In Proceedings of the 2017 International Conference on Control, Artificial Intelligence, Robotics & Optimization (ICCAIRO), Prague, Czech Republic, 20–22 May 2017; pp. 22–27. [Google Scholar]
  12. Levine, N.; Zahavy, T.; Mankowitz, D.J.; Tamar, A.; Mannor, S. Shallow updates for deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  13. Leibfried, F.; Grau-Moya, J.; Bou-Ammar, H. An Information-Theoretic Optimality Principle for Deep Reinforcement Learning. arXiv 2017, arXiv:1708.01867. [Google Scholar]
  14. Anschel, O.; Baram, N.; Shimkin, N. Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 176–185. [Google Scholar]
  15. Hester, T.; Vecerík, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Sendonaris, A.; Dulac-Arnold, G.; Osband, I.; Agapiou, J.P.; et al. Learning from Demonstrations for Real World Reinforcement Learning. arXiv 2017, arXiv:1704.03732. [Google Scholar]
  16. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
  17. Sorokin, I.; Seleznev, A.; Pavlov, M.; Fedorov, A.; Ignateva, A. Deep Attention Recurrent Q-Network. arXiv 2015, arXiv:1512.01693. [Google Scholar]
  18. Hausknecht, M.; Stone, P. Deep recurrent q-learning for partially observable mdps. In Proceedings of the 2015 AAAI Fall Symposium Series, Arlington, VA, USA, 12–14 November 2015. [Google Scholar]
  19. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.; Lanctot, M.; Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1995–2003. [Google Scholar]
  20. Mosavi, A.; Ghamisi, P.; Faghan, Y.; Duan, P.; Band, S. Comprehensive Review of Deep Reinforcement Learning Methods and Applications in Economics; Social Science Electronic Publishing: Rochester, NY, USA, 2020. [Google Scholar]
  21. Thakkar, A.; Chaudhari, K. A Comprehensive Survey on Deep Neural Networks for Stock Market: The Need, Challenges, and Future Directions. Expert Syst. Appl. 2021, 177, 114800. [Google Scholar] [CrossRef]
  22. Gao, X. Deep reinforcement learning for time series: Playing idealized trading games. arXiv 2018, arXiv:1803.03916. [Google Scholar]
  23. Huang, C.Y. Financial Trading as a Game: A Deep Reinforcement Learning Approach. arXiv 2018, arXiv:1807.02787. [Google Scholar]
  24. Chen, L.; Gao, Q. Application of Deep Reinforcement Learning on Automated Stock Trading. In Proceedings of the 2019 IEEE 10th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 18–20 October 2019; pp. 29–33. [Google Scholar]
  25. Jeong, G.H.; Kim, H.Y. Improving financial trading decisions using deep Q-learning: Predicting the number of shares, action strategies, and transfer learning. Expert Syst. Appl. 2019, 117, 125–138. [Google Scholar] [CrossRef]
  26. Li, Y.; Nee, M.; Chang, V. An Empirical Research on the Investment Strategy of Stock Market based on Deep Reinforcement Learning model. In Proceedings of the 4th International Conference on Complexity, Future Information Systems and Risk, Crete, Greece, 2–4 May 2019. [Google Scholar]
  27. Chakole, J.; Kurhekar, M. Trend following deep Q-Learning strategy for stock trading. Expert Syst. 2020, 37, e12514. [Google Scholar] [CrossRef]
  28. Dang, Q.V. Reinforcement learning in stock trading. In Advanced Computational Methods for Knowledge Engineering, Proceedings of the 6th International Conference on Computer Science, Applied Mathematics and Applications, ICCSAMA 2019, Hanoi, Vietnam, 19–20 December 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 311–322. [Google Scholar]
  29. Ma, C.; Zhang, J.; Liu, J.; Ji, L.; Gao, F. A Parallel Multi-module Deep Reinforcement Learning Algorithm for Stock Trading. Neurocomputing 2021, 449, 290–302. [Google Scholar] [CrossRef]
  30. Shi, Y.; Li, W.; Zhu, L.; Guo, K.; Cambria, E. Stock trading rule discovery with double deep Q-network. Appl. Soft Comput. 2021, 107, 107320. [Google Scholar] [CrossRef]
  31. Théate, T.; Ernst, D. An application of deep reinforcement learning to algorithmic trading. Expert Syst. Appl. 2021, 173, 114632. [Google Scholar] [CrossRef]
  32. Bajpai, S. Application of deep reinforcement learning for Indian stock trading automation. arXiv 2021, arXiv:2106.16088. [Google Scholar]
  33. Li, Y.; Liu, P.; Wang, Z. Stock Trading Strategies Based on Deep Reinforcement Learning. Sci. Program. 2022, 2022, 4698656. [Google Scholar] [CrossRef]
  34. Taghian, M.; Asadi, A.; Safabakhsh, R. Learning financial asset-specific trading rules via deep reinforcement learning. Expert Syst. Appl. 2022, 195, 116523. [Google Scholar] [CrossRef]
  35. Liu, P.; Zhang, Y.; Bao, F.; Yao, X.; Zhang, C. Multi-type data fusion framework based on deep reinforcement learning for algorithmic trading. Appl. Intell. 2023, 53, 1683–1706. [Google Scholar] [CrossRef]
  36. Tran, M.; Pham-Hi, D.; Bui, M. Optimizing Automated Trading Systems with Deep Reinforcement Learning. Algorithms 2023, 16, 23. [Google Scholar] [CrossRef]
  37. Huang, Y.; Cui, K.; Song, Y.; Chen, Z. A Multi-Scaling Reinforcement Learning Trading System Based on Multi-Scaling Convolutional Neural Networks. Mathematics 2023, 11, 2467. [Google Scholar] [CrossRef]
  38. Ye, Z.J.; Schuller, B.W. Human-Aligned Trading by Imitative Multi-Loss Reinforcement Learning. Expert Syst. Appl. 2023, 234, 120939. [Google Scholar] [CrossRef]
  39. Moody, J.; Saffell, M. Learning to trade via direct reinforcement. IEEE Trans. Neural Netw. 2001, 12, 875–889. [Google Scholar] [CrossRef]
  40. Lele, S.; Gangar, K.; Daftary, H.; Dharkar, D. Stock market trading agent using on-policy reinforcement learning algorithms. Soc. Sci. Electron. Publ. 2020. [Google Scholar] [CrossRef]
  41. Liu, F.; Li, Y.; Li, B.; Li, J.; Xie, H. Bitcoin transaction strategy construction based on deep reinforcement learning. Appl. Soft Comput. 2021, 113, 107952. [Google Scholar] [CrossRef]
  42. Wang, Z.; Lu, W.; Zhang, K.; Li, T.; Zhao, Z. A parallel-network continuous quantitative trading model with GARCH and PPO. arXiv 2021, arXiv:2105.03625. [Google Scholar]
  43. Mahayana, D.; Shan, E.; Fadhl’Abbas, M. Deep Reinforcement Learning to Automate Cryptocurrency Trading. In Proceedings of the 2022 12th International Conference on System Engineering and Technology (ICSET), Bandung, Indonesia, 3–4 October 2022; pp. 36–41. [Google Scholar]
  44. Xiao, X. Quantitative Investment Decision Model Based on PPO Algorithm. Highlights Sci. Eng. Technol. 2023, 34, 16–24. [Google Scholar] [CrossRef]
  45. Ponomarev, E.; Oseledets, I.V.; Cichocki, A. Using reinforcement learning in the algorithmic trading problem. J. Commun. Technol. Electron. 2019, 64, 1450–1457. [Google Scholar] [CrossRef]
  46. Liu, X.Y.; Yang, H.; Chen, Q.; Zhang, R.; Yang, L.; Xiao, B.; Wang, C.D. FinRL: A deep reinforcement learning library for automated stock trading in quantitative finance. arXiv 2020, arXiv:2011.09607. [Google Scholar]
  47. Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; Liu, C. Adaptive quantitative trading: An imitative deep reinforcement learning approach. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 2128–2135. [Google Scholar]
  48. Lima Paiva, F.C.; Felizardo, L.K.; Bianchi, R.A.d.C.; Costa, A.H.R. Intelligent trading systems: A sentiment-aware reinforcement learning approach. In Proceedings of the Second ACM International Conference on AI in Finance, Virtual, 3–5 November 2021; pp. 1–9. [Google Scholar]
  49. Vishal, M.; Satija, Y.; Babu, B.S. Trading Agent for the Indian Stock Market Scenario Using Actor-Critic Based Reinforcement Learning. In Proceedings of the 2021 IEEE International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), Bangalore, India, 16–18 December 2021; pp. 1–5. [Google Scholar]
  50. Ge, J.; Qin, Y.; Li, Y.; Huang, Y.; Hu, H. Single stock trading with deep reinforcement learning: A comparative study. In Proceedings of the 2022 14th International Conference on Machine Learning and Computing (ICMLC), Guangzhou, China, 18–21 February 2022; pp. 34–43. [Google Scholar]
  51. Nesselroade, K.P., Jr.; Grimm, L.G. Statistical Applications for the Behavioral and Social Sciences; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
  52. Cai, J.; Xu, K.; Zhu, Y.; Hu, F.; Li, L. Prediction and analysis of net ecosystem carbon exchange based on gradient boosting regression and random forest. Appl. Energy 2020, 262, 114566. [Google Scholar] [CrossRef]
  53. Li, G.; Zhang, A.; Zhang, Q.; Wu, D.; Zhan, C. Pearson Correlation Coefficient-Based Performance Enhancement of Broad Learning System for Stock Price Prediction. IEEE Trans. Circuits Syst. II Express Briefs 2022, 69, 2413–2417. [Google Scholar] [CrossRef]
  54. Guo, X.; Zhang, H.; Tian, T. Development of stock correlation networks using mutual information and financial big data. PLoS ONE 2018, 13, e195941. [Google Scholar] [CrossRef]
  55. Kong, A.; Azencott, R.; Zhu, H.; Li, X. Pattern Recognition in Microtrading Behaviors Preceding Stock Price Jumps: A Study Based on Mutual Information for Multivariate Time Series. Comput. Econ. 2023, 1–29. [Google Scholar] [CrossRef]
  56. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  57. Yue, H.; Liu, J.; Zhang, Q. Applications of Markov Decision Process Model and Deep Learning in Quantitative Portfolio Management during the COVID-19 Pandemic. Systems 2022, 10, 146. [Google Scholar] [CrossRef]
  58. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12 June–17 June 2016; pp. 1480–1489. [Google Scholar]
  59. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  60. Chan, E. Algorithmic Trading: Winning Strategies and Their Rationale; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 625. [Google Scholar]
  61. Narang, R.K. Inside the Black Box: A Simple Guide to Quantitative and High Frequency Trading; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 846. [Google Scholar]
  62. Chan, E.P. Quantitative Trading: How to Build Your Own Algorithmic Trading Business; John Wiley & Sons: Hoboken, NJ, USA, 2021. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.