Practical Application of Deep Reinforcement Learning to Optimal Trade Execution

.


Introduction
Electronic trading has gradually become the dominant form of trading in the modern financial market, comprising the majority of the total trading volumes.As a subfield of electronic trading, optimal trade execution has also become a crucial concept in financial markets.In the problem of optimal trade execution, the goal is to buy (or sell) V shares of a particular stock within a defined time period while minimizing the costs that are associated with trading.When executing a large order, however, submitting one large order would incur slippage because of its market impacts, which forces the large order to be split up into several smaller orders.The smaller orders would then be submitted in the hopes of being sequentially executed over time.Sequential decision making is a natural fit for reinforcement learning (RL) as it is mainly modeled as a Markov decision process (MDP) where the next state only depends on the current state and action taken.For this reason, there have been several studies that apply RL in optimal trade execution problems.Nevmyvaka et al. (2006) [1] took advantage of RL to decide the optimal price level to reposition the remaining inventory based on real-time market conditions.Hendricks and Wilcox (2014) [2] extended the standard model of Almgren and Chriss (2001) [3] by incorporating market microstructure elements, demonstrating another way to leverage RL in addressing the problem of optimal trade execution.
However, with the recent advances in neural networks and their ability to learn from data, researchers have turned to more advanced approaches that use deep reinforcement learning (DRL) (Mnih et al. (2013) [4]) for problems that have a wide range of data inputs and market signals.Dabérius et al. (2019) [5] and Ye et al. (2020) [6]'s studies have led to the successful integration of DRL for optimal trade execution, showing new potential routes for future work.The effectiveness of both value-based and policy gradient DRL algorithms has been demonstrated through the experiments conducted by Ning et al. (2021) [7] and Lin and Beling (2020) [8].
Nevertheless, previous studies that have applied DRL to optimal trade execution have not been successful in bridging the gap between their experimental results and practical utility, making their findings less relevant to real market scenarios.
In order to be applied in the real market, it is imperative for the model to perform well in diverse settings.These could include a wide coverage of different stocks, flexible execution time horizon, and dynamic target volume.Through the use of a detailed setup that integrates DRL algorithms and a vast amount of granular market data, we have achieved effective advancements towards the practical usage of DRL-based optimal trade execution algorithms in the real market.
Table 1 demonstrates how our study developed a generalized DRL model for optimal trade execution that can be utilized in diverse settings by comparing the model with the existing studies in the literature.We focused on the three features of stock group size, execution time horizon, and dynamic target volume, and the specifications in Table 1 denote the coverage of a single trained model.It should be noted that the dynamic target volume implies that there are no preset values for each scenario.
As shown in Table 1, all of the previous studies had either a fixed or short range in the execution time horizon.Moreover, none of the previous studies employed the dynamic target volume.However, fixed or short execution time horizons and preset target volumes are impractical compared with actual execution specifications.This is because the execution time horizon and target volume considerably vary depending on the client or asset manager.Although Hendricks and Wilcox (2014) [2] had a stock group size of 166, their model only works for execution time horizons between 20 and 60 min with two fixed target volumes of 10,000 and 1,000,000 shares.Moreover, our study's execution time horizon of 165-380 min is the longest in the literature, and our study is the first to employ the dynamic target volume, which signifies the practicality of our model and its ability to be the first to achieve generalization across all three of the features mentioned above.
Additionally, simulating real markets within minimal discrepancy poses a significant challenge, as accurately capturing the market impact and incorporating feedback from all other market participants are nearly impossible (Kyle and Obizhaeva (2018) [11]).Although reasonable assumptions can be incorporated into the simulation logic as in Lin and Beling (2020) [8], such assumptions are more likely to lead to a discrepancy between the simulation environment and the real market.In this regard, we attempted to minimize such assumptions by utilizing level 3 data, which is one of the most granular form of market data, and developed a simulation environment that is highly representative of the real market, as shown in Section 3.2 and 3.3.Moreover, the availability of such data allows for the development of a sophisticated Markov decision process (MDP) state and reward designs that lead to the enhanced training of the DRL agent.The details of the state and reward designs can be found in Section 2.3.
Our main contributions in using DRL for optimal trade execution are as follows: 1.
We set up an order execution simulation environment using level 3 market data to minimize the discrepancy between the environment and the real market during training, which cannot be achieved when using less granular market data.

2.
We successfully trained a generalized DRL agent that can, on average, outperform the volume-weighted average price (VWAP) of the market for a stock group comprising 50 stocks over an execution time horizon ranging from 165 min to 380 min, which is the longest in the literature; this is accomplished by using the dynamic target volume.

3.
We formulated a problem setting within the DRL framework in which the agent can choose to pass, submit a market/limit order, modify an outstanding limit order, and choose the volume of the order if the chosen action type is an order.4.
We successfully commercialized our algorithm in the South Korea stock market.

Related Work
Researchers from financial domains have shown great interest in the optimal trade execution problem, making it a prominent research topic in the field of algorithmic trading.Bertsimas and Lo (1998) [12] are the pioneers of the problem, and their use of dynamic programming to find an explicit closed-form solution for minimizing trade execution costs was a remarkable achievement.Building onto the research, Almgren and Chriss (2001) [3] introduced the influential Almgren-Chriss model, which concurrently took transaction costs, complex price impact functions, and risk aversion parameters into account.While closed-form solutions can be utilized for trading, they require strong assumptions to be made about the price movements.Consequently, more straightforward strategies such as the time-weighted average price (TWAP) and volume-weighted average price (VWAP) strategies still remain as the default options.Nevertheless, due to their simplicity and inability to adapt, those strategies also struggle to outperform their respective benchmarks when applied in the real market.For this reason, given that reinforcement learning (RL) algorithms can dynamically adapt to the observed market conditions while minimizing transaction costs, RL poses itself as an effective approach for tackling the optimal trade execution problem.Nevmyvaka et al. (2006) [1] were the first to present the extensive empirical application of RL in the optimal trade execution problem using large-scale data.The authors proposed a model that outputs the relative price at which to reposition all of the remaining inventory.It can be thought of as consecutively modifying any outstanding limit orders with the goal of minimizing implementation shortfall (IS) within a short execution time frame.The authors used historical data from the electronic communication network INET for three stocks, which they split into training and test sets for the time ranges of 12 months and 6 months, respectively.Their execution time horizon ranged from 2 min to 8 min with 5000 fixed shares to execute.
Then, Hendricks and Wilcox (2014) [2] extended the standard model of Almgren and Chriss (2001) [3] by utilizing elements from the market microstructure.Because the Almgren-Chriss framework mainly focuses on the arrival price benchmark, they optimized the number of market orders to submit to be based on the prevailing market conditions.The authors used 12 months of data from the Thomson Reuters Tick History (TRTH) database, which comprises 166 stocks that make up the South African local benchmark index (ALSI).Their execution time horizon ranged from 20 min to 60 min with the target volume ranging from 100,000 to 1,000,000 shares.
While the two previous studies successfully employed Q-learning (Watkins and Dayan (1992) [13]), an RL algorithm, for the purposes of optimal trade execution, the complex dynamics of the financial market and high dimensionality still remain as challenges.With the recent advancements in the field of RL, deep reinforcement learning (DRL) in particular has also gained a lot of popularity in the field of optimal trade execution, wherein the efficacy of neural networks is leveraged as function approximators.Both value-based and policy gradient methods, which are standard approaches in RL, are utilized.
Ning et al. (2021) [7] and Li et al. (2022) [10] employed the extended version of the deep Q-Network (DQN) (Mnih et al. (2015) [14]; in the work of Van Hasselt et al. (2016) [15]) that focused on optimal trade execution, by taking advantage of the ability of deep neural networks to learn from a large amount of high-dimensional data, they were able to address the challenges of high-dimensional data that are faced by Q-learning schemes.Li et al. (2022) [10] proposed a hierarchical reinforcement learning architecture with the goal of optimizing the VWAP strategy.Their model outputs the following binary action: limit order or pass.The price of the limit order is the current best bid or best ask price in the limit order book (LOB), with the order size being fixed.The authors acknowledge the innate difficulty of correctly estimating the actual length of the queue using historical LOB data, which can potentially lead to a bigger gap between the simulation environment and the real market.To prevent the agent from being too optimistic about market liquidity, the authors introduced an empirical constant, which they set as three in their experiments, as the estimator of the actual queue length.They were also the first to employ a DRL-based optimal trade execution algorithm over a time horizon as long as 240 min.
Karpe et al. (2020) [16] set up a LOB simulation environment within an agent-based interactive discrete event simulation (ABIDES); they were also the first to present the formulation in which the optimal execution (volume) and optimal placement (price) are combined in the action space.By using the double deep Q-Learning algorithm, the authors trained a DRL agent, and they demonstrated its convergence to the TWAP strategy in some situations.They trained the model using 9 days of data from five stocks and conducted a test over the following 9 days.
In the context of policy gradient RL methods, Ye et al. (2020) [6] used the deep deterministic policy gradient (DDPG) (Lillicrap et al. (2015) [17]) algorithm, which has achieved great success in problem settings with continuous action space.Lin and Beling (2020) [8] utilized proximal policy optimization (PPO) (Schulman et al. (2017) [18]), which is a popular policy gradient approach that has demonstrated significant advantages in stability despite being flexible among reinforcement learning algorithms.The authors employed the level 2 market data of 14 US equities.They also designed their own sparse reward signal and demonstrated its advantage over the implementation shortfall (IS) and shaped reward structures.Furthermore, the authors used long short-term memory (LSTM) networks (Hochreiter and Schmidhuber (1997) [19]) with their PPO model to capture temporal patterns within the sequential LOB data.Their model optimizes the numbers of shares to trade based on the liquidity of the stock market with the aim of balancing the liquidity risk and timing risk.
In this study, we experimented with both value-based and policy gradient-based approaches, mainly focusing on PPO applications.We employed a deliberately designed Markov decision process (MDP), which included a detailed reward structure and an elaborate simulation environment that used level 3 market data.Our findings in both the experimental settings and the real market demonstrate the effectiveness of our approach in terms of generalizability and practicability, which illustrates significant advances from existing works in this field.

Optimal Trade Execution
The optimal trade execution problem seeks to minimize transaction costs while liquidating a fixed volume of a security over a defined execution time horizon.The process of trading is carried out by following a limit order book mechanism wherein traders have the option of pre-determining the price and volume of shares that they desire to buy or sell.A limit order entails the specification of the volume and price, and it is executed when a matching order is found on the opposite side.In contrast, a market order simply requires the specification of the quantity to be traded, and the orders are matched with the best available offers of the opposite side.The two most commonly used trade execution strategies are the time-weighted average price (TWAP) and volume-weighted average price (VWAP) strategies (Berkowitz et al. (1988) [20] and Kakade et al. (2004) [21]).The TWAP strategy breaks a large order into sub-orders of equal size and executes each within equal time intervals.On the other hand, the VWAP strategy estimates the average volume of trades during different time intervals using historical data and then divides orders based on these estimates.In this paper, we present a DRL-based optimal trade execution algorithm with the goal of outperforming the VWAP of the market.It should be noted that the VWAP of the market is simply a scalar price value that is representative of the average trading price of the specified period.

Deep RL Algorithms
In this section, we provide a brief explanation of the two primary deep reinforcement learning algorithms that were used in this study.

Deep Q-Network (DQN)
The deep Q-network (DQN) (Mnih et al. (2015) [14]) is a value-based reinforcement learning algorithm that uses deep neural networks to estimate the action-value function Q(s, a).DQN implements experience replay, which involves storing transitions (s, a, r, s ) in a replay buffer and sampling a minibatch of transitions from the buffer for training.s denotes the state of the agent, a denotes the action taken by the agent in s, r denotes the reward received by taking a action in s state, and s denotes the next state of the agent after taking a action.To mitigate the potential overestimation of the Q-values, which can lead to the learning of suboptimal policies, a target network is used alongside the online network.Wang et al. (2016) [22] proposed a dueling architecture that uses two streams of neural networks to independently estimate the state value function and the advantage function and then combines them to obtain the Q-value.This allows the agent to better estimate the value of each action in each state and learn to differentiate between states that have similar values but different optimal actions.In this study, we utilized the above-mentioned variation of DQN for comparative purposes.

Proximal Policy Optimization (PPO)
Proximal policy optimization (PPO), introduced by Schulman et al. (2017) [18], is a policy gradient reinforcement learning algorithm that aims to improve upon the traditional policy gradient methods (Sutton and Barto (2018) [23]) by enforcing a constraint on the policy update to prevent it from straying too far from the previous policy.This constraint is implemented by using a clipping hyperparameter that limits the change from the old policy to the new policy.PPO utilizes the actor-critic architecture, where the actor network learns the policy and the critic network learns the value function.The value function estimates the expected cumulative return starting from a particular state s and following the current policy π(a|s) thereafter.PPO operates through an iterative process of collecting trajectories by executing the current policy in the environment.The advantage of each action is estimated by comparing the observed reward with the expected reward from the value function, and it is then used to compute the policy gradient.Although numerous approaches have been newly developed, PPO still remains as one of the most popular algorithms due to its relative simplicity and stability.Furthermore, PPO can be utilized for problems with discrete, continuous, or even hybrid action spaces.In this paper, we propose two PPO-based models: one model with discrete action space for both action type and action quantity, denoted as PPO-discrete; and another model with discrete action space for action type and continuous action space for action quantity, denoted as PPO-hybrid.
We mainly focus on PPO-based models as the central algorithms and demonstrate their efficacy for the optimal trade execution problem.

Problem Formulation
We define the execution problem as a T-period problem over execution time horizon H (min), where time interval ∆T is 30 s and H ranges from 165 min to 380 min.Moreover, execution job denotes a parent order that specifies the side (ask/bid), stock, number of shares (V), and execution time horizon (H).The goal of the agent is to sell (or buy) V shares over H by taking an action in each period T n where n = {1,. . ., H  ∆T } such that the VWAP of the agent is larger (smaller) than the that of the market.In our problem setting, we train buy and sell agents separately.The bid side denotes buying, while the ask side denotes selling; we utilize the terms bid side and ask side in the remaining sections of the paper.The DRL agent is trained to achieve the above-mentioned goal by interacting with the simulation environment, which is explained in detail in Section 3.2.

States
To successfully learn the underlying dynamics of market microstructure, consistently feeding representative data to the agent is crucial.Thus, we divide the encoded state into two sub-states, market state and agent state, which are explained in detail below.Moreover, in order to incorporate the temporal signals of time series data, we divide the data into pre-defined time intervals, which are then encoded to capture the characteristics of the respective time interval.

Market State:
We denote the price of the ith price level in the ask (or bid) side of the limit order book at time step t as askp t i (bidp t i ) and the total volume of that price level as askv t i (bidv t i ) at time step t.We include the top five price levels on both sides to capture the most essential information of the order book.• Pending Orders -Pending feature: A list of volumes of any pending orders at each of the top five price levels of the order book.For pending orders that do not have a matching price in the order book, we added the pending orders' quantities at the last index of the list to still represent all outstanding orders.

Actions
We focused on giving the agent the most freedom possible.The agent chooses both the action type and action quantity.The action type is the type of action, while the action quantity is the volume of the order to be submitted.By learning the underlying dynamics of market microstructure, the agent can decide which action to take with how much quantity or to even take no action at all.

• Action Type
The agent chooses one of the following action types: -Market order: submit a market order; -Limit order (1-5): submit a limit order with the price of the corresponding price level of the opposing side; -Correction order (1-5): modify the price of a pending order to the price of the corresponding price level of the opposing side.The goal is to modify a pending order that has remained unexecuted for the longest period due to its unfavorable price.Thus, a pending order with the price that is least likely to be executed and the oldest submission time is selected for correction.The original price of the pending order is excluded to ensure that the modified price is different from the original price; -Pass: skip action at the current time step t.

• Action Quantity
The action quantity can either be discrete or continuous, as explained in Section 2.2.2.In both cases, the agent decides the quantity of the order as shown below.
-Discrete action Q is a scalar integer value that the agent chooses from the set of numbers {1, 2, 3, 4, 5}.The final quantity decided is Q AGENT = Q TWAP • action Q , where Q TWAP is the order quantity of the TWAP strategy.
-Continuous action Q is a scalar float value that the agent chooses between 0 and 1.The final quantity decided is The clipping is to ensure the minimum order quantity of one and to prevent buying/selling excess amounts.

Rewards
The reward is the feedback from the environment, which agent uses to optimize its actions.Thus, it is paramount to design a reward that induces the desired behavior of the agent.As the volume-weighted average price (VWAP) is one of the most wellknown benchmarks used to gauge the efficacy of execution algorithms, the goal of our algorithm is to outperform the VWAP benchmark.Consequently, we directly use the VWAP in calculating rewards.Although VWAP price can only be computed at the end of the execution time horizon in real time, it is known at the start of the trading horizon in our simulation.Accordingly, we can give the agent a reward signal based on its order placements compared with the market VWAP price of the trading horizon.At each time step t, we take the following three aspects into consideration when calculating the reward signal: 1.
New orders submitted by the agent between time steps t − 1 and t, denoted as agent order t−1, t ; 2.
Orders corrected by the agent between time steps t − 1 and t, denoted as agent corr t−1, t ; 3.
Pending orders of the agent at time step t, denoted as agent pending t .
We then divide the reward signal into three sub-rewards: order reward, correction reward, and maintain reward, with each representing one of the aspects above, respectively.We also denote all remaining market execution transactions from time step t to the end of the trading horizon as market exec t, end .The order reward is calculated by going through agent order t−1, t and checking if the price of each order is an "executable" price.More specifically, a price is "executable" if it is included in market exec t, end or if it is lower (higher) than the lowest (highest) price in market exec t, end if it is on the ask (bid) side.Then, for all orders with an "executable" price, the price of the order is compared with the market VWAP price of the trading horizon; it is given a positive reward if the price is advantageous compared with the market VWAP, and it is given a negative reward if otherwise.A price is advantageous compared with the market VWAP if it is higher (lower) than the market VWAP when it is on the ask (bid) side.The magnitude of the reward depends on how advantageous or disadvantageous the price is compared with the market VWAP.Furthermore, for all orders with a "not executable" price, a negative order reward is given.
The correction reward is calculated in two parts.In the first part, we search through agent corr t−1, t and check if the price of each correction order is an "executable" price.For all correction orders with an "executable" price, the price of each correction order is compared with the market VWAP price of the trading horizon.The rewards are given in the same manner as in the order reward portion.In the second part, we search through agent corr t−1, t again and check if the original price of each correction order is an "executable" price.For all correction orders with an "executable" original price, the original price of each correction order is compared with the market VWAP in the opposite manner.For example, if the original price of the correction order is advantageous compared with the market VWAP price, then a negative reward is given.Then, the final correction reward is calculated by adding the rewards in the first and second parts.
The maintain reward is calculated by searching through agent pending t and checking if the price of each pending order is an "executable" price.Then, for all pending orders with an "executable" price that is also advantageous compared with the market VWAP price, a positive maintain reward is given; a negative maintain reward is given otherwise.

Model Architecture
In this study, we mainly utilize the PPO algorithm with a long short-term memory (LSTM) (Hochreiter and Schmidhuber (1997) [19]) network as shown in Figure 1.The architecture of the model is designed to incorporate the RL formulation setup described in Section 2.3.The input features from the simulation environment are provided to the encoders in which the features are encoded into market states and agent states.The market states and agent states are then separately fed into their respective fully connected (FC) layers.Subsequently, the outputs from the market states' FC layers are concatenated and fed into an LSTM layer along with the hidden state from the previous step.Similarly, the outputs from the agent states' FC layers are concatenated and fed into an additional FC layer.Then, the two outputs are concatenated as the final output.The final output is used as the input for both the actor and value networks.In the actor network, the final output is fed into two separate FC layers.In the PPO-discrete model, both outputs are subsequently fed into a softmax function, which is used to form a categorical distribution for both the action type and action quantity.In the PPO-hybrid model, one output is similarly fed into a softmax function for the action type, and the other output is directly used to form a normal distribution for continuous action quantity.In the value network, the final output is also fed into an FC layer, which outputs the value.The use of a multiprocessing technique for the actors allows the training time to be reduced by executing processes in parallel.Table A2 provides information on the hyperparmeters used for training the models.

Dataset
We utilized non-aggregated level 3 market data, also known as market By order (MBO) data, which is one of the most granular forms of market data (Pardo et al. (2022) [24]).As they are an order-based data that contain information of individual queue position, they allow for a more precise simulation compared with less granular market data.LOB data consolidate all orders of the same price into a single price level, which does not allow for the precise tracking of the positions of synthetic orders in the queue along with other orders in the same price level.The dataset (as the dataset is proprietary, we are unable to make it publicly available) includes 50 stocks listed on Korea Stock Exchange (KRX) from 1 January 2018 to 1 January 2020.

Market Simulation
We built an order execution simulation environment utilizing historical level 3 data.Using level 3 data, orders that have been canceled or modified between each time step are known.It is also known exactly which orders have been executed between each time step.These allow us to track individual orders in each price level as shown in Figure 2 which informs us about how many orders with a higher priority are present before the synthetic orders of the agent.In the limit order book mechanism, the priority of orders are determined based on price, followed by time submitted.Without such information, it is necessary to make assumptions about the number of orders having a higher priority as in Li et al. (2022) [10].As the location of the agent's synthetic order in the LOB is known, we are able to simulate the market more accurately.
At each time step t, the agent can place a market order or limit order.If a market order is placed, it is executed immediately with the best available orders in the opposing book, successively consuming deeper into the opposing book for any remaining quantities.If a limit order is placed, it is first added to the pending orders of the agent.Then, we construct a heap for market execution transactions between time steps t − 1 and t and for the pending orders of the agent, separately.Both heaps are sorted in a descending order based on the priority of the orders.Subsequently, each pending order is compared with market execution transactions for priority, and every pending order with a higher priority than any of the market execution transactions is executed.It should be noted that when the agent corrects a pending order in the simulation environment, the pending order is first canceled .Then, a new order with the modified price is submitted.

Validity of Simulation Environment
Even with the most granular market data, it is impossible to completely eliminate market impact in the simulation environment, as accurately simulating the behaviors of all market participants in response to the synthetic orders of the agent is infeasible.One of the most accurate ways to evaluate the magnitude of the discrepancy between the simulation environment and the real market is by executing the exact same jobs on the exact same day in both settings separately.One caveat for this approach is that it can only be done ex post.However, since our algorithm was successfully commercialized and used in the real market during the time period of 1 July 2020-29 December 2021, it is possible to replay those jobs in the simulation environment using historical data.Figure 4 demonstrates the strong positive correlation between the performance of the agent for 4332 jobs in the real market and simulation environments.Each data point represents the performance of the agent for the same job on the same day in the real market and in the simulation environment.The performance summary of real market executions can also be found in Table A3.

Training Set Up
The settings of the environment during training are crucial for the successful learning of the agent.The carefully designed training settings are as follows: In each episode, a random date is sampled from the date range, and a stock from the universe is selected through weighted random sampling, wherein the weights are determined by the trading volume of each stock for the day.Furthermore, a random execution time horizon and target volume are sampled from their respective ranges.With the sampled execution time horizon, the trading start and trading end times between 9:00 AM and 15:20 PM are also randomly determined.With these specifications, an execution job is created for each episode.Then, the agent is trained until convergence with the goal of beating market volume-weighted average price (VWAP) of the same execution time horizon.At the end of the execution time horizon, any remaining shares are submitted as a market order to fulfill the target volume.

Training Results
In this section, the plots of the training process are presented.In plots (a) and (c) ((b) and (d)) of Figure 5, each data point of the y-axis corresponds to the average performance (win rate) of 200 episodes against the market VWAP.The average win rate represents the ratio of episodes in which the agent outperformed the market VWAP.The x-axis represents the optimization steps during training.In every plot, PPO-discrete is displayed in red, and PPO-hybrid is displayed in blue.As the training progresses, the average performance converges above the basis point of 0 while the average win rate converges above 50% for both the ask and bid sides, with PPO-discrete converging at a slightly higher point than PPO-hybrid in every case.Convergence of the models above the basis point value of 0 and the 50% win rate is substantial as it demonstrates that the models have, on average, learned to outperform the benchmark for 200 execution jobs generated with a random stock sampled from a 50-stock universe, a random execution time horizon between 165 and 380 minutes, and a random target volume.

Validation
The settings of the environment during validation is the same as that of the environment during training with one exception.The date range is set to 2 January 2020-31 June 2020 in order to evaluate the efficacy of the trained agent in the out-of-sample data.For each trading day, 50 jobs are created for the agent to execute.Both PPO and DQN are used and compared for the performance and stability.Furthermore, the TWAP strategy with a fixed limit order price is also compared as the baseline strategy.The summarized performance for each model can be found in Table 2.The total field represents the mean performance of the model across all execution jobs.The ask and bid fields represent the mean performance of the model across all ask side jobs and bid side jobs, respectively.It should be noted that all performance values are measured against the market VWAP.As shown in the table, PPO-discrete performed the best, although by a modest margin compared with PPO-hybrid, which is in line with the training results.For both PPO models, it is notable that only two models (ask and bid) were trained to outperform the market VWAP for 12,300 execution jobs that were randomly generated across stocks, an execution time horizon, and a target volume.They also achieved significantly higher performances than the DQN model and baseline TWAP strategy.Furthermore, Figures 6 and 7 represent the daily performance of the PPO-discrete model during the validation period.It should be noted that each data point in the bar graphs represents the mean performance of the day across 50 execution jobs.Moreover, the line plots represent the cumulative performance of the model over the validation period.As shown in the figures, for both the ask and bid sides, the final cumulative performance is considerably higher than the basis point of 0. It should also be noted that the y-axis on the left represents the scale of the bar graphs, and the y-axis on the right represents the scale of the line plot.

Conclusions
In this study, we addressed the two unresolved issues in applying deep reinforcement learning (DRL) to the optimal trade execution problem.First, we trained a generalized optimal trade execution algorithm by utilizing a widely used reinforcement learning (RL) algorithm, proximal policy optimization (PPO), combined with a long short-term memory (LSTM) network.Moreover, we set up a problem setting in which the DRL agent has the freedom to choose from all possible actions.We presented training results that demonstrate the successful training of the generalized agent, which could handle execution orders for 50 stocks and flexible execution time horizons ranging from 165 min to 380 min with dynamic target volumes.Additionally, the validation result for the 6-month period showed that the trained agent outperformed the market volume-weighted average price (VWAP) benchmark by 3.282 basis points on average for 12,300 execution jobs randomly generated across stocks, an execution time horizon, and target volume.This result is substantial as it demonstrates the efficacy of the generalized model presented in this study.The generalized model is highly practical as it reduces the time spent on training and validation compared with having a model for each stock.Furthermore, Table 1 signifies the practicality of our model and demonstrates that our study is the first to achieve generalization across all three features of stock group size, execution time horizon, and dynamic target volume.
Secondly, we attempted to minimize the discrepancy between the simulation environment and the real market.We built an order execution simulation environment utilizing level 3 market data, which contains information of each order, thus innately allowing for

Figure 1 .
Figure 1.The overall architecture of the PPO model for optimal trade execution.

Figure 2 .
Figure 2. Depth chart description of the level 3 market simulation.Utilizing the data in (Figure 3) enables the tracking of individual orders.

Figure 3 .
Figure 3. Sample date (28 May 2019) description of execution data (Left) and order book data (Right) for the level 3 market simulation environment.

Figure 5 .
Figure 5. PPO training plots.(a) Ask side performance, (b) ask side win rate, (c) bid side performance, (d) bid side win rate.
We denote the volume of the k th execution transaction from any side between t − 1 and t as execv k t−1,t and the price of that as execp k t−1,t if there were n > 0 execution transactions between t − 1 and t.Among those n execution transactions, we also denote the total volume of the execution transactions from the ask (or bid) side as execv ask All price features are normalized using the trading day's first execution price, while all volume features are normalized using target vol .We denote the total volume of all executed orders of the agent up until time step t as exec t agent .Moreover, we denote the total volume of all pending orders of the agent at time step t as pending t agent .The target volume of the job is denoted by target vol .The start time and end time of the job is denoted by trade start and trade end , respectively.

Table A3 .
Real-market performance summary.