Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off

Yu, Jiayang; Chang, Kuo-Chu

doi:10.3390/ijfs13040227

Open AccessArticle

Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off

by

Jiayang Yu

^* and

Kuo-Chu Chang

Department of Systems Engineering and Operations Research, George Mason University, Fairfax, VA 22030, USA

^*

Author to whom correspondence should be addressed.

Int. J. Financial Stud. 2025, 13(4), 227; https://doi.org/10.3390/ijfs13040227

Submission received: 1 October 2025 / Revised: 6 November 2025 / Accepted: 20 November 2025 / Published: 2 December 2025

(This article belongs to the Special Issue Financial Markets: Risk Forecasting, Dynamic Models and Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a dynamic portfolio allocation framework that integrates deep reinforcement learning (DRL) with classical portfolio optimization to enhance rebalancing strategies and risk–return management. Within a unified reinforcement-learning environment for portfolio reallocation, we train actor–critic agents (Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C)). These agents learn to select both the risk-aversion level—positioning the portfolio along the efficient frontier defined by expected return and a chosen risk measure (variance, Semivariance, or CVaR)—and the rebalancing horizon. An ensemble procedure, which selects the most effective agent–utility combination based on the Sharpe ratio, provides additional robustness. Unlike approaches that directly estimate portfolio weights, our framework retains the optimization structure while delegating the choice of risk level and rebalancing interval to the AI agent, thereby improving stability and incorporating a market-timing component. Empirical analysis on daily data for 12 U.S. sector ETFs (2003–2023) and 28 Dow Jones Industrial Average components (2005–2023) demonstrates that DRL-guided strategies consistently outperform static tangency portfolios and market benchmarks in annualized return, volatility, and Sharpe ratio. These findings underscore the potential of DRL-driven rebalancing for adaptive portfolio management.

Keywords:

portfolio optimization; deep reinforcement learning; Proximal Policy Optimization (PPO); Advantage Actor–Critic (A2C); Conditional Value-at-Risk (CVaR); dynamic rebalancing; efficient frontier; risk–return trade-off

1. Introduction

Efficient and smart rebalancing strategies play a pivotal role in portfolio management. This process involves periodically adjusting portfolio asset weights to maintain a desired risk–return allocation. Effective rebalancing helps manage risk, optimize returns, and ensures the portfolio stays aligned with an investor’s financial goals and risk tolerance. Harry Markowitz’s seminal paper “Portfolio Selection” (H. Markowitz, 1952) laid the foundation for Modern Portfolio Theory (MPT). Within this research, Markowitz proposed the mean–variance optimization framework which creates an optimal portfolio by finding the composition with maximum return for a given level of risk. The combination of different levels of risks and maximum returns form the efficient frontier.

Investors utilize this approach to select optimal portfolios by first determining their risk preferences. They then match this risk profile with a corresponding portfolio on an efficient frontier to maximize their expected return. A drawback of this approach is the assumption that an investor’s profile remains static across different market regimes and volatilities. Consequently, the trade-off between risk and return remains constant until investors decides to update it. At the same time, the Sharpe ratio, which quantifies the excess portfolio return over the risk-free rate per unit of risk, can be used to identify a single optimal portfolio—the tangency portfolio. However, this maximum Sharpe ratio has several drawbacks, including the followings: (i) its reliance on static historical return data, which assumes that past relationships between assets will persist; this assumption is often invalidated by changing market regimes, rendering a historically optimized portfolio less effective; and (ii) its dependence on distributional assumptions that often ignore higher order moments (such as skewness and kurtosis), focusing instead only on the first two moments under a normal assumption.

Moreover, static portfolios are prone to misalignment with evolving market conditions, underscoring the critical need for dynamic rebalancing. Unlike a static approach, which risks suboptimal performance and increased exposure to drift, dynamic rebalancing enables continuous adjustment of the portfolio in response to market fluctuations. This ensures the portfolio remains consistently aligned with the investor’s long-term objectives, enhancing both responsiveness and resilience in volatile environments.

Recently, machine learning has evolved remarkably, transitioning from classical neural networks to sophisticated deep learning. Traditional algorithms such as decision trees and support vector machines (Breiman et al., 2017) laid the groundwork for predictive models in stock selection and return estimation. Subsequent advances in ensemble methods, such as Random Forests, Gradient Boosting Machines, and XG Boost (Friedman, 2001; Chen & Guestrin, 2016), further enhanced prediction accuracy by capturing complex, nonlinear patterns and interactions among predictors, making them highly applicable to financial price forecasting. Furthermore, neural networks, from classical architectures (Bishop, n.d.) to modern variants like Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2017) and Recurrent Neural Networks (RNNs) (Elman, 1990; Hochreiter & Schmidhuber, 1997), have gained significant traction for their ability to model complex nonlinear pattern across diverse data types, including spatial, image, and time-series data. These capabilities make them particularly suited for financial applications, such as generating trading signals and predicting asset returns. The integration of deep learning with reinforcement learning (Mnih et al., 2015), known as Deep Reinforcement Learning (DRL), has further expanded these possibilities by enabling agents to learn optimal decision-making strategies through environmental interactions. This approach supports increasing automated and adaptive solutions for complex sequential decision tasks, with impactful applications in domains such as autonomous driving, game playing, and automated financial trading.

This paper introduces a novel framework that integrates DRL with traditional mean–risk optimization to enhance portfolio management. Our approach dynamically identifies the optimal risk–return trade-off and determines effective rebalancing strategies and frequencies. We employ actor–critic algorithms—especially Proximal Policy Optimization (PPO) (Liang et al., 2018; Schulman et al., 2017) and Advantage Actor–Critic (A2C) (Mnih et al., 2016; Zhang et al., 2019)—to train agents that explore the state space using various technical indicators. These agents learn to determine optimal actions for calibrating risk aversion and balancing risk against return, while also discovering the ideal rebalancing frequency within a given portfolio holding period. We evaluate and compare agent performance based on both overall and risk-adjusted returns to form an ensemble strategy. Additionally, we explore various utility functions to represent distinct risk metrics within each environment. Results are tested using sector ETFs and the constituents of the Dow Jones index in a rolling out-of-sample back-test.

The structure of this paper is as follows: Section 2 reviews the relevant literature. Section 3 presents our proposed portfolio rebalancing system and the DRL-based trading environment, detailing the actor–critic algorithms and our ensemble strategy. Section 4 describes the back-testing setup, including data description, preparation of training, validation, and testing datasets, and the incorporation of technical indicators along with other time series observations into the state space. This section also presents and discusses the results and statistical performance of the rolling out-of-sample tests. Finally, the paper concludes with key findings and implications in Section 5.

2. Literature Review

2.1. Portfolio Optimization and Dynamic Rebalancing

Ghahtarani et al. (2022) provided a comprehensive review of robust portfolio selection problems, which address the uncertainties inherent in financial markets. Starting with (H. Markowitz, 1952) the mean–variance framework, this classical model considers

n

risky assets, each with an expected return represented by the vector

r

and risk denoted as portfolio variance

v

coming from a covariance matrix. Some critiques of this classical framework include the following: Hurley and Brimberg (2015) investigated the sensitivity of portfolio optimization outcomes to small changes in input parameters, finding that traditional models, like the Markowitz mean–variance framework, often yield highly variable and unreliable solutions. Harvey et al. (2010) critiqued the use of standard deviation as a risk measure in portfolio optimization, noting it penalizes both upside and downside deviations, which misaligns with investors’ focus on downside risk. They advocated for using semi-deviation and higher-order moments like skewness and kurtosis to capture asymmetry and tail risks, providing a more comprehensive and effective risk assessment. Lobo et al. (2007) highlighted the gap between theoretical portfolio optimization models and practical applications. They recommended incorporating transaction costs into the optimization process and devising models that strike a balance between the frequency of rebalancing and the associated costs to improve practical usability.

Fliege and Werner (2014) further extended this by investigating robust multi-objective optimization (MOO) for mean–variance portfolio selection using

ε

-constraint and weighted-sum methods, finding different efficient frontiers in robust cases. They proposed resampling methods to enhance robustness. Fakhar et al. (2018) applied MOO to nonsmoothed functions, demonstrating strong duality in convex cases and introducing the concept of saddle-point. Ceria and Stubbs (2006) proposed a robust portfolio selection problem (PSP) to address the sensitivity around expected return change in the mean–variance approach. They introduced the true, estimated, and actual efficient frontiers and found the minimum of maximum distance between the estimated and actual frontiers, which leads to robust optimization. Garlappi et al. (2007) enhanced this approach by including normally distributed asset returns and accounting for investor uncertainty-aversion. Dai and Wang (2019) and Lee et al. (2020) proposed including regularization in the mean–variance framework to control asset weights and improve model stability.

Smith and Desormeau (2006) investigated the optimal portfolio rebalancing frequency for a list of stock–bond portfolios. The paper assessed the impact of different rebalancing strategies (triggered by a deviation from target allocation) with fixed time-based approaches. Their optimal rebalancing policy was dependent on the Fed’s monetary policy regime. Longer rebalancing horizons often outperform shorter ones, a phenomenon potentially explained by positive short-term autocorrelation and negative long-term autocorrelation in asset returns. Several papers have also examined the rebalancing frequency in log-optimal portfolios, including (Kuhn & Luenberger, 2010), which maximized the portfolio’s overall wealth and set up the problem in a Black–Scholes economy. In the absence of transaction costs, continuous rebalancing only slightly outperforms discrete rebalancing when the rebalancing intervals are shorter than approximately one year. The theoretical foundation for frequency-based strategies is advanced by (Hsieh, 2021), who derived necessary and sufficient conditions for a frequency-based Kelly portfolio and proposed a corresponding trading algorithm. Expanding on this, Hsieh and Wong (2023) incorporated both rebalancing frequency and transaction costs into the log-optimal portfolio problem, formulating it as a concave optimization problem to derive solutions. In a different approach, Ekren et al. (2017) analyzed a simple frequency-based rebalancing strategy within a multi-dimensional diffusion framework, aiming to maximize portfolio return under specific risk constraints and small transaction costs. Building on the dynamic asset allocation framework of (Chang et al., 2017), which integrated Kelly’s growth criterion with mean–variance optimization to enhance rebalancing under changing market environments, this study extends that principle by allowing a DRL agent to learn and adjust the portfolio’s risk aversion and rebalancing horizon adaptively.

2.2. Deep Reinforcement Learning

Deep reinforcement learning is rooted in reinforcement learning (RL), where the agents defined in RL learn the optimal actions by interacting with environment to maximize cumulative reward. Deep RL started with Mnih et al.’s (2013) Deep Q-Network (DQN), which integrated Q-learning with deep neural networks, enabling agents to learn directly from high-dimensional inputs like raw pixels in Atari games. Building upon this, Silver et al. (2014) introduced the Deterministic Policy Gradient (DPG), providing a more stable and efficient learning approach for continuous action spaces. This was further refined by Lillicrap et al.’s (2019) Deep Deterministic Policy Gradient (DDPG), which combined the actor–critic method with deterministic policy gradients. In parallel, the Asynchronous Advantage Actor–Critic (A3C) algorithm proposed by Mnih et al. (2016) improved learning efficiency through asynchronous updates across multiple workers. The Advantage Actor–Critic (A2C) algorithm retains the same actor–critic architecture but performs synchronous gradient updates, aggregating gradients from parallel environments before each policy update. This synchronization yields a more stable and reproducible implementation of A3C.

Schulman et al.’s (2015) Trust Region Policy Optimization (TRPO) enhanced policy optimization stability by maintaining a trust region. Schulman et al. (2017) later developed a popular algorithm, i.e., Proximal Policy Optimization (PPO), which uses a surrogate objective function that penalizes changes to the policy that are too large. It balances sample complexity and simplicity. Silver et al.’s (2017) AlphaZero showcased RL’s potential by achieving superhuman performance in strategic games without prior human knowledge. Fujimoto et al.’s (2018) Twin Delayed DDPG (TD3) addressed the overestimation bias in DDPG, improving stability, and Haarnoja et al.’s (2018) Soft Actor–Critic (SAC) promoted exploration with entropy regularization, enhancing learning efficiency.

2.3. Deep Reinforcement Learning for Stock Trading and Portfolio Management

Deep reinforcement learning (DRL) transformed stock trading and portfolio management by learning within a large state space to incorporate general market dynamics for optimal allocation. Inspired by (Yang et al., 2020), this paper proposes an ensemble strategy that integrates Proximal Policy Optimization (PPO), Advantage Actor–Critic (A2C), and Deep Deterministic Policy Gradient (DDPG) to generate dynamic portfolio allocations. The ensemble approach maximized investment returns by integrating each algorithm based on the maximum Sharpe ratio, outperforming both individual agent strategies and relevant market benchmarks.

Other applications of DRL in stock trading include (Deng et al., 2017) which focused on financial signal representation and trading. They developed a complex trading system that applies feature selection using fuzzy deep learning and generates trading decisions through direct reinforcement learning. J. Wang et al. (2019) proposed AlphaStock, a novel investment strategy that combines DRL with attention networks to implement a buying-winners-and-selling-losers approach with some interpretability.

In options trading, Du et al. (2020) applied DRL based on DQN, DQN with Pop-Art, and PPO algorithms to replicate and hedge options. They developed a hedging strategy by learning optimal trades through interactions with a simulated market environment. Bühler et al. (2018) utilized modern DRL methods to optimize hedging strategies, considering market frictions such as transaction costs, market impact, liquidity constraints, and risk limits. This innovative framework diverges from the traditional model-based approach by incorporating nonlinear convex risk measures into the reward function, combined with a set of trading constraints based on reinforcement learning techniques.

In the portfolio management area, Liang et al. (2018) enhanced portfolio management strategies by incorporating adversarial training environments to introduce random noise to price data based on the PPO, DDPG, and Policy Gradient algorithms, which improved the robustness and efficiency of the DRL models. Ye et al. (2020) introduced augmented asset movement prediction to enhance state representation. By augmenting different predictive signals, neural network-based price forecasts to financial sentiment based on news data, the RL model improves the accuracy of asset price and movement predictions. Related work by Yu and Chang (2020) applied neural-network predictive modeling to portfolio management, showing that machine learning can improve allocation outcomes within a mean–risk optimization framework. This study advances that work by introducing a reinforcement learning approach that learns dynamic control parameters rather than predictive signals. Z. Wang et al. (2021) proposed DeepTrader, a DRL model for portfolio management that dynamically balances risk and return by embedding macro market conditions as indicators. This model proposes an asset scoring unit for ranking stocks and a market scoring unit for market trend prediction via proportion between long and short funds. As an extension to the Environmental, Social, and Governance (ESG) theme, Acero et al. (2024) explored the application of DRL in responsible portfolio optimization. The study designed an ESG-score-adjusted differential Sharpe and Sortino ratio as the reward function and trained the DRL agent accordingly. The results, compared to the traditional mean–variance optimization approach, indicate that DRL effectively balances financial returns and ESG responsibilities with promising risk adjusted returns.

While this study focuses on on-policy actor–critic algorithms (PPO and A2C), other DRL frameworks such as Deep Deterministic Policy Gradient (DDPG) and Soft Actor–Critic (SAC) have also been applied to financial decision-making. PPO and A2C, as on-policy stochastic approaches, update directly within the current policy distribution and are generally more stable and interpretable—attributes particularly suitable for the discrete decision variables used here. In contrast, DDPG and SAC are off-policy continuous-control algorithms that can achieve faster convergence and higher sample efficiency but are often more sensitive to hyperparameter choices, market noise, and non-stationary rewards (Fujimoto et al., 2018; Haarnoja et al., 2018; Liang et al., 2018). Incorporating DDPG or SAC into our multi-utility mean–risk framework would require a continuous-action reformulation and extensive retraining. Future research should therefore undertake a systematic comparison between on-policy (PPO/A2C) and off-policy (DDPG/SAC) paradigms under the same portfolio objectives to assess trade-offs in statistical stability and risk-adjusted performance.

2.4. Synthesis and Research Gap

Classical mean–variance and robust portfolio models show that optimal allocations are highly sensitive to estimation errors, and static weights often underperform in changing markets. Semivariance and CVaR address investor preferences for downside protection and tail events better than variance alone but selecting how much risk to take (risk aversion) and when to update that choice in real time remains unresolved. Rebalancing timing interacts with transaction costs and returns autocorrelations. As shown by (Smith & Desormeau, 2006), optimal rebalancing intervals vary across market and policy regimes, making the timing dependent on economic conditions rather than a fixed rule. Hsieh and Wong (2023) and Kuhn and Luenberger (2010) further demonstrate that frequency should adjust dynamically to transaction costs and market conditions, highlighting the need for data-driven rebalancing strategies.

Most DRL applications learn portfolio weights directly through end-to-end mapping from states to allocations, sometimes using ensemble strategy, and often optimize for a single metric such as Sharpe or PnL. While powerful, these end-to-end policies can be difficult to constrain, are less interpretable, and require more training data to ensure agent stability.

To overcome these limitations, our framework does not require the DRL agent to predict portfolio weights directly. Instead, the agent learns two interpretable and economically meaningful parameters which serve as critical inputs for determining optimal allocations through a traditional optimizer: (1) a risk-aversion index that selects a point on the efficient frontier defined by a mean–risk framework; and (2) a rebalancing horizon that determines when the next optimization is triggered. This design leverages the stability and constraints of optimization, incorporates downside-risk frontiers, and endogenizes rebalancing timing—thereby operationalizing insights from both robust portfolio choice and frequency-selection research within a single DRL framework. This extends our earlier work (Yu & Chang, 2020) on NN-driven signals by endogenizing the portfolio’s risk–return trade-off and timing decisions through DRL while preserving optimizer structure. To our knowledge, no prior DRL studies in finance have jointly learned the risk–return trade-off parameter and the rebalancing schedule while delegating weight computation to a classical optimizer across multiple risk measures.

3. Problem Setup and Proposed Methodology

3.1. Portfolio Optimization Framework

In this section, we begin by reviewing the traditional Mean–Variance portfolio optimization framework introduced by (H. Markowitz, 1952), and then explore extensions that incorporate alternative risk metrics beyond variance. Since most investors are risk-averse and prioritize downside risk over upside gain, H. M. Markowitz (1959) proposed using Semivariance to measure downside risk. This concept is further enhanced by the introduction of value-at-risk (VaR) from (J. P. Morgan, 1996) and conditional VaR (CVaR) or expected shortfall from (Rockafellar & Uryasev, 2000, 2002).

Given a list of

N

number of investment assets, let

y_{t} = {(y_{1 t}, y_{2 t}, \dots, y_{N t})}^{T}

represent the vector of realized returns at time

t

. The expected return vector is denoted by

μ = E [y_{t}]

, estimated from historical data. Let

w = {(w_{1}, w_{2}, \dots, w_{N})}^{T}

be the portfolio weight vector, satisfying no short selling and leveraging with

\sum_{i = 1}^{N} w_{i} = 1

and

w_{i} \geq 0

. The target expected portfolio return is

μ_{0}

, and

Σ

denotes the sample covariance matrix of asset returns.

The mean–variance optimization aims to find the optimal weights vector which minimizes portfolio risk for a given expected return level. It can be written as

\begin{matrix} \min_{w} w^{T} Σ w \\ s . t . \{\begin{matrix} \sum_{i = 1}^{N} w_{i} = 1, w_{i} \geq 0, \\ w^{T} μ \geq μ_{0} \end{matrix} \end{matrix}

(1)

Alternatively, this problem can be expressed in an unconstrained penalized form

\max_{w} w^{T} μ - λ w^{T} Σ w

(2)

where

λ > 0

represents the investor’s risk-aversion coefficient, determining the position along the mean–variance efficient frontier.

In the mean–variance framework, risk is evaluated symmetrically around the mean, penalizing both gains and losses equally. To address this limitation, alternative downside-risk measures have been proposed. Semivariance focuses exclusively on returns below a benchmark, aligning more closely with the preferences of risk-averse investors who are primarily concerned with losses rather than gains. Following (H. M. Markowitz, 1959), the Semivariance

S V (<)

for asset

i

with respect to a benchmark

B

is defined as

S V (<) = E [\min {(y_{i t} - B, 0)}^{2}] = \frac{1}{T} \sum_{t = 1}^{T} \min {(y_{i t} - B, 0)}^{2},

(3)

where

y_{i t}

is the realized return of the asset

i

at time

t

, and

B

is typically chosen as the mean or target return. Semivariance thus penalizes only the negative deviations from the benchmark. Further, Estrada (2002, 2007) define the semi-covariance between assets

i

and

j

with respect to a benchmark

B

as

S C O V (<) = E [\min (y_{i t} - B, 0) \min (y_{j t} - B, 0)] = \frac{1}{T} \sum_{t = 1}^{T} \min (y_{i t} - B, 0) \min (y_{j t} - B, 0),

(4)

This definition can be tailored to any desired benchmark

B

and generates a symmetric and exogenous semi-covariance matrix. The corresponding mean–semivariance optimization problem is

\begin{matrix} \min_{w} w^{T} S C O V (<) w \\ s . t . \{\begin{matrix} \sum_{i = 1}^{N} w_{i} = 1, w_{i} \geq 0, \\ w^{T} μ \geq μ_{0} \end{matrix} \end{matrix}

(5)

where

μ = E [y_{t}]

is the expected return vector. Analogously, an investor’s preference for downside risk can be expressed through

\max_{w} w^{T} μ - λ w^{T} S C O V (<) w

(6)

Furthermore, to capture extreme losses in the tails of the return distribution, Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR, or expected shortfall) have been introduced by (J. P. Morgan, 1996) and (Rockafellar & Uryasev, 2000, 2002).

VaR represents the worst-case loss

l

over a target horizon for a given confidence level

β

. Formally,

\begin{matrix} V a R_{β} (w) = \min \{l \in R : Ψ (w, l) \geq β\} \\ C V a R_{β} (w) = \frac{1}{1 - β} \int_{f (w, y) \geq V a R_{β} (w)} f (w, y) p (y) d y \end{matrix}

(7)

where

y

represents returns with density

p (y)

and

β

is the confidence level.

Here we define the loss function as

f (w, y) = - w^{T} y

and the corresponding probability of the loss that would not exceed a certain level

l

can be expressed as

Ψ (w, l) = \int_{f (w, y) \leq l} p (y) d y

. Thus,

V a R_{β} (w)

is the VaR and

C V a R_{β} (w)

is the expected loss of the portfolio at the

β

confidence level. It is clear that

C V a R_{β} (w) \geq V a R_{β} (w)

. Rockafellar and Uryasev (2000, 2002) showed that CVaR can be obtained directly through the following convex optimization problem:

\begin{matrix} C V a R_{β} (w) = \min_{l \in R} F_{β} (w, l) \\ F_{β} (w, l) = l + \frac{1}{1 - β} E [{[f (w, y) - l]}^{+}] \end{matrix}

(8)

where

{[f (w, y) - l]}^{+} = \max (f (w, y) - l, 0)

and

F_{β} (w, l)

is a function of

l

and convex as well as continuously differentiable. Furthermore, the integral part under

F_{β} (w, l)

can be simplified by discretizing based on the density function

p (y)

to a

q

-dimension sample,

F_{β} (w, l) = l + \frac{1}{q (1 - β)} \sum_{k = 1}^{q} {[f (w, y_{k}) - l]}^{+}

(9)

Minimizing CVaR is thus equivalent to minimizing

F_{β} (w, l)

with respect to both

w

and

l

. Introducing auxiliary non-negative variables

S_{k} \geq {[f (w, y_{k}) - l]}^{+}

leads to the linear programming form:

\begin{matrix} \min_{w, l, S_{k}} l + \frac{1}{q (1 - β)} \sum_{k = 1}^{q} S_{k} \\ s . t . \{\begin{matrix} S_{k} \geq f (w, y_{k}) - l = - w^{T} y_{k} - l \\ S_{k} \geq 0 \\ \sum_{i = 1}^{N} w_{i} = 1 \\ w^{T} μ \geq μ_{0} \end{matrix} \end{matrix}

(10)

This transforms the problem into a linear programming problem that can be easily solved and does not depend on any distribution assumption for the return series

y_{k}

. Similarly, when using CVaR, the parameter λ reflects the trade-off between expected return and tail-risk minimization, defining a point on the mean–CVaR efficient frontier.

3.2. Reinforcement Learning

As a foundation of reinforcement learning, we review the Markov Decision Process (MDP), which is used to model sequential stochastic decision-making processes with either random or controlled outcomes. MDP includes key elements such as an agent, which is trained to make optimal decisions within an environment where the agent interacts. The state

S

represents the current situation of the agent, while the actions

A

govern the choices available to the agent at each time step. Each action taken by the agent generates certain quantifiable rewards

R

, which decay over time under a discount factor

γ

. The transition probability

P

determines the likelihood of an agent transitioning from one state to the other. The optimal policy

π

defines the best action to take for a given state to maximize the cumulative reward.

A Markov decision process (MDP) is defined by a tuple of

(S, A, P, R, γ)

and follows the Markov property, which states that future state depends only on the current state and action, not on the historical sequence of states and actions, i.e.,

P (s_{t + 1}∣ s_{t}, a_{t}) = P (s_{t + 1}| s_{1}, a_{1}, \dots, s_{t}, a_{t})

where

P (s_{t + 1}∣ s_{t}, a_{t})

is the probability of future state of

s_{t + 1}

given current state

s_{t}

and action

a_{t}

which is the same as future state of

s_{t + 1}

given the historical states and actions. The cumulative values for those actions based on current states are specified in either the State-Value function

v_{π} (s)

or the Action-Value function

q_{π} (s, a),

based on the Bellman equation:

S t a t e - V a l u e f u n c t i o n : v_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1} | S_{t} = s)] = \sum_{a \in A (s)} π (a| s) \sum_{s^{'} \in S} P (s^{'}| s, a) [R (s, a, s^{'}) + γ v_{π} (s^{'})]

(11)

A c t i o n - V a l u e f u n c t i o n : q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1} | S_{t} = s, A_{t} = a)] = \sum_{s^{'} \in S} P (s^{'}| s, a) [R (s, a, s^{'}) + γ \sum_{a^{'} \in A (s^{'})} π (a^{'}| s^{'}) q_{π} (s^{'}, a^{'})]

(12)

where the Bellman equation decomposes the value of the current state to the immediate reward and value from the previous state discounted by

γ

.

There are a few methods to solve this problem above:

Value iteration: this involves iteratively updating the state value function until convergence is reached

π^{*} (s) = \underset{a \in A (s)}{a r g m a x} (R (s, a, s^{'}) + γ v_{π} (s^{'}))

(13)

Policy iteration: this involves finding an optimal policy by maximizing the Action-Value function, which means we pick the action that gives us the most optimal Action-Value

q_{π} (s, a)

, which can be expressed by

π^{*} (s, a) = \{\begin{matrix} 1, i f a = \underset{a \in A (s)}{a r g m a x} (q_{π^{*}} (s, a)) \\ 0, o t h e r w i s e \end{matrix}

(14)

Q-learning: the algorithm maintains a table of expected utility values based on certain action-value pairs. Through iterative updates of those pairs, the Q-learning is a greedy policy search which learns the value of the optimal policy independently of the agent’s actions and can converge to the optimal policy which gives the highest expected reward. Q-value is defined and updated as follows:

Q (s, a) \leftarrow Q (s, a) + α [R (s, a, s^{'}) + γ \underset{a^{'}}{m a x} Q (s^{'}, a^{'}) - Q (s, a)]

(15)

SARSA (State–Action–Reward–State–Action): The SARSA algorithm operates by making decisions based on the rewards received from previous actions. The process begins by initializing the Q-values

Q (s, a)

to arbitrary values. The initial state

s

is set, and the initial action

a

is selected randomly, with non-greedy actions having some probability of being chosen to balance between exploitation and exploration techniques. This approach involves on-policy learning, where Q-values are learned using the same epsilon-greedy policy for both the behavior policy (to decide actions for a given state) and the target policy (for desired actions):

Q (s, a) \leftarrow Q (s, a) + α [R (s, a, s^{'}) + γ Q (s^{'}, a^{'}) - Q (s, a)]

(16)

Actor–Critic Methods: this algorithm combines the value based and policy-based iterations to solve the reinforcement learning problem based on the MDP framework. The policy-based approach involves learning the ideal state-to-action mapping by directly optimizing the parameters of the policy. The policy

π_{θ} (a, s)

is typically represented by a parameterized function with parameter

θ

and directly outputs the probability distribution for any state and action pair. The objective is to maximize reward

J (θ) = E_{π_{θ}} [\sum_{t = 0}^{T - 1} γ R (s_{t}, a_{t})],

and thus the optimal policy is calculated via a gradient descent algorithm from

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} \log π_{θ} (a, s) q_{π} (s, a)] = E_{π_{θ}} [\sum_{t = 0}^{T - 1} \nabla_{θ} \log π_{θ} (a_{t}, s_{t}) R (s_{t}, a_{t})]

. At the same time, the value-based approach focuses on finding optimal value functions, which can be used for better action updates to achieve higher reward utilities.

Thus, the Actor–Critic method, leveraging both policy-based and value-based iterations has 2 networks: Actor and Critic. The actor network is based on a policy gradient approach and determines which action should be taken, whereas the critic network estimates the state-value function

v_{π} (s)

or action-value function

q_{π} (s, a)

to evaluate the actions taken and provides corresponding adjusted actions.

In the learning procedure, the advantage function can be used as a critic instead of the action-value function based on the Advantage Actor–Critic (A2C) algorithm (Mnih et al., 2016). The advantage function

A (s, a)

measures the incremental benefit of taking an action from the average value of the current state:

A (s, a) = q (s, a) - v (s) = R + γ v (s^{'}) - v (s)

(17)

The actor then uses the advantage function

A (s, a)

to update the policy parameter

θ

:

θ \leftarrow θ + α \nabla_{θ} \log π_{θ} (a_{t}, s_{t}) A (s, a)

(18)

Then we update the weights

w

in the critic network to minimize the mean square error:

w \leftarrow w - β \nabla_{w} {[R + γ v (s^{'}) - v (s)]}^{2}

(19)

Another state-of-the-art learning algorithm based on the actor–critic framework is Proximal Policy Optimization (PPO) (Schulman et al., 2017). PPO is an on-policy algorithm that learns from actions taken within the current policy instead of from separate data. It builds on Trust Region Policy Optimization (TRPO, Schulman et al., 2015), which introduces a constraint to ensure the new policy does not deviate too far from the old one. PPO inherits this concept by applying a clipping function to prevent large updates to the policy for stable learning. Similarly to the A2C algorithm, the advantage function in PPO is also used for efficient learning, focusing on good actions that could lead to good outcomes:

A (s, a) = q (s, a) - v (s) = R + γ v (s^{'}) - v (s)

(20)

To calculate the loss for both the actor and critic networks, the clip function is used to bound the change in policy updates:

L^{C L I P} (θ) = E_{t} [\min (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})} A (s, a), clip (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})}, 1 - ϵ, 1 + ϵ) A (s, a))]

(21)

where

c l i p (\cdot)

truncates the policy ratio

\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})}

within the range of

[1 - ϵ, 1 + 1 - ϵ]

. The

L^{C L I P} (θ)

can then be further simplified to

L^{C L I P} (θ) = \{\begin{matrix} \min (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})}, 1 + ϵ) A (s, a), i f A (s, a) i s p o s i t i v e \\ \max (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})}, 1 - ϵ) A (s, a), i f A (s, a) i s n e g a t i v e \end{matrix}

(22)

The pseudo-code below in Algorithm 1 from OpenAI documents the procedure for updating and finding optimal parameters using actor and critic networks based on the PPO algorithm1:

Algorithm 1: PPO Clip

(1): $Input : initial policy parameters θ_{0}, initial value function parameters ϕ_{0}$
(2): $for k = 0, 1, 2, \dots$ do
(3): $Collect set of trajectories D_{k} = \{τ_{i}\} by running policy π_{k} = π (θ_{k})$ in the environment.
(4): $Compute rewards-to-go {\hat{R}}_{t}$
(5): $Compute advantage estimates, {\hat{A}}_{t} (using any method of advantage estimation) based on the current value function v_{ϕ_{k}}$ .
(6): Update the policy by maximizing the PPO-Clip objective:
$θ_{k + 1} = \underset{θ}{argmax} \{\frac{1}{|D_{k}| T} \sum_{τ \in D_{k}} \sum_{t = 0}^{T} \min (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})} A (s, a), clip (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})}, 1 - ϵ, 1 + 1 - ϵ) A (s, a))\} = \underset{θ}{argmax} \{\frac{1}{|D_{k}| T} \sum_{τ \in D_{k}} \sum_{t = 0}^{T} \min (\frac{π_{θ} (a_{t}, s_{t})}{π_{o l d} (a_{t}, s_{t})} A (s, a), g (ϵ, A (s, a)))\}$
$where g (ϵ, A) = \{\begin{matrix} (1 + ϵ) A, i f A \geq 0 \\ (1 - ϵ) A, i f A < 0 \end{matrix}$ .
(7): Fit value function by regression on mean-squared error:
typically via some gradient descent algorithm.
(8): end for

3.3. Dynamic Portfolio Optimization Using Deep Reinforcement Learning

Dynamic portfolio optimization aims to continuously adjust portfolio weights to achieve higher returns or lower risk based on different market regimes in a timely manner. In this section, we propose a DRL framework for dynamically adjusting portfolio allocations over time. Compared to traditional Mean-Variance portfolio optimization, this DRL framework offers a more adaptive trade-off between risk and return under varying market conditions, allowing for more timely adjustments rather than adhering to a fixed rebalancing schedule.

We utilize a stock trading and portfolio reallocation environment similar to those developed by Liu et al. (2020) and Yang et al. (2020) to simulate investment asset trading and portfolio management activities. This environment is based on OpenAI Gym, introduced by (Brockman et al., 2016). In this paper, we adapt these environments to accommodate portfolio optimization using multi-utility and different learning algorithms.

3.3.1. Action Space

The objective here is to provide a dynamic portfolio reallocation system to balance the risk–return trade-off across different market regimes and reoptimize the portfolio at systematic intervals rather than on a fixed schedule. The portfolio comprises a variety of investments, including individual stocks listed on exchanges and general sector ETFs that encompass broader investment themes. Portfolios are optimized based on the three utilities discussed in the first section:

Mean-Variance

\begin{matrix} \min_{w \in W} w^{T} Σ w \\ s . t . \{\begin{matrix} \sum_{i = 1}^{N} w_{i} = 1 \\ w^{T} y \geq μ_{0} \end{matrix} \end{matrix}

(23)

Mean-Semivariance

\begin{matrix} \min_{w \in W} w^{T} S C O V (<) w \\ s . t . \{\begin{matrix} \sum_{i = 1}^{N} w_{i} = 1 \\ w^{T} y \geq μ_{0} \end{matrix} \end{matrix}

(24)

Mean-CVaR

\begin{matrix} \min_{w \in W} F_{β} (w, l) = l + \frac{1}{q (1 + β)} \sum_{k = 1}^{q} S_{k} \\ s . t . \{\begin{matrix} S_{k} \geq f (w, y_{k}) - l = - w^{T} y_{k} - l \\ S_{k} \geq 0 \\ \sum_{i = 1}^{N} w_{i} = 1 \\ w^{T} y \geq μ_{0} \end{matrix} \end{matrix}

(25)

Those optimizations can be generalized into a risk and return trade-off with return as the historical mean of the investment asset return and risk defined as variance, Semivariance, or CVaR from the utilities above.

λ

is known as the risk aversion parameter, which measures how much risk to take for the maximum return. We can draw the efficient frontiers for each utility function based on different levels of target return, controlled by index

λ

between the minimum and maximum return among all investments.

\begin{matrix} \underset{w \in W}{m i n} R i s k \\ s . t . \{\begin{matrix} w^{T} y \geq μ (λ) \\ \sum_{i = 1}^{N} w_{i} = 1 \\ w_{i} \in [0,1] \end{matrix} \end{matrix}

(26)

The other key factor to consider in the optimization is the number of days until the next re-optimization time. Once the optimal allocation is derived based on the framework above, the portfolio weights will drift until the next re-optimization day. Here we define the number of days until the next re-optimization as

D_{i} \in \{5, N\}

.

For each of the three utility functions, the first parameter in the action space is defined as

λ \in Z : 1 \leq λ \leq 100,

which identifies the

λ

th percentile of return between the minimum and maximum of overall investments. The second parameter is

D_{i} \in \{5, N\},

which denotes the next rebalancing date since the last one, with

N

set to 60, allowing the portfolio to be rebalanced between 5 and 60 days. The holding period is denoted here in trading days, excluding weekends and NYSE holidays. These parameters in the action space are discrete multivariate variables. We will train agents to fine-tune these parameters for optimal portfolios.

3.3.2. Observation Space

We define a 3-dimensional observation space for securities in each training or testing dataset, represented as

B o x (f, n, t)

, where

n

represents the number of securities in the investment universe,

f

is the number of features for each security, and

t

is the time window for time series data. The features used in this project include technical indicators as well as price and return data for each security. Table 1 summarizes the technical indicators and price-based features that form the observation space for the DRL agents. These indicators, computed from historical adjusted-close prices and volumes, capture market trends, momentum, volatility, and volume dynamics. Each is standardized using a rolling z-score transformation within the lookback window to ensure scale consistency. Detailed definitions, parameter settings, and calculation formulas for all indicators are provided in Appendix B.

3.3.3. Trading Agent and Environment

In this paper, we design a trading system to simulate the portfolio rebalancing over time. This simulation serves as the reinforcement-learning (RL) environment, in which the trading agent interacts with market data and portfolio states. Portfolios are created based on the optimization frameworks defined in Section 3.3.2 (Formula (22)), which determine portfolio weights using return–risk trade-offs and are rebalanced periodically. The risk-aversion parameter

λ

and rebalancing frequency

h

(or holding period) are the two variables the DRL agent learns to optimize over time. Given a learned risk-aversion level for each utility function, the environment solves the corresponding optimal weights and invests according to those weights.

To train the risk-aversion parameter, we initialize the trading environment with an investment amount

D

and an initial allocation

w_{i}^{0}

for each asset

i

. The initial holding for asset

i

is therefore

D \cdot w_{i}^{0}

. For any given risk-aversion index

λ_{0}

, the optimization framework is solved to obtain the optimal weight

w_{i}^{*}

. Consequently, the system executes trades to reach the new allocation

{D \cdot w}_{i}^{*}

. The number of shares traded for each asset is

\frac{D \cdot w_{i}^{*} - D \cdot w_{i}^{0}}{p_{i}}

where

p_{i}

is the price for each asset

i

.

After rebalancing, the portfolio follows a buy-and-hold strategy for the holding period

h_{1}

. During this period, the market value of asset

i

evolves with its realized return

R_{i} | h_{1}

, and the portfolio value at the end of the holding period becomes

P V = \sum_{i = 1}^{N} D \cdot w_{i}^{*} (1 + R_{i} | h_{1})

, where

N

is the number of assets. The portfolio return over the first holding period is

R_{p o r t | λ_{0}} = \ln (\frac{P V}{P V_{0}}) = \ln (\sum_{i = 1}^{N} w_{i}^{* (1)} (1 + R_{i} | h_{1}))

(27)

where

w_{i}^{* (1)}

denotes the optimal portfolio weights obtained in the first period.

At the end of the holding period, the environment provides an updated market state, and the agent selects a new risk-aversion index

λ_{1}

. Solving the optimization again yields updated optimal weights

w_{i}^{* (2)}

for the next period

h_{2}

. The portfolio is then rebalanced to these weights and held for

h_{2}

days. The cumulative portfolio return over two consecutive holding periods is

R_{p o r t | λ_{0}, λ_{1}} = \ln (\sum_{i = 1}^{N} w_{i}^{* (1)} (1 + R_{i} | h_{1})) + \ln (\sum_{i = 1}^{N} w_{i}^{* (2)} (1 + R_{i} | h_{2}))

(28)

This rebalancing–holding process repeats until the end of the evaluation horizon

T

. In the final period, the agent chooses the risk-aversion index

λ_{T}

and obtains the corresponding optimal weights

w_{i}^{* (T)}

for the holding period. With realized returns

R_{i} | h_{T}

for each instrument, the final cumulative portfolio return is

R_{p o r t | λ_{0}, λ_{1}, \dots, λ_{T}} = \ln (\frac{P V_{T}}{D}) = \sum_{t = 1}^{T} \ln (\sum_{i = 1}^{N} w_{i}^{* (t)} (1 + R_{i} | h_{t}))

(29)

Within this framework, the DRL agent observes the current market and portfolio state, selects the action

A_{t} = (λ_{t}, h_{t})

, and receives a reward based on the realized portfolio return

R_{p o r t | λ_{T}}

. At each decision step, after selecting the action

A_{t} = (λ_{t}, h_{t})

, the agent receives a reward that reflects the realized portfolio performance over the corresponding holding period. The reward is defined as the logarithmic portfolio return

R_{t} = \ln (\frac{P V_{t}}{P V_{t - h_{t}}})

achieved during that period, where

P V_{t}

and

P V_{t - h_{t}}

are the portfolio values at the end and beginning of the holding period. This reward directly corresponds to the portfolio growth achieved over each holding interval and serves as the optimization signal for the DRL agent.

Through repeated interaction, the agent learns a policy that dynamically adjusts both risk aversion and rebalancing horizons to maximize the terminal portfolio value over time. The trading system thus forms a closed loop learning environment linking the DRL agent with the classical portfolio optimization engine.

3.4. Ensemble Model

In addition to the six base strategies trained from combinations of learning algorithms and portfolio optimization objectives, we introduce an ensemble model that integrates information across these strategies within the same methodological framework. The ensemble serves as an additional modeling approach that combines multiple DRL agents trained under different mean–risk formulations to capture a broader range of allocation behaviors.

After the PPO and A2C agents are trained for each of the three optimization objectives—Mean–Variance, Mean–Semivariance, and Mean–CVaR—each model is evaluated on a separate validation dataset that is distinct from the training and testing sets.

For each trained trading agent

m

, we obtain a distribution of Sharpe ratios from repeated validation runs, denoted by

S_{m} = \{S_{m}^{(1)}, S_{m}^{(2)}, \dots, S_{m}^{(K)}\}

(30)

To measure the model’s robustness, we compute the percentile-Sharpe ratio following (Yang et al., 2020):

S_{p, m} = {Quantile}_{p} (S_{m})

(31)

where

S_{p, m}

is the

p

-th percentile of the Sharpe-ratio distribution, and

p = 50

corresponds to the median, representing the model’s typical risk-adjusted performance across different validation realizations.

Based on this criterion, the ensemble model is defined by selecting the configuration that achieves the highest median Sharpe ratio on the validation data:

m^{*} = \underset{m}{argmax} S_{50, m}

(32)

The resulting ensemble model synthesizes methodologies across DRL–utility combinations by selecting the most consistently performing policy during validation. This process is completed prior to the testing to maintain clear separation between model development and evaluation.

The chart in Figure 1 illustrates the entire DRL portfolio optimization system with agents, environment, observations, action space, and reward function:

4. Empirical Experiments

In this section, we apply the deep reinforcement learning (DRL) portfolio optimization framework described above to train trading agents and evaluate their out-of-sample performance. The experiments test the agents’ ability to learn adaptive rebalancing strategies, which jointly determine the risk-aversion level

λ_{t}

and the rebalancing horizon

h_{t}

under different market conditions.

Two DRL algorithms—Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C)—are implemented for three portfolio optimization objectives: Mean–Variance, Mean–Semivariance, and Mean–CVaR. The trained trading agents are then compared across two investment universes to examine how different learning algorithms and risk metrics perform within distinct market settings.

4.1. Data Collection

To find the best trade-off between risk and return and the optimal rebalancing frequency, we train DRL trading agents using PPO and A2C and compare results across different utility functions and investment universes. We use 12 sector ETFs from iShares (Table 2) and 28 stocks from the Dow Jones Industrial Average (DJIA) Index (Table 3) as of February 2024 to form investment universes for portfolio reallocation.2

Historical daily adjusted-close prices were obtained from Yahoo! Finance. The dataset for the sector ETF universe spans from January 2003 to January 2023, while that for the DJIA universe covers January 2005 to January 2023. All returns are computed as log differences in adjusted-close prices, which include dividends and stock splits. Each portfolio is initialized with equal weights and a notional investment of $1000 at the start of the evaluation period. The data are divided into rolling training, validation, and out-of-sample testing periods, allowing each agent to learn, validate, and test policies over time. This rolling-window structure provides repeated evaluations of trading agents and provides a consistent framework to assess the adaptability of the DRL approach over time.

The tables below list the ticker symbols, security names, and corresponding industries for each security in both investment universes. Historical price series for these universes are presented in Appendix A.

4.2. Benchmark Setup

To compare the optimal portfolios generated by the DRL trading agents, we establish several benchmarks for each investment universe—the Dow Jones Industrial Average (DJIA) components and the sector ETF universe.

Since securities in both universes are constituents of the Dow Jones and S&P 500 indices, we use the exchange-traded funds tracking these indices as our first set of benchmarks: the SPDR Dow Jones Industrial Average ETF Trust (DIA) and the SPDR S&P 500 ETF Trust (SPY). These serve as passive market references representing broad market exposure for their corresponding universes.

In addition, to align with the three optimization objectives used in training the DRL agents, we define a second set of benchmarks based on tangency portfolios derived from each utility function. Formally,

\begin{matrix} \underset{w, λ_{t a n g e n c y}}{m a x} \frac{R e t u r n}{R i s k} \\ s . t . \{\begin{matrix} \sum_{i = 1}^{N} w_{i} = 1 \\ w_{i} \in [0,1] \end{matrix} \end{matrix}

(33)

where “

R i s k

” is defined by variance, Semivariance, or CVaR depending on the utility specification, and

λ_{t a n g e n c y}

denotes the corresponding risk-aversion coefficient that maximizes the return-to-risk ratio.

To maintain consistency across all models and datasets, the tangency portfolios are estimated using a 252-day historical window (approximately one trading year) and re-optimized every 30 trading days, which corresponds roughly to between a monthly and bi-monthly rebalancing frequency, both of which are commonly adopted by practitioners.

Therefore, based on each of the utility functions, we defined the following benchmark portfolios:

Benchmark ETFs: the out-of-sample performance of DIA and SPY during the testing period;
Mean–Variance Tangency Portfolio: portfolio maximizing return/variance using the latest 252-day return history and re-optimized every 30 trading days;
Mean–CVaR Tangency Portfolio: portfolio maximizing return/CVaR (at the 95% confidence level) based on the latest 252-day return history and re-optimized every 30 trading days;
Mean–Semivariance Tangency Portfolio: portfolio maximizing return/Semivariance using the latest 252-day return history and re-optimized every 30 trading days.

4.3. Trading Agents Training and Rolling Out-of-Sample Tests

For each investment universe defined above, we implement training, validation, and out-of-sample testing procedures to evaluate the performance of the DRL-based trading methodologies. The experiments follow a rolling-window design, where each cycle consists of two years of training data, two years of validation data, and two years of testing data. This rolling structure allows repeated assessments of the agents’ adaptability while maintaining clear temporal separation between phases.3

During the training phase, each DRL trading agent interacts with the portfolio environment using two years of daily market data and the 34 observation features introduced in Section 3.2. At every decision step, the agent produces two actions from its action space:

(1): The risk-aversion index $λ_{t}$ , which specifies the desired position on the efficient frontier for the selected optimization objective,
(2): The rebalancing horizon $h_{t}$ , indicating the number of trading days until the next portfolio optimization.

The risk-aversion index takes integer values from 1 to 100, where smaller values represent more conservative allocations and larger values represent more aggressive risk–return profiles. For each value of

λ_{t}

predicted by the agent, six sets of optimal portfolio weights are generated based on the three optimization objectives—Mean–Variance, Mean–Semivariance, and Mean–CVaR (at 95% confidence level)—under the two DRL algorithms, A2C and PPO.

Each portfolio is held for

h_{t}

trading days, ranging from 5 to 60, after which the environment updates market information and the agent generates a new pair of actions. This process continues through the end of the training window, forming one training episode. Each agent is trained for 10,000 episodes to ensure policy convergence and stability.

During the validation phase, the trained trading agents are applied to the subsequent two-year dataset to evaluate robustness. The neural-network parameters of each agent are fixed after training; the agents operate in inference mode, using their learned policy functions to predict the two control parameters

(λ_{t}, h_{t})

over the validation horizon. Given these predicted parameters, the environment solves the corresponding portfolio-optimization problem to obtain optimal weights. These weights are then applied to historical market data to compute the realized portfolio returns that would have been achieved by implementing the strategy. This process is repeated 100 times under different random seeds or starting conditions, producing 100 distinct portfolio return series for each trading agent. For each series, a Sharpe ratio (mean return divided by volatility assuming risk free rate is zero) is calculated, yielding the set

S_{m} = \{S_{m}^{(1)}, S_{m}^{(2)}, \dots, S_{m}^{(100)}\}

. Following the procedure defined in Section 3.4, we use the 50th-percentile Sharpe ratio

S_{m}^{(50)}

as the evaluation metric, representing the model’s typical risk-adjusted performance across the validation runs.

The trading agent achieving the highest median Sharpe ratio is designated as the ensemble model for that rolling window. In this study, the ensemble model corresponds to the single trading agent that demonstrates the most consistent validation performance across all algorithm–objective combinations, as defined in Section 3.4.

In the out-of-sample testing phase, the six trained trading agents and the selected ensemble model are deployed on the next two-year testing dataset to simulate portfolio rebalancing under unseen market conditions. Each agent operates in inference mode, producing a sequence of actions

(λ_{t}, h_{t}

) that determine portfolio updates over time. The corresponding optimal weights are implemented according to each agent’s predicted horizon, and daily portfolio values are recorded. These simulated portfolio paths provide the basis for evaluating the out-of-sample performance of the individual DRL strategies as well as the ensemble model.

Figure 2 summarizes the overall process, illustrating the flow of training, validation (including percentile-Sharpe evaluation and ensemble selection), and testing across rolling time windows.

4.4. Out-of-Sample Results

The deep-reinforcement-learning (DRL) trading agents were trained using rolling two-year windows of daily data, validated on the subsequent two-year periods, and then tested on the following two-year out-of-sample windows. For each cycle, daily returns from both investment universes—the sector ETFs and the Dow Jones Industrial Average (DJIA) components—were used to train, validate, and test the models sequentially. This process was repeated every two years, beginning in 2007 for the DJIA universe and 2003 for the sector ETFs, resulting in six rolling periods for the DJIA and eight periods for the sector ETFs. All results are reported in percentage terms based on an initial notional investment of $1000, consistent across DRL strategies and benchmark portfolios. Performance statistics are computed from daily returns and include annualized return, annualized volatility, Value-at-Risk (VaR), Conditional VaR (CVaR) at the 95% confidence level, maximum drawdown, Sharpe ratio (assuming risk free rate

r_{f} = 0

), upside/downside capture ratios, beta, tracking error, and information ratio.

4.4.1. Results for the Sector ETF Universe

This section reports out-of-sample results for portfolios constructed from 12 U.S. sector ETFs. Each DRL trading agent is trained under one of the three utility functions—Mean–Variance, Mean–Semivariance, and Mean–CVaR—and one of the two DRL algorithms, PPO or A2C. The ensemble model, as defined in Section 3.4, selects the strategy exhibiting the highest median (50th-percentile) Sharpe ratio during the validation phase among the six agent–objective combinations. Benchmark portfolios are optimized using the same three mean–risk objectives with a 252-day historical window and 30-day re-optimization frequency, while the SPDR S&P 500 ETF Trust (SPY) serves as the general market benchmark.

The results in Table 4 and Figure 3 show that DRL-based strategies consistently outperform static benchmarks across both return and risk-adjusted measures. Among the single agents, the PPO Mean–Variance model achieved the highest annual return of 10.7%, with a volatility of 18.2%, surpassing all three benchmark tangency portfolios—Mean–Variance (5.9%), Mean–Semivariance (6.1%), and Mean–CVaR (2.8%)—as well as the SPY market benchmark (8.8%, 20.5% volatility). The complete out-of-sample dynamics of optimal weights, learned risk-aversion levels, and rebalancing horizons is presented in Appendix C (Figure A2). The ensemble model further improved performance, generating the highest overall return of 11.3% at a comparable volatility (19.1%). Most DRL agents also demonstrated lower Value-at-Risk and Conditional Value-at-Risk levels and smaller drawdowns (around 40–46%) relative to the tangency benchmarks (58–72%), indicating better downside control while maintaining higher returns within the sector ETF universe.

The ensemble model records the highest Sortino ratio among all strategies, comfortably exceeding both the single DRL agents and the static benchmarks. This result indicates that DRL-guided rebalancing enhances performance not only on a total volatility basis (Sharpe) but also relative to downside risk (Sortino). Among the single-agent models, both PPO and A2C trained under the mean–variance objective show consistently higher Sortino ratios and stronger upside capture, while limiting downside exposure compared to benchmark portfolios. These patterns translate into higher information ratios at similar tracking errors, suggesting that DRL-based approaches effectively adjust rebalancing horizons and risk aversion in response to changing market conditions, leading to more resilient and adaptive portfolio performance across the ETF universe.

4.4.2. Result for the Dow Jones Industrial Average Universe

For the DJIA components, out-of-sample tests were conducted using the same combination of three optimization objectives and two DRL algorithms. The ensemble model was again identified from the validation phase as the trading agent with the highest median Sharpe ratio among the six configurations. Benchmarks included tangency portfolios based on Mean–Variance, Mean–Semivariance, and Mean–CVaR (95% confidence) objectives, re-optimized every 30 trading days, together with the SPDR Dow Jones Industrial Average ETF Trust (DIA).

The results in Table 5 and Figure 4 show that DRL-based strategies generally outperform the static tangency benchmarks across the Dow Jones Industrial Average universe. Among the single agents, the PPO Mean–Variance model achieved the highest annual return of 15.7% with a volatility of 17.7%, followed closely by the A2C Mean–Variance model (15.5%, 18.1% volatility). Both models delivered stronger performance than the Mean–Variance (7.7%), Mean–Semivariance (5.8%), and Mean–CVaR (7.0%) tangency portfolios, as well as the DIA benchmark (11.6%, 17.3% volatility). The ensemble model produced a balanced profile, with an annual return of 11.4% and volatility of 17.8%, performing comparably to the DIA ETF and with similar tail-risk exposure. Detailed out-of-sample optimal portfolio weights and the agent’s predicted action variables are shown in Appendix C (Figure A3). Across most specifications, the DRL agents achieved higher Sharpe ratios and smaller drawdowns (most around 28–31%) relative to the benchmark portfolios (most around 31–38%), demonstrating their ability to sustain higher returns while maintaining effective downside-risk control.

The PPO Mean–Variance model also achieved the highest Sortino ratio among all strategies, surpassing both the DIA benchmark and the tangency portfolios. The A2C Mean–Variance model ranked a close second, both performances confirming that DRL agents trained under variance-based objectives generate stronger asymmetric payoffs and more stable performances in the Dow Jones universe. Together, these results show that DRL-driven rebalancing frameworks adapt dynamically to changing market conditions, enhancing both total and downside-risk-adjusted returns without increasing volatility.

4.4.3. Statistical Inference and Model Robustness

In this study, we applied a rolling training–validation–testing schedule to evaluate the DRL policy and generate the out-of-sample (OOS) return series for each strategy. The policy learning process used a fixed decision rule, meaning the policy was trained on the fixed random seed while the relevant source of uncertainty was the time-series variation in realized OOS returns across various market conditions.

To assess whether the DRL strategies delivered economically and statistically meaningful value-added relative to relevant benchmarks, we focused on two complementary and widely used tests: benchmark-adjusted alpha (a regression-based, risk-adjusted performance measure) and the Sharpe-ratio difference. Both procedures explicitly account for heteroskedasticity and serial dependence in daily returns. As noted by (Cont, 2001), volatility clustering is a core “stylized fact” of financial returns and there are serial correlations in returns—especially at higher frequencies and in dynamic/overlapping portfolio constructions, as shown by (Lo & MacKinlay, 1988).

Test on Benchmark adjusted $α$ under HAC (Newey–West heteroskedasticity- and autocorrelation-consistent error): $H_{0} : α = 0$ vs. $H_{1} : α \neq 0$ . $α$ is calculated from rem regression $r_{S, t} = α + β r_{B, t} + ϵ_{t}$ and annualized by 252. $r_{S, t}$ and $r_{B, t}$ are daily OOS returns for DRL strategies and benchmark.
Test on Sharpe ratio difference using stationary bootstrap: $H_{0} : Δ S R = S R (S) - S R (B) \leq 0$ vs. $H_{1} : S R = S R (S) - S R (B) \geq 0$ . We test whether the strategy improves risk efficiency relative to the benchmark by evaluating Sharpe ratio difference defined as $Δ S R$ ; $S R$ is Sharpe ratio calculated using Politis–Romano stationary bootstrap, which can preserve serial dependence (expected block length = 20 trading days).

In the sector ETF universe, we ran the above two tests for Ensemble and PPO Mean-Variance DRL models. As reported in Table 6, both strategies delivered statistically significant positive alpha against all three static tangency benchmarks (6–9% per year; HAC

p < 0.05

), coupled with meaningful Sharpe improvements (ΔSR ≈ 0.27–0.42; bootstrap

p \leq 0.06

). Against SPY, estimates were directionally positive but not statistically significant at 0.05 or 0.1 levels.

In the DJIA component universe, we also ran the same tests for the Ensemble and PPO Mean–Variance DRL models. As reported in Table 7, the PPO Mean–Variance agent exhibits strong risk-adjusted outperformance versus all static tangency benchmarks (alpha 9–11% per year;

p < 0.001

) with significant Sharpe increases (ΔSR ≈ 0.44–0.55;

p \leq 0.021

). The Ensemble also shows positive alpha and, against Semivariance tangency, a significant Sharpe improvement. Versus the DJIA index, PPO Mean–Variance maintains a significant positive alpha (6.18%/yr;

p = 0.035

), although Sharpe differences are not significant—consistent with the difficulty of statistically dominating a cap-weighted index over this post-2011 sample.

4.4.4. Turnover Dynamics and Transaction Cost Impact

To verify the behavioral consistency of the proposed reinforcement-learning framework, we examine the realized turnover and implied transaction costs of the PPO–Mean–Variance configuration. The agent learns both a dynamic risk-aversion index and a rebalancing horizon, leading to adaptive but structured trading patterns. Across the Sector ETF universe (2007–2023), the strategy executed 130 rebalances, averaging 42.8% turnover per event and an annualized turnover of 327%. In the DIJA universe (2011–2022), turnover averaged 53.8% per rebalance and 431% annually, with 96 rebalancing trades. Detailed turnover and transaction cost information for both universes are shown in Appendix C (Figure A4). These figures confirm that the agent rebalances on a consistent bi-monthly cycle rather than exhibiting unstable or high-frequency behavior. To assess implementation feasibility, we applied a linear transaction-cost model:

C o s t_{y} = τ_{y} \times c_{s i d e},

(34)

where

τ_{y}

is one way turnover and

c_{s i d e}

is per side transaction cost in basis point. Under a base scenario of 5 bps per side for ETFs and 10 bps for equities, the estimated annual drag equals 0.16% and 0.43%, respectively. Even in a high-cost assumption (10–20 bps per side), the drag remains below 1% per annual basis—indicating that the learned trading frequency is economically sustainable.

5. Conclusions

This paper develops a dynamic portfolio optimization framework that integrates deep reinforcement learning (DRL) with classical mean–risk optimization. By combining Proximal Policy Optimization (PPO) and Advantage Actor–Critic (A2C) algorithms with Mean–Variance, Mean–Semivariance, and Mean–CVaR formulations, the framework enables portfolio rebalancing that adapts endogenously to evolving market conditions.

Methodologically, this study extends (Yang et al., 2020) by employing DRL agents not to directly estimate portfolio weights but to learn the underlying risk-aversion parameters and rebalancing horizons that govern traditional optimization. This design preserves the structure of established portfolio theory while introducing an adaptive, data-driven market-timing mechanism. The agents operate on a 3-dimensional state representation composed of historical prices, realized returns, and technical indicators derived from the same time series. Using rolling windows of two-year training, validation, and testing periods, the model is trained and evaluated on historical daily data for 12 U.S. sector ETFs and 28 Dow Jones Industrial Average (DJIA) constituents.

Empirically, the DRL-based strategies exhibit competitive and, in several cases, superior performance relative to static tangency portfolios and benchmark ETFs. In the sector ETF universe, the DRL agents—particularly those trained under the Mean–Semivariance objective—deliver higher average returns and Sharpe ratios, indicating improved responsiveness to shifts in market dynamics. In the DJIA universe, while the ensemble strategy performs comparably to the DIA benchmark, the Mean–Variance-based agents achieve the strongest individual results. Overall, the findings demonstrate that DRL-guided parameter selection can enhance portfolio adaptability and maintain robust performance across different market structures without relying on direct weight learning.

Future research could extend this framework in several directions. First, incorporating broader asset classes such as fixed income, commodities, cryptocurrencies, or derivatives would allow the DRL agents to learn allocation and hedging decisions within a more complete investment universe. Second, integrating alternative and macroeconomic data sources—including sentiment, liquidity, or policy indicators—may improve the model’s predictive and adaptive capacity. Finally, advances in DRL architectures, such as hierarchical or multi-agent systems, could further improve stability and interpretability by decomposing the investment problem into specialized sub-tasks for return generation and risk control.

In summary, this study establishes a framework where reinforcement learning enhances traditional portfolio theory by endogenizing key decision parameters—risk preference and rebalancing frequency—offering a flexible, data-driven approach to dynamic investment management.

Author Contributions

Conceptualization, J.Y. and K.-C.C.; methodology, J.Y. and K.-C.C.; validation, J.Y. and K.-C.C.; formal analysis, J.Y. and K.-C.C.; data curation, J.Y.; writing—original draft preparation, J.Y.; writing—review and editing, K.-C.C.; visualization, J.Y.; supervision, K.-C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT 4.0 to assist with wording and language refinement. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. Historical price for sector ETFs and Dow Jones Industrial Index constituents.5

Appendix B. Technical Indicator Feature Definitions

This appendix details the 34 technical indicators and derived features that constitute the observation space for the DRL trading agents (Table A1). All indicators are computed from daily adjusted-close prices and trading volumes. Calculation formulas, where available, adhere to the convention established in the (TA-Lib-Technical Analysis Library, n.d.) and the standard references (Achelis, 2000; Murphy, 1999). All variables are standardized using a rolling z-score normalization within the look-back window to ensure scale consistency.

Table A1. Technical indicator definitions and parameters.

Feature Set	Definition	Windows/Parameters
Aroon Oscillator	$AROONOSC = {AROON}_{U P} - {AROON}_{D O W N}, {AROON}_{U P} = 100 \times \frac{n - days since highest high}{n}$	$n = 25$
Awesome Oscillator	$A O = S M A (Median Price, 5) - S M A (Median Price, 34), Median Price = \frac{H i g h + L o w}{2}$	5 and 34
Bollinger Band Upper Bound	$B B_{U} = S M A (P, n) + k \times σ_{n} (P), k = 2$	$n = 20$
Bollinger Band Lower Bound	$B B_{L} = S M A (P, n) - k \times σ_{n} (P), k = 2$	$n = 20$
Chande Momentum Oscillator	$C M O = 100 \times \frac{\sum U - \sum D}{\sum U + \sum D}, U = m a x (Δ P, 0), D = m a x (- Δ P, 0)$	$n = 14$
Commodity Channel Index—30 periods	$C C I = \frac{T P - S M A (T P, n)}{0.015 \times M A D_{n} (T P)}, T P = \frac{H i g h + L o w + C l o s e}{3}$	$n = 30$
Correlation Trend Indicator	$ρ_{n} = Corr (P_{t}, t)$	$n = 30$
Directional Index—30 periods	Wilder’s ADX (built from $+ DI$ $and - DI$ $) over n = 30$ . In brief, $+ {DI}_{n} = 100 \cdot \frac{{SMMA}_{n} (+ DM)}{{ATR}_{n}}, - {DI}_{n} = 100 \cdot \frac{{SMMA}_{n} (- DM)}{{ATR}_{n}}$ ${DX}_{n} = 100 \cdot \frac{\| + {DI}_{n} - - {DI}_{n} \|}{+ {DI}_{n} + - {DI}_{n}}, {ADX}_{n} = {SMMA}_{n} ({DX}_{n})$ $(SMMA$ is Wilder’s moving average; TA–Lib implementation is used.)	$n = 30$
Elder–Ray Index–Bull Power	$BullPower = H i g h - E M A (C l o s e, 13)$	13
Elder–Ray Index–Bear Power	$BearPower = L o w - E M A (C l o s e, 13)$	13
Inertia Indicator	Trend persistence measure based on Relative Vigor Index applied to a detrended price oscillator.	$n = 20$
Kaufman’s Efficiency Ratio	${ER}_{t} = \frac{\| P_{t} - P_{t - n} \|}{\sum_{i = 1}^{n} \| P_{t - i + 1} - P_{t - i} \|}$	$n = 10$
Know Sure Thing	$K S T = \sum_{j = 1}^{4} w_{j} \times S M A (R O C (P, r_{j}), s_{j})$	Pring defaults
Linear Regression of Close (Window 10)	$S l o p e (β) from P_{τ} = a + β τ, τ = t - 9, \dots, t$	$n = 10$
Log Return	$r_{t} = l n (\frac{P_{t}}{P_{t - 1}})$	1
Moving Average Convergence Divergence	$M A C D = E M A_{12} (P) - E M A_{26} (P), S i g n a l = E M A_{9} (M A C D)$	12, 26, 9
Percentage Volume Oscillator	$P V O = 100 \times \frac{E M A_{12} (V o l) - E M A_{26} (V o l)}{E M A_{26} (V o l)}$	12, 26
Pretty Good Oscillator	$P G O = \frac{P - S M A (P, n)}{A T R (n)}$	$n = 14$
Psychological Line	$P S Y = 100 \times \frac{\ # up closes over n}{n}$	$n = 12$
Quantitative Qualitative Estimation (QQE)	RSI smoothed by EMA and bounded by ATR-based dynamic bands (standard definition).	RSI base 14–30
Relative Strength Index—30 periods	$R S I = 100 - \frac{100}{1 + R S}, R S = \frac{AvgGain}{AvgLoss}$	$n = 30$
Relative Vigor Index	$R V I = \frac{S M A_{n} (C l o s e - O p e n)}{S M A_{n} (H i g h - L o w)}$	$n = 10$
Simple Moving Average of Close—10 periods	$S M A_{10} (P) = \frac{1}{10} \sum P$	10
Simple Moving Average of Close—20 periods	$S M A_{20} (P) = \frac{1}{20} \sum P$	20
Simple Moving Average of Close—100 periods	$S M A_{100} (P) = \frac{1}{100} \sum P$	100
Stochastic RSI	$S t o c h R S I = \frac{R S I - m i n (R S I_{n})}{m a x (R S I_{n}) - m i n (R S I_{n})}$	$n = 14$
Super Trend Upper Bound	$U p p e r = \frac{H i g h + L o w}{2} + m \times A T R (n), m = 3$	$n = 10$
Super Trend Lower Bound	$L o w e r = \frac{H i g h + L o w}{2} - m \times A T R (n), m = 3$	$n = 10$
the Gaussian Fisher Transform Price Reversals indicator	$F_{t} = 0.5 \times l n (\frac{1 + x_{t}}{1 - x_{t}}), x_{t} = 2 \frac{P_{t} - m i n (P_{n})}{m a x (P_{n}) - m i n (P_{n})} - 1$	$n = 9$
Triple Exponential Moving Average	$T E M A = 3 \times E M A - 3 \times E M A (E M A) + E M A (E M A (E M A))$	$n = 30$
Volume Variation Index	$V V I = \frac{V o l_{t} - S M A (V o l, n)}{S M A (V o l, n)}$	$n = 20$
Z-Score of Close Price—75 periods	$Z = \frac{P_{t} - S M A_{75} (P)}{σ_{75} (P)}$	$n = 75$
Adjusted Close Price	Adjusted for splits and dividends; base series.	—
Percentage Change in Close Price	$\frac{P_{t} - P_{t - 1}}{P_{t - 1}}$	1

Appendix C

Appendix C.1. Best Strategy Portfolio Out-of-Sample Metrics

The following charts illustrate the out-of-sample dynamics of the best-performing DRL-based strategies identified for each investment universe. Each figure visualizes how the trained agents adjust their portfolio allocations and decision variables over time in response to evolving market conditions.

In each chart, the upper panel displays the portfolio weights assigned to individual assets across successive re-optimization periods. These weights are generated by the DRL agent, which outputs an optimal risk-aversion parameter

λ

to solve the mean–risk optimization problem. The colored bands represent the proportion of capital allocated to each asset over time, where the shifting patterns across the panel highlight the dynamic nature of portfolio rebalancing. Periods of concentrated weights indicate higher conviction in specific sectors or assets, while more diversified allocations reflect a preference for risk spreading during heightened market uncertainty.

The lower panel shows the evolution of two internal state variables—the risk-aversion index and the rebalancing window—produced by the DRL agents during out-of-sample trading. The risk-aversion index ranges from 1 to 100 and represents the agent’s learned preference along the efficient frontier: lower values correspond to defensive, low-volatility portfolios, while higher values indicate more aggressive, return-seeking allocations. The rebalancing window, which varies from 5 to 60 trading days, represents the period the agent holds a portfolio before re-optimizing. A shorter window suggests higher market uncertainty or a need for more frequent tactical adjustment, while a longer one indicates more stable conditions.

Together, these visualizations provide empirical evidence of how the DRL agents jointly optimize what to hold (portfolio composition) and when to adjust (rebalancing frequency). The coordination between the two state variables allows the agents to navigate changing market regimes without relying on predefined labels. This adaptive behavior illustrates a key advantage of DRL in portfolio management—its capacity to learn from market feedback and continuously refine allocation and timing decisions to maintain robust performance across diverse economic environments.

Figure A2. Sector ETF universe: PPO mean-variance optimal weights and state space variables.

Figure A3. DJIA components universe: PPO mean-variance optimal weights and state space variables.

Appendix C.2. Turnover and Transaction Cost Analysis

Figure A4 illustrates the annual turnover and estimated transaction cost for both the Sector ETF and DJIA component universes of the PPO mean–variance model. For the Sector ETF portfolios, the average annual turnover was roughly 327%, with visible spikes in years such as 2013 and 2017, corresponding to periods of more frequent rebalancing and stronger market dispersion. Assuming a 5-bps transaction cost per side, the implied annual cost drag averaged around 0.16% per year, suggesting that trading cost is a limited but non-negligible impact on net performance.

For the DJIA Component result, the mean annual turnover was higher—approximately 431%, reflecting the larger number of individual securities and greater position adjustments required to maintain optimal weights. Under a 10 bps per-side assumption, the average annual cost drag was about 0.43%, consistent with expectations for higher-frequency equity rebalancing.

Overall, the turnover and cost profiles confirm that the PPO mean–variance-based allocation framework remains operationally efficient: although turnover varies with market conditions, the resulting transaction-cost impact remains small relative to the annualized portfolio returns.

Figure A4. Turnover and transaction cost analysis for PPO mean–variance model (sector ETFs vs. DJIA universe).

Notes

1	https://spinningup.openai.com/en/latest/algorithms/ppo.html#id8 (accessed on 25 September 2025).
2	Since the DJIA composition changes frequently—for example, Amazon replaced Walgreens Boots Alliance in 2024 and Salesforce, Amgen, and Honeywell joined in 2020—we assume a static universe to keep the analysis consistent across the sample period.
3	The DJIA panel begins in 2005 to ensure continuous histories for all 28 retained names. With a 252-day feature look-back and rolling two-year train–validate–test design, the first ETF out-of-sample test begins in January 2007, and the first DJIA test starts in January 2011.
4	All portfolio metrics are derived from daily log returns for both the portfolio and its benchmark (SPY for the ETF universe and DIA for the DJIA universe) using functions from the PerformanceAnalytics package (version 2.0.4) in R (version 4.4.1). Annual return is calculated from the average daily return and annualized based on 252 trading days. Annual volatility is obtained from the standard deviation of daily returns, also scaled by $\sqrt{252}$ . Value-at-Risk (VaR, 95%) is computed through historical simulation as the fifth percentile of daily returns, representing the worst expected loss on a typical day at a 95% confidence level. Conditional Value-at-Risk (CVaR, 95%) represents the mean of losses exceeding the VaR threshold, reflecting expected tail risk. Maximum drawdown measures the greatest cumulative decline from a portfolio’s peak value to its subsequent trough. Sharpe ratio is the annualized mean excess return divided by annualized volatility, assuming a zero risk-free rate. Sortino ratio refines this by dividing the annualized mean return by downside deviation, using a minimum acceptable return of zero to isolate negative volatility. Upside and downside capture ratios compare the portfolio’s returns to the benchmark’s performance during positive and negative benchmark periods, respectively, indicating relative participation in gains and protection during losses. Beta is estimated from a regression of daily portfolio returns on benchmark returns, indicating systematic market exposure. Tracking error measures the annualized standard deviation of active returns (portfolio minus benchmark). Information ratio captures risk-adjusted active performance by dividing annualized active return by tracking error.
5	The adjusted close price is used throughout the analysis as it accounts for corporate actions such as stock splits, dividends, and distributions, providing a more accurate measure of total return than the raw close price. The adjusted close price is retrieved from Yahoo! Finance.

References

Acero, F., Zehtabi, P., Marchesotti, N., Cashmore, M., Magazzeni, D., & Veloso, M. (2024). Deep reinforcement learning and mean-variance strategies for responsible portfolio optimization. arXiv, arXiv:2403.16667. [Google Scholar] [CrossRef]
Achelis, S. B. (2000). Technical analysis from A to Z (2nd ed.). McGraw Hill Professional. [Google Scholar]
Bishop, C. M. (n.d.). Neural networks for pattern recognition. Clarendon Press. [Google Scholar]
Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (2017). Classification and regression trees. Chapman and Hall/CRC. [Google Scholar]
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. arXiv, arXiv:1606.01540v1. [Google Scholar] [CrossRef]
Bühler, H., Gonon, L., Teichmann, J., & Wood, B. (2018). Deep hedging. arXiv, arXiv:1802.03042. [Google Scholar] [CrossRef]
Ceria, S., & Stubbs, R. A. (2006). Incorporating estimation errors into portfolio selection: Robust portfolio construction. Journal of Asset Management, 7(2), 109–127. [Google Scholar] [CrossRef]
Chang, K. C., Tian, Z., & Yu, J. (2017, July 10–13). Dynamic asset allocation—Chasing a moving target. The 2017 20th International Conference on Information Fusion (Fusion) (pp. 1–8), Xi’an, China. [Google Scholar]
Chen, T., & Guestrin, C. (2016, August 13–17). XGBoost: A scalable tree boosting system. The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794), San Francisco, CA, USA. [Google Scholar]
Cont, R. (2001). Empirical properties of asset returns: Stylized facts and statistical issues. Quantitative Finance, 1(2), 223–236. [Google Scholar] [CrossRef]
Dai, Z., & Wang, F. (2019). Sparse and robust mean–variance portfolio optimization problems. Physica A: Statistical Mechanics and Its Applications, 523, 1371–1378. [Google Scholar] [CrossRef]
Deng, Y., Bao, F., Kong, Y., Ren, Z., & Dai, Q. (2017). Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3), 653–664. [Google Scholar] [CrossRef]
Du, J., Jin, M., Kolm, P. N., Ritter, G., Wang, Y., & Zhang, B. (2020). Deep reinforcement learning for option replication and hedging. The Journal of Financial Data Science, 2(4), 44–57. [Google Scholar] [CrossRef]
Ekren, I., Liu, R., & Muhle-Karbe, J. (2017). Optimal rebalancing frequencies for multidimensional portfolios. Mathematics and Financial Economics, 12(2), 165–191. [Google Scholar] [CrossRef]
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. [Google Scholar] [CrossRef]
Estrada, J. (2002). Systematic risk in emerging markets: The D-CAPM. Emerging Markets Review, 3(4), 365–379. [Google Scholar] [CrossRef]
Estrada, J. (2007). Mean-semivariance behavior: Downside risk and capital asset pricing. International Review of Economics & Finance, 16(2), 169–185. [Google Scholar] [CrossRef]
Fakhar, M., Mahyarinia, M. R., & Zafarani, J. (2018). On nonsmooth robust multiobjective optimization under generalized convexity with applications to portfolio optimization. European Journal of Operational Research, 265(1), 39–48. [Google Scholar] [CrossRef]
Fliege, J., & Werner, R. (2014). Robust multiobjective optimization & applications in portfolio optimization. European Journal of Operational Research, 234(2), 422–433. [Google Scholar] [CrossRef]
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. [Google Scholar] [CrossRef]
Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. arXiv, arXiv:1802.09477. [Google Scholar] [CrossRef]
Garlappi, L., Uppal, R., & Wang, T. (2007). Portfolio selection with parameter and model uncertainty: A multi-prior approach. The Review of Financial Studies, 20(1), 41–81. [Google Scholar] [CrossRef]
Ghahtarani, A., Saif, A., & Ghasemi, A. (2022). Robust portfolio selection problems: A comprehensive review. Operational Research, 22(4), 3203–3264. [Google Scholar] [CrossRef]
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv, arXiv:1801.01290. [Google Scholar]
Harvey, C. R., Liechty, J. C., Liechty, M. W., & Müller, P. (2010). Portfolio selection with higher moments. Quantitative Finance, 10(5), 469–485. [Google Scholar] [CrossRef]
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Hsieh, C.-H. (2021). Necessary and sufficient conditions for frequency-based Kelly optimal portfolio. IEEE Control Systems Letters, 5(1), 349–354. [Google Scholar] [CrossRef]
Hsieh, C.-H., & Wong, Y.-S. (2023). On frequency-based optimal portfolio with transaction costs. arXiv, arXiv:2301.02754. [Google Scholar] [CrossRef]
Hurley, W. J., & Brimberg, J. (2015). A note on the sensitivity of the strategic asset allocation problem. Operations Research Perspectives, 2, 133–136. [Google Scholar] [CrossRef][Green Version]
JPMorgan/Reuters. (1996). RiskMetrics—Technical document (4th ed.). References—Scientific Research Publishing. Available online: https://www.scirp.org/reference/referencespapers?referenceid=2340829 (accessed on 30 September 2025).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. [Google Scholar] [CrossRef]
Kuhn, D., & Luenberger, D. G. (2010). Analysis of the rebalancing frequency in log-optimal portfolio selection. Quantitative Finance, 10, 221–234. [Google Scholar] [CrossRef][Green Version]
Lee, Y., Kim, M. J., Kim, J. H., Jang, J. R., & Kim, W. C. (2020). Sparse and robust portfolio selection via semi-definite relaxation. Journal of the Operational Research Society, 71(5), 687–699. [Google Scholar] [CrossRef]
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforcement learning in portfolio management. arXiv, arXiv:1808.09940. [Google Scholar] [CrossRef]
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2019). Continuous control with deep reinforcement learning. arXiv, arXiv:1509.02971. [Google Scholar]
Liu, X.-Y., Yang, H., Chen, Q., Zhang, R., Yang, L., Xiao, B., & Wang, C. (2020). FinRL: A deep reinforcement learning library for automated stock trading in quantitative finance. arXiv, arXiv:2011.09607. [Google Scholar] [CrossRef]
Lo, A. W., & MacKinlay, A. C. (1988). Stock market prices do not follow random walks: Evidence from a simple specification test. The Review of Financial Studies, 1(1), 41–66. [Google Scholar] [CrossRef]
Lobo, M. S., Fazel, M., & Boyd, S. (2007). Portfolio optimization with linear and fixed transaction costs. Annals of Operations Research, 152(1), 341–365. [Google Scholar] [CrossRef]
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1), 77–91. [Google Scholar] [CrossRef]
Markowitz, H. M. (1959). Portfolio selection: Efficient diversification of investments. Yale University Press. [Google Scholar]
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv, arXiv:1312.5602. [Google Scholar] [CrossRef]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. [Google Scholar] [CrossRef]
Mnih, V., Puigdomènech Badia, A., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. arXiv, arXiv:1602.01783. [Google Scholar] [CrossRef]
Murphy, J. J. (1999). Technical analysis of the financial markets: A comprehensive guide to trading methods and applications. Penguin. [Google Scholar]
Rockafellar, R. T., & Uryasev, S. (2000). Optimization of conditional value-at-risk. The Journal of Risk, 2(3), 21–41. [Google Scholar] [CrossRef]
Rockafellar, R. T., & Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7), 1443–1471. [Google Scholar]
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust region policy optimization. In Proceedings of the 32nd international conference on international conference on machine learning, Lille, France, July 6–11 (Vol. 37, pp. 1889–1897). PMLR. [Google Scholar]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv, arXiv:1707.06347. Available online: https://www.semanticscholar.org/paper/Proximal-Policy-Optimization-Algorithms-Schulman-Wolski/dce6f9d4017b1785979e7520fd0834ef8cf02f4b (accessed on 25 September 2025). [CrossRef]
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv, arXiv:1712.01815v1. [Google Scholar] [CrossRef]
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. A. (2014, June 21–26). Deterministic policy gradient algorithms. The 31st International Conference on International Conference on Machine Learning, Beijing, China. [Google Scholar]
Smith, D. M., & Desormeau, W. H. (2006). Optimal rebalancing frequency for bond/stock portfolios. Journal of Financial Planning, 1919, 52–63. [Google Scholar]
TA-Lib-Technical Analysis Library. (n.d.). Available online: https://ta-lib.org/ (accessed on 25 September 2025).
Wang, J., Zhang, Y., Tang, K., Wu, J., & Xiong, Z. (2019, August 4–8). AlphaStock: A buying-winners-and-selling-losers investment strategy using interpretable deep reinforcement attention networks. The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1900–1908), Anchorage, AK, USA. [Google Scholar] [CrossRef]
Wang, Z., Huang, B., Tu, S., Zhang, K., & Xu, L. (2021). DeepTrader: A deep reinforcement learning approach for risk-return balanced portfolio management with market conditions embedding. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 643–650. [Google Scholar] [CrossRef]
Yang, H., Liu, X.-Y., Zhong, S., & Walid, A. (2020, October 15–16). Deep reinforcement learning for automated stock trading: An ensemble strategy. The First ACM International Conference on AI in Finance (pp. 1–8), New York, NY, USA. [Google Scholar] [CrossRef]
Ye, Y., Pei, H., Wang, B., Chen, P., Zhu, Y., Xiao, J., & Li, B. (2020). Reinforcement-learning based portfolio management with augmented asset movement prediction states. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01), 1112–1119. [Google Scholar] [CrossRef]
Yu, J., & Chang, K.-C. (2020). Neural network predictive modeling on dynamic portfolio management—A simulation-based portfolio optimization approach. Journal of Risk and Financial Management, 13(11), 285. [Google Scholar] [CrossRef]
Zhang, Z., Zohren, S., & Roberts, S. (2019). Deep reinforcement learning for trading. arXiv, arXiv:1911.10107v1. [Google Scholar] [CrossRef]

Figure 1. The DRL system for portfolio optimization.

Figure 2. Training, validating, and testing DRL agents.

Figure 3. Sector ETF Universe out-of-sample portfolio value for different strategies (with USD 1000 investment).

Figure 4. DJIA stocks universe out-of-sample portfolio value for different strategies (with USD 1000 investment).

Table 1. Features set: technical indicators and derived features.

Feature Set: Technical Indicators
Adjusted Close Price	Percentage Change in Adjusted Close Price
Aroon Oscillator	Percentage Volume Oscillator
Awesome Oscillator	Pretty Good Oscillator
Bollinger Band Lower Bound	Psychological Line
Bollinger Band Upper Bound	Quantitative Qualitative Estimation
Chande Momentum Oscillator	Relative Strength Index—30 periods
Commodity Channel Index—30 periods	Relative Vigor Index
Correlation Trend Indicator	Simple Moving Average of the Close Price—10 periods
Directional Index—30 periods	Simple Moving Average of the Close Price—100 periods
Elder-Ray Index—Bear Power	Simple Moving Average of the Close Price—20 periods
Elder-Ray Index—Bull Power	Stochastic RSI
Inertia Indicator	Super Trend Lower Bound
Kaufman’s efficiency ratio	Super Trend Upper Bound
Know Sure Thing	the Gaussian Fisher Transform Price Reversals indicator
Linear Regression Of Close Price With Window Size 10	Triple Exponential Moving Average
Log Return	Volume Variation Index
Moving Average Convergence Divergence	Z-Score of Close Price—75 periods

Table 2. Selected sector ETFs.

ETF Tickers	Fund Name	Industry
IYW	iShares U.S. Technology ETF	Technology
IYF	iShares U.S. Financials ETF	Financials
IYZ	iShares U.S. Telecommunications ETF	Telecommunications
IYM	iShares U.S. Basic Materials ETF	Basic Materials
IYK	iShares U.S. Consumer Staples ETF	Consumer Staples
IYC	iShares U.S. Consumer Discretionary ETF	Consumer Discretionary
IYE	iShares U.S. Energy ETF	Energy
IYG	iShares U.S. Financial Services ETF	Financial Services
IYH	iShares U.S. Healthcare ETF	Healthcare
IYJ	iShares U.S. Industrials ETF	Industrials
IDU	iShares U.S. Utilities ETF	Utilities
IYR	iShares US Real Estate ETF	Real Estate

Table 3. Selected individual stocks from Dow Jones Industrial Average index.

Stock Tickers	Company Name	Industry
BA	The Boeing Company	Aerospace and Defense
AMGN	Amgen Inc.	Biopharmaceutical
DIS	The Walt Disney Company	Broadcasting and entertainment
NKE	NIKE	Clothing industry
HON	Honeywell International Inc.	Conglomerate
MMM	3M Company	Conglomerate
CAT	Caterpillar Inc.	Construction and mining
KO	The Coca-Cola Company	Drink industry
PG	The Procter & Gamble Company	Fast-moving consumer goods
AXP	American Express Company	Financial services
GS	The Goldman Sachs Group	Financial services
JPM	JPMorgan Chase & Co.	Financial services
MCD	McDonald’s Corporation	Food industry
HD	The Home Depot	Home Improvement
AAPL	Apple Inc	Information technology
CRM	Salesforce	Information technology
CSCO	Cisco Systems	Information technology
IBM	International Business Machines Corporation	Information technology
MSFT	Microsoft Corporation	Information technology
TRV	The Travelers Companies	Insurance
UNH	UnitedHealth Group Incorporated	Managed healthcare
CVX	Chevron Corporation	Petroleum industry
JNJ	Johnson & Johnson	Pharmaceutical industry
MRK	Merck & Co.	Pharmaceutical industry
AMZN	Amazon.com Inc	Retailing
WMT	Walmart Inc.	Retailing
INTC	Intel Corporation	Semiconductor industry
VZ	Verizon Communications Inc.	Telecommunications industry

Table 4. Sector ETF universe out-of-sample test results.4

Out-of-Sample Metrics	Annual Return	Annual Volatility	VaR	CVaR	Max Drawdown	Sharpe Ratio $(r_{f}$ = 0%)	Sortino Ratio	Upside Capture	Downside Capture	Beta	Tracking Error	Information Ratio
PPO Mean–Variance	10.7%	18.2%	−1.7%	−2.8%	40.0%	59.0%	5.7%	78.7%	75.1%	77.4%	10.1%	19.7%
PPO Mean–CVaR	9.1%	20.9%	−2.1%	−3.3%	48.9%	43.7%	4.5%	86.5%	84.7%	84.3%	12.2%	3.0%
PPO Mean–Semivariance	8.3%	17.8%	−1.7%	−2.8%	42.6%	46.8%	4.7%	76.2%	74.6%	76.1%	9.8%	−4.6%
A2C Mean–Variance	10.2%	17.6%	−1.6%	−2.7%	42.2%	58.0%	5.6%	76.4%	73.0%	75.4%	9.9%	15.0%
A2C Mean–CVaR	8.6%	20.9%	−2.0%	−3.3%	49.4%	41.2%	4.3%	88.2%	86.9%	85.8%	11.7%	−1.0%
A2C Mean–Semivariance	8.6%	18.6%	−1.8%	−2.9%	45.5%	46.1%	4.7%	78.9%	77.2%	77.1%	10.8%	−1.8%
Benchmark Mean–CVaR	2.8%	22.5%	−2.2%	−3.6%	72.4%	12.6%	2.0%	85.8%	89.5%	87.7%	13.8%	−42.8%
Benchmark Mean–Variance	5.9%	20.7%	−2.1%	−3.3%	58.1%	28.6%	3.3%	84.7%	85.6%	83.5%	12.1%	−23.4%
Benchmark Mean–Semivariance	6.1%	22.9%	−2.3%	−3.7%	58.1%	26.6%	3.2%	90.0%	90.9%	92.6%	12.9%	−20.7%
Ensemble	11.3%	19.1%	−1.9%	−3.0%	41.2%	59.0%	5.8%	79.3%	75.1%	77.6%	11.6%	21.9%
SPY Benchmark	8.8%	20.5%	−2.0%	−3.2%	55.2%	42.7%	4.5%	100.0%	100.0%	100.0%	0.0%

Table 5. DJIA stocks universe out-of-sample test results (see Note 4).

Out-of-Sample Metrics	Annual Return	Annual Volatility	VaR	CVaR	Max Drawdown	Sharpe Ratio $(r_{f}$ = 0%)	Sortino Ratio	Upside Capture	Downside Capture	Beta	Tracking Error	Information Ratio
PPO Mean–Variance	15.7%	17.7%	−1.5%	−2.6%	30.7%	88.7%	8.3%	80.7%	73.3%	80.3%	11.5%	36.0%
PPO Mean–CVaR	14.5%	22.2%	−2.1%	−3.3%	30.4%	65.3%	6.5%	100.1%	95.7%	97.7%	14.4%	20.3%
PPO Mean–Semivariance	10.0%	18.0%	−1.6%	−2.7%	33.2%	55.7%	5.6%	75.8%	73.6%	74.8%	13.3%	−11.5%
A2C Mean–Variance	15.5%	18.1%	−1.6%	−2.7%	30.2%	85.8%	8.0%	82.0%	74.8%	81.8%	11.7%	33.7%
A2C Mean–CVaR	10.5%	23.2%	−2.2%	−3.5%	35.9%	45.4%	4.9%	99.2%	98.7%	98.6%	15.7%	−6.7%
A2C Mean–Semivariance	12.0%	18.8%	−1.7%	−2.8%	28.1%	63.8%	6.3%	80.4%	76.6%	78.3%	13.6%	3.3%
Benchmark Mean–CVaR	7.0%	21.0%	−2.0%	−3.2%	35.8%	33.2%	3.7%	86.5%	88.7%	85.5%	15.2%	−30.2%
Benchmark Mean–Variance	7.7%	19.9%	−1.9%	−3.0%	30.7%	39.0%	4.2%	86.3%	87.9%	85.4%	13.5%	−28.4%
Benchmark Mean–Semivariance	5.8%	22.1%	−2.1%	−3.4%	37.6%	26.3%	3.2%	88.2%	91.8%	89.3%	15.9%	−36.1%
Ensemble	11.4%	17.8%	−1.8%	−2.7%	35.9%	64.3%	6.2%	80.5%	77.5%	76.5%	12.6%	−1.0%
DIA Benchmark	11.6%	17.3%	−1.6%	−2.6%	36.7%	66.9%	6.3%	100.0%	100.0%	100.0%	0.0%

Table 6. Statistical tests for alpha and Sharpe-ratio differences in DRL strategies vs. benchmarks (sector ETF universe).

Strategy	Benchmark	Number of Samples	Alpha (ann., %)	$p (α = 0)$	Confidence Interval—α (%)	$Δ S R$ (Units)	$p (Δ S R > 0)$	$Confidence Interval — Δ S R$ (Units)
Ensemble	SPY	4032	4.40%	0.095 *	[−0.77%, 9.56%]	0.14	0.255	[−0.26, 0.60]
Ensemble	Benchmark Mean–CVaR	4032	8.69%	0.002 ***	[3.28%, 14.10%]	0.42	0.019 **	[0.02, 0.90]
Ensemble	Benchmark Mean–Variance	4032	6.10%	0.008 ***	[1.56%, 10.65%]	0.27	0.056 *	[−0.06, 0.65]
Ensemble	Benchmark Mean–Semivariance	4032	6.51%	0.008 ***	[1.71%, 11.31%]	0.28	0.055 *	[−0.06, 0.69]
PPO Mean–Variance	SPY	4032	3.75%	0.086 *	[−0.53%, 8.03%]	0.14	0.213	[−0.21, 0.50]
PPO Mean–Variance	Benchmark Mean–CVaR	4032	8.22%	0.002 ***	[2.95%, 13.50%]	0.41	0.023 **	[0.00, 0.90]
PPO Mean–Variance	Benchmark Mean–Variance	4032	5.70%	0.009 ***	[1.45%, 9.94%]	0.27	0.052 *	[−0.06, 0.65]
PPO Mean–Variance	Benchmark Mean–Semivariance	4032	6.10%	0.009 ***	[1.50%, 10.71%]	0.28	0.057 *	[−0.07, 0.69]

Significance is noted as * for

p < 0.10

, ** for

p < 0.05

and *** for

p < 0.01

.

Table 7. Statistical tests for alpha and Sharpe-ratio differences in DRL strategies vs. benchmarks (DJIA stock universe).

Strategy	Benchmark	Number of Samples	Alpha (ann., %)	$p (α = 0)$	Confidence Interval—α (%)	$Δ S R$ (Units)	$p (Δ S R > 0)$	Confidence Interval—ΔSR (Units)
Ensemble	DIA	3024	2.90%	0.400	[−3.86%, 9.67%]	−0.02	0.538	[−0.63, 0.56]
Ensemble	Benchmark Mean–CvaR	3024	6.57%	0.043 **	[0.20%, 12.94%]	0.27	0.157	[−0.27, 0.77]
Ensemble	Benchmark Mean–Variance	3024	5.70%	0.065 *	[−0.36%, 11.76%]	0.22	0.190	[−0.27, 0.70]
Ensemble	Benchmark Mean–Semivariance	3024	7.46%	0.024 **	[1.00%, 13.91%]	0.33	0.108	[−0.21, 0.86]
PPO Mean–Variance	DIA	3024	6.18%	0.035 **	[0.45%, 11.90%]	0.19	0.206	[−0.29, 0.66]
PPO Mean–Variance	Benchmark Mean–CVaR	3024	10.12%	0.000 ***	[4.56%, 15.68%]	0.49	0.021 **	[0.03, 0.98]
PPO Mean–Variance	Benchmark Mean–Variance	3024	9.08%	0.000 ***	[3.99%, 14.17%]	0.44	0.015 **	[0.03, 0.89]
PPO Mean–Variance	Benchmark Mean–Semivariance	3024	10.96%	0.000 ***	[5.33%, 16.58%]	0.55	0.008 ***	[0.09, 1.05]

Significance is noted as * for

p < 0.10

, ** for

p < 0.05

and *** for

p < 0.01

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Chang, K.-C. Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off. Int. J. Financial Stud. 2025, 13, 227. https://doi.org/10.3390/ijfs13040227

AMA Style

Yu J, Chang K-C. Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off. International Journal of Financial Studies. 2025; 13(4):227. https://doi.org/10.3390/ijfs13040227

Chicago/Turabian Style

Yu, Jiayang, and Kuo-Chu Chang. 2025. "Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off" International Journal of Financial Studies 13, no. 4: 227. https://doi.org/10.3390/ijfs13040227

APA Style

Yu, J., & Chang, K.-C. (2025). Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off. International Journal of Financial Studies, 13(4), 227. https://doi.org/10.3390/ijfs13040227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Smart Tangency Portfolio: Deep Reinforcement Learning for Dynamic Rebalancing and Risk–Return Trade-Off

Abstract

1. Introduction

2. Literature Review

2.1. Portfolio Optimization and Dynamic Rebalancing

2.2. Deep Reinforcement Learning

2.3. Deep Reinforcement Learning for Stock Trading and Portfolio Management

2.4. Synthesis and Research Gap

3. Problem Setup and Proposed Methodology

3.1. Portfolio Optimization Framework

3.2. Reinforcement Learning

3.3. Dynamic Portfolio Optimization Using Deep Reinforcement Learning

3.3.1. Action Space

3.3.2. Observation Space

3.3.3. Trading Agent and Environment

3.4. Ensemble Model

4. Empirical Experiments

4.1. Data Collection

4.2. Benchmark Setup

4.3. Trading Agents Training and Rolling Out-of-Sample Tests

4.4. Out-of-Sample Results

4.4.1. Results for the Sector ETF Universe

4.4.2. Result for the Dow Jones Industrial Average Universe

4.4.3. Statistical Inference and Model Robustness

4.4.4. Turnover Dynamics and Transaction Cost Impact

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B. Technical Indicator Feature Definitions

Appendix C

Appendix C.1. Best Strategy Portfolio Out-of-Sample Metrics

Appendix C.2. Turnover and Transaction Cost Analysis

Notes

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI