The empirical evaluation is conducted using the proposed hierarchical signal-to-policy learning framework, which explicitly decouples return prediction (Stage 1) from portfolio optimization (Stage 2). This separation introduces a modular structure that enables independent analysis of the predictive signal layer and contributes to more stable learning dynamics using a reinforcement learning framework. In short, Stage 1 produces predictive signals at a daily frequency, whereas Stage 2 aggregates these signals to inform portfolio allocation decisions made on a monthly basis under explicit risk controls.
4.2.1. Stage 1: Predictive Modeling and SHAP-Based Signal Construction
Stage 1 trains a set of asset-specific XGBoost models with one model per ETF or stock from the investment universe. Each model predicts the one-day forward simple return:
where
is the feature vector for asset
on trading day
, and
is the corresponding gradient-boosted tree.
The feature set includes two components, i.e., asset-specific technical indicators including moving-average measures, momentum signals, lagged returns, and volatility, as well as Fama–French factors with daily lags shared across assets. To ensure feature selection stability, this list of features is first filtered using a Unified SHAP Feature Gate, which fits preliminary XGBoost models on a designated base window, computes TreeSHAP importances, and aggregates them across assets. We selected the top 10 feature groups using the first period data, which had consistently strong predictive relevance at the universe level for subsequent rolling training and prediction.
Models are trained using the top 10 selected features from Unified SHAP Feature Gate with rolling-forward splits to generate predicted forward-return estimations. Because the target uses information at , the pipeline excludes observations whose forward labels are not fully realized.
Hyperparameters are selected to maximize the validation information coefficient (IC):
where
denotes the realized one-day forward return and
is the cross-sectional correlation operator. For each trading day, Stage 1 produces several outputs that are passed to Stage 2. In addition to the point forecast
, the model generates a cross-sectional percentile score
, where
is the number of assets, as well as SHAP attributions aggregated at the feature-group level according to
Although the supervised learning target is one-day-ahead, portfolio decisions in Stage 2 are made at a 21-trading-day rebalancing frequency. To bridge this horizon mismatch and improve signal robustness, the daily Stage-1 predictions are further transformed into a medium-horizon signal by blending the one-day forecast with its rolling 21-day average, as detailed in
Section 3.4.2. This aggregation reduces high-frequency noise while preserving responsiveness to new information. The signals passed from Stage 1 to the policy-learning stage consist of the smoothed predictive signal, daily return forecasts, cross-sectional rankings, and SHAP-based group attributions.
4.2.2. Stage 2: PPO-Based Portfolio Allocation
Stage 2 converts the daily predictive signals produced in Stage 1 into portfolio allocations using a Proximal Policy Optimization (PPO) agent. While the environment evolves at a daily frequency, portfolio rebalancing occurs every 21 trading days, corresponding to a monthly decision horizon. Rewards are issued only at rebalance dates and summarize portfolio performance over the preceding period.
At each rebalance time
, the agent observes a state vector constructed from the Stage-2 observation feature set described in
Appendix B. The state combines asset-level predictive signals with cross-sectionally aggregated explanatory and market-context features, all computed using information available up to time
. The observation state can be written as,
where
denotes the number of assets in the investment universe.
The first block consists of the asset-level outputs of Stage 1: the one-day-ahead return forecasts , their historical volatilities and their cross-sectional ranks. These quantities preserve prediction and their relative ordering of expected returns across assets.
The second block contains cross-sectionally aggregated quantities already produced in Stage 1, including SHAP feature-group summaries and cross-sectional context variables. Specifically,
denotes the vector of cross-sectionally normalized (relative) SHAP attributions across the feature groups selected by the Unified SHAP Feature Gate, corresponding to the variables *_REL in
Appendix B. The quantity
represents the dispersion of SHAP attributions across assets for each feature group, corresponding to the variables Std_SHAP_*. These quantities summarize the explanatory structure of the Stage-1 predictive models at the universe level by indicating which feature groups drive forecasts and how heterogeneous their influence is across assets. In addition, the observation includes cross-sectional context variables derived from Stage-1 outputs, namely the mean and standard deviation of predicted returns across assets
and the cross-sectional mean of asset-level volatility
. Together, these aggregated variables describe the overall strength and dispersion of predictive signals and prevailing risk conditions across the universe at time
.
The final block reuses a subset of the Stage-1 input features themselves, i.e., Fama–French factors and asset-level technical and momentum indicators selected by the Unified SHAP Gate. These features provide additional context on market conditions and recent price dynamics to be observed by the Stage 2 agents.
In addition, the observation includes cross-sectional context variables, such as the mean and standard deviation of predicted returns and volatility across assets. The complete state vector is stacked over a fixed lookback window prior to each rebalance, allowing the policy to condition decisions on recent temporal dynamics at both the asset and market levels. Given this representation, the PPO policy learns to generate portfolio tilts relative to a model-based baseline allocation, which are subsequently mapped into feasible portfolio weights subject to long-only, no shorting and other diversification constraints.
Portfolio actions are optimal weights defined as tilted weights from a static baseline allocation constructed from the predictive signals generated in Stage 1. The purpose of the baseline is to provide a stable, model-driven reference portfolio, allowing the reinforcement learning agent to focus on relative tilts rather than learning absolute allocations from scratch.
At each rebalance date
, similar to
Section 3.4.2, predictions are further transformed into a medium-horizon signal by blending the one-day forecast with its rolling 21-day average. Let
and
denote the cross-sectional average and standard deviation of the predicted returns at time
. These standardized scores
are then mapped to preliminary baseline weights through a linear tilt around the equal weight portfolio with fixed intensity
,
The above raw weights are then mapped onto a long-only, fully invested feasible set using transformation of , with individual asset bounds and summing to 100% . This projection yields the baseline portfolio .
At each rebalance date, the PPO policy outputs an action vector
. To control the magnitude of portfolio adjustments and improve numerical stability, the raw action is transformed using a temperature-scaled hyperbolic tangent function,
The transformed action weights are interpreted as relative tilts around the baseline portfolio generated from the predictive signals in Stage 1, and the resulting tilted weights are projected back onto the feasible set to enforce long-only and budget constraints,
Between rebalance dates, the portfolio evolves passively according to standard self-financing dynamics,
where
denotes the vector of realized asset returns. This is the portfolio-drifted weight during non-rebalancing dates.
At each rebalance date, the agent receives a reward designed to balance return generation, downside risk control, trading costs, and diversification. The reward is specified as
where
denotes the realized 21-day portfolio return and
represents the historical portfolio loss sequence used to compute Conditional Value-at-Risk (CvaR). CVaR is computed from empirical loss distribution after a 12-month warm-up period. The penalty terms reduce the likelihood of excessive trading, portfolio concentration (measured by the Herfindahl–Hirschman Index), and proportional transaction costs.
Together, this reward structure encourages exploitation of predictive signals while explicitly controlling tail risk and implementation frictions, yielding a policy that is both risk-aware and practically implementable.
4.2.3. Rolling-Window Training, Evaluation, and Prediction
We adopt a rolling forward training, evaluation, and prediction protocol to emulate realistic model deployment and to ensure a strictly out-of-sample process across both stages of the hierarchical framework. The full dataset spans January 1999 to November 2024. Stage-1 predictive models are trained using rolling windows consisting of a 12-year training period, followed by a 1-year validation period and a 1-year prediction period. Stage-2 policy learning is conducted using rolling windows consisting of 7 years of training data, followed by 1 year of validation and 1 year of out-of-sample testing. Portfolio performance reported in
Section 4.4,
Section 4.5 and
Section 4.6 corresponds exclusively to the aggregated out-of-sample test periods from these rolling evaluations. All cumulative return figures display only the out-of-sample performance intervals.
In Stage 1, asset-specific supervised learning models are used to generate one-day-ahead return forecasts. For each rolling window, the data are divided into a 12-year training period, a 1-year validation period, and a 1-year prediction period. Within each window, several candidate models are trained using only the training data under different hyperparameter settings. Early stopping is applied during training, and model selection is based on performance on the validation set.
After the best-performing model is identified, it is fixed and applied directly to the prediction period without further adjustment. This model produces next-day return forecasts together with the corresponding SHAP attributions. The above procedure is implemented sequentially over the entire sample, resulting in a single, consistent time series of predicted returns and SHAP-based signals. These outputs are then treated as fixed inputs for the downstream policy-learning stage.
In Stage 2, portfolio allocation is learned using a Proximal Policy Optimization (PPO) agent. To evaluate robustness and sensitivity to stochastic learning dynamics, the full rolling training–validation–testing procedure is repeated over 30 iterations over the entire data period. In each iteration, the environment is constructed using trading-day windows consisting of 7 years ( trading days) of training, followed by a 1-year ( trading days) validation window and a 1-year ( trading days) out-of-sample testing window.
The agent observes a rolling lookback window of 21 trading days and rebalances the portfolio every 21 trading days, corresponding to a monthly decision frequency. The PPO agent is trained to observe daily state information and take actions and receive rewards at monthly intervals. This design reduces reward noise and aligns the learning problem with the portfolio rebalancing horizon.
PPO hyperparameters, including the learning rate, clipping range, CVaR risk-aversion coefficient, turnover penalty, and concentration (HHI) penalty, are selected via short PPO training runs fitted on the training window and evaluated on the validation window. Hyperparameter selection is performed only once using the first rolling period, and the chosen configuration is then held fixed throughout the entire reallocation process. To assess whether fixing PPO hyperparameters introduces regime dependence or temporal bias,
Section 4.5.3 conducts a structured sensitivity analysis over key parameters while holding the PPO architecture and optimization settings fixed. Candidate configurations are ranked by their average validation reward,
Using the selected hyperparameters, the PPO agent is trained on the training window through incremental updates. During training, performance is evaluated repeatedly on the validation window, and early stopping is applied to avoid overfitting. After training, the PPO policy is fixed and applied to the out-of-sample test window without further parameter updates. The portfolio is rebalanced every 21 trading days and remains drifted between rebalancing dates. The corresponding wealth
evolves according to
where
denotes the realized portfolio return over the 21-day holding period.
To evaluate the performance of this model, we compute standard performance metrics, including compound annual growth rate (CAGR), annualized volatility, return over volatility ratio, Sortino ratio, maximum drawdown, and CVaR at the 5% level. This evaluation is repeated over 30 rolling PPO iterations, each corresponding to a distinct training–validation–testing window and an independent policy training run. We report both aggregate statistics (e.g., mean or median across runs) and run-level win ratios to quantify the stability of the learned allocation policy. The collection of out-of-sample results across random initializations provides evidence on robustness and parameter sensitivity under realistic stochastic training dynamics.
Figure 2 provides a workflow overview of the rolling training, validation, and out-of-sample testing design, which is applied consistently across both stages.