Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review

Khemlichi, Firdaous; Idrissi Khamlichi, Youness; Elhaj Ben Ali, Safae

doi:10.3390/info17050476

Open AccessSystematic Review

Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review

by

Firdaous Khemlichi

^1,2,*

,

Youness Idrissi Khamlichi

¹

and

Safae Elhaj Ben Ali

¹

Sciences and Engineering Research Laboratory, Faculty of Sciences and Technology, Sidi Mohamed Ben Abdellah University, Imouzer Road, P.O. Box 2202, Fez 30000, Morocco

²

Department of Mathematics and Computer Science, École Marocaine des Sciences de l’Ingénieur (EMSI), Fez 30050, Morocco

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 476; https://doi.org/10.3390/info17050476

Submission received: 8 April 2026 / Revised: 2 May 2026 / Accepted: 8 May 2026 / Published: 13 May 2026

(This article belongs to the Special Issue Decision Models for Economics and Business Management)

Download

Browse Figures

Versions Notes

Abstract

Financial markets are characterized by non-stationarity, regime shifts, and complex cross-asset interactions, which challenge traditional portfolio optimization and motivate reinforcement learning (RL) for adaptive decision-making. However, many RL-based approaches remain predominantly return-centric, with risk, uncertainty, and human oversight only weakly integrated, limiting robustness and practical applicability. This review provides a critical synthesis of risk-aware and uncertainty-sensitive reinforcement learning for portfolio optimization from a human–AI collaboration perspective. We analyze major architectural paradigms—including single-agent, hierarchical, multi-agent, and modular systems—together with risk modeling strategies (e.g., reward shaping, constraint-based optimization, and downside risk measures such as CVaR) and probabilistic approaches to uncertainty estimation (e.g., Bayesian neural networks, Monte Carlo dropout, and ensembles). A structured analysis of 57 fully assessed studies reveals that only 5 (9%) explicitly couple uncertainty estimation with risk constraint mechanisms, while 38 (69%) treat risk and uncertainty as structurally independent components. We identify a central structural limitation: risk objectives are rarely conditioned on epistemic uncertainty, while uncertainty estimates seldom influence constraint mechanisms or capital allocation. This decoupling leads to fragmented frameworks that remain difficult to deploy in real financial environments. By integrating architectural design, risk modeling, uncertainty estimation, and evaluation practices, this review proposes a unified, deployment-oriented perspective for developing governance-aligned portfolio decision-support systems.

Keywords:

reinforcement learning; portfolio optimization; risk-aware decision-making; uncertainty quantification; human–AI collaboration; AI-augmented decision support; trustworthy AI

1. Introduction

Financial markets are characterized by structural non-stationarity, regime shifts, and evolving cross-asset dependencies [1] that challenge the assumptions underlying traditional portfolio optimization—namely stationarity, linearity, and stable risk estimation. Although classical frameworks such as mean-variance optimization and risk parity remain foundational, they are inherently static and highly sensitive to estimation error [2], particularly during periods of market stress.

Advances in data availability and computational power have accelerated the adoption of machine learning in finance. However, many predictive models remain disconnected from sequential decision-making. Portfolio allocation is not merely a forecasting task but a dynamic control problem in which actions shape future exposures, transaction costs, and risk dynamics. Reinforcement learning (RL) addresses this limitation [3,4] by formulating portfolio management as sequential optimization under uncertainty.

Despite significant progress, most RL-based portfolio systems still treat risk regulation, uncertainty estimation, and system design as only partially coordinated components. Risk constraints are typically specified independently of confidence in model estimates [4], uncertainty signals rarely influence allocation decisions, and architectural paradigms—such as hierarchical, multi-agent, or modular systems—often lack mechanisms for propagating reliability across decision layers. As a result, models that perform well in backtesting may exhibit fragile behavior under distributional shifts or institutional constraints [5].

In practice, portfolio management is inherently collaborative. Decision-making authority is distributed across portfolio managers, risk committees, and regulatory frameworks [6], positioning RL systems as decision-support tools rather than fully autonomous optimizers. Effective real-world use therefore requires calibrated communication of confidence, interpretable risk controls, and explicit escalation mechanisms.

This review adopts a unified, deployment-oriented perspective on reinforcement learning for portfolio optimization. Rather than cataloging algorithms, it examines how system design, risk modeling, uncertainty estimation, and evaluation practices jointly shape robustness, interpretability, and governance alignment.

Our contributions are fourfold. First, we introduce a unified taxonomy linking RL architectures, risk control mechanisms, uncertainty modeling strategies, and human–AI interaction modes. Second, we identify key design-level misalignments that limit practical applicability. Third, we analyze modular and multi-modal architectures as mechanisms for integrating heterogeneous signals while improving interpretability. Fourth, we propose evaluation principles that extend beyond backtesting to include regime robustness, cost realism, calibration quality, and governance compatibility.

Beyond synthesizing the existing literature, this review proposes a unified conceptual framework that explicitly integrates risk control, epistemic uncertainty, and governance mechanisms within portfolio reinforcement learning systems.

2. Human–AI Collaboration in Portfolio Reinforcement Learning

In institutional financial environments, reinforcement learning systems are rarely deployed as fully autonomous decision-makers. Instead, they function as AI-augmented decision-support tools embedded within governance structures involving portfolio managers, risk officers, compliance teams, and investment committees. The degree of autonomy depends on regulatory constraints, organizational risk tolerance, and the maturity of the AI infrastructure [4,6].

The literature reveals four recurring modes of human–AI collaboration in portfolio reinforcement learning. These modes provide the conceptual basis for the governance layer later formalized in the unified architecture proposed in Section 9.6.

Advisory Mode. The RL agent generates allocation recommendations or risk diagnostics subject to human approval prior to execution, acting primarily as an analytical assistant.
Constraint-Guided Mode. Human-defined risk limits, leverage caps, or regulatory rules shape the feasible action space or objective function. While governance boundaries are explicit, constraint enforcement typically remains static and rarely adapts to model confidence.
Uncertainty-Aware Escalation Mode. Systems communicate confidence estimates that trigger human intervention when predefined thresholds are exceeded, although such mechanisms often remain heuristic and imperfectly calibrated.
Shared-Control Mode. Strategic decisions—such as risk budgeting and allocation regimes—remain under human supervision, while tactical execution is delegated to RL agents. However, reliability signals are not consistently propagated across decision levels.

These four modes echo classifications proposed in the broader human–AI teaming literature [7].

Across these modes, a common limitation emerges: uncertainty is often communicated but not operationalized. Confidence estimates rarely inform risk constraints, allocation intensity, or escalation policies in a systematic manner, resulting in loosely coupled rather than integrated decision processes.

Effective human–AI collaboration in portfolio RL therefore requires architectures in which uncertainty directly informs risk control and supervisory intervention, enabling more coherent and governance-aligned decision-making. This structural requirement motivates the unified framework developed in Section 9.6, where each of the four collaboration modes is explicitly mapped to a governance layer component.

3. Methodological Approach

3.1. Review Design and Research Questions

This study adopts a structured critical review methodology following PRISMA-inspired guidelines. The objective is to provide a conceptual integration across three interdependent dimensions—risk modeling, uncertainty estimation, and architectural design—within human–AI collaborative settings in portfolio reinforcement learning.

Three explicit research questions guide the review:

RQ1—Architectural paradigms: What architectural paradigms have been adopted in portfolio reinforcement learning, and how do they differ in their structural capacity to integrate risk control and uncertainty estimation?

RQ2—Risk–uncertainty coupling: To what extent do existing portfolio RL systems explicitly couple epistemic uncertainty estimation with risk constraint mechanisms, and what structural limitations result from their decoupling?

RQ3—Deployment and governance: What evaluation protocols and governance mechanisms are required for portfolio RL systems to be credibly deployable in real institutional environments?

RQ1 directly motivates the architectural taxonomy in Section 4. RQ2 structures the empirical analysis of Section 5 and Section 6, culminating in Table 1. RQ3 frames Section 8 and Section 9 and the unified architecture of Section 9.6.

Existing surveys on RL for portfolio optimization—including [4,6]—provide valuable overviews of algorithmic approaches but do not systematically examine the structural relationship between risk modeling and uncertainty quantification. The present review fills this gap by introducing a structured coding framework across six dimensions applied to 57 peer-reviewed studies, selected from an initial pool of 434 identified records through a rigorous PRISMA-inspired screening process. This corpus represents the methodologically relevant and fully assessable literature published between 2016 and 2025 across six major databases. By quantifying for the first time the degree of risk-uncertainty coupling across this corpus, this review produces an original empirical finding—that only 9% of studies achieve explicit coupling. This analytical contribution distinguishes the present work from descriptive surveys and positions it as a critical synthesis grounded in systematic empirical evidence.

3.2. Search Strategy and Database Coverage

A systematic literature search was conducted over January 2016 to March 2025 across six databases: Web of Science, Scopus, IEEE Xplore, ACM Digital Library, ScienceDirect, and arXiv. The search was supplemented by backward and forward citation tracking applied to high-impact contributions including [4,6].

The primary search string was:

(reinforcement learning OR deep reinforcement learning OR RL OR DRL) AND (portfolio optimization OR portfolio management OR asset allocation) AND (risk OR uncertainty OR CVaR OR drawdown OR Bayesian OR epistemic)

A secondary string targeted specific architectures: (hierarchical reinforcement learning OR multi-agent reinforcement learning OR MARL OR HRL) AND (finance OR portfolio OR trading). Searches were restricted to English-language peer-reviewed publications and methodologically rigorous preprints. IEEE, ACM, NeurIPS, ICML, and ICLR proceedings were included. The completed PRISMA 2020 Checklist is provided in the Supplementary Materials.

3.3. Inclusion and Exclusion Criteria

Table 2 presents the inclusion and exclusion criteria applied at each stage of the literature selection process.

3.4. Selection Process and Study Flow

The selection followed four sequential stages (Figure 1). The primary and secondary searches yielded 412 records after within-database deduplication. Citation tracking added 22 further records, for a total of 434 identified records. Removal of 120 cross-database duplicates left 314 unique records for screening. Title and abstract screening retained 146 studies for full-text assessment, excluding 168 records (prediction-only scope: n = 112; insufficient content: n = 56). Full-text assessment excluded a further 80 studies (prediction-only without decision layer: n = 34; insufficient methodological detail: n = 28; out of scope: n = 18). An additional 9 studies were excluded at the final quality check stage (non-indexed or predatory venues). The final corpus comprises 57 peer-reviewed and methodologically relevant studies.

3.5. Data Extraction and Coding Framework

Each of the 57 selected studies was coded along six dimensions using a structured extraction instrument. Of these 57 studies, 55 provided sufficient methodological detail to code both Dimension 2 (risk modeling strategy) and Dimension 3 (uncertainty estimation method) simultaneously; these 55 constitute the analytical basis for the coupling statistics reported throughout this review. The six coding dimensions are:

RL paradigm—single-agent, deep RL, hierarchical RL, multi-agent RL, or modular/hybrid, based on the architectural description in the original paper.
Risk modeling strategy—reward shaping, soft constraint, hard constraint (CMDP), CVaR/distributional, drawdown-based, or none; multiple codes allowed.
Uncertainty estimation method—none, implicit (ensemble disagreement without explicit propagation), Bayesian neural network, Monte Carlo dropout [22], deep ensemble [23], distributional (aleatoric only), or joint (aleatoric + epistemic).
Risk–uncertainty coupling level—the central variable for RQ2, coded as: Non-coupled (risk and uncertainty treated independently), Partial (uncertainty present but not operationalized in risk constraints), or Explicit (uncertainty directly conditions risk budgeting, constraint thresholds, or allocation intensity).
Signal integration—price-only, multi-modal (sentiment [24], macro, graph [25]), or modular fusion.
Evaluation rigor—four sub-dimensions: transaction cost inclusion, regime-based evaluation, multi-seed reporting, and out-of-sample validation (each coded yes/no).

The full extraction dataset for all 57 studies is available from the corresponding author upon reasonable request. Dimension 4 constitutes the primary analytical variable directly addressing RQ2. Its distribution across the 55 studies for which both risk and uncertainty information were extractable is reported in Table 1.

3.6. Positioning Relative to Existing Reviews

Hambly et al. [4] provide a rigorous mathematical treatment of RL-based financial decision-making but do not assess risk–uncertainty coupling or governance requirements. Fischer [6] surveys DRL approaches for trading focused on algorithmic performance, without addressing epistemic uncertainty integration. Neither review examines human–AI collaboration modes or deployment-oriented governance.

The present review fills this gap along three dimensions: (1) it introduces an explicit taxonomy of risk–uncertainty coupling and quantifies its prevalence empirically across 55 studies, yielding the finding that only 9% achieve explicit coupling; (2) it integrates human–AI collaboration as a structural analytical dimension; (3) it proposes a governance-aware unified architectural framework derived directly from the identified gaps. These contributions directly correspond to RQ1–RQ3.

4. Architectures of Reinforcement Learning for Portfolio Optimization

Reinforcement learning formulates portfolio allocation as a sequential decision process in which policies map market states to portfolio weights, enabling adaptive responses to evolving market conditions [26]. It has been widely applied to asset allocation and portfolio rebalancing [27], with extensions incorporating behavioral principles such as loss aversion and overconfidence [28]. At the same time, the suitability of standard RL formulations for financial backtesting has been questioned, particularly when state transitions are effectively deterministic [8].

Beyond algorithmic choices, the effectiveness of RL in portfolio optimization is strongly influenced by system organization. Existing approaches range from centralized to hierarchical, multi-agent, and modular systems [29], each influencing scalability, interpretability, information flow, and compatibility with human oversight. While many studies emphasize algorithmic comparisons (e.g., PPO vs. DDPG), structural design remains central to how risk exposure and uncertainty signals are handled [30].

To capture these distinctions, we introduce a unified analytical taxonomy linking RL architectures with risk modeling, uncertainty handling, and governance considerations, summarized in Table 3 and Table 4, and highlighting design-level misalignments that limit real-world applicability.

The suitability of each architectural paradigm varies significantly depending on prevailing market conditions. In high-volatility regimes—such as crisis periods or earnings announcements—hierarchical RL architectures are particularly appropriate, as their decomposed decision layers allow macro-level risk budgeting to override tactical allocation signals when uncertainty spikes. In low-liquidity environments—such as emerging markets or small-cap assets—modular architectures offer an advantage, as independent modules can incorporate liquidity-adjusted transaction cost penalties without disrupting the core policy. Multi-agent architectures are best suited to diversified cross-asset portfolios, where specialized agents can simultaneously monitor sector-specific dynamics. Finally, single-agent approaches remain competitive in stable, high-liquidity markets such as large-cap equity indices, where computational efficiency outweighs the need for architectural complexity.

4.1. Single-Agent Reinforcement Learning

Single-agent formulations represent the most centralized paradigm, where a single policy maps market states directly to portfolio allocations. While computationally efficient, this monolithic structure entangles signal extraction, risk management, and allocation within a unified representation.

This tight coupling reduces interpretability and limits explicit coordination between risk regulation and uncertainty handling. Adaptation occurs implicitly within the policy, often leading to fragile behavior under regime shifts. From a governance standpoint, such architectures provide limited transparency and restricted intervention capabilities.

4.2. Deep Reinforcement Learning (DRL)

Early RL approaches rely on value-based methods such as Q-learning [31,32]. Deep reinforcement learning (DRL) extends these formulations through function approximation, enabling high-dimensional representations and continuous control. Algorithms such as PPO [33], DDPG [34], and SAC [35] are widely adopted due to their stability and scalability.

DRL has been extensively applied to portfolio management, supporting adaptive allocation and dynamic rebalancing under evolving market conditions [36,37,38,39,40,41]. Recurrent architectures further capture temporal dependencies in financial time series [9,42,43], often within single-agent frameworks based on market features such as prices, volumes, and technical indicators [14,44].

Despite these advances, most DRL systems retain centralized, end-to-end structures in which signal processing, risk regulation, and allocation decisions remain implicitly coupled. This often favors performance optimization over robustness, while training instability, reward sensitivity, and limited confidence calibration persist.

4.3. Hierarchical Reinforcement Learning (HRL)

Hierarchical reinforcement learning decomposes decision-making into multiple levels, typically separating strategic planning from tactical execution [10,45,46]. This structure aligns with institutional settings where risk budgeting and execution operate at different temporal scales.

While improving scalability and interpretability, HRL does not inherently ensure consistent propagation of uncertainty signals. Lower-level confidence estimates are rarely integrated into higher-level decisions, limiting their effectiveness for coordinated risk management.

4.4. Multi-Agent Reinforcement Learning (MARL)

Multi-agent reinforcement learning distributes decision-making across specialized agents responsible for assets, sectors, or functions, and can model interactions between heterogeneous market participants [47,48].

In portfolio optimization, MARL systems combine agents for trading, risk assessment, or market observation [15,49], with extensions including centralized training with decentralized execution [50], temporal specialization [51], and role-based or asset-specific agent design [52,53,54,55,56]. Value decomposition and role-aware mechanisms further support coordination [57,58]. Knowledge distillation approaches have also been explored to improve multi-agent coordination efficiency in portfolio management [59].

While MARL improves scalability and functional decomposition, it introduces coordination complexity and endogenous non-stationarity. Aggregating confidence across agents remains challenging, often resulting in fragmented risk perception and unstable portfolio-level behavior.

4.5. Modular and Hybrid Architectures

Modular architectures explicitly separate signal processing, risk estimation, and allocation control into distinct components. Recent approaches integrate specialized modules and diverse learning algorithms to improve robustness in heterogeneous environments [16,60]. Hybrid designs combine deep learning feature extractors with RL decision layers [61], while ensemble strategies aggregate multiple agents to enhance stability [62].

When combined with dynamic fusion mechanisms, such architectures improve interpretability, enable localized confidence estimation, and support extensibility. However, modularity alone does not guarantee coordination. Risk constraints and uncertainty estimates are often not formally linked, and adaptive signal fusion may remain disconnected from allocation decisions.

Across architectures, three recurring design-level gaps emerge: (i) risk constraints are rarely conditioned on epistemic uncertainty; (ii) signal integration is not consistently aligned with explicit risk budgeting; (iii) evaluation protocols insufficiently assess robustness, calibration, and real-world applicability.

These observations indicate that structural design—rather than algorithmic refinement alone—is central to developing deployment-ready reinforcement learning systems.

5. Risk-Aware Reinforcement Learning

Building on the architectural taxonomy, this section examines risk modeling strategies in portfolio reinforcement learning. Existing approaches range from reward shaping and constraint-based optimization to CVaR-based formulations and dynamic risk budgeting. However, these mechanisms remain weakly connected to epistemic uncertainty, resulting in only partially integrated risk management frameworks.

A central challenge lies in reconciling the expected-return objective of RL with the asymmetric nature of financial risk. Standard RL assumes stable return distributions and symmetric preferences—assumptions frequently violated in practice. Recent work addresses this limitation through downside risk measures, risk-sensitive objectives, and multi-objective formulations incorporating volatility, drawdown, and tail-risk considerations [11,63,64,65,66].

From a governance standpoint, risk-aware formulations also serve as control mechanisms by enabling interpretability and constraint enforcement. However, their limited interaction with uncertainty signals constrains their effectiveness in real-world decision settings.

5.1. Reward Shaping and Risk-Sensitive Objectives

Reward shaping incorporates risk by augmenting return-based objectives with penalties related to volatility, turnover, transaction costs, or drawdowns [13,67,68,69]. This approach is flexible and readily integrated into standard RL frameworks.

However, it introduces structural fragility. Composite reward functions combine heterogeneous objectives whose weights require careful calibration. Mis-specification can induce unintended behaviors, ranging from excessive conservatism to uncontrolled risk exposure.

Moreover, reward shaping relies on historical estimates and implicitly assumes stable conditions. Under regime shifts, these estimates become unreliable, leading to degraded out-of-sample performance. It also provides limited control over extreme outcomes and may obscure implicit risk preferences embedded in the reward design.

5.2. Hard and Soft Constraints in Portfolio RL

Constraint-based approaches offer a more principled alternative by formulating portfolio optimization as a constrained Markov Decision Process (CMDP), where expected return is maximized under explicit risk limits [12].

Soft constraints penalize violations, while hard constraints enforce strict feasibility. Although the latter aligns more closely with regulatory requirements, it increases optimization complexity and typically requires specialized solution methods.

From a governance perspective, constraint-based RL enables direct incorporation of risk budgets. However, constraints are generally treated as static and rarely adjust to variations in model confidence. Hybrid approaches—soft constraints during training and hard constraints at execution—are common but may introduce inconsistencies between training conditions and practical use.

5.3. Downside Risk Measures: CVaR and Drawdown-Aware RL

Downside risk measures such as drawdown and CVaR explicitly target tail losses, addressing limitations of variance-based metrics. CVaR-based and distributional RL approaches enable optimization over adverse outcomes and improve sensitivity to extreme events [17,70].

These methods often yield smoother performance and improved resilience under stress. However, estimation remains challenging due to the scarcity of tail events. In practice, they rely on historical distributions and therefore provide limited insight into model uncertainty.

Consequently, tail-risk control remains largely retrospective, capturing observed variability rather than uncertainty in the reliability of the underlying estimates.

5.4. Distributional Reinforcement Learning

Distributional RL [71] models the full return distribution rather than its expectation, enabling direct estimation of risk measures such as VaR and CVaR and supporting more informative decision-making.

However, it primarily captures aleatoric uncertainty and does not address epistemic uncertainty related to model confidence. Without complementary mechanisms such as Bayesian or ensemble approaches, it may create an appearance of calibrated risk control while leaving structural uncertainty unaddressed.

5.5. Structural Limitations of Current Risk-Aware RL Approaches

A key limitation of current approaches is that risk control mechanisms are typically designed under the implicit assumption of reliable return estimates. As a result, they do not adapt to variations in predictive confidence and may remain unchanged even when uncertainty increases.

This limitation is further amplified by simplified evaluation settings, which can overstate robustness by failing to capture regime shifts and structural instability. From a decision-support perspective, effective systems must therefore account not only for realized risk but also for the credibility of the underlying estimates.

5.6. Formal Perspective on Risk-Uncertainty Decoupling

Formally, the portfolio RL problem is defined as a Markov Decision Process (s, a, P, r, γ) where

s_{t}

∈ S represents market features,

a_{t}

∈ A denotes portfolio weights, and the transition function P(

s_{t + 1}

|

s_{t}

,

a_{t}

) captures market dynamics.

Classical reinforcement learning maximizes expected return:

{m a x}_{π} E_{π} [R_{t}],

(1)

Risk-aware formulations extend this objective using measures such as CVaR:

{m a x}_{π} E_{π} [R_{t}] - λ {C V a r}_{α} (R_{t}),

(2)

where λ controls risk aversion.

In most implementations, λ is fixed, implicitly assuming stable estimation of return distributions and ignoring epistemic uncertainty.

A more integrated formulation introduces uncertainty-dependent risk aversion:

{m a x}_{π} E_{π} [R_{t}] - λ (U_{t}) {C V a r}_{α} (R_{t}),

(3)

where higher uncertainty increases risk aversion. Similarly, constraint thresholds can be adapted:

E_{π} [C_{t}] \leq κ (U_{t}),

(4)

Such formulations allow risk exposure to decrease as uncertainty increases, linking confidence signals directly to decision-making.

This perspective formalizes the need to explicitly couple risk regulation and uncertainty within the optimization process.

6. Uncertainty Modeling in Financial Reinforcement Learning

While risk-aware reinforcement learning focuses on controlling adverse outcomes under assumed return distributions, uncertainty-aware reinforcement learning addresses the reliability of these assumptions.

In financial environments, this distinction is essential. Risk reflects variability within estimated distributions, whereas uncertainty concerns the credibility of these estimates and the stability of the learned policy. In non-stationary markets, this distinction becomes operational, as decisions must be made under incomplete and evolving information.

Recent approaches incorporate predictive uncertainty into reinforcement learning, enabling agents to communicate confidence levels and guide downstream decisions [18]. Uncertainty-aware mechanisms have also been explored in multi-agent settings beyond finance, demonstrating their broader applicability to sequential decision-making under uncertainty [72]. From a governance standpoint, uncertainty acts as a control signal, informing when to reduce exposure, maintain autonomy, or trigger human intervention. Without explicit modeling of uncertainty, RL systems may exhibit overconfidence under distributional ambiguity.

6.1. Aleatoric and Epistemic Uncertainty in Finance

Uncertainty in financial RL is commonly decomposed into aleatoric and epistemic components.

Aleatoric uncertainty reflects inherent market randomness, driven by stochastic dynamics such as macroeconomic shocks or liquidity fluctuations. Epistemic uncertainty captures model-related ambiguity arising from limited data, misspecification, or regime shifts.

Recent work attempts to model both forms jointly [19]. Failing to distinguish them can cause systematic overconfidence—models may interpret limited knowledge as inherent randomness. In practice, elevated epistemic uncertainty should trigger caution, retraining, or supervisory intervention, making its estimation central to reliable decision-support systems.

6.2. Bayesian Approaches and Probabilistic Policies

Bayesian methods provide a principled framework for representing epistemic uncertainty by treating model parameters as random variables. In RL, this yields probabilistic policies and value functions that encode confidence in predictions.

Such approaches have been applied to portfolio decision-making, including Bayesian neural networks and multi-agent systems that propagate uncertainty through deep representations [20,21]. In practice, higher uncertainty levels tend to induce more conservative allocation strategies.

However, exact Bayesian inference is generally intractable in deep models. Approximate methods such as variational inference or Monte Carlo dropout introduce approximation errors and may result in poor calibration. As a result, Bayesian RL enhances uncertainty representation but does not guarantee well-calibrated confidence.

6.3. Ensemble Methods and Approximate Epistemic Estimation

Ensemble methods approximate epistemic uncertainty through prediction dispersion across multiple models. In RL, they improve stability, exploration, and robustness, and have been applied to portfolio management through ensemble-based strategies [73].

However, disagreement across models does not necessarily reflect true uncertainty. When trained on similar data, ensembles may fail to capture deeper structural ambiguity. They also introduce additional computational cost and lack a unified probabilistic interpretation.

From a governance perspective, miscalibrated confidence—whether under- or overestimated—can directly affect risk exposure, limiting its reliability as a decision signal.

6.4. Uncertainty-Driven Exploration and Robustness

Uncertainty modeling also informs exploration strategies. Standard methods such as ε-greedy do not account for epistemic uncertainty and are poorly suited to financial contexts.

Uncertainty-aware exploration prioritizes information gain while controlling exposure, potentially improving generalization and reducing drawdowns. However, in financial environments, exploration must balance learning efficiency with capital preservation.

In practice, coordination between exploration, risk regulation, and uncertainty signals remains limited. Although thresholds may trigger monitoring or escalation, their integration with allocation and constraint mechanisms is still incomplete.

6.5. Structural Limitations of Current Uncertainty-Aware RL

Despite recent progress, uncertainty-aware reinforcement learning remains structurally incomplete in financial applications. Many approaches do not clearly distinguish between aleatoric and epistemic uncertainty, uncertainty is often modeled locally and not propagated across temporal horizons, and formal evaluation of calibration remains limited. More fundamentally, uncertainty is frequently treated as an auxiliary output rather than as a control variable that influences decisions. These structural gaps are directly addressed in the unified governance-aware architecture proposed in Section 9.6.

The predominant decoupling of risk management and uncertainty estimation in existing portfolio RL systems can be attributed to three main factors. First, training stability: jointly optimizing risk constraints and uncertainty estimates introduces conflicting gradient signals that can destabilize the learning process. Second, computational complexity: Bayesian uncertainty quantification methods such as BNNs or Monte Carlo Dropout significantly increase inference time, making real-time coupled decision-making computationally expensive. Third, framework limitations: standard RL libraries and benchmark environments are not designed to natively support epistemic uncertainty as a control variable, which discourages coupled architectures.

7. Multi-Modal Signals and Modular Architectures for Portfolio Reinforcement Learning

Building on the previous analysis, this section examines multi-modal and modular designs as structural mechanisms for integrating heterogeneous signals, risk estimation, and uncertainty-aware decision-making.

Financial markets are influenced by diverse information sources beyond price data, including sentiment, volatility regimes, macroeconomic indicators, liquidity conditions, and inter-asset dependencies. Modeling such heterogeneity within monolithic RL architectures is often statistically fragile and difficult to interpret, motivating modular and multi-modal approaches.

From a human–AI perspective, modularity enhances transparency by decomposing decision processes into interpretable components aligned with institutional workflows. It also provides a structural basis for coordinating information, risk signals, and model confidence within unified decision pipelines.

7.1. Motivation for Multi-Modal Learning in Finance

Traditional portfolio optimization relies primarily on price-derived statistics, which provide a limited and often lagging representation of market conditions. During regime shifts, these signals typically react only after structural changes occur.

Alternative data sources—such as sentiment, volatility forecasts, and macroeconomic indicators—offer complementary information that can support earlier regime detection and adaptive exposure management. Reinforcement learning systems increasingly incorporate such signals through enriched state representations [74,75].

However, heterogeneous data differ in frequency, noise, and predictive horizon. Direct concatenation often increases overfitting and reduces interpretability. Modular representations address this limitation by processing signals independently before integration, improving robustness and transparency.

7.2. Sentiment and Behavioral Signals

Textual data captures expectations and behavioral dynamics not immediately reflected in prices. Transformer-based models [24] extract sentiment signals associated with short-term volatility and event-driven market reactions.

In RL systems, sentiment typically modulates exposure rather than directly determining allocations. Its relevance is regime-dependent, becoming more influential during crises. Static integration may therefore degrade performance if not adaptively controlled.

Modular architectures mitigate this risk by isolating sentiment processing, enabling monitoring when behavioral signals diverge from structural indicators.

7.3. Volatility Modeling and Risk State Estimation

Volatility is a central risk indicator in portfolio construction. While traditional models estimate conditional volatility, deep sequential models better capture nonlinear dynamics and regime changes.

In RL, volatility acts as a dynamic risk-state variable influencing leverage and allocation. Probabilistic models further provide richer representations of risk conditions.

However, volatility forecasts are often treated deterministically, and exposure rarely adapts to forecast reliability. This limits their effectiveness under regime shifts.

7.4. Structural Dependencies and Graph-Based Representations

Financial assets are embedded in dynamic dependency structures shaped by correlations, sector linkages, and systemic interactions. Modeling assets independently therefore ignores important relational information.

Graph-based approaches represent assets as nodes and dependencies as edges, enabling graph neural networks [25] to capture co-movement patterns and systemic risk propagation. In reinforcement learning, such representations improve diversification and enhance awareness of interconnected risks.

However, financial dependency structures are unstable and evolve over time. Static graph constructions may encode transient correlations as persistent structure, leading to overfitting. Adaptive graph learning and temporal updating are therefore necessary for reliable deployment.

7.5. Static Versus Dynamic Signal Fusion

The primary challenge in multi-modal systems lies in signal integration.

Static fusion methods assume stable signal relevance, an assumption rarely satisfied in non-stationary markets. Dynamic fusion mechanisms—such as attention or gating—enable adaptive weighting based on market conditions.

While this improves flexibility and robustness, it introduces overfitting risks and requires regularization. Attention weights are adaptive statistical constructs—not causal explanations.

7.6. Modular Architectures as Structural Unification

Modular architectures decompose RL systems into functional components, including signal extraction, risk estimation, uncertainty modeling, and policy optimization [76].

This decomposition improves interpretability, scalability, and extensibility. More importantly, it enables coordinated decision-making, where multiple information sources and model outputs can be combined within a structured pipeline.

However, modularity introduces coordination challenges. Without explicit mechanisms, systems may behave as loosely connected components rather than unified decision frameworks.

7.7. Structural Implications for Portfolio Reinforcement Learning

The combination of multi-modal learning and modular architectures reflects a shift from monolithic policies toward structured decision systems that explicitly account for heterogeneous information.

Their effectiveness depends on coherent integration across components. When properly coordinated, such architectures can improve situational awareness, support calibrated intervention, and enhance accountability in human–AI decision-making.

A remaining challenge concerns empirical validation under realistic deployment constraints. The following section therefore examines evaluation protocols and their implications for system credibility.

8. Evaluation Protocols and Deployment Pitfalls

The rapid expansion of reinforcement learning (RL) in portfolio optimization has produced numerous empirical performance claims, often supported by open-source experimentation frameworks [77]. However, many results remain difficult to interpret, compare, or reproduce. Simplified backtesting environments, inconsistent treatment of transaction costs, and selective reporting contribute to a persistent gap between academic findings and real-world applicability.

From a governance standpoint, evaluation is not merely a validation step but a structural requirement for ensuring reliability and accountability. For RL systems intended for practical use, assessment protocols must explicitly incorporate regime variability, market frictions, and operational constraints.

To illustrate the practical applicability of the proposed evaluation framework, we apply the four evaluation criteria to three representative studies drawn from the reviewed corpus. First, regarding temporal integrity: Ye et al. [75] employ a walk-forward validation protocol with strict train-test separation, satisfying this criterion, whereas Song et al. [8] report results on a single evaluation period using cryptocurrency datasets, without multi-regime or walk-forward validation. Second, regarding regime robustness: Khemlichi et al. [16] explicitly evaluate performance across stable, crisis, recovery, and sideways market regimes, providing a comprehensive robustness assessment, while the majority of reviewed studies—38 out of 55—limit evaluation to a single market period. Third, regarding multi-seed reporting: only 12 of the 55 coded studies report results across multiple random seeds with confidence intervals, limiting the reproducibility of their findings. Fourth, regarding transaction cost inclusion: studies employing real-world datasets from the S&P 500 and DAX 30 more consistently incorporate transaction costs, whereas studies using synthetic environments tend to omit this criterion. This comparative application demonstrates that the proposed evaluation framework provides actionable discriminative power across existing studies and can serve as a standardized audit tool for future work in portfolio RL.

8.1. Backtesting Bias and Data Leakage

Backtesting remains the dominant evaluation paradigm but is inherently vulnerable to bias. Look-ahead bias, survivorship bias, improper normalization, and data leakage can significantly inflate performance estimates [78]. In RL settings, these issues are further amplified by temporal dependencies in training. The deflated Sharpe ratio [79] and walk-forward validation protocols provide more robust evaluation baselines that partially address these concerns.

Leakage may arise from overlapping train–test splits, forward-looking preprocessing, or feature engineering pipelines that implicitly incorporate future information. As a result, agents may exploit artifacts rather than learn robust strategies. Reliable evaluation therefore requires strict temporal separation, walk-forward validation, and sensitivity analysis across data partitions.

8.2. Regime-Based and Stress-Oriented Evaluation

Aggregate performance metrics often conceal regime-specific weaknesses. Strategies with strong average returns may fail under crisis conditions or degrade in low-volatility regimes.

Regime-aware evaluation addresses this limitation by segmenting data into economically meaningful phases and conducting stress tests under extreme but plausible scenarios. Metrics such as drawdown severity, recovery speed, and responsiveness to volatility shifts provide a more structural assessment of behavior.

8.3. Transaction Costs, Turnover, and Market Frictions

A common limitation in financial RL is the assumption of frictionless markets. Ignoring transaction costs, liquidity constraints, and execution delays systematically overstates performance [5].

RL agents are particularly sensitive to such assumptions. High-frequency rebalancing strategies may appear optimal in simulation but become unprofitable once costs are introduced. Transaction costs should therefore be incorporated directly into the training objective rather than applied post hoc.

8.4. Overfitting, Sample Inefficiency, and Selection Bias

Deep RL models are data-intensive, whereas financial data is limited, noisy, and non-stationary. This mismatch increases the risk of overfitting to historical patterns that do not generalize.

Sample inefficiency further contributes to instability, while repeated experimentation and selective reporting introduce implicit selection bias. More reliable validation requires multi-seed reporting, hyperparameter sensitivity analysis, and distributional results rather than single-point estimates.

8.5. Reproducibility and Benchmark Fragmentation

The absence of standardized benchmarks remains a major limitation. Differences in datasets, preprocessing pipelines, cost assumptions, and evaluation horizons hinder meaningful comparison across studies.

Even minor implementation details can materially affect results, yet methodological transparency is often limited. Institutional adoption requires verifiable performance, traceable pipelines, and auditable experimental procedures.

A key open challenge is the lack of standardized benchmarks and evaluation protocols, which limits reproducibility and cross-study comparison in portfolio reinforcement learning.

8.6. Bridging Research and Deployment

Even with improved validation protocols, real-world implementation introduces additional constraints rarely addressed in academic studies, including latency, execution slippage, regulatory requirements, and explainability [80].

Models that perform well in backtests may degrade under real-time conditions or distributional shifts. Evaluation must therefore extend beyond retrospective analysis to include monitoring of stability, calibration, and resilience under stress.

Practical use also requires organizational mechanisms such as escalation policies, override capabilities, and clearly defined accountability structures.

8.7. Deployment-Oriented Validation Framework

A deployment-oriented framework should assess portfolio RL systems across six structural dimensions, as illustrated in Figure 2.

First, temporal integrity requires strict train–test separation and walk-forward validation. Second, regime robustness requires evaluation across market phases and stress scenarios. Third, cost realism requires integrating transaction costs during training and testing sensitivity to liquidity constraints. Fourth, uncertainty calibration requires assessing whether exposure adapts to confidence levels. Fifth, stability requires evaluation across multiple seeds and configurations with distributional reporting. Sixth, governance compatibility requires auditability, escalation mechanisms, and integration with human oversight. Under such a framework, historical returns alone are insufficient. A system is convincingly validated only if it demonstrates robustness, reliability, and compatibility with operational and governance constraints.

9. Open Challenges and Research Directions

Despite substantial progress, portfolio reinforcement learning still faces important challenges related to deployment, generalization, and system design. Bridging this gap requires progress beyond predictive performance, including improved architectural coordination, uncertainty calibration, evaluation rigor, and governance-compatible design.

To illustrate the practical operation of the proposed unified framework, consider the following step-by-step example. At time step t, the uncertainty estimation module—implemented as a Bayesian Neural Network—produces a predictive distribution over expected returns for a given asset. When the epistemic uncertainty estimate

σ_{e} (t)

exceeds a predefined threshold

θ

, the risk budgeting module automatically tightens the CVaR constraint from its baseline level α to a more conservative level

α

′ <

α

. This adjustment propagates to the policy optimization module—implemented as a PPO agent—which receives an updated feasible action space with reduced maximum position sizes. As a result, the agent allocates less capital to the uncertain asset and redistributes it toward lower-uncertainty assets or cash. If

σ_{e} (t)

subsequently decreases—indicating improved model confidence—the risk constraint is progressively relaxed back toward

α

, restoring the agent’s full allocation capacity. This dynamic coupling between uncertainty estimation and risk control constitutes the core operational mechanism of the proposed framework, ensuring that capital exposure is always commensurate with model confidence.

While the proposed unified framework remains at a conceptual level, several empirical settings appear particularly well-suited for initial piloting. From an asset class perspective, large-cap equity indices—such as the S&P 500 and DAX 30—represent the most accessible starting point, given their high liquidity, data availability, and well-documented regime dynamics. Multi-asset portfolios combining equities, bonds, and commodities would constitute a natural second step, as they introduce cross-asset dependency structures that stress-test the coordination mechanisms of multi-agent architectures. From an institutional perspective, quantitative hedge funds and robo-advisory platforms represent promising deployment contexts, as they already operate within algorithmic decision frameworks and maintain the infrastructure required for real-time uncertainty monitoring. Sovereign wealth funds and pension funds—which operate under strict regulatory risk constraints—would benefit most from the governance layer of the proposed framework, making them ideal long-term validation partners.

9.1. Continual and Lifelong Reinforcement Learning Under Risk Constraints

Most financial RL systems rely on offline training with periodic retraining, implicitly assuming stable market dynamics. Continual learning aims to enable adaptive policy updates while preserving prior knowledge. However, in financial settings, adaptation must remain aligned with risk management objectives. Key challenges include reliable regime-shift detection, selective update mechanisms, and integration of regulatory constraints into learning dynamics.

9.2. Generalization Across Markets, Assets, and Temporal Horizons

Limited generalization remains a major weakness. Policies trained on specific datasets often fail when applied to new markets or regimes, raising concerns about whether models capture structural relationships or exploit local patterns.

Improving generalization requires representations that encode broader financial principles, including diversification, liquidity constraints, and regime sensitivity. Transfer learning, meta-learning, and domain randomization offer promising directions.

9.3. Interpretability, Accountability, and Regulatory Integration

Interpretability is increasingly a regulatory requirement. Portfolio decisions must be auditable, traceable, and aligned with fiduciary and compliance constraints [81]. While modular and hierarchical architectures improve structural transparency, they do not guarantee meaningful explanations. Future systems must integrate input signals, risk constraints, uncertainty estimates, and allocation decisions within coherent explanatory frameworks.

9.4. Standardized Benchmarks and Deployment-Oriented Evaluation

Fragmented evaluation practices limit comparability and reproducibility. Differences in datasets, preprocessing, and cost assumptions hinder cumulative progress.

Future benchmarks should incorporate regime segmentation, realistic transaction costs, multi-metric evaluation, and cross-market validation. Open-source frameworks and reproducible pipelines are essential for methodological transparency.

9.5. Toward Unified Risk-Uncertainty-Modularity Frameworks

A central challenge is the design of coherent systems that align decision-making, risk control, and model reliability within a unified framework.

Future approaches should emphasize coordinated architectures in which decision processes, model confidence, and system constraints are jointly considered. Progress in this direction depends less on isolated algorithmic improvements than on system-level design principles.

9.6. Conceptual Governance-Aware Unified Architecture

To operationalize the integration of risk control, uncertainty modeling, and decision-making, we outline a conceptual governance-aware architecture for portfolio reinforcement learning, illustrated in Figure 3. This architecture directly addresses the structural gaps identified across Section 4, Section 5, Section 6, Section 7 and Section 8 and provides a concrete instantiation of the four human–AI collaboration modes described in Section 2.

This architecture is structured into five interdependent layers. First, multi-modal signal processing modules extract heterogeneous information, including market dynamics, volatility estimates, sentiment indicators, and structural dependencies—directly addressing the signal integration gaps identified in Section 7. Second, epistemic uncertainty is aggregated across modules to produce a portfolio-level confidence signal U(t)—addressing the calibration limitations identified in Section 6.1, Section 6.2 and Section 6.3. Third, risk budgeting mechanisms adapt dynamically to this confidence signal, allowing constraints such as CVaR limits [82], drawdown thresholds, or leverage bounds to tighten under increased uncertainty, implementing the formal coupling formalized in Section 5.6. Fourth, policy optimization determines allocation decisions under uncertainty-conditioned constraints, embedding the RL agent (PPO [33]/SAC [35]/DDPG [34]) within an explicit governance structure. Finally, a governance layer introduces escalation mechanisms, auditability, and human oversight—corresponding respectively to the Advisory, Constraint-Guided, Escalation, and Shared-Control modes described in Section 2.

Within this framework, uncertainty is no longer treated as a passive diagnostic output but as an active control variable that directly conditions risk exposure, allocation behavior, and supervisory intervention. This enables a coordinated and interpretable decision process aligned with deployment requirements. While the framework remains conceptual at this stage, its five-layer structure provides a concrete research agenda for future empirical validation.

9.7. Structural Failure Modes Across Integration Levels

Several recurrent failure modes explain why RL systems degrade in real environments:

Overconfident policies due to insufficient model awareness;
Static risk control unable to adapt to changing conditions;
Limited use of model confidence in allocation and constraint decisions;
Inconsistent signals across system components;
Weak propagation of uncertainty information across decision layers.

These patterns highlight the importance of coordinated system design for achieving robust performance.

9.8. Limitations of This Review

This review adopts a structured critical synthesis methodology rather than a formal meta-analysis. Although a structured comparative analysis of 55 fully assessed studies was conducted to quantify risk–uncertainty coupling, the proposed unified framework remains conceptual and has not yet been empirically validated under live deployment conditions. Evaluating the five-layer architecture under realistic market conditions, including transaction costs, regime shifts, and regulatory constraints, constitutes the primary direction for future work.

The proposed taxonomy involves interpretive judgment, and category boundaries may evolve as the field matures. The literature on uncertainty-aware and governance-aligned RL remains emerging, with limited large-scale empirical validation. The proposed architectural perspective should therefore be viewed as a structured research agenda rather than an established standard.

Institutional and regulatory factors—including fiduciary obligations, explainability requirements, and audit trails—are only partially addressed in this review, although they play a critical role in real-world adoption and warrant dedicated investigation in future work.

10. Conclusions

This review examined reinforcement learning (RL) for portfolio optimization through three interdependent dimensions: risk sensitivity, uncertainty modeling, and architectural design. While RL provides a powerful framework for sequential decision-making, its transition from experimental success to practical use remains constrained by system-level limitations.

A central insight of this work is that effective portfolio reinforcement learning requires the joint design of risk regulation, uncertainty modeling, and system architecture, rather than treating them as independent components. A structured analysis of 55 fully assessed studies confirms this structural decoupling empirically: only 5 studies (9%) explicitly couple uncertainty with risk constraints, while 38 (69%) treat these dimensions as independent.

From a governance standpoint, robust real-world use requires more than improved returns. Decision-support systems must incorporate calibrated communication of confidence, adaptive and enforceable risk constraints, traceable decision processes, and explicit escalation mechanisms.

From a practical perspective, while the proposed frameworks remain conceptual, they are intended to guide future empirical validation. Evaluating these architectures under realistic market conditions, including transaction costs and regime shifts, constitutes an important direction for future work.

Future research should therefore prioritize integrated system design over isolated methodological improvements. Ultimately, the viability of reinforcement learning in portfolio management will depend on the development of well-coordinated, interpretable, and governance-compatible decision infrastructures. Reinforcement learning should therefore be viewed not as a fully autonomous optimizer, but as an adaptive component embedded within structured human-centered decision systems.

The following points summarize the key takeaways of this review:

For practitioners:

Risk-aware RL architectures—particularly hierarchical and modular systems—offer measurable advantages over traditional Mean-Variance and Risk Parity approaches in volatile market regimes.
Epistemic uncertainty estimation should be incorporated as an active control signal rather than a passive diagnostic tool, enabling dynamic adjustment of risk constraints in real time.
Governance mechanisms—including human oversight triggers and escalation protocols—are essential for institutional deployment and should be explicitly designed into the system architecture.
Evaluation of RL-based portfolio systems should systematically include multi-seed reporting, regime-based testing, and transaction cost modeling to ensure real-world validity.

For researchers:

The explicit coupling of risk management and uncertainty quantification remains an open problem—only 9% of reviewed studies address it directly.
Continual learning and regime-adaptive architectures represent a critical gap, as most systems assume stationary market dynamics.
Standardized benchmarking protocols for portfolio RL—analogous to those established in robotics and game-playing RL—are urgently needed.
The governance layer of human–AI collaborative systems in finance remains largely theoretical and requires empirical validation in institutional settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info17050476/s1, PRISMA 2020 Checklist [83].

Author Contributions

Conceptualization, F.K.; methodology, F.K.; formal analysis, F.K.; investigation, F.K.; data curation, F.K.; writing—original draft preparation, F.K.; writing—review and editing, F.K., Y.I.K. and S.E.B.A.; visualization, F.K.; supervision, Y.I.K. and S.E.B.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Firdaous Khemlichi.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ang, A.; Timmermann, A. Regime Changes and Financial Markets. Annu. Rev. Financ. Econ. 2012, 4, 313–337. [Google Scholar] [CrossRef]
DeMiguel, V.; Garlappi, L.; Uppal, R. Optimal Versus Naive Diversification: How Inefficient Is the 1/N Portfolio Strategy? Rev. Financ. Stud. 2009, 22, 1915–1953. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Hambly, B.; Xu, R.; Yang, H. Recent Advances in Reinforcement Learning in Finance. Math. Financ. 2023, 33, 437–503. [Google Scholar] [CrossRef]
Lopez de Prado, M. Advances in Financial Machine Learning; Wiley: Hoboken, NJ, USA, 2018. [Google Scholar]
Fischer, T.G. Reinforcement Learning in Financial Markets—A Survey; FAU Discussion Papers in Economics, No. 12/2018; Friedrich-Alexander University Erlangen-Nürnberg: Erlangen, Germany, 2018; Available online: https://econstor.eu/bitstream/10419/183139/1/1032172355.pdf (accessed on 7 April 2026).
Cummings, M.M. Man versus Machine or Man + Machine? IEEE Intell. Syst. 2014, 29, 62–69. [Google Scholar] [CrossRef]
Song, G.; Zhao, T.; Ma, X.; Lin, P.; Cui, C. Reinforcement Learning-Based Portfolio Optimization with Deterministic State Transition. Inf. Sci. 2025, 690, 121538. [Google Scholar] [CrossRef]
Aboussalah, A.M.; Lee, C.-G. Continuous Control with Stacked Deep Dynamic Recurrent Reinforcement Learning for Portfolio Optimization. Expert Syst. Appl. 2020, 140, 112891. [Google Scholar] [CrossRef]
Millea, A.; Edalat, A. Using Deep Reinforcement Learning with Hierarchical Risk Parity for Portfolio Optimization. Int. J. Financ. Stud. 2022, 11, 10. [Google Scholar] [CrossRef]
Jin, B. A Mean-VaR Based Deep Reinforcement Learning Framework for Practical Algorithmic Trading. IEEE Access 2023, 11, 28920–28933. [Google Scholar] [CrossRef]
Winkel, D.; Strauß, N.; Schubert, M.; Seidl, T. Risk-Aware Reinforcement Learning for Multi-Period Portfolio Selection. In Machine Learning and Knowledge Discovery in Databases; Amini, M.-R., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 185–200. [Google Scholar]
Jiang, Y.; Olmo, J.; Atwi, M. Deep Reinforcement Learning for Portfolio Selection. Glob. Financ. J. 2024, 62, 101016. [Google Scholar] [CrossRef]
Hao, Z.; Zhang, H.; Zhang, Y. Stock Portfolio Management by Using Fuzzy Ensemble Deep Reinforcement Learning Algorithm. J. Risk Financ. Manag. 2023, 16, 201. [Google Scholar] [CrossRef]
Li, Z.; Tam, V.; Yeung, K.L. Developing a Multi-Agent and Self-Adaptive Framework with Deep Reinforcement Learning for Dynamic Portfolio Risk Management. arXiv 2024, arXiv:2402.00515. [Google Scholar] [CrossRef]
Khemlichi, F.; Khamlichi, Y.I.; Ali, S.E.B. Modular Reinforcement Learning for Multi-Market Portfolio Optimization. Information 2025, 16, 961. [Google Scholar] [CrossRef]
Hêche, F.; Nigro, B.; Barakat, O.; Robert-Nicoud, S. Risk-Averse Policies for Natural Gas Futures Trading Using Distributional Reinforcement Learning. arXiv 2025, arXiv:2501.04421. [Google Scholar] [CrossRef]
Lina, J.; Banda, S.S.; Paib, H.-T.; Rawal, B.S. EUDRL: Explainable Uncertainty-Based Deep Reinforcement Learning for Portfolio Management. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–7. [Google Scholar]
Hao, J.L.T.; Wang, L.R.; Liu, C.; Choi, C.; Liu, S.; Fan, X. Exploring Epistemic and Distributional Uncertainties in Algorithmic Trading Agents. In Proceedings of the 2024 IEEE International Conference on Agents (ICA), Wollongong, Australia, 4–6 December 2024; pp. 82–87. [Google Scholar]
Park, K.; Jung, H.-G.; Eom, T.-S.; Lee, S.-W. Uncertainty-Aware Portfolio Management with Risk-Sensitive Multiagent Network. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 362–375. [Google Scholar] [CrossRef]
Khemlichi, F.; Khamlichi, Y.I.; Ali, S.E.B. Hierarchical Multi-Agent System with Bayesian Neural Networks for Portfolio Optimization. Math. Model. Eng. Probl. 2025, 12, 1257. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Proc. Mach. Learn. Res. 2016, 48, 1050–1059. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. arXiv 2016, arXiv:1612.01474. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar] [CrossRef]
Benhamou, E. Can Deep Reinforcement Learning Solve the Portfolio Allocation Problem? Ph.D. Thesis, Université Paris Sciences et Lettres, Paris, France, 2023. [Google Scholar]
Lim, Q.Y.E.; Cao, Q.; Quek, C. Dynamic Portfolio Rebalancing through Reinforcement Learning. Neural Comput. Appl. 2022, 34, 7125–7139. [Google Scholar] [CrossRef]
Charkhestani, A.; Esfahanipour, A. Behaviorally Informed Deep Reinforcement Learning for Portfolio Optimization with Loss Aversion and Overconfidence. Sci. Rep. 2026, 16, 6443. [Google Scholar] [CrossRef] [PubMed]
Mani Shankar, M.; Sweety, A.; Deepthi, D. Optimizing Algorithmic Trading Through DRL: A Comparative Analysis of Single-Agent and Multi-Agent Models. In Data Science and Applications; Nanda, S.J., Yadav, R.P., Prasad, M., Saraswat, M., Eds.; Springer Nature: Cham, Switzerland, 2026; pp. 1–15. [Google Scholar]
Espiga-Fernández, F.; García-Sánchez, Á.; Ordieres-Meré, J. A Systematic Approach to Portfolio Optimization: A Comparative Study of Reinforcement Learning Agents, Market Signals, and Investment Horizons. Algorithms 2024, 17, 570. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Jiang, M.; Xu, Z.; Lin, Z. Dynamic Risk Control and Asset Allocation Using Q-Learning in Financial Markets. Trans. Comput. Sci. Methods 2024, 4. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2019, arXiv:1509.02971. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Khemlichi, F.; Elfilali, H.E.; Chougrad, H.; Ben Ali, S.E.; Idrissi Khamlichi, Y. Actor-Critic Methods in Stock Trading: A Comparative Study. In Proceedings of the 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Online Part, 20–21 July 2023; pp. 1–5. [Google Scholar]
Sun, Q.; Wei, X.; Yang, X. GraphSAGE with Deep Reinforcement Learning for Financial Portfolio Optimization. Expert Syst. Appl. 2024, 238, 122027. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Zhan, Y.; Liu, X.-Y. Optimistic Bull or Pessimistic Bear: Adaptive Deep Reinforcement Learning for Stock Portfolio Allocation. arXiv 2019, arXiv:1907.01503. [Google Scholar] [CrossRef]
Khemlichi, F.; Chougrad, H.; Khamlichi, Y.I.; el Boushaki, A.; Ben Ali, S.E. Deep Deterministic Policy Gradient for Portfolio Management. In Proceedings of the 2020 6th IEEE Congress on Information Science and Technology (CiSt), Essaouira, Morocco, 5–12 June 2020; pp. 424–429. [Google Scholar]
Khemlichi, F.; Chougrad, H.; Ali, S.E.B.; Khamlichi, Y.I. Portfolio Optimization System (POS): A Deep Reinforcement Learning Approach for Market-Adaptive Investment Strategies. In Proceedings of the 2025 5th International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), Zanzibar, Tanzania, 16–19 October 2025; pp. 1–6. [Google Scholar]
Khemlichi, F.; Chougrad, H.; Khamlichi, Y.I.; Elboushaki, A.; Ali, S.E.B. Deep Deterministic Policy Gradient Based Portfolio Management System. Int. J. Inf. Sci. Technol. 2022, 6, 29–39. [Google Scholar] [CrossRef]
Bai, Z.-L.; Zhao, Y.-N.; Zhou, Z.-G.; Li, W.-Q.; Gao, Y.-Y.; Tang, Y.; Dai, L.-Z.; Dong, Y.-Y. Mercury: A Deep Reinforcement Learning-Based Investment Portfolio Strategy for Risk-Return Balance. IEEE Access 2023, 11, 78353–78362. [Google Scholar] [CrossRef]
Khemlichi, F.; Chougrad, H.; Idrissi Khamlichi, Y.; El Boushaki, A.; El Haj Ben Ali, S. A Stock Trading Strategy Based on Deep Reinforcement Learning. In Advanced Intelligent Systems for Sustainable Development (AI2SD’2020); Kacprzyk, J., Balas, V.E., Ezziyyani, M., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 920–928. [Google Scholar]
Yang, H.; Liu, X.-Y.; Zhong, S.; Walid, A. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In Proceedings of the 1st ACM International Conference on AI in Finance (ICAIF’20), New York, NY, USA, 3–4 November 2020; Association for Computing Machinery: New York, NY, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
Millea, A. Hierarchical Model-Based Deep Reinforcement Learning for Single-Asset Trading. Analytics 2023, 2, 560–576. [Google Scholar] [CrossRef]
Sun, R.; Xi, Y.; Stefanidis, A.; Jiang, Z.; Su, J. A Novel Multi-Agent Dynamic Portfolio Optimization Learning System Based on Hierarchical Deep Reinforcement Learning. Complex Intell. Syst. 2025, 11, 311. [Google Scholar] [CrossRef]
Cheridito, P.; Dupret, J.-L.; Wu, Z. ABIDES-MARL: A Multi-Agent Reinforcement Learning Environment for Endogenous Price Formation and Execution in a Limit Order Book. arXiv 2025, arXiv:2511.02016. [Google Scholar] [CrossRef]
Kumlungmak, K.; Vateekul, P. Multi-Agent Deep Reinforcement Learning With Progressive Negative Reward for Cryptocurrency Trading. IEEE Access 2023, 11, 66440–66455. [Google Scholar] [CrossRef]
Cheng, C.; Chen, B.; Xiao, Z.; Lee, R.S.T. Quantum Finance and Fuzzy Reinforcement Learning-Based Multi-Agent Trading System. Int. J. Fuzzy Syst. 2024, 26, 2224–2245. [Google Scholar] [CrossRef]
Ying, R.; Lyu, J.; Li, J. Dynamic Portfolio Optimization with Data-Aware Multi-Agent Reinforcement Learning and Adaptive Risk Control. In Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing (AIIP 2024), Wuhan, China, 22–24 November 2024; ACM: New York, NY, USA, 2025; pp. 912–918. [Google Scholar] [CrossRef]
Shavandi, A.; Khedmati, M. A Multi-Agent Deep Reinforcement Learning Framework for Algorithmic Trading in Financial Markets. Expert Syst. Appl. 2022, 208, 118124. [Google Scholar] [CrossRef]
Cheng, L.-C.; Sun, J.-S. Multiagent-Based Deep Reinforcement Learning Framework for Multi-Asset Adaptive Trading and Portfolio Management. Neurocomputing 2024, 594, 127800. [Google Scholar] [CrossRef]
Ma, C.; Zhang, J.; Li, Z.; Xu, S. Multi-Agent Deep Reinforcement Learning Algorithm with Trend Consistency Regularization for Portfolio Management. Neural Comput. Appl. 2022, 35, 6589–6601. [Google Scholar] [CrossRef]
Kim, S.-H.; Lee, K.-H. Multi-Asset Multi-Agent Reinforcement Learning for Portfolio Management. IEEE Access 2025, 13, 194456–194474. [Google Scholar] [CrossRef]
Li, H.; Hai, M. Deep Reinforcement Learning Model for Stock Portfolio Management Based on Data Fusion. Neural Process. Lett. 2024, 56, 108. [Google Scholar] [CrossRef]
Zhang, H.; Shi, Z.; Hu, Y.; Ding, W.; Kuruoglu, E.E.; Zhang, X.-P. Optimizing Trading Strategies in Quantitative Markets Using Multi-Agent Reinforcement Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 136–140. [Google Scholar] [CrossRef]
Xu, Z.; Bao, Q.; Wang, Y.; Feng, H.; Du, J.; Sha, Q. Reinforcement Learning in Finance: QTRAN for Portfolio Optimization. J. Comput. Technol. Softw. 2025, 4. [Google Scholar] [CrossRef]
Ram, K.S.R.; M, S.; I, N.; M, T.; M, K.; B, N. Enhanced Investment Decision Making with a Reinforcement Learning-Based Multi-Agent Portfolio Management System. In Proceedings of the 2024 International Conference on Data Science and Network Security (ICDSNS), Tiptur, India, 26–27 July 2024; pp. 1–6. [Google Scholar]
Chen, M.-Y.; Chen, C.-T.; Huang, S.-H. Knowledge Distillation for Portfolio Management Using Multi-Agent Reinforcement Learning. Adv. Eng. Inf. 2023, 57, 102096. [Google Scholar] [CrossRef]
Khemlichi, F.; Khamlichi, Y.I.; Ali, S.E.B. MPLS: A Modular Portfolio Learning System for Adaptive Portfolio Optimization. Math. Model. Eng. Probl. 2025, 12, 1959–1970. [Google Scholar] [CrossRef]
Carta, S.; Corriga, A.; Ferreira, A.; Podda, A.S.; Recupero, D.R. A Multi-Layer and Multi-Ensemble Stock Trader Using Deep Learning and Deep Reinforcement Learning. Appl. Intell. 2021, 51, 889–905. [Google Scholar] [CrossRef]
Yu, X.; Wu, W.; Liao, X.; Han, Y. Dynamic Stock-Decision Ensemble Strategy Based on Deep Reinforcement Learning. Appl. Intell. 2023, 53, 2452–2470. [Google Scholar] [CrossRef]
Yang, M.; Hu, Y.; Wang, J. Risk-Averse Trader: A Deep Reinforcement Learning-Based Portfolio Optimization Method for Risk-Averse Investors. In Proceedings of the 2024 International Conference on Identification, Information and Knowledge in the Internet of Things (IIKI), Kusatsu, Japan, 6–8 December 2024; pp. 160–165. [Google Scholar]
Shen, S.; Ma, C.; Li, C.; Liu, W.; Fu, Y.; Mei, S.; Liu, X.; Wang, C. RiskQ: Risk-Sensitive Multi-Agent Reinforcement Learning Value Factorization. Adv. Neural Inf. Process. Syst. 2023, 36, 34791–34825. [Google Scholar]
Garrido-Merchán, E.; Mora-Figueroa, S.; Coronado-Vaca, M. Multi-Objective Bayesian Optimization of Deep Reinforcement Learning for Environmental, Social, and Governance (ESG) Financial Portfolio Management. Intell. Syst. Account. Financ. Manag. 2025, 32, e70008. [Google Scholar] [CrossRef]
Sun, S.; Xue, W.; Wang, R.; He, X.; Zhu, J.; Li, J.; An, B. DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1858–1867. [Google Scholar]
Wang, X.; Liu, L. Risk-Sensitive Deep Reinforcement Learning for Portfolio Optimization. J. Risk Financ. Manag. 2025, 18, 347. [Google Scholar] [CrossRef]
Sattar, A.; Sarwar, A.; Gillani, S.; Bukhari, M.; Rho, S.; Faseeh, M. A Novel RMS-Driven Deep Reinforcement Learning for Optimized Portfolio Management in Stock Trading. IEEE Access 2025, 13, 42813–42835. [Google Scholar] [CrossRef]
Yang, M.; Wang, J.; Hu, Y. RiskawareTrader: A Reinforcement Learning Based Portfolio Optimization for Risk Averter. Int. J. Comput. Intell. Syst. 2025, 19, 25. [Google Scholar] [CrossRef]
Dong, S.C.; Finlay, J.R. Adaptive Insurance Reserving with CVaR-Constrained Reinforcement Learning under Macroeconomic Regimes. arXiv 2025, arXiv:2504.09396. [Google Scholar] [CrossRef]
Bellemare, M.G.; Dabney, W.; Munos, R. A Distributional Perspective on Reinforcement Learning. Proc. Mach. Learn. Res. 2017, 70, 449–458. [Google Scholar]
Shah, M.I.A.; Barrett, E.; Mason, K. Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2507.16796. [Google Scholar] [CrossRef]
Carta, S.; Ferreira, A.; Podda, A.S.; Reforgiato Recupero, D.; Sanna, A. Multi-DQN: An Ensemble of Deep Q-Learning Agents for Stock Market Forecasting. Expert Syst. Appl. 2021, 164, 113820. [Google Scholar] [CrossRef]
Ansari, Y.; Gillani, S.; Bukhari, M.; Lee, B.; Maqsood, M.; Rho, S. A Multifaceted Approach to Stock Market Trading Using Reinforcement Learning. IEEE Access 2024, 12, 90041–90060. [Google Scholar] [CrossRef]
Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; Li, B. Reinforcement-Learning Based Portfolio Management with Augmented Asset Movement Prediction States. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; Volume 34, pp. 1112–1119. [Google Scholar]
Huang, Z.; Tanaka, F. MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-Based System for Financial Portfolio Management. PLoS ONE 2022, 17, e0263689. [Google Scholar] [CrossRef]
Liu, X.-Y.; Yang, H.; Chen, Q.; Zhang, R.; Yang, L.; Xiao, B.; Wang, C.D. FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance. arXiv 2020, arXiv:2011.09607. [Google Scholar] [CrossRef]
Maeda, I.; deGraw, D.; Kitano, M.; Matsushima, H.; Sakaji, H.; Izumi, K.; Kato, A. Deep Reinforcement Learning in Agent Based Financial Market Simulation. J. Risk Financ. Manag. 2020, 13, 71. [Google Scholar] [CrossRef]
Bailey, D.H.; Lopez de Prado, M. The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. J. Portf. Manag. 2014, 40, 94–107. [Google Scholar] [CrossRef]
Ndikum, P.; Ndikum, S. Advancing Investment Frontiers: Industry-Grade Deep Reinforcement Learning for Portfolio Optimization. arXiv 2024, arXiv:2403.07916. [Google Scholar] [CrossRef]
Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Rockafellar, R.T.; Uryasev, S. Optimization of Conditional Value-at-Risk. J. Risk 2000, 2, 21–41. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]

Figure 1. PRISMA-inspired study selection flow for portfolio reinforcement learning (2016–2025).

Figure 2. Deployment-oriented validation framework for portfolio reinforcement learning systems. The framework extends traditional backtesting by incorporating temporal integrity, regime robustness, market frictions, uncertainty calibration, stability analysis, and governance compatibility.

Figure 3. Conceptual governance-aware unified architecture for portfolio reinforcement learning. The five-layer structure integrates multi-modal signal processing, epistemic uncertainty aggregation, uncertainty-conditioned risk budgeting, policy optimization, and a governance and escalation layer.

Table 1. Representative comparative taxonomy of reviewed RL studies for portfolio optimization—verified sample of 14 studies organized by risk-uncertainty coupling level.

Ref	Author(s), Year	RL Architecture	Risk Treatment	Uncertainty Modeled	Risk–Uncertainty Coupling	Key Performance Metrics & Regime Testing
		GROUP A—Non-Coupled: Risk and Uncertainty Structurally Decoupled
[8]	Song et al., 2025 (Inf. Sci.)	Deep RL (deterministic state transition)	Reward shaping (return-based)	None—deterministic transitions explicitly assumed	Non-Coupled	Sharpe, cum. return. Backtesting on a single continuous evaluation setting.
[9]	Aboussalah & Lee, 2020 (Expert Syst. Appl.)	Deep RL—Recurrent (Stacked DDRL)	Reward shaping (return + turnover penalty)	None—no explicit uncertainty modeling	Non-Coupled	Sharpe, Sortino, max drawdown (OOS). Multiple market periods tested.
[10]	Millea & Edalat, 2022 (Int. J. Financ. Stud.)	Hierarchical RL (DRL + HRP)	Hierarchical Risk Parity (structural risk budgeting)	None—deterministic HRP/HERC allocation	Non-Coupled	Sharpe, Calmar, max drawdown (OOS). Multiple market periods.
[11]	Jin, 2023 (IEEE Access)	Deep RL (Mean-VaR framework)	Mean-VaR explicit tail-risk objective	None—VaR used as risk measure only	Non-Coupled	Sharpe, VaR, annualized return. Some stress periods included.
[12]	Winkel et al., 2023 (ECML PKDD)	Risk-aware RL	Variance-based Pareto risk optimization	None—uncertainty not modeled	Non-Coupled	std. deviation, Sharpe, risk-return Pareto front (Nasdaq-100 OOS). Risk sensitivity analysis.
[13]	Jiang, Olmo & Atwi, 2024 (Glob. Finance J.)	Deep RL (TD3 + CNN—RTC)	Risk + transaction cost sensitive reward	None— variance-based risk modeling	Non-Coupled	Sharpe, Calmar, max drawdown (OOS) vs. MV & Sharpe benchmarks.
		GROUP B—Partial Coupling: Uncertainty Present but Not Operationalized in Risk Constraints
[14]	Hao et al., 2023 (J. Risk Financ. Manag.)	Modular/Hybrid (Fuzzy Ensemble DRL)	Reward shaping + fuzzy market-regime encoding	Implicit via ensemble disagreement	Partial	Sharpe, cum. return, max drawdown. S&P 100 market regimes tested.
[15]	Li, Tam & Yeung, 2024 (arXiv)	MARL (self-adaptive risk management)	Dynamic risk management via agent specialization	Partial—agent disagreement as uncertainty proxy	Partial	Sharpe, max drawdown (OOS). Multiple market regimes tested.
[16]	Khemlichi et al., 2025 (Information, MDPI)	Modular HRL-MARL (MPLS—multi-algorithm)	CVaR in reward + modular risk scaling	Partial—Bayesian VFM for volatility uncertainty	Partial	Sharpe +40–70% vs. MVP/RP; CVaR limited during COVID. S&P 500, DAX 30, FTSE100. 4 regimes: stable, crisis, recovery, sideways.
[17]	Hêche et al., 2025 (arXiv)	Distributional RL (natural gas futures)	CVaR-based distributional RL objectives	Distributional aleatoric uncertainty only	Partial	CVaR, return metrics. Commodity stress testing.
		GROUP C—Explicit Coupling: Uncertainty Directly Conditions Risk Decisions
[18]	Lina et al., 2024 (ICCCNT—IEEE)	Deep RL (EUDRL—explainable uncertainty-based)	Uncertainty-conditioned reward via local agents	Explicit—local agents provide uncertainty assessments integrated via SHAP explainability	Explicit	Portfolio performance + uncertainty metrics. Portfolio management scenarios.
[19]	Hao et al., 2024 (IEEE ICA)	Deep RL (epistemic + distributional uncertainty)	Uncertainty-informed risk decisions	Explicit—epistemic (MC Dropout) and aleatoric uncertainty modeled jointly	Explicit	Uncertainty metrics, trading performance. S&P 500 with recession periods.
[20]	Park et al., 2024 (IEEE TNNLS)	MARL (RSMAN—risk-sensitive multiagent)	Risk-sensitive decisions via uncertainty estimation (RSA + RAPG)	Explicit—market + parameter uncertainty directly condition risk-adaptive portfolio generation	Explicit	Sharpe, cum. return, risk-adjusted metrics. Multiple market periods tested.
[21]	Khemlichi et al., 2025 (Math. Model. Eng. Probl.)	Hierarchical MARL (BNN-based—uncertainty-aware)	Uncertainty-aware dynamic decisions (BNN + PPO)	Explicit—BNN uncertainty propagated to allocation decisions across agents	Explicit	Sharpe, return, uncertainty calibration metrics. Multi-sector portfolio.

Note: Among the 55 fully assessed studies coded on both risk and uncertainty dimensions, only 5 (9%) explicitly couple uncertainty estimation with risk constraint mechanisms, while 38 (69%) treat risk and uncertainty as structurally independent components. This empirically confirms the central decoupling argument of this review.

Table 2. Inclusion and exclusion criteria applied at each stage of the literature selection process.

Stage	Inclusion Criteria	Exclusion Criteria
Title/abstract screening	Addresses RL or DRL applied to portfolio management, trading, or asset allocation; published 2016–2025; English language; peer-reviewed or rigorous preprint	Focuses solely on prediction without sequential decision-making; non-English; duplicate; outside 2016–2025
Full-text assessment	Addresses at least one of: RL architecture for portfolio optimization, risk modeling strategy (reward shaping, constraints, CVaR), uncertainty estimation (Bayesian, ensemble, MC Dropout); sufficient methodological transparency to extract data on ≥3 of the 6 coding dimensions	Prediction-only without RL decision layer; insufficient methodological detail; non-indexed or low-quality venue
Final quality check	Published in indexed journal or major peer-reviewed conference; clear empirical or methodological contribution; dataset and setup sufficiently described	Predatory venue; opinion-based without methodological substance; direct self-replication without new contribution

Table 3. Architectural taxonomy of RL paradigms for portfolio optimization across risk modeling, uncertainty integration, human–AI interaction, and structural limitations.

Architecture Type	Risk Modeling Strategy	Uncertainty Integration	Human–AI Interaction Mode	Deployment Robustness Level	Structural Limitations
Single-Agent RL	Reward shaping (volatility, turnover penalties)	Rare or implicit; typically, deterministic	Advisory mode	Low	Conflation of signal processing and risk control; no explicit uncertainty conditioning
Deep RL (Centralized DRL)	Reward penalties or soft constraints	Occasional ensemble/dropout; rarely calibrated	Advisory/Constraint-guided	Low–Moderate	End-to-end opacity; weak interpretability; risk not conditioned on epistemic confidence
Hierarchical RL (HRL)	Risk budgeting at strategic layer; tactical reward shaping	Local uncertainty estimates; limited cross-layer propagation	Shared-control	Moderate	Poor uncertainty aggregation across layers; coordination complexity
Multi-Agent RL (MARL)	Agent-level constraints; partial global alignment	Implicit via agent disagreement; no formal aggregation	Shared-control/Constraint-guided	Moderate	Coordination instability; fragmented risk perception; endogenous non-stationarity
Modular & Hybrid Architectures	Dedicated risk modules; dynamic risk scaling	Module-specific uncertainty; partial fusion	Uncertainty-aware escalation/Shared-control	Moderate–High (theoretical)	Integration gap between uncertainty signals and capital allocation constraints
Unified Risk–Uncertainty–Modular Frameworks (Emerging)	Explicit constraint conditioning on epistemic confidence	Bayesian or calibrated probabilistic propagation across modules	Structured human-in-the-loop	High (conceptual)	Limited empirical validation; computational complexity; benchmark scarcity

Table 4. Critical Comparative Taxonomy of Reinforcement Learning Architectures for Portfolio Optimization.

Architecture Type	Typical Objective	Risk Treatment	Uncertainty Modeling	Signal Integration	Evaluation Rigor	Deployment Readiness	Structural Limitation
Single-Agent RL	Return/Sharpe maximization	Reward shaping (volatility penalties)	None or implicit	Monolithic (price-based)	Simple backtests	Prototype	Fragile under regime shifts; opaque decision logic
Deep RL	Return/risk-adjusted return	Reward shaping; soft constraints	Rare; mostly deterministic	End-to-end representation	Limited regime testing; few multi-seed reports	Low	Reward sensitivity; overfitting; weak uncertainty calibration
Hierarchical RL (HRL)	Strategic + tactical objectives	Sometimes constraint-based	Rarely propagated across layers	Structured but not modular	Partial regime analysis	Conceptual	Credit assignment complexity; weak uncertainty flow
Multi-Agent RL (MARL)	Global portfolio objective	Implicit risk sharing	Rare; mostly absent	Distributed asset-level policies	Heterogeneous; rarely standardized	Experimental	Coordination instability; non-stationarity between agents
Risk-Constrained RL (CMDP-based)	Return under explicit constraints (CVaR, drawdown)	Hard or soft constraints	Often absent	Monolithic	Improved robustness testing	Medium (cost-aware sometimes)	Risk handled separately from uncertainty
Distributional RL	Return distribution optimization	Quantile/CVaR-based	Distributional outcome modeling	Monolithic	Limited stress testing	Prototype	Distribution learning ≠ calibrated epistemic uncertainty
Bayesian/Ensemble RL	Confidence-aware optimization	Usually via reward penalties	Epistemic (approximate)	Monolithic	Rare calibration metrics	Low	Uncertainty rarely integrated into action constraints
Modular/Multi-Modal RL	Adaptive multi-signal optimization	Varies	Localized per module (sometimes)	Explicit modular fusion	Very heterogeneous	Exploratory	Lack of unified risk-uncertainty integration
Unified Risk–Uncertainty–Modular (Emerging)	Joint performance + risk + confidence	Integrated constraints	Propagated uncertainty	Dynamic fusion	Rare end-to-end evaluation	Not yet mature	Still largely conceptual; missing benchmark standardization

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khemlichi, F.; Idrissi Khamlichi, Y.; Elhaj Ben Ali, S. Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review. Information 2026, 17, 476. https://doi.org/10.3390/info17050476

AMA Style

Khemlichi F, Idrissi Khamlichi Y, Elhaj Ben Ali S. Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review. Information. 2026; 17(5):476. https://doi.org/10.3390/info17050476

Chicago/Turabian Style

Khemlichi, Firdaous, Youness Idrissi Khamlichi, and Safae Elhaj Ben Ali. 2026. "Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review" Information 17, no. 5: 476. https://doi.org/10.3390/info17050476

APA Style

Khemlichi, F., Idrissi Khamlichi, Y., & Elhaj Ben Ali, S. (2026). Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review. Information, 17(5), 476. https://doi.org/10.3390/info17050476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Human–AI Collaboration in Risk- and Uncertainty-Aware Portfolio Reinforcement Learning: A Critical Review

Abstract

1. Introduction

2. Human–AI Collaboration in Portfolio Reinforcement Learning

3. Methodological Approach

3.1. Review Design and Research Questions

3.2. Search Strategy and Database Coverage

3.3. Inclusion and Exclusion Criteria

3.4. Selection Process and Study Flow

3.5. Data Extraction and Coding Framework

3.6. Positioning Relative to Existing Reviews

4. Architectures of Reinforcement Learning for Portfolio Optimization

4.1. Single-Agent Reinforcement Learning

4.2. Deep Reinforcement Learning (DRL)

4.3. Hierarchical Reinforcement Learning (HRL)

4.4. Multi-Agent Reinforcement Learning (MARL)

4.5. Modular and Hybrid Architectures

5. Risk-Aware Reinforcement Learning

5.1. Reward Shaping and Risk-Sensitive Objectives

5.2. Hard and Soft Constraints in Portfolio RL

5.3. Downside Risk Measures: CVaR and Drawdown-Aware RL

5.4. Distributional Reinforcement Learning

5.5. Structural Limitations of Current Risk-Aware RL Approaches

5.6. Formal Perspective on Risk-Uncertainty Decoupling

6. Uncertainty Modeling in Financial Reinforcement Learning

6.1. Aleatoric and Epistemic Uncertainty in Finance

6.2. Bayesian Approaches and Probabilistic Policies

6.3. Ensemble Methods and Approximate Epistemic Estimation

6.4. Uncertainty-Driven Exploration and Robustness

6.5. Structural Limitations of Current Uncertainty-Aware RL

7. Multi-Modal Signals and Modular Architectures for Portfolio Reinforcement Learning

7.1. Motivation for Multi-Modal Learning in Finance

7.2. Sentiment and Behavioral Signals

7.3. Volatility Modeling and Risk State Estimation

7.4. Structural Dependencies and Graph-Based Representations

7.5. Static Versus Dynamic Signal Fusion

7.6. Modular Architectures as Structural Unification

7.7. Structural Implications for Portfolio Reinforcement Learning

8. Evaluation Protocols and Deployment Pitfalls

8.1. Backtesting Bias and Data Leakage

8.2. Regime-Based and Stress-Oriented Evaluation

8.3. Transaction Costs, Turnover, and Market Frictions

8.4. Overfitting, Sample Inefficiency, and Selection Bias

8.5. Reproducibility and Benchmark Fragmentation

8.6. Bridging Research and Deployment

8.7. Deployment-Oriented Validation Framework

9. Open Challenges and Research Directions

9.1. Continual and Lifelong Reinforcement Learning Under Risk Constraints

9.2. Generalization Across Markets, Assets, and Temporal Horizons

9.3. Interpretability, Accountability, and Regulatory Integration

9.4. Standardized Benchmarks and Deployment-Oriented Evaluation

9.5. Toward Unified Risk-Uncertainty-Modularity Frameworks

9.6. Conceptual Governance-Aware Unified Architecture

9.7. Structural Failure Modes Across Integration Levels

9.8. Limitations of This Review

10. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI