Deep Reinforcement Learning in the Era of Foundation Models: A Survey

Mienye, Ibomoiye Domor; Esenogho, Ebenezer; Modisane, Cameron

doi:10.3390/computers15010040

Open AccessReview

Deep Reinforcement Learning in the Era of Foundation Models: A Survey

by

Ibomoiye Domor Mienye

^†,

Ebenezer Esenogho

^*,†

and

Cameron Modisane

^†

Centre for Artificial Intelligence and Multidisciplinary Innovations, Department of Auditing, College of Accounting Sciences, University of South Africa, Pretoria 0002, South Africa

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2026, 15(1), 40; https://doi.org/10.3390/computers15010040

Submission received: 6 December 2025 / Revised: 2 January 2026 / Accepted: 4 January 2026 / Published: 9 January 2026

Download

Browse Figures

Versions Notes

Abstract

Deep reinforcement learning (DRL) and large foundation models (FMs) have reshaped modern artificial intelligence (AI) by enabling systems that learn from interaction while leveraging broad generalization and multimodal reasoning capabilities. This survey examines the growing convergence of these paradigms and reviews how reinforcement learning from human feedback (RLHF), reinforcement learning from AI feedback (RLAIF), world-model pretraining, and preference-based optimization refine foundation model capabilities. We organize existing work into a taxonomy of model-centric, RL-centric, and hybrid DRL–FM integration pathways, and synthesize applications across language and multimodal agents, autonomous control, scientific discovery, and societal and ethical alignment. We also identify technical, behavioral, and governance challenges that hinder scalable and reliable DRL–FM integration, and outline emerging research directions that suggest how reinforcement-driven adaptation may shape the next generation of intelligent systems. This review provides researchers and practitioners with a structured overview of the current state and future trajectory of DRL in the era of foundation models.

Keywords:

deep reinforcement learning; foundation models; reinforcement learning from human feedback; policy optimization; alignment; multimodal agents

1. Introduction

Deep reinforcement learning (DRL) has become a core part of modern artificial intelligence (AI), combining the representational power of deep learning with the sequential decision-making capabilities of reinforcement learning (RL) [1,2,3]. Over the past decade, DRL has enabled significant advances in game playing [4], robotics [5], autonomous navigation [6], and simulation-driven control [7]. In these systems, agents learn through interaction with their environment, optimizing long-horizon behavior under uncertainty using reward-guided objectives. Classic RL methods, including Q-learning and policy gradients, have historically struggled with high-dimensional state spaces. DRL addresses this limitation by incorporating deep neural networks into value functions, policies, and world models, enabling effective representation learning from raw sensory inputs. Milestones include Deep Q-Networks (DQN) [8], which achieved human-level Atari performance, and model-based agents such as Dreamer [9], which learn latent dynamics for long-horizon planning.

Meanwhile, foundation models (FMs) have reshaped generative AI, with large language models (LLMs) such as GPT and Claude demonstrating robust performance, emergent reasoning capabilities, and multimodal understanding [10,11]. Foundation models, although developed primarily for supervised and self-supervised learning, incorporate reinforcement mechanisms during their refinement. The most prominent example is reinforcement learning from human feedback (RLHF), which played a crucial role in aligning models such as instruction-tuned LLMs [12]. Reinforcement learning from AI feedback (RLAIF) extends this paradigm by using strong teacher models to generate preference judgments. Beyond alignment, recent work uses DRL to support tool use, interactive decision-making, and embodied action in FM-driven agents, illustrating a deepening connection between the two research areas.

Despite rapid progress, the literature at the intersection of DRL and foundation models remains fragmented: alignment-focused reviews tend to emphasize feedback pipelines and preference optimization, while DRL-focused reviews emphasize learning dynamics, stability, and generalization in control settings. As a result, it is often unclear how pretrained priors, feedback sources, and reinforcement signals jointly shape agent behavior when foundation models are placed inside closed-loop decision-making systems [13,14,15,16,17,18,19].

This fragmentation leaves gaps in the current literature. To the best of our knowledge, no prior survey provides a consolidated synthesis of how DRL and foundation models interact across alignment, tool-augmented behavior, multimodal reasoning, embodied control, and scientific discovery—domains where reinforcement-driven adaptation is becoming central to model reliability, safety, and real-world applications. In order to fill these gaps, this survey offers a comprehensive and integrated examination of DRL in the era of foundation models. This review is timely given the accelerating use of reinforcement mechanisms in FM training and the increasing deployment of FM-driven systems in real-world settings. The key contributions are as follows:

A unified taxonomy of DRL–FM integration, distinguishing model-centric, RL-centric, and hybrid paradigms, and clarifying their implications for representation, optimisation, and interaction.
A detailed synthesis of application domains including language and multimodal agents, autonomous and robotic control, scientific discovery, and societal and ethical alignment.
A unified evaluation and benchmarking contribution that consolidates metrics, protocols, and stress-tests for comparing DRL–FM systems across tasks, safety, robustness, and efficiency dimensions.
An in-depth analysis of the challenges that limit scalable and safe DRL–FM integration, alongside forward-looking research opportunities with emphasis on stability, interpretability, and trustworthy DRL-driven FM systems.

The rest of the paper is organized as follows: Section 2 summarizes related surveys. Section 3 provides some background on DRL and FM, and Section 4 presents a taxonomy of DRL–FM integration. Section 5 reviews technical foundations of DRL in FM contexts. Section 6 examines key applications. Section 7 discusses open challenges and future directions, which Section 8 concludes the study.

Review Methodology

This survey follows a structured and systematic review protocol to ensure comprehensive coverage and reproducibility. The Scopus, IEEE Xplore, SpringerLink, arXiv, and ACM Digital Library databases were queried between January 2020 and March 2025 using keywords such as “deep reinforcement learning”, “foundation models”, “RLHF”, “alignment”, and “autonomous agents”. An initial set of 612 studies was retrieved. After duplicate removal and title–abstract screening, 184 papers were shortlisted for full-text review. Exclusion criteria included works without explicit DRL components, alignment discussions unrelated to foundation models, or redundant workshop versions.

After full-text screening, 28 were excluded, leaving 156. The final corpus comprised 126 primary studies, 18 review papers, and 12 benchmark proposals. Each article was annotated by research focus (model-centric, RL-centric, or hybrid), methodological contribution (training, optimization, or evaluation), and application domain (language/multimodal, control, scientific, or societal). This coding schema enabled cross-sectional comparison across integration paradigms and performance criteria.

Bibliometric analysis was performed to capture publication trends, author contributions, and venue distribution. Citation data were normalized by year to control for recency bias. Qualitative synthesis focused on identifying conceptual convergence between reinforcement learning and foundation model alignment, while quantitative evidence was summarized using citation frequencies and methodological clusters. The resulting taxonomy and discussion sections are grounded in these systematically collected insights.

2. Related Reviews

The research landscape on reinforcement learning and foundation models has expanded rapidly, producing several surveys that address related but distinct themes. For example, Kaufmann et al. [13] conducted one of the most comprehensive reviews of RLHF, covering the full pipeline from preference data collection to reward modeling and policy optimization. Their work identified key limitations in human supervision scalability and emphasized the necessity of automated or hybrid feedback systems. In a related study, Wang et al. [14] presented a detailed taxonomy of large model alignment techniques, including RLHF and RLAIF. They discussed how divergence constraints, reward modeling strategies, and reference policies influence alignment stability. Zhou et al. [15] extended this discussion by focusing on alignment for large language model (LLM) agents, emphasizing preference-driven optimization, social alignment, and decision-making under uncertainty. These reviews clarify the mechanics of alignment but devote limited attention to the role of DRL as a theoretical and algorithmic foundation underlying feedback-based optimization.

Beyond alignment, Plaat et al. [16] surveyed the emerging paradigm of agentic large language models that reason, act, and interact across open environments. They categorized research into reasoning, tool-augmented behavior, and multi-agent coordination, highlighting progress toward autonomous systems capable of planning and execution. While their analysis captures the capabilities of modern foundation models, it pays little attention to reinforcement-driven adaptation, sample efficiency, and policy optimization, which are core aspects of DRL that enable these agentic behaviors to scale reliably.

From the DRL perspective, several foundational reviews have provided valuable insights but primarily focus on algorithmic aspects independent of foundation models. Landers and Doryab [19] surveyed verification methods for DRL, proposing taxonomies for robustness evaluation and formal safety guarantees. Their contribution complements alignment research by introducing verification standards but does not explore FM-driven policy learning or preference-based fine-tuning. Similarly, Moerland et al. [17] offered an exhaustive overview of model-based reinforcement learning, detailing how learned dynamics and uncertainty-aware planning bridge learning and control. While this work establishes the foundation for modern world-model agents, it predates the tight coupling between DRL and FMs that now enables reasoning across multimodal representations.

Offline and data-efficient learning have also been the subject of active review. Prudencio et al. [20] and Levine et al. [18] surveyed offline reinforcement learning, emphasizing conservative policy estimation, distributional robustness, and dataset quality—concepts increasingly relevant when optimizing foundation models using static preference datasets or synthetic feedback. In robotics, Tang et al. [21] summarized the application of DRL to real-world control, including sim-to-real transfer, hierarchical policies, and safety-aware exploration. Their findings demonstrate the practical viability of DRL systems and acknowledge the growing influence of pretrained representations; however, the review remains narrowly focused on robotics rather than general-purpose foundation model integration.

Complementary domains have examined interpretability and explainability. Puiutta and Veith [22] provided a systematic survey of explainable reinforcement learning (XRL), classifying interpretation methods and evaluation protocols for analyzing learned policies. Their work informs how transparency can be embedded within reinforcement-driven systems but does not address the challenges of aligning foundation models using reinforcement objectives.

Overall, these surveys (summarized in Table 1) form a broad yet disjointed picture: alignment-focused works often overlook DRL’s algorithmic role, while DRL-focused surveys rarely engage with foundation model alignment or multimodal reasoning. Motivated by this disconnect, this survey focuses on the coupling mechanisms that link pretrained foundations with reinforcement-driven adaptation in practice: how representations, planners/policies, world models, and feedback signals are composed inside the reinforcement loop. This lens is then used to organize the remainder of the paper, connecting technical foundations to application patterns and the recurring challenges that limit scalable and trustworthy DRL–FM systems.

3. Preliminaries

3.1. Reinforcement Learning Fundamentals

Reinforcement learning formalizes sequential decision making as a Markov decision process (MDP)

M = (𝒮, A, P, R, γ)

with states

s \in 𝒮

, actions

a \in A

, transition kernel

P (s^{'} | s, a)

, reward

R (s, a)

, and discount

γ \in [0, 1)

. An agent seeks a policy

π (a | s)

that maximizes the expected discounted return [23]

J (π) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t}] .

(1)

The state and action value functions under

π

are

V^{π} (s) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s], Q^{π} (s, a) = E_{π} [\sum_{t = 0}^{\infty} γ^{t} r_{t} | s_{0} = s, a_{0} = a],

(2)

satisfying the Bellman expectation equations, with optimal counterparts

V^{*}

and

Q^{*}

defined via maximization over actions. Policy optimization follows the policy gradient theorem, where a differentiable policy

π_{θ} (a | s)

is updated in the direction [24,25]

\nabla_{θ} J (π_{θ}) = E_{(s, a) \sim π_{θ}} [\nabla_{θ} log π_{θ} (a | s) Q^{π_{θ}} (s, a)],

(3)

often estimated with an advantage baseline

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

to reduce variance. Practical trust-region ideas constrain policy drift between updates; for instance, proximal policy optimization (PPO) maximizes a clipped surrogate with ratio

r_{θ} = π_{θ} (a | s) / π_{θ_{old}} (a | s)

and an entropy bonus to encourage exploration [26,27].

3.2. Deep Reinforcement Learning

DRL extends classical reinforcement learning by incorporating deep neural networks as function approximators for policies, value functions, and environment models [1,28]. As shown in Figure 1, standard reinforcement learning consists of an agent interacting with an environment by receiving states, selecting actions, and optimizing expected rewards based on feedback. Deep neural networks (DNNs), on the other hand, provide powerful nonlinear mappings from inputs to outputs through multiple hidden layers, enabling the extraction of high-dimensional features and representations. DRL combines these paradigms by embedding a deep network within the agent to approximate optimal policies and value functions directly from raw, high-dimensional inputs such as pixels or sensor data, allowing end-to-end learning of perception and control. Recent work has also revisited the choice of function approximators used inside DRL agents, with potential implications for parameter efficiency and interpretability; we discuss this further in Section 4 [29,30].

Model-free DRL algorithms, including Asynchronous Advantage Actor–Critic (A3C), PPO, and Soft Actor–Critic (SAC), achieve a balance between stability and sample efficiency through on-policy and off-policy updates, entropy regularization, and parallelized learning [27,31,32]. Model-based variants learn predictive world models of environment dynamics and rewards, enabling planning and imagination-based rollouts before executing actions. MuZero exemplifies this paradigm by jointly learning latent dynamics, policy, and value functions, achieving superhuman performance in Atari, Go, chess, and shogi [33]. Similarly, world-model agents such as DreamerV3 leverage learned latent representations to simulate trajectories and optimize behavior through imagination, attaining state-of-the-art results across diverse continuous-control domains [9]. More recently, transformers have reframed control as sequence modeling—illustrated by the Decision Transformer—which predicts actions conditioned on prior states, actions, and desired returns, thereby eliminating the need for explicit value estimation while competitive with strong offline RL baselines on several benchmarks [34].

3.3. Foundation Models and Alignment

FMs are large pretrained architectures––language, vision–language, or multimodal––that can be adapted to diverse tasks through instruction tuning and alignment [35]. Reinforcement learning plays a central role in refining these models by optimizing their behavior according to evaluative feedback. In RLHF and related approaches, a pretrained model generates candidate outputs that are ranked by human annotators or teacher models. These preferences are used to train a reward model, which provides scalar feedback for fine-tuning the policy. Variants such as RLAIF and DPO modify this pipeline by incorporating synthetic evaluators or direct preference optimization objectives [13,14]. Alignment methods have also been extended to LLM-based agents, where reinforcement signals influence reasoning, tool use, and decision-making in interactive environments [15].

3.4. Evaluation Paradigms and Benchmarks

Assessing the capabilities of DRL and FMs requires benchmarks that capture not only performance and efficiency, but also reasoning, generalization, and alignment. Yet current evaluation paradigms remain largely siloed by domain, preventing direct comparison of agents that integrate perception, language, and control. The major benchmark categories relevant to DRL–FM systems are summarized below and tabulated in Table 2.

3.4.1. Classical Reinforcement Learning Benchmarks

Classical DRL evaluation is anchored in structured control environments designed to measure policy optimization, exploration, and generalization under controlled conditions. The Atari 2600 suite [36] remains a standard for testing visual perception, temporal credit assignment, and robustness to sparse or delayed rewards in discrete-action settings. In contrast, MuJoCo [37] targets continuous-control dynamics, emphasizing stable locomotion and manipulation, fine-grained motor coordination, and sample efficiency in physics-based simulation. To better probe generalization beyond fixed test sets, ProcGen [38] uses procedural generation to evaluate robustness and zero-shot generalization across unseen levels, highlighting overfitting risks that can be masked by static environments. While these benchmarks provide rigorous measures of learning performance and efficiency, they offer limited coverage of multimodal reasoning, language grounding, and alignment-oriented behaviors that become central when integrating DRL with foundation models.

3.4.2. Embodied and Multimodal Testbeds

Embodied and multimodal testbeds extend evaluation toward agents that must couple perception, language, and action over longer horizons. MineDojo [39] leverages the open-ended structure of Minecraft to test long-horizon exploration, tool use, and instruction following in environments that blend vision and language with complex action sequences. Simulation platforms such as Habitat [40] and BEHAVIOR [41] further emphasize realistic 3D interaction, where agents must interpret natural language instructions, navigate cluttered scenes, and manipulate semantically meaningful objects. Collectively, these environments probe multi-step reasoning, grounding, and coordination across modalities, while also exposing practical failure modes such as compounding errors and brittle perception-action coupling. However, evaluation is often tightly tied to task success and episode-level rewards, which limits systematic assessment of alignment-related properties across heterogeneous scenarios.

3.4.3. Foundation Model Benchmarks

Foundation model benchmarks primarily evaluate reasoning, language understanding, and alignment-related attributes, typically in settings that do not require closed-loop control. BIG-Bench [42] and HELM [43] span broad collections of linguistic and reasoning tasks and include measurement axes such as calibration, bias, and fairness, enabling comparative analysis of model behavior across diverse prompts and evaluation protocols. GAIA [44] extends this scope toward multi-step problem solving with tool use, including external API interaction and agentic task decomposition, thereby providing a closer proxy to real-world assistant behavior. Despite their usefulness for benchmarking reasoning quality and ethical performance, these suites only weakly represent reinforcement-driven adaptivity, exploration under uncertainty, and feedback loops that are central to DRL.

This separation complicates comparison of integrated DRL–FM agents across closed-loop performance, robustness, and alignment, motivating the unified evaluation framework introduced in Section 7.2.

4. A Taxonomy of DRL–Foundation Model Integration

This section proposes a taxonomy for how DRL and foundation models are coupled within the reinforcement loop. Rather than classifying systems by application domain, we distinguish integration patterns by where pretrained representations, planning, and feedback-driven adaptation enter the learning pipeline. This framing supports a consistent comparison of methods spanning language-conditioned control, multimodal robotics, and alignment-oriented fine-tuning.

4.1. Paradigms of Integration

The integration of DRL and FMs can be categorized into three primary paradigms based on how learning, perception, and reasoning are coupled within the reinforcement loop. These paradigms are: (i) FM-centric DRL architectures, where foundation models act as policy, planner, or world-model components; (ii) RL-centric foundation models, where DRL drives fine-tuning and alignment of large pretrained models; and (iii) hybrid or multimodal frameworks, where both paradigms are unified in interactive or embodied agents. Here, “FM-centric” denotes integration in which a pretrained foundation model is the primary policy/planner, in contrast with classical model-based RL, where the central object is an explicit environment dynamics model learned for planning. To reduce boundary ambiguity, we distinguish paradigms by (a) which module performs deployment-time action selection or planning, and (b) which parameters are primarily shaped by reward- or preference-driven objectives (i.e., whether the FM itself is the main object being optimized versus an attached controller). When the dominant “model” is a transition model trained primarily from scratch for planning, the setting aligns with classical model-based RL rather than FM-centric integration, even if pretrained encoders are used. The following subsections describe each paradigm in detail.

4.1.1. FM-Centric DRL Architectures

In this paradigm, the foundation model forms the central computational substrate for policy learning or planning, while reinforcement learning supplies task- and environment-specific adaptation around that pretrained backbone. In this paradigm, a pretrained foundation model is embedded directly into the reinforcement learning pipeline, often serving as the policy

π_{θ} (a | s)

, planner, or environment model. Rather than learning policy parameters

θ

from scratch, the FM provides pretrained representations

f_{ϕ} (s)

that encode semantic or multimodal information from the state space [45,46]. Because downstream control quality depends on the robustness of these learned representations under complex scenes and appearance variation, advances in robust feature learning for perception provide relevant grounding for FM-centric agent design [47]. The policy optimization objective then becomes

max_{θ} J (θ) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})], with a_{t} \sim π_{θ} (a_{t} | f_{ϕ} (s_{t})) .

(4)

Here,

f_{ϕ}

represents the feature extractor or encoder inherited from the foundation model, and

π_{θ}

maps these embeddings to actions. This formulation allows the policy to benefit from the pretrained model’s generalization while still being fine-tuned for downstream control or decision tasks. A notable example is the Decision Transformer, which formulates RL as sequence modeling. The model predicts an action sequence

a_{1 : T}

conditioned on prior states

s_{1 : T}

, rewards-to-go

{\hat{R}}_{t} = \sum_{k = t}^{T} r_{k}

, and actions, using an autoregressive transformer:

P (a_{1 : T} | s_{1 : T}, {\hat{R}}_{1 : T}) = \prod_{t = 1}^{T} P (a_{t} | s_{\leq t}, a_{< t}, {\hat{R}}_{\leq t}; θ) .

(5)

By leveraging pretrained transformer architectures, such methods transform the RL problem into a supervised learning task over trajectory datasets [34]. Other systems such as Gato [48], SayCan [49], PaLM-E [50], and RT-2 [51] further extend this concept by integrating visual, textual, and proprioceptive modalities into unified policies. These approaches treat the FM as a differentiable policy backbone that grounds language and perception into control signals, resulting in generalist agents capable of zero-shot task adaptation.

When foundation models are used as world-model components in this paradigm, they typically serve as pretrained generative priors or simulators that complement (or initialize) learned transitions, rather than mirroring the classical setting where the transition model is the primary object learned from scratch for planning. Accordingly, FM-centric approaches may incorporate pretrained models as differentiable world models

p_{ϕ} (s_{t + 1}, r_{t} | s_{t}, a_{t})

, which enable simulation-based policy optimization. The learning objective in such cases is to minimize model prediction error while maximizing long-term returns under imagined trajectories:

min_{ϕ} E [∥ s_{t + 1} - {\hat{s}}_{t + 1} ∥^{2}], max_{θ} E_{p_{ϕ}} [\sum_{t = 0}^{\infty} γ^{t} r_{ϕ} (s_{t}, a_{t})] .

(6)

This dual objective, used in systems like DreamerV3 [9], enables efficient planning and transfer learning by combining learned dynamics with the representational priors of foundation models.

4.1.2. RL-Centric Foundation Models

In RL-centric paradigms, reinforcement learning provides the optimization mechanism for fine-tuning and aligning foundation models. The model (policy)

π_{θ}

generates an output y given a prompt x, and a learned reward model

r_{ϕ} (x, y)

provides feedback based on human or synthetic preferences. The objective function is expressed as

max_{θ} E_{y \sim π_{θ} (\cdot | x)} [r_{ϕ} (x, y)] - β KL (π_{θ} (\cdot | x) ∥ π_{ref} (\cdot | x)),

(7)

where

π_{ref}

is a reference model (often the pretrained base model) and

β

regulates the degree of deviation. This general framework underlies methods such as RLHF, RLAIF, and DPO, which optimize large models for qualities like helpfulness, coherence, and safety [13,14]. Furthermore, in this paradigm, the foundation model itself is the primary object being updated by reinforcement objectives, rather than being embedded as a planner or controller within an external environment loop. While these methods originated in natural language processing, they extend naturally to multimodal systems, where feedback may include visual, auditory, or behavioral signals. In essence, DRL provides the mechanism through which pretrained models learn value alignment and behavioral consistency beyond pure likelihood optimization.

4.1.3. Hybrid and Multimodal Frameworks

Hybrid paradigms couple FM-centric decision-making with reinforcement-driven adaptation in interactive settings, combining pretrained reasoning or planning with environment-facing control and feedback. Hybrid paradigms unify model-centric and RL-centric elements by coupling pretrained representation learning with reinforcement-based adaptation and feedback. These frameworks operate across multiple modalities—text, vision, and action—enabling agents that can perceive, reason, and act in real-world settings. Systems such as Voyager [52] exemplify this approach: foundation models provide high-level reasoning and planning, while DRL components handle continuous interaction, feedback, and environment adaptation. The composite objective typically combines supervised pretraining loss

L_{\sup}

with reinforcement objectives

L_{RL}

:

L_{total} = L_{\sup} + λ L_{RL},

(8)

where

λ

controls the trade-off between imitation learning and reinforcement-driven improvement. This paradigm supports the development of embodied agents capable of learning from both instruction data and real-time feedback, paving the way for scalable autonomous systems that integrate perception, cognition, and control.

These paradigms separate whether foundation models drive deployment-time decision-making, whether reinforcement objectives primarily reshape the foundation model itself, or whether both mechanisms co-exist in an interactive loop with distinct planning and control roles. Collectively, these three paradigms illustrate complementary pathways for integrating DRL with foundation models. Model-centric designs emphasize representational transfer and sample efficiency, RL-centric paradigms focus on alignment and optimization, and hybrid frameworks seek unification through multimodal reasoning and closed-loop interaction. Together, they represent the foundational structure for understanding how reinforcement learning and large-scale pretraining coalesce in the next generation of intelligent agents.

4.2. Architectural Patterns

Architectural innovations underpin the practical realization of DRL–FM integration. These architectures determine how perception, memory, reasoning, and control interact across model-centric, RL-centric, and hybrid paradigms. The key trend is the unification of sequence modeling, representation learning, and world modeling within a single policy structure. In addition to these macro-architectures, recent work revisits the choice of function approximators used inside policies, critics, and dynamics models, which can materially affect interpretability and parameter efficiency.

4.2.1. Transformer-Based Architectures

Transformer-based architectures dominate modern DRL–FM integration due to their ability to model temporal dependencies and multimodal context. By interpreting reinforcement learning trajectories as sequences of tokens—comprising states, actions, and rewards—transformer policies can learn long-term dependencies across time steps [53,54,55]. A general autoregressive policy can be expressed as

π_{θ} (a_{t} | s_{\leq t}, a_{< t}, r_{< t}) = softmax (W_{o} h_{t}),

(9)

where

h_{t} = f_{θ} (s_{\leq t}, a_{< t}, r_{< t})

represents the hidden representation computed by a transformer encoder–decoder and

W_{o}

projects the latent state to an action distribution. This sequence-based design enables direct adaptation of large transformer backbones, such as GPT-style or T5 architectures, for policy modeling and trajectory prediction.

4.2.2. Hierarchical Architectures

Hierarchical architectures extend this formulation by separating decision-making into high-level planning and low-level control. A hierarchical policy can be formalized as a two-level structure:

π (a_{t} | s_{t}) = \sum_{z_{t}} π_{L} (a_{t} | z_{t}, s_{t}) π_{H} (z_{t} | s_{t}),

(10)

where

π_{H}

denotes the high-level policy that selects abstract goals or latent skills

z_{t}

, and

π_{L}

represents the low-level controller [50]. Foundation models contribute to this hierarchy by encoding task semantics or linguistic goals that condition

π_{H}

, while DRL learns

π_{L}

through continuous interaction. This design underlies agents such as PaLM-E [50] and RT-2 [51], where multimodal encoders feed symbolic intentions into continuous controllers.

4.2.3. World-Model Architectures

World-model architectures form another critical pattern. These systems jointly learn a latent dynamics model

p_{ϕ} (s_{t + 1} | s_{t}, a_{t})

and a policy that plans within the learned latent space. The combined objective typically includes both reconstruction and return-maximization terms [56,57]:

L_{world} = E [∥ s_{t + 1} - {\hat{s}}_{t + 1} ∥^{2}] - η E_{p_{ϕ}} [\sum_{t = 0}^{T} γ^{t} r_{ϕ} (s_{t}, a_{t})],

(11)

where

η

controls the trade-off between predictive accuracy and reward optimization. DreamerV3 [9] and MuZero [33] are canonical examples, with recent variants incorporating pretrained vision–language encoders or large language models (LLMs) as priors for richer latent representations.

4.2.4. Kolmogorov–Arnold Networks

A recent architectural development is KANs, proposed as a potential replacement for MLP blocks in function approximation. KANs shift learnable nonlinearities from fixed activations at nodes to learnable univariate functions on edges, often parameterized using spline bases, while nodes primarily aggregate additive contributions. This design has been argued to support more interpretable component functions and, in some regimes, improved parameter efficiency relative to standard MLPs [29]. A canonical Kolmogorov–Arnold form can be written as

f (x) = \sum_{q = 1}^{2 n + 1} Φ_{q} (\sum_{p = 1}^{n} φ_{q, p} (x_{p})),

(12)

where

x = (x_{1}, \dots, x_{n}) \in R^{n}

denotes the input vector with n scalar components;

φ_{q, p} (\cdot)

are learnable univariate component functions; and

Φ_{q} (\cdot)

are learnable univariate outer functions that map the aggregated inner sum to a scalar contribution, and practical KAN implementations generalize this template to wider and deeper networks with spline-parameterized edge functions [29]. In DRL–FM systems, KANs are relevant wherever MLPs act as default approximators: (i) actor and critic heads in actor–critic methods, (ii) reward and dynamics heads in world-model agents, and (iii) auxiliary prediction modules used for representation learning or uncertainty modeling. This creates a DRL-specific mapping in which KANs can serve as policy approximators, value approximators, or components of learned environment models, while leaving higher-level FM backbones unchanged. From a practical standpoint, KAN performance can be sensitive to basis choices, grid resolution, regularization, and training dynamics, and recent guidance consolidates these implementation trade-offs and positioning for practitioners [30].

4.2.5. Retrieval-Augmented and Memory-Based Architectures

Finally, retrieval-augmented and memory-based architectures enhance sample efficiency and reasoning depth by coupling DRL policies with external knowledge [58,59]. In such systems, a retriever module

R (q)

fetches relevant context from a memory bank

M

given a query q, producing augmented state representations

{\tilde{s}}_{t} = [s_{t}; R (s_{t})]

. The policy then conditions on

{\tilde{s}}_{t}

rather than raw observations, allowing foundation models to act as dynamic knowledge bases. This mechanism supports long-term credit assignment and continual learning—core requirements for scalable agentic intelligence [60,61].

4.3. Interaction and Feedback Loops

Interaction and feedback loops determine how DRL–FM systems adapt over time. Human-in-the-loop supervision provides preference labels, rankings, or evaluative signals that guide alignment and policy refinement [62,63]. These signals are typically converted into preference datasets that support reward modeling or direct optimization methods [13,14]. Furthermore, AI-in-the-loop feedback extends this paradigm by using teacher models to generate synthetic evaluations, improving scalability and consistency across tasks [64,65]. Self-improvement mechanisms further enable policies to critique their own outputs, adjust task difficulty, and iteratively generate new training samples [66,67]. These feedback structures support continual learning and autonomy in open-ended environments.

These feedback mechanisms—human-in-the-loop, AI-in-the-loop, and self-improving loops—form the backbone of adaptive intelligence in DRL–FM systems. They transform static pretrained models into interactive, evolving agents capable of learning from evaluation, reasoning over feedback, and refining their policies over time. These interaction patterns complement the architectural foundations discussed earlier, together defining the dynamic learning ecosystem that underlies scalable, aligned, and generalizable reinforcement-driven foundation models.

5. Training, Alignment, and Optimization Methods

Having outlined the major paradigms through which DRL and foundation models interact, the next step is to examine the underlying mechanisms that enable these systems to learn from preference signals, structured rewards, and interaction dynamics. The training and alignment methods discussed in this section operationalize the taxonomy presented earlier by detailing how policies are optimized, how reward signals are constructed, and how reinforcement objectives shape the behavior of foundation models in practice.

5.1. Reward Modeling and Preference Learning

Reward modeling forms the foundation of alignment in DRL and foundation model integration. It provides the evaluative signal that drives policy optimization, ensuring model behavior reflects human or task-specific values [68,69,70]. Classical reinforcement learning assumes access to an explicit scalar reward

r_{t} = R (s_{t}, a_{t})

that captures task success. In contrast, foundation models operate in high-dimensional, ambiguous domains—such as language, vision, or multimodal reasoning—where reward functions are not predefined but must instead be learned from preference data or human feedback.

Reward modeling uses the same preference-learning formulation introduced in Section 4.1.2, where a reward model is trained from pairwise preferences and subsequently used to guide policy optimization. In DRL–FM systems, this component becomes the primary mechanism for shaping high-level behaviors such as reasoning quality, factuality, and safety. The overall RLHF framework is illustrated in Figure 2. A pretrained language model generates candidate outputs that are evaluated by a learned reward or preference model trained from human feedback [62]. The trained policy model is then updated via PPO using the reward signal and a Kullback–Leibler (KL) regularization term, which constrains deviations from the base model to preserve linguistic fluency and prevent exploitation of the reward model [71,72]. The KL divergence term used in RLHF optimization is expressed as

D_{KL} (π_{θ} ∥ π_{ref}) = E_{x} [\sum_{y} π_{θ} (y | x) log \frac{π_{θ} (y | x)}{π_{ref} (y | x)}],

(13)

where

π_{θ}

denotes the current policy and

π_{ref}

the frozen reference model. This regularization encourages the updated model to remain aligned with its pretrained distribution while integrating reinforcement signals effectively.

Modern systems often adopt multi-objective reward modeling to balance competing values such as helpfulness, harmlessness, and honesty. The composite reward is represented as

r_{ϕ} (x, y) = \sum_{i = 1}^{K} w_{i} r_{ϕ_{i}} (x, y), where \sum_{i = 1}^{K} w_{i} = 1,

(14)

with

r_{ϕ_{i}}

representing individual reward components and

w_{i}

their corresponding priorities. This flexible formulation enables dynamic trade-offs among multiple alignment criteria during training. Recent extensions include language-conditioned rewards, where evaluative criteria are expressed through natural language rather than explicit labels [73,74]. The reward is computed using an evaluator model

E_{ψ}

queried with a meta-instruction:

r_{ψ} (x, y) = E_{ψ} (“ Is y an appropriate response to x ? ”),

(15)

thus converting open-ended qualitative judgments into quantitative reinforcement signals. Despite these advances, reward-model exploitation and bias amplification remain persistent challenges.

5.2. Policy Optimization in the FM Era

Policy optimization defines how learned rewards translate into improved model behavior. In the context of foundation models, reinforcement learning typically fine-tunes a pretrained policy

π_{θ}

using feedback from a reward model

r_{ϕ}

[68,75]. The optimization objective follows the KL-regularized policy update described previously, where reinforcement signals from the reward model are balanced with deviation constraints from the reference policy. The KL regularization term penalizes excessive divergence from the pretrained distribution, maintaining linguistic fluency and stability during optimization.

This structure forms the basis of several modern policy optimization methods summarized in Table 3. Among these, PPO remains the most widely adopted approach for alignment, performing gradient updates with clipped probability ratios to ensure stable improvement. More recent methods, such as Direct Preference Optimization (DPO) and Implicit Preference Optimization (IPO) [76], extend this foundation by reformulating preference-based optimization objectives. DPO removes explicit reward modeling by directly encoding preference differences, whereas IPO leverages implicit likelihood reparameterization for improved convergence properties.

Across alignment pipelines, PPO, DPO, and IPO differ systematically along three practical axes: stability, sample efficiency, and sensitivity to hyperparameters. PPO is typically stable when its clipping thresholds and KL regularization are well tuned, but its on-policy nature makes it sensitive to reward-model noise, learning-rate schedules, and KL coefficients, while also incurring high sample and compute costs due to repeated rollouts. In contrast, DPO avoids on-policy rollouts entirely and operates directly on preference pairs, yielding substantially higher sample efficiency and a simpler optimization pipeline, but its stability depends strongly on preference consistency, evaluator calibration, and dataset quality. IPO further modifies the preference-based objective to smooth optimization dynamics and reduce variance, often improving convergence behavior and reducing sensitivity relative to DPO, while still inheriting fundamental limitations tied to preference noise, objective scaling, and data coverage.

Offline or implicit reinforcement learning further adapts this paradigm by learning from static datasets rather than online interactions [18,77]. The objective for offline RL is formulated as

L_{offline} (θ) = - E_{(x, y) \sim D} [{\hat{r}}_{ϕ} (x, y)] + β KL (π_{θ} (\cdot | x) ∥ π_{ref} (\cdot | x)),

(16)

where

D

denotes pre-collected human or synthetic preference data. This approach enhances safety and reproducibility by decoupling policy updates from real-time exploration, but limits adaptability under distribution shift. A further development involves hybrid optimization, where behavior cloning (BC), a supervised imitation-learning method, is often combined with reinforcement learning for stability [78]. The total objective is defined as

L_{total} = L_{BC} + λ L_{RL},

(17)

with

λ

controlling the trade-off between imitation learning and reinforcement adaptation. In summary, these methods indicate that policy optimization in the FM era is governed less by novel reinforcement algorithms and more by managing trade-offs between stability, efficiency, preference quality, and optimization sensitivity at scale.

5.3. Safety, Robustness, and Constraints

Safety and robustness remain critical challenges in DRL–FM integration. Constrained reinforcement learning imposes explicit limits on risk or undesired behavior through cost signals and thresholded expectations [79,80]. In reward-model–driven alignment, a central failure mode is a feedback-loop dynamic: systematic bias in the reward model can be exploited by the policy during optimization, leading to progressive bias amplification and, in extreme cases, reward hacking under distribution shift [81]. These constraints help prevent policies from exploiting weaknesses in reward models or drifting toward unsafe actions.

Adversarial training and red-teaming evaluate robustness by exposing models to perturbed inputs, adversarial prompts, or challenging edge cases [82]. Beyond one-off stress tests, periodic audits and targeted red-team cycles can detect reward-hacking behaviors that only surface after repeated policy improvement, when the policy learns to optimize evaluator blind spots rather than the intended objective [81,83]. Mitigations that reduce bias amplification include reward-model ensembles, uncertainty-aware reward signals, and adversarially curated preference data, which make overoptimization harder and increase the chance that failure modes are surfaced before deployment.

Interpretability methods, including policy probing and rationale extraction, further support safety auditing by revealing the internal decision features that drive policy behavior. In practice, these tools are most effective when paired with constraint mechanisms that bound optimization pressure, so that evaluator imperfections do not propagate unchecked into policy behavior [84,85].

5.4. Scaling Laws and Data Efficiency

Scaling behavior in DRL–FM systems follows empirical regularities analogous to those observed in supervised and self-supervised learning [86,87]. The expected performance E can often be approximated as a power-law function of model size N, data volume D, and environment complexity

ξ

:

E = k N^{α} D^{β} ξ^{- δ},

(18)

where

α

,

β

, and

δ

are scaling exponents and k is a task-dependent constant [86]. Empirical studies have shown that reward quality and sample efficiency improve sublinearly with scale until diminishing returns set in [88,89]. These relationships guide model design and resource allocation for large reinforcement-trained systems.

Data efficiency remains an active research frontier. Techniques such as synthetic preference generation, experience replay, and active sampling have been developed to improve learning from limited feedback. Synthetic data generation uses teacher models or simulators to produce preference pairs

(x, y^{+}, y^{-})

, which augment scarce human annotations while preserving reward diversity. Experience replay buffers past trajectories for off-policy optimization, ensuring better sample reuse. Active learning strategies adaptively select queries for which model uncertainty is highest, thus maximizing information gain per feedback instance.

Lastly, scaling and efficiency insights reveal that the effectiveness of DRL–FM systems depends not only on architectural capacity but also on the quantity, diversity, and quality of feedback data. As models grow larger and environments more complex, balancing computational cost with reward fidelity and safety alignment becomes the central optimization challenge.

6. Applications of DRL in Foundation Model Contexts

Building on the preceding taxonomy and optimization principles, the following sections survey how DRL–FM integration manifests across real-world domains, illustrating concrete implementations of model-centric, RL-centric, and hybrid paradigms.

6.1. Language and Multimodal Agentic Systems

Language-first agents have become a central proving ground for reinforcement-based refinement of foundation models. RLHF remains the most widely deployed strategy for aligning generative behavior with user expectations. Ouyang et al. [12] showed that combining supervised instruction tuning with preference-optimized policy updates enables moderately sized models to outperform significantly larger pretrained systems on human preference metrics. Rafailov et al. [76] advanced this direction with direct preference optimization, a formulation that dispenses with explicit reward modeling while matching the alignment performance of PPO-based RLHF across summarization and stylistic control tasks. More recently, Singh et al. [90] demonstrated that explicit reinforcement-driven reasoning schedules can further improve the consistency of tool-use decisions and stepwise reasoning, indicating that alignment and agentic competence can be shaped jointly.

The transition from passive text generation to utility-driven action becomes particularly evident in tool-augmented agents. Yao et al. [91] showed that interleaving chain-of-thought traces with concrete environment actions improves stability and reduces hallucinations relative to prompting-only approaches. Schick et al. [92] further demonstrated that models can learn tool-use behaviors through self-supervised objectives that implicitly reward actions contributing to downstream success. Building on this direction, Jiang et al. [93] introduced VerlTool, a framework that integrates reinforcement learning with dynamic tool-selection mechanisms, addressing fragmentation in earlier tool-use pipelines and enabling more coherent multi-step interaction.

These ideas extend naturally to embodied and grounded settings, where language, perception, and control must operate jointly. Driess et al. [50] embedded continuous perceptual signals into PaLM-E, enabling transfer of knowledge from vision–language tasks to robotic manipulation. Zitkovich et al. [51] expanded this approach in RT-2 by co-training on web-scale visual corpora and robot data, achieving improved generalization to previously unseen objects and task compositions. Kim et al. [94] contributed OpenVLA, an open-source alternative demonstrating that parameter-efficient reinforcement-style adaptation can port policies across heterogeneous robotic embodiments. These developments show how pretrained representations serve as semantic priors, while reinforcement learning supplies the corrective feedback necessary for fine-grained action grounding.

The limitations of prompt-only control become even more evident in open-ended digital environments. Zhou et al. [95] introduced WebArena, a realistic multi-site benchmark where language-model agents frequently fail without structured feedback or explicit credit assignment. Wang et al. [52] illustrated this in Minecraft: an automatic curriculum, self-verification, and iterative skill learning enabled substantially faster competence acquisition than imitation-driven or prompting-only baselines. Zheng et al. [96] reached similar conclusions in real web environments, showing that reinforcement-guided exploration and retrieval yield more reliable long-horizon research behavior than static prompting approaches. These studies reflect a consistent pattern: in open-world settings with branching action spaces, reinforcement-style shaping is essential for compressing exploration and avoiding compounding errors.

As language models increasingly function as autonomous agents rather than passive generators, new benchmarks have emerged to evaluate their decision-making and instruction-following capabilities. Qi et al. [97] introduced AGENTIF, a suite of tasks designed to measure the robustness of instruction following under interactive and constraint-heavy conditions. Results across multiple foundation models show substantial gaps between static prompting performance and reinforcement-refined agent behavior, highlighting the need for explicit feedback loops when models operate in environments with persistent state and long-horizon dependencies.

These developments illustrate how deep reinforcement learning transforms foundation models into more reliable and action-capable systems. Pretrained representations provide broad generalization and semantic grounding, while reinforcement-driven updates—whether from human preferences, synthetic evaluators, or environment returns—supply the corrective signals required for stable reasoning, tool use, and multi-step interaction. This synergy underlies the growing shift from text-only generation toward fully agentic systems capable of operating across web environments, software APIs, robotics tasks, and other complex domains.

Table 4 summarizes the studies discussed in this section.

6.2. Autonomous Decision-Making and Control

Autonomous control systems provide another setting where pretrained representations and reinforcement optimization interact to yield substantial gains. Model-based DRL architectures remain fundamental to this space. Hafner et al. [9] introduced DreamerV3, a latent world-model agent that leverages imagination rollouts to optimize long-horizon policies from high-dimensional observations, achieving strong data efficiency across diverse benchmarks. Schrittwieser et al. [33] demonstrated with MuZero that effective planning does not require explicit knowledge of environment dynamics; its combination of latent dynamics learning and policy optimization continues to influence hybrid architectures that embed pretrained vision or language encoders into downstream control pipelines.

The integration of foundation models into robotic systems has accelerated progress in real-world manipulation and navigation. Brohan et al. [98] trained RT-1 on large-scale real-world trajectories, showing that transformer sequence models paired with reinforcement refinement produce robust manipulation behavior. RT-2 [51] extended this paradigm by combining web-scale visual pretraining with robot data, enabling generalization to unseen objects and novel task compositions. O’Neill et al. [99] advanced this direction further with the Open X-Embodiment collection, a multi-embodiment dataset and model suite demonstrating that shared representations across robots substantially enhance cross-task and cross-platform transfer, particularly when reinforced through interaction.

Recent work has also begun to examine broader multimodal representations for control. Rajasegaran et al. [100] proposed token-to-token video learning mechanisms that capture temporal structure in continuous streams, providing a richer perceptual backbone for visuomotor policies. Yan et al. [101] introduced RoboMM, a multimodal foundation model for robotic manipulation that integrates vision, language, and action signals. Their framework highlights how large pretrained encoders can supply semantic grounding while reinforcement-based fine-tuning tailors the policy to platform-specific constraints.

Simulation-to-reality transfer remains a persistent challenge, and reinforcement-guided grounding has emerged as a key method for bridging the gap. Ahn et al. [49] combined high-level language reasoning with value-based affordances in SayCan, constraining action plans to feasible behaviors and improving real-world manipulation success. Chebotar et al. [102] showed that transformer-based attention stabilizes Q-learning across multitask manipulation settings, enhancing data efficiency and robustness. Nasiriany et al. [103] contributed RoboCasa, a large-scale household simulation environment that supports both VLA-style pretraining and reinforcement-driven adaptation, enabling controlled experimentation with sim-to-real pipelines under realistic task complexity.

Driving and aerial robotics exhibit similar advantages from hybrid DRL–FM architectures. Xu et al. [104] introduced DriveGPT-4, a multimodal driving system that integrates vision–language reasoning with reinforcement-style policy updates, improving collision avoidance and route-following stability relative to imitation-only training. Shao et al. [105] developed LMDrive, demonstrating that reinforcement refinement helps address long-tail edge cases that challenge purely supervised systems. These systems underscore the value of pretrained semantic understanding combined with reinforcement-driven adaptation for handling dynamic, safety-critical environments.

Table 5 provides a summary of the applications. Across these developments, a consistent pattern is evident: foundation models supply broad perceptual and semantic structure, while reinforcement learning provides the task-specific corrective signals required for stable, generalizable decision-making. This interaction has reshaped the landscape of embodied intelligence, enabling policies that adapt under uncertainty, generalize across embodiments, and maintain robustness in real-world, interactive settings.

6.3. Scientific Discovery and Complex Systems

Scientific discovery often unfolds in spaces that are vast, combinatorial, and sparsely supervised, making it a natural setting for uniting foundation-model priors with reinforcement-driven search. In molecular design, Bagal et al. [106] demonstrated that transformer decoders can generate syntactically valid and chemically novel compounds competitive with specialized neural baselines. Frey et al. [107] expanded this direction with ChemGPT, documenting scaling laws that enable billion-parameter chemical models to serve as powerful priors for targeted property optimization. Ye et al. [108] further showed that augmenting such priors with proximal policy optimization yields molecular candidates that better satisfy structural and pharmacological constraints than those obtained via supervised fine-tuning alone.

Materials science provides a parallel illustration of how pretrained structure can accelerate the search over large physical and chemical design spaces. Riebesell et al. [109] introduced Matbench Discovery to evaluate model performance under realistic, prospective screening conditions, finding that reinforcement-guided surrogate search significantly improves hit rates for stable materials. Tang et al. [110] extended these ideas with MatterChat, a multimodal foundation model that incorporates structural, textual, and physical information; its representations offer a flexible substrate for reinforcement-driven optimization of materials with complex cross-property trade-offs.

Pretrained scientific models are also increasingly used as reasoning engines that guide algorithmic or symbolic search. Fawzi et al. [111] demonstrated the feasibility of framing algorithm design as a reinforcement learning problem with AlphaTensor, which discovered matrix multiplication routines outperforming long-standing human-derived schemes. Mankowitz et al. [112] later extended this line of work by discovering faster sorting algorithms using deep reinforcement learning, indicating that algorithmic domains once dominated by theory-driven heuristics can benefit from policy-guided exploration. SciGLM [113], a scientific multimodal language model, complements these efforts by providing a versatile prior for tasks involving symbolic manipulation, hypothesis evaluation, and multi-step reasoning.

Climate and geoscience research present additional domains where foundation models and reinforcement learning intersect. Rasp et al. [114] established WeatherBench as a standardized benchmark for data-driven forecasting, enabling systematic comparison of physics-free and hybrid models. Pathak et al. [115] introduced a high-resolution global forecasting system based on adaptive Fourier operators that now serves as a surrogate for downstream reinforcement-guided scheduling and resource allocation. Reviews by Michailidis et al. [116] and de Burgh-Day et al. [117] highlight growing evidence that coupling learned simulators with reinforcement controllers improves decision quality in climate mitigation, energy optimization, and environmental policy planning.

In these scientific domains (summarized in Table 6), the interaction between broad, pretrained priors and reinforcement-based adaptation has transformed how researchers navigate vast hypothesis spaces. Foundation models provide structured, domain-aware representations, while reinforcement learning offers a principled mechanism for goal-directed search, enabling faster discovery, deeper reasoning, and more efficient use of experimental or computational resources.

6.4. Societal, Ethical, and Alignment Applications

As foundation models expand their presence in socially embedded settings, reinforcement-driven alignment has become a core method for shaping behavior toward human expectations. Chaudhari et al. [118] demonstrated that RLHF substantially increases perceived helpfulness and preference consistency, while Bai et al. [119] showed that constitutional principles combined with AI-generated critique can reduce harmful outputs without relying exclusively on costly human annotation. Heo et al. [120] further showed that non-expert feedback can provide surprisingly coherent preference signals during early-stage tuning, underscoring the flexibility of reinforcement-based alignment methods.

However, the growing evidence around reward hacking indicates that RLHF pipelines can behave unpredictably unless reward models are carefully constrained. Miao et al. [121] proposed an information-theoretic approach (InfoRM) that suppresses spurious reward predictors within the reward model, reducing overoptimization and improving alignment stability. Related work by Casper et al. [63] highlighted how misgeneralization, proxy gaming, and evaluator blind spots arise in practice, particularly when evaluators—human or model-based—lack adequate robustness. This concern is amplified by findings from Perez et al. [122], who showed that language models acting as adversarial red-teamers can reliably elicit unsafe behavior from reinforcement-tuned systems.

Scaling alignment beyond human feedback introduces an additional layer of complexity. Chen et al. [123] proposed uncertainty-aware reward-model pipelines that incorporate adversarially curated preference data, improving robustness in reinforcement optimization. Complementing these ideas, Fu et al. [124] examined reward shaping strategies designed to reduce reward hacking in RLHF, showing that bounded and smoothed reward functions produce more stable optimization dynamics. These findings indicate that alignment procedures must be treated as full-stack systems in which reward design, data curation, and optimization dynamics interact in nontrivial ways.

When reinforcement-trained systems operate within multi-agent environments, additional social risks emerge. Dafoe et al. [125] examined how strategic behaviors—such as defection, collusion, and exploitation—arise naturally in multi-agent reinforcement learning unless cooperative constraints are explicitly modeled. Constrained reinforcement learning provides one viable direction. Gao et al. [126] showed that harm-sensitive constraints can produce policies that remain within ethical bounds even under ambiguous reward signals, while Shperberg et al. [127] introduced a relaxed exploration framework that enforces strict safety constraints only at deployment time. Haydari et al. [128] demonstrated similar benefits in safety-critical control settings, further illustrating how constraint-based formulations complement preference-based alignment strategies. These approaches integrate naturally with broader evaluation frameworks such as HELM [43], which assess robustness, fairness, and safety alongside task-oriented performance.

Finally, reinforcement-based alignment raises governance and societal challenges that extend beyond purely technical interventions. Yao et al. [129] highlighted the risks of culturally narrow or demographically skewed preference datasets, which can cause RLHF pipelines to encode restrictive normative assumptions. Lindström et al. [130] examined practical limitations in human feedback collection, noting that scalable alignment requires transparent annotation protocols and diversified evaluators. Xu et al. [131] similarly argued that reinforcement-driven alignment demands reproducible governance mechanisms to ensure stability under distribution shift. Across these analyses, a consistent pattern emerges: reinforcement-based alignment is effective at shaping model behavior, but its success ultimately depends on robust oversight structures that integrate technical, organizational, and societal considerations. These applications are tabulated in Table 7.

7. Challenges, Solutions, and Future Research Directions

As DRL becomes increasingly intertwined with foundation models, substantial technical, methodological, and governance-related challenges remain unresolved. The issues are not isolated; they interact across optimization, evaluation, human supervision, and societal deployment. This section consolidates the central open problems and outlines research pathways capable of advancing the integration of DRL and FMs toward more reliable, generalizable, and socially aligned systems.

7.1. Challenges and Open Problems

The integration of DRL with FMs introduces several unresolved challenges spanning optimization, supervision, interaction dynamics, and governance. These challenges reflect underlying tensions between the flexibility of reinforcement learning and the scale, opacity, and social impact of modern FMs.

7.1.1. Optimization Instability

A core difficulty lies in the instability of reinforcement-based optimization when applied to foundation-scale policies. Algorithms such as PPO and DPO remain sensitive to hyperparameter choices, reward shaping, and small errors in reward model predictions. Even minor inaccuracies can induce reward hacking or behavioral drift, especially in high-dimensional policy spaces where updates propagate unpredictably [119]. As model size increases, controlling these instabilities becomes more challenging, and current techniques lack mechanisms for ensuring consistent improvement during iterative alignment.

7.1.2. Credit Assignment in Language-Conditioned Environments

Another structural challenge concerns credit assignment when rewards depend on long-horizon linguistic or multimodal criteria. Many DRL–FM tasks require assessing latent constructs—factual accuracy, ethical compliance, or multistep tool use—but the scalar rewards typically available provide insufficient granularity for identifying which intermediate actions, tool decisions, or reasoning steps contributed to success [50,52]. In practice, reward models and automated evaluators often score an entire response or episode, producing delayed, noisy, and sometimes contradictory signals; this causes “credit diffusion” where many distinct trajectories receive similar scores, while small evaluator errors can dominate the learning signal. When the environment is language-conditioned, the action space expands to token- or instruction-level decisions, and the state can be partially implicit (e.g., hidden assumptions, private intermediate reasoning, or external tool side effects), making standard temporal-difference style assignment unreliable under sparse or delayed feedback.

Emerging approaches attempt to restore structure by moving from step-local rewards to trajectory-aware and decomposition-aware supervision. Trajectory-level reward modeling assigns value to coherent segments (plans, subgoals, tool-use phases) rather than isolated steps, while hierarchical policies and subgoal-conditioned critics restrict the search space so that rewards map to fewer, more interpretable decisions. Complementary methods add process-level feedback via verifier signals, intermediate checkpoints, or structured preference comparisons over partial solutions, aiming to densify supervision without introducing brittle hand-crafted shaping. However, these designs remain vulnerable to evaluator gaming and distribution shift: as policies improve, they generate out-of-distribution reasoning patterns that can break learned evaluators, leading to unstable updates or misaligned shortcuts.

A second line of work explores causal-influenced credit assignment, treating tool calls and reasoning interventions as potential causes and estimating which decisions materially change outcomes via counterfactual rollouts, causal graphs, or influence-style attributions [132]. In parallel, self-critique mechanisms embed an internal critic—often an FM-based verifier or revision loop—that tests, edits, or rejects intermediate steps before execution, converting sparse task rewards into richer self-generated supervision [119]. These mechanisms can improve learnability in long-horizon settings, but they introduce new failure modes, including critic–policy collusion, over-optimization to the critique format, and increased compute and governance burden for auditing critic reliability at scale.

7.1.3. Scalability and Reliability of Human Feedback

High-quality human preference data remain expensive, inconsistent, and difficult to scale. RLAIF [119] alleviates the annotation bottleneck but introduces new risks: synthetic feedback inherits the blind spots, cultural biases, and normative assumptions of the teacher models that generate it. As a result, feedback pipelines face a tension between scalability and representational fidelity. The lack of systematic tools for measuring these distortions complicates efforts to build durable and trustworthy alignment mechanisms.

7.1.4. Emergent Multi-Agent Dynamics

Large models increasingly interact with one another in multi-agent ecosystems—through communication, negotiation, shared tool use, or competitive tasks. These interactions can yield emergent behaviors such as collusion, deception, or manipulative strategies [125]. Standard single-agent RL algorithms are not designed to detect or constrain such dynamics, and existing multi-agent RL techniques lack proven stability guarantees at FM scale. Without robust mechanisms for shaping cooperative norms and monitoring emergent strategies, FM-driven ecosystems risk producing unintended and difficult-to-audit behaviors.

7.1.5. Reproducibility, Transparency, and Governance Limitations

The field continues to face persistent reproducibility and governance gaps. Variation in pretraining corpora, reward model architectures, evaluator instructions, and computational budgets can significantly alter experimental outcomes [43,133]. These discrepancies hinder the comparability of results and limit the ability of auditors to evaluate alignment claims. Transparent reporting, standardized evaluation, and reproducible pipelines remain underdeveloped, creating barriers to responsible deployment in socially sensitive environments.

7.2. Unified DRL–FM Evaluation Framework

A recurring limitation in DRL–FM research is that evaluation practice does not match system complexity: agents are trained with multimodal perception, tool use, preference feedback, and long-horizon interaction, yet are often reported with narrow single-domain metrics. This subsection proposes a unified evaluation framework intended to be usable across classical control, embodied environments, and language-first agentic settings. The goal is not to replace existing benchmarks, but to standardize how results are reported so that claims become comparable across environments, feedback sources, and deployment contexts.

7.2.1. Design Principles

Cross-domain comparability: report a shared set of metric groups so results from control, web agents, and multimodal settings can be compared along common axes.
Closed-loop and offline evaluation: evaluate both interactive behavior under environment dynamics and offline behavior under fixed datasets or preference logs, because many systems mix both regimes.
Safety and robustness as first-class objectives: treat unsafe behavior, policy exploits, and adversarial fragility as primary outcomes rather than secondary notes.
Governance and reproducibility artifacts: publish minimum reporting details (seeds, compute, data, evaluator prompts, and reward-model settings) required to reproduce headline results.

7.2.2. Metric Groups and Minimum Outputs

We recommend reporting metrics in seven groups. Each group includes example measures that can be instantiated across environment types.

Capability and return: success rate, episodic return, task completion rate, constraint satisfaction rate, and trajectory-level goal completion.
Efficiency: environment samples to threshold, wall-clock time to threshold, training tokens, compute proxy measures, regret curves, and evaluation-time cost.
Generalization: performance under out-of-distribution splits, procedural generalization, tool or API shift, environment shift, and evaluator shift.
Alignment and helpfulness: preference win-rate, refusal correctness, harmlessness rate, truthfulness proxies, and instruction-following reliability under constraints.
Credit assignment quality: stability of step- or segment-level attributions under perturbations, consistency of identified causal steps across seeds, and sensitivity of outcomes to intermediate action edits [19,22].
Robustness and security: adversarial prompts, observation perturbations, reward-model attacks, jailbreaking resistance, and traceability checks such as watermark or provenance hooks when applicable.
Reproducibility: seed sensitivity, variance across runs, reporting of hyperparameters and reward-model details, and ablations that isolate the contribution of the FM and the reinforcement component.

7.2.3. Evaluation Protocol Template

To reduce ambiguity across papers, evaluation should explicitly specify:

Interaction mode: online, offline, or mixed, and the exact point at which preference or reward signals enter the loop.
Evaluator type: human, model-based, rule-based, or hybrid, including prompt templates and any safety policies used by evaluators.
Aggregation: mean and dispersion across seeds, and the number of independent runs used for reported metrics.
Failure analysis: a mandatory set of failure modes aligned with the benchmark blueprint in Table 8.

7.2.4. Benchmark Blueprint

Table 8 provides a blueprint that links environment types to required metric groups, recommended protocols, and typical failure modes. It is intended as a minimum standard that authors can extend rather than a restrictive checklist.

7.2.5. Minimum Reporting Standard

To make results auditable and comparable, we recommend a minimum reporting standard that includes: (i) model and adapter details, including which components are frozen or fine-tuned; (ii) reward or preference pipeline details, including evaluator prompts and reward-model training data sources; (iii) training regime and compute proxies; (iv) number of seeds and variance measures; and (v) a short failure-mode report aligned with Table 8. This standard is designed to be lightweight enough for survey-derived comparisons while still enabling reproducibility and governance review.

7.3. Future Research Directions

Solving these problems requires progress across modeling, optimization, evaluation, and governance. Foundation-scale world models offer one avenue: systems such as DreamerV3 [9] illustrate how temporally coherent latent dynamics can support long-horizon credit assignment, reduce reliance on sparse scalar rewards, and stabilize optimization. Extending such techniques to FM architectures could provide shared state representations that integrate perception, reasoning, and action.

A second direction centers on self-reflective and self-improving agents. Reflection and critique mechanisms—such as iterative self-evaluation and revision pipelines [95,119]—may mitigate reward hacking, correct reasoning errors, and facilitate consistent long-horizon behavior. Embedding these loops within recurrent or memory-augmented architectures would provide structured pathways for internal error detection, enabling more robust credit assignment and reducing fragility in policy updates.

Further progress may arise from neurosymbolic reinforcement learning, which integrates logical constraints, causal structure, and symbolic rules with neural policy optimization. Such approaches offer pathways toward interpretability and verifiability, addressing governance and reproducibility concerns while improving robustness in high-stakes settings. Complementary work in federated and resource-efficient DRL aims to reduce computational cost and improve privacy-preserving deployment, which becomes increasingly important as FMs proliferate across distributed infrastructures.

Sustainability and energy efficiency represent additional priorities. Energy-aware training and evaluation—consistent with principles from Green AI [134,135]—should become integral to reinforcement learning objectives, with carbon- and compute-aware reward shaping enabling more responsible scaling trajectories.

Finally, the development of sovereign, culturally adaptive, and legally grounded alignment practices is essential for global deployment. Region-specific reward functions, inclusive preference datasets, and auditable oversight structures [136] are critical for mitigating bias propagation and ensuring that reinforcement-trained models operate within diverse normative contexts. These directions suggest a research landscape in which DRL functions not only as an optimization tool but as part of a broader governance framework shaping the evolution of foundation models. Meanwhile, Table 9 summarizes the challenges presented in this section and the corresponding future research directions.

8. Conclusions

The growing interplay between DRL and FMs marks a significant shift in how artificial agents perceive, reason, and act across complex environments. This survey examined how DRL complements the broad generalization, semantic priors, and multimodal capabilities of FMs, enabling systems that extend beyond passive prediction to become adaptive, interactive, and goal-directed. Across language agents, embodied control, scientific discovery, and societal alignment, the combined use of pretrained representations and reinforcement-based optimization has produced substantial gains in reliability, task efficiency, and long-horizon competence.

Furthermore, DRL–FM integration faces substantial technical and governance-related challenges. Optimization instability, limited credit assignment for language-conditioned tasks, and the difficulty of scaling high-quality human feedback continue to constrain system reliability. Addressing these shortcomings will require not only algorithmic advances but also transparent experimental reporting, inclusive evaluation practices, and stronger institutional mechanisms for oversight.

A practical takeaway of this survey is the unified DRL–FM evaluation framework introduced in Section 7, which operationalizes cross-domain comparability through shared metric groups, protocol templates, and a benchmark blueprint that links environment types to expected evaluation outputs and failure modes. The framework supports more consistent comparison across model-centric, RL-centric, and hybrid systems, and enables stronger auditing of safety, robustness, and governance claims.

Looking ahead, progress will likely come from coupling foundation-scale world models and structured long-horizon reinforcement learning with more reliable feedback and evaluation pipelines, while treating safety, reproducibility, and compute constraints as design objectives rather than afterthoughts. Culturally grounded governance and reward design further suggest that alignment is context-dependent, so DRL–FM systems must be evaluated and audited under the norms and deployment conditions they will actually face.

Author Contributions

Conceptualization, I.D.M., E.E. and C.M.; methodology, I.D.M., E.E. and C.M.; validation, I.D.M., E.E. and C.M.; investigation, I.D.M., E.E. and C.M.; writing—original draft preparation, I.D.M.; writing—review and editing, I.D.M., E.E. and C.M.; visualization, I.D.M., E.E. and C.M.; supervision, E.E. and C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

A3C	Asynchronous advantage actor–critic
AI	Artificial intelligence
API	Application programming interface
BC	Behavior cloning
CNN	Convolutional neural network
DNN	Deep neural networks
DPO	Direct preference optimization
DQN	Deep Q–network
DRL	Deep reinforcement learning
FM	Foundation model
IPO	Implicit preference optimization
KAN	Kolmogorov–Arnold network
KL	Kullback–Leibler divergence
LLM	Large language model
MDP	Markov decision process
MLP	Multilayer perceptron
PPO	Proximal policy optimization
RL	Reinforcement learning
RLAIF	Reinforcement learning from AI feedback
RLHF	Reinforcement learning from human feedback
RM	Reward model
SAC	Soft actor–critic
VLA	Vision–language–action
VLM	Vision–language model
XRL	Explainable reinforcement learning

References

Balhara, S.; Gupta, N.; Alkhayyat, A.; Bharti, I.; Malik, R.Q.; Mahmood, S.N.; Abedi, F. A survey on deep reinforcement learning architectures, applications and emerging trends. IET Commun. 2025, 19, e12447. [Google Scholar]
Alkam, T.; Van Benschoten, A.H.; Tarshizi, E. Reinforcement learning in artificial intelligence and neurobiology. Neurosci. Inform. 2025, 5, 100220. [Google Scholar]
Jalali Khalil Abadi, Z.; Mansouri, N.; Javidi, M.M. Deep reinforcement learning-based scheduling in distributed systems: A critical review. Knowl. Inf. Syst. 2024, 66, 5709–5782. [Google Scholar] [CrossRef]
Li, Z.; Ji, Q.; Ling, X.; Liu, Q. A comprehensive review of multi-agent reinforcement learning in video games. IEEE Trans. Games 2025, 17, 873–892. [Google Scholar]
Bouhoula, S.; Avgeris, M.; Leivadeas, A.; Lambadaris, I. DRL-based trajectory planning and sensor task scheduling for edge robotics. In Proceedings of the 2024 IEEE 10th World Forum on Internet of Things (WF-IoT); IEEE: New York, NY, USA, 2024; pp. 352–357. [Google Scholar]
Golchoubian, M.; Ghafurian, M.; Dautenhahn, K.; Azad, N.L. Uncertainty-aware drl for autonomous vehicle crowd navigation in shared space. IEEE Trans. Intell. Veh. 2024, 9, 7931–7944. [Google Scholar] [CrossRef]
Liu, X.; Alexopoulos, C.; Peng, Y. A simulation-driven machine learning framework for large-scale inventory management. Ann. Oper. Res. 2025, 1–27. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hafner, D.; Pasukonis, J.; Ba, J.; Lillicrap, T. Mastering diverse domains through world models. arXiv 2023, arXiv:2301.04104. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Mienye, I.D.; Jere, N.; Obaido, G.; Ogunruku, O.O.; Esenogho, E.; Modisane, C. Large language models: An overview of foundational architectures, recent trends, and a new taxonomy. Discov. Appl. Sci. 2025, 7, 1027. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Kaufmann, T.; Weng, P.; Bengs, V.; Hüllermeier, E. A Survey of Reinforcement Learning from Human Feedback. 2024. Available online: https://epub.ub.uni-muenchen.de/125328/1/2312.14925v2.pdf (accessed on 5 October 2025).
Wang, Z.; Bi, B.; Pentyala, S.K.; Ramnath, K.; Chaudhuri, S.; Mehrotra, S.; Zhu, Z.; Mao, X.B.; Asur, S.; Cheng, N. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv 2024, arXiv:2407.16216. [Google Scholar] [CrossRef]
Zhou, D.; Zhang, J.; Feng, T.; Sun, Y. A survey on alignment for large language model agents. In UIUC Spring 2025 CS598 LLM Agent Workshop; Urbana, IL, USA, 2025; Available online: https://openreview.net/forum?id=gkxt5kZS84 (accessed on 10 November 2025).
Plaat, A.; van Duijn, M.; van Stein, N.; Preuss, M.; van der Putten, P.; Batenburg, K.J. Agentic large language models, a survey. arXiv 2025, arXiv:2503.23037. [Google Scholar] [CrossRef]
Moerland, T.M.; Broekens, J.; Plaat, A.; Jonker, C.M. Model-based reinforcement learning: A survey. Found. Trends® Mach. Learn. 2023, 16, 1–118. [Google Scholar] [CrossRef]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar] [CrossRef]
Landers, M.; Doryab, A. Deep reinforcement learning verification: A survey. ACM Comput. Surv. 2023, 55, 1–31. [Google Scholar] [CrossRef]
Prudencio, R.F.; Maximo, M.R.; Colombini, E.L. A survey on offline reinforcement learning: Taxonomy, review, and open problems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10237–10257. [Google Scholar] [CrossRef]
Tang, C.; Abbatematteo, B.; Hu, J.; Chandra, R.; Martín-Martín, R.; Stone, P. Deep reinforcement learning for robotics: A survey of real-world successes. Annu. Rev. Control. Robot. Auton. Syst. 2025, 8, 153–188. [Google Scholar] [CrossRef]
Puiutta, E.; Veith, E.M. Explainable reinforcement learning: A survey. In Proceedings of the International Cross-Domain Conference for Machine Learning and Knowledge Extraction; Springer: Berlin/Heidelberg, Germany, 2020; pp. 77–95. [Google Scholar]
Wang, K.; Shah, S.; Chen, H.; Perrault, A.; Doshi-Velez, F.; Tambe, M. Learning mdps from features: Predict-then-optimize for sequential decision making by reinforcement learning. Adv. Neural Inf. Process. Syst. 2021, 34, 8795–8806. [Google Scholar]
Meyn, S. The projected Bellman equation in reinforcement learning. IEEE Trans. Autom. Control 2024, 69, 8323–8337. [Google Scholar] [CrossRef]
Hambly, B.; Xu, R.; Yang, H. Recent advances in reinforcement learning in finance. Math. Financ. 2023, 33, 437–503. [Google Scholar]
Fayaz, S.A.; Jahangeer Sidiq, S.; Zaman, M.; Butt, M.A. Machine learning: An introduction to reinforcement learning. Mach. Learn. Data Sci. Fundam. Appl. 2022, 1–22. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Yang, K.; Liu, L. An improved deep reinforcement learning algorithm for path planning in unmanned driving. IEEE Access 2024, 12, 67935–67944. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2025, arXiv:2404.19756. [Google Scholar] [CrossRef]
Noorizadegan, A.; Wang, S.; Ling, L. A Practitioner’s Guide to Kolmogorov-Arnold Networks. arXiv 2025, arXiv:2510.2578. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 15084–15097. [Google Scholar]
Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundation models defining a new era in vision: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef]
Bellemare, M.G.; Naddaf, Y.; Veness, J.; Bowling, M. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res. 2013, 47, 253–279. [Google Scholar] [CrossRef]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE: New York, NY, USA, 2012; pp. 5026–5033. [Google Scholar]
Cobbe, K.; Hesse, C.; Hilton, J.; Schulman, J. Leveraging procedural generation to benchmark reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 2048–2056. [Google Scholar]
Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.A.; Zhu, Y.; Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Adv. Neural Inf. Process. Syst. 2022, 35, 18343–18362. [Google Scholar]
Szot, A.; Clegg, A.; Undersander, E.; Wijmans, E.; Zhao, Y.; Turner, J.; Maestre, N.; Mukadam, M.; Chaplot, D.S.; Maksymets, O.; et al. Habitat 2.0: Training home assistants to rearrange their habitat. Adv. Neural Inf. Process. Syst. 2021, 34, 251–266. [Google Scholar]
Srivastava, S.; Li, C.; Lingelbach, M.; Martín-Martín, R.; Xia, F.; Vainio, K.E.; Lian, Z.; Gokmen, C.; Buch, S.; Liu, K.; et al. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Proceedings of the Conference on Robot Learning. PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 477–490. [Google Scholar]
Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Trans. Mach. Learn. Res. arXiv 2023, arXiv:2206.04615. [Google Scholar]
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
Mialon, G.; Fourrier, C.; Wolf, T.; LeCun, Y.; Scialom, T. Gaia: A benchmark for general ai assistants. In Proceedings of the Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Tang, J.; Chen, J.; He, J.; Chen, F.; Lv, Z.; Han, G.; Liu, Z.; Yang, H.H.; Li, W. Towards General Industrial Intelligence: A Survey of Large Models as a Service in Industrial IoT. IEEE Commun. Surv. Tutorials 2025, 28, 2054–2086. [Google Scholar] [CrossRef]
Alamdari, P.M. MetaEvo-Rec: Self-Evolving Meta-Reinforcement Learning Recommendation with Large-Language-Model Guided Policy Adaptation. 2025. Available online: https://www.researchgate.net/profile/Pegah-Malekpour-Alamdari/publication/397180705_MetaEvo-Rec_Self-Evolving_Meta-Reinforcement_Learning_Recommendation_with_Large-Language-Model_Guided_Policy_Adaptation/links/690751e9c900be105cbe9228/MetaEvo-Rec-Self-Evolving-Meta-Reinforcement-Learning-Recommendation-with-Large-Language-Model-Guided-Policy-Adaptation.pdf (accessed on 2 December 2025).
Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From simple to complex scenes: Learning robust feature representations for accurate human parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef]
Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S.G.; Novikov, A.; Barth-Maron, G.; Gimenez, M.; Sulsky, Y.; Kay, J.; Springenberg, J.T.; et al. A generalist agent. arXiv 2022, arXiv:2205.06175. [Google Scholar] [CrossRef]
Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv 2022, arXiv:2204.01691. [Google Scholar] [CrossRef]
Driess, D.; Xia, F.; Sajjadi, M.S.; Lynch, C.; Chowdhery, A.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; Huang, W.; et al. Palm-e: An Embodied Multimodal Language Model. 2023. Available online: https://openreview.net/forum?id=VTpHpqM3Cf&utm_campaign=The%20Batch&utm_source=hs_email&utm_medium=email&utm_content=284568789&_hsenc=p2ANqtz-9lsSL4nXMrOGBQqGoqktY5Yno_r9-nTOARZinDcgihFNqcOFEQb_MVtHKdpgI2AC3N8SrNW5PxcD0uxl4WeKcPJgUOgw (accessed on 5 November 2025).
Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning. PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 2165–2183. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
Hu, S.; Shen, L.; Zhang, Y.; Chen, Y.; Tao, D. On transforming reinforcement learning with transformers: The development trajectory. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8580–8599. [Google Scholar] [CrossRef]
Li, W.; Luo, H.; Lin, Z.; Zhang, C.; Lu, Z.; Ye, D. A survey on transformers in reinforcement learning. arXiv 2023, arXiv:2301.03044. [Google Scholar] [CrossRef]
Agarwal, P.; Rahman, A.A.; St-Charles, P.L.; Prince, S.J.; Kahou, S.E. Transformers in reinforcement learning: A survey. arXiv 2023, arXiv:2307.05979. [Google Scholar] [CrossRef]
Pan, M.; Zhang, W.; Chen, G.; Zhu, X.; Gao, S.; Wang, Y.; Yang, X. Continual Visual Reinforcement Learning with A Life-Long World Model. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2025; pp. 146–162. [Google Scholar]
Luis, C.E.; Bottero, A.G.; Vinogradska, J.; Berkenkamp, F.; Peters, J. Uncertainty Representations in State-Space Layers for Deep Reinforcement Learning under Partial Observability. arXiv 2024, arXiv:2409.16824. [Google Scholar] [CrossRef]
Cheng, M.; Luo, Y.; Ouyang, J.; Liu, Q.; Liu, H.; Li, L.; Yu, S.; Zhang, B.; Cao, J.; Ma, J.; et al. A survey on knowledge-oriented retrieval-augmented generation. arXiv 2025, arXiv:2503.10677. [Google Scholar]
Schmied, T.; Paischer, F.; Patil, V.; Hofmarcher, M.; Pascanu, R.; Hochreiter, S. Retrieval-augmented decision transformer: External memory for in-context rl. arXiv 2024, arXiv:2410.07071. [Google Scholar]
Zheng, J.; Shi, C.; Cai, X.; Li, Q.; Zhang, D.; Li, C.; Yu, D.; Ma, Q. Lifelong learning of large language model based agents: A roadmap. arXiv 2025, arXiv:2501.07278. [Google Scholar] [CrossRef]
Bell, J.; Quarantiello, L.; Coleman, E.N.; Li, L.; Li, M.; Madeddu, M.; Piccoli, E.; Lomonaco, V. The Future of Continual Learning in the Era of Foundation Models: Three Key Directions. arXiv 2025, arXiv:2506.03320. [Google Scholar] [CrossRef]
Lambert, N. Reinforcement learning from human feedback. arXiv 2025, arXiv:2504.12501. [Google Scholar]
Casper, S.; Davies, X.; Shi, C.; Gilbert, T.K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P.; et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv 2023, arXiv:2307.15217. [Google Scholar] [CrossRef]
Lee, H.; Phatale, S.; Mansoor, H.; Lu, K.R.; Mesnard, T.; Ferret, J.; Bishop, C.; Hall, E.; Carbune, V.; Rastogi, A. Rlaif: Scaling Reinforcement Learning from Human Feedback with AI Feedback. 2023. Available online: https://openreview.net/forum?id=AAxIs3D2ZZ (accessed on 5 November 2025).
Lee, H.; Phatale, S.; Mansoor, H.; Mesnard, T.; Ferret, J.; Lu, K.; Bishop, C.; Hall, E.; Carbune, V.; Rastogi, A.; et al. Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback. arXiv 2023, arXiv:2309.00267. [Google Scholar]
Qu, Y.; Zhang, T.; Garg, N.; Kumar, A. Recursive Introspection: Teaching Foundation Model Agents How to Self-Improve. In Automated Reinforcement Learning: Exploring Meta-Learning, AutoML, and LLMs; 2024. Available online: https://openreview.net/forum?id=qDXdmdBLhR&utm_source=ainews&utm_medium=email&utm_campaign=ainews-microsoft-agentinstruct-orca-3 (accessed on 10 November 2025).
Liu, T.; van der Schaar, M. Truly Self-Improving Agents Require Intrinsic Metacognitive Learning. arXiv 2025, arXiv:2506.05109. [Google Scholar] [CrossRef]
Yu, R.; Wan, S.; Wang, Y.; Gao, C.X.; Gan, L.; Zhang, Z.; Zhan, D.C. Reward Models in Deep Reinforcement Learning: A Survey. arXiv 2025, arXiv:2506.15421. [Google Scholar] [CrossRef]
Moroncelli, A.; Soni, V.; Shahid, A.A.; Maccarini, M.; Forgione, M.; Piga, D.; Spahiu, B.; Roveda, L. Integrating reinforcement learning with foundation models for autonomous robotics: Methods and perspectives. arXiv 2024, arXiv:2410.16411. [Google Scholar] [CrossRef]
Huang, J.; Xu, Y.; Wang, Q.; Wang, Q.C.; Liang, X.; Wang, F.; Zhang, Z.; Wei, W.; Zhang, B.; Huang, L.; et al. Foundation models and intelligent decision-making: Progress, challenges, and perspectives. Innovation 2025, 6, 100948. [Google Scholar] [CrossRef]
Sanghi, N. Proximal Policy Optimization (PPO) and RLHF. In Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models; Springer: Berlin/Heidelberg, Germany, 2024; pp. 461–522. [Google Scholar]
Gu, Y.; Cheng, Y.; Chen, C.P.; Wang, X. Proximal policy optimization with policy feedback. IEEE Trans. Syst. Man, Cybern. Syst. 2021, 52, 4600–4610. [Google Scholar] [CrossRef]
Zhou, H.; Yao, X.; Meng, Y.; Sun, S.; Bing, Z.; Huang, K.; Knoll, A. Language-conditioned learning for robotic manipulation: A survey. arXiv 2023, arXiv:2312.10807. [Google Scholar]
Goyal, P.; Niekum, S.; Mooney, R. Pixl2r: Guiding reinforcement learning using natural language by mapping pixels to rewards. In Proceedings of the Conference on Robot Learning. PMLR, London, UK, 8–11 November 2021; pp. 485–497. [Google Scholar]
Yang, K.; Tao, J.; Lyu, J.; Ge, C.; Chen, J.; Shen, W.; Zhu, X.; Li, X. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 8941–8951. [Google Scholar]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 2023, 36, 53728–53741. [Google Scholar]
Kostrikov, I.; Nair, A.; Levine, S. Offline reinforcement learning with implicit q-learning. arXiv 2021, arXiv:2110.06169. [Google Scholar] [CrossRef]
Noh, S.; Kim, S.; Jang, I. Efficient Fine-Tuning of Behavior Cloned Policies with Reinforcement Learning from Limited Demonstrations. In NeurIPS 2024 Workshop on Fine-Tuning in Modern Machine Learning: Principles and Scalability; Available online: https://openreview.net/forum?id=zKQ9RfuEuT (accessed on 10 November 2025).
Chow, Y.; Ghavamzadeh, M.; Janson, L.; Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res. 2018, 18, 1–51. [Google Scholar]
Geibel, P.; Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res. 2005, 24, 81–108. [Google Scholar] [CrossRef]
Pan, A.; Jones, E.; Jagadeesan, M.; Steinhardt, J. Feedback loops with language models drive in-context reward hacking. arXiv 2024, arXiv:2402.06627. [Google Scholar] [CrossRef]
Casper, S.; Killian, T.; Kreiman, G.; Hadfield-Menell, D. Red teaming with mind reading: White-box adversarial policies against rl agents. arXiv 2022, arXiv:2209.02167. [Google Scholar]
Huang, K.; Hughes, C. Agentic AI Reinforcement Learning and Security. In Securing AI Agents: Foundations, Frameworks, and Real-World Deployment; Springer: Berlin/Heidelberg, Germany, 2025; pp. 169–205. [Google Scholar]
Vouros, G.A. Explainable deep reinforcement learning: State of the art and challenges. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
Wang, C.; Zhang, Q.; Wang, X.; Zhou, L.; Li, Q.; Xia, Z.; Ma, B.; Shi, Y.Q. Light-Field Image Multiple Reversible Robust Watermarking Against Geometric Attacks. IEEE Trans. Dependable Secur. Comput. 2025, 22, 5861–5875. [Google Scholar] [CrossRef]
Liu, H.; Li, T.; Yang, H.; Chen, L.; Wang, C.; Guo, K.; Tian, H.; Li, H.; Li, H.; Lv, C. Reinforced Refinement with Self-Aware Expansion for End-to-End Autonomous Driving. arXiv 2025, arXiv:2506.09800. [Google Scholar]
Mao, Y.; Yu, X.; Huang, K.; Zhang, Y.J.A.; Zhang, J. Green edge AI: A contemporary survey. arXiv 2023, arXiv:2312.00333. [Google Scholar] [CrossRef]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Hernandez, D.; Brown, T.; Conerly, T.; DasSarma, N.; Drain, D.; El-Showk, S.; Elhage, N.; Hatfield-Dodds, Z.; Henighan, T.; Hume, T.; et al. Scaling laws and interpretability of learning from repeated data. arXiv 2022, arXiv:2205.10487. [Google Scholar] [CrossRef]
Singh, J.; Magazine, R.; Pandya, Y.; Nambi, A. Agentic reasoning and tool integration for llms via reinforcement learning. arXiv 2025, arXiv:2505.01441. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
Jiang, D.; Lu, Y.; Li, Z.; Lyu, Z.; Nie, P.; Wang, H.; Su, A.; Chen, H.; Zou, K.; Du, C.; et al. VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use. arXiv 2025, arXiv:2509.01055. [Google Scholar] [CrossRef]
Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; et al. Openvla: An open-source vision-language-action model. arXiv 2024, arXiv:2406.09246. [Google Scholar]
Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. Webarena: A realistic web environment for building autonomous agents. arXiv 2023, arXiv:2307.13854. [Google Scholar]
Zheng, Y.; Fu, D.; Hu, X.; Cai, X.; Ye, L.; Lu, P.; Liu, P. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments. arXiv 2025, arXiv:2504.03160. [Google Scholar]
Qi, Y.; Peng, H.; Wang, X.; Xin, A.; Liu, Y.; Xu, B.; Hou, L.; Li, J. AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios. arXiv 2025, arXiv:2505.16944. [Google Scholar] [CrossRef]
Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Dabis, J.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Hsu, J.; et al. Rt-1: Robotics transformer for real-world control at scale. arXiv 2022, arXiv:2212.06817. [Google Scholar]
O’Neill, A.; Rehman, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 6892–6903. [Google Scholar]
Rajasegaran, J.; Radosavovic, I.; Gandelsman, Y.; Ravishankar, R.; Feichtenhofer, C.; Malik, J. Token to Token Learning From Videos. Available online: https://openreview.net/forum?id=m29SV0n6DO (accessed on 2 December 2025).
Yan, F.; Liu, F.; Zheng, L.; Zhong, Y.; Huang, Y.; Guan, Z.; Feng, C.; Ma, L. Robomm: All-in-one multimodal large model for robotic manipulation. arXiv 2024, arXiv:2412.07215. [Google Scholar] [CrossRef]
Chebotar, Y.; Vuong, Q.; Hausman, K.; Xia, F.; Lu, Y.; Irpan, A.; Kumar, A.; Yu, T.; Herzog, A.; Pertsch, K.; et al. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. In Proceedings of the Conference on Robot Learning. PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 3909–3928. [Google Scholar]
Nasiriany, S.; Maddukuri, A.; Zhang, L.; Parikh, A.; Lo, A.; Joshi, A.; Mandlekar, A.; Zhu, Y. Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv 2024, arXiv:2406.02523. [Google Scholar]
Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.Y.K.; Li, Z.; Zhao, H. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robot. Autom. Lett. 2024, 9, 8186–8193. [Google Scholar]
Shao, H.; Hu, Y.; Wang, L.; Song, G.; Waslander, S.L.; Liu, Y.; Li, H. Lmdrive: Closed-loop end-to-end driving with large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15120–15130. [Google Scholar]
Bagal, V.; Aggarwal, R.; Vinod, P.; Priyakumar, U.D. MolGPT: Molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2021, 62, 2064–2076. [Google Scholar] [CrossRef]
Frey, N.C.; Soklaski, R.; Axelrod, S.; Samsi, S.; Gomez-Bombarelli, R.; Coley, C.W.; Gadepally, V. Neural scaling of deep chemical models. Nat. Mach. Intell. 2023, 5, 1297–1305. [Google Scholar] [CrossRef]
Ye, G. De novo drug design as GPT language modeling: Large chemistry models with supervised and reinforcement learning. J.-Comput.-Aided Mol. Des. 2024, 38, 20. [Google Scholar]
Riebesell, J.; Goodall, R.E.; Benner, P.; Chiang, Y.; Deng, B.; Lee, A.A.; Jain, A.; Persson, K.A. Matbench Discovery—A framework to evaluate machine learning crystal stability predictions. arXiv 2023, arXiv:2308.14920. [Google Scholar]
Tang, Y.; Xu, W.; Cao, J.; Gao, W.; Farrell, S.; Erichson, B.; Mahoney, M.W.; Nonaka, A.; Yao, Z. Matterchat: A multi-modal llm for material science. arXiv 2025, arXiv:2502.13107. [Google Scholar]
Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; R. Ruiz, F.J.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef]
Mankowitz, D.J.; Michi, A.; Zhernov, A.; Gelmi, M.; Selvi, M.; Paduraru, C.; Leurent, E.; Iqbal, S.; Lespiau, J.B.; Ahern, A.; et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature 2023, 618, 257–263. [Google Scholar] [CrossRef]
Zhang, D.; Hu, Z.; Zhoubian, S.; Du, Z.; Yang, K.; Wang, Z.; Yue, Y.; Dong, Y.; Tang, J. Sciglm: Training scientific language models with self-reflective instruction annotation and tuning. arXiv 2024, arXiv:2401.07950. [Google Scholar]
Rasp, S.; Dueben, P.D.; Scher, S.; Weyn, J.A.; Mouatadid, S.; Thuerey, N. WeatherBench: A benchmark data set for data-driven weather forecasting. J. Adv. Model. Earth Syst. 2020, 12, e2020MS002203. [Google Scholar]
Pathak, J.; Subramanian, S.; Harrington, P.; Raja, S.; Chattopadhyay, A.; Mardani, M.; Kurth, T.; Hall, D.; Li, Z.; Azizzadenesheli, K.; et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. arXiv 2022, arXiv:2202.11214. [Google Scholar]
Michailidis, P.; Michailidis, I.; Kosmatopoulos, E. Reinforcement learning for optimizing renewable energy utilization in buildings: A review on applications and innovations. Energies 2025, 18, 1724. [Google Scholar] [CrossRef]
Ginzburg-Ganz, E.; Segev, I.; Balabanov, A.; Segev, E.; Kaully Naveh, S.; Machlev, R.; Belikov, J.; Katzir, L.; Keren, S.; Levron, Y. Reinforcement learning model-based and model-free paradigms for optimal control problems in power systems: Comprehensive review and future directions. Energies 2024, 17, 5307. [Google Scholar] [CrossRef]
Chaudhari, S.; Aggarwal, P.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K.; Deshpande, A.; Castro da Silva, B. Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms. ACM Comput. Surv. 2025, 58, 1–37. [Google Scholar] [CrossRef]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional ai: Harmlessness from ai feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
Heo, J.; Lee, Y.J.; Kim, J.; Kwak, M.G.; Park, Y.J.; Kim, S.B. Mixing corrupted preferences for robust and feedback-efficient preference-based reinforcement learning. Knowl.-Based Syst. 2025, 309, 112824. [Google Scholar] [CrossRef]
Miao, Y.; Zhang, S.; Ding, L.; Bao, R.; Zhang, L.; Tao, D. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling. Adv. Neural Inf. Process. Syst. 2024, 37, 134387–134429. [Google Scholar]
Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; Irving, G. Red teaming language models with language models. arXiv 2022, arXiv:2202.03286. [Google Scholar] [CrossRef]
Chen, Z.; Chen, D.; Sun, R.; Liu, W.; Gan, C. Scaling autonomous agents via automatic reward modeling and planning. arXiv 2025, arXiv:2502.12130. [Google Scholar] [CrossRef]
Fu, J.; Zhao, X.; Yao, C.; Wang, H.; Han, Q.; Xiao, Y. Reward shaping to mitigate reward hacking in rlhf. arXiv 2025, arXiv:2502.18770. [Google Scholar] [CrossRef]
Dafoe, A.; Hughes, E.; Bachrach, Y.; Collins, T.; McKee, K.R.; Leibo, J.Z.; Larson, K.; Graepel, T. Open problems in cooperative AI. arXiv 2020, arXiv:2012.08630. [Google Scholar] [CrossRef]
Gao, F.; Wang, X.; Fan, Y.; Gao, Z.; Zhao, R. Constraints driven safe reinforcement learning for autonomous driving decision-making. IEEE Access 2024, 12, 128007–128023. [Google Scholar] [CrossRef]
Shperberg, S.S.; Liu, B.; Stone, P. Relaxed Exploration Constrained Reinforcement Learning. In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, Auckland, New Zealand, 6–10 May 2024; pp. 1727–1735. [Google Scholar]
Haydari, A.; Aggarwal, V.; Zhang, M.; Chuah, C.N. Constrained Reinforcement Learning for Fair and Environmentally Efficient Traffic Signal Controllers. J. Auton. Transp. Syst. 2024, 2, 1–19. [Google Scholar]
Yao, J.; Yi, X.; Wang, X.; Wang, J.; Xie, X. From Instructions to Intrinsic Human Values–A Survey of Alignment Goals for Big Models. arXiv 2023, arXiv:2308.12014. [Google Scholar]
Lindström, A.D.; Methnani, L.; Krause, L.; Ericson, P.; de Troya, Í.M.d.R.; Mollo, D.C.; Dobbe, R. AI alignment through reinforcement learning from human feedback? Contradictions and limitations. arXiv 2024, arXiv:2406.18346. [Google Scholar] [CrossRef]
Xu, Y.; Chakraborty, T.; Kıcıman, E.; Aryal, B.; Rodrigues, E.; Sharma, S.; Estevao, R.; Balaguer, M.A.d.L.; Wolk, J.; Padilha, R.; et al. Rlthf: Targeted human feedback for llm alignment. arXiv 2025, arXiv:2502.13417. [Google Scholar] [CrossRef]
Zeng, Y.; Cai, R.; Sun, F.; Huang, L.; Hao, Z. A survey on causal reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5942–5962. [Google Scholar]
Tan, X.; Liu, B.; Bao, Y.; Tian, Q.; Gao, Z.; Wu, X.; Luo, Z.; Wang, S.; Zhang, Y.; Wang, X.; et al. Towards Safe and Trustworthy Embodied AI: Foundations, Status, and Prospects. 2025. Available online: https://openreview.net/pdf?id=Eu6Yt21Alv (accessed on 2 December 2025).
Lv, C.; Li, B.; Wei, J. Energy Aware Controller Load Balancing Based on Multi-Agent Deep Reinforcement Learning for Software-Defined Internet of Things. J. Comput. Netw. Commun. 2025, 2025, 8880533. [Google Scholar]
Ramezani, M.; Amiri Atashgah, M. Energy-aware hierarchical reinforcement learning based on the predictive energy consumption algorithm for search and rescue aerial robots in unknown environments. Drones 2024, 8, 283. [Google Scholar] [CrossRef]
Wang, X. Enhancing Human-Computer Interaction with Cultural Nuance: A Deep reinforcement Learning Perspective. IEEE Access 2025, 13, 79940–79954. [Google Scholar]

Figure 1. Conceptual illustration of reinforcement learning paradigms: (a) standard reinforcement learning, where the agent interacts with the environment through states, actions, and rewards; (b) a deep neural network that maps inputs to outputs via multiple hidden layers; and (c) deep reinforcement learning, which integrates deep neural networks into the agent architecture, enabling end-to-end optimization from raw high-dimensional inputs to actions.

Figure 2. Illustration of the RLHF process. The trained policy model generates responses from data samples, which are evaluated by human feedback and a reward model to compute scalar rewards. The policy is fine-tuned via PPO, with a KL-divergence penalty ensuring controlled deviation from the frozen base model. This iterative feedback loop aligns large language models with human preferences and task-specific objectives.

Table 1. Summary of related reviews and surveys on DRL, foundation models, and alignment.

Reference	Year	Description
Kaufmann et al. [13]	2024	Comprehensive analysis of RLHF pipelines, covering preference data, reward modeling, and policy optimization, with emphasis on supervision scalability.
Wang et al. [14]	2024	Taxonomy of alignment techniques for large models, including RLHF, RLAIF, and DPO, focusing on optimization stability and divergence controls.
Zhou et al. [15]	2025	Review of alignment and decision-making in LLM agents, examining preference optimization, safety, and interactive behaviour.
Plaat et al. [16]	2025	Analysis of agentic LLMs with emphasis on reasoning, tool use, and multi-agent interaction; limited coverage of reinforcement-based adaptation.
Landers and Doryab [19]	2023	Overview of verification and safety evaluation methods for DRL, highlighting robustness concerns but without FM-specific integration.
Moerland et al. [17]	2023	Extensive review of model-based reinforcement learning, detailing learned dynamics and planning but not foundation-model coupling.
Levine et al. [18]	2020	Foundational survey of offline reinforcement learning and conservative policy learning, relevant to preference datasets used in FM alignment.
Prudencio et al. [20]	2023	Examination of offline RL methods, focusing on distribution shift, robustness, and data constraints.
Tang et al. [21]	2025	Survey of DRL in robotics, including sim-to-real transfer and safety-aware exploration; coverage of pretrained representations remains limited.
Puiutta and Veith [22]	2020	Systematic review of explainable reinforcement learning, addressing transparency but not DRL–FM integration.
This survey	2025	Unified synthesis of DRL–FM integration covering FM-centric DRL architectures, RL-centric alignment, and hybrid multimodal agents, coupled with a unified evaluation framework that specifies shared metric groups, protocol templates, and minimum reporting artifacts for cross-domain comparability and auditing.

Table 2. Summary of benchmark categories commonly used to evaluate DRL–FM systems, highlighting what each category measures and what it typically misses.

Category	Benchmarks	Primary Evaluation Focus	Typical Limitations for DRL–FM Evaluation
Classical DRL benchmarks	Atari 2600 [36]; MuJoCo [37]; ProcGen [38]	Policy learning performance; exploration; sample efficiency; robustness and generalization in controlled settings	Limited language and multimodal reasoning; weak coverage of alignment behavior and governance-oriented reporting
Embodied and multimodal testbeds	MineDojo [39]; Habitat [40]; BEHAVIOR [41]	Long-horizon tasks; grounding language in perception/action; embodied manipulation and navigation; compositional skills	Often task-centric; alignment assessment is indirect or inconsistent; scenario coverage and evaluation protocols vary widely
Foundation model benchmarks	BIG-Bench [42]; HELM [43]; GAIA [44]	Reasoning and language capabilities; calibration and bias; tool use; safety and social compliance	Limited closed-loop control and reinforcement feedback; weak representation of exploration and online adaptation

Table 3. Comparison of policy optimization methods for foundation model alignment.

Method	Key Principle	Advantages	Limitations
Proximal Policy Optimization [27]	On-policy gradient ascent with clipped updates and KL penalty.	Strong update stability under clipping and KL control when properly tuned.	Low sample efficiency due to rollout dependence; high sensitivity to KL strength, clipping thresholds, reward scaling, and learning rates.
Direct Preference Optimization [76]	Closed-form objective optimizing preference log odds.	Rollout-free and typically more sample-efficient than PPO; simpler optimization pipeline.	Stability depends on preference consistency and evaluator quality; sensitive to objective scaling and dataset mixture.
Implicit Preference Optimization [76]	Implicitly optimizes pairwise preference likelihood via reparameterized gradient.	Rollout-free with strong efficiency; smoother optimization and improved convergence relative to DPO.	Depends on correct objective specification and scaling; still constrained by preference noise and limited standardized tuning practices.
Offline RL	Optimizes policy from static datasets using precomputed feedback.	Safe and reproducible; no online exploration required.	Limited adaptability; distributional bias possible.

Table 4. Applications of DRL in language and multimodal agentic systems.

Reference	Year	Domain	Contribution Summary
Ouyang et al. [12]	2022	RLHF	Introduced instruction tuning + preference-optimized RL, enabling smaller models to outperform larger LMs on human preferences.
Rafailov et al. [76]	2023	Alignment	Proposed Direct Preference Optimization, matching PPO-based RLHF without explicit reward models.
Singh et al. [90]	2025	Agentic RL	Showed that reinforcement-shaped reasoning schedules improve tool-use reliability and agentic consistency.
Yao et al. [91]	2022	Tool Use	Combined chain-of-thought reasoning and environment actions for more stable utility-driven behavior.
Schick et al. [92]	2023	Tool Learning	Demonstrated self-supervised API usage through implicit reward modeling.
Jiang et al. [93]	2025	Tool RL	Introduced VerlTool, integrating RL with dynamic tool-selection for coherent multi-step interaction.
Driess et al. [50]	2023	Embodied VLA	Introduced PaLM-E, merging continuous perception with language reasoning for robotics.
Zitkovich et al. [51]	2023	Robotics	Released RT-2, combining web-scale VLM pretraining with robot action sequences for improved generalization.
Kim et al. [94]	2024	Open VLA	Presented OpenVLA, enabling parameter-efficient adaptation across embodiments.
Zhou et al. [95]	2023	Web Agents	Released WebArena, showing LMs fail without RL-style feedback in long-horizon web tasks.
Wang et al. [52]	2023	Open-Ended RL	Demonstrated curriculum-driven RL for Minecraft–based lifelong skill acquisition.
Zheng et al. [96]	2025	Research Agents	Showed RL-driven retrieval and exploration improves long-horizon reasoning in real-world web environments.
Qi et al. [97]	2025	Benchmarking	Introduced AGENTIF, evaluating instruction-following reliability in agentic scenarios.

Table 5. Applications of DRL in autonomous decision-making and control.

Reference	Year	Domain	Contribution Summary
Hafner et al. [9]	2023	World Models	DreamerV3 with imagination rollouts for long-horizon control.
Schrittwieser et al. [33]	2020	Planning RL	MuZero with learned latent dynamics and tree search.
Brohan et al. [98]	2022	Robotics	RT-1 transformer-based robot control from large-scale trajectories.
Zitkovich et al. [51]	2023	Robotics	RT-2 combining VLM pretraining with robot data.
O’Neill et al. [99]	2024	Embodiment	Open X-Embodiment dataset for multi-robot generalization.
Rajasegaran et al. [100]	2024	Video Learning	Token-to-token video representations for visuomotor control.
Yan et al. [101]	2024	VLM Robotics	RoboMM multimodal model for manipulation.
Ahn et al. [49]	2022	Language→Action	SayCan integrates language reasoning with value-based affordances.
Chebotar et al. [102]	2023	Q-Learning	Transformers to stabilize multi-task Q-learning.
Nasiriany et al. [103]	2024	Simulation	RoboCasa simulation environment for VLA+RL research.
Xu et al. [104]	2024	Driving	DriveGPT-4 integrates VLM reasoning with RL-style refinement.
Shao et al. [105]	2024	Driving	LMDrive addresses long-tail driving cases using RL.

Table 6. Applications of DRL in scientific discovery and complex systems.

Reference	Year	Domain	Contribution Summary
Bagal et al. [106]	2021	Molecules	MolGPT produced chemically valid and novel molecules using transformer decoders.
Frey et al. [107]	2023	ChemGPT	Established scaling laws for chemical foundation models.
Ye et al. [108]	2024	Drug Design	Used PPO to steer molecular generation toward desired properties.
Riebesell et al. [109]	2023	Materials	Introduced Matbench Discovery for prospective evaluation of materials ML.
Tang et al. [110]	2025	Materials LLMs	Introduced MatterChat, a multimodal LM for materials science and cross-property reasoning.
Zhang et al. [113]	2024	Scientific LMs	SciGLM provided self-refined scientific reasoning and symbolic manipulation capabilities.
Fawzi et al. [111]	2022	Algorithms	AlphaTensor used deep RL to discover faster matrix-multiplication schemes.
Mankowitz et al. [112]	2023	Algorithms	Extended RL-based algorithm discovery to produce faster sorting routines.
Rasp et al. [114]	2020	Climate	Released WeatherBench for standardized climate-forecast benchmarking.
Pathak et al. [115]	2022	Climate	Introduced FourCastNet, a global high-resolution surrogate model.
Michailidis et al. [116]	2025	Climate RL	Reviewed RL-augmented climate and energy decision systems.
de Burgh-Day et al. [117]	2024	Geoscience RL	Surveyed RL-enhanced environmental modeling and simulation control.

Table 7. Representative applications of DRL in societal, ethical, and alignment contexts.

Reference	Year	Theme	Contribution Summary
Chaudhari et al. [118]	2025	RLHF	Improved preference consistency and helpfulness.
Bai et al. [119]	2022	Constitutional AI	Rule-based critique + LM revision reduces harm.
Heo et al. [120]	2025	Human Feedback	Found stable early-stage alignment from non-experts.
Miao et al. [121]	2024	Reward Modeling	InfoRM reduces reward hacking via information constraints.
Casper et al. [63]	2023	Alignment Risks	Analyzed misgeneralization, proxy gaming, evaluator failures.
Perez et al. [122]	2022	Red-Teaming	LM adversaries reveal unsafe RLHF behaviors.
Chen et al. [123]	2025	RLAIF	Uncertainty-aware reward modeling + adversarial curation.
Fu et al. [124]	2025	Reward Shaping	Bounded/smoothed rewards improve RLHF stability.
Dafoe et al. [125]	2020	Social RL	Examined emergent collusion and defection in multi-agent RL.
Gao et al. [126]	2024	Safe RL	Harm-sensitive constraints for ethical policy learning.
Shperberg et al. [127]	2024	Constrained RL	Relaxed exploration with deployment-time safety guarantees.
Haydari et al. [128]	2024	Fair RL	Fairness-aware constrained RL for traffic control.
Liang et al. [43]	2022	Evaluation	HELM evaluation framework for safety, robustness, and bias.
Yao et al. [129]	2023	Cultural Bias	Showed demographic skew in preference data distorts RLHF.
Lindström et al. [130]	2024	Oversight	Identified feedback-collection bottlenecks and evaluator bias.
Xu et al. [131]	2025	Governance	Advocated reproducibility and governance for RLHF pipelines.

Table 8. Benchmark blueprint for the proposed unified DRL–FM evaluation: environment types, required metric groups, protocol expectations, and common failure modes.

Environment Type	Required Metric Groups	Evaluation Protocol	Common Failure Modes
Classical control (Atari, MuJoCo, ProcGen)	Capability and return; Efficiency; Generalization; Robustness; Reproducibility	Multiple seeds; fixed training budget; OOD or procedural splits; perturbation stress tests	Reward hacking; brittle exploration; overfitting to levels; instability across seeds
Embodied multimodal tasks (Habitat, BEHAVIOR, MineDojo)	Capability and return; Generalization; Credit assignment quality; Robustness; Alignment	Instruction variants; tool or object shift; long-horizon evaluation; safety constraint checks	Compounding errors; goal drift; unsafe interactions; shortcut learning from spurious cues
Language and tool agents (web, APIs, assistants)	Alignment; Capability and return; Generalization; Robustness and security; Reproducibility	Held-out tasks; tool-shift tests; adversarial prompt suite; evaluator shift comparisons	Hallucinated actions; tool misuse; jailbreak susceptibility; preference overfitting
Scientific discovery and combinatorial search	Capability and return; Efficiency; Generalization; Robustness; Reproducibility	Prospective evaluation; constrained validity checks; compute reporting; sensitivity analysis	Mode collapse; proxy optimization; constraint violations; poor transfer across targets
Multi-agent settings (cooperative or competitive)	Capability and return; Robustness; Alignment; Reproducibility	Cross-play evaluation; opponent variation; policy exploit tests; monitoring for emergent strategies	Collusion; deception; exploitability; non-stationarity instability

Table 9. Summary of key challenges and corresponding future research directions for DRL–FM integration.

Challenge	Description	Future Research Direction
Optimization Instability	Sensitivity to reward misspecification and brittle policy updates in large FM spaces.	Foundation-scale world models; structured trajectory-level RL; robust reward modeling.
Credit Assignment for Language-Conditioned Rewards	Difficulty linking high-level linguistic rewards to intermediate reasoning steps.	Self-reflective agents; hierarchical RL; memory-augmented reasoning architectures.
Scalability of Human and Synthetic Feedback	High cost of human annotation; bias propagation in RLAIF pipelines.	Hybrid human–AI feedback; adversarially curated preference datasets; evaluator diversification.
Emergent Multi-Agent Strategic Behavior	Risk of collusion, deception, and adversarial dynamics between interacting FMs.	Multi-agent safety mechanisms; game-theoretic alignment protocols; cooperative constraint learning.
Reproducibility, Governance, and Accountability	Lack of standardized reporting, inconsistent datasets, and opaque reward models.	Transparent governance frameworks; standardized evaluation; sovereign and culturally aware alignment.
Sustainability and Energy Constraints	Rising computational cost and environmental impact of DRL–FM training.	Energy-aware objectives; carbon-efficient RL; federated and decentralized DRL.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mienye, I.D.; Esenogho, E.; Modisane, C. Deep Reinforcement Learning in the Era of Foundation Models: A Survey. Computers 2026, 15, 40. https://doi.org/10.3390/computers15010040

AMA Style

Mienye ID, Esenogho E, Modisane C. Deep Reinforcement Learning in the Era of Foundation Models: A Survey. Computers. 2026; 15(1):40. https://doi.org/10.3390/computers15010040

Chicago/Turabian Style

Mienye, Ibomoiye Domor, Ebenezer Esenogho, and Cameron Modisane. 2026. "Deep Reinforcement Learning in the Era of Foundation Models: A Survey" Computers 15, no. 1: 40. https://doi.org/10.3390/computers15010040

APA Style

Mienye, I. D., Esenogho, E., & Modisane, C. (2026). Deep Reinforcement Learning in the Era of Foundation Models: A Survey. Computers, 15(1), 40. https://doi.org/10.3390/computers15010040

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning in the Era of Foundation Models: A Survey

Abstract

1. Introduction

Review Methodology

2. Related Reviews

3. Preliminaries

3.1. Reinforcement Learning Fundamentals

3.2. Deep Reinforcement Learning

3.3. Foundation Models and Alignment

3.4. Evaluation Paradigms and Benchmarks

3.4.1. Classical Reinforcement Learning Benchmarks

3.4.2. Embodied and Multimodal Testbeds

3.4.3. Foundation Model Benchmarks

4. A Taxonomy of DRL–Foundation Model Integration

4.1. Paradigms of Integration

4.1.1. FM-Centric DRL Architectures

4.1.2. RL-Centric Foundation Models

4.1.3. Hybrid and Multimodal Frameworks

4.2. Architectural Patterns

4.2.1. Transformer-Based Architectures

4.2.2. Hierarchical Architectures

4.2.3. World-Model Architectures

4.2.4. Kolmogorov–Arnold Networks

4.2.5. Retrieval-Augmented and Memory-Based Architectures

4.3. Interaction and Feedback Loops

5. Training, Alignment, and Optimization Methods

5.1. Reward Modeling and Preference Learning

5.2. Policy Optimization in the FM Era

5.3. Safety, Robustness, and Constraints

5.4. Scaling Laws and Data Efficiency

6. Applications of DRL in Foundation Model Contexts

6.1. Language and Multimodal Agentic Systems

6.2. Autonomous Decision-Making and Control

6.3. Scientific Discovery and Complex Systems

6.4. Societal, Ethical, and Alignment Applications

7. Challenges, Solutions, and Future Research Directions

7.1. Challenges and Open Problems

7.1.1. Optimization Instability

7.1.2. Credit Assignment in Language-Conditioned Environments

7.1.3. Scalability and Reliability of Human Feedback

7.1.4. Emergent Multi-Agent Dynamics

7.1.5. Reproducibility, Transparency, and Governance Limitations

7.2. Unified DRL–FM Evaluation Framework

7.2.1. Design Principles

7.2.2. Metric Groups and Minimum Outputs

7.2.3. Evaluation Protocol Template

7.2.4. Benchmark Blueprint

7.2.5. Minimum Reporting Standard

7.3. Future Research Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI