1. Introduction
Reinforcement Learning (RL) has evolved into a general-purpose framework for sequential decision-making, enabling agents to achieve superhuman mastery in board games [
1], match or exceed expert play in video game arenas [
2], and perform dexterous in-hand robotic manipulation in the physical world [
3]. These milestones demonstrate that, once a task objective is encoded as a reward signal, an RL agent can autonomously discover effective policies across symbolic, high-dimensional, and embodied domains [
4].
Although conceptually straightforward, the practical success of RL hinges on how well the reward signal captures the designer’s intent. Translating an abstract goal into a dense, informative scalar is notoriously difficult, particularly in real-world settings with sparse rewards, such as long-horizon robotic manipulation [
5,
6]. Consequently, practitioners often resort to manual reward engineering, but this demands substantial domain expertise and extensive trial-and-error, rendering it a primary bottleneck to scaling RL applications.
To relax the burden of hand-crafting rewards, a promising research direction has emerged that leverages the compositional power of Large Language Models (LLMs) to translate natural-language task descriptions directly into executable reward code [
7,
8,
9]. This approach represents a significant leap forward from earlier paradigms like Inverse Reinforcement Learning (IRL) or preference-based methods [
10,
11,
12], as it automates the reward engineering process in a more direct and scalable manner.
However, while this prevailing LLM-based approach addresses the bottleneck of manual design, it exposes and inherits a deeper, more fundamental limitation: the reward generation remains an open-loop, a priori specification process. The LLM acts as an offline programmer, generating the reward code based solely on the initial prompt, without any grounding in the agent’s subsequent, real-world learning dynamics. This initial specification, no matter how sophisticated, is essentially a static hypothesis.
LLM-generated reward codes can contain hallucinated facts, omitted constraints, or logical inconsistencies; when adopted as a one-shot, static specification, such errors persist and can steer learning off-target [
13,
14]. More broadly, reliance on a static reward—whether hand-crafted or LLM-generated—exposes two recurring failure modes. First, misspecification can enable specification gaming, where agents exploit proxy loopholes while neglecting the intended objective [
15]. Second, incompleteness induces extended “reward deserts” with vanishing gradients, stalling exploration and trapping policies in suboptimal regions [
16]. These issues reflect a structural mismatch: the policy adapts with experience, whereas the reward is treated as a fixed specification. As the visitation distribution shifts during learning, there is no built-in mechanism to detect and correct emerging misalignment, leaving the system vulnerable to exploitation or stagnation [
17]. Crucially, prevailing pipelines lack an in-situ, closed-loop procedure to diagnose and revise the reward specification as misalignment arises from the agent’s own learning dynamics.
To overcome this fundamental limitation, we introduce the Metacognitive Introspective Reward Architecture (MIRA), a framework that explicitly places LLM-generated reward code inside a closed-loop optimization process. Rather than treating the reward specification as a one-shot artifact, MIRA defines it as a revisable program that is continuously updated in response to the agent’s own learning dynamics. Most existing work either uses LLMs only once before training to synthesize reward code, or employs inner–outer dual-loop optimization architectures in which an outer loop adapts the policy or its learning process based on training performance [
18,
19,
20], while assuming a fixed reward specification throughout learning. In contrast, MIRA places the reward factors themselves inside the optimization loop: the inner loop learns a potential-based shaping signal over a given factorization, and the outer loop performs metacognitive diagnosis and semantic-level revisions of that factorization. This closes the feedback loop between downstream learning and upstream reward specification, enabling the reward structure to co-evolve with the policy; in this work, we instantiate a closed-loop, LLM-mediated reward revision architecture under explicit potential-based shaping constraints in the form of MIRA.
This closed-loop design allows the reward structure to co-evolve with the policy under empirical feedback. Our contributions are as follows:
We propose MIRA, a metacognitive reward architecture that places LLM-generated reward code inside a closed-loop optimization process. Instead of using the LLM as a one-shot reward generator, MIRA treats the reward specification as a typed, revisable program and uses the LLM as an in-the-loop semantic editor of this program.
We instantiate this architecture with two tightly coupled mechanisms: (a) an inner-loop reward shaping module that learns state-only, potential-based shaping rewards over a set of high-level reward factors, aligned to sparse extrinsic returns, and (b) an outer-loop metacognitive module that monitors trajectory-level diagnostics, detects persistent anomalies via online density estimation, and triggers structured, Potential-Based Reward Shaping (PBRS)-compatible edits to the reward factor space through the LLM.
We provide extensive empirical validation showing that MIRA substantially outperforms leading baselines, including static LLM-generated rewards and other adaptive methods, in final task performance, sample efficiency, and robustness to initial reward misspecification. Ablation experiments further demonstrate that these gains arise from the full MIRA design, with both the metacognitive outer loop and the structured LLM-based editing interface making essential contributions.
The remainder of this paper is organized as follows:
Section 2 reviews related work on reward engineering, algorithmic credit redistribution, and LLM-driven agents.
Section 3 establishes the theoretical foundations, analyzing the practical dilemmas of sparse rewards and the reward decomposability hypothesis.
Section 4 details the proposed MIRA framework, describing the dual-loop architecture, the inner-loop reward synthesis, and the outer-loop metacognitive diagnosis and reframing mechanisms.
Section 5 presents the experimental setup and results, including comparative benchmarks on MuJoCo tasks, mechanistic case studies of self-correction, and ablation analyses. Finally,
Section 6 concludes the paper and outlines directions for future research.
3. Sequential Decision-Making and the Challenges of Reward Design
Reinforcement-learning systems are control algorithms that iteratively refine their behavior through interaction with a complex and uncertain world. This section establishes the mathematical and conceptual foundations required to motivate a dynamic reward architecture.
Section 3.1 formalizes the agent–environment loop as a Markov Decision Process (MDP), establishing notation and performance criteria.
Section 3.2 reviews three practical dilemmas—reward sparsity, shaping risk, and multi-objective non-stationarity—that undermine conventional, static objective design.
Section 3.3 introduces the Reward Decomposability Hypothesis, which reframes reward design as a hierarchical problem and thus motivates the adaptive framework presented in
Section 4.
3.1. Formalizing Sequential Decision-Making
We model the sequential decision problem as a finite-horizon Markov Decision Process (MDP), formally defined by the tuple
, where [
36,
37] the following apply:
is the state space, representing all possible environmental configurations.
is the action space, containing all actions available to the agent.
is the state transition kernel, where is the set of probability distributions over . denotes the probability of transitioning to state after taking action a in state s.
is the reward function, yielding an immediate scalar reward .
is the discount factor, which balances the trade-off between immediate and future rewards.
is the initial-state distribution from which the starting state is sampled.
The agent’s behavior is described by a stochastic policy
, where
is the probability of taking action
a in state
s. A deterministic policy is a special case where the distributions have a single mass. Performance over a finite horizon
T is measured by the expected discounted return,
where
is a trajectory sampled under the distribution
induced by policy
. The reinforcement-learning objective is to compute an optimal policy
that maximizes this return,
This formulation decouples the environment’s dynamics,
P, from the task’s objectives,
R. All learning algorithms implicitly assume that
R faithfully encodes the designer’s intent; any mis-specification in
R therefore biases the optimization target.
Section 3.2 and
Section 3.3 examine why, in practice, specifying a correct and informative reward is non-trivial and motivate an architecture that can revise
R during learning.
3.2. Practical Dilemmas in Reward Design
Although the MDP formalism is theoretically complete, practical RL success hinges on the informational content and semantic fidelity of the reward signal. Three recurring dilemmas—reward sparsity, shaping risk, and multi-objective non-stationarity—regularly undermine the efficacy of static reward specifications.
3.2.1. The Challenge of Sparsity and Credit Assignment
In many real-world tasks, informative feedback is inherently sparse and delayed: a mobile robot earns a reward only upon successful docking, or a game-playing agent only at victory. Such sparsity inflates the variance of return estimates and yields weak, uninformative gradients for exploration. To combat this, return–redistribution methods like RUDDER propagate terminal rewards back through the trajectory to accelerate learning [
17], while exploration-bonus techniques such as Random Network Distillation (RND) [
38] or Go-Explore [
16] supply auxiliary signals to mitigate blind search. Yet, these methods operate under the critical assumption that the terminal reward is semantically correct. If the underlying goal is mis-specified, these methods will only accelerate the optimization of a flawed objective. This highlights the need for a mechanism that can correct the semantic definition of the reward itself, rather than merely improving credit assignment for a fixed signal.
3.2.2. The Risk of Shaping and Specification Gaming
While potential-based reward shaping preserves policy optimality in theory [
21], its practical application relies heavily on designer heuristics. This introduces the risk of agents exploiting subtle loopholes in the proxy reward, a pathology known as specification gaming [
15]. Empirical analyses—in complex, open-ended environments in particular—show that even carefully engineered rewards can be reverse-engineered and subverted by the learning process [
22]. These failures expose a deeper issue: a fundamental semantic misalignment between the proxy reward and the true task intent. Traditional offline design pipelines are unequipped to address this, as they lack any mechanism to detect or amend such misalignment once training begins.
3.2.3. The Rigidity of Multi-Objective Trade-Offs
Real-world objectives typically blend competing priorities, such as safety, efficiency, and task completion. Encoding these into a single scalar reward requires weight-tuning that is often stage-dependent: broad exploration is desirable early in training, whereas precise control is essential for final convergence. Fixed weights induce a structural rigidity in guidance, leading to policies that may first under-explore and later over-explore, slowing overall progress [
39]. While intrinsic-motivation methods can alleviate early-stage sparsity, they leave the primary extrinsic signal untouched [
40]. This can result in the agent receiving conflicting objectives as learning proceeds. A truly alignment-preserving architecture must therefore adapt not only the factor weights but, when necessary, the semantic structure of the reward function itself.
3.3. The Reward Decomposability Hypothesis: A Basis for Autonomous Design
Taken together, these dilemmas demonstrate that a single, monolithic reward function is fundamentally insufficient to bridge the vast semantic gap between a designer’s high-level intent and the low-level mechanics of learning. We therefore ground our framework in the reward decomposability hypothesis: any sparse, episodic return can be effectively approximated by the sum of latent semantic factors, where each factor evaluates a specific, interpretable aspect of task performance under discounted accumulation. Formally, let
denote the extrinsic reward and
the discount factor. We posit that an ideal dense per-step reward
satisfies
where the dense reward admits a factorization
Here, each
is an interpretable reward factor (e.g., “forward velocity,” “gripper alignment”), and
is its non-negative weight. This decomposition, which generalizes successor-feature frameworks [
41] and aligns with work on temporal-logic specifications [
42], turns reward design into a hierarchical problem with the following three nested sub-tasks:
Factor-Space Generation: Given a high-level task description, construct an initial, semantically rich set of candidate factors
. Prior art relies on handcrafted features or logic templates [
42]; we later show that an LLM can automate this step.
Parametric Alignment: Optimize the weights
(or a low-rank parameterization thereof) so that the composed reward aligns with discounted, extrinsic signals. This is a well-studied problem addressed by methods like IRL [
43] and preference-based fitting [
11].
Structural Refinement: Detect when poor performance stems from missing or pathological factors in
, rather than from suboptimal weights, and revise the factor space accordingly. This critical step is largely absent from current pipelines. As misalignment studies show [
24], failure to modify the factor space itself is a primary cause of persistent reward hacking.
A system that solves only the first two sub-tasks remains vulnerable to semantic blind spots. True autonomy requires the ability to introspect and edit its own factor space. This capability for online structural refinement is the core innovation of our approach: by integrating monitoring with semantic reframing, the outer loop is designed specifically to address this third, most challenging sub-problem, enabling continual structural updates to the reward architecture during learning.
4. MIRA: A Framework for Metacognitive Reward Design
As established in the preceding sections, reward engineering remains a significant gap between the theory and practice of RL.
Section 3.2 systematically deconstructed the persistent challenges inherent in reward design: from the credit assignment problem stemming from informational sparsity, to the intrinsic risk of reward hacking in reward shaping, and the structural rigidity of a static reward function across different learning stages. These dilemmas converge on a practical challenge: a static reward function, defined once at the outset of training, is often ill-equipped to handle the complex, dynamic issues that emerge during the learning process.
The advent of LLMs offers a promising approach to the primary challenge of informational sparsity. By translating high-level natural language instructions into executable, dense reward functions [
24], LLMs significantly lower the barrier to reward design. However, using the LLM only as a one-shot, offline reward generator overlooks a deeper, dynamic limitation. Its output, though dense, is fundamentally a static proxy, rendering it vulnerable to two practical risks. First, semantic misalignment: a static reward represents a single interpretation of the designer’s intent, where any initial error can create exploitable loopholes, leading to reward hacking. Second, structural rigidity: the reward function remains fixed, unable to adapt to the agent’s evolving competence from novice to expert, and thus cannot adapt its incentive structure.
To address these limitations, this section introduces MIRA (Metacognitive Introspective Reward Architecture), a framework. The core idea of MIRA is to reconceptualize the reward function, transforming it from a static, predefined artifact into a dynamic system that is jointly optimized with the policy and is endowed with computational metacognition. The architecture is composed of the following three components:
Semantics-Guided Initialization: The framework first leverages an LLM’s understanding of a natural language task description, , to automatically synthesize a high-quality, interpretable, and knowledge-rich candidate reward factor space. This constitutes the initial step of translating human intent into machine-understandable value primitives.
Dynamic Policy–Reward Alignment: Within this factor space, MIRA employs a joint optimization objective to optimize the agent’s policy network, , and the reward architecture’s parameters. Policy learning provides the data for reward alignment, while the updated reward architecture provides a more precise guidance signal for the policy, forming a rapid inner loop of bootstrapped optimization.
Metacognitive-Driven Reframing: We introduce a metacognitive monitoring module that continuously and non-intrusively tracks a set of diagnostic metrics during policy learning. When these metrics indicate a learning pathology—such as policy oscillation, a divergence between the value function and true returns, or a collapse in reward factor attention—the system triggers an outer-loop structural adaptation mechanism. This mechanism invokes the LLM, providing the current failure mode as context, and requests a revision, augmentation, or reframing of the reward factor set, thereby enabling structural adaptation of the reward architecture itself.
Together, these components form a two-tiered optimization structure. The inner loop drives rapid policy learning, while the outer loop acts as a metacognitive supervisor. It continuously monitors the learning dynamics generated by the inner loop, diagnoses pathologies, and proactively triggers a semantic-level reconstruction of its own value system (i.e., the underlying structure of the reward function). This section details MIRA’s core architecture, the mathematical formalization of its key modules, and the complete learning algorithm. We will demonstrate how, by endowing the reward function with the capacity for in situ structural refinement, MIRA facilitates a shift from automated reward generation to autonomous reward adaptation.
4.1. Overall Architecture
The MIRA is structured as a hierarchical system with a dual-loop feedback mechanism, as depicted in
Figure 1. This design aims to simulate a computational metacognitive process: an inner loop responsible for policy optimization and reward alignment within the current value system, and an outer loop that monitors the learning state of the inner loop and, when necessary, performs a semantic-level reconstruction of its foundational value system—the reward factors themselves.
The architecture comprises the following four core modules:
Linguistic-Semantic Factor Synthesis (LSFS): This is the entry point of the system. This stage translates a high-level task description, , into machine-understandable value primitives. Specifically, the system constructs a structured prompt that integrates a role assignment for the LLM (e.g., “You are a senior robotics reward engineer”), the explicit task objective, and key environmental information, such as a formal description of the state space and action space . Based on this input, the LLM generates an initial, semantically rich, and executable set of reward factors, .
Iterative Reward Alignment and Refinement (IRAR): This module forms MIRA’s inner loop. For a given factor set (where k denotes the outer-loop iteration), it jointly optimizes the policy network and the HARS module with parameters . In HARS, the state potential is produced from the factor attention and value mapping components, while the temporal context encoder serves as an auxiliary module for temporal smoothing and regularization. The shaped reward signal is then computed from using Potential-Based Reward Shaping (PBRS). This design ensures that HARS remains robust to variations in the size and ordering of the factor set, and can readily adapt to structural changes in introduced by the outer loop.
Metacognitive Monitoring (MCM): This is MIRA’s diagnostic hub. It adaptively models a “healthy” learning state via online density estimation. When diagnostic metrics consistently form low-density anomalies, the system triggers an outer-loop reframing. Critically, it not only determines when to intervene but also provides actionable diagnostic evidence for the reward reframing process by passing the anomalous vector and its gradient.
Semantic Reward Reframing (SRR): Together with MCM, this module forms the outer loop. When triggered by MCM, this module encodes the current factor set and the diagnostic report d into a remedial prompt. This prompt is then submitted to the LLM to generate a revised or augmented factor set, .
Formally, MIRA addresses a nested, bi-level optimization problem as follows:
Inner-loop optimization. For a given, fixed reward factor set
, the inner loop’s objective is to find the optimal policy parameters
and HARS parameters
by minimizing a joint loss,
Here, is the standard policy-learning loss computed using the shaped rewards from the state-only potential ; aligns to extrinsic value targets; and regularizes the temporal encoder for temporal consistency/smoothing. The hyperparameters balance the objectives.
Outer-loop optimization. The outer loop searches over a discrete, structured space—the space of valid reward factor sets,
—to maximize the discounted extrinsic return,
As is vast and discrete, gradient-based optimization is infeasible. MIRA employs LLM-mediated structural evolution, guided by MCM, as a solver for this high-level optimization: MCM identifies bottlenecks and proposes directions; SRR leverages the LLM’s reasoning to propose the next iterate, .
This dual-loop architecture enables learning on two distinct timescales and levels of abstraction: rapid numerical parameter tuning in the inner loop, and deliberate semantic structural evolution in the outer loop. In the following sections, we elaborate on the design and mechanics of each core loop.
4.2. Inner Loop: Semantic Initialization and Dynamic Alignment
The inner loop is the primary learning mechanism, responsible for policy optimization and reward-architecture refinement under a given set of reward factors. It comprises two phases: a one-time semantic initialization and an ongoing iterative alignment process.
4.2.1. Linguistic–Semantic Factor Synthesis (LSFS)
This module translates a high-level, potentially underspecified task description into a precise and computable representation. To synthesize a multi-component reward architecture, we adapt Structured Chain-of-Thought (SCoT) prompting [
44] to reward synthesis as follows:
Role and context priming: Assign the LLM a domain-expert role and provide the task objective together with specifications, constraining outputs to the task context.
Hierarchical task decomposition: Decompose the high-level objective into orthogonal sub-goals/phases to guide factor discovery and coverage.
Reward-factor identification: For each sub-goal, enumerate computable metrics over , specifying desirable/undesirable behaviors and how they are measured.
Executable code synthesis and validation: Convert the natural-language factor descriptions into executable snippets that follow a predefined syntax. Prior to deployment, perform static analysis, sandboxed unit tests, and PBRS-compatibility checks to ensure factors are Markovian and side-effect free (no future dependence or environment mutation). All factor code is version-controlled to allow safe rollback in case of runtime anomalies.
This structured workflow constrains the LLM’s reasoning within a rigorous engineering process, improving reliability, interpretability, and code quality.
4.2.2. Iterative Reward Alignment and Refinement (IRAR)
IRAR is the core mechanism that turns a fixed semantic factorization into a stable, adaptive guidance signal. By separating semantic valuation over factors from temporal stabilization, and by combining this with PBRS that preserves policy optimality in theory [
21], IRAR enables rapid inner-loop adaptation under a fixed factor set.
The inner loop optimizes a policy within a fixed factor set
via HARS (Hierarchical Attention-based Reward Shaping), which learns a potential
that captures the long-horizon value of the current state (
Figure 2). Let
denote the subset of factors used in potential construction; at time
t, write
. Action-dependent factors, if present, are routed to the policy branch for regularization or auxiliary control, and are excluded from
to preserve the theoretical invariance guarantees of PBRS. Each factor
i carries an identifier
(one-hot or learned embedding) and metadata
(e.g., units/range, expected polarity, monotonicity hints, normalization statistics). The Factor Attention Block (FAB) has parameters
and includes the factor embedder
, state projection
, and attention matrix
. The Temporal Context Encoder (TCE) has parameters
and serves as an auxiliary stabilizer for temporal smoothing/consistency. The embedding dimension is
. The following apply:
HARS: Hierarchical Attention-based Reward Shaping.
Factor Attention Block (FAB; ϕ): For each
, we compute
Let
. A state-conditioned query
attends to the set
via permutation-invariant pooling,
with an entropy regularizer on
to mitigate premature attention collapse.
Temporal Context Encoder (TCE; ψ): A lightweight sequence encoder GRU processes to produce an auxiliary smoothed representation used for temporal consistency regularization.
Value Potential Mapping (VPM; φ): An MLP maps the instantaneous aggregation (optionally concatenated with a state projection) to the scalar potential .
Potential-based reward shaping. We adopt PBRS with absorbing terminals to preserve optimal policies,
Under these conditions, shaped and extrinsic returns differ by a trajectory-independent constant, so the set of optimal policies is invariant.
Joint optimization objective. The policy parameters
and HARS parameters
are trained jointly,
where
is the standard policy-learning loss (e.g., actor–critic) computed with shaped rewards
;
is a TD-style loss aligning
to extrinsic value targets without leakage from the shaping term,
where
prevents unbounded drift at absorbing terminals; and
enforces temporal smoothing/consistency via
where
is the instantaneous FAB aggregation and
is the smoothed representation from TCE.
By decoupling semantic valuation from temporal stabilization and leveraging PBRS, IRAR provides a stable shaping signal that does not alter the optimal policy set while remaining adaptive to the current factorization. By minimizing , the system refines both the policy and HARS’s ability to predict long-term success. Upon an outer-loop structural update to , we retain the policy parameters () and the temporal encoder (), while re-initializing the factor-dependent attention () and value mapping ().
4.3. Outer Loop I: Metacognitive Monitoring and Pathological Learning Diagnosis
The outer loop embodies metacognitive self-reflection. In its first stage, MCM serves as a perceptual layer that summarizes noisy inner-loop signals into low-frequency, diagnostically meaningful evidence. Rather than reacting to instantaneous fluctuations, MCM focuses on persistent regime shifts that indicate semantic misalignment between the learned guidance signal and the extrinsic objective. We organize training into outer-loop cycles
, each aggregating
environment steps (or gradient updates) from the inner loop under a fixed evaluation protocol that produces a held-out batch
with a frozen policy snapshot to avoid off-policy bias; scalar time series are exponentially smoothed with decay
. Throughout this section we write
for the learned state-only potential (
Section 4.2),
for the extrinsic reward, and
with
its Monte-Carlo or critic-based estimate under the same frozen policy.
4.3.1. Diagnostic Vector Construction
At the end of cycle k, MCM constructs a diagnostic vector by concatenating calibrated metrics from three orthogonal axes: (1) policy dynamics, (2) potential–return agreement, and (3) reward-architecture adaptability. Before density modeling, each component is standardized with robust statistics (median/median absolute deviation (MAD)) over a rolling window and then smoothed by exponential moving average (EMA) as follows:
Policy dynamics. These metrics identify premature exploration stagnation or oscillatory updates.
Policy-entropy gradient (
),
Persistently large negatives indicate premature entropy collapse and risk of suboptimal fixation.
Temporal policy divergence (
). Symmetric Kullback–Leibler (KL) divergence between consecutive snapshots,
For strictly deterministic policies (e.g., DDPG), uses Gaussian surrogates whose means are the deterministic actions and whose covariances follow the exploration-noise schedule.
Potential–return agreement. We detect reward hacking when the guidance signal diverges from the extrinsic objective.
Potential–value correlation (
). Pearson correlation on a fixed evaluation set to avoid distributional drift,
with
for numerical stability. Sustained near-zero or negative values indicate decoupling between
and the extrinsic objective.
Reward-architecture adaptability. These metrics probe plasticity of the reward synthesizer under structural edits.
Factor attentional plasticity (
). Let
be attention over factor IDs at
s. Because the outer loop may add/remove factors, compare only the intersection of IDs between two cycles,
,
The change is quantified using the Jensen–Shannon Divergence (JSD). We track both the level and EMA trend of to distinguish healthy convergence from stagnation.
Normalized residual prediction error (
). A scale-free version of the alignment loss,
A high plateau concurrent with poor returns indicates an informational bottleneck in rather than a transient optimization issue.
The diagnostic vector is , each component robustly standardized (median/MAD over a rolling window) and smoothed by EMA, improving comparability across tasks/phases and yielding well-behaved inputs for density modeling.
4.3.2. Triggering Mechanism: Anomaly Detection via Online Density Estimation
MCM converts the stream of diagnostics into an anomaly score and raises an intervention signal only when persistent, statistically significant deviations from healthy regimes are detected. We adopt a nonparametric online density model with a fixed-size healthy buffer to handle multi-modal “healthy” behaviors with bounded per-cycle cost [
45,
46].
Healthy Buffer and Kernel Density Model
Let
store representative diagnostic vectors collected during cycles deemed healthy. Define a Gaussian-kernel density with bandwidth
h,
The kernel’s normalization constant cancels out in percentile-based thresholding and relative comparisons. Initialize
h by the median heuristic on
and update it slowly via EMA to avoid jitter. The per-cycle evaluation cost is
for
D-dimensional diagnostics.
Adaptive Buffer Maintenance (Performance-Gated Updates)
To keep the model aligned with the agent’s evolving competence, insert into only if the smoothed extrinsic return over the current cycle exceeds a moving baseline (EMA over the last cycles). If , evict by a first-in, first-out (FIFO) policy. A warm-up period of cycles and a minimum fill level are required before triggering is enabled, preventing early false alarms.
Anomaly Score and Dynamic Thresholding
Given
, define the anomaly score as the negative log-likelihood under the healthy model,
Maintain a history
and set a data-driven threshold as the
q-th percentile,
Require
for
consecutive cycles before declaring a pathology.
Hysteresis and Cooldown
To avoid oscillatory interventions near the threshold, employ hysteresis and a cooldown. After a trigger, suppress further triggers for outer cycles and use a slightly lower release threshold to declare back-to-healthy transitions, with a small margin .
4.4. Outer Loop II: Semantic Reward Reframing
When MCM flags a persistent anomaly, the SRR module converts diagnostic evidence into admissible structural edits of the reward factor set. SRR treats rewards as a typed, mutable program and uses an LLM in a constrained, evidence-driven loop to search for edits that (i) repair semantic misalignment and (ii) preserve the potential-based invariances of the inner loop.
Given the current factor set and diagnostics with the KDE gradient ∇, SRR outputs a revised set and a rationale trace (specification deltas, not model internals). SRR is restricted to reward-level changes and does not modify the policy or the HARS architecture/hyperparameters.
To maintain the state-only potential
assumed in
Section 4.2, we partition factors into the following two disjoint types:
-factors : state-only terms that may feed the potential synthesis (and thus the PBRS pipeline).
Auxiliary factors : optional state–action terms used for policy-side regularization, safety checks, or feasibility/gating predicates; they never feed nor the shaping term.
Edits that place action-dependent signals into are disallowed. When gating is applied to -factors, the gate must be a state predicate .
4.4.1. Admissible Edit Space
We define a finite set of typed edit operators on typed factor programs, each producing another well-typed factor set that remains compatible with PBRS and the state interface as follows:
Parametric scaling. : replace with .
Saturation and clipping : or apply smooth saturations (e.g., tanh-gating) to limit extremes.
Contextual gating. : multiplicatively gate by a predicate in . For -factors, ; action-dependent gates are allowed only for factors in .
Refactorization. : algebraically rewrite using an equivalent but loophole-averse expression (e.g., align velocity with heading via ).
Augmentation. : synthesize a new, interpretable factor with a typed signature and documented semantics (to if state-only, otherwise to ).
Pruning. : remove redundant, ineffective, or detrimental factors.
Admissibility constraints. All factors must be pure (no side effects), use only current-step observables, have documented units/range, and satisfy a Lipschitz bound estimated on a sandbox batch via randomized finite differences (using a quantile of gradient norms). For PBRS compatibility, only state-only factors may feed the potential ; action-dependent factors are confined to and do not affect or the shaping term. Dependence on future information and environmental mutation is forbidden.
4.4.2. From Diagnostics to Edit Hypotheses
SRR projects diagnostic evidence onto factor-level hypotheses via rules and a linear surrogate,
where
is a finite-difference Jacobian estimated from recent edits (or a library prior),
is the predicted diagnostic change for operator
o,
w prioritizes key metrics (e.g.,
), and
penalizes complexity (e.g.,
and code length). The following evidence-to-operator rules are applied before LLM synthesis:
Low/negative (potential–return decoupling): prefer to make high-payoff proxies conditional on posture/safety predicates; consider to embed physical consistency (e.g., align velocity with heading). For -factors, gating predicates must be state-only .
High-plateau with poor return: prefer to inject missing semantics (contacts, feasibility, sparse-to-dense bridges); back off to if variance explosion is detected.
Near-zero (ossified attention): prefer introducing state dependence (phase/terrain); optionally overly dominant factors.
Large oscillations without gains: prefer on rapidly varying terms; discourage multiple simultaneous .
Edits that would route action-dependent signals into the potential synthesis are rejected at this stage to preserve the state-only assumption for .
4.4.3. LLM-Guided Candidate Generation and Prompting Strategy
SRR interacts with the LLM through a fixed, structured prompting scheme rather than ad-hoc natural-language queries. At each intervention, it compiles an adaptive directive with the following four components:
Task and environment context: the natural-language task description , state/action specifications , and any safety or feasibility constraints.
Current factor program: the current factor set presented as typed code with documentation and a type tag for each factor indicating whether it belongs to or .
Diagnostic evidence: a summary of the anomaly detected by MCM, including the diagnostic vector , anomaly score , and the top components of the KDE gradient ∇ translated into short natural-language “failure hypotheses”.
Admissible edit space: the catalog of allowed edit operators with typed signatures and explicit constraints (e.g., no future dependence, no environment mutation, no routing of action-dependent signals into ).
The directive then instructs the LLM to propose at most
M minimally different candidate revisions and to output them in a strict JSON+code schema: each candidate is a list of edits
with a short justification and updated metadata (units, ranges, monotonicity hints, and type tags for
vs.
). We use deterministic decoding (temperature
) and reject any output that does not conform to the schema before running the static and sandbox checks in
Section 4.4.4. A concrete example of the full prompt template used by SRR is provided in
Appendix A. The resulting reframing routine, executed whenever MCM raises an intervention signal, is summarized in Algorithm 2.
| Algorithm 2 SRR.Reframe: Executed when MCM triggers |
Input: current factors ; task description ; diagnostics ; KDE gradient ∇; operator set ; weights w; penalty ; candidate count M. Output: revised factor set ; rationale trace ; backoff flag .
- 1:
▹ rules in Section 4.4.2- 2:
▹ deterministic decoding; typed JSON+code with -vs-aux tags - 3:
shortlist - 4:
for each do - 5:
if StaticIntegrityOK and SandboxOK then - 6:
compute and ; build fixed eval labels by value quantiles; score via - 7:
if AUC and permutation and then - 8:
add to shortlist - 9:
end if - 10:
end if - 11:
end for - 12:
if shortlist then - 13:
- 14:
return - 15:
else - 16:
/* fail-safe backoff */ - 17:
return - 18:
end if
|
4.4.4. Safety, Integrity, and Proxy-Efficacy Checks
Each candidate undergoes a three-stage validation before any environment interaction as follows:
Static integrity. Perform abstract syntax tree (AST) and type checks, unit/range conformity, and purity verification (no I/O or global state). Check PBRS compatibility and wiring: only state-only factors may feed the potential (no dependence), and any action-dependent factors are confined to .
Sandbox realizability. Execute factors on a held-out sandbox batch to verify numerical stability (no NaNs/Infs), bounded kurtosis, and a low saturation ratio (fraction of clipped outputs below ).
Proxy efficacy (offline). Rank candidates with a surrogate objective,
and require a significant discriminative test. We compute the area under the receiver operating characteristic curve (ROC-AUC) on a fixed evaluation set by labeling states via extrinsic value quantiles,
discarding middling values (
). The score is obtained by feeding the candidate’s state-only factors through the frozen FAB/VPM (parameters
fixed) to produce a proxy potential
. We then report
and a permutation
p-value. Acceptance requires AUC
with
and
. As a robustness alternative, we also report the Spearman rank correlation
with a permutation test.
Fail-Safe Backoff
If a candidate passes integrity and sandbox checks but fails proxy-efficacy (AUC not significant or ), SRR abstains and increases the intervention conservatism by (i) extending the cooldown by one outer cycle and (ii) raising the anomaly threshold percentile with a small (default ), preventing unproductive edit thrashing in borderline regimes.
4.4.5. Full-System Integration and Training Loop
Having validated and selected the revised factor set, we next integrate it into the full MIRA workflow. Upon acceptance of an SRR-proposed factor revision, the factor set is updated to
. The factor-dependent parameters
that contribute to the state-only potential
are re-initialized to ensure compatibility with the revised factorization, while the policy parameters
and the temporal context encoder
are retained to preserve previously acquired control competence and temporal smoothing capacity (the latter is used exclusively for auxiliary regularization and never feeds
) [
47]. To prevent oscillatory or excessive edits, SRR enforces a cooldown period of
outer cycles before further interventions are allowed; the triggering logic and threshold adaptation follow
Section 4.3.2.
Integrating the inner-loop IRAR with the two outer-loop modules—MCM for metacognitive monitoring and SRR for semantic reward reframing—yields the complete MIRA training pipeline. This unified process abstracts over the choice of off-policy optimizer (e.g., SAC or DDPG) while keeping the interfaces to HARS and the outer loop explicit. It leverages PBRS Equation (
9), attention pooling Equation (
8), and the alignment loss Equation (
11) as composed in Equation (
10). The overall MIRA training procedure is summarized in Algorithm 3.
| Algorithm 3 MIRA: Overall Learning Loop |
Input: task description ; large language model LLM; operator set ; number of outer cycles ; number of inner iterations ; rollout length H; replay buffer ; anomaly percentile q; persistence threshold ; cooldown period ; anomaly threshold increment . Output: trained policy .
- 1:
▹ Section 4.2.1; static checks and unit tests before use; tag -vs-aux - 2:
Initialize policy ; HARS params ; initialize (TCE) for auxiliary smoothing; MCM state - 3:
for to do - 4:
for to do ▹ Inner loop: IRAR - 5:
Roll out H env steps with to collect - 6:
Compute factor values ; form (state-only) and (state–action) - 7:
Attention pooling on ( 8); update TCE for auxiliary consistency (does not feed ) - 8:
Obtain state-only potential from instantaneous aggregation - 9:
▹ PBRS, ( 9) - 10:
Store into - 11:
RL.Update ▹ e.g., SAC/DDPG actor/critic steps; aux factors may enter policy-side regularizers - 12:
Jointly update via ( 11) and ( 10); update via auxiliary smoothing losses - 13:
end for - 14:
Sample evaluation batch ; compute diagnostic vector - 15:
- 16:
if and then - 17:
- 18:
if then - 19:
; re-init ; retain and - 20:
▹ stabilization window - 21:
else - 22:
▹ abstain - 23:
if then - 24:
; - 25:
end if - 26:
end if - 27:
else - 28:
; - 29:
end if - 30:
end for - 31:
return
|
4.5. Computational Complexity and System-Level Overhead
Although MIRA introduces a dual-loop architecture, it is explicitly designed so that both the inner- and outer-loop procedures preserve the standard asymptotic time complexity of off-policy RL, incurring only a controlled constant-factor overhead. We analyze the computational cost of the main components below.
For a fixed reward factor set
, the inner loop (IRAR) augments a standard off-policy RL agent with the HARS module and potential-based reward shaping. Per environment step, the computational overhead consists primarily of (i) evaluating the current reward factors, (ii) an attention-pooling pass over the active
-factors (FAB), (iii) a GRU update in the Temporal Context Encoder (TCE), and (iv) an MLP forward pass in the Value Potential Mapping (VPM). Crucially, these operations are independent of the training horizon
T; they scale only with fixed architectural widths and linearly with the number of active factors (e.g.,
for attention pooling). Since
K is strictly bounded by the constrained edit space and the complexity penalty in SRR (
Section 4.4.1), the per-step cost of IRAR represents a constant-time addition:
where
is the environment simulation cost,
denotes the optimization cost of the underlying RL algorithm (e.g., standard actor and critic gradient updates), and
is comparable to the inference cost of a lightweight value network. Thus, for a training run of
T steps, the overall inner-loop time complexity remains
.
The metacognitive monitoring module executes once per outer cycle after aggregating
inner-loop steps. Constructing the diagnostic vector
over a held-out evaluation batch incurs a cost of
. Evaluating the kernel-density model over the healthy buffer scales as
per cycle,
Since
,
D, and
are fixed architectural constants (
Table 1), the amortized monitoring cost per environment step is negligible,
preserving the
total complexity.
The SRR module is invoked sparsely, triggered only when MCM detects a persistent anomaly and the cooldown counter has expired. Each SRR call operates on a frozen snapshot and performs LLM-guided generation and validation. Let
I denote the total number of accepted interventions. Since the number of candidate generations
M and the validation protocol are fixed, the total SRR cost is
, where
is independent of
T. Furthermore, the triggering logic imposes a hard upper bound on
I,
This ensures that the expensive semantic editing steps cannot occur arbitrarily often, keeping the total overhead bounded linearly by
T.
Combining these components, the total time complexity of MIRA over
T environment steps is
The overall complexity remains linear in
T, differing from the base RL agent only by constant factors derived from HARS, MCM, and the bounded SRR interventions. Memory overhead is similarly controlled, requiring only fixed-size buffers for the potential module and diagnostic vectors.
From an implementation perspective, MIRA serves as a lightweight extension over standard off-policy RL methods. In our experiments, we instantiate using an actor–critic architecture; the replay buffer and optimizer are reused, while the auxiliary modules (HARS, MCM, SRR) operate in a modular fashion without altering the fundamental asymptotic complexity.
5. Experiments
This section presents a rigorous empirical evaluation of the MIRA framework through a series of controlled experiments. The experiments are designed to systematically investigate MIRA’s performance, efficiency, and internal mechanisms in complex continuous control tasks. Specifically, we address the following three core research questions (RQs):
(RQ1) Performance Benchmarking: How does MIRA compare against state-of-the-art and representative reward design paradigms in terms of asymptotic performance and sample efficiency?
(RQ2) Mechanistic Investigation: How does MIRA’s closed-loop correction mechanism diagnose and resolve canonical learning pathologies, such as specification gaming and exploration stagnation?
(RQ3) Ablation Analysis: What are the individual contributions of MIRA’s primary architectural components—the MCM and SRR modules—to its overall efficacy?
5.1. Experimental Setup
5.1.1. Test Environments
We evaluate MIRA on two challenging, high-dimensional continuous control tasks from the MuJoCo physics simulation suite [
48]: HalfCheetah and HumanoidStandup. These environments were selected because they exemplify two distinct and fundamental challenges in reward design, providing a targeted means to assess MIRA’s adaptive mechanisms. Both tasks are widely used benchmarks in continuous control literature, ensuring comparability with prior work.
HalfCheetah: The objective is to command a planar bipedal agent to run forward. This environment is a canonical benchmark for specification gaming, where simple proxy rewards (e.g., forward velocity) are misaligned with the high-level semantic goal of “running,” incentivizing unnatural and inefficient gaits. The task therefore provides a controlled setting to evaluate MIRA’s capacity to detect and correct an emergent objective misalignment.
HumanoidStandup: This task requires a high-dimensional, unstable humanoid model to transition from a crouched to a standing posture. The environment is characterized by a prominent local optimum, creating a deceptive reward landscape that frequently leads to exploration stagnation. The agent receives consistent positive rewards for maintaining a stable crouch, while the transient act of standing is penalized due to instability. This property makes it a canonical benchmark for evaluating an agent’s ability to escape local optima and overcome premature convergence.
Additional experiments on the Humanoid locomotion task, which poses a more complex control challenge, are reported in
Appendix B.
5.1.2. Comparative Baselines
To comprehensively evaluate MIRA, we benchmark it against a curated set of baselines spanning distinct technical paradigms. Each is chosen to isolate and challenge a specific aspect of MIRA’s design, allowing for a nuanced analysis as follows:
Performance Bound Baselines. These establish the performance floor and a practical ceiling for each task. Sparse-Reward: The agent trains on the native, sparse terminal reward, calibrating inherent task difficulty. Dense-Reward (Oracle): An expert-engineered dense reward function represents the performance ceiling achievable with perfect, static domain knowledge.
Static LLM-based Reward Baseline. This represents the current one-shot reward synthesis paradigm. Latent Reward (LaRe) [
35]: We employ LaRe, which represents the state-of-the-art for LLM-based reward synthesis. In this method, an LLM generates a fixed reward function used throughout training. This comparison directly tests our central thesis: that even a sophisticated static reward is brittle to emergent misalignments, and that MIRA’s closed-loop reframing offers superior robustness and performance.
Trajectory-based Reward Inference Baselines. These methods infer rewards from trajectory returns without external semantics. Reward Disentanglement (RD) (based on [
49]): Operating in the “RL with trajectory feedback” setting, RD models the per-step reward as a linear function of state-action features, estimated via least-squares regression over trajectories. Iterative Relative Credit Refinement (IRCR) (based on [
50]): This method computes a guidance reward by normalizing each trajectory’s total return to a
credit and uniformly redistributing it to its constituent state-action pairs. This comparison investigates whether learning pathologies can be avoided purely through sophisticated signal processing, or if, as MIRA proposes, injecting new semantic knowledge is necessary.
Information-Theoretic Baseline. This baseline tests if robust alignment can emerge from statistical correlation alone. Variational Information Bottleneck (VIB) [
51]: VIB learns a reward function by training an encoder to find a “minimal sufficient” latent representation of state-action pairs that predicts task success. This tests if robust alignment necessitates the explicit causal and semantic reasoning provided by MIRA.
Collectively, these baselines situate MIRA not only in terms of performance but also in relation to core debates in reward design: static vs. dynamic specification, data-driven vs. knowledge-driven correction, and correlation-based vs. causality-informed alignment.
5.1.3. Policy Optimization Algorithms
To validate the generality of the MIRA framework, we integrate it with two representative RL algorithms for continuous control that embody different core philosophies. By deploying MIRA and all baselines on top of both, we aim to demonstrate that MIRA’s contributions are orthogonal to the choice of the underlying policy optimizer as follows:
Deep Deterministic Policy Gradient (DDPG) [
52]: A seminal off-policy actor-critic algorithm that learns a deterministic policy, relying on action-space noise for exploration.
Soft Actor-Critic (SAC) [
53]: A state-of-the-art off-policy algorithm that learns a stochastic policy by incorporating an entropy maximization objective, promoting robust and principled exploration.
5.1.4. Implementation and Evaluation
All experiments were conducted on a single workstation equipped with an Intel Core i9-13900K CPU (24 cores, 32 threads), an NVIDIA RTX 4090 GPU (24 GB VRAM), and 64 GB of system memory. The system runs Ubuntu 20.04 with CUDA 11.8 and PyTorch 1.13.1. No distributed training or specialized accelerators were required for any of the reported results.
To ensure fair comparison and reproducibility, we utilized a unified software framework for all methods. For LLM-based methods (MIRA and LaRe), we adopted the Deepseek-R1 model with deterministic sampling (temperature ) to guarantee reproducible reward-code generation across runs.
All Reinforcement Learning baselines and MIRA share the same optimizer implementation (DDPG or SAC) and the same actor–critic network architecture. This design ensures that differences in performance can be attributed to the reward mechanisms rather than architectural variation. We follow a standard continuous-control configuration for the actor and critic and keep all base RL hyperparameters fixed across methods, tasks, and optimizers. The exact values of these base RL hyperparameters, together with all MIRA-specific hyperparameters (HARS, MCM, SRR), are summarized in
Section 5.1.5 and
Table 1.
For the inner-loop architecture of MIRA, the HARS module follows a lightweight and stable configuration that uses standard attention, recurrent, and MLP components. HARS is trained jointly with the policy using the same optimizer family as the baseline critic, under the inner-loop objective in Equation (
10). The concrete architectural and loss-weight choices are included in
Table 1.
The outer-loop metacognitive process (MCM + SRR) operates on top of the policy optimizer with a fixed cycle length of
environment steps. At the end of each outer cycle, MCM computes a diagnostic vector and an anomaly score based on a kernel-density model over a healthy buffer (
Section 4.3). A reframing attempt is considered only when the anomaly score exceeds a percentile-based threshold, and any candidate reward update proposed by SRR is validated offline using static checks, sandbox execution, and a permutation-based ROC–AUC test on a fixed evaluation set (
Section 4.4). Thresholds and cooldowns follow the single global configuration reported in
Section 5.1.5 and
Table 1.
Each experimental configuration is run with five random seeds. Reported learning curves display the mean extrinsic episodic return with one-standard-deviation bands across seeds. We evaluate all methods using the following two metrics:
Asymptotic performance: the average episodic return over the final 10% of training steps, quantifying the final proficiency of the converged policy.
Sample efficiency: the number of environment steps required to reach 50% of the Dense-Reward Oracle baseline’s asymptotic performance. Lower values indicate faster learning.
Runtime and Overhead
Under the hardware setup described above, training a single seed of a fixed environment–optimizer configuration to the full evaluation horizon typically requires on the order of 20–40 h of wall-clock time for the baseline methods, with HalfCheetah near the lower end of this range and HumanoidStandup near the upper end. Relative to the static LLM-based reward baseline LaRe, MIRA incurs only a very small additional runtime cost: averaged across both environments and both optimizers, the per-seed wall-clock time of MIRA is approximately 1–
higher than that of LaRe. This mild overhead is consistent with the design of the dual-loop architecture: the inner-loop HARS computations add only a constant per-step cost, MCM is evaluated once per outer cycle and contributes
amortized overhead per environment step, and SRR is triggered infrequently due to the persistence and cooldown rules. Empirically, SRR interventions are rare (median
triggers per seed), and the time spent on semantic reframing and offline validation accounts for less than
of the overall training time. The dominant cost remains environment simulation and standard actor–critic updates, and MIRA’s dual-loop mechanism manifests as a modest constant-factor slowdown rather than a change in asymptotic scaling, in line with the analysis in
Section 4.5.
5.1.5. Hyperparameter Governance
MIRA introduces additional hyperparameters beyond the underlying off-policy optimizer because it decomposes reward adaptation into modular inner–outer subsystems with explicit safety constraints. To keep these choices interpretable and reproducible, we organize them into three conceptual layers: (i) inner-loop optimization and shaping stability, (ii) outer-loop metacognitive conservatism, and (iii) SRR reward-editing strictness. This grouping yields a concise governance framework and avoids unconstrained global tuning.
- (1)
Inner-Loop Hyperparameters
The underlying actor–critic optimizer (DDPG or SAC) uses standard configurations that are shared by all methods. In particular, we use a fixed learning rate, discount factor, and a common two-layer MLP architecture for both policy and critic networks. These base RL settings remain fixed for all tasks and baselines and are listed in
Table 1.
Within MIRA, the HARS module adopts a lightweight architecture with conventional components: single-head attention in the Factor Attention Block, a GRU-based Temporal Context Encoder, and a 2-layer Value Potential Mapping. Inner-loop loss weights (e.g.,
and
in Equation (
10)) are chosen to prioritize stable potential learning under PBRS and are kept identical across environments and optimizers. We do not perform any task-specific retuning of these inner-loop hyperparameters.
- (2)
Outer-Loop Hyperparameters
MCM hyperparameters primarily regulate the conservatism and frequency of semantic interventions rather than the reward semantics themselves. We employ a percentile-based anomaly rule (threshold q) to obtain a scale-free trigger criterion, together with a persistence requirement () and a fixed healthy-buffer size () used for the kernel-density estimator. After accepting an edit, we enforce a refractory period to allow the policy and critic to adapt under the revised factorization; with steps per outer cycle, this corresponds to a cooldown of cycles. These defaults are selected once and reused across all reported experiments.
- (3)
SRR Reward-Editing Hyperparameters
SRR parameters control the breadth of candidate generation and the strictness of offline validation. We use deterministic decoding with a small number of candidate factor-set revisions per intervention and accept an edit only if it passes (i) static integrity checks (syntax, typing, PBRS compatibility), (ii) sandbox realizability checks (numerical stability and boundedness on held-out data), and (iii) a permutation-based ROC–AUC gate on a fixed evaluation set. This conservative design is intended to prevent edit thrashing and to reduce sensitivity to minor variations in SRR hyperparameters.
Global Configuration
Table 1 summarizes the main hyperparameters, their roles, and the global values used in all experiments. Importantly, we adopt a single global configuration for these hyperparameters and reuse it across both environments (HalfCheetah and HumanoidStandup) and both optimizers (DDPG and SAC); we do not perform any per-task or per-baseline retuning. This design choice reflects MIRA’s intended role as a robust closed-loop reward adaptation framework rather than a heavily tuned system.
5.2. Macro-Level Performance Comparison
To systematically address our first research question (RQ1) regarding the performance advantages of MIRA over existing paradigms, we conduct a quantitative evaluation against a suite of representative baselines (definitions in
Section 5.1.2). We first present and analyze the results using DDPG as the underlying optimizer. This allows us to isolate and evaluate the direct impact of the different reward design paradigms. Subsequently, we show parallel results with SAC to demonstrate the robustness of our core findings. All reported metrics are computed from the environment’s extrinsic episodic return at evaluation time; training-time shaping signals never enter evaluation.
The overall results demonstrate that MIRA exhibits a significant competitive advantage over the baseline methods in both asymptotic performance and sample efficiency. This advantage is consistently reflected in the learning dynamics across both test environments (
Figure 3) and in the quantitative metrics (
Table 2).
This analysis provides the following insights into the intrinsic capabilities and limitations of different reward design paradigms:
Superior performance and Oracle competitiveness. In both environments, MIRA not only surpasses the baseline reward design methods in asymptotic performance but also demonstrates performance competitive with, and in some aspects superior to, the human-engineered Oracle. This supports the thesis that dynamic reward evolution can close residual gaps left by static expert shaping.
Brittleness of static reward synthesis. LaRe, representing one-shot LLM-based reward design, highlights the fragility of static approaches: a fixed reward is a single hypothesis about a dynamic process. In HalfCheetah, the approach exhibits a vulnerability to specification gaming, while in HumanoidStandup, it shows a tendency to converge to a deceptive local optimum. While LLM priors are valuable, they require online revision—enabled by MIRA—to remain aligned with learning dynamics.
Limitations of retrospective reward inference. Retrospective reward inference methods such as RD, IRCR, and VIB adopt a bottom-up perspective: they recover per-step rewards from aggregated trajectory returns, seeking a statistical explanation of past agent behavior within the constraints of the current value system. In contrast, MIRA’s SRR module leverages a top-down approach, in which an LLM synthesizes reward factors directly from task semantics and structured knowledge, enabling targeted, causal interventions when misalignments are diagnosed. This top-down semantic reasoning allows MIRA not only to reinterpret existing data but also to reshape the reward landscape itself—capabilities that are inherently inaccessible to purely retrospective methods.
To validate the generalizability of our findings, we conducted a parallel set of experiments using the SAC algorithm. As illustrated in
Figure 4 and
Table 3, the results obtained with SAC are consistent with our primary findings. Although absolute performance values shift due to optimizer properties, the relative hierarchy remains intact, with MIRA maintaining a significant lead. This indicates that MIRA’s advantage is not an artifact of incidental synergy with a particular optimizer but stems from the intrinsic superiority of its dynamic reward architecture.
5.3. Mechanistic Analysis: Dissecting MIRA’s Introspective Self-Correction
While the aggregate performance comparisons in
Section 5.2 validate MIRA’s efficacy, a more fundamental question remains: how does its internal mechanism overcome the challenges that cause other paradigms to fail? This section addresses RQ2 by dissecting MIRA’s core loop through two illustrative case studies, showing how it performs online diagnosis and structural self-correction to resolve alignment failures that are often intractable for static or purely data-driven methods. For notational brevity we write
for the state-only potential learned by HARS (
Section 4.2); action-dependent factors are confined to policy-side regularization and never feed
or the shaping term Equation (
9).
5.3.1. Case Study I: Rectifying Reward Hacking via Structural Reward Revision
Initial design and emergent pathology. In HalfCheetah, SRR emits an initial factor set
with
where
is forward velocity,
the pitch angle,
the pitch angular rate, and
a the action vector. To preserve PBRS invariance, we partition
HARS constructs
from
via attention pooling in Equation (
8) followed by VPM, and the policy is trained with shaped rewards in Equation (
9). Early in training, the policy uncovers a loophole: maximizing
by adopting a flipped “somersaulting” motion yields large forward velocity, while posture penalties are insufficient to counterbalance it, producing a numerically high return but semantically incorrect gait.
Diagnosis via potential–value decoupling. Per
Section 4.3.1, MCM computes the Potential–Value Correlation
on a fixed evaluation set
under a frozen policy snapshot, correlating
with
. As the “somersaulting” behavior emerges, states with high
no longer correspond to high extrinsic value, driving
toward sustained negative values. The diagnostic vector thus falls into a low-density region of the healthy model, and the anomaly score
(
Section 4.3.2) crosses the trigger threshold persistently, activating the outer loop and forwarding directional evidence to SRR.
Corrective mechanism: from trade-off to precondition. Guided by the evidence (low/negative
), SRR applies an admissible Gate edit (
Section 4.4.1) that converts posture from an additive trade-off into a precondition for rewarding velocity. It replaces
with a posture-gated variant
with
. The revised set is
This edit is state-only, pure (no side effects), and bounded (via tanh), thus passing static integrity and PBRS-compatibility checks. To maintain scale, we apply the same saturation policy used in SRR (
Section 4.4.4) so that
remains commensurate with other
-factors.
Policy Readaptation and Recovery. Upon acceptance (
Section 4.4.4), we re-initialize
while retaining
and the always-on TCE
(used only for auxiliary smoothing). Training continues with
in Equation (
9). The gated design suppresses the pathological high-velocity posture, reshaping the local landscape so that stable running becomes the profitable solution. Empirically,
recovers to positive values on
, the residual prediction error decreases, and no further anomalies are triggered during the cooldown window.
5.3.2. Case Study II: Escaping Local Optima via Factor Set Augmentation
Initial design and premature convergence. In HumanoidStandup, SRR emits an initial factor set
with
where
is torso height,
are linear and angular velocities, and
a is the action vector. As before,
and
. This design inadvertently yields a deceptive fixed point: a crouched posture offers moderate height reward and continuous survival reward while minimizing instability penalties. Any upward motion incurs a large
penalty, creating a reward valley that traps the policy.
Diagnosis via stagnation signatures. With the policy trapped, MCM detects two persistent changes on the fixed evaluation set : (i) the Policy Entropy Gradient collapses toward zero, signalling stalled exploration; and (ii) the Residual Prediction Error plateaus at a high value, indicating that the critic cannot improve in this static region. This stagnation vector lies in a low-likelihood region under , producing an anomaly score above the trigger threshold. MCM localizes the plateau to a narrow band of torso heights .
Corrective mechanism: surgical reward injection. SRR infers that the factor set lacks a targeted escape incentive. Using the localized context
, it synthesizes a bounded, state-only exploration bonus
where
is the logistic function,
is vertical velocity, and
are chosen so that
matches the scale of other
-factors. The revised factor sets are
This is a localized shaping term aimed solely at the stagnation band, passing admissibility checks by being bounded, pure, and state-only.
Policy readaptation and recovery. After are re-initialized with and TCE retained, HARS rapidly learns to assign positive to states where is active. This reshapes the potential landscape from a flat plateau into a sloped “escape ramp,” creating a consistent policy gradient out of the crouched posture. recovers to healthy levels, decreases, and the agent proceeds toward the globally optimal standing behavior without regressing into the local optimum.
5.4. Ablation Study: Attributing Contributions of Key Components
To quantitatively isolate and evaluate the contribution of each core component within the MIRA framework, thereby addressing RQ3, we conducted a series of ablation studies. This section aims to independently assess the impact of MCM’s adaptive diagnostic module and SRR’s semantic reconstruction module on overall performance, as well as to investigate their potential synergistic effects.
We designed the following ablations:
MIRA w/o MCM (Periodic, Unguided Intervention): This variant is designed to quantify the effectiveness of MCM’s adaptive diagnostic mechanism. It retains the SRR module but replaces MCM with a frequency-matched periodic intervention baseline, which invokes SRR at fixed intervals whose total count per run matches the median number of MCM triggers observed in full MIRA. No diagnostic vector is provided; SRR generates candidates using only the current reward code and recent returns/trajectories, under the same operator set and candidate count M as full MIRA. This variant tests the hypothesis that the microscopic, root-cause diagnostic information provided by MCM is critical for effective correction and superior to interventions based on macroscopic performance alone.
MIRA w/o SRR (Diagnosis-Only): This variant measures the practical contribution of SRR’s semantic reconstruction module. It retains the full diagnostic and triggering capabilities of MCM but removes the SRR correction module entirely. In this configuration, the reward function is initialized once at the beginning of training and remains fixed throughout, unaffected by any diagnostic signals from MCM.
All other components and hyperparameters are identical to full MIRA: the operator set , candidate count M, static/sandbox checks, parameter re-initialization of with retention of , and the outer-loop cooldown .
To understand the contribution of each component, we conducted an ablation study using the DDPG algorithm. We evaluated two ablated variants against the full MIRA framework on the HalfCheetah task. As illustrated in
Figure 5, the results clearly show that the complete MIRA framework substantially outperforms these ablated baselines in this task. This evaluation reveals the following key insights:
The Guiding Role of Diagnostic Information for Effective Intervention (MCM): The performance of MIRA w/o MCM is significantly inferior to that of the full MIRA framework. Its failure can be attributed to two factors. First, the intervention timing is suboptimal, as non-adaptive, periodic interventions fail to align with the precise onset of a problem. Second, and more fundamentally, the intervention is unguided. Lacking the specific diagnostic vector from MCM, SRR is forced to perform an unguided search in a vast reward design space, making its proposed modifications likely to be ineffective or even detrimental. In contrast, MCM in the full MIRA framework not only determines when to intervene but, by passing the diagnostic vector, also informs SRR of what the problem is. This constrains an open-ended design problem into a targeted repair task, dramatically increasing the probability of a successful correction.
Addressing the Limitations of Initial Reward Hypotheses (SRR): This ablation highlights a core concept of our framework: an LLM-generated reward, while a powerful starting point, should be viewed as an initial hypothesis about the task’s value landscape, not as a perfect, final specification. As demonstrated in our case studies, the complex and often unpredictable dynamics of agent–environment interaction can reveal latent flaws in this initial hypothesis—such as semantic loopholes or unintended local optima. The performance of the MIRA w/o the SRR variant is a direct testament to this challenge. By relying solely on the static initial reward, its learning process stagnates once such a flaw is encountered. This result provides a critical insight: diagnosis without intervention is insufficient. Even when MCM correctly identifies a problem, the system remains trapped by the flaws of its initial reward hypothesis without SRR to perform a structural correction.
In conclusion, the ablation studies provide clear evidence that MIRA’s superior performance stems from a deep synergy between its two core components. This synergy is not a simple modular addition but a tightly integrated feedback loop. The MCM acts as the perceptual system, answering when an intervention is needed and what the problem is. The SRR acts as the cognitive and executive system, using the information provided by MCM to answer how to fix it. It is the precise, diagnostic information passed from MCM that allows SRR’s powerful semantic reconstruction capabilities to be effectively targeted. This complete “Diagnose–Respond–Reshape–Verify” process forms the core of the MIRA framework’s adaptive intelligence.
6. Conclusions
The practical success of RL is critically dependent on the quality of the reward signal. However, reward design is non-trivial: external feedback is often sparse, leading to the classic temporal credit assignment problem; reward shaping relies on domain expertise and can trigger specification gaming when the agent treats proxy metrics as the ultimate goal; misspecified rewards also increase the likelihood of converging to suboptimal local optima; meanwhile, a reward function that remains static throughout training exhibits structural rigidity, failing to adapt to the agent’s evolving capabilities. These challenges collectively raise a core question: can we endow an agent with an online self-correction mechanism to diagnose and repair structural deficiencies in its own value system?
To answer this question, we propose the MIRA framework, which reframes the reward function from a static artifact into a dynamic component that co-evolves with the agent. Starting with initial reward factors generated by an LLM, MIRA operationalizes a form of computational metacognition through a dual-loop architecture: The inner loop learns a state potential function via the HARS module and applies optimality-preserving PBRS to provide dense guidance. The outer loop acts as a metacognitive supervisor, continuously monitoring learning dynamics through a set of diagnostic metrics. Upon detecting persistent pathological behaviors, it invokes the LLM to perform semantic reward reframing within a constrained and auditable edit space. In essence, MIRA transforms reward design into a “detect-diagnose-repair” closed-loop process. Our empirical results in complex continuous control tasks validate this approach, showing that it generally outperforms baseline methods in terms of asymptotic return and sample efficiency. Case studies illustrate how the mechanism translates diagnostic signals of learning pathologies into targeted reward edits, while ablation studies confirm that both the monitoring and reframing components are indispensable for achieving these gains.
While this work validates the foundational viability of MIRA, our evaluation is confined to simulated, fully observable locomotion tasks in order to focus on and isolate the core mechanism of self-correction. Looking ahead, we will explore the robustness and adaptability of MIRA in more realistic and demanding regimes, including high-dimensional perception, partial observability, and safety-critical or explicitly adversarial settings where reward channels, observations, or dynamics may be corrupted. First, we will extend the method to pixel-based control by integrating pretrained visual encoders and evaluate it on real robots. Second, we will investigate adversarial and high-security domains by combining MIRA with additional safety constraints and human oversight mechanisms, and by subjecting it to systematically perturbed reward and dynamics models. Third, we will leverage meta-learning to acquire reusable reward-editing heuristics across families of tasks to amortize the outer-loop search cost, and strengthen the theoretical foundations of “structured reward editing” by characterizing and proving properties such as stability and invariance, providing firmer guarantees for the method. Overall, we posit that reward evolution represents a viable path toward more autonomous and reliable agents, and we hope to inspire further research on self-correcting value systems.