Next Article in Journal
An Integrated Optimization for Resilient Wildfire Evacuation System Design: A Case Study of a Rural County in Korea
Previous Article in Journal
A Method for Identifying Critical Control Points in Production Scheduling for Crankshaft Production Workshop by Integrating Weighted-ARM with Complex Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design

1
Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China
2
Post Industry Technology R&D Center of the State Posts Bureau (IoT Technology), Nanjing University of Posts and Telecommunications, 66 Xinmofan Road, Nanjing 210003, China
*
Author to whom correspondence should be addressed.
Systems 2025, 13(12), 1124; https://doi.org/10.3390/systems13121124
Submission received: 9 November 2025 / Revised: 9 December 2025 / Accepted: 14 December 2025 / Published: 16 December 2025
(This article belongs to the Topic Agents and Multi-Agent Systems)

Abstract

A central obstacle to the practical deployment of Reinforcement Learning (RL) is the prevalence of sparse rewards, which often necessitates task-specific dense signals crafted through costly trial-and-error. Automated reward decomposition and return–redistribution methods can reduce this burden, but they are largely semantically agnostic and may fail to capture the multifaceted nature of task performance, leading to reward hacking or stalled exploration. Recent work uses Large Language Models (LLMs) to generate reward functions from high-level task descriptions, but these specifications are typically static and may encode biases or inaccuracies from the pretrained model, resulting in a priori reward misspecification. To address this, we propose the Metacognitive Introspective Reward Architecture (MIRA), a closed-loop architecture that treats LLM-generated reward code as a dynamic object refined through empirical feedback. An LLM first produces a set of computable reward factors. A dual-loop design then decouples policy learning from reward revision: an inner loop jointly trains the agent’s policy and a reward-synthesis network to align with sparse ground-truth outcomes, while an outer loop monitors learning dynamics via diagnostic metrics and, upon detecting pathological signatures, invokes the LLM to perform targeted structural edits. Experiments on MuJoCo benchmarks show that MIRA corrects flawed initial specifications and improves asymptotic performance and sample efficiency over strong reward-design baselines.
Keywords: Reinforcement Learning (RL); Large Language Models (LLMs); value alignment; reward shaping Reinforcement Learning (RL); Large Language Models (LLMs); value alignment; reward shaping

Share and Cite

MDPI and ACS Style

Zhang, W.; Xu, Y.; Sun, Z. MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design. Systems 2025, 13, 1124. https://doi.org/10.3390/systems13121124

AMA Style

Zhang W, Xu Y, Sun Z. MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design. Systems. 2025; 13(12):1124. https://doi.org/10.3390/systems13121124

Chicago/Turabian Style

Zhang, Weiying, Yuhua Xu, and Zhixin Sun. 2025. "MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design" Systems 13, no. 12: 1124. https://doi.org/10.3390/systems13121124

APA Style

Zhang, W., Xu, Y., & Sun, Z. (2025). MIRA: An LLM-Driven Dual-Loop Architecture for Metacognitive Reward Design. Systems, 13(12), 1124. https://doi.org/10.3390/systems13121124

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop