1. Introduction
Large Language Model (LLM)-based autonomous agents have progressed rapidly in recent years, demonstrating promising capabilities in task planning, knowledge-intensive reasoning, and multi-turn interaction [
1]. However, relying solely on parametric knowledge encoded in model weights still entails substantial limitations, particularly a heightened risk of hallucinations when operating in open and dynamically evolving information environments [
2]. To mitigate these issues, Retrieval-Augmented Generation (RAG) [
3] has been proposed, which augments the generation process with external knowledge sources and enables the model to dynamically acquire evidence that is highly relevant to the current task.
A typical RAG system consists of two collaborative components: a vector-based retrieval module that identifies content most relevant to the input from large-scale knowledge corpora, and a generation module that conditions on both the retrieved evidence and the original query to produce more accurate, coherent, and factually grounded responses. By integrating external knowledge into the inference process, RAG reduces uncertainty arising from the model’s reliance on purely parametric memory and achieves greater reliability and robustness in knowledge-intensive tasks.
Although RAG expands an agent’s access to knowledge within a single inference step, its overall mechanism remains essentially reactive and oriented toward immediate response, lacking the ability to persist and update information produced during interaction. Conventional RAG pipelines provide little support for revisiting generated outputs through systematic review, evaluation, or strategy adaptation, preventing the agent from internalizing experience and from building a memory structure that evolves with its environment. In addition, RAG typically relies on a static external knowledge base where the system can only passively consume pre-existing information, with no principled mechanism for updating, organizing, or restructuring the underlying knowledge. This limits its capacity to absorb new information and maintain consistency as the world changes. Consequently, while retrieval enriches the model’s external knowledge sources, the overall paradigm still operates in a fast, one-shot reasoning mode, falling short of the deeper reflective processes and self-evolution capabilities required by autonomous agents in long-horizon settings.
Building on these observations, we argue that to build truly self-evolving agents that can continuously learn and improve autonomously, we need to introduce a complementary process that can reflect on, integrate, and evolve the interaction process. Inspired by Kahneman’s dual-process theory [
4], we view fast reasoning and deep reflection as two independent yet collaborative processes in the agent’s memory system: the former focuses on immediate decision-making and efficient generation, while the latter is responsible for abstracting experience, evaluating behavior, and updating internal representations and long-term knowledge based on that evaluation.
To this end, we propose a Dual-Process Agent (DPA) framework for continual context refinement (as shown in
Figure 1b). Our design draws on both cognitive dual-process theory and established AI cognitive architectures. In cognitive science, System 1 denotes fast, automatic, associative processing, while System 2 denotes slow, deliberate, rule-based reasoning [
5,
6]. Similarly, in DPA, System 1 corresponds to feed-forward inference, where the agent retrieves relevant context via pattern matching and generates a response in a single pass, while System 2 corresponds to deliberate meta-cognition, where the agent revisits the interaction trace, evaluates outcomes, and decides whether to update the long-term memory store. This setup reflects dual-process views in which System 1 produces a default response that System 2 can refine through reflective monitoring. This decomposition also connects to AI cognitive architectures such as SOAR [
7] and ACT-R [
8], where production systems distinguish between automatic pattern-matched actions and deliberate goal-directed reasoning.
In contrast to conventional linear RAG pipelines that rely on static knowledge bases and lack any post-interaction update mechanism, our framework augments the fast reactive loop with an evolution-and-reflection phase that systematically compresses, abstracts, and integrates the reasoning trajectory, retrieved evidence, and final outputs after each interaction, thereby driving the dynamic evolution of long-term memory. Concretely, a user query is first passed to a retrieval module, which selects the most relevant memory snippets from a continually updated long-term memory store and feeds them, together with the current query, into the LLM to produce the final response. The full interaction trace is then distilled in the background into reusable memory units and written back into long-term memory, allowing the memory store to continually adapt and grow with each episode. This closed-loop mechanism enables the agent to accumulate experience over long-horizon episodes and maintain an evolving long-term memory structure without updating backbone parameters.
To validate the effectiveness of DPA, we conduct extensive experiments on six benchmarks spanning factual QA, multi-hop reasoning, instruction following, and mathematical reasoning (TruthfulQA, StrategyQA, IFEval, HotpotQA, FreshQA, and GSM-IC) using both GPT-5.1 and Llama-3.1-8B as backbones. Experimental results demonstrate that DPA achieves the strongest average performance across all benchmarks, with an average accuracy of 77.3% on GPT-5.1 and 67.0% on Llama-3.1-8B. We summarize our main contributions as follows:
We propose a cognitively-inspired dual-process autonomous agent framework that explicitly separates fast response and deep reflection into two complementary processes, enabling the agent to combine immediate reasoning with continuous self-improvement.
We design an evolvable long-term memory system that automatically distills, integrates, and reorganizes interaction experiences, enabling continual context refinement and self-improvement over extended interaction streams.
We instantiate DPA as a complete end-to-end pipeline with retrieval, reflection, curation, and memory maintenance modules, and conduct comprehensive experiments and ablation studies to validate its effectiveness.
4. Experiments
We evaluate DPA on six benchmarks spanning factual QA, multi-hop reasoning, instruction following, and mathematical reasoning. Our experiments address three questions: (i) whether DPA improves over static baselines and existing adaptive methods, (ii) which System 2 components contribute most to performance, and (iii) whether memory evolution remains stable over long streams.
4.1. Experimental Setup
Datasets. We select six benchmarks that collectively test different aspects of memory-augmented reasoning: knowledge accuracy, multi-step inference, instruction compliance, temporal sensitivity, and robustness to distraction.
TruthfulQA [
39] evaluates whether models avoid generating false but plausible-sounding answers. We use the binary-choice format under the January 2025 knowledge cutoff. Distinguishing truth from plausible misconceptions requires nuanced world knowledge, so we evaluate this benchmark on GPT-5.1.
StrategyQA [
40] requires implicit multi-step reasoning to answer yes/no questions (e.g., “Did Aristotle use a laptop?”). Memory entries capturing successful decomposition strategies can transfer across questions with similar implicit structure. We evaluate this dataset using GPT-5.1.
IFEval [
41] tests instruction-following with verifiable constraints such as word count limits, formatting requirements, and content restrictions. We report prompt-level accuracy (all constraints satisfied). We evaluate on both GPT-5.1 and Llama-3.1-8B for this dataset.
HotpotQA [
42] is a multi-hop QA benchmark requiring reasoning over multiple supporting documents. We use the validation split and report token-level F1 on the final answer. Unlike the original open-domain setting where models can retrieve from Wikipedia or provided supporting documents, we adopt a closed-book setting where the model receives only the question and must answer based on its parametric knowledge and accumulated memory (i.e., we perform no external document/corpus retrieval). This setting is more challenging and is intended to test whether DPA can distill reusable multi-hop reasoning knowledge into memory to improve downstream performance, rather than to evaluate external document retrieval capability. We evaluate on GPT-5.1.
FreshQA [
43] contains time-sensitive factual questions whose answers may change over time (e.g., “Who is the current CEO of Twitter?”). We follow the FreshEval protocol with a deterministic LLM judge (temperature 0). We evaluate this benchmark using Llama-3.1-8B, and use GPT-5.1 as the judge model. This benchmark tests whether DPA can learn meta-strategies for handling temporal uncertainty.
GSM-IC [
44] augments grade-school math problems with irrelevant context designed to distract the model. We report Exact Match (EM) on the final numerical answer. The underlying math is straightforward, so we evaluate on Llama-3.1-8B to test whether memory can help smaller models learn to ignore distracting information.
Baselines. We compare against five systems:
Vanilla (direct prompting),
Abstention Prompting [
45,
46,
47] (instructing the model to reply “Unknown” when uncertain),
Self-Refine [
30] (iterative refinement without external memory),
SwiftSage [
48] (dual-process fast/slow reasoning), and
Dynamic Cheatsheet [
49] (memory-augmented hint accumulation; we use the Cumulative Memory configuration from the original paper). These baselines span prompting-only, self-correction, and memory-augmented approaches, allowing us to isolate the contribution of DPA’s specific design choices.
Implementation. We evaluate on GPT-5.1 via API and Llama-3.1-8B locally. For Llama-3.1-8B, we utilize vLLM for efficient inference. For memory-enabled methods, we fix a deterministic stream order (seed 42) to ensure reproducibility. During each episode, the model receives only the current question
(plus retrieved context for memory-based methods); no ground truth is available during answer generation. Feedback is revealed only after the episode and is then used for evaluation and, where applicable, memory updates. Memory-augmented methods receive the same post-episode feedback signal under each task, while non-memory baselines do not use cross-episode updates by design.
Table 1 summarizes key hyperparameters; we use consistent settings across benchmarks with minor task-specific adjustments noted in the footnotes.
4.2. Main Results
Table 2 presents results on GPT-5.1 across four benchmarks. DPA achieves the highest overall performance (77.3% macro-average). It attains the best accuracy on TruthfulQA (92.3%) and the second-best F1 on HotpotQA (47.1%), indicating strong performance on both factual accuracy and multi-hop reasoning. While SwiftSage achieves slightly higher accuracy on IFEval (93.1% vs. 90.6%), DPA remains the strongest method on average across the evaluated GPT-5.1 settings.
Table 3 reports results on Llama-3.1-8B. DPA achieves the strongest overall performance across the three evaluated benchmarks: FreshQA (34.7%), GSM-IC (91.8%), and IFEval (74.7%). The improvement over vanilla is particularly notable on FreshQA (+13.2%) and IFEval (+5.4%), suggesting that accumulated memory entries capture reusable strategies for time-sensitive QA and constraint satisfaction.
4.3. Learning Dynamics
A key design goal of DPA is stable memory evolution over extended streams.
Figure 3 shows cumulative accuracy (left) and memory size (right) on IFEval as functions of episode index. Both backbones exhibit stable learning dynamics, and memory growth remains controlled, indicating that the curator gate filters redundant or low-quality insights.
The memory store grows sub-linearly, indicating that the curator gate successfully filters redundant or low-quality insights. This controlled growth validates the conservative commit strategy (
Section 3.3), which prioritizes high-confidence updates over aggressive expansion.
4.4. System 2 Component Ablations
To isolate the contribution of each System 2 component, we disable one component at a time while keeping all other settings fixed. Specifically, we evaluate three ablation variants:
w/o Refl. disables the reflector, so memory entries are committed directly without structured outcome analysis or credit assignment;
w/o Gate disables the curator quality gate, allowing all candidate insights to pass through without filtering; and
w/o Prune disables the pruning mechanism, so persistently harmful entries (high
) are never removed.
Table 4 and
Table 5 present results for GPT-5.1 and Llama-3.1-8B.
Disabling the reflector (w/o Refl.) causes notable degradation on most benchmarks. On Llama, FreshQA accuracy drops from 34.7% to 29.5% (%), confirming that explicit outcome analysis is essential for generating useful memory updates. Interestingly, on GPT-5.1, the w/o Refl. variant remains competitive on StrategyQA (81.8%), suggesting that stronger backbones may partially compensate for the lack of structured reflection.
Removing the curator gate (w/o Gate) degrades performance, particularly on Llama where the overall average drops from 67.0% to 64.0%. Without quality filtering, memory accumulates noisy or conflicting entries that mislead subsequent retrieval—underscoring the importance of conservative writes for long-horizon stability.
Disabling pruning (w/o Prune) yields the lowest overall performance on Llama (63.8%), with consistent degradation across benchmarks. The pruning mechanism removes persistently harmful entries based on credit assignment (Equation (
7)); its absence allows detrimental memories to persist and negatively influence context selection.
4.5. Qualitative Analysis
Figure 4 illustrates how accumulated memory entries influence model behavior through three representative cases where DPA succeeds but the vanilla baseline fails.
In Case 1 (FreshQA), the question contains a false premise—the 2022 Winter Olympics were held in Beijing, not Russia. The vanilla baseline catastrophically fails, entering a repetition loop. DPA retrieves meta-strategies about verifying premises and produces a stable, correct response.
In Case 2 (TruthfulQA), the question asks which U.S. state produces the most peaches. Although Georgia is nicknamed “The Peach State”, California actually leads in production. Vanilla selects the culturally prevalent but incorrect answer (Georgia), while DPA retrieves evaluative heuristics that encourage skepticism toward surface-level associations.
In Case 3 (IFEval), the instruction forbids using “can” and “ride” in a story about learning to bike. Vanilla violates the constraint; DPA successfully avoids both words by using alternatives such as “glided” and “pedaling”.
These cases illustrate three types of transferable knowledge: meta-strategies for premise verification, skepticism toward common misconceptions, and procedural awareness for constraint satisfaction. Crucially, none of these insights encode task-specific facts; they capture reasoning patterns that generalize across benchmarks.
4.6. Cost Analysis
Figure 5 presents the token cost breakdown (GPT-5.1 DPA runs; same four benchmarks as
Table 2). Reflection and curation account for 13.5% (TruthfulQA), 15.6% (IFEval), 38.4% (StrategyQA), and 62.3% (HotpotQA) of total tokens, reflecting both task difficulty and how often System 2 is triggered. This overhead can be reduced further by triggering System 2 selectively (e.g., only on incorrect predictions or low-confidence outputs).
5. Discussion
Our experiments reveal a trade-off in memory-augmented inference with frozen backbones: persistent context can mitigate recurring failure modes, but only when it remains query-aligned, compact, and robust to noisy writes. DPA improves performance by storing a small set of reusable strategies and retrieving them selectively. In contrast, methods that inject a growing, query-agnostic state can introduce interference, with effects that are more pronounced for smaller backbones.
5.1. On the Interaction Between Memory Design and Model Scale
We observe notable performance differences between memory-augmented approaches when evaluated on models of varying scale. Specifically, on Llama-3.1-8B (
Table 3), we find that while DPA maintains consistent improvements over the vanilla baseline, Dynamic Cheatsheet (DC) [
49] exhibits performance below the baseline across all three benchmarks, with average accuracy dropping from 60.8% to 34.1%. This divergence is particularly interesting given that both methods maintain persistent state across episodes, suggesting that architectural choices in memory management may interact differently with backbone model scale.
We identify three potential factors that may contribute to this phenomenon, which offer insights into the design space of memory-augmented systems for resource-constrained settings:
(1) Trade-offs in context injection strategies. DC employs a design where the entire evolving cheatsheet is provided to every episode, independent of query content. While this ensures comprehensive coverage, it can result in contexts containing information spanning multiple domains. In our experiments, cheatsheets often grow to thousands of characters. For smaller models with more limited reasoning capabilities, distinguishing task-relevant information from a large heterogeneous context may present additional challenges, potentially leading to outputs that deviate from expected formats or violate task-specific constraints.
(2) Sensitivity to output format conventions. DC uses a structured response template together with a dedicated extractor. In our experiments on Llama, strict compliance with the template is not always reliable. On IFEval, the extractor occasionally fails to identify a valid final answer. This is especially important for instruction-following tasks, where small formatting deviations can materially change evaluation results. Extra headings, meta-commentary, or minor structural changes may be sufficient to break extraction. In these cases, the resulting errors tend to reflect surface-form constraints, including punctuation, casing, and length, rather than semantic content alone.
(3) Considerations for memory update policies. DC updates its cheatsheet after each episode without explicit quality filtering or credit assignment mechanisms. While this ensures comprehensive coverage of experiences, it may allow less informative or task-misaligned entries to persist. Over extended interaction streams, such entries could be repeatedly reintroduced, potentially creating feedback loops between generation patterns and memory state.
These observations suggest an important design consideration: as model scale decreases, the benefits of selective retrieval and quality-controlled memory updates may become more pronounced. More broadly, unfiltered cumulative memory appears substantially more fragile than selective retrieval with conservative curation in resource-constrained settings, particularly on tasks with strict output constraints. We emphasize that these findings reflect specific experimental conditions and implementation choices, and different design decisions or hyperparameter configurations might yield different outcomes.
5.2. When Is Memory Helpful—and When Can It Hurt?
The ablations on GPT-5.1 indicate that memory is not uniformly beneficial; its value depends on whether System 2 produces high-signal, reusable updates and whether the system prevents low-quality writes from accumulating (
Table 4).
Memory helps when a small set of strategies generalizes. On some tasks, a handful of persistent rules can correct recurring blind spots. For example, on IFEval the full system admits very few memory entries over hundreds of episodes, yet achieves higher prompt-level accuracy than variants without reflection or without quality control. This supports the view that memory quality can dominate memory quantity: compact, reusable strategies can improve constraint satisfaction without large stores.
Memory hurts when updates are noisy or miscalibrated. Two failure modes are visible in the ablations. First, removing the curator gate causes the system to commit nearly all proposed updates, increasing memory size and reducing performance on three of the four GPT-5.1 benchmarks; on HotpotQA, F1 drops from 47.1 to 40.6. Second, disabling pruning degrades long-horizon performance (HotpotQA: 47.1 → 40.0), suggesting that some memories are net harmful and should be down-weighted or removed when they repeatedly correlate with errors.
System 2 is not always worth triggering. Removing the reflector improves StrategyQA (81.8 vs. 79.2), even though it hurts the other benchmarks. When the backbone is already strong and the feedback signal is coarse, reflection may generate low-signal heuristics that compete with the base model’s decision rule. Binary correctness on short answers provides limited information for meaningful reflection. In such regimes, a conservative policy may be preferable: writing less or triggering System 2 only under stronger evidence of uncertainty. In the current implementation, System 2 is triggered after every episode for simplicity and reproducibility. However, several adaptive triggering strategies could reduce the 13–62% System 2 token overhead reported in our cost analysis (
Figure 5) while preserving the benefits where they matter most: (i)
confidence-based triggering, where System 2 activates only when the backbone’s output probability or self-reported confidence falls below a threshold; (ii)
consistency-based triggering, where multiple responses are sampled and reflection is triggered only when they disagree, signaling genuine uncertainty; and (iii)
constraint-based triggering, where a lightweight verifier checks whether the output satisfies known structural constraints before committing to full reflection. Designing and evaluating such adaptive triggers is an important direction for future work.
5.3. Limitations and Open Problems
Order dependence and calibration. Memory is learned sequentially and is therefore sensitive to stream order and the gate threshold. The ablations suggest that the optimal write policy can be task-dependent: for some datasets, aggressive reflection can inject noise (StrategyQA), while for others pruning and gating are essential (HotpotQA). Future work should quantify variance across multiple random seeds and explore adaptive gate calibration.
Scalability and scope. Our memory is a lightweight strategy bank, not a knowledge base: it stores procedural heuristics and error-avoidance rules, and may not help when improvements require new external facts or long-form, compositional skills. Moreover, on high-variance tasks the store can grow to hundreds of entries, raising questions about retrieval saturation, compression, and long-term stability.
Cross-task transfer and robustness. Our experiments maintain an independent memory store per benchmark, and we do not study cross-task transfer. Although some meta-level heuristics may generalize, others are likely task-specific and can induce negative transfer when retrieved out of context, especially under distribution shift. Systematically evaluating cross-task memory sharing and developing safeguards against such negative transfer remain important directions for future work.
Open-domain applicability. Our benchmarks involve well-defined, single-turn tasks with clear feedback signals. Deploying DPA in open-domain settings such as real-time dialogue or open-ended reasoning would introduce additional challenges, including sparse and noisy feedback, unbounded interaction streams, and stronger cross-domain retrieval interference. Addressing these issues would likely require more aggressive memory compression, better session-level memory management, and topic-aware organization of long-term memory. Despite these challenges, DPA’s core mechanisms (selective retrieval, conservative curation, credit-based pruning) are domain-agnostic by design and should transfer to open-domain settings with appropriate adaptations.
Security and privacy considerations. Memory-augmented LLM agents introduce security and privacy concerns that are distinct from stateless models. An adversary may influence the input stream or feedback signal to inject persistent harmful memories, creating an attack surface related to backdoor, prompt-injection, and data-poisoning risks [
50,
51]. In addition, stored reasoning artifacts may inadvertently expose sensitive information from prior interactions. Mitigating these risks requires stronger memory access control, storage-layer filtering, and privacy-preserving memory management.
6. Conclusions
We presented the Dual-Process Agent (DPA), a framework for continual context refinement that keeps the backbone LLM frozen while evolving an explicit, editable long-term memory. DPA couples a fast System 1 that answers each instance with retrieval-augmented context and a conservative System 2 that audits outcomes, assigns credit to retrieved memories, and commits curated, localized edits to memory.
Across six benchmarks, DPA improves performance over static prompting and competitive baselines by accumulating reusable, deployment-time memories that refine future context selection. These findings also suggest several concrete directions for future work.
Adaptive gate calibration. The current curator gate uses a fixed threshold , but the optimal write policy is task-dependent. Learning the gate threshold from memory state or task difficulty could make the system more aggressive when reflection is consistently useful and more conservative when it risks injecting noise.
Memory compression and cross-task transfer. Our experiments maintain independent memory stores per benchmark. Compressing task-specific entries into higher-level meta-strategies could support knowledge sharing across tasks while reducing redundancy and negative transfer.
Selective System 2 triggering. Triggering reflection after every episode incurs unnecessary overhead when the model is already confident and correct. Confidence-based or consistency-based triggers could reduce computational cost while preserving the benefits of reflection where they matter most.
Integration with parameter-efficient fine-tuning. DPA currently keeps the backbone frozen. Combining memory evolution with lightweight parameter updates such as LoRA [
52] could allow persistent high-confidence patterns to be internalized into model weights while preserving the flexibility of external memory.
Together, these directions aim to make continual context refinement reliably beneficial across a wider range of backbones, tasks, and deployment regimes.