Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement

Teng, Liangyu; Ni, Wei; Song, Liang; Shi, Jun; Li, Yanfei

doi:10.3390/electronics15061232

Open AccessArticle

Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement

by

Liangyu Teng

¹

,

Wei Ni

¹,

Liang Song

^1,*

,

Jun Shi

² and

Yanfei Li

²

¹

College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai 200433, China

²

China State Construction Engineering (Hong Kong) Ltd., Hong Kong, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1232; https://doi.org/10.3390/electronics15061232

Submission received: 16 February 2026 / Revised: 10 March 2026 / Accepted: 13 March 2026 / Published: 16 March 2026

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Large Language Models (LLMs) have become essential for interactive AI systems, yet they remain fundamentally static after deployment: they cannot update their parameters from interaction feedback and often repeat the same mistakes across long interaction streams. We propose Dual-Process Agent (DPA), a framework for continual context refinement that enables learning without modifying a frozen model backbone. Inspired by dual-process theory from cognitive science, DPA decomposes each interaction episode into two complementary processes: a fast System 1 that retrieves compact, relevant context from an explicit long-term memory and generates responses, and a slow System 2 that reflects on outcomes and writes curated updates back into memory. To prevent memory degradation over extended interactions, DPA maintains bulletized memory entries with utility statistics and employs a conservative curator gate that filters generic, redundant, or conflicting insertions. Experiments on six diverse benchmarks demonstrate that DPA consistently outperforms vanilla prompting and competitive baselines on both GPT-5.1 and Llama-3.1-8B backbones, achieving the best overall performance across multiple reasoning and knowledge-intensive tasks.

Keywords:

self-evolving agents; dual-process architecture; context engineering; Retrieval-Augmented Generation; cognitive architecture

1. Introduction

Large Language Model (LLM)-based autonomous agents have progressed rapidly in recent years, demonstrating promising capabilities in task planning, knowledge-intensive reasoning, and multi-turn interaction [1]. However, relying solely on parametric knowledge encoded in model weights still entails substantial limitations, particularly a heightened risk of hallucinations when operating in open and dynamically evolving information environments [2]. To mitigate these issues, Retrieval-Augmented Generation (RAG) [3] has been proposed, which augments the generation process with external knowledge sources and enables the model to dynamically acquire evidence that is highly relevant to the current task.

A typical RAG system consists of two collaborative components: a vector-based retrieval module that identifies content most relevant to the input from large-scale knowledge corpora, and a generation module that conditions on both the retrieved evidence and the original query to produce more accurate, coherent, and factually grounded responses. By integrating external knowledge into the inference process, RAG reduces uncertainty arising from the model’s reliance on purely parametric memory and achieves greater reliability and robustness in knowledge-intensive tasks.

Although RAG expands an agent’s access to knowledge within a single inference step, its overall mechanism remains essentially reactive and oriented toward immediate response, lacking the ability to persist and update information produced during interaction. Conventional RAG pipelines provide little support for revisiting generated outputs through systematic review, evaluation, or strategy adaptation, preventing the agent from internalizing experience and from building a memory structure that evolves with its environment. In addition, RAG typically relies on a static external knowledge base where the system can only passively consume pre-existing information, with no principled mechanism for updating, organizing, or restructuring the underlying knowledge. This limits its capacity to absorb new information and maintain consistency as the world changes. Consequently, while retrieval enriches the model’s external knowledge sources, the overall paradigm still operates in a fast, one-shot reasoning mode, falling short of the deeper reflective processes and self-evolution capabilities required by autonomous agents in long-horizon settings.

Building on these observations, we argue that to build truly self-evolving agents that can continuously learn and improve autonomously, we need to introduce a complementary process that can reflect on, integrate, and evolve the interaction process. Inspired by Kahneman’s dual-process theory [4], we view fast reasoning and deep reflection as two independent yet collaborative processes in the agent’s memory system: the former focuses on immediate decision-making and efficient generation, while the latter is responsible for abstracting experience, evaluating behavior, and updating internal representations and long-term knowledge based on that evaluation.

To this end, we propose a Dual-Process Agent (DPA) framework for continual context refinement (as shown in Figure 1b). Our design draws on both cognitive dual-process theory and established AI cognitive architectures. In cognitive science, System 1 denotes fast, automatic, associative processing, while System 2 denotes slow, deliberate, rule-based reasoning [5,6]. Similarly, in DPA, System 1 corresponds to feed-forward inference, where the agent retrieves relevant context via pattern matching and generates a response in a single pass, while System 2 corresponds to deliberate meta-cognition, where the agent revisits the interaction trace, evaluates outcomes, and decides whether to update the long-term memory store. This setup reflects dual-process views in which System 1 produces a default response that System 2 can refine through reflective monitoring. This decomposition also connects to AI cognitive architectures such as SOAR [7] and ACT-R [8], where production systems distinguish between automatic pattern-matched actions and deliberate goal-directed reasoning.

In contrast to conventional linear RAG pipelines that rely on static knowledge bases and lack any post-interaction update mechanism, our framework augments the fast reactive loop with an evolution-and-reflection phase that systematically compresses, abstracts, and integrates the reasoning trajectory, retrieved evidence, and final outputs after each interaction, thereby driving the dynamic evolution of long-term memory. Concretely, a user query is first passed to a retrieval module, which selects the most relevant memory snippets from a continually updated long-term memory store and feeds them, together with the current query, into the LLM to produce the final response. The full interaction trace is then distilled in the background into reusable memory units and written back into long-term memory, allowing the memory store to continually adapt and grow with each episode. This closed-loop mechanism enables the agent to accumulate experience over long-horizon episodes and maintain an evolving long-term memory structure without updating backbone parameters.

To validate the effectiveness of DPA, we conduct extensive experiments on six benchmarks spanning factual QA, multi-hop reasoning, instruction following, and mathematical reasoning (TruthfulQA, StrategyQA, IFEval, HotpotQA, FreshQA, and GSM-IC) using both GPT-5.1 and Llama-3.1-8B as backbones. Experimental results demonstrate that DPA achieves the strongest average performance across all benchmarks, with an average accuracy of 77.3% on GPT-5.1 and 67.0% on Llama-3.1-8B. We summarize our main contributions as follows:

We propose a cognitively-inspired dual-process autonomous agent framework that explicitly separates fast response and deep reflection into two complementary processes, enabling the agent to combine immediate reasoning with continuous self-improvement.
We design an evolvable long-term memory system that automatically distills, integrates, and reorganizes interaction experiences, enabling continual context refinement and self-improvement over extended interaction streams.
We instantiate DPA as a complete end-to-end pipeline with retrieval, reflection, curation, and memory maintenance modules, and conduct comprehensive experiments and ablation studies to validate its effectiveness.

2. Related Work

2.1. Retrieval-Augmented Generation

RAG augments language models with external non-parametric knowledge and has become one of the primary approaches for mitigating hallucinations and improving factuality and timeliness in generation [9,10,11]. Early RAG systems [3,12,13] typically perform a single retrieval step over a static vector index, concatenate a small set of similar passages with the input, and fuse them either at the encoder side or during decoding, achieving strong performance on open-domain QA and other knowledge-intensive tasks. To extend the applicability of RAG, RETRO [14] performs fine-grained neighborhood retrieval over a trillion-token-scale corpus for each generation chunk, and injects retrieved neighbors into an autoregressive model via chunked cross-attention, demonstrating that explicit external memory can substantially enhance model capability without increasing parameter count.

More recently, researchers have proposed finer-grained designs for RAG with a focus on adaptive and robust retrieval strategies. Self-RAG [15] introduces internal reflection tokens that allow the model to trigger retrieval on demand and assess the usefulness of retrieved passages. Adaptive-RAG [16] selects retrieval intensity based on estimated query complexity and model uncertainty, automatically choosing among different retrieval paths. To improve robustness to retrieval noise and retrieval failures, CRAG [17] employs a separate quality estimator to filter low-confidence retrieved documents and trigger fallback strategies when necessary. Despite the improvements these methods bring in single-step accuracy, robustness, and flexibility of retrieval strategies, they still treat retrieval as a one-off operation conditioned on the current input: retrieved documents and intermediate reasoning are discarded once the interaction ends, and the external knowledge store remains static during use, making it difficult to accumulate interaction experience and structurally evolve the context over time. In contrast, our work introduces an evolving memory store where retrieval quality directly influences memory curation through persistent utility tracking.

2.2. Memory-Augmented Language Agents

As LLM-based agents are deployed in increasingly complex, long-horizon scenarios, the hard limits of the context window and the absence of principled memory management have become major bottlenecks for further capability scaling [18,19]. Early work [20,21,22] shows that augmenting language models with explicit external memory modules can improve long-term behavioral coherence, context maintenance, and personalized interaction. Subsequent research has explored memory-structure design and multi-turn dialogue modeling in a more systematic manner. LD-Agent [23] decouples event perception, persona extraction, and response generation, and combines a two-tier storage scheme of short-term and long-term memories with dynamic persona modeling to enhance cross-session contextual coherence and personalization. THEANINE [24] introduces a timeline-based memory structured by temporal and causal relations, organizing past events into a retrievable memory timeline to support state tracking and long-range dependency modeling in lifelong dialogue. RMM [25] incorporates both prospective and retrospective reflection into long-term personalized dialogue memory management, writing multi-granularity summaries into a personalized memory store and dynamically adapting retrieval strategies based on actually cited evidence, thereby forming a reflection-driven memory update loop.

Recent work has further shifted the focus toward the structured organization of long-term memory and its interaction-driven evolution. H-MEM [26] organizes memories into a multi-level hierarchical structure and employs index-based routing to balance representational capacity and access efficiency for long-horizon reasoning, whereas A-MEM [27] treats the memory system itself as an agentic module that automatically creates links and triggers updates to existing entries when new memories are written, yielding a memory graph that can be continuously reorganized based on experience. Taken together, these approaches move memory-augmented agents beyond simple external caches toward dynamically evolving memory systems. Along this line, our work further introduces a dual-process perspective at the levels of interaction and memory, enabling continual context refinement over explicit long-term memory.

2.3. Self-Evolving Agents

Self-evolution has become an important design paradigm for LLM-based agents, emphasizing that agents should autonomously improve their capabilities through continual interaction with the environment [28]. One major line of work introduces self-evaluation and self-correction loops at inference time: the agent diagnoses and revises candidate outputs via natural language reflection or tool-assisted verification, thereby improving factuality and robustness without updating underlying model parameters [29,30,31,32]. Such methods typically first produce an initial answer, then analyze the reasoning process and intermediate conclusions based on self-reflection, confidence or consistency checks, and feedback from external tools, and finally generate revised outputs, yielding a multi-round self-correcting inference procedure at test time. In parallel, another line of work [33,34,35] treats agent-generated interaction trajectories as training signals, leveraging policy-level reflection and search, experience distillation, or test-time finetuning to gradually improve overall decision-making ability over multiple tasks and extended deployment.

Context serves as a critical medium for self-evolving agents: through context engineering, an agent can continuously optimize its behavior while keeping model parameters frozen. SAGE [36] combines iterative feedback, reflective reasoning, and memory optimization based on the Ebbinghaus forgetting curve to adaptively adjust strategies and information retention in multi-task, long-horizon settings, thereby strengthening the agent’s sustained decision-making ability. The symbolic learning framework of Ou et al. [37] treats an agent as a symbolic network composed of prompts, tools, and their orchestration, and employs a symbolic optimizer that updates these “symbolic weights” during deployment based on data feedback, enabling data-driven self-evolution at the workflow and policy levels. ACE [38] restructures context as an evolvable playbook and, via modular procedures for generation, reflection, and curation, continually optimizes context configurations in both offline and online settings, achieving significant performance gains while relying primarily on natural execution feedback. In contrast to these approaches, we focus on explicit long-term memory and, within a dual-process framework, realize context self-evolution for long-horizon tasks through insertion, refinement, and pruning of interaction trajectories.

3. Methods

DPA augments a frozen backbone language model with an explicit, non-parametric memory that evolves online. As illustrated in Figure 2, DPA runs a fast reactive loop that answers each instance using a compact retrieved context, and a slower reflective loop that audits completed episodes and commits conservative, auditable delta writes back to memory. Instead of updating model parameters, DPA assigns credit to retrieved memory bullets (helpful vs. harmful) and uses these signals to refine future context selection while keeping

θ

fixed.

3.1. Overview

We consider a stream of instances

(q_{1}, \dots, q_{T})

drawn from an unknown distribution

D_{test}

, where each

q_{t}

may be a question, instruction, or task specification. At step t, the agent produces an output

a_{t}

and may receive a feedback signal

y_{t}

(e.g., a reference answer, verifier outcome, or environment reward). Since the backbone parameters

θ

are fixed, DPA adapts by evolving an external memory state

M_{t}

during deployment. Concretely, drawing on dual-process cognition [4], DPA couples a fast reactive procedure (System 1) that retrieves a compact context and generates

a_{t}

under a strict budget, with a slower reflective procedure (System 2) that revisits completed episodes, assigns credit to the retrieved memory items, and applies conservative edits to

M_{t}

. By decoupling execution from memory evolution, DPA remains responsive at inference time while avoiding unstable, large-scale memory rewrites over long streams.

The long-term memory store

M_{t}

consists of short natural-language entries (“bullets”) that encode reusable strategies, warnings, and compact corrections distilled from experience. We represent each entry

m_{i} \in M_{t}

as

m_{i} = ({text}_{i}, {id}_{i}, s_{i}, h_{i}^{+}, h_{i}^{-}, r_{i}),

(1)

where

{text}_{i}

is the bullet content,

{id}_{i}

is a stable identifier,

s_{i} \in [0, 1]

is a curator-assigned quality score,

(h_{i}^{+}, h_{i}^{-})

are helpful/harmful counters used for credit assignment, and

r_{i}

is a recency indicator updated whenever the entry is retrieved. Importantly,

s_{i}

and

(h_{i}^{+}, h_{i}^{-})

play complementary roles. The quality score

s_{i}

is assigned once at write time by the curator gate and remains fixed. By contrast,

(h_{i}^{+}, h_{i}^{-})

are runtime counters that accumulate as the agent uses an entry and observes its effects on outcomes. During retrieval (System 1), we select context based on net utility

usage (m_{i}) = h_{i}^{+} - h_{i}^{-}

and recency

r_{i}

(Equation (4)), since empirical utility is more reliable than the initial write-time estimate. During memory maintenance, these signals guide pruning of consistently harmful entries and, under a fixed memory budget, determine which entries are retained when capacity is exceeded.

Algorithm 1 summarizes the end-to-end inference & evolution loop.

Algorithm 1 Dual-Process Agent (DPA) Inference & Evolution

Require: Stream

{q_{t}}_{t = 1}^{T}

, frozen backbone

f_{θ}

, memory

M_{1}

, context size k, gate threshold

τ

, budget B

1: for

t = 1

to T do

2: // Process I: Fast Reaction

3:

R_{t} \leftarrow Retrieve (q_{t}, M_{t})

4:

{\tilde{R}}_{t} \leftarrow Rerank (R_{t})

// by relevance

5:

C_{t} \leftarrow {arg topk}_{m \in {\tilde{R}}_{t}} (usage (m), r_{m})

// usage-aware selection

6: Refresh recency

r_{m}

for each selected entry

m \in C_{t}

7:

a_{t} \leftarrow f_{θ} (q_{t}, C_{t})

// Generate response

8: // Process II: Reflection

9: if

Trigger (q_{t}, a_{t}, y_{t})

then

10:

(T_{t}, I_{t}) \leftarrow R (q_{t}, a_{t}, C_{t}, y_{t})

11: for all

m \in C_{t}

do

12:

h_{m}^{+} \leftarrow h_{m}^{+} + I [T_{t} (m) = helpful]

13:

h_{m}^{-} \leftarrow h_{m}^{-} + I [T_{t} (m) = harmful]

14: end for

15:

Δ M_{t} \leftarrow G (I_{t}, q_{t}, a_{t}, M_{t}; τ)

16:

M_{t + 1} \leftarrow Maintain (M_{t} \cup Δ M_{t}; B)

17: else

18:

M_{t + 1} \leftarrow M_{t}

19: end if

20: end for

3.2. System 1: Retrieval-Augmented Answer Generation

Given

q_{t}

and

M_{t}

, System 1 constructs a compact context bank

C_{t}

(size k) and generates the response. We first retrieve and rerank a candidate set via a two-stage pipeline:

{\tilde{R}}_{t} = Rerank (Retrieve (q_{t}, M_{t})),

(2)

where

Retrieve (\cdot)

performs dense retrieval using a bi-encoder to obtain an initial candidate pool, and

Rerank (\cdot)

applies a cross-encoder reranker to refine the ranking by query-document relevance.

To incorporate long-horizon utility into context selection, we define a usage statistic based on accumulated credit:

usage (m_{i}) = h_{i}^{+} - h_{i}^{-} .

(3)

After relevance-based reranking, we apply a lightweight usage-aware selection and choose the top-k entries according to the lexicographic key

(usage (m), r_{m})

:

C_{t} = \underset{m \in {\tilde{R}}_{t}}{arg topk} (usage (m), r_{m}) .

(4)

We update

r_{m}

whenever m is selected into

C_{t}

(Algorithm 1). The backbone model then produces

a_{t} = f_{θ} (q_{t}, C_{t}) .

(5)

Optionally, the generator emits cited bullet identifiers to make the subsequent audit step attributable.

3.3. System 2: Reflection, Credit Assignment, and Conservative Commits

System 2 runs when an episode terminates (e.g., dialogue/task completion) and when a feedback signal

y_{t}

becomes available. It has two responsibilities: (i) assign credit to the retrieved context items, and (ii) commit a small set of curated memory edits. We denote the reflector by

R (\cdot)

and the curator gate by

G (\cdot)

.

Given the episode tuple

(q_{t}, a_{t}, C_{t}, y_{t})

, the reflector outputs bullet-level tags and a small set of candidate insights:

(T_{t}, I_{t}) = R (q_{t}, a_{t}, C_{t}, y_{t}),

(6)

where

T_{t}

assigns each retrieved bullet

m \in C_{t}

a tag in

{helpful, harmful, neutral}

and

I_{t}

contains distilled reusable corrections, strategies, or warnings. The reflector also emits structured diagnostics for analysis; only

I_{t}

is forwarded to the curator gate as candidate memory, while

T_{t}

updates the utility counters (see Appendix B for the full prompt template and JSON output schema). We then update the helpful/harmful counters for each

m \in C_{t}

using the tags:

h_{m}^{+} \leftarrow h_{m}^{+} + I [T_{t} (m) = helpful], h_{m}^{-} \leftarrow h_{m}^{-} + I [T_{t} (m) = harmful],

(7)

which yields a persistent utility signal that influences future context selection via Equation (4).

Candidate insights may be noisy, redundant with existing bullets, or inconsistent with the current memory. To stabilize long-horizon behavior, we apply a conservative curator gate

G (\cdot)

that evaluates each candidate insight

I_{j} \in I_{t}

and assigns it a quality score

c_{j} \in [0, 1]

. Only insights with

c_{j} \geq τ

are committed to memory, where

τ

is a conservativeness threshold. The gate also detects near-duplicates (merged into canonical entries) and conflicts (resolved via rewriting). Formally, the resulting localized increment is

Δ M_{t} = G (I_{t}, q_{t}, a_{t}, M_{t}; τ),

(8)

which contains only high-quality, non-redundant insights that pass the threshold.

Finally, memory is updated by

M_{t + 1} = Maintain (M_{t} \cup Δ M_{t}; B),

(9)

where B is a fixed budget (max bullets).

Maintain (\cdot)

performs lightweight stabilization operations, including (i) normalization-based deduplication, (ii) pruning of persistently harmful entries (when

h_{m}^{-} - h_{m}^{+}

exceeds a threshold), and (iii) budget-based truncation that prioritizes high-quality and high-utility bullets. All updates are non-parametric and do not require modifying

θ

. Prompt schemas and output constraints for each component are provided in Appendix B.

4. Experiments

We evaluate DPA on six benchmarks spanning factual QA, multi-hop reasoning, instruction following, and mathematical reasoning. Our experiments address three questions: (i) whether DPA improves over static baselines and existing adaptive methods, (ii) which System 2 components contribute most to performance, and (iii) whether memory evolution remains stable over long streams.

4.1. Experimental Setup

Datasets. We select six benchmarks that collectively test different aspects of memory-augmented reasoning: knowledge accuracy, multi-step inference, instruction compliance, temporal sensitivity, and robustness to distraction.

TruthfulQA [39] evaluates whether models avoid generating false but plausible-sounding answers. We use the binary-choice format under the January 2025 knowledge cutoff. Distinguishing truth from plausible misconceptions requires nuanced world knowledge, so we evaluate this benchmark on GPT-5.1.

StrategyQA [40] requires implicit multi-step reasoning to answer yes/no questions (e.g., “Did Aristotle use a laptop?”). Memory entries capturing successful decomposition strategies can transfer across questions with similar implicit structure. We evaluate this dataset using GPT-5.1.

IFEval [41] tests instruction-following with verifiable constraints such as word count limits, formatting requirements, and content restrictions. We report prompt-level accuracy (all constraints satisfied). We evaluate on both GPT-5.1 and Llama-3.1-8B for this dataset.

HotpotQA [42] is a multi-hop QA benchmark requiring reasoning over multiple supporting documents. We use the validation split and report token-level F1 on the final answer. Unlike the original open-domain setting where models can retrieve from Wikipedia or provided supporting documents, we adopt a closed-book setting where the model receives only the question and must answer based on its parametric knowledge and accumulated memory (i.e., we perform no external document/corpus retrieval). This setting is more challenging and is intended to test whether DPA can distill reusable multi-hop reasoning knowledge into memory to improve downstream performance, rather than to evaluate external document retrieval capability. We evaluate on GPT-5.1.

FreshQA [43] contains time-sensitive factual questions whose answers may change over time (e.g., “Who is the current CEO of Twitter?”). We follow the FreshEval protocol with a deterministic LLM judge (temperature 0). We evaluate this benchmark using Llama-3.1-8B, and use GPT-5.1 as the judge model. This benchmark tests whether DPA can learn meta-strategies for handling temporal uncertainty.

GSM-IC [44] augments grade-school math problems with irrelevant context designed to distract the model. We report Exact Match (EM) on the final numerical answer. The underlying math is straightforward, so we evaluate on Llama-3.1-8B to test whether memory can help smaller models learn to ignore distracting information.

Baselines. We compare against five systems: Vanilla (direct prompting), Abstention Prompting [45,46,47] (instructing the model to reply “Unknown” when uncertain), Self-Refine [30] (iterative refinement without external memory), SwiftSage [48] (dual-process fast/slow reasoning), and Dynamic Cheatsheet [49] (memory-augmented hint accumulation; we use the Cumulative Memory configuration from the original paper). These baselines span prompting-only, self-correction, and memory-augmented approaches, allowing us to isolate the contribution of DPA’s specific design choices.

Implementation. We evaluate on GPT-5.1 via API and Llama-3.1-8B locally. For Llama-3.1-8B, we utilize vLLM for efficient inference. For memory-enabled methods, we fix a deterministic stream order (seed 42) to ensure reproducibility. During each episode, the model receives only the current question

q_{t}

(plus retrieved context for memory-based methods); no ground truth is available during answer generation. Feedback is revealed only after the episode and is then used for evaluation and, where applicable, memory updates. Memory-augmented methods receive the same post-episode feedback signal under each task, while non-memory baselines do not use cross-episode updates by design. Table 1 summarizes key hyperparameters; we use consistent settings across benchmarks with minor task-specific adjustments noted in the footnotes.

4.2. Main Results

Table 2 presents results on GPT-5.1 across four benchmarks. DPA achieves the highest overall performance (77.3% macro-average). It attains the best accuracy on TruthfulQA (92.3%) and the second-best F1 on HotpotQA (47.1%), indicating strong performance on both factual accuracy and multi-hop reasoning. While SwiftSage achieves slightly higher accuracy on IFEval (93.1% vs. 90.6%), DPA remains the strongest method on average across the evaluated GPT-5.1 settings.

Table 3 reports results on Llama-3.1-8B. DPA achieves the strongest overall performance across the three evaluated benchmarks: FreshQA (34.7%), GSM-IC (91.8%), and IFEval (74.7%). The improvement over vanilla is particularly notable on FreshQA (+13.2%) and IFEval (+5.4%), suggesting that accumulated memory entries capture reusable strategies for time-sensitive QA and constraint satisfaction.

4.3. Learning Dynamics

A key design goal of DPA is stable memory evolution over extended streams. Figure 3 shows cumulative accuracy (left) and memory size (right) on IFEval as functions of episode index. Both backbones exhibit stable learning dynamics, and memory growth remains controlled, indicating that the curator gate filters redundant or low-quality insights.

The memory store grows sub-linearly, indicating that the curator gate successfully filters redundant or low-quality insights. This controlled growth validates the conservative commit strategy (Section 3.3), which prioritizes high-confidence updates over aggressive expansion.

4.4. System 2 Component Ablations

To isolate the contribution of each System 2 component, we disable one component at a time while keeping all other settings fixed. Specifically, we evaluate three ablation variants: w/o Refl. disables the reflector, so memory entries are committed directly without structured outcome analysis or credit assignment; w/o Gate disables the curator quality gate, allowing all candidate insights to pass through without filtering; and w/o Prune disables the pruning mechanism, so persistently harmful entries (high

h^{-} - h^{+}

) are never removed. Table 4 and Table 5 present results for GPT-5.1 and Llama-3.1-8B.

Disabling the reflector (w/o Refl.) causes notable degradation on most benchmarks. On Llama, FreshQA accuracy drops from 34.7% to 29.5% (

- 5.2

%), confirming that explicit outcome analysis is essential for generating useful memory updates. Interestingly, on GPT-5.1, the w/o Refl. variant remains competitive on StrategyQA (81.8%), suggesting that stronger backbones may partially compensate for the lack of structured reflection.

Removing the curator gate (w/o Gate) degrades performance, particularly on Llama where the overall average drops from 67.0% to 64.0%. Without quality filtering, memory accumulates noisy or conflicting entries that mislead subsequent retrieval—underscoring the importance of conservative writes for long-horizon stability.

Disabling pruning (w/o Prune) yields the lowest overall performance on Llama (63.8%), with consistent degradation across benchmarks. The pruning mechanism removes persistently harmful entries based on credit assignment (Equation (7)); its absence allows detrimental memories to persist and negatively influence context selection.

4.5. Qualitative Analysis

Figure 4 illustrates how accumulated memory entries influence model behavior through three representative cases where DPA succeeds but the vanilla baseline fails.

In Case 1 (FreshQA), the question contains a false premise—the 2022 Winter Olympics were held in Beijing, not Russia. The vanilla baseline catastrophically fails, entering a repetition loop. DPA retrieves meta-strategies about verifying premises and produces a stable, correct response.

In Case 2 (TruthfulQA), the question asks which U.S. state produces the most peaches. Although Georgia is nicknamed “The Peach State”, California actually leads in production. Vanilla selects the culturally prevalent but incorrect answer (Georgia), while DPA retrieves evaluative heuristics that encourage skepticism toward surface-level associations.

In Case 3 (IFEval), the instruction forbids using “can” and “ride” in a story about learning to bike. Vanilla violates the constraint; DPA successfully avoids both words by using alternatives such as “glided” and “pedaling”.

These cases illustrate three types of transferable knowledge: meta-strategies for premise verification, skepticism toward common misconceptions, and procedural awareness for constraint satisfaction. Crucially, none of these insights encode task-specific facts; they capture reasoning patterns that generalize across benchmarks.

4.6. Cost Analysis

Figure 5 presents the token cost breakdown (GPT-5.1 DPA runs; same four benchmarks as Table 2). Reflection and curation account for 13.5% (TruthfulQA), 15.6% (IFEval), 38.4% (StrategyQA), and 62.3% (HotpotQA) of total tokens, reflecting both task difficulty and how often System 2 is triggered. This overhead can be reduced further by triggering System 2 selectively (e.g., only on incorrect predictions or low-confidence outputs).

5. Discussion

Our experiments reveal a trade-off in memory-augmented inference with frozen backbones: persistent context can mitigate recurring failure modes, but only when it remains query-aligned, compact, and robust to noisy writes. DPA improves performance by storing a small set of reusable strategies and retrieving them selectively. In contrast, methods that inject a growing, query-agnostic state can introduce interference, with effects that are more pronounced for smaller backbones.

5.1. On the Interaction Between Memory Design and Model Scale

We observe notable performance differences between memory-augmented approaches when evaluated on models of varying scale. Specifically, on Llama-3.1-8B (Table 3), we find that while DPA maintains consistent improvements over the vanilla baseline, Dynamic Cheatsheet (DC) [49] exhibits performance below the baseline across all three benchmarks, with average accuracy dropping from 60.8% to 34.1%. This divergence is particularly interesting given that both methods maintain persistent state across episodes, suggesting that architectural choices in memory management may interact differently with backbone model scale.

We identify three potential factors that may contribute to this phenomenon, which offer insights into the design space of memory-augmented systems for resource-constrained settings:

(1) Trade-offs in context injection strategies. DC employs a design where the entire evolving cheatsheet is provided to every episode, independent of query content. While this ensures comprehensive coverage, it can result in contexts containing information spanning multiple domains. In our experiments, cheatsheets often grow to thousands of characters. For smaller models with more limited reasoning capabilities, distinguishing task-relevant information from a large heterogeneous context may present additional challenges, potentially leading to outputs that deviate from expected formats or violate task-specific constraints.

(2) Sensitivity to output format conventions. DC uses a structured response template together with a dedicated extractor. In our experiments on Llama, strict compliance with the template is not always reliable. On IFEval, the extractor occasionally fails to identify a valid final answer. This is especially important for instruction-following tasks, where small formatting deviations can materially change evaluation results. Extra headings, meta-commentary, or minor structural changes may be sufficient to break extraction. In these cases, the resulting errors tend to reflect surface-form constraints, including punctuation, casing, and length, rather than semantic content alone.

(3) Considerations for memory update policies. DC updates its cheatsheet after each episode without explicit quality filtering or credit assignment mechanisms. While this ensures comprehensive coverage of experiences, it may allow less informative or task-misaligned entries to persist. Over extended interaction streams, such entries could be repeatedly reintroduced, potentially creating feedback loops between generation patterns and memory state.

These observations suggest an important design consideration: as model scale decreases, the benefits of selective retrieval and quality-controlled memory updates may become more pronounced. More broadly, unfiltered cumulative memory appears substantially more fragile than selective retrieval with conservative curation in resource-constrained settings, particularly on tasks with strict output constraints. We emphasize that these findings reflect specific experimental conditions and implementation choices, and different design decisions or hyperparameter configurations might yield different outcomes.

5.2. When Is Memory Helpful—and When Can It Hurt?

The ablations on GPT-5.1 indicate that memory is not uniformly beneficial; its value depends on whether System 2 produces high-signal, reusable updates and whether the system prevents low-quality writes from accumulating (Table 4).

Memory helps when a small set of strategies generalizes. On some tasks, a handful of persistent rules can correct recurring blind spots. For example, on IFEval the full system admits very few memory entries over hundreds of episodes, yet achieves higher prompt-level accuracy than variants without reflection or without quality control. This supports the view that memory quality can dominate memory quantity: compact, reusable strategies can improve constraint satisfaction without large stores.

Memory hurts when updates are noisy or miscalibrated. Two failure modes are visible in the ablations. First, removing the curator gate causes the system to commit nearly all proposed updates, increasing memory size and reducing performance on three of the four GPT-5.1 benchmarks; on HotpotQA, F1 drops from 47.1 to 40.6. Second, disabling pruning degrades long-horizon performance (HotpotQA: 47.1 → 40.0), suggesting that some memories are net harmful and should be down-weighted or removed when they repeatedly correlate with errors.

System 2 is not always worth triggering. Removing the reflector improves StrategyQA (81.8 vs. 79.2), even though it hurts the other benchmarks. When the backbone is already strong and the feedback signal is coarse, reflection may generate low-signal heuristics that compete with the base model’s decision rule. Binary correctness on short answers provides limited information for meaningful reflection. In such regimes, a conservative policy may be preferable: writing less or triggering System 2 only under stronger evidence of uncertainty. In the current implementation, System 2 is triggered after every episode for simplicity and reproducibility. However, several adaptive triggering strategies could reduce the 13–62% System 2 token overhead reported in our cost analysis (Figure 5) while preserving the benefits where they matter most: (i) confidence-based triggering, where System 2 activates only when the backbone’s output probability or self-reported confidence falls below a threshold; (ii) consistency-based triggering, where multiple responses are sampled and reflection is triggered only when they disagree, signaling genuine uncertainty; and (iii) constraint-based triggering, where a lightweight verifier checks whether the output satisfies known structural constraints before committing to full reflection. Designing and evaluating such adaptive triggers is an important direction for future work.

5.3. Limitations and Open Problems

Order dependence and calibration. Memory is learned sequentially and is therefore sensitive to stream order and the gate threshold. The ablations suggest that the optimal write policy can be task-dependent: for some datasets, aggressive reflection can inject noise (StrategyQA), while for others pruning and gating are essential (HotpotQA). Future work should quantify variance across multiple random seeds and explore adaptive gate calibration.

Scalability and scope. Our memory is a lightweight strategy bank, not a knowledge base: it stores procedural heuristics and error-avoidance rules, and may not help when improvements require new external facts or long-form, compositional skills. Moreover, on high-variance tasks the store can grow to hundreds of entries, raising questions about retrieval saturation, compression, and long-term stability.

Cross-task transfer and robustness. Our experiments maintain an independent memory store per benchmark, and we do not study cross-task transfer. Although some meta-level heuristics may generalize, others are likely task-specific and can induce negative transfer when retrieved out of context, especially under distribution shift. Systematically evaluating cross-task memory sharing and developing safeguards against such negative transfer remain important directions for future work.

Open-domain applicability. Our benchmarks involve well-defined, single-turn tasks with clear feedback signals. Deploying DPA in open-domain settings such as real-time dialogue or open-ended reasoning would introduce additional challenges, including sparse and noisy feedback, unbounded interaction streams, and stronger cross-domain retrieval interference. Addressing these issues would likely require more aggressive memory compression, better session-level memory management, and topic-aware organization of long-term memory. Despite these challenges, DPA’s core mechanisms (selective retrieval, conservative curation, credit-based pruning) are domain-agnostic by design and should transfer to open-domain settings with appropriate adaptations.

Security and privacy considerations. Memory-augmented LLM agents introduce security and privacy concerns that are distinct from stateless models. An adversary may influence the input stream or feedback signal to inject persistent harmful memories, creating an attack surface related to backdoor, prompt-injection, and data-poisoning risks [50,51]. In addition, stored reasoning artifacts may inadvertently expose sensitive information from prior interactions. Mitigating these risks requires stronger memory access control, storage-layer filtering, and privacy-preserving memory management.

6. Conclusions

We presented the Dual-Process Agent (DPA), a framework for continual context refinement that keeps the backbone LLM frozen while evolving an explicit, editable long-term memory. DPA couples a fast System 1 that answers each instance with retrieval-augmented context and a conservative System 2 that audits outcomes, assigns credit to retrieved memories, and commits curated, localized edits to memory.

Across six benchmarks, DPA improves performance over static prompting and competitive baselines by accumulating reusable, deployment-time memories that refine future context selection. These findings also suggest several concrete directions for future work.

Adaptive gate calibration. The current curator gate uses a fixed threshold

τ

, but the optimal write policy is task-dependent. Learning the gate threshold from memory state or task difficulty could make the system more aggressive when reflection is consistently useful and more conservative when it risks injecting noise.

Memory compression and cross-task transfer. Our experiments maintain independent memory stores per benchmark. Compressing task-specific entries into higher-level meta-strategies could support knowledge sharing across tasks while reducing redundancy and negative transfer.

Selective System 2 triggering. Triggering reflection after every episode incurs unnecessary overhead when the model is already confident and correct. Confidence-based or consistency-based triggers could reduce computational cost while preserving the benefits of reflection where they matter most.

Integration with parameter-efficient fine-tuning. DPA currently keeps the backbone frozen. Combining memory evolution with lightweight parameter updates such as LoRA [52] could allow persistent high-confidence patterns to be internalized into model weights while preserving the flexibility of external memory.

Together, these directions aim to make continual context refinement reliably beneficial across a wider range of backbones, tasks, and deployment regimes.

Author Contributions

Conceptualization, L.T.; methodology, L.T.; software, L.T.; validation, L.T.; formal analysis, L.T.; investigation, L.T.; resources, L.T.; data curation, L.T.; writing—original draft preparation, L.T.; writing—review and editing, L.T.; visualization, L.T.; supervision, W.N. and L.S.; project administration, L.S.; funding acquisition, J.S. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program of China, Project No. 2024YFE0200700, Subject No. 2024YFE0200703. This work was also supported in part by the Specific Research Fund of the Innovation Platform for Academicians of Hainan Province under Grant YSPTZX202314, in part by the Shanghai Key Research Laboratory of NSAI.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available from their original sources: TruthfulQA (https://github.com/sylinrl/TruthfulQA, accessed on 16 February 2026); StrategyQA (https://huggingface.co/datasets/ChilleD/StrategyQA, accessed on 16 February 2026); IFEval (https://github.com/google-research/google-research/tree/master/instruction_following_eval, accessed on 16 February 2026); HotpotQA (https://huggingface.co/datasets/hotpotqa/hotpot_qa, accessed on 16 February 2026); FreshQA (https://github.com/freshllms/freshqa, accessed on 16 February 2026); GSM-IC (https://huggingface.co/datasets/voidful/GSM-IC, accessed on 16 February 2026).

Acknowledgments

This work focuses on developing a dual-process framework for continual context refinement in LLM-based agents. Accordingly, our experiments employ LLMs for the evaluation of the proposed methods. For paper preparation, we used LLM exclusively for manuscript refinement. All content generated or refined through LLMs has been rigorously reviewed and validated by the authors.

Conflicts of Interest

Authors Jun Shi and Yanfei Li were employed by the company China State Construction Engineering (Hong Kong) Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Supplementary Experimental Results

Supplementary FreshQA evaluation protocol. All supplementary FreshQA results in this appendix follow the strict evaluation mode described in [43], using a unified deterministic judge based on GPT-5.1-chat (temperature 0). Because the judge model differs from that used in Table 3, these values are reported separately to maintain consistency across supplementary experiments.

Multi-seed robustness for the 8B setting. Table A1 reports Llama-3.1-8B DPA results under three stream orderings (seeds 0, 42, and 100), quantifying sensitivity to stream order. Across these orderings, overall variation is modest: GSM-IC and IFEval are particularly stable, while FreshQA exhibits moderately higher sensitivity to stream order.

Table A1. Supplementary multi-seed results for Llama-3.1-8B DPA. FreshQA is evaluated with a unified GPT-5.1-chat judge.

Dataset	Seed 0	Seed 42	Seed 100	Mean
FreshQA	23.00	26.00	24.54	24.51
GSM-IC	90.30	91.80	90.00	90.70
IFEval	75.79	74.68	74.12	74.86

Additional model-scale experiments. Table A2 summarizes Llama-3.2-1B and Llama-3.2-3B experiments under the same DPA/Vanilla/Dynamic Cheatsheet protocol. These results characterize scale trends for smaller backbones. They show that DPA can still provide gains at smaller scales, while unfiltered cumulative memory remains substantially more fragile than selective retrieval with conservative curation. The 1B GSM-IC result also highlights an important limitation: DPA’s current prompting format is not uniformly beneficial across all small-model/task combinations.

Table A2. Supplementary model-scale results on Llama-3.2-1B and Llama-3.2-3B across three stream orders. FreshQA is evaluated with a unified GPT-5.1-chat judge. Values are mean ± population std over three random orderings (seeds 0, 42, 100). Best results are bolded.

Model	Method	FreshQA	GSM-IC	IFEval
1B	Vanilla	12.67 ± 0.68	42.20 ± 0.75	47.32 ± 0.99
1B	Dynamic Cheatsheet	6.35 ± 0.82	21.80 ± 4.61	17.62 ± 1.07
1B	DPA	16.00 ± 0.62	2.60 ± 0.22	40.42 ± 1.82
3B	Vanilla	18.74 ± 0.71	71.57 ± 0.05	64.39 ± 0.57
3B	Dynamic Cheatsheet	10.96 ± 1.24	29.33 ± 2.98	26.80 ± 0.69
3B	DPA	23.33 ± 1.77	82.07 ± 1.35	61.80 ± 1.03

Computational efficiency. Table A3 details the computational overhead on GSM-IC using the Llama-3.1-8B model. Dynamic Cheatsheet approach introduces a significant computational bottleneck, averaging 31.41 s and over 4700 tokens per episode due to heavy generator and curator prompting. In contrast, DPA remains highly efficient, requiring only 5.70 s and 892 tokens per episode on average. This demonstrates that DPA achieves its performance gains with only a marginal increase in cost over the Vanilla baseline, completely avoiding the severe latency issues associated with unfiltered dynamic memory methods.

Table A3. Supplementary efficiency comparison on GSM-IC with Llama-3.1-8B. Runs are executed serially on a single RTX 3090. We report per-example (per episode) wall-clock time recorded by the evaluation pipeline and per-example token counts computed by offline tokenization with the Llama-3.1-8B tokenizer over logged prompts and model outputs (Vanilla: question+answer; Dynamic Cheatsheet: generator+curator prompts/outputs; DPA: executor prompt/output plus reflection and gate prompts/outputs when triggered).

Method	Avg. Time/ep (s)	Avg. Tokens/ep	Total Time (min)
Vanilla	4.75	293	79.23
Dynamic Cheatsheet	31.41	4,769	523.76
DPA	5.70	892	95.06

Memory-quality dynamics. Figure A1 provides a view of how memory evolves over the interaction stream on IFEval for the GPT-5.1 setting, where feedback is dense, constraint-based, and relatively stable across rounds. The left panel shows step-like, gradual growth in memory size with infrequent accepted writes, consistent with conservative curation rather than automatic accumulation. The right panel shows that runtime credit is selective: neutral tags dominate, and helpful/harmful evidence accumulates more slowly and unevenly over time. These patterns suggest two key observations: First, the dominance of neutral credits suggests that many retrieved entries do not materially affect the outcome in most rounds, so the memory functions more as a conservative bank of reusable strategies than as a densely reused knowledge base. Second, harmful credits can exceed helpful credits because the harmful tag includes retrieved bullets that are irrelevant to the current query (retrieval mismatch), in addition to genuinely misleading content. This pattern motivates selective retrieval to reduce irrelevant context and maintenance mechanisms (gating and pruning) to prevent long-horizon degradation from compounding.

Figure A1. Memory dynamics on the GPT-5.1 DPA IFEval run. (Left) current memory size and cumulative accepted writes over interaction rounds. (Right) cumulative helpful, harmful, and neutral credit tags assigned over interaction rounds. The figure highlights gradual memory growth together with sparse, unevenly distributed runtime credit.

Appendix B. Prompt Templates and Output Schemas

DPA relies on structured prompting to make System 2 updates conservative and auditable. All components are instructed to produce machine-parseable outputs: (i) the executor returns a JSON object containing final_answer and cited bullet ids, (ii) the reflector returns bullet-level tags in {helpful, harmful, neutral} together with a small set of reusable key insights, and (iii) the curator gate returns a JSON list aligned one-to-one with candidate insights, each containing a score in

[0, 1]

and a decision in {accept, rewrite, reject, duplicate, conflict}. These strict schemas reduce failure modes (e.g., unparseable outputs leading to unintended memory writes).

We provide below the full prompts used in the framework. In actual deployment, prompts are dynamically composed. We implement a templating mechanism where placeholder tokens (e.g., question, context_bank) are automatically filled based on the actual instance.

References

Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the The Eleventh International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Kahneman, D. Thinking, Fast and Slow; Farrar, Straus and Giroux: New York, NY, USA, 2011. [Google Scholar]
Evans, J.S.B. Dual-processing accounts of reasoning, judgment, and social cognition. Annu. Rev. Psychol. 2008, 59, 255–278. [Google Scholar] [CrossRef]
Sloman, S.A. The empirical case for two systems of reasoning. Psychol. Bull. 1996, 119, 3. [Google Scholar] [CrossRef]
Laird, J.E. The Soar Cognitive Architecture; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar] [CrossRef]
Anderson, J.R. How Can the Human Mind Occur in the Physical Universe? Oxford University Press: Oxford, UK, 2007. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-context retrieval-augmented language models. Trans. Assoc. Comput. Linguist. 2023, 11, 1316–1331. [Google Scholar] [CrossRef]
Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 9248–9274. [Google Scholar]
Khandelwal, U.; Levy, O.; Jurafsky, D.; Zettlemoyer, L.; Lewis, M. Generalization through Memorization: Nearest Neighbor Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Izacard, G.; Grave, E. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 874–880. [Google Scholar]
Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2206–2240. [Google Scholar]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 7029–7043. [Google Scholar]
Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. Corrective Retrieval Augmented Generation. arXiv 2024, arXiv:2401.15884. [Google Scholar] [CrossRef]
Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
Zhang, Z.; Dai, Q.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Zhu, J.; Dong, Z.; Wen, J.R. A survey on the memory mechanism of large language model-based agents. ACM Trans. Inf. Syst. 2025, 43, 1–47. [Google Scholar] [CrossRef]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems. arXiv 2023, arXiv:2310.08560. [Google Scholar]
Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 19724–19731. [Google Scholar]
Li, H.; Yang, C.; Zhang, A.; Deng, Y.; Wang, X.; Chua, T.S. Hello again! llm-powered personalized agent for long-term dialogue. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 5259–5276. [Google Scholar]
Ong, K.T.i.; Kim, N.; Gwak, M.; Chae, H.; Kwon, T.; Jo, Y.; Hwang, S.w.; Lee, D.; Yeo, J. Towards lifelong dialogue agents via timeline-based memory management. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 8631–8661. [Google Scholar]
Tan, Z.; Yan, J.; Hsu, I.H.; Han, R.; Wang, Z.; Le, L.; Song, Y.; Chen, Y.; Palangi, H.; Lee, G.; et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; pp. 8416–8439. [Google Scholar]
Sun, H.; Zeng, S. Hierarchical memory for high-efficiency long-term reasoning in LLM agents. arXiv 2025, arXiv:2507.22925. [Google Scholar]
Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; Zhang, Y. A-mem: Agentic memory for LLM agents. arXiv 2025, arXiv:2502.12110. [Google Scholar] [CrossRef]
Fang, J.; Peng, Y.; Zhang, X.; Wang, Y.; Yi, X.; Zhang, G.; Xu, Y.; Wu, B.; Liu, S.; Li, Z.; et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems. arXiv 2025, arXiv:2508.07407. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-refine: Iterative refinement with self-feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 46534–46594. [Google Scholar]
Gou, Z.; Shao, Z.; Gong, Y.; Shen, Y.; Yang, Y.; Duan, N.; Chen, W. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Dhuliawala, S.; Komeili, M.; Xu, J.; Raileanu, R.; Li, X.; Celikyilmaz, A.; Weston, J. Chain-of-verification reduces hallucination in large language models. In Proceedings of the Findings of the association for computational linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 3563–3578. [Google Scholar]
Zhang, W.; Tang, K.; Wu, H.; Wang, M.; Shen, Y.; Hou, G.; Tan, Z.; Li, P.; Zhuang, Y.; Lu, W. Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 5348–5375. [Google Scholar]
Acikgoz, E.C.; Qian, C.; Ji, H.; Hakkani-Tür, D.; Tur, G. Self-improving llm agents at test-time. arXiv 2025, arXiv:2510.07841. [Google Scholar] [CrossRef]
Wu, R.; Wang, X.; Mei, J.; Cai, P.; Fu, D.; Yang, C.; Wen, L.; Yang, X.; Shen, Y.; Wang, Y.; et al. EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle. arXiv 2025, arXiv:2510.16079. [Google Scholar]
Liang, X.; He, Y.; Xia, Y.; Song, X.; Wang, J.; Tao, M.; Sun, L.; Yuan, X.; Su, J.; Li, K.; et al. Self-evolving Agents with reflective and memory-augmented abilities. arXiv 2024, arXiv:2409.00872. [Google Scholar] [CrossRef]
Ou, Y.; Zhou, W.; Ding, S.; Li, L.; Wu, J.; Wang, T.; Chen, J.; Wang, S.; Xu, X.; Zhang, N.; et al. Symbolic learning enables self-evolving agents. arXiv 2025, arXiv:2406.18532. [Google Scholar] [CrossRef]
Zhang, Q.; Hu, C.; Upasani, S.; Ma, B.; Hong, F.; Kamanuru, V.; Rainton, J.; Wu, C.; Ji, M.; Li, H.; et al. Agentic context engineering: Evolving contexts for self-improving language models. arXiv 2025, arXiv:2510.04618. [Google Scholar] [CrossRef]
Lin, S.; Hilton, J.; Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3214–3252. [Google Scholar]
Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguist. 2021, 9, 346–361. [Google Scholar] [CrossRef]
Zhou, J.; Lu, T.; Mishra, S.; Brahma, S.; Basu, S.; Luan, Y.; Zhou, D.; Hou, L. Instruction-following evaluation for large language models. arXiv 2023, arXiv:2311.07911. [Google Scholar] [CrossRef]
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2369–2380. [Google Scholar]
Vu, T.; Iyyer, M.; Wang, X.; Constant, N.; Wei, J.; Wei, J.; Tar, C.; Sung, Y.H.; Zhou, D.; Le, Q.; et al. Freshllms: Refreshing large language models with search engine augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 13697–13720. [Google Scholar]
Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E.H.; Schärli, N.; Zhou, D. Large language models can be easily distracted by irrelevant context. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 31210–31227. [Google Scholar]
Tian, K.; Mitchell, E.; Zhou, A.; Sharma, A.; Rafailov, R.; Yao, H.; Finn, C.; Manning, C.D. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 5433–5442. [Google Scholar]
Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Hatfield-Dodds, Z.; DasSarma, N.; Tran-Johnson, E.; et al. Language models (mostly) know what they know. arXiv 2022, arXiv:2207.05221. [Google Scholar] [CrossRef]
Feng, S.; Shi, W.; Wang, Y.; Ding, W.; Balachandran, V.; Tsvetkov, Y. Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 14664–14690. [Google Scholar]
Lin, B.Y.; Fu, Y.; Yang, K.; Brahman, F.; Huang, S.; Bhagavatula, C.; Ammanabrolu, P.; Choi, Y.; Ren, X. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. Adv. Neural Inf. Process. Syst. 2023, 36, 23813–23825. [Google Scholar]
Suzgun, M.; Yuksekgonul, M.; Bianchi, F.; Jurafsky, D.; Zou, J. Dynamic cheatsheet: Test-time learning with adaptive memory. arXiv 2025, arXiv:2504.07952. [Google Scholar] [CrossRef]
Zhou, Y.; Ni, T.; Lee, W.B.; Zhao, Q. A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods. arXiv 2025, arXiv:2502.05224. [Google Scholar] [CrossRef]
Jaffal, N.O.; Alkhanafseh, M.; Mohaisen, D. Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques. AI 2025, 6, 216. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual Event, 25–29 April 2022; Volume 1, p. 3. [Google Scholar]

Figure 1. Comparison of Traditional RAG and the proposed Dual-Process Agent (DPA). (a) Traditional RAG follows a linear, one-shot pipeline and the knowledge base remains unchanged across interactions. (b) DPA augments this reactive loop with a reflective process, enabling continuous self-improvement without modifying model parameters.

Figure 2. Dual-Process Agent (DPA) framework. Process I (reactive loop) answers an incoming query

q_{t}

by retrieving a compact context bank from the evolving long-term memory and executing the backbone LLM to produce the final response

a_{t}

. Process II (reflective loop) is triggered upon dialogue/task completion: a reflector evaluates the episode, distills reusable insights, and performs conservative memory evolution through insert–refine–prune operations. The updated long-term memory then supports future retrieval and execution.

Figure 2. Dual-Process Agent (DPA) framework. Process I (reactive loop) answers an incoming query

q_{t}

by retrieving a compact context bank from the evolving long-term memory and executing the backbone LLM to produce the final response

a_{t}

. Process II (reflective loop) is triggered upon dialogue/task completion: a reflector evaluates the episode, distills reusable insights, and performs conservative memory evolution through insert–refine–prune operations. The updated long-term memory then supports future retrieval and execution.

Figure 3. Learning dynamics on IFEval. Left: Cumulative accuracy. Right: Memory size

| M_{t} |

. Controlled growth indicates effective curation.

Figure 3. Learning dynamics on IFEval. Left: Cumulative accuracy. Right: Memory size

| M_{t} |

. Controlled growth indicates effective curation.

Figure 4. Qualitative comparison between vanilla baseline and DPA on three representative cases. Each case shows how retrieved memory entries guide DPA toward correct answers while vanilla fails. (a) The vanilla baseline enters a degenerate repetition loop, while DPA correctly identifies the false premise. (b) Vanilla falls for the “Peach State” heuristic; DPA applies skepticism via retrieved meta-strategies. (c) Vanilla violates the constraint by using “ride”; DPA successfully uses alternatives like “glided” and “pedaling.”

Figure 5. Token cost breakdown (GPT-5.1 DPA runs). Bars show the fraction of total tokens spent on QA (System 1), reflection, and curation (System 2).

Table 1. Hyperparameter settings for DPA.

Hyperparameter	Value
System 1: Retrieval & Generation
Context bank size k	5
Initial retrieval size	50
Embedder model	all-MiniLM-L6-v2
Reranker model	BGE-reranker-large
Reranker batch size	16
QA temperature	0.0 ^a
Max output tokens (QA)	512 ^b
System 2: Reflection & Curation
Insights per episode	2 ^c
Curator gate threshold $τ$	0.6 ^d
Max sentences per bullet	3
Reflector temperature	0.0
Max output tokens (reflector)	384
Memory Management
Memory budget B	1000
Pruning threshold ( $h_{m}^{-} - h_{m}^{+}$ )	2

^a GSM-IC and IFEval use temperature 0.2. ^b StrategyQA and TruthfulQA use 32 tokens; IFEval uses 768 tokens. ^c FreshQA uses 1 insight per episode. ^d Aligned with the curator scoring rubric (Appendix B): scores

\leq 0.5

indicate vague or non-actionable insights (rejected), while

\geq 0.6

indicates salvageable content worthy of rewriting or acceptance. This threshold represents the natural boundary between clearly low-quality and potentially useful entries in the rubric.

Table 2. Comparison with baselines on GPT-5.1. Best results are bolded; second-best are underlined. Overall is the macro-average across datasets.

Method	TruthfulQA	StrategyQA	IFEval	HotpotQA	Overall
Method	Acc.	Acc.	Acc.	F1	Avg.
vanilla	89.4	80.6	89.9	9.3	67.3
Abstention Prompting	89.9	79.5	92.9	10.1	68.1
Self-Refine	83.5	71.6	84.5	24.4	66.0
SwiftSage	83.5	77.6	93.1	11.6	66.5
Dynamic Cheatsheet	87.5	76.1	54.5	58.5	69.2
DPA (Ours)	92.3	79.2	90.6	47.1	77.3

Table 3. Comparison with baselines on Llama-3.1-8B. Best results are bolded; second-best are underlined. Overall is the macro-average across datasets.

Method	FreshQA	GSM-IC	IFEval	Overall
Method	Acc.	EM	Acc.	Avg.
vanilla	21.5	91.7	69.3	60.8
Abstention Prompting	22.3	91.1	73.2	62.2
Self-Refine	21.7	82.9	64.5	56.4
SwiftSage	19.5	76.0	56.0	50.5
Dynamic Cheatsheet	15.0	53.2	34.2	34.1
DPA (Ours)	34.7	91.8	74.7	67.0

Table 4. System 2 component ablations (GPT-5.1). Best results are bolded; second-best are underlined.

Variant	TruthfulQA	StrategyQA	IFEval	HotpotQA	Overall
Variant	Acc.	Acc.	Acc.	F1	Avg.
Vanilla	89.4	80.6	89.9	9.3	67.3
w/o Refl.	90.9	81.8	89.2	36.8	74.7
w/o Gate	91.4	80.6	88.7	40.6	75.3
w/o Prune	92.8	79.0	88.0	40.0	75.0
Full	92.3	79.2	90.6	47.1	77.3

Table 5. System 2 component ablations (Llama-3.1-8B). Best results are bolded; second-best are underlined.

Variant	FreshQA	GSM-IC	IFEval	Overall
Variant	Acc.	EM	Acc.	Avg.
Vanilla	21.5	91.7	69.3	60.8
w/o Refl.	29.5	90.4	74.1	64.7
w/o Gate	29.0	89.7	73.4	64.0
w/o Prune	29.0	89.2	73.2	63.8
Full	34.7	91.8	74.7	67.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Teng, L.; Ni, W.; Song, L.; Shi, J.; Li, Y. Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement. Electronics 2026, 15, 1232. https://doi.org/10.3390/electronics15061232

AMA Style

Teng L, Ni W, Song L, Shi J, Li Y. Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement. Electronics. 2026; 15(6):1232. https://doi.org/10.3390/electronics15061232

Chicago/Turabian Style

Teng, Liangyu, Wei Ni, Liang Song, Jun Shi, and Yanfei Li. 2026. "Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement" Electronics 15, no. 6: 1232. https://doi.org/10.3390/electronics15061232

APA Style

Teng, L., Ni, W., Song, L., Shi, J., & Li, Y. (2026). Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement. Electronics, 15(6), 1232. https://doi.org/10.3390/electronics15061232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Self-Evolving Agents: A Dual-Process Framework for Continual Context Refinement

Abstract

1. Introduction

2. Related Work

2.1. Retrieval-Augmented Generation

2.2. Memory-Augmented Language Agents

2.3. Self-Evolving Agents

3. Methods

3.1. Overview

3.2. System 1: Retrieval-Augmented Answer Generation

3.3. System 2: Reflection, Credit Assignment, and Conservative Commits

4. Experiments

4.1. Experimental Setup

4.2. Main Results

4.3. Learning Dynamics

4.4. System 2 Component Ablations

4.5. Qualitative Analysis

4.6. Cost Analysis

5. Discussion

5.1. On the Interaction Between Memory Design and Model Scale

5.2. When Is Memory Helpful—and When Can It Hurt?

5.3. Limitations and Open Problems

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Supplementary Experimental Results

Appendix B. Prompt Templates and Output Schemas

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI