Next Article in Journal
Quantifying and Mitigating Uncertainties in Geo-Localization of Objects Using LiDAR and Image Data in Forestry
Previous Article in Journal
Control Strategy of Matrix Converter Using Different Algorithms with MATLAB Simulink and PLECS
Previous Article in Special Issue
Reconceptualizing Prompt Engineering as Reflective Professional Practice: A Framework for Teacher Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Dual-Loop Causality-Traceable Retrieval Framework for Long-Horizon Conversational Agents

1
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi 600355, Taiwan
2
Department of Electrical Engineering, National Cheng Kung University, Tainan 701401, Taiwan
3
School of Software and Big Data, Changzhou College of Information Technology, No. 22, Mingxin Middle Road, Changzhou 213164, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(11), 2373; https://doi.org/10.3390/electronics15112373
Submission received: 30 March 2026 / Revised: 11 May 2026 / Accepted: 12 May 2026 / Published: 1 June 2026
(This article belongs to the Special Issue AI-Driven Frameworks for Human–Computer Interaction)

Abstract

In long-horizon multi-party conversations, human-centric AI agents face a persistent structural problem: similarity-based retrieval may fail to reconnect semantically dispersed fragments of the same evolving event. This problem severely weakens causal continuity and multi-hop context recovery. To improve attribution trust and reduce structural erasure, we propose MemLoom, a dual-loop causality-traceable retrieval framework that organizes conversational history as an event memory graph. MemLoom decouples latency-sensitive online interaction from off-peak structural curation through online event formation, sentence-level buffering, asynchronous neuro-symbolic graph synthesis, and bounded dual-stream retrieval. Evaluations across QMSum, LoCoMo, and the synthetic causal diagnostic suite (SCDS) support the structural utility of MemLoom. For LoCoMo, under our unified local evaluation setup, MemLoom shows favorable temporal and multi-hop reasoning results (J = 65.77 and 58.14) relative to contemporary agentic baselines, such as Mem0, Zep, and A-Mem. For SCDS, within a controlled diagnostic setting, it recovers demanded causal chains more reliably than GraphRAG (SCR = 0.72 vs. 0.35) and maintains stronger answer-level auditability (AA = 0.80 vs. 0.50). This is achieved with a bounded online P95 latency of 1.67 s. These results indicate that asynchronous dual-loop stewardship has practical value for causality-traceable, event-centric conversational memory in multi-party settings.

1. Introduction

The rapid evolution of embodied AI drives a shift toward persistent agents that operate over long horizons [1] in real-world multi-party settings [2]. However, this transition exposes a critical gap in conventional retrieval approaches regarding causal structure preservation. Because traditional systems are optimized for static knowledge snapshots [3], they struggle to capture the causal dependencies and temporal dynamics of continuous interactions [4]. To address highly sparse information, merely expanding the retrieval volume or context window is insufficient. Excessive retrieval introduces hard negatives and exacerbates the lost-in-the-middle positional bias [5,6], which necessitates causality-aware structured memory management rather than flat text accumulation. Beyond these contextual limitations, autonomous agents face the structural complexity of multi-party interaction (MPI) topologies [7]. Unlike static documents, social environments continuously generate temporally distributed actions, updates, and causally linked state transitions. When current systems treat these dynamic memories as a flat cache [3,8], they often lack explicit mechanisms for preserving the underlying causal structure and historical state evolution.
From the perspective of causality-traceable conversational memory, the structural limitations of current memory augmentations can be summarized into three design risks. DR1 is topological blindness, in which similarity-driven retrieval fails to reconnect semantically dispersed but causally linked evidence, thereby inducing causally misaligned attributions [9,10]. For instance, when a later utterance has stronger lexical overlap with a query than the true earlier cause, conventional unstructured retrieval may return the more semantically similar statement while missing the actual causal root. DR2 is structural erasure, in which overwrite-oriented updates and coarse summary pipelines collapse temporally distinct historical states into a single compressed representation, thereby weakening the auditability of intermediate transitions and historical state evolution. DR3 is the synchronous maintenance bottleneck, in which graph construction, clustering, versioning, and summarization remain coupled to the live serving path and thus scale poorly with the interaction history [8]. As will be elaborated in Section 2, these risks are especially pronounced in multi-party interactions and should therefore be treated as interaction-specific design risks rather than as universal flaws of all prior systems.
To address these interaction-specific limitations, we propose MemLoom, which models memory as a curated event memory graph. At a high level, MemLoom combines three coordinated ideas: a dual-loop publish–subscribe (DLPS) architecture that decouples latency-sensitive serving from heavyweight structural curation, contract-bound event entities that preserve versioned structural boundaries across long-horizon state evolution, and a bounded dual-stream retrieval (BDSR) mechanism that couples topology-aware traversal with sentence-level grounding. Through this design, MemLoom seeks to support interactive responsiveness while preserving causal and temporal traceability under multi-party conversational settings.
As shown in Figure 1, the two loops construct a mutually supporting operation via atomic publication, thereby resolving the tension between scalability and consistency. This paper presents MemLoom, a steerable, event-centric memory architecture with four main contributions, listed as follows:
  • A dual-loop publish–subscribe (DLPS) architecture is proposed to practically decouple latency-sensitive online serving from consistency-critical structural maintenance. By serving immutable snapshots on the online path while shifting heavyweight graph curation to off-peak stewardship cycles, MemLoom maintains a bounded answer-serving path without exposing user-facing latency directly to asynchronous maintenance cost.
  • A neuro-symbolic graph synthesis (NSGS) mechanism is introduced to construct a dual-track topology that transforms raw utterances into contract-validated event entities through a lifecycle finite-state machine. Within this design, the topic track preserves semantic and chronological neighborhoods, while the logic track supports bounded recovery of non-local dependencies such as causal continuation and state evolution.
  • A bounded dual-stream retrieval (BDSR) mechanism is proposed to combine topology-aware traversal with sentence-level grounding. By coupling structured event-level navigation with raw evidence pointers, BDSR is designed to mitigate topological blindness while preserving answer-level traceability to original dialogue turns.
  • The synthetic causal diagnostic suite (SCDS) is constructed as a controlled diagnostic instrument for isolating causal rupture under targeted ground truth for complex causal chains. Based on this suite, diagnostic metrics for causal faithfulness can be established, thereby providing deeper diagnostic insights into specific design risks than traditional recall metrics.
The remainder of this paper is organized to make the above design logic explicit. Section 2 revisits recent memory-augmented, agentic-memory, and graph-augmented retrieval studies through the lens of topology awareness, historical traceability, asynchronous curation, and event versioning. Section 3 then presents the dual-loop architecture and its event-graph stewardship mechanism. The subsequent sections describe the evaluation protocol, experimental results, ablation findings, and limitations.

2. Related Works

To address the limitations of unstructured RAG’s flat vector space in maintaining a traceable causal history in dynamic interactions, recent studies have explored several complementary directions, including long-context access, memory construction, graph-augmented retrieval, and agentic memory organization. For instance, benchmarks such as LoCoMo [4] show that very long-term conversational memory remains difficult even when longer contexts or retrieval mechanisms are available. However, long-context settings still suffer from the lost-in-the-middle effect [6], while simply increasing retrieved history introduces a practical trade-off between context coverage and computational efficiency. Consequently, external memory organization remains necessary for persistent agents. From the perspective of memory construction, SeCom [11] and RMM [12] show that long-term dialogue memory benefits from both suitable memory granularity and adaptive retrieval refinement across multiple abstraction levels. Mem0 [13] and THEANINE [14] further highlight the practical value of compact online updating and timeline-aware organization. In parallel, JERR [15] demonstrates that semantic-similarity ranking alone is often insufficient for long-horizon reasoning. Although these approaches improve retrieval quality, memory compression, timeline-aware organization, or query-time reasoning depth, their primary optimization targets remain local access, compact updating, or single-query reasoning. They therefore do not explicitly provide a unified architectural mechanism for preserving causally linked event versions, maintaining auditable historical state transitions, and bounding online serving latency under continuous multi-party interaction.
Beyond isolated RAG pipelines, Wang et al. [16] review networked agent systems, which emphasize multi-agent cooperation strategies and modular architectures that explicitly integrate planning, memory, action, and interaction. This broader agent-centric view is important because it shifts the design question from whether memory exists to how memory is structurally maintained under multi-party interaction. However, the agent-systems literature usually treats memory as a high-level module and offers less guidance on how to preserve causal continuity, event-versioned state evolution, and bounded maintenance cost within a concrete long-horizon conversational memory architecture.
The integration of knowledge graphs (KGs) with RAG presents an important direction for supporting causal-context and multi-hop reasoning. For instance, CausalRAG [9] mitigates semantic-similarity bias by tracing causal paths, while GraphRAG [8] was primarily introduced for corpus-level global sensemaking via community summaries. Complementary to these directions, recent graph-structured retrieval methods such as query-aware KG fusion [17] and Associa [18] strengthen evidence selection beyond flat semantic matching. Such KG-driven retrieval can help mitigate the limitation of similarity-only retrieval that tends to return isolated text segments, thereby aligning with the need for multi-hop and causally traceable retrieval in long-horizon queries. However, these graph-augmented methods are primarily designed for relatively stable corpora, where the graph serves as a retrieval scaffold over already consolidated knowledge. More critically, corpus-level graph construction and community summarization usually assume that the underlying knowledge units are relatively stable before graph construction. Long-horizon multi-party conversations violate this assumption because event boundaries, participant states, and causal dependencies continue to evolve during interaction. Consequently, graph-augmented retrieval alone does not guarantee event-versioned state preservation or latency-bounded online serving unless graph maintenance is explicitly decoupled from the live interaction path.
From a graph-theoretic perspective, this limitation can be further formulated through the distinction between community organization, temporal reachability, and heterophilic bridging. Classical modularity theory explains why relational evidence can be organized into locally dense semantic communities [19], while temporal-network analysis shows that edge activation order and historical timing affect reachability and cannot be fully preserved by a static aggregated graph [20]. In addition, recent heterophilic graph learning studies emphasize that important dependencies may connect dissimilar nodes across community boundaries rather than only similar nodes within the same neighborhood [21]. Based on this background formulation, MemLoom treats conversational memory as a versioned directed, labeled, and temporal event graph, rather than as a flat cache or a static corpus-level graph summary. This formulation motivates the separation between a homophilic topic graph and a logic-aware relation graph: the former preserves local semantic and chronological neighborhoods, whereas the latter retains non-local typed dependencies, such as causal continuation and state succession, that may be weakened by community-only summarization.
When memory is treated not only as a retrieval cache but also as an evolving structure, preserving temporally traceable memory evolution becomes a critical challenge. A-Mem [22] introduces a Zettelkasten-inspired dynamic linking mechanism, whereas CEO [23], hierarchical event schema induction [24], and synergetic event understanding [25] move toward event-oriented abstraction and consolidation. At the system level, MemoryOS [26] outlines a broader hierarchical memory-management blueprint, while Amory [27] moves memory formation closer to offline narrative consolidation. Nevertheless, even these structured-memory directions still leave open how to jointly preserve immutable historical versions, recover causally dispersed evidence, and keep user-facing serving latency bounded in long-horizon multi-party settings. In other words, these systems demonstrate the importance of memory organization, but they do not make the separation between provisional online evidence writing and off-peak structural commitment a primary architectural contract. To make this critical comparison explicit, Table 1 summarizes whether representative paradigms treat topology awareness, history tracking, asynchronous curation and serving, and event versioning as primary architectural objectives.
Taken together, prior work improves long-horizon memory along different axes, but few existing approaches explicitly and jointly address four requirements that become coupled in long-horizon multi-party conversations: topology-aware causal retrieval, preservation of temporally distinct historical states, asynchronous separation between structural curation and bounded online serving, and event-versioned memory governance. This gap motivates MemLoom. Rather than treating memory as a flat cache, a purely online profile store, or a corpus-level graph summary, MemLoom organizes interaction history as a versioned event memory graph and separates latency-sensitive serving from off-peak structural stewardship. The contribution of MemLoom therefore lies not simply in using graph-based retrieval, but in combining event-versioned memory governance, bounded topology-aware retrieval, sentence-level provenance grounding, and snapshot-based asynchronous curation within one deployable architecture.

3. Methodology

3.1. System Overview of MemLoom

For readability, Figure 2 should be read from left to right as three coordinated paths: online interaction, online event formation, and off-peak stewardship.
To construct a conversational AI system with architectural-level disentanglement, MemLoom is built on a dual-loop publish–subscribe architecture. MemLoom structurally decouples real-time responsiveness from long-term consistency by integrating three coordinated subsystems. These subsystems are the online interaction-loop subscriber (OILS), online event formation system (OEFS), and off-peak stewardship loop publisher (OSLP), as shown in Figure 2. Specifically, OILS serves as the subscription endpoint. To avoid blocking caused by high-frequency writes, this loop primarily reads from the immutable snapshots ( S t ) published by the off-peak layer. Meanwhile, it utilizes sentence-level memory to provide real-time compensation for the newest unstructured interactions. The OEFS pipeline is unidirectional and lightweight, which concisely builds the writing path and maintains clear distinctness. The OEFS encapsulates raw dialogues into provisional events and appends them to a buffer, thereby deferring heavyweight computations to sustain smooth interactive responsiveness.
In parallel, OSLP absorbs most computational burdens by deferring congestion-inducing workloads to off-peak stewardship cycles. During asynchronous maintenance cycles, the off-peak steward processes accumulated provisional events to reconstruct high-order connectivity, including entity refinement and high-cost causal/temporal relations. To prevent online state inconsistencies, the newly curated graph is released via an atomic publish protocol, producing a snapshot ( S t + 1 ). This guarantees monotonic consistency, allowing seamless integration of structured knowledge without interrupting interactions.
To systematically block structural failures, the MemLoom data lifecycle is strictly regulated by four core architectural invariants ( I 1 to I 4 ). First, the semantic router enforces a bounded budget gate ( I 1 ) to prevent resource exhaustion from invalid utterances, directly addressing the synchronous maintenance bottleneck, i.e., DR3. Next, the OEFS adheres to a provisional-first lifecycle contract ( I 2 ). By restricting online writes to revisable drafts, it prevents transient information from mutating authoritative memory, effectively mitigating structural erasure, i.e., DR2. Furthermore, the OSLP executes batch global consistency ( I 3 ). Deferring intensive relation construction to off-peak batches preserves logical integrity and helps mitigate DR1. Finally, the system employs auditable dual-stream grounding ( I 4 ). By combining macro-level graph traversal with micro-level evidence retention, this retrieval mechanism mitigates the semantic blind spots associated with DR1 while preserving temporally traceable evidence through end-to-end provenance auditing to reduce DR2.
Operationally, the practical decoupling effect of DLPS is realized through three coordinated mechanisms: immutable snapshot serving for stable online reads, off-peak stewardship for heavyweight structural curation, and a bounded answer-serving path that isolates user-facing latency from asynchronous maintenance cost. In this design, the online loop remains read-light and latency-sensitive, whereas the stewardship loop absorbs topology repair, validation, and publication overhead outside the active serving window.

3.2. Semantic Router: Single-Pass Rewrite–Gate–Route

Rather than relying only on conventional text cleaning, MemLoom employs an LLM-powered context-aware semantic router with a single-pass rewrite–gate–route mechanism. The router maps implicit multi-party ambiguities into structured representations and transforms natural language into deterministic control signals for downstream event construction. To minimize interaction latency, the system adopts a single-pass inference strategy. Within this strategy, a sliding window captures short-horizon referential dependencies and pragmatic continuity. This routing protocol is realized as the mapping function in Equation (1):
f R ( u t , H t 5 : t 1 , K ) LLM u t , I , T .
where u t is the current utterance, H t 5 : t 1 is a length-5 historical window, and  K contains environmental metadata. The output comprises a rewritten text u t , a retrieval intent I, and topic labels T .
To establish a reliable substrate for long-horizon multi-party interactions, the contextual rewriting module for u t executes a three-step transformation to resolve conversational ambiguities. First, an explicit mapping anchors first-person pronouns to the current speaker ID, which mitigates identity drift across long interaction histories. Then, it resolves abstract references by replacing implicit pronouns with recent concrete noun phrases, thereby stabilizing the semantic embedding space. Finally, relative time expressions are normalized into absolute date markers, which provides physical coordinates for downstream causal chains.

3.3. Online Incremental Event Formation System

As illustrated in Figure 3, the structured signals resulting from the preprocessing layer are fed into the online incremental event formation system (OEFS), which transforms discrete conversational streams into temporal event nodes in real time. Conventional RAG architectures typically compress input streams into irreversible semantic abstractions, inevitably compromising the integrity of the contextual window and making provenance evaluation highly challenging. To overcome this bottleneck, the design of OEFS adheres to a precision-first, provenance-preserving “write-as-evidence” principle. This strategy aims to maximize semantic purity within individual events, effectively mitigating the topic blackhole effect prevalent in long-horizon dialogues. Specifically, for each preprocessed input memory unit m t —comprising topic tags T and a text embedding vector v m generated via OpenAI’s text-embedding-3-small model—the builder explicitly avoids overwriting the existing event representations. Instead, m t is appended as immutable evidence into the pending memory array of the target event E j . Synchronously, the mathematically aggregated event centroid vector ( c j ) and internal source pointers are updated to preserve the physical integrity of the original context. By providing a strict physical alignment basis, this write-as-evidence paradigm significantly optimizes downstream operations—including multi-hop retrieval, deduplication, and shadowing—thereby reducing the system’s reliance on purely semantic-based heuristics.
To facilitate event attribution under low-latency constraints, MemLoom executes a hierarchically dispatched routing algorithm (HDRA), which integrates pragmatics with non-linear semantic metrics. First, it enforces a candidate pool constraint, bounding the search for candidate events within the set of online joinable states ( S j o i n , formally defined in Section 3.4). As a result, the system structurally filters potential cross-day misalignments and significantly shrinks the search space. Concurrently, for each updated or created event E j , the constructor maintains its internal statistical feature vector v stats = [ n j , p j , V j ] , representing the evidence count n j , the normalized participant distribution p j , and the token mass or informational volume V j , respectively. These real-time statistics serve as foundational evidence to support subsequent lifecycle stewardship (detailed in Section 3.4).
For candidate memory units that successfully pass this initial filter, HDRA evaluates their compatibility by calculating a final matching score S j . Instead of performing redundant vectorization, MemLoom directly computes the semantic similarity between the preprocessed embedding v m and the maintained event centroid c j . As defined in Algorithm 1, this routing step balances semantic similarity with temporal decay. Let Δ h denote the temporal distance between the current input and the last update of candidate E j . To tolerate subtle semantic fluctuations and suppress background noise, HDRA integrates a fixed non-linear sigmoid function, g ( s ) = 1 / ( 1 + exp ( 10 ( s 0.5 ) ) ) . Additionally, it applies a short-window gain B s w { 0 , 1 } controlled by the parameter γ , which explicitly rewards temporally proximate inputs to sustain the continuity of Reasoning Chains.
To configure the routing objective under bounded online latency, the HDRA coefficients were selected through a coarse-to-fine empirical search on a held-out validation subset rather than as universally optimal constants. The search examined candidate settings for the semantic weight α , temporal decay weight β , decay rate λ , short-window gain γ , and transition threshold τ n e w under the joint criteria of semantic fidelity, temporal continuity, and latency stability. The final routing defaults were fixed at α = 0.9 , β = 0.1 , λ = 20 , γ = 0.1 , and  τ n e w = 0.30 . More concretely, this validation-stage tuning used an independent held-out SCDS validation split that was reserved exclusively for parameter selection and never reused in the final reported evaluations. The final SCDS diagnostic test split, as well as the external QMSum and LoCoMo benchmarks, remained untouched during tuning. Once selected, the routing defaults were frozen and kept fixed throughout all final experiments. Concretely, the held-out SCDS validation split contained 100 diagnostic queries. All reported SCDS benchmark results were then produced on a disjoint final evaluation split of 353 queries, while QMSum and LoCoMo remained entirely untouched during tuning. These values should therefore be interpreted as empirically selected operational defaults for the tested domains, not as universally optimal settings. By comparing the maximum candidate score against τ n e w , the system triggers either a NEW_EVENT or CONTINUE operation, thereby preserving topic coherence while remaining adaptive to boundary shifts in dynamic conversations.
Algorithm 1 HDRA pseudo code
Require: Current memory unit m t (vector v m ), pool of active events S j o i n
Ensure: Routing decision and state update
  • Step 1: Candidate Filtering
  1:
Filter candidates in S j o i n based on temporal boundaries
  • Step 2: Non-linear Scoring
  2:
for all candidate E j in S j o i n  do
  3:
       s cos ( v m , c j )
  4:
       S j min 1.0 , α · g ( s ) + β · exp ( λ Δ h ) + γ · B s w
  5:
end for
  • Step 3: Routing Decision and State Update
  6:
if max ( S j ) < τ n e w   then
  7:
      NEW_EVENT( m t )
  8:
else
  9:
      Select optimal target E j * arg max E j S j o i n S j
10:
      Append m t to E j * . pending and update c j * and v s t a t s
11:
end if

3.4. Event Lifecycle Realization

In the present framework, an event is treated as a bounded, versionable, and evidence-backed conversational memory unit rather than as a claim of a universally correct real-world event ontology. This definition is intentionally operational: an event captures a locally coherent interaction segment under bounded temporal and semantic consistency. To ensure auditable memory, MemLoom models each event as a finite-state machine (FSM) regulated by the state set S FSM = { provisional , active , closed , archived } . As shown in Figure 4, the proposed FSM defines lifecycle transitions and storage states under a strict commitment contract.
As illustrated by Stage 1 in Figure 4, events in the provisional state are isolated from high-order relationships, such as causal and temporal edges, to filter premature structural noise. Promotion eligibility is subsequently determined by a maturity function M ( E j ) , which evaluates the information density of these events.
M ( E j ) = w 1 · n j + w 2 · H ( p j ) + w 3 · log ( 1 + V j ) .
An event is promoted to the active state when the condition M ( E j ) τ maturity is satisfied, where n j denotes the number of pending evidence units, H ( p j ) represents the participant entropy computed from the normalized participant distribution p j within E j , and  V j indicates the token mass or informational volume of the pending evidence. As defined in Equation (2), the coefficients w 1 , w 2 , and  w 3 act as governance parameters to balance three complementary maturity signals: evidence accumulation, participant diversity, and informational volume, respectively. This decoupling prevents event promotion from being dominated by a single proxy; for example, evidence count alone may over-promote repetitive fragments, token mass alone may favor verbose but structurally weak spans, and participant entropy alone may overvalue interaction breadth without sufficient evidence mass.
In the current implementation, these maturity coefficients are treated as fixed governance defaults, which were selected based on the held-out SCDS validation split and subsequently frozen for all final experiments. They are not intended to guarantee universal terminal optimality; instead, they regulate the lifecycle trade-off between the premature promotion of noisy fragments and the excessive delay in making emerging events available to the active graph. Therefore, τ maturity should be interpreted as a lifecycle governance threshold rather than a theoretically optimal constant. If  τ maturity is set too low, provisional events are promoted prematurely, thereby increasing graph inflation and structural noise. Conversely, an excessively high τ maturity delays event availability, causing freshness lag and deferred structural integration. In this context, τ maturity is employed to mitigate noisy premature promotion while avoiding excessive delays in event availability.
When multiple simultaneous or overlapped events are present, MemLoom adopts an intentionally asymmetric handling strategy. On the online path, OEFS applies a single primary event assignment under bounded latency. It assigns the incoming memory unit to the most compatible provisional or active event, ensuring that online routing remains lightweight and auditable. The higher-order repair of simultaneous, interleaved, or partially overlapping event structures is then deferred to off-peak stewardship. During this off-peak phase, timeline repair and relation synthesis can recover cross-event continuity through typed temporal, state-transition, and event-continuity links, which are formalized later in the NSGS subsection. This design embodies a deliberate trade-off between real-time responsiveness and structural recovery, rather than operating under the assumption that overlaps can always be perfectly resolved online.
As shown by Stage 2 in Figure 4, once an event is promoted to the active state via dual-track topological skeleton construction (DTKC), it becomes eligible for full topological reasoning. As shown by Stage 3 in Figure 4, to capably adapt to episodic rhythms, an event transitions to the closed state when a specific contextual boundary condition is satisfied. Specifically, this transition occurs when a condition, e.g., task completion or session shift, holds with Φ close ( E j , C ) = True , where Φ close ( · ) is a boolean boundary predicate and C denotes contextual cues used for closure. These read-only nodes preserve structural connectivity while prohibiting new insertions, effectively fixing the historical context.
Finally, transitioning to the archived state, as illustrated by Stage 4 in Figure 4, represents a strategic shift toward resource optimization. Regulated by a resource-aware retention policy rather than rigid chronological rules, events are moved to cold storage when exceeding their relevance horizon or when memory footprints necessitate migration. This mechanism isolates the active search space from the historical repository, helping mitigate latency inflation and stabilize long-term write efficiency during system maintenance windows.

3.5. Off-Peak Stewardship Loop Publisher (OSLP)

The off-peak stewardship loop publisher (OSLP) operates as a background entity that asynchronously performs memory curation during the system’s idle cycles, which functions analogously to a steward systematically organizing a household while the owner rests. This design architecturally reconciles the inherent tension between low-latency online interactions and strict structural consistency in long-horizon conversational agents. Because the OEFS conforms to an append-first strategy, the off-peak steward assumes the critical responsibility of reconstructing the accumulated raw fragments into a causal-temporal graph snapshot. This snapshot is designed to preserve causal traceability and temporal continuity.
As depicted in Figure 5, the stewardship pipeline orchestrates this transformation through a sequence of rigorous phases under multi-version concurrency control (MVCC) isolation. As illustrated by Phase 1 in Figure 5, fragmented inputs first undergo physical defragmentation and normalization. During this phase, retired nodes are preserved in a lineage registry L to maintain directed acyclic graph (DAG) provenance. Specifically, L records node-level lineage links for provenance-preserving roll-forward and audit trails.
Subsequently, as shown by Phase 2 in Figure 5, the core neuro-symbolic graph synthesis (NSGS) implements topological decoupling. Within this mechanism, the first track constructs a homophilic topic graph via non-LLM approximate nearest neighbor (ANN) search, and the second track injects symbolic hooks to structurally bridge heterophilic reasoning paths. Following this, as depicted by Phases 3 and 4 in Figure 5, the system executes cross-cluster reasoning and contract validation. These operations are bounded by a deterministic inference budget B inference , which enforces a per-maintenance-cycle upper bound on large language model (LLM) inference and validation costs. Ultimately, the OSLP releases the structural updates via an atomic publish protocol, which consequently provides the online loop with a stable, curated event memory graph G curated for subsequent retrievals.

3.5.1. Snapshot Consistency Contract

Based on MVCC principles, the architecture introduces a snapshot consistency contract. To prevent evidence tearing (i.e., dirty reads where agents might infer from partially constructed graphs), the online layer enforces a single-version read principle. Under the guarantee of snapshot isolation, the system exclusively accesses the preceding snapshot S t 1 during the active period (the online serving window between two stewardship cycles). The subsequent state transition to the newly curated t-th snapshot S t is executed via an atomic publish protocol, which switches the global pointer only after passing all integrity checks. This mechanism supports monotonic consistency and helps mitigate DR3.

3.5.2. Global Policy Definition

To ensure reproducible memory management, the hyperparameters used to regulate graph topology evolution are encapsulated into a global synthesis policy vector:
Π syn = θ s , k sem , k t , Δ d , δ t , B u , B bridge , K guard , V ver .
This policy constrains the steward through explicit transformation rules and budgets (where B u bounds the per-cycle LLM inference budget, B bridge bounds the budget allocated to cross-cluster bridging/verification, K guard specifies the guard/validation budget used in contract checking, and  V ver denotes the versioned verification rule-set applied by integrity checks). Semantically, θ s acts as a cosine similarity margin to filter low-relevance edges, while k sem bounds the ANN search breadth to prevent excessive semantic hub expansion. Temporally, k t (temporal k-NN), the backtracking window ( Δ d ), and the local temporal tolerance ( δ t ) restrict the merging search space to physically adjacent events, helping timeline repairs remain consistent with chronological order. The temporal components of Equation (3) are operationalized directly in Algorithm 2, where Π syn . Δ d bounds the merge backtracking scope and Π syn . δ t acts as the admissibility threshold for local temporal repair.
Algorithm 2 Entity Normalization and Timeline Repair (ENTR)
Require: Provisional events Δ E , existing curated events E curated , global policy Π syn
1:
E active FilterRetired ( E curated Δ E )                      ▹ Input guard using lineage L
2:
for all adjacent pairs ( e i , e i + 1 ) within backtracking window Π syn . Δ d  do
3:
      if  TemporalDistance ( e i , e i + 1 ) Π syn . δ t and LLM _ Merge ( e i , e i + 1 )  then
4:
            e new e i e i + 1
5:
           Replace e i , e i + 1 with e new in E active and record DAG mapping in L
6:
      end if
7:
end for
Unlike conventional RAG architectures, where structural thresholds are often scattered as hard-coded heuristics, explicitly encapsulating these guardrails disentangles deterministic stewardship logic from the probabilistic LLM engine. This formulation represents complex neuro-symbolic orchestration as a parameterized configuration space, enabling reproducible configurations across different hardware constraints and application scenarios.
To preserve reproducibility, the scalar routing coefficients used in HDRA ( α , β , λ , γ , and  τ n e w ) are treated as tuned online operational defaults, whereas the stewardship policies and lifecycle thresholds are treated as fixed governance controls under a versioned policy contract. In practice, each deployment profile materializes a concrete policy instance Π syn from this versioned stewardship contract, and this instance remains immutable throughout a serving window. Stewardship-side controls may be updated only at stewardship-cycle boundaries through offline re-validation and atomic snapshot publication. This design prevents mixed-policy reads across serving windows, preserves snapshot consistency, and makes policy evolution auditable across snapshot versions. Their roles, tuning rationales, and freeze rules are summarized in Appendices Appendix A.3 and Appendix A.4.

3.5.3. Entity Normalization and Timeline Repair

Prior to synthesizing high-order relationships, the steward transforms unstructured provisional events into a curated set of event nodes. First, a semantic encapsulation process enforces an authority contract, guiding the LLM to generate citable core summaries ( s core ) and stable vector anchors for newly accumulated events, where s core denotes a citable core summary attached to each curated event node. Second, to repair timeline fragmentation caused by the online layer’s conservative segmentation, a physical defragmentation process is executed, as depicted in Phase 1 of Figure 5. As detailed in the pseudocode of Entity Normalization and Timeline Repair (ENTR), the steward applies a bounded merge-repair procedure to temporally adjacent candidates within the backtracking window Δ d . This procedure combines a deterministic temporal gate with an LLM-assisted merge validator, so that physical merging is triggered only when both local chronological admissibility and narrative compatibility are satisfied. The temporal edges produced by this repair stage are collected as E time , which denotes the set of physical adjacency edges used as the chronological backbone in Stage 2.
To make this repair procedure operationally explicit, ENTR applies two bounded decision gates before physical merging. First, TemporalDistance( e i , e i + 1 ) acts as a deterministic temporal admissibility check that determines whether two adjacent event candidates remain sufficiently close in chronological position to be considered for merge under the current stewardship policy. In the current implementation, this check is bounded by the backtracking window Π syn . Δ d and the local temporal tolerance Π syn . δ t . Second, LLM_Merge( e i , e i + 1 ) serves as an LLM-assisted but policy-bounded boolean validator. It returns True only when the candidate pair exhibits sufficient narrative continuity, semantic compatibility, and referential stability, while not violating temporal order or introducing an explicit state contradiction.
Crucially, to maintain graph atomicity and data provenance, this physical merging ( e i e i + 1 e new ) (where e i denotes an event node instance and ⊕ denotes the physical merge operator) does not simply discard old nodes. Instead, original nodes are retired and recorded in a lineage registry ( L ), establishing a directed acyclic graph (DAG) lineage mapping. This registry not only acts as an input guard to prevent “zombie” fragments from being reactivated into the deep reasoning pipeline, but also serves as the structural basis for subsequent retrieval grounding. By preserving this traceable mapping, the system retains the capacity to link abstract summaries back to raw utterances, thereby aiding mitigation of DR2.
Here, Δ E denotes the newly accumulated provisional events in the current stewardship cycle, E curated denotes the previously committed curated event set, and  E active denotes the active candidate pool after retired nodes are filtered through the lineage registry L . The operator ⊕ denotes the physical merge operator that replaces a validated adjacent pair ( e i , e i + 1 ) with a new merged node e new , while preserving provenance in L .

3.5.4. Neuro-Symbolic Graph Synthesis (NSGS)

NSGS functions as the central off-peak curation mechanism that combines probabilistic semantic understanding with deterministic structural constraints. Its role is to validate evidence-grounded candidate relations among temporally sparse and semantically disconnected dialogue fragments, thereby helping preserve causal continuity and state evolution under DR1 and DR2. Rather than claiming formal causal discovery, NSGS performs bounded causal-relation validation over curated event nodes and organizes accepted relations into a compact topology for downstream retrieval.
Conventional GraphRAG-style architectures are less suitable for this setting because continuous interaction streams require frequent updates, while community summarization may merge temporally distinct but semantically related states. This can obscure intermediate transitions and historical boundaries, which corresponds to the structural erasure risk described by DR2.
As illustrated by Algorithm 3 and Phase 2 in Figure 6, NSGS introduces a bounded-update approach via a dual-track skeleton. To keep the topology interpretable and validation-bounded, NSGS uses a compact four-type relation closure: CAUSAL_FOLLOW, TEMPORAL_NEXT, STATE_SUCCESSOR, and SAME_EVENT. CAUSAL_FOLLOW captures directional cause–effect dependencies, TEMPORAL_NEXT preserves chronological adjacency, STATE_SUCCESSOR models historically distinct state evolution within the same event lineage, and SAME_EVENT reconnects semantically dispersed but co-referential fragments. This closure is not intended as a universal ontology; it is selected to balance causal recovery, temporal continuity, state preservation, event reconnection, and bounded maintenance cost.
Track 1 constructs a homophilic topic graph using HNSW-based approximate nearest neighbor search bounded by k sem , as defined in Equation (4):
E sem = { ( i , j ) j ANN ( i , k sem ) cos ( v i , v j ) θ s } .
where i and j index active event nodes, v i denotes the embedding anchor of node i, ANN ( i , k sem ) returns the top- k sem nearest neighbors of i, and  θ s is the cosine similarity margin. The resulting topic graph E topic = E time E sem provides a non-LLM local scaffold for semantic and chronological neighborhoods.
Track 2 then adds symbolic hooks E hook derived from explicit online signals, yielding E logic = E topic E hook . These hooks allow the steward to test candidate dependencies across semantic-cluster boundaries, supporting the later validation of relations such as CAUSAL_FOLLOW and STATE_SUCCESSOR.
Algorithm 3 Dual-Track Topological Skeleton Construction (DTKC)
1:
E sem ComputeHomophilicEdges ( E active , Π syn . θ s , Π syn . k sem )
2:
E topic E time E sem   ▹ Track 1: Homophily + chronological scaffold for clustering
3:
E hook ExtractSymbolicHooks ( E active )
4:
E logic E topic E hook            ▹ Track 2: Bridge structure for higher-order reasoning
As Figure 6 illustrates, semantic clustering may separate an error event E 1 from its later fix event E 3 into different communities. The logic graph uses symbolic hooks to bypass these homophilic boundaries and recover cross-cluster dependencies, preserving chronological progression, state evolution, and same-event reconnection.
Following topological decoupling, BBR partitions the topic graph into semantic clusters C = { C 1 , , C m } and allocates bounded inference slots to cluster-local and cross-cluster reasoning as shown in Algorithm 4. Let E ˜ rel denote LLM-proposed candidate relation edges before contract validation. The total LLM inference cost is bounded by Equation (5):
C LLM i = 1 m min | C i | new , B u + B bridge + M m · B u + B bridge + M .
where | C i | new denotes incrementally updated nodes in cluster i, and M represents fixed overhead. Because the reasoning budget is bounded by the number of active clusters rather than the entire historical repository, BBR keeps off-peak structural reasoning tractable under asynchronous stewardship.
Candidate relations are subsequently canonicalized into the four-type closure and passed to contract validation before snapshot publication. Answer-level auditability is supported separately through sentence-level reference pointers and pointer-grounded verification.
Algorithm 4 Budgeted Bridge Reasoning (BBR)
1:
C CommunityDetection ( E topic )                 ▹ Partition into clusters C 1 C m
2:
E ˜ rel
3:
for all clusters C i C  do
4:
       E ˜ rel E ˜ rel LLM _ Reasoning ( C i , E logic , Budget B u )
5:
end for
6:
E ˜ rel E ˜ rel CrossClusterReasoning ( C , E hook , Budget B bridge )

3.5.5. Contract Validation and Snapshot Publication

To converge probabilistic LLM-generated candidate edges ( E ˜ rel ) into a deterministic graph, the system executes a rigorous contract validation and snapshot publication (CVSP) process. A verification gate applies three hard constraints: f schema verifies JSON structural integrity, f evid mandates original-evidence citations for auditability, and  f type restricts relationships to an allowed closure set (all f ( · ) are boolean validation predicates applied to candidate edge y, with type closure specified by verification rule-set V ver under Π syn ). Validated edges are then deduplicated and subjected to physical constraints, such as temporal ordering, via a canonicalization function Canon ( · ) , yielding the committed relation edge set E rel .
Ultimately, the system encapsulates authoritative entities ( E active ), validated relations ( E rel ), and auditable metadata, including cluster indexing M cluster , physical lineage L , and policy Π syn into an immutable snapshot S t + 1 . Complying with the atomic publication rule in the snapshot consistency contract above, the global pointer switches to S t + 1 only after all integrity verifications are passed. Through the lens of MVCC, this step functions similarly to a transactional commit. By ensuring that the online loop only queries fully committed states, this publication pipeline supports monotonic consistency. Consequently, it helps mitigate DR1 and DR2 by improving reproducibility and traceability in long-term interactions.

3.6. Sentence-Level Memory

While curated events constitute the core macroscopic memory, sentence-level memory serves as an indispensable high-fidelity failsafe, providing a two-fold compensation. Temporally, it provides OILS with an intra-cycle real-time buffer, directly indexing unstructured raw conversational data to eliminate memory blind spots caused by asynchronous stewardship latency. Structurally, it preserves uncompressed raw text spans as a resolution substrate, compensating for the inherent detail loss during event summarization.
Concretely, each utterance is stored as an immutable record indexed by a unique identifier, enabling direct addressability without requiring re-clustering. Each curated event E j maintains a set of reference pointers that map its summary claims to supporting raw spans. At query time, OILS materializes grounding text by executing direct pointer lookups, ensuring that citations are verbatim spans rather than regenerated paraphrases. The real-time buffer covers the intra-cycle gap (i.e., the wall-clock duration between two consecutive published snapshots) by indexing the newest utterances before the next off-peak publication. This design supports fallback to raw evidence when event summaries are lossy, thereby improving auditability and error analysis.

3.7. Bounded Dual-Stream Retrieval

To retrieve information from immutable snapshots while addressing DR1 and DR2, MemLoom employs a BDSR mechanism executed through three atomic operations. First, semantic seeding leverages vector similarity to identify highly relevant active events as narrative anchors without altering graph topology. Second, structured traversal systematically expands from these anchors through the validated relation closure. It bidirectionally traces CAUSAL_FOLLOW edges to recover root causes and consequences, thereby addressing DR1, follows TEMPORAL_NEXT edges to preserve chronological continuity across adjacent event transitions, traverses STATE_SUCCESSOR edges to retain historically distinct state evolution without flattening it into a single compressed representation, thereby addressing DR2, and explores SAME_EVENT edges to reconnect semantically dispersed but co-referential fragments of the same underlying event. Together, these bounded traversals enable multi-hop context recovery while preserving causal traceability and structural distinctness. Finally, the grounding stage utilizes reference pointers to map abstract event summaries back to concrete utterance entries within the sentence-level substrate. This forms an end-to-end provenance chain from retrieval to answer generation, which suppresses hallucinations by forcing every critical claim to remain traceable to raw evidence and supports answer-level auditability.
To guarantee boundary strength during long-term interactions, this pipeline is strictly constrained by a global computational budget and a visibility filter. The global budget enforces a strictly bounded retrieval cost independent of historical data scale by applying fixed caps on (i) the number of seed events, (ii) traversal hops and branching factors, and (iii) the amount of grounded evidence materialized for generation. Concurrently, a pre-computed access control list (ACL) physically intercepts unauthorized accesses during both graph traversal and evidence materialization. This ensures that only nodes, edges, and sentence-level spans within the viewer’s visibility scope can be retrieved, thereby eliminating cross-user leakage and the risk of dirty reads.
Illustrative example. Consider a household dialogue in which Alice first says, “I unplugged the coffee machine because it was leaking,” and many turns later asks, “Can you help me get something hot to drink?” A flat similarity-based retriever may over-focus on the lexical cue “hot” and retrieve general coffee-machine usage context, thereby missing the earlier safety-related state change. In contrast, MemLoom is designed to preserve the earlier unplugging event, its state consequence, and the later query as structurally connected memory units. Under this event-centric representation, the retrieval path can recover that the coffee machine is unavailable due to a prior state-changing event, leading to a safer and more contextually grounded response, such as recommending an alternative hot drink rather than suggesting reuse of the machine.
Ultimately, by centering on immutable snapshots, MemLoom addresses the tension between structural consistency and online latency. Within this design, online interactions remain responsive and write-light, while off-peak stewardship upgrades the global structure and preserves the stability of the live read path. The subsequent section empirically evaluates this architecture under claim-driven protocols across long-horizon recall, causal chain recovery, answer-level auditability, and end-to-end cost.

4. System Verification and Mechanism Diagnosis

An orthogonal verification strategy is adopted to examine the DLPS architecture in MemLoom. The mitigation effectiveness against DR1, DR2, and DR3 is systematically evaluated, as these design risks represent primary structural bottlenecks for LLM-based conversational agents. To guide this evaluation, our diagnosis is driven by four core research questions, i.e., RQ1 to RQ4:
  • RQ1 (Structuring capability): Can the off-peak stewardship establish robust event boundaries to prevent semantic fragmentation and over-merging, i.e., the foundation for DR1 and DR2?
  • RQ2 (Retrieval stability and reasoning consistency): Does the bounded dual-stream retrieval sustain signal survival and structural durability across long-horizon and multi-hop interactions?
  • RQ3 (Causal recovery and structural preservation): Can the curated event memory graph accurately reconstruct causal chains while preserving temporally traceable state transitions without collapsing historically distinct states into a single compressed representation?
  • RQ4 (Deployment viability): Does the asynchronous maintenance mechanism effectively mitigate DR3 while ensuring bounded latency?

4.1. Datasets and Evaluation Protocols

Herein, we select two external canonical benchmarks and a self-built controlled diagnostic suite (SCDS) to implement the orthogonal verification strategy. Moreover, a scale sweep simulation for infinite-horizon long-term accumulation is incorporated. These evaluations are designed to map the respective benchmarks and defensive contracts directly to the proposed RQs and DRs, confirming comprehensive coverage across structure, noise, logic, and scale dimensions.

4.1.1. Long-Meeting Structuring Benchmark (QMSum)

To address RQ1, the QMSum benchmark [28] is applied to verify the fundamental ability of MemLoom in establishing event-level granularity, affirming a structural prerequisite for addressing DR1 and DR2. Exhibiting significant long-range features (averaging ∼575 turns per meeting across diverse domains) and high topic variability (averaging 4.26 topic spans), QMSum is utilized beyond traditional ROUGE metrics. Specifically, it serves as the segmentation contract against human ground-truth (GT) annotations. Because not all compared systems are segmentation-native, all outputs are first normalized into a common turn-level segmentation sequence before evaluation. For segmentation-native systems, boundaries are read directly from the produced event sequence. For non-segmentation-native baselines, a fixed boundary projection protocol is applied: native memory units are first aligned back to their supporting dialogue turns, and projected boundaries are then induced whenever the dominant assigned unit changes along the chronological turn sequence. For Mem0, the projected unit is the consolidated memory entry attached to each turn span; for GraphRAG, it is the dominant community-aligned summary region assigned to each turn. Standard probabilistic segmentation error metrics, such as P k and WindowDiff (WD), are then employed to rigorously evaluate the difference between these normalized boundary projections and human annotations.

4.1.2. Long-Horizon Multi-Party Memory Benchmark (LoCoMo)

To address RQ2, the LoCoMo benchmark [4] evaluates structural durability against the information fragmentation and catastrophic forgetting risks inherent in real-world interactions, featuring exceptionally long multi-session contexts (averaging ∼600 turns and 200 QA pairs per set). We employ three tasks across different cognitive levels: First, single-hop retrieval establishes the foundational retrieval capability for capturing discrete facts. Then, multi-hop reasoning verifies whether the curated event memory graph can mitigate DR1 by constructing effective bridging paths that reconnect semantically dispersed but contextually related event fragments without relying on shallow keyword co-occurrence. Finally, temporal reasoning checks if the immutable snapshot mechanism effectively preserves historical contexts. This specifically prevents historical states from being wrongly overwritten by new information—a common destructive update issue—thereby mitigating DR2.

4.1.3. Synthetic Causal Diagnostic Suite (SCDS)

To systematically address RQ3, we introduce the Synthetic Causal Diagnostic Suite (SCDS). Existing datasets inherently lack auditable ground-truth (GT) annotations for densely intertwined multi-party interactions, thereby limiting the precise diagnosis of DR1 and DR2. To bridge this gap while adhering to strict community standards for dataset transparency [29,30,31,32], SCDS avoids one-shot prompting in favor of a spec-driven multi-stage synthesis pipeline [33,34,35,36,37]. As detailed in Appendix C, this controlled protocol actively mitigates model collapse and data contamination risks via verification-centric quality gates and split isolation [38,39,40,41]. Furthermore, to circumvent the preference leakage and judgment bias inherent in LLM-as-a-judge methodologies [42,43], the suite explicitly grounds all evaluations in pointer-level gold-evidence annotations combined with deterministic scoring.
Built upon this rigorous foundational protocol, the full SCDS pool contains 453 causal diagnostic queries. Among them, 100 queries are reserved as an independent validation split for parameter tuning and sanity-checking, while the remaining 353 queries constitute the final reported diagnostic evaluation split. We deliberately calibrate this interaction length and density to effectively isolate structural reasoning failures from sheer context-window overflows. These diagnostic tasks evaluate the architecture’s capacity for topological recovery by testing whether sparse, semantically heterophilous signals can be reconstructed into an unbroken causal chain, thereby rigorously stressing DR1. Concurrently, the suite examines whether historically distributed evidence can be preserved and reconnected without being overwritten or structurally flattened during long-horizon memory maintenance, thereby stressing DR2. Together, these high-resolution stress tests provide controlled diagnostic evidence regarding the structural durability of the proposed dual-loop mechanism.
Importantly, SCDS is intended as a controlled diagnostic suite rather than as a universal benchmark for open-world conversational memory. Its evaluation ontology is intentionally aligned with the relation closure used in MemLoom because the suite is designed to isolate specific structural risks, especially causal rupture and state-flattening failures, under pointer-grounded verification. Accordingly, SCDS should be interpreted as high-resolution diagnostic evidence, whereas more natural datasets such as LoCoMo provide complementary evidence regarding broader retrieval durability and reasoning behavior.

4.2. Comparisons of Categorized Architectures

To precisely isolate the structural advantages of MemLoom, we compare it against four categories of baseline architectures representing different evolutionary stages. First, the Full-Context baseline establishes a theoretical upper bound for accurate model understanding but is rendered unsuitable for large-scale deployment due to substantial latency growth and strict input budgets. Second, the Chunk-Based Vector baseline, sweeping parameters from 128 to 8192 tokens, explores current industry limits; however, its inherent lack of semantic boundary awareness inevitably results in context fragmentation and topic drift. To resolve such fragmentation, GraphRAG provides macroscopic global understanding through community detection, yet it is primarily optimized for corpus-level sensemaking rather than event-versioned temporal preservation in dynamic dialogue. Finally, contemporary online agentic memory frameworks, such as Mem0 [13] and Zep [44], attempt to solve this latency via real-time LLM tool calls. Nevertheless, the tight coupling of read and write operations often introduces a synchronous maintenance bottleneck, identified as DR3, while their reliance on overwrite-to-latest state updates directly complicates temporal state preservation, denoted as DR2, during long-horizon updates or historical backtracking.
To establish a fair and practically viable cross-system comparison, all evaluations, except the theoretical full-context upper bound, were conducted under a strict bounded-budget contract. Rather than treating token limits as a capability ceiling of modern LLMs, we enforce a strict 8k-token contextual budget, denoted as B c t x , and a bounded generation budget, denoted as B g e n , to simulate latency-sensitive deployment environments. Comprehensive details of this evaluation contract are provided in Appendix A.5. This contract further prohibits any system from relaxing retrieval limits to obtain marginal gains by enforcing strict boundaries across three dimensions: the context budget B c t x , the retrieval budget B r e t , and the generation budget B g e n . Specifically, B c t x sets a hard upper bound on the total prompt tokens inputted to the generation model, such as 8k tokens; B r e t establishes a fixed limit on the number of initial retrieval seeds, specifically the Top-K selection, and the maximum evidence payload, ensuring consistent rules for de-duplication and truncation; and B g e n places a strict lock on the maximal decoded parameters, for instance, 512 new tokens, to assure the full reproducibility of experimental results.
To ensure deployment parity, all evaluated systems, except for the theoretical upper bound, strictly adhere to a maximum context budget of B c t x = 8000 tokens per query. Crucially, the Full-Context baseline reported in subsequent evaluations is exempt from this contract; it is included solely as an out-of-budget theoretical upper-bound reference to delineate the inherent reasoning limits of the LLM.

4.3. Evaluation Metrics and Diagnostic Signatures

To replace the monotonous reporting of single-dimensional generation accuracy, a dual-reporting protocol of “Quality × Cost” is adopted to evaluate the architecture’s deployment viability in resource-constrained environments. For operational efficiency, end-to-end tail latency, denoted as P 95 , is strictly tracked to diagnose DR3, as interactivity severely degrades when synchronous graph updates cause tail latency to spike. For retrieval quality, the turn-grounded recall at K, denoted as R @ K , measures the coverage of the retrieved results over the minimal sufficient source-turn evidence set, formally defined as T g o l d within our controlled suite. Crucially, rather than treating it as a universal operational requirement, we establish full coverage, specifically achieving an R @ K of 1.0, as a diagnostic sufficiency gate within our controlled suite, where the target is the complete minimal evidence set rather than semantically related content alone. Failing to capture the complete necessary evidence set forces the LLM into an information blindness state, where downstream responses risk degrading into ungrounded parametric guessing rather than explicit, traceable reasoning.
To quantify event-boundary reconstruction accuracy, specifically addressing RQ1, and to examine structural tendencies related to DR2, we use the probabilistic segmentation error, denoted as P k , and the WindowDiff metric, denoted as W D , with respect to human annotations. These metrics are used here not merely as standard segmentation scores, but as diagnostic instruments for evaluating whether off-peak stewardship can reconstruct event boundaries without inducing either excessive fragmentation or structural flattening. The metrics are defined as follows:
P k = 1 T k i = 1 T k I δ ( y ref [ i ] , y ref [ i + k ] ) δ ( y sys [ i ] , y sys [ i + k ] ) .
W D = 1 T k i = 1 T k I B ref ( i , i + k ) B sys ( i , i + k ) .
where T denotes the total length of the normalized turn sequence, k denotes the evaluation window size, y ref and y sys denote the reference and system-projected segmentation labels, respectively, and  I [ · ] denotes the indicator function. In Equation (6), δ ( u , v ) = 1 if positions u and v belong to the same segment and 0 otherwise. In Equation (7), B ref ( i , i + k ) and B sys ( i , i + k ) denote the number of segment boundaries observed within the same window in the reference and system-projected segmentation, respectively.
The key insight captured by Equations (6) and (7) is that event-boundary quality in long-horizon conversational memory should not be judged only by whether boundaries exist, but by whether the system preserves the correct local relational structure of nearby turns within a finite diagnostic window. Under this interpretation, P k measures whether the system agrees with the human reference about same-segment versus different-segment membership, whereas WindowDiff measures whether the local boundary density is preserved. We further define the boundary-count difference as Δ B = B sys B ref to distinguish two common structural tendencies: Δ B > 0 indicates semantic fragmentation, namely excessive sensitivity to local semantic shifts, whereas Δ B < 0 indicates over-merging, where distinct narrative units are merged more often than in the human reference. In our diagnostic interpretation, persistent over-merging is treated as evidence consistent with DR2 because it reflects the collapse of historically distinct event boundaries into a flatter structure.
For complex high-level cognitive tasks, specifically addressing RQ2 and RQ3, traditional lexical metrics exhibit important blind spots. We therefore assess reasoning consistency using an LLM-as-a-Judge metric, denoted as J, which utilizes the evaluation prompt established in Mem0 [13] and is estimated with 10-fold Monte Carlo verification to reduce sampling variance and mitigate survivorship bias. In addition, for the controlled causal diagnostics in SCDS, we retain attribution auditability, denoted as AA, as an answer-level, pointer-grounded auditability indicator. Unlike its former use in conflict-oriented settings, here AA measures whether the generated answer remains explicitly attributable to the gold supporting evidence and preserves causal traceability under long-range reasoning. Consequently, AA complements R @ K and SCR by distinguishing retrieval sufficiency from answer-level evidential faithfulness. Finally, for causal stewardship, strict chain recall, abbreviated as SCR, is defined as 1 when all turns in the gold causal chain are included in the top-K retrieved turns, and 0 otherwise. A near-zero SCR despite a relatively high R @ K suggests that the system failed to recover a contiguous logical path across temporal gaps, indicating a causal rupture, formally identified as DR1. In addition, when retrieval coverage remains adequate but chain continuity or answer-level auditability still degrades, such a pattern is interpreted as evidence that historically distributed states were not preserved or reconnected faithfully, aligning with the temporal state degradation identified as DR2.

5. Simulations and Experiments

5.1. Verifying Structuring Craft (QMSum)

To address RQ1, human annotations from the QMSum dataset serve as the ground-truth reference for evaluating whether the off-peak stewardship mechanism can reconstruct event boundaries. As detailed in Table 2, the rigid fixed-window slicing employed by the chunk-based baseline fails to preserve event granularity, resulting in a fragmentation rate of 34.2% and a high P k error of 0.604. Conversely, GraphRAG relies heavily on topological density rather than temporal boundaries. This aggregation tendency leads to an over-merging rate of 28.4%, increasing the risk of structural flattening and historical boundary loss, thereby exacerbating DR2. By completely decoupling complex structural curation from the online interaction loop, MemLoom reduces the P k error to 0.375. Furthermore, removing the off-peak stewardship module causes the P k error to rise to 0.513, indicating that asynchronous stewardship is critical for maintaining structural integrity. Under the diagnostic interpretation defined by Equations (6) and (7), the lower P k and WindowDiff indicate that MemLoom more faithfully reconstructs event boundaries, while the lower over-merging tendency relative to graph-summarization baselines suggests stronger resistance to the structural flattening associated with DR2.
Takeaway: These QMSum results indicate that off-peak stewardship is not merely a maintenance utility, but the structural foundation that allows MemLoom to reconstruct event boundaries with lower fragmentation and reduced over-merging.

5.2. Reasoning Consistency and Latency Trade-Offs for RQ2 on LoCoMo

In this section, the LoCoMo benchmark [4] is utilized to evaluate the mitigation effectiveness of MemLoom in addressing RQ2. Specifically, we verify whether the BDSR architecture can mitigate the logical limitations of temporal and multi-hop reasoning inherent in traditional RAG while maintaining bounded latency. The comparative results are presented in Table 3 and Table 4. To reduce fairness concerns, all major systems discussed in this section were re-evaluated under the same local setup, and the resulting comparison trends were found to be broadly consistent with previously reported patterns.

5.2.1. Single-Hop Performance: The Cost of Abstraction

For single-hop queries, the Mem0 baseline achieves the highest score ( J = 68.37 ), marginally exceeding MemLoom ( J = 67.43 ). This result reflects the inherent advantage of Mem0’s dense-vector retrieval, which maximizes the preservation of microscopic lexical details directly from the raw text. In contrast, the performance of MemLoom reflects the inevitable trade-off of lossy compression inherent in an event-centric architecture, where the off-peak stewardship abstracts continuous dialogues into discrete semantic nodes, occasionally omitting granular unstructured details. However, MemLoom remains highly competitive because the online sentence buffer within the dual-stream mechanism acts as a compensatory layer, bounding the loss of lexical details within a strictly acceptable threshold.

5.2.2. Complex Reasoning: Structural Consistency

Compared with the reference results in Table 3, MemLoom exhibits a design-consistent advantage on the temporal and multi-hop subsets while remaining competitive on single-hop questions. For temporal reasoning, MemLoom reaches J = 65.77 , which is higher than the listed agentic baselines. We interpret this pattern as being consistent with MemLoom’s design emphasis on event-level structure and snapshot-based history preservation, because in-place updating in online memory systems may be less favorable when historically distinct states and intermediate transitions must remain explicitly traceable.
For multi-hop reasoning, MemLoom reaches J = 58.14 , also showing a favorable comparison pattern relative to the reference results. We interpret this result as being consistent with the use of a curated event memory graph and bounded topology-aware traversal, which together provide a clearer structural substrate for long-range reasoning than purely flat or runtime-induced memory organization. Because these systems were re-evaluated under the same local setup, the consistency of the temporal and multi-hop trends strengthens the interpretation that MemLoom’s event-structured retrieval design is beneficial for long-horizon reasoning.

5.2.3. Efficiency Analysis and Bounded Latency

As shown in the latency analysis, Mem0 exhibits an extremely low total tail latency ( P 95 1.47 s ) due to its lightweight flat architecture. Under the bounded-budget answer-serving setting reported in Table 4, the total latency of MemLoom ( P 95 1.67 s ) is slightly higher than that of Mem0, yet remains far below the prohibitive delay of the full-context baseline ( P 95 17.18 s ). This moderate overhead reflects the additional cost of structure-aware retrieval and grounding, rather than synchronous graph maintenance on the live serving path. Because heavyweight topology construction, relation synthesis, and snapshot publication are shifted to the off-peak stewardship loop, the online path remains bounded even under long-horizon interaction pressure. This empirical result therefore provides direct support for the core DLPS claim: MemLoom can practically decouple latency-sensitive answer serving from consistency-critical structural maintenance under the tested deployment budget.
Furthermore, MemLoom deliberately incurs a higher storage footprint by maintaining both sentence-level evidence and event-level structure. We interpret this as a memory-for-durability trade-off: additional storage cost is exchanged for stronger temporal continuity, causal recoverability, and answer traceability in long-horizon multi-party settings. Taken together, these results suggest that MemLoom occupies a practical middle ground in the quality–latency trade-off. Although it is not optimal on every isolated metric, it preserves bounded serving latency while maintaining stronger temporal and multi-hop structural consistency than flat retrieval systems.
Takeaway: Under the shared bounded-budget setting, MemLoom preserves bounded serving latency while sustaining stronger temporal and multi-hop structural consistency than flatter retrieval architectures, providing direct empirical support for the DLPS design objective.

5.3. Causal Stewardship and Structural Preservation (SCDS)

In this section, the causal diagnostic questions within SCDS are utilized to verify whether the system can recover chain-complete causal evidence under turn-grounded sufficiency and preserve historically distributed states against structural flattening. These scenarios are characterized by semantic heterophily and long-range temporal dispersion. As shown in Table 5, the dual-loop design substantially affects whether a system can maintain causal continuity, answer-level auditability, and historical traceability under this controlled diagnostic setting.
A noticeable degradation in causal recoverability is observed in the Mem0 baseline. This pattern may reflect a mismatch between compact online memory consolidation and the need to preserve chain-complete, temporally grounded provenance. When explicit event versioning is unavailable, its inference-based update mechanism may be less suitable for retaining remote root-cause turns and intermediate historical states. Consequently, its R @ K drops to 0.36, its SCR reaches only 0.12, and its AA remains limited at 0.30. This suggests that systems relying on in-place state-overwrite mechanisms face difficulty in recovering long-range causal chains and may also induce structural erasure of earlier states under long-horizon updates.
While GraphRAG demonstrates a relatively high retrieval rate due to dense graph indexing, its core community aggregation procedure introduces a distinct design trade-off in causal stewardship. GraphRAG tends to group semantically related updates into the same broad summary region based on topological density rather than precise temporal boundaries. Consequently, although its R @ K remains relatively high at 0.78, its SCR reaches only 0.35 and its AA remains at 0.50. This indicates that relatively high node recall in isolation does not guarantee chain-complete causal recovery or answer-level evidential faithfulness. In this sense, GraphRAG may exhibit Topological Diffusion—a structural smoothing of specific directional causal paths under broad community summarization. Without a curated chronological backbone, such systems may be more prone to generating narratives that are superficially coherent but causally misaligned.
In contrast, MemLoom reconstructs long-range causality through structured traversal within the curated event memory graph. By navigating explicit logical paths rather than relying solely on semantic proximity, it shows a strong capacity to mitigate DR1. At the same time, its immutable snapshot mechanism and lineage-preserving event lifecycle help maintain historically distinct states without destructive overwrite, thereby reducing DR2. As a result, MemLoom attains strong scores, with  R @ K = 0.85 , S C R = 0.72 , and  A A = 0.80 . We interpret this result as being consistent with MemLoom’s topology-aware retrieval design. At the same time, these findings should be interpreted within the intended scope of SCDS as a controlled diagnostic suite rather than as a substitute for broader real-world evaluation. Because SCDS is deliberately constructed to stress the structural failure patterns targeted by the present relation closure, part of the observed advantage may reflect diagnostic ontology alignment. We therefore treat the SCDS results as mechanism-level evidence for structural recovery, while relying on LoCoMo and QMSum as complementary evidence for more natural long-horizon reasoning and segmentation behavior.
Takeaway: Within the intended scope of SCDS as a controlled diagnostic suite, MemLoom shows stronger mechanism-level evidence for causal-chain recovery and answer-level auditability than baselines that rely on overwrite-oriented or community-smoothed memory organization.

5.4. Deployability and Boundedness Verification for RQ4

This section addresses RQ4 by evaluating the deployment viability of MemLoom and, more specifically, by testing the central architectural claim of DLPS: whether latency-sensitive answer serving can remain bounded while consistency-critical structural maintenance is deferred to asynchronous off-peak stewardship. As illustrated in Figure 7, MemLoom demonstrates this deployability through two stress tests that jointly examine bounded tail latency and intra-cycle robustness during temporary structural lag.
First, the scale sweep test shows that MemLoom mitigates DR3 by offloading heavyweight structural curation to the off-peak stewardship layer under the dual-loop architecture. Compared with offline graph summarization approaches such as GraphRAG, which exhibit a prohibitive P 95 latency of 15.0 s during continuous interaction, MemLoom maintains a bounded P 95 operating range from 1.35 s in the mature state to 1.67 s during the intra-cycle buffering state. The upper endpoint of this range is consistent with the bounded answer-serving latency reported in the LoCoMo evaluation. This result is therefore not merely a latency observation; it directly supports the DLPS claim that user-facing serving can remain bounded even when structural curation is retained as an asynchronous background process.
The second evaluation introduces a freshness failsafe verification using the SCDS diagnostic suite to measure the system’s performance during the latency gap prior to an off-peak topology update. As shown in Figure 7, even when the curated event memory graph is not yet synchronized with the latest conversational turns, MemLoom—leveraging its online sentence-level buffer and dual-stream mechanism—maintains competitive causal robustness, achieving AA of 0.78 and SCR of 0.60. These figures substantially outperform the chunk-based vector baseline ( A A = 0.25 , S C R = 0.02 ). The chunk-based values are reused from the same S1 reference in Table 5, because this baseline has no asynchronous freshness state; therefore, its intra-cycle and steady-state behavior are identical under our protocol.
This controlled decline—from a mature peak of A A = 0.80 and S C R = 0.72 down to A A = 0.78 and S C R = 0.60 —demonstrates graceful degradation under temporary structural lag. In the worst-case scenario where the off-peak stewardship cycle remains incomplete, the dual-stream online loop acts as a failsafe, providing foundational causal retrieval and answer-level auditability that still exceed those of traditional RAG systems relying solely on flat retrieval. This suggests that the dual-loop architecture effectively mitigates the risks associated with the “information freshness gap” while preserving bounded responsiveness.
Takeaway: These deployment-oriented stress tests support the claim that MemLoom can maintain bounded answer serving while degrading gracefully during temporary structural lag, which is the central practical benefit of DLPS.

6. Ablation Study

Section 6 reports the final measured ablation results used to assess the architectural necessity of MemLoom’s major functional components. The analysis is based on observed performance changes after removing or weakening one module at a time under the corresponding full-model evaluation protocol. The resulting ablation values are summarized in Table 6, which serves as the primary evidence source for the module-level discussion below. To avoid redundancy, the QMSum-based steward ablation is discussed separately in Section 5.1 and is therefore not repeated in Table 6.
The removal of the event steward module causes a clear degradation in structural boundary quality on QMSum, driving the P k error to 0.513. This regression moves the system toward a fragmented state, approaching the behavior of a standard chunk-based pipeline. The result identifies the steward as the structural foundation for downstream logical operations: without the event-boundary discipline established during ingestion and off-peak curation, higher-order reasoning lacks a stable basis for conceptual attachment. As discussed in Section 5.1, this QMSum-based steward ablation is reported separately from Table 6 because it targets boundary reconstruction rather than causal recovery or deployment latency.
Removing the topology module weakens answer-level traceability on SCDS, with AA decreasing from 0.80 to 0.68, indicating weaker recoverability of distant supporting evidence. These measured results show that even if a system retains local semantic relevance, the lack of a long-range contextual skeleton makes it more difficult to reconnect temporally distant evidence into an auditable answer path. Under this measured ablation, the observed decline should be interpreted as evidence that the topology module contributes to long-range evidence organization and answer-level traceability.
Removing the logic track causes SCDS SCR to decline from 0.72 to 0.39, while AA drops from 0.80 to 0.55. This larger degradation relative to the topology-only ablation indicates a distinction between the loss of contextual background and the loss of a structural reasoning mechanism. Whereas the topology module provides the broad contextual skeleton, the logic track is the primary module designed to bridge semantically separated but causally connected states and to prevent their collapse into structurally flattened interpretations. Once this mechanism is bypassed, the generation model frequently produces locally plausible yet globally misaligned answers. Under this measured ablation, the observed decline should be interpreted as evidence that the logic track is the primary mechanism for recovering semantically separated but causally linked states into a chain-complete reasoning path.
Removing the asynchronous off-peak stewardship mechanism forces the architecture to regress to a synchronous blocking mode, increasing client-side P 95 latency from 1.67 s to 8.34 s, which corresponds to an approximately 399.4% relative increase. In architectural terms, this ablation isolates the necessity of DLPS rather than merely showing the usefulness of one optional module. Once off-peak stewardship is removed, the intended decoupling between answer serving and structural maintenance collapses, and the online path is forced to absorb the full cost of graph updating. This result therefore reinforces the claim that dual-loop asynchronous offloading is not a cosmetic optimization, but a necessary architectural condition for bounded structured-memory serving in real-time environments.
Taken together, the ablation values in Table 6 show that the major components of MemLoom are functionally non-redundant. The topology module contributes to long-range evidence organization and answer-level traceability, the logic track supports chain-complete causal recovery and answer-level evidential faithfulness, and the off-peak stewardship mechanism is necessary for maintaining bounded serving latency under long-horizon interaction. Under this interpretation, the ablation study supports the view that MemLoom is not an ad hoc stack of loosely coupled modules, but a coordinated architecture in which each component addresses a distinguishable structural risk.

7. Discussion and Limitations

The principal strategy in MemLoom is grounded in latency economics, which reflects a strategic trade-off between real-time responsiveness and deep logical structuring. The evaluations suggest that the MemLoom architecture can practically balance causal traceability, structural preservation, and bounded latency in long-horizon interactions. By appropriately deferring the computationally intensive neuro-symbolic graph synthesis to the off-peak stewardship loop, DR3 is effectively mitigated, thereby bounding the online P 95 tail latency to approximately 1.67 s under the bounded answer-serving setting. Notably, a substantial portion of the remaining delay is attributable to the cloud API communication overhead of the semantic router; deploying a local LLM or dedicated server in the future could further reduce this overhead. Although encapsulating raw utterances into abstract event nodes inevitably introduces some lossy compression, the sentence-level substrate implemented within bounded dual-stream retrieval can partially compensate for this effect.
Real-time conversational memory also operates under unavoidable uncertainty. In practice, ambiguity may arise from incomplete utterances, unstable references, delayed clarifications, or imperfect LLM-mediated relation inference. MemLoom does not assume that these uncertainties can be eliminated at the input level. Instead, the architecture is designed to contain uncertainty within bounded and auditable stages. First, the bounded policy vector constrains the scope of online routing and off-peak reasoning, preventing unbounded expansion of uncertain inferences. Second, contract validation acts as a rule-based validation filter over probabilistic candidate relations before publication. Third, sentence-level grounding preserves direct access to raw supporting spans, allowing error tracing even when event abstraction is lossy. Finally, immutable snapshot serving prevents partially updated or weakly validated structures from directly contaminating the live retrieval path. In this sense, the present framework prioritizes operational containment and traceability of uncertainty rather than full probabilistic uncertainty modeling.
Furthermore, the necessity of these defensive boundaries is supported by the ablation study. Specifically, removing the off-peak stewardship mechanism increases exposure to DR3, while weakening the event-boundary, topology, and logic-track safeguards increases the risks of DR1 and DR2. Hence, MemLoom can be understood as a decoupled architectural framework rather than a monolithic system with a trial-and-error stack of empirical features. Nevertheless, several limitations remain for broader real-world deployment. First, part of the evaluation relies on SCDS, which is intentionally diagnostic and controlled. While this suite is valuable for isolating DR1- and DR2-related structural failures under pointer-grounded verification, it does not replace large-scale real-world evidence. Moreover, because the SCDS ontology is deliberately aligned with the compact relation closure adopted in MemLoom, some portion of the observed gain may reflect this diagnostic alignment rather than a fully general architectural advantage. We therefore do not present the four-relation closure as a universal ontology for open-world conversational causality; instead, it should be understood as a compact and validation-bounded structural substrate chosen to balance expressiveness, auditability, and maintenance tractability. Although the complementary LoCoMo and QMSum results suggest that this restricted closure remains useful beyond the synthetic suite, the precise effect of this ontology choice on more natural open-ended datasets has not yet been independently quantified and remains an important direction for future work. Second, although the major LoCoMo systems were re-evaluated under our unified local setup to improve fairness, cross-system comparison should still be interpreted with appropriate caution because absolute values may remain sensitive to implementation details, model versions, prompt design, and evaluation configuration. Third, control policy parameters, such as routing thresholds and token budgets, are currently managed by static heuristics without precise cost adaptation. Finally, the current MemLoom framework is exclusively tailored for text-based transcripts managed by a single centralized steward, leaving multimodal inputs and decentralized multi-agent architectures as important open directions.
MemLoom also explicitly incurs a higher storage overhead than naive window-based retrieval because it maintains event graphs, lineage registries, and sentence-level evidence simultaneously. We treat this as a deliberate memory-for-durability trade-off rather than as an incidental cost: additional storage is exchanged for more traceable long-horizon causal continuity, historically traceable state evolution, and answer-level auditability. Importantly, this overhead is not exposed uniformly on the live serving path; in addition, off-peak curation cost appears mainly as background compute demand, validation workload, snapshot-publication overhead, and storage growth during stewardship cycles rather than as per-turn tail-latency inflation. Through lifecycle-aware memory organization, active event structures remain on the online retrieval surface, whereas archived events are migrated to colder storage once they exceed their relevance horizon or memory pressure requires compaction. Likewise, lineage records serve primarily as provenance support rather than as the main high-frequency query plane, and sentence-level memory is accessed through pointer-based grounding rather than broad full-text traversal in every request. Even so, the precise long-horizon resource envelope of this design has not yet been fully profiled under long-running deployment horizons, and systematic large-scale stress testing remains an important direction for future work.

8. Future Work

The dual-loop defensive capabilities of MemLoom have been preliminarily examined. Consequently, to reduce the substantial marginal computational costs and human tuning efforts currently required, the dynamic optimization of stewardship mechanisms can be prioritized in future work. Herein, control policy vectors related to router thresholds, temporal decay parameters, and retrieval budgets can be formulated as optimizable variables within an automated optimization framework. Specifically, multi-fidelity resource allocation strategies, such as Hyperband [46] and BOHB [47], can be deployed during the offline batch evaluation phase to rapidly screen high-quality parameter configurations under limited computational budgets. Moreover, rather than serving as a direct hyperparameter optimization tool, MAML [48] can be explored to learn favorable model parameter initializations, enabling the system to warm-start and rapidly adapt to novel interaction environments. Simultaneously, for hyperparameters requiring temporal adaptation, population-based training (PBT) [49] can be evaluated to asynchronously discover dynamic scheduling policies within the offline background maintenance phase. As more comprehensive interaction datasets become available, the current restriction that these optimization procedures remain strictly outside the online inference path can be relaxed. In addition, open information extraction (Open IE) can be introduced into the off-peak stewardship, enabling the system to autonomously induce a broader and more open-ended set of event relations, thereby improving its awareness of the nuanced heterogeneity present in real-world temporal and social topologies.
Another important extension concerns event explainability in LLM-based memory systems. In the current version, MemLoom emphasizes evidence-grounded traceability and pointer-level auditability, but it does not yet provide an explicit explanation layer for why an event was formed, why a relation was validated, or how a retrieved answer depends on a particular sequence of event evolution. Future work may therefore extend the framework with edge-confidence estimation, uncertainty-aware explanation, and more interpretable visualization of event-state evolution so that event formation, relation validation, and answer grounding become not only traceable but also more directly explainable to users and developers.
In this work, we prioritize operational traceability and bounded structural governance over explicit probabilistic uncertainty modeling. Developing a fully quantified uncertainty framework for event confidence, edge reliability, and retrieval-time risk propagation remains an important direction for future work.
Prematurely introducing multimodal or multi-agent networks before sufficiently resolving textual topological ruptures and historical state fragmentation may unnecessarily increase system complexity. Hence, after establishing a more robust logical substrate in the pure-text setting, the current MemLoom framework can be extended in a more stable manner. In future work, multimodal signals and decentralized multi-agent knowledge sources can be incorporated, such that the integration of visual states and audio features can further evolve MemLoom toward multimodal event graphs for embodied AI.

9. Conclusions

For emerging embodied AI and autonomous agents, long-horizon multi-party interactions are considered critical scenarios for examining the memory boundaries of large language models (LLMs). Herein, traditional retrieval-augmented generation (RAG) systems are frequently susceptible to the design risks identified in this study, specifically DR1, DR2, and DR3. To simultaneously address these challenges, the MemLoom architecture is proposed. Through a dual-loop publish–subscribe architecture, latency-sensitive online interactions are structurally decoupled from consistency-critical off-peak stewardship. Rather than claiming to definitively resolve all structural tensions in long-horizon conversational memory, the proposed architecture is intended as a practical balance point between real-time responsiveness, structural stewardship, and traceable retrieval under bounded deployment conditions.
Through controlled evaluations and measured ablation analyses, the present study examines how MemLoom supports event-centric structuring, causal recovery, historical state traceability, and bounded online serving within the target multi-party setting. The resulting evidence suggests that curated event memory graphs and contract-bound lifecycle control can improve structural durability and answer traceability, while asynchronous stewardship helps prevent live serving latency from being dominated by graph maintenance cost. These findings should be interpreted in light of the bounded-budget and unified local evaluation protocol adopted in this study; they therefore provide architecture-consistent evidence rather than a definitive universal ranking across all memory systems. Under this scope, MemLoom may serve as a useful architectural reference for long-horizon conversational agents that require stronger temporal continuity, structural auditability, and bounded deployment behavior.

Author Contributions

Conceptualization, D.-Y.C. and S.-P.T.; methodology, D.-Y.C.; software, C.-Y.C.; validation, S.-P.T. and J.-F.W.; formal analysis, D.-Y.C.; investigation, C.-Y.C.; resources, S.-P.T.; data curation, C.-Y.C.; writing—original draft preparation, C.-Y.C.; writing—review and editing, S.-P.T.; visualization, C.-Y.C.; supervision, J.-F.W.; project administration, J.-F.W.; funding acquisition, J.-F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Correspondence and requests for materials should be addressed to tsengshihpang@czcit.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

    The following abbreviations are used in this manuscript:
OILSOnline Interaction-Loop Subscriber
OEFSOnline Event Formation System
OSLPOff-Peak Stewardship Loop Publisher
DLPSDual-Loop Publish–Subscribe
NSGSNeuro-Symbolic Graph Synthesis
BDSRBounded Dual-Stream Retrieval

Appendix A. Experimental Environment and Model Configuration

This appendix details the fixed experimental environment and the module-level model configurations utilized throughout the study. Exposing these physical and parametric configurations is intended to establish a transparent baseline and ensure the practical reproducibility of the empirical results.

Appendix A.1. Experimental Environment

To reduce infrastructure-related variability during the latency and throughput evaluations, all local orchestration, database management, and graph traversals were conducted within a strictly controlled hardware environment. Given that the heavy reasoning workloads were offloaded to cloud-based Large Language Models (LLMs) via API calls, local GPU acceleration was not strictly required. The local configurations are detailed in Table A1. Furthermore, to ensure the fairness of the P 95 latency measurements for RQ4, all API requests were executed under a stable, high-bandwidth network environment to minimize extraneous networking jitter.
Table A1. Hardware and software environment used for local system orchestration.
Table A1. Hardware and software environment used for local system orchestration.
ComponentSpecification
CPUIntel Core i7-14900
RAM64 GB DDR4
Storage1 TB NVMe SSD
Operating SystemUbuntu 22.04 LTS
Python EnvironmentPython 3.11

Appendix A.2. Model Configuration by Module

In strict adherence to the dual-loop architecture proposed in Section 3, heterogeneous models were explicitly assigned to distinct modules based on their cognitive capacities and designated functional roles. To assure absolute reproducibility, the exact API snapshots and decoding temperatures were locked during the evaluation, as tabulated in Table A2.
Table A2. Model assignments and generation hyperparameters for core system modules.
Table A2. Model assignments and generation hyperparameters for core system modules.
ModuleModel SnapshotTemp.
Semantic Routergpt-4o-mini0.0
Response Generator LLMgpt-4o0.0
Off-peak Stewardgpt-5.20.0
Embedding Modeltext-embedding-3-smallN/A

Appendix A.3. Online Event Formation Parameters

The low-latency online interaction loop is governed by a compact set of routing coefficients and boundary constraints. Rather than being presented as universally optimal constants, these values are treated as online operational defaults selected through a coarse-to-fine empirical tuning procedure on a held-out validation subset. The objective of this tuning process was not the maximization of a single score, but the identification of a stable operating region under three concurrent criteria: semantic fidelity, temporal continuity, and bounded online latency. Accordingly, the scalar values reported in Table A3 should be interpreted as empirically selected defaults for the tested domains rather than as universally transferable optima. The held-out validation subset referenced here corresponds to an independent SCDS validation split constructed under the same spec-driven generation protocol as the final SCDS diagnostic set, but kept disjoint from the reported evaluation split. It was used only for selecting stable operational defaults for HDRA and related lifecycle-control parameters. Neither the final SCDS diagnostic split nor the external QMSum and LoCoMo benchmarks were used for parameter tuning. More specifically, the SCDS validation split contained 100 diagnostic queries, whereas the final reported SCDS evaluation split contained 353 queries.
The defaults in Table A3 were chosen to balance three competing demands in the online loop: preserving semantic attachment quality, avoiding excessive temporal drift, and maintaining stable low-latency routing behavior. In particular, the settings of α and β regulate the relative emphasis on semantic similarity versus temporal continuity, whereas λ and γ shape the aggressiveness of temporal penalization and local continuity reward. The transition threshold τ n e w was then selected to avoid both premature event proliferation and over-attachment to stale event centroids.
Table A3. Operational defaults for HDRA-based online event formation.
Table A3. Operational defaults for HDRA-based online event formation.
ComponentSymbolValueRole in Online Routing
Semantic weight α 0.9Prioritizes semantic alignment between incoming unit and event centroid.
Temporal decay weight β 0.1Preserves short-horizon continuity under dynamic turn arrivals.
Temporal decay rate λ 20Controls the decay steepness in the temporal penalty term.
Short-window gain γ 0.1Rewards immediate-local continuity via binary gain B s w .
Event transition threshold τ n e w 0.30Triggers NEW_EVENT when max ( S j ) < τ n e w .
Local context horizon H t 5 : t 1 5 turnsBounds router context to preserve low latency and deterministic behavior.
Sigmoid non-linearity g ( s ) 1 1 + exp ( 10 ( s 0.5 ) ) Suppresses noisy similarities while retaining mid-range sensitivity.
Once selected on the validation split, all routing coefficients and related operational defaults were frozen and remained unchanged throughout the final reported experiments. In implementation, online writes remain strictly append-only at the evidence level, while boundary decisions are conservative by design. This separation keeps fast-path event attribution lightweight and auditable at the turn level, while deferring heavyweight graph-level reasoning to off-peak stewardship.

Appendix A.4. Stewardship Policy and Lifecycle Thresholds

The off-peak stewardship loop is governed by a policy vector that controls topology growth, temporal repair scope, event promotion behavior, and verification effort under bounded asynchronous computation. Unlike the online routing coefficients in Appendix A.3, these elements are treated as policy-level governance controls rather than latency-tuned operational defaults. Their role is to constrain structural evolution into a reproducible and validation-bounded regime, not to maximize a single downstream benchmark score.
Table A4. NSGS stewardship policy and lifecycle control contract.
Table A4. NSGS stewardship policy and lifecycle control contract.
Policy ElementContracted Function
θ s and k sem Semantic edge filtering margin and ANN breadth for homophilic topic construction.
k t , Δ d , and  δ t Temporal neighborhood and repair window bounds for physically adjacent merge candidates.
B u Per-cluster LLM reasoning ceiling during incremental structural synthesis.
B bridge Cross-cluster bridge-reasoning budget to recover heterophilic causal/temporal links.
K guard and V ver Verification budget and versioned rule set for schema, evidence, and relation-type validation.
w 1 , w 2 , and  w 3 Governance coefficients balancing evidence accumulation, participant diversity, and token mass within the maturity function M ( E j ) .
M ( E j ) τ maturity Promotion rule from provisional to active state under finite-state lifecycle control.
Φ close ( E j , C ) and archive policyClosure and archival transition predicates for stabilizing long-horizon memory footprint.
In particular, the maturity rule M ( E j ) τ maturity is governed by a lifecycle threshold that determines when a provisional event is considered sufficiently stable for promotion into the active graph. The associated coefficients w 1 , w 2 , and  w 3 regulate the relative influence of evidence accumulation, participant diversity, and token mass in that decision. These parameters are not intended to express universal optimality; rather, they function as governance coefficients that balance two opposing risks: premature promotion of noisy or weakly supported fragments, and excessive lag in making emerging events available to the structured memory graph.
Together, these controls provide a deterministic governance layer over probabilistic LLM outputs. They define when provisional events may be promoted, how far timeline repair may extend, how much cross-cluster reasoning may be invoked, and which candidate relations may be validated into committed graph structure. Under this contract, stewardship remains reproducible, bounded, and provenance-preserving across maintenance cycles.

Appendix A.5. Bounded-Budget Evaluation Contract

To establish a fairer deployment-oriented comparison and to reflect practical latency-sensitive serving constraints, all reported experiments adopt a bounded-budget contract. This contract is designed as a deployment simulation protocol rather than as an intrinsic capability ceiling of contemporary LLMs. Its purpose is to prevent systems from gaining artificial advantage merely by expanding context size, retrieval breadth, or answer length, thereby allowing architectural differences in structure stewardship and retrieval grounding to be interpreted under a more controlled serving assumption.
Table A5. Unified bounded-budget contract applied in evaluation.
Table A5. Unified bounded-budget contract applied in evaluation.
Budget AxisConstraintEvaluation Purpose
Context budget ( B c t x )8000 tokensEnforces fixed prompt context under latency-sensitive deployment assumptions.
Retrieval budget ( B r e t )Fixed Top-K seeds and payload capPrevents systems from gaining unfair advantage via expanded candidate pools.
Generation budget ( B g e n )Max decoded tokens (512)Normalizes final answer length and generation-time overhead.
Upper-bound exceptionFull-Context exemptRetained only as an out-of-budget theoretical reference ceiling.
Under this protocol, all non-upper-bound systems share identical truncation, de-duplication, and answer-generation constraints. Therefore, observed gains are interpreted as architectural improvements in structure stewardship and retrieval grounding rather than budget-induced artifacts. Because the major systems in LoCoMo were re-evaluated under a shared local setup, the resulting comparisons provide a more consistent architecture-level reference under the same bounded-budget protocol. Nevertheless, the reported differences should still be interpreted as evidence of design-sensitive comparison patterns rather than as a claim of universal leaderboard superiority.

Appendix B. Prompt Design Contracts and Engineering Principles

For implementation transparency, this appendix outlines the core engineering principles and output contracts of the prompts utilized in the MemLoom architecture. MemLoom formulates prompts as bounded architectural contracts rather than open-ended instructions. Each prompt is associated with a specific functional role, a constrained contextual window, and a deterministic JSON-formatted structure. This design aims to mitigate hallucination risks, preserve structural provenance, and ensure auditable integration across online interaction, off-peak stewardship, and evaluation-time verification.

Appendix B.1. Key Design Principles

Prompt templates across the online interaction loop, the off-peak stewardship loop, and the evaluation pipeline adhere to three core architectural principles.
  • Contextual Bounding and Auditability. Prompts are restricted to reason over explicitly provided evidence windows, such as short-horizon contexts for online routing, bounded event-pair or cluster-local evidence packs for off-peak relation synthesis, and pointer-grounded evidence bundles for diagnostic verification. This restriction ensures that downstream reasoning remains traceable to explicit citations and reference structures.
  • Conservative Arbitration and Anti-Flattening. To mitigate DR1 and DR2, prompts are instructed to prefer conservative operations when semantic evidence is sparse. Historically distinct states and causally relevant evidence are explicitly preserved as traceable structural boundaries rather than compressed into a single flattened interpretation.
  • Deterministic Output Contracts. System-critical prompts enforce strict JSON-schema compliance. This deterministic parsing prevents cascading failures caused by unstructured text generation and enables seamless integration with the neuro-symbolic logic track and the verification pipeline.

Appendix B.2. Prompt Inventory and Module Contracts

To maintain behavioral stability while keeping the appendix concise, only the three most system-critical prompt contracts are summarized below.
  • P1 (Semantic Router). This prompt serves as a single-pass rewrite–gate–route module for unstructured online utterances. It operates under strict latency bounds, is restricted to a local context window ( H t 5 : t 1 ), and deterministically outputs normalized intents and speaker-grounded states in JSON format.
  • P2 (Off-Peak Relation Synthesis). This prompt serves as the relation synthesis layer within the broader off-peak stewardship loop. It is compute-heavy but scope-bounded, and it operates asynchronously over short-listed event pairs or cluster-local evidence packs to propose evidence-grounded candidate typed relations under a closed relation set. Final validation, canonicalization, and snapshot publication remain downstream non-prompt processes.
  • P3 (LLM-as-a-Judge). This prompt functions as a verification-bound, task-conditional diagnostic verifier rather than a preference-based evaluator. Grounded strictly in pointer-level gold evidence, causal annotations, retrieved evidence bundles, and sample metadata, it activates only the criterion checks required by the specific diagnostic task type. Its outputs are deterministic criterion-level verification fields, such as PASS, FAIL, or NA, for answer correctness, causal faithfulness, attribution auditability (AA), schema compliance, and refusal or privacy compliance.

Appendix C. SCDS Spec-Driven Generation Protocol

The Synthetic Causal Diagnostic Suite (SCDS) is a controlled evaluation tool designed to investigate RQ3 by isolating DR1 and DR2. To ensure reproducibility and mitigate unconstrained generation drift, SCDS bypasses traditional one-shot prompting in favor of a spec-driven, multi-stage synthesis pipeline [33,34].
The pipeline operates through five sequential stages:
  • Stage 1: Blueprint Specification. To prevent unconstrained semantic drift and hallucination, the process begins by defining a rigorous structured blueprint. This specification explicitly fixes the participant roster, latent event chains, target causal dependencies, intended difficulty levels, and the diagnostic question families.
  • Stage 2: Dialogue Realization. The blueprint is converted into a concrete multi-turn conversation. Each utterance is indexed as a stable turn unit and assigned a canonical pointer. This functions as an end-to-end audit path, explicitly connecting the generated output back to its source interaction within the model’s contextual window [34].
  • Stage 3: Diagnostic Question Instantiation. Questions are directly instantiated from the blueprints and organized as causal diagnostic tasks to assess DR1 and DR2.
    -
    Causal Diagnostic Tasks: These tasks evaluate single-hop premise retrieval, intermediate reasoning chain reconstruction, and attribution-grounded causal verification under sparse dependency conditions.
  • Stage 4: Pointer-Level Gold-Evidence & Manual Quality Assurance. Recent studies demonstrate that LLM-as-a-judge evaluations suffer from systematic vulnerabilities, including judgment bias [42] and preference leakage [43]. To mitigate these risks and ensure dataset reliability, every sample is subjected to pointer-level gold-evidence grounding, followed by a human-in-the-loop Quality Assurance (QA) step. Human reviewers manually inspect the generated samples to verify logical consistency and explicitly filter out factual or structural errors. This pragmatic manual validation ensures that the gold answers remain semantically plausible and causally well-grounded without relying solely on automated verification pipelines.
  • Stage 5: Verification-Centric Curation. Scaling synthetic data introduces significant risks of model collapse [38], and mitigating these risks requires rigorous curation mechanisms [39]. To build upon the manual QA foundation, SCDS enforces automated verification-centric quality gates, which include schema validation, multi-hop evidence resolution, and checks for causal faithfulness and attribution auditability [35,36]. Furthermore, because paraphrase variations frequently bypass standard heuristic deduplication [41], strict paraphrase-aware filtering, near-duplicate detection, and split isolation [40] are employed to prevent contamination and data leakage.
For tuning transparency, SCDS was partitioned into an independent validation split and a final diagnostic evaluation split. The validation split contained 100 diagnostic queries and was used only for parameter selection and sanity-checking of operational defaults, whereas all reported SCDS results were produced on the disjoint final evaluation split of 353 queries. The validation data were kept separate from the final reported diagnostic results during benchmark reporting.
No validation queries were reused in the final reported SCDS results.
Finally, the SCDS suite operates as a rigorously managed diagnostic instrument. Aligning with widely adopted dataset documentation and transparency frameworks [29,32], it is provisioned with comprehensive, reproducibility-oriented metadata. This encompasses scenario specification contracts, annotation conventions, validation rules, and explicit version-precedence records.
  • Scenario level: A stable scenario_id, an explicit participant set, and a chronologically ordered list of turns.
  • Turn level: A canonical pointer, formatted as scenario_id:tXX, alongside the speaker_id, raw text, and turn order.
  • Question level: sample_id, viewer_id, query_text, gold_answer, task_type, gold_evidence_keys, and metadata.
  • Causal annotation level: gold_causal_structure, specifying the causal dependency type, local identifier, attribution target, and supporting_evidence_keys.

References

  1. Song, X.; Chen, W.; Liu, Y.; Chen, W.; Li, G.; Lin, L. Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; IEEE: New York, NY, USA, 2025; pp. 12078–12088. [Google Scholar] [CrossRef]
  2. Addlesee, A.; Cherakara, N.; Nelson, N.; Hernandez Garcia, D.; Gunson, N.; Sieińska, W.; Dondrup, C.; Lemon, O. Multi-party Multimodal Conversations Between Patients, Their Companions, and a Social Robot in a Hospital Memory Clinic. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, St. Julians, Malta, 17–22 March 2024; Aletras, N., De Clercq, O., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 62–70. [Google Scholar] [CrossRef]
  3. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
  4. Maharana, A.; Lee, D.H.; Tulyakov, S.; Bansal, M.; Barbieri, F.; Fang, Y. Evaluating Very Long-Term Conversational Memory of LLM Agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 13851–13870. [Google Scholar] [CrossRef]
  5. Jin, B.; Yoon, J.; Han, J.; Arik, S.O. Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG. arXiv 2024, arXiv:2410.05983. [Google Scholar] [CrossRef]
  6. Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts. Trans. Assoc. Comput. Linguist. 2024, 12, 157–173. [Google Scholar] [CrossRef]
  7. Penzo, N.; Sajedinia, M.; Lepri, B.; Tonelli, S.; Guerini, M. Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11210–11233. [Google Scholar] [CrossRef]
  8. Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv 2025, arXiv:2404.16130. [Google Scholar] [CrossRef]
  9. Wang, N.; Han, X.; Singh, J.; Ma, J.; Chaudhary, V. CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 22680–22693. [Google Scholar] [CrossRef]
  10. Liu, H.; Wang, Z.; Chen, X.; Li, Z.; Xiong, F.; Yu, Q.; Zhang, W. HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 1897–1913. [Google Scholar] [CrossRef]
  11. Pan, Z.; Wu, Q.; Jiang, H.; Luo, X.; Cheng, H.; Li, D.; Yang, Y.; Lin, C.Y.; Zhao, H.V.; Qiu, L.; et al. SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  12. Tan, Z.; Yan, J.; Hsu, I.H.; Han, R.; Wang, Z.; Le, L.; Song, Y.; Chen, Y.; Palangi, H.; Lee, G.; et al. In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8416–8439. [Google Scholar] [CrossRef]
  13. Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; Yadav, D. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memor. arXiv 2025, arXiv:2504.19413. [Google Scholar] [CrossRef]
  14. Ong, K.T.i.; Kim, N.; Gwak, M.; Chae, H.; Kwon, T.; Jo, Y.; Hwang, S.w.; Lee, D.; Yeo, J. Towards Lifelong Dialogue Agents via Timeline-based Memory Management. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, New Mexico, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8631–8661. [Google Scholar] [CrossRef]
  15. Chen, Z.; Shen, W.; Huang, J.; Shao, L. Joint Enhancement of Relational Reasoning for Long-Context LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, 4–9 November 2025; Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8706–8720. [Google Scholar] [CrossRef]
  16. Wang, Y.; Pan, Y.; Su, Z.; Deng, Y.; Zhao, Q.; Du, L.; Luan, T.H.; Kang, J.; Niyato, D.T. Large Model-Based Agents: State-of-the-Art, Cooperation Paradigms, Security and Privacy, and Future Trends. IEEE Commun. Surv. Tutor. 2024, 28, 1906–1949. [Google Scholar] [CrossRef]
  17. Wei, Q.; Ning, H.; Han, C.; Ding, J. A query-aware multi-path knowledge graph fusion approach for enhancing retrieval-augmented insgeneration in large language models. Expert Syst. Appl. 2026, 316, 131932. [Google Scholar] [CrossRef]
  18. Zhang, Y.; Yuan, W.; Jiang, Z. Bridging Intuitive Associations and Deliberate Recall: Empowering LLM Personal Assistant with Graph-Structured Long-term Memory. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 17533–17547. [Google Scholar] [CrossRef]
  19. Newman, M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef] [PubMed]
  20. Holme, P.; Saramäki, J. Temporal networks. Phys. Rep. 2012, 519, 97–125. [Google Scholar] [CrossRef]
  21. Luan, S.; Hua, C.; Lu, Q.; Ma, L.; Wu, L.; Wang, X.; Xu, M.; Chang, X.W.; Precup, D.; Ying, R.; et al. The Heterophilic Graph Learning Handbook: Benchmarks, Models, Theoretical Analysis, Applications and Challenges. arXiv 2024, arXiv:2407.09618. [Google Scholar] [CrossRef]
  22. Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; Zhang, Y. A-MEM: Agentic Memory for LLM Agents. arXiv 2025, arXiv:2502.12110. [Google Scholar] [CrossRef]
  23. Xu, N.; Zhang, H.; Chen, J. CEO: Corpus-based Open-Domain Event Ontology Induction. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, 17–22 March 2024; Graham, Y., Purver, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 946–964. [Google Scholar] [CrossRef]
  24. Li, S.; Zhao, R.; Li, M.; Ji, H.; Callison-Burch, C.; Han, J. Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 5677–5697. [Google Scholar] [CrossRef]
  25. Min, Q.; Guo, Q.; Hu, X.; Huang, S.; Zhang, Z.; Zhang, Y. Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2985–3002. [Google Scholar] [CrossRef]
  26. Kang, J.; Ji, M.; Zhao, Z.; Bai, T. Memory OS of AI Agent. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 25961–25970. [Google Scholar] [CrossRef]
  27. Zhou, Y.; Guo, X.; Bayar, B.; Sengamedu, S.H. Amory: Building Coherent Narrative-Driven Agent Memory through Agentic Reasoning. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Rabat, Morocco, 24–29 March 2026; Demberg, V., Inui, K., Marquez, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2026; pp. 3926–3938. [Google Scholar] [CrossRef]
  28. Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5905–5921. [Google Scholar] [CrossRef]
  29. Gebru, T.; Morgenstern, J.; Vecchione, B.; Vaughan, J.W.; Wallach, H.; Iii, H.D.; Crawford, K. Datasheets for datasets. Commun. ACM 2021, 64, 86–92. [Google Scholar] [CrossRef]
  30. Bender, E.M.; Friedman, B. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Trans. Assoc. Comput. Linguist. 2018, 6, 587–604. [Google Scholar] [CrossRef]
  31. Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I.D.; Gebru, T. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 220–229. [Google Scholar] [CrossRef]
  32. Pushkarna, M.; Zaldivar, A.; Kjartansson, O. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 1776–1826. [Google Scholar] [CrossRef]
  33. Long, L.; Wang, R.; Xiao, R.; Zhao, J.; Ding, X.; Chen, G.; Wang, H. On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11065–11082. [Google Scholar] [CrossRef]
  34. Patel, A.; Raffel, C.; Callison-Burch, C. DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 3781–3799. [Google Scholar] [CrossRef]
  35. Huang, Y.; Wu, S.; Gao, C.; Chen, D.; Zhang, Q.; Wan, Y.; Zhou, T.; Xiao, C.; Gao, J.; Sun, L.; et al. DataGen: Unified Synthetic Dataset Generation via Large Language Models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  36. Prabhakar, A.; Liu, Z.; Zhu, M.; Zhang, J.; Awalgaonkar, T.; Wang, S.; Liu, Z.; Chen, H.; Hoang, T.; Niebles, J.C.; et al. APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay. arXiv 2025, arXiv:2504.03601. [Google Scholar] [CrossRef]
  37. Huang, X.; Shen, J.; Huang, S.; Cheng, S.; Wang, X.; Qu, Y. TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 2704–2726. [Google Scholar] [CrossRef]
  38. Dohmatob, E.; Feng, Y.; Subramonian, A.; Kempe, J. Strong Model Collapse. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  39. Feng, Y.; Dohmatob, E.; Yang, P.; Charton, F.; Kempe, J. Beyond Model Collapse: Scaling up with Synthesized Data Requires Verification. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  40. Sainz, O.; García-Ferrero, I.; Jacovi, A.; Ander Campos, J.; Elazar, Y.; Agirre, E.; Goldberg, Y.; Chen, W.L.; Chim, J.; Choshen, L.; et al. Data Contamination Report from the 2024 CONDA Shared Task. In Proceedings of the 1stWorkshop on Data Contamination (CONDA), Bangkok, Thailand, 16 August 2024; Sainz, O., García Ferrero, I., Agirre, E., Ander Campos, J., Jacovi, A., Elazar, Y., Goldberg, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 41–56. [Google Scholar] [CrossRef]
  41. Matton, A.; Sherborne, T.; Aumiller, D.; Tommasone, E.; Alizadeh, M.; He, J.; Ma, R.; Voisin, M.; Gilsenan-McMahon, E.; Gallé, M. On Leakage of Code Generation Evaluation Datasets. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 13215–13223. [Google Scholar] [CrossRef]
  42. Chen, G.H.; Chen, S.; Liu, Z.; Jiang, F.; Wang, B. Humans or LLMs as the Judge? A Study on Judgement Bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 8301–8327. [Google Scholar] [CrossRef]
  43. Li, D.; Sun, R.; Huang, Y.; Zhong, M.; Jiang, B.; Han, J.; Zhang, X.; Wang, W.; Liu, H. Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv 2026, arXiv:2502.01534. [Google Scholar] [CrossRef]
  44. Zep AI. Zep Documentation: Building Memory for AI Assistants. 2026. Available online: https://help.getzep.com/ (accessed on 8 March 2026).
  45. LangChain. LangMem Documentation. 2026. Available online: https://langchain-ai.github.io/langmem/ (accessed on 8 March 2026).
  46. Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; Talwalkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. J. Mach. Learn. Res. 2018, 18, 1–52. [Google Scholar]
  47. Falkner, S.; Klein, A.; Hutter, F. BOHB: Robust and efficient hyperparameter optimization at scale. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July, 2018; PMLR: Cambridge, MA, USA, 2018; pp. 1437–1446. [Google Scholar]
  48. Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11, August 2017; PMLR: Cambridge, MA, USA, 2017; pp. 1126–1135. [Google Scholar]
  49. Jaderberg, M.; Dalibard, V.; Osindero, S.; Czarnecki, W.M.; Donahue, J.; Razavi, A.; Vinyals, O.; Green, T.; Dunning, I.; Simonyan, K.; et al. Population Based Training of Neural Networks. arXiv 2017, arXiv:1711.09846. [Google Scholar] [CrossRef]
Figure 1. High-level dual-loop view of MemLoom. The upper loop represents the latency-sensitive online interaction path, while the lower loop represents the off-peak stewardship path for asynchronous structural curation. The curated event memory graph is exposed to the online loop through atomic publication.
Figure 1. High-level dual-loop view of MemLoom. The upper loop represents the latency-sensitive online interaction path, while the lower loop represents the off-peak stewardship path for asynchronous structural curation. The curated event memory graph is exposed to the online loop through atomic publication.
Electronics 15 02373 g001
Figure 2. System architecture of MemLoom. The figure highlights three main paths: the online interaction loop for bounded answer serving, the online event formation path for lightweight event writing, and the off-peak stewardship loop for structural curation and snapshot publication.
Figure 2. System architecture of MemLoom. The figure highlights three main paths: the online interaction loop for bounded answer serving, the online event formation path for lightweight event writing, and the off-peak stewardship loop for structural curation and snapshot publication.
Electronics 15 02373 g002
Figure 3. Schematic overview of the online incremental event formation system (OEFS).
Figure 3. Schematic overview of the online incremental event formation system (OEFS).
Electronics 15 02373 g003
Figure 4. Contract-bound event lifecycle of MemLoom.
Figure 4. Contract-bound event lifecycle of MemLoom.
Electronics 15 02373 g004
Figure 5. Off-peak stewardship pipeline.
Figure 5. Off-peak stewardship pipeline.
Electronics 15 02373 g005
Figure 6. Conceptual diagram of the dual-track topological decoupling and bridge reasoning mechanism.
Figure 6. Conceptual diagram of the dual-track topological decoupling and bridge reasoning mechanism.
Electronics 15 02373 g006
Figure 7. Intra-cycle Robustness on SCDS: Freshness Failsafe Verification.
Figure 7. Intra-cycle Robustness on SCDS: Freshness Failsafe Verification.
Electronics 15 02373 g007
Table 1. Methodological comparison of representative paradigms against the requirements of long-horizon multi-party conversational memory. The labels indicate whether each paradigm explicitly supports the requirement as a primary architectural objective, rather than whether the capability is completely absent.
Table 1. Methodological comparison of representative paradigms against the requirements of long-horizon multi-party conversational memory. The labels indicate whether each paradigm explicitly supports the requirement as a primary architectural objective, rather than whether the capability is completely absent.
Method CategoryRepresentative StudiesTopology AwareHistory TrackingAsync Curation & ServingEvent Versioning
Long-context/flat accumulationLoCoMo benchmark [4], long-context promptingLimitedLimitedNoNo
Granularity-aware/reflective memorySeCom [11], RMM [12]LimitedLimitedNoNo
Compact online memory/timeline memoryMem0 [13], THEANINE [14], A-Mem [22]PartialPartialNoLimited
Graph-augmented retrieval/graph memoryGraphRAG [8], CausalRAG [9], Associa [18], and query-aware KG fusion [17]YesPartialLimitedLimited
Event/schema/system-level memoryCEO [23], hierarchical event schema induction [24], synergetic event understanding [25], MemoryOS [26], and Amory [27]PartialPartialPartialPartial
MemLoom (Ours)Dual-loop stewardship with an event-versioned memory graphYesYesYesYes
Table 2. Segmentation Quality and Boundary Alignment on QMSum.
Table 2. Segmentation Quality and Boundary Alignment on QMSum.
System P k (Mean)WindowDiffFrag%Over-Merge%
Chunk-based0.6040.95034.2%5.1%
GraphRAG0.4390.4548.5%28.4%
Mem00.5870.72131.0%4.2%
MemLoom (Ours)0.3750.39512.1%8.2%
w/o Steward0.5130.46228.0%6.5%
Note: Lower scores are better for P k , WindowDiff, Frag%, and Over-merge%. For non-segmentation-native baselines, such as Mem0 and GraphRAG, P k and WindowDiff are computed after a fixed boundary projection protocol that maps native memory units back to a normalized turn-level segmentation sequence.
Table 3. Comparative Analysis of Reasoning Capabilities on LoCoMo Benchmark (J Score).
Table 3. Comparative Analysis of Reasoning Capabilities on LoCoMo Benchmark (J Score).
MethodSingle-Hop (J)Multi-Hop (J)Temporal (J)
A-Mem [22]41.6217.5248.54
LangMem [45]63.9149.6925.19
Zep [44]61.3443.1647.83
Mem0 [13]68.3753.3356.83
Mem0g [13]66.1848.1260.27
MemLoom (Ours)67.4358.1465.77
Note: All LoCoMo results in this table were evaluated under our unified local setup to improve cross-system comparability. Although absolute values may differ from originally reported results due to implementation and environment differences, the overall comparison tendencies remain consistent with prior observations.
Table 4. Efficiency and Latency Analysis Under Bounded Budget.
Table 4. Efficiency and Latency Analysis Under Bounded Budget.
MethodMemory Tokens/Chunk SizeSearch Latency ( P 50 / P 95 )Total Latency ( P 50 / P 95 )Overall J
RAG ( k = 1 )1280.29 s/0.83 s0.79 s/1.84 s41.20
2560.23 s/0.73 s0.76 s/1.61 s43.87
5120.26 s/0.62 s0.74 s/1.74 s39.70
10240.22 s/0.74 s0.84 s/1.93 s34.40
20480.28 s/0.77 s1.03 s/2.14 s31.39
40960.23 s/0.74 s1.12 s/2.68 s30.50
81920.31 s/0.86 s1.44 s/4.39 s38.26
RAG ( k = 2 )1280.29 s/0.64 s0.79 s/1.86 s53.20
2560.24 s/0.68 s0.82 s/1.93 s54.70
5120.28 s/0.77 s0.86 s/1.76 s51.84
10240.22 s/0.68 s0.89 s/1.88 s44.32
20480.28 s/0.86 s1.14 s/2.82 s42.22
40960.29 s/0.96 s1.48 s/4.78 s45.44
81920.32 s/1.16 s2.36 s/9.87 s54.20
Full-context26,082–/–9.94 s/17.18 s66.70
A-Mem [22] 25380.69 s/1.52 s1.46 s/4.41 s35.89
LangMem [45] 13118.12 s/59.91 s18.64 s/60.48 s46.26
Zep [44]39240.53 s/0.76 s1.32 s/2.96 s50.78
Mem0 [13]17780.17 s/0.22 s0.74 s/1.47 s59.51
Mem0g [13]36310.46 s/0.68 s1.12 s/2.62 s58.19
MemLoom (Ours)3250–/0.85 s–/1.67 s63.78
Note: All systems in this table were evaluated under the same local deployment-oriented setting to support a more consistent latency–quality comparison. The Overall J score is computed over the same evaluated LoCoMo query pool under the local deployment-oriented setting. For memory systems also reported in Table 3, the subset-level scores are shown separately in Table 3; RAG and Full-context configurations are included here as additional efficiency-oriented reference settings.
Table 5. Causal Stewardship and Chain-Level Evaluation on SCDS Causal-Diagnostic Questions.
Table 5. Causal Stewardship and Chain-Level Evaluation on SCDS Causal-Diagnostic Questions.
System R @ K (Turn-Grounded)SCRAA
S0 (Full Context)0.980.950.92
S1 (Vector)0.200.020.25
S2 (GraphRAG)0.780.350.50
S3 (Mem0)0.360.120.30
S4 (Ours)0.850.720.80
Note: S0 (Full Context) serves as the theoretical upper-bound reference. In SCDS, R @ K is measured against turn-grounded causal source evidence rather than consolidated memory states, while AA measures whether the generated answer remains explicitly attributable to the gold supporting evidence under pointer-grounded verification. S3 shows reduced recoverability of remote root-cause turns and weaker chain continuity, consistent with memory consolidation effects. S2 exhibits characteristics of topological diffusion.
Table 6. Final measured ablation results for major MemLoom components.
Table 6. Final measured ablation results for major MemLoom components.
Ablation SettingMetricFull ModelAblated Model Δ Absolute Δ RelativeArchitectural Implication
w/o Topology ModuleAA0.800.68−0.12−15.0%Removing topology weakens recoverability of distant supporting evidence and reduces answer-level traceability.
w/o Logic TrackSCR0.720.39−0.33−45.8%The logic track is the primary bridge for recovering semantically separated but causally linked states into a chain-complete reasoning path.
w/o Logic TrackAA0.800.55−0.25−31.3%Removing logic-based bridging weakens answer-level evidential faithfulness.
w/o Off-Peak Stewardship P 95 1.67 s8.34 s+6.67 s+399.4%DLPS is necessary to keep structural maintenance off the live serving path and preserve bounded answer-serving latency.
Note: All values in Table 6 are measured under the same evaluation protocol as the corresponding full-model runs. SCR and AA are reported under the controlled SCDS diagnostic setting, whereas P 95 follows the deployment-oriented latency protocol. Relative changes are computed with respect to the full-model value. QMSum-specific steward ablation is discussed separately in Section 5.1 because it targets event-boundary reconstruction rather than causal recovery or deployment latency.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chan, D.-Y.; Cheng, C.-Y.; Wang, J.-F.; Tseng, S.-P. A Novel Dual-Loop Causality-Traceable Retrieval Framework for Long-Horizon Conversational Agents. Electronics 2026, 15, 2373. https://doi.org/10.3390/electronics15112373

AMA Style

Chan D-Y, Cheng C-Y, Wang J-F, Tseng S-P. A Novel Dual-Loop Causality-Traceable Retrieval Framework for Long-Horizon Conversational Agents. Electronics. 2026; 15(11):2373. https://doi.org/10.3390/electronics15112373

Chicago/Turabian Style

Chan, Din-Yuen, Chih-Yu Cheng, Jhing-Fa Wang, and Shih-Pang Tseng. 2026. "A Novel Dual-Loop Causality-Traceable Retrieval Framework for Long-Horizon Conversational Agents" Electronics 15, no. 11: 2373. https://doi.org/10.3390/electronics15112373

APA Style

Chan, D.-Y., Cheng, C.-Y., Wang, J.-F., & Tseng, S.-P. (2026). A Novel Dual-Loop Causality-Traceable Retrieval Framework for Long-Horizon Conversational Agents. Electronics, 15(11), 2373. https://doi.org/10.3390/electronics15112373

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop