1. Introduction
The rapid evolution of embodied AI drives a shift toward persistent agents that operate over long horizons [
1] in real-world multi-party settings [
2]. However, this transition exposes a critical gap in conventional retrieval approaches regarding causal structure preservation. Because traditional systems are optimized for static knowledge snapshots [
3], they struggle to capture the causal dependencies and temporal dynamics of continuous interactions [
4]. To address highly sparse information, merely expanding the retrieval volume or context window is insufficient. Excessive retrieval introduces hard negatives and exacerbates the lost-in-the-middle positional bias [
5,
6], which necessitates causality-aware structured memory management rather than flat text accumulation. Beyond these contextual limitations, autonomous agents face the structural complexity of multi-party interaction (MPI) topologies [
7]. Unlike static documents, social environments continuously generate temporally distributed actions, updates, and causally linked state transitions. When current systems treat these dynamic memories as a flat cache [
3,
8], they often lack explicit mechanisms for preserving the underlying causal structure and historical state evolution.
From the perspective of causality-traceable conversational memory, the structural limitations of current memory augmentations can be summarized into three design risks. DR1 is topological blindness, in which similarity-driven retrieval fails to reconnect semantically dispersed but causally linked evidence, thereby inducing causally misaligned attributions [
9,
10]. For instance, when a later utterance has stronger lexical overlap with a query than the true earlier cause, conventional unstructured retrieval may return the more semantically similar statement while missing the actual causal root. DR2 is structural erasure, in which overwrite-oriented updates and coarse summary pipelines collapse temporally distinct historical states into a single compressed representation, thereby weakening the auditability of intermediate transitions and historical state evolution. DR3 is the synchronous maintenance bottleneck, in which graph construction, clustering, versioning, and summarization remain coupled to the live serving path and thus scale poorly with the interaction history [
8]. As will be elaborated in
Section 2, these risks are especially pronounced in multi-party interactions and should therefore be treated as interaction-specific design risks rather than as universal flaws of all prior systems.
To address these interaction-specific limitations, we propose MemLoom, which models memory as a curated event memory graph. At a high level, MemLoom combines three coordinated ideas: a dual-loop publish–subscribe (DLPS) architecture that decouples latency-sensitive serving from heavyweight structural curation, contract-bound event entities that preserve versioned structural boundaries across long-horizon state evolution, and a bounded dual-stream retrieval (BDSR) mechanism that couples topology-aware traversal with sentence-level grounding. Through this design, MemLoom seeks to support interactive responsiveness while preserving causal and temporal traceability under multi-party conversational settings.
As shown in
Figure 1, the two loops construct a mutually supporting operation via atomic publication, thereby resolving the tension between scalability and consistency. This paper presents MemLoom, a steerable, event-centric memory architecture with four main contributions, listed as follows:
A dual-loop publish–subscribe (DLPS) architecture is proposed to practically decouple latency-sensitive online serving from consistency-critical structural maintenance. By serving immutable snapshots on the online path while shifting heavyweight graph curation to off-peak stewardship cycles, MemLoom maintains a bounded answer-serving path without exposing user-facing latency directly to asynchronous maintenance cost.
A neuro-symbolic graph synthesis (NSGS) mechanism is introduced to construct a dual-track topology that transforms raw utterances into contract-validated event entities through a lifecycle finite-state machine. Within this design, the topic track preserves semantic and chronological neighborhoods, while the logic track supports bounded recovery of non-local dependencies such as causal continuation and state evolution.
A bounded dual-stream retrieval (BDSR) mechanism is proposed to combine topology-aware traversal with sentence-level grounding. By coupling structured event-level navigation with raw evidence pointers, BDSR is designed to mitigate topological blindness while preserving answer-level traceability to original dialogue turns.
The synthetic causal diagnostic suite (SCDS) is constructed as a controlled diagnostic instrument for isolating causal rupture under targeted ground truth for complex causal chains. Based on this suite, diagnostic metrics for causal faithfulness can be established, thereby providing deeper diagnostic insights into specific design risks than traditional recall metrics.
The remainder of this paper is organized to make the above design logic explicit.
Section 2 revisits recent memory-augmented, agentic-memory, and graph-augmented retrieval studies through the lens of topology awareness, historical traceability, asynchronous curation, and event versioning.
Section 3 then presents the dual-loop architecture and its event-graph stewardship mechanism. The subsequent sections describe the evaluation protocol, experimental results, ablation findings, and limitations.
2. Related Works
To address the limitations of unstructured RAG’s flat vector space in maintaining a traceable causal history in dynamic interactions, recent studies have explored several complementary directions, including long-context access, memory construction, graph-augmented retrieval, and agentic memory organization. For instance, benchmarks such as LoCoMo [
4] show that very long-term conversational memory remains difficult even when longer contexts or retrieval mechanisms are available. However, long-context settings still suffer from the lost-in-the-middle effect [
6], while simply increasing retrieved history introduces a practical trade-off between context coverage and computational efficiency. Consequently, external memory organization remains necessary for persistent agents. From the perspective of memory construction, SeCom [
11] and RMM [
12] show that long-term dialogue memory benefits from both suitable memory granularity and adaptive retrieval refinement across multiple abstraction levels. Mem0 [
13] and THEANINE [
14] further highlight the practical value of compact online updating and timeline-aware organization. In parallel, JERR [
15] demonstrates that semantic-similarity ranking alone is often insufficient for long-horizon reasoning. Although these approaches improve retrieval quality, memory compression, timeline-aware organization, or query-time reasoning depth, their primary optimization targets remain local access, compact updating, or single-query reasoning. They therefore do not explicitly provide a unified architectural mechanism for preserving causally linked event versions, maintaining auditable historical state transitions, and bounding online serving latency under continuous multi-party interaction.
Beyond isolated RAG pipelines, Wang et al. [
16] review networked agent systems, which emphasize multi-agent cooperation strategies and modular architectures that explicitly integrate planning, memory, action, and interaction. This broader agent-centric view is important because it shifts the design question from whether memory exists to how memory is structurally maintained under multi-party interaction. However, the agent-systems literature usually treats memory as a high-level module and offers less guidance on how to preserve causal continuity, event-versioned state evolution, and bounded maintenance cost within a concrete long-horizon conversational memory architecture.
The integration of knowledge graphs (KGs) with RAG presents an important direction for supporting causal-context and multi-hop reasoning. For instance, CausalRAG [
9] mitigates semantic-similarity bias by tracing causal paths, while GraphRAG [
8] was primarily introduced for corpus-level global sensemaking via community summaries. Complementary to these directions, recent graph-structured retrieval methods such as query-aware KG fusion [
17] and Associa [
18] strengthen evidence selection beyond flat semantic matching. Such KG-driven retrieval can help mitigate the limitation of similarity-only retrieval that tends to return isolated text segments, thereby aligning with the need for multi-hop and causally traceable retrieval in long-horizon queries. However, these graph-augmented methods are primarily designed for relatively stable corpora, where the graph serves as a retrieval scaffold over already consolidated knowledge. More critically, corpus-level graph construction and community summarization usually assume that the underlying knowledge units are relatively stable before graph construction. Long-horizon multi-party conversations violate this assumption because event boundaries, participant states, and causal dependencies continue to evolve during interaction. Consequently, graph-augmented retrieval alone does not guarantee event-versioned state preservation or latency-bounded online serving unless graph maintenance is explicitly decoupled from the live interaction path.
From a graph-theoretic perspective, this limitation can be further formulated through the distinction between community organization, temporal reachability, and heterophilic bridging. Classical modularity theory explains why relational evidence can be organized into locally dense semantic communities [
19], while temporal-network analysis shows that edge activation order and historical timing affect reachability and cannot be fully preserved by a static aggregated graph [
20]. In addition, recent heterophilic graph learning studies emphasize that important dependencies may connect dissimilar nodes across community boundaries rather than only similar nodes within the same neighborhood [
21]. Based on this background formulation, MemLoom treats conversational memory as a versioned directed, labeled, and temporal event graph, rather than as a flat cache or a static corpus-level graph summary. This formulation motivates the separation between a homophilic topic graph and a logic-aware relation graph: the former preserves local semantic and chronological neighborhoods, whereas the latter retains non-local typed dependencies, such as causal continuation and state succession, that may be weakened by community-only summarization.
When memory is treated not only as a retrieval cache but also as an evolving structure, preserving temporally traceable memory evolution becomes a critical challenge. A-Mem [
22] introduces a Zettelkasten-inspired dynamic linking mechanism, whereas CEO [
23], hierarchical event schema induction [
24], and synergetic event understanding [
25] move toward event-oriented abstraction and consolidation. At the system level, MemoryOS [
26] outlines a broader hierarchical memory-management blueprint, while Amory [
27] moves memory formation closer to offline narrative consolidation. Nevertheless, even these structured-memory directions still leave open how to jointly preserve immutable historical versions, recover causally dispersed evidence, and keep user-facing serving latency bounded in long-horizon multi-party settings. In other words, these systems demonstrate the importance of memory organization, but they do not make the separation between provisional online evidence writing and off-peak structural commitment a primary architectural contract. To make this critical comparison explicit,
Table 1 summarizes whether representative paradigms treat topology awareness, history tracking, asynchronous curation and serving, and event versioning as primary architectural objectives.
Taken together, prior work improves long-horizon memory along different axes, but few existing approaches explicitly and jointly address four requirements that become coupled in long-horizon multi-party conversations: topology-aware causal retrieval, preservation of temporally distinct historical states, asynchronous separation between structural curation and bounded online serving, and event-versioned memory governance. This gap motivates MemLoom. Rather than treating memory as a flat cache, a purely online profile store, or a corpus-level graph summary, MemLoom organizes interaction history as a versioned event memory graph and separates latency-sensitive serving from off-peak structural stewardship. The contribution of MemLoom therefore lies not simply in using graph-based retrieval, but in combining event-versioned memory governance, bounded topology-aware retrieval, sentence-level provenance grounding, and snapshot-based asynchronous curation within one deployable architecture.
3. Methodology
3.1. System Overview of MemLoom
For readability,
Figure 2 should be read from left to right as three coordinated paths: online interaction, online event formation, and off-peak stewardship.
To construct a conversational AI system with architectural-level disentanglement, MemLoom is built on a dual-loop publish–subscribe architecture. MemLoom structurally decouples real-time responsiveness from long-term consistency by integrating three coordinated subsystems. These subsystems are the online interaction-loop subscriber (OILS), online event formation system (OEFS), and off-peak stewardship loop publisher (OSLP), as shown in
Figure 2. Specifically, OILS serves as the subscription endpoint. To avoid blocking caused by high-frequency writes, this loop primarily reads from the immutable snapshots (
) published by the off-peak layer. Meanwhile, it utilizes sentence-level memory to provide real-time compensation for the newest unstructured interactions. The OEFS pipeline is unidirectional and lightweight, which concisely builds the writing path and maintains clear distinctness. The OEFS encapsulates raw dialogues into provisional events and appends them to a buffer, thereby deferring heavyweight computations to sustain smooth interactive responsiveness.
In parallel, OSLP absorbs most computational burdens by deferring congestion-inducing workloads to off-peak stewardship cycles. During asynchronous maintenance cycles, the off-peak steward processes accumulated provisional events to reconstruct high-order connectivity, including entity refinement and high-cost causal/temporal relations. To prevent online state inconsistencies, the newly curated graph is released via an atomic publish protocol, producing a snapshot (). This guarantees monotonic consistency, allowing seamless integration of structured knowledge without interrupting interactions.
To systematically block structural failures, the MemLoom data lifecycle is strictly regulated by four core architectural invariants ( to ). First, the semantic router enforces a bounded budget gate () to prevent resource exhaustion from invalid utterances, directly addressing the synchronous maintenance bottleneck, i.e., DR3. Next, the OEFS adheres to a provisional-first lifecycle contract (). By restricting online writes to revisable drafts, it prevents transient information from mutating authoritative memory, effectively mitigating structural erasure, i.e., DR2. Furthermore, the OSLP executes batch global consistency (). Deferring intensive relation construction to off-peak batches preserves logical integrity and helps mitigate DR1. Finally, the system employs auditable dual-stream grounding (). By combining macro-level graph traversal with micro-level evidence retention, this retrieval mechanism mitigates the semantic blind spots associated with DR1 while preserving temporally traceable evidence through end-to-end provenance auditing to reduce DR2.
Operationally, the practical decoupling effect of DLPS is realized through three coordinated mechanisms: immutable snapshot serving for stable online reads, off-peak stewardship for heavyweight structural curation, and a bounded answer-serving path that isolates user-facing latency from asynchronous maintenance cost. In this design, the online loop remains read-light and latency-sensitive, whereas the stewardship loop absorbs topology repair, validation, and publication overhead outside the active serving window.
3.2. Semantic Router: Single-Pass Rewrite–Gate–Route
Rather than relying only on conventional text cleaning, MemLoom employs an LLM-powered context-aware semantic router with a single-pass rewrite–gate–route mechanism. The router maps implicit multi-party ambiguities into structured representations and transforms natural language into deterministic control signals for downstream event construction. To minimize interaction latency, the system adopts a single-pass inference strategy. Within this strategy, a sliding window captures short-horizon referential dependencies and pragmatic continuity. This routing protocol is realized as the mapping function in Equation (
1):
where
is the current utterance,
is a length-5 historical window, and
contains environmental metadata. The output comprises a rewritten text
, a retrieval intent
I, and topic labels
.
To establish a reliable substrate for long-horizon multi-party interactions, the contextual rewriting module for executes a three-step transformation to resolve conversational ambiguities. First, an explicit mapping anchors first-person pronouns to the current speaker ID, which mitigates identity drift across long interaction histories. Then, it resolves abstract references by replacing implicit pronouns with recent concrete noun phrases, thereby stabilizing the semantic embedding space. Finally, relative time expressions are normalized into absolute date markers, which provides physical coordinates for downstream causal chains.
3.3. Online Incremental Event Formation System
As illustrated in
Figure 3, the structured signals resulting from the preprocessing layer are fed into the online incremental event formation system (OEFS), which transforms discrete conversational streams into temporal event nodes in real time. Conventional RAG architectures typically compress input streams into irreversible semantic abstractions, inevitably compromising the integrity of the contextual window and making provenance evaluation highly challenging. To overcome this bottleneck, the design of OEFS adheres to a precision-first, provenance-preserving “write-as-evidence” principle. This strategy aims to maximize semantic purity within individual events, effectively mitigating the topic blackhole effect prevalent in long-horizon dialogues. Specifically, for each preprocessed input memory unit
—comprising topic tags
and a text embedding vector
generated via OpenAI’s
text-embedding-3-small model—the builder explicitly avoids overwriting the existing event representations. Instead,
is appended as immutable evidence into the pending memory array of the target event
. Synchronously, the mathematically aggregated event centroid vector (
) and internal source pointers are updated to preserve the physical integrity of the original context. By providing a strict physical alignment basis, this write-as-evidence paradigm significantly optimizes downstream operations—including multi-hop retrieval, deduplication, and shadowing—thereby reducing the system’s reliance on purely semantic-based heuristics.
To facilitate event attribution under low-latency constraints, MemLoom executes a hierarchically dispatched routing algorithm (HDRA), which integrates pragmatics with non-linear semantic metrics. First, it enforces a candidate pool constraint, bounding the search for candidate events within the set of online joinable states (
, formally defined in
Section 3.4). As a result, the system structurally filters potential cross-day misalignments and significantly shrinks the search space. Concurrently, for each updated or created event
, the constructor maintains its internal statistical feature vector
, representing the evidence count
, the normalized participant distribution
, and the token mass or informational volume
, respectively. These real-time statistics serve as foundational evidence to support subsequent lifecycle stewardship (detailed in
Section 3.4).
For candidate memory units that successfully pass this initial filter, HDRA evaluates their compatibility by calculating a final matching score . Instead of performing redundant vectorization, MemLoom directly computes the semantic similarity between the preprocessed embedding and the maintained event centroid . As defined in Algorithm 1, this routing step balances semantic similarity with temporal decay. Let denote the temporal distance between the current input and the last update of candidate . To tolerate subtle semantic fluctuations and suppress background noise, HDRA integrates a fixed non-linear sigmoid function, . Additionally, it applies a short-window gain controlled by the parameter , which explicitly rewards temporally proximate inputs to sustain the continuity of Reasoning Chains.
To configure the routing objective under bounded online latency, the HDRA coefficients were selected through a coarse-to-fine empirical search on a held-out validation subset rather than as universally optimal constants. The search examined candidate settings for the semantic weight
, temporal decay weight
, decay rate
, short-window gain
, and transition threshold
under the joint criteria of semantic fidelity, temporal continuity, and latency stability. The final routing defaults were fixed at
,
,
,
, and
. More concretely, this validation-stage tuning used an independent held-out SCDS validation split that was reserved exclusively for parameter selection and never reused in the final reported evaluations. The final SCDS diagnostic test split, as well as the external QMSum and LoCoMo benchmarks, remained untouched during tuning. Once selected, the routing defaults were frozen and kept fixed throughout all final experiments. Concretely, the held-out SCDS validation split contained 100 diagnostic queries. All reported SCDS benchmark results were then produced on a disjoint final evaluation split of 353 queries, while QMSum and LoCoMo remained entirely untouched during tuning. These values should therefore be interpreted as empirically selected operational defaults for the tested domains, not as universally optimal settings. By comparing the maximum candidate score against
, the system triggers either a
NEW_EVENT or
CONTINUE operation, thereby preserving topic coherence while remaining adaptive to boundary shifts in dynamic conversations.
| Algorithm 1 HDRA pseudo code |
| Require: Current memory unit (vector ), pool of active events |
| Ensure: Routing decision and state update |
|
- 1:
Filter candidates in based on temporal boundaries
|
|
- 2:
for all candidate in do - 3:
- 4:
- 5:
end for
|
|
- 6:
if
then - 7:
NEW_EVENT() - 8:
else - 9:
Select optimal target - 10:
Append to pending and update and - 11:
end if
|
3.4. Event Lifecycle Realization
In the present framework, an event is treated as a bounded, versionable, and evidence-backed conversational memory unit rather than as a claim of a universally correct real-world event ontology. This definition is intentionally operational: an event captures a locally coherent interaction segment under bounded temporal and semantic consistency. To ensure auditable memory, MemLoom models each event as a finite-state machine (FSM) regulated by the state set
. As shown in
Figure 4, the proposed FSM defines lifecycle transitions and storage states under a strict commitment contract.
As illustrated by Stage 1 in
Figure 4, events in the provisional state are isolated from high-order relationships, such as causal and temporal edges, to filter premature structural noise. Promotion eligibility is subsequently determined by a maturity function
, which evaluates the information density of these events.
An event is promoted to the active state when the condition
is satisfied, where
denotes the number of pending evidence units,
represents the participant entropy computed from the normalized participant distribution
within
, and
indicates the token mass or informational volume of the pending evidence. As defined in Equation (
2), the coefficients
,
, and
act as governance parameters to balance three complementary maturity signals: evidence accumulation, participant diversity, and informational volume, respectively. This decoupling prevents event promotion from being dominated by a single proxy; for example, evidence count alone may over-promote repetitive fragments, token mass alone may favor verbose but structurally weak spans, and participant entropy alone may overvalue interaction breadth without sufficient evidence mass.
In the current implementation, these maturity coefficients are treated as fixed governance defaults, which were selected based on the held-out SCDS validation split and subsequently frozen for all final experiments. They are not intended to guarantee universal terminal optimality; instead, they regulate the lifecycle trade-off between the premature promotion of noisy fragments and the excessive delay in making emerging events available to the active graph. Therefore, should be interpreted as a lifecycle governance threshold rather than a theoretically optimal constant. If is set too low, provisional events are promoted prematurely, thereby increasing graph inflation and structural noise. Conversely, an excessively high delays event availability, causing freshness lag and deferred structural integration. In this context, is employed to mitigate noisy premature promotion while avoiding excessive delays in event availability.
When multiple simultaneous or overlapped events are present, MemLoom adopts an intentionally asymmetric handling strategy. On the online path, OEFS applies a single primary event assignment under bounded latency. It assigns the incoming memory unit to the most compatible provisional or active event, ensuring that online routing remains lightweight and auditable. The higher-order repair of simultaneous, interleaved, or partially overlapping event structures is then deferred to off-peak stewardship. During this off-peak phase, timeline repair and relation synthesis can recover cross-event continuity through typed temporal, state-transition, and event-continuity links, which are formalized later in the NSGS subsection. This design embodies a deliberate trade-off between real-time responsiveness and structural recovery, rather than operating under the assumption that overlaps can always be perfectly resolved online.
As shown by Stage 2 in
Figure 4, once an event is promoted to the active state via dual-track topological skeleton construction (DTKC), it becomes eligible for full topological reasoning. As shown by Stage 3 in
Figure 4, to capably adapt to episodic rhythms, an event transitions to the closed state when a specific contextual boundary condition is satisfied. Specifically, this transition occurs when a condition, e.g., task completion or session shift, holds with
, where
is a boolean boundary predicate and
denotes contextual cues used for closure. These read-only nodes preserve structural connectivity while prohibiting new insertions, effectively fixing the historical context.
Finally, transitioning to the archived state, as illustrated by Stage 4 in
Figure 4, represents a strategic shift toward resource optimization. Regulated by a resource-aware retention policy rather than rigid chronological rules, events are moved to cold storage when exceeding their relevance horizon or when memory footprints necessitate migration. This mechanism isolates the active search space from the historical repository, helping mitigate latency inflation and stabilize long-term write efficiency during system maintenance windows.
3.5. Off-Peak Stewardship Loop Publisher (OSLP)
The off-peak stewardship loop publisher (OSLP) operates as a background entity that asynchronously performs memory curation during the system’s idle cycles, which functions analogously to a steward systematically organizing a household while the owner rests. This design architecturally reconciles the inherent tension between low-latency online interactions and strict structural consistency in long-horizon conversational agents. Because the OEFS conforms to an append-first strategy, the off-peak steward assumes the critical responsibility of reconstructing the accumulated raw fragments into a causal-temporal graph snapshot. This snapshot is designed to preserve causal traceability and temporal continuity.
As depicted in
Figure 5, the stewardship pipeline orchestrates this transformation through a sequence of rigorous phases under multi-version concurrency control (MVCC) isolation. As illustrated by Phase 1 in
Figure 5, fragmented inputs first undergo physical defragmentation and normalization. During this phase, retired nodes are preserved in a lineage registry
to maintain directed acyclic graph (DAG) provenance. Specifically,
records node-level lineage links for provenance-preserving roll-forward and audit trails.
Subsequently, as shown by Phase 2 in
Figure 5, the core neuro-symbolic graph synthesis (NSGS) implements topological decoupling. Within this mechanism, the first track constructs a homophilic topic graph via non-LLM approximate nearest neighbor (ANN) search, and the second track injects symbolic hooks to structurally bridge heterophilic reasoning paths. Following this, as depicted by Phases 3 and 4 in
Figure 5, the system executes cross-cluster reasoning and contract validation. These operations are bounded by a deterministic inference budget
, which enforces a per-maintenance-cycle upper bound on large language model (LLM) inference and validation costs. Ultimately, the OSLP releases the structural updates via an atomic publish protocol, which consequently provides the online loop with a stable, curated event memory graph
for subsequent retrievals.
3.5.1. Snapshot Consistency Contract
Based on MVCC principles, the architecture introduces a snapshot consistency contract. To prevent evidence tearing (i.e., dirty reads where agents might infer from partially constructed graphs), the online layer enforces a single-version read principle. Under the guarantee of snapshot isolation, the system exclusively accesses the preceding snapshot during the active period (the online serving window between two stewardship cycles). The subsequent state transition to the newly curated t-th snapshot is executed via an atomic publish protocol, which switches the global pointer only after passing all integrity checks. This mechanism supports monotonic consistency and helps mitigate DR3.
3.5.2. Global Policy Definition
To ensure reproducible memory management, the hyperparameters used to regulate graph topology evolution are encapsulated into a global synthesis policy vector:
This policy constrains the steward through explicit transformation rules and budgets (where
bounds the per-cycle LLM inference budget,
bounds the budget allocated to cross-cluster bridging/verification,
specifies the guard/validation budget used in contract checking, and
denotes the versioned verification rule-set applied by integrity checks). Semantically,
acts as a cosine similarity margin to filter low-relevance edges, while
bounds the ANN search breadth to prevent excessive semantic hub expansion. Temporally,
(temporal
k-NN), the backtracking window (
), and the local temporal tolerance (
) restrict the merging search space to physically adjacent events, helping timeline repairs remain consistent with chronological order. The temporal components of Equation (
3) are operationalized directly in Algorithm 2, where
bounds the merge backtracking scope and
acts as the admissibility threshold for local temporal repair.
| Algorithm 2 Entity Normalization and Timeline Repair (ENTR) |
| Require: Provisional events , existing curated events , global policy |
- 1:
▹ Input guard using lineage - 2:
for all adjacent pairs within backtracking window do - 3:
if and then - 4:
- 5:
Replace with in and record DAG mapping in - 6:
end if - 7:
end for
|
Unlike conventional RAG architectures, where structural thresholds are often scattered as hard-coded heuristics, explicitly encapsulating these guardrails disentangles deterministic stewardship logic from the probabilistic LLM engine. This formulation represents complex neuro-symbolic orchestration as a parameterized configuration space, enabling reproducible configurations across different hardware constraints and application scenarios.
To preserve reproducibility, the scalar routing coefficients used in HDRA (
,
,
,
, and
) are treated as tuned online operational defaults, whereas the stewardship policies and lifecycle thresholds are treated as fixed governance controls under a versioned policy contract. In practice, each deployment profile materializes a concrete policy instance
from this versioned stewardship contract, and this instance remains immutable throughout a serving window. Stewardship-side controls may be updated only at stewardship-cycle boundaries through offline re-validation and atomic snapshot publication. This design prevents mixed-policy reads across serving windows, preserves snapshot consistency, and makes policy evolution auditable across snapshot versions. Their roles, tuning rationales, and freeze rules are summarized in Appendices
Appendix A.3 and
Appendix A.4.
3.5.3. Entity Normalization and Timeline Repair
Prior to synthesizing high-order relationships, the steward transforms unstructured provisional events into a curated set of event nodes. First, a semantic encapsulation process enforces an authority contract, guiding the LLM to generate citable core summaries (
) and stable vector anchors for newly accumulated events, where
denotes a citable core summary attached to each curated event node. Second, to repair timeline fragmentation caused by the online layer’s conservative segmentation, a physical defragmentation process is executed, as depicted in Phase 1 of
Figure 5. As detailed in the pseudocode of Entity Normalization and Timeline Repair (ENTR), the steward applies a bounded merge-repair procedure to temporally adjacent candidates within the backtracking window
. This procedure combines a deterministic temporal gate with an LLM-assisted merge validator, so that physical merging is triggered only when both local chronological admissibility and narrative compatibility are satisfied. The temporal edges produced by this repair stage are collected as
, which denotes the set of physical adjacency edges used as the chronological backbone in Stage 2.
To make this repair procedure operationally explicit, ENTR applies two bounded decision gates before physical merging. First, TemporalDistance(, ) acts as a deterministic temporal admissibility check that determines whether two adjacent event candidates remain sufficiently close in chronological position to be considered for merge under the current stewardship policy. In the current implementation, this check is bounded by the backtracking window and the local temporal tolerance . Second, LLM_Merge(, ) serves as an LLM-assisted but policy-bounded boolean validator. It returns True only when the candidate pair exhibits sufficient narrative continuity, semantic compatibility, and referential stability, while not violating temporal order or introducing an explicit state contradiction.
Crucially, to maintain graph atomicity and data provenance, this physical merging () (where denotes an event node instance and ⊕ denotes the physical merge operator) does not simply discard old nodes. Instead, original nodes are retired and recorded in a lineage registry (), establishing a directed acyclic graph (DAG) lineage mapping. This registry not only acts as an input guard to prevent “zombie” fragments from being reactivated into the deep reasoning pipeline, but also serves as the structural basis for subsequent retrieval grounding. By preserving this traceable mapping, the system retains the capacity to link abstract summaries back to raw utterances, thereby aiding mitigation of DR2.
Here, denotes the newly accumulated provisional events in the current stewardship cycle, denotes the previously committed curated event set, and denotes the active candidate pool after retired nodes are filtered through the lineage registry . The operator ⊕ denotes the physical merge operator that replaces a validated adjacent pair with a new merged node , while preserving provenance in .
3.5.4. Neuro-Symbolic Graph Synthesis (NSGS)
NSGS functions as the central off-peak curation mechanism that combines probabilistic semantic understanding with deterministic structural constraints. Its role is to validate evidence-grounded candidate relations among temporally sparse and semantically disconnected dialogue fragments, thereby helping preserve causal continuity and state evolution under DR1 and DR2. Rather than claiming formal causal discovery, NSGS performs bounded causal-relation validation over curated event nodes and organizes accepted relations into a compact topology for downstream retrieval.
Conventional GraphRAG-style architectures are less suitable for this setting because continuous interaction streams require frequent updates, while community summarization may merge temporally distinct but semantically related states. This can obscure intermediate transitions and historical boundaries, which corresponds to the structural erasure risk described by DR2.
As illustrated by Algorithm 3 and Phase 2 in
Figure 6, NSGS introduces a bounded-update approach via a dual-track skeleton. To keep the topology interpretable and validation-bounded, NSGS uses a compact four-type relation closure: CAUSAL_FOLLOW, TEMPORAL_NEXT, STATE_SUCCESSOR, and SAME_EVENT. CAUSAL_FOLLOW captures directional cause–effect dependencies, TEMPORAL_NEXT preserves chronological adjacency, STATE_SUCCESSOR models historically distinct state evolution within the same event lineage, and SAME_EVENT reconnects semantically dispersed but co-referential fragments. This closure is not intended as a universal ontology; it is selected to balance causal recovery, temporal continuity, state preservation, event reconnection, and bounded maintenance cost.
Track 1 constructs a homophilic topic graph using HNSW-based approximate nearest neighbor search bounded by
, as defined in Equation (
4):
where
i and
j index active event nodes,
denotes the embedding anchor of node
i,
returns the top-
nearest neighbors of
i, and
is the cosine similarity margin. The resulting topic graph
provides a non-LLM local scaffold for semantic and chronological neighborhoods.
Track 2 then adds symbolic hooks
derived from explicit online signals, yielding
. These hooks allow the steward to test candidate dependencies across semantic-cluster boundaries, supporting the later validation of relations such as CAUSAL_FOLLOW and STATE_SUCCESSOR.
| Algorithm 3 Dual-Track Topological Skeleton Construction (DTKC) |
- 1:
- 2:
▹ Track 1: Homophily + chronological scaffold for clustering - 3:
- 4:
▹ Track 2: Bridge structure for higher-order reasoning
|
As
Figure 6 illustrates, semantic clustering may separate an error event
from its later fix event
into different communities. The logic graph uses symbolic hooks to bypass these homophilic boundaries and recover cross-cluster dependencies, preserving chronological progression, state evolution, and same-event reconnection.
Following topological decoupling, BBR partitions the topic graph into semantic clusters
and allocates bounded inference slots to cluster-local and cross-cluster reasoning as shown in Algorithm 4. Let
denote LLM-proposed candidate relation edges before contract validation. The total LLM inference cost is bounded by Equation (
5):
where
denotes incrementally updated nodes in cluster
i, and
M represents fixed overhead. Because the reasoning budget is bounded by the number of active clusters rather than the entire historical repository, BBR keeps off-peak structural reasoning tractable under asynchronous stewardship.
Candidate relations are subsequently canonicalized into the four-type closure and passed to contract validation before snapshot publication. Answer-level auditability is supported separately through sentence-level reference pointers and pointer-grounded verification.
| Algorithm 4 Budgeted Bridge Reasoning (BBR) |
- 1:
▹ Partition into clusters - 2:
- 3:
for all clusters do - 4:
- 5:
end for - 6:
|
3.5.5. Contract Validation and Snapshot Publication
To converge probabilistic LLM-generated candidate edges () into a deterministic graph, the system executes a rigorous contract validation and snapshot publication (CVSP) process. A verification gate applies three hard constraints: verifies JSON structural integrity, mandates original-evidence citations for auditability, and restricts relationships to an allowed closure set (all are boolean validation predicates applied to candidate edge y, with type closure specified by verification rule-set under ). Validated edges are then deduplicated and subjected to physical constraints, such as temporal ordering, via a canonicalization function , yielding the committed relation edge set .
Ultimately, the system encapsulates authoritative entities (), validated relations (), and auditable metadata, including cluster indexing , physical lineage , and policy into an immutable snapshot . Complying with the atomic publication rule in the snapshot consistency contract above, the global pointer switches to only after all integrity verifications are passed. Through the lens of MVCC, this step functions similarly to a transactional commit. By ensuring that the online loop only queries fully committed states, this publication pipeline supports monotonic consistency. Consequently, it helps mitigate DR1 and DR2 by improving reproducibility and traceability in long-term interactions.
3.6. Sentence-Level Memory
While curated events constitute the core macroscopic memory, sentence-level memory serves as an indispensable high-fidelity failsafe, providing a two-fold compensation. Temporally, it provides OILS with an intra-cycle real-time buffer, directly indexing unstructured raw conversational data to eliminate memory blind spots caused by asynchronous stewardship latency. Structurally, it preserves uncompressed raw text spans as a resolution substrate, compensating for the inherent detail loss during event summarization.
Concretely, each utterance is stored as an immutable record indexed by a unique identifier, enabling direct addressability without requiring re-clustering. Each curated event maintains a set of reference pointers that map its summary claims to supporting raw spans. At query time, OILS materializes grounding text by executing direct pointer lookups, ensuring that citations are verbatim spans rather than regenerated paraphrases. The real-time buffer covers the intra-cycle gap (i.e., the wall-clock duration between two consecutive published snapshots) by indexing the newest utterances before the next off-peak publication. This design supports fallback to raw evidence when event summaries are lossy, thereby improving auditability and error analysis.
3.7. Bounded Dual-Stream Retrieval
To retrieve information from immutable snapshots while addressing DR1 and DR2, MemLoom employs a BDSR mechanism executed through three atomic operations. First, semantic seeding leverages vector similarity to identify highly relevant active events as narrative anchors without altering graph topology. Second, structured traversal systematically expands from these anchors through the validated relation closure. It bidirectionally traces CAUSAL_FOLLOW edges to recover root causes and consequences, thereby addressing DR1, follows TEMPORAL_NEXT edges to preserve chronological continuity across adjacent event transitions, traverses STATE_SUCCESSOR edges to retain historically distinct state evolution without flattening it into a single compressed representation, thereby addressing DR2, and explores SAME_EVENT edges to reconnect semantically dispersed but co-referential fragments of the same underlying event. Together, these bounded traversals enable multi-hop context recovery while preserving causal traceability and structural distinctness. Finally, the grounding stage utilizes reference pointers to map abstract event summaries back to concrete utterance entries within the sentence-level substrate. This forms an end-to-end provenance chain from retrieval to answer generation, which suppresses hallucinations by forcing every critical claim to remain traceable to raw evidence and supports answer-level auditability.
To guarantee boundary strength during long-term interactions, this pipeline is strictly constrained by a global computational budget and a visibility filter. The global budget enforces a strictly bounded retrieval cost independent of historical data scale by applying fixed caps on (i) the number of seed events, (ii) traversal hops and branching factors, and (iii) the amount of grounded evidence materialized for generation. Concurrently, a pre-computed access control list (ACL) physically intercepts unauthorized accesses during both graph traversal and evidence materialization. This ensures that only nodes, edges, and sentence-level spans within the viewer’s visibility scope can be retrieved, thereby eliminating cross-user leakage and the risk of dirty reads.
Illustrative example. Consider a household dialogue in which Alice first says, “I unplugged the coffee machine because it was leaking,” and many turns later asks, “Can you help me get something hot to drink?” A flat similarity-based retriever may over-focus on the lexical cue “hot” and retrieve general coffee-machine usage context, thereby missing the earlier safety-related state change. In contrast, MemLoom is designed to preserve the earlier unplugging event, its state consequence, and the later query as structurally connected memory units. Under this event-centric representation, the retrieval path can recover that the coffee machine is unavailable due to a prior state-changing event, leading to a safer and more contextually grounded response, such as recommending an alternative hot drink rather than suggesting reuse of the machine.
Ultimately, by centering on immutable snapshots, MemLoom addresses the tension between structural consistency and online latency. Within this design, online interactions remain responsive and write-light, while off-peak stewardship upgrades the global structure and preserves the stability of the live read path. The subsequent section empirically evaluates this architecture under claim-driven protocols across long-horizon recall, causal chain recovery, answer-level auditability, and end-to-end cost.
4. System Verification and Mechanism Diagnosis
An orthogonal verification strategy is adopted to examine the DLPS architecture in MemLoom. The mitigation effectiveness against DR1, DR2, and DR3 is systematically evaluated, as these design risks represent primary structural bottlenecks for LLM-based conversational agents. To guide this evaluation, our diagnosis is driven by four core research questions, i.e., RQ1 to RQ4:
RQ1 (Structuring capability): Can the off-peak stewardship establish robust event boundaries to prevent semantic fragmentation and over-merging, i.e., the foundation for DR1 and DR2?
RQ2 (Retrieval stability and reasoning consistency): Does the bounded dual-stream retrieval sustain signal survival and structural durability across long-horizon and multi-hop interactions?
RQ3 (Causal recovery and structural preservation): Can the curated event memory graph accurately reconstruct causal chains while preserving temporally traceable state transitions without collapsing historically distinct states into a single compressed representation?
RQ4 (Deployment viability): Does the asynchronous maintenance mechanism effectively mitigate DR3 while ensuring bounded latency?
4.1. Datasets and Evaluation Protocols
Herein, we select two external canonical benchmarks and a self-built controlled diagnostic suite (SCDS) to implement the orthogonal verification strategy. Moreover, a scale sweep simulation for infinite-horizon long-term accumulation is incorporated. These evaluations are designed to map the respective benchmarks and defensive contracts directly to the proposed RQs and DRs, confirming comprehensive coverage across structure, noise, logic, and scale dimensions.
4.1.1. Long-Meeting Structuring Benchmark (QMSum)
To address RQ1, the QMSum benchmark [
28] is applied to verify the fundamental ability of MemLoom in establishing event-level granularity, affirming a structural prerequisite for addressing DR1 and DR2. Exhibiting significant long-range features (averaging ∼575 turns per meeting across diverse domains) and high topic variability (averaging 4.26 topic spans), QMSum is utilized beyond traditional ROUGE metrics. Specifically, it serves as the segmentation contract against human ground-truth (GT) annotations. Because not all compared systems are segmentation-native, all outputs are first normalized into a common turn-level segmentation sequence before evaluation. For segmentation-native systems, boundaries are read directly from the produced event sequence. For non-segmentation-native baselines, a fixed boundary projection protocol is applied: native memory units are first aligned back to their supporting dialogue turns, and projected boundaries are then induced whenever the dominant assigned unit changes along the chronological turn sequence. For Mem0, the projected unit is the consolidated memory entry attached to each turn span; for GraphRAG, it is the dominant community-aligned summary region assigned to each turn. Standard probabilistic segmentation error metrics, such as
and WindowDiff (WD), are then employed to rigorously evaluate the difference between these normalized boundary projections and human annotations.
4.1.2. Long-Horizon Multi-Party Memory Benchmark (LoCoMo)
To address RQ2, the LoCoMo benchmark [
4] evaluates structural durability against the information fragmentation and catastrophic forgetting risks inherent in real-world interactions, featuring exceptionally long multi-session contexts (averaging ∼600 turns and 200 QA pairs per set). We employ three tasks across different cognitive levels: First, single-hop retrieval establishes the foundational retrieval capability for capturing discrete facts. Then, multi-hop reasoning verifies whether the curated event memory graph can mitigate DR1 by constructing effective bridging paths that reconnect semantically dispersed but contextually related event fragments without relying on shallow keyword co-occurrence. Finally, temporal reasoning checks if the immutable snapshot mechanism effectively preserves historical contexts. This specifically prevents historical states from being wrongly overwritten by new information—a common destructive update issue—thereby mitigating DR2.
4.1.3. Synthetic Causal Diagnostic Suite (SCDS)
To systematically address RQ3, we introduce the Synthetic Causal Diagnostic Suite (SCDS). Existing datasets inherently lack auditable ground-truth (GT) annotations for densely intertwined multi-party interactions, thereby limiting the precise diagnosis of DR1 and DR2. To bridge this gap while adhering to strict community standards for dataset transparency [
29,
30,
31,
32], SCDS avoids one-shot prompting in favor of a spec-driven multi-stage synthesis pipeline [
33,
34,
35,
36,
37]. As detailed in
Appendix C, this controlled protocol actively mitigates model collapse and data contamination risks via verification-centric quality gates and split isolation [
38,
39,
40,
41]. Furthermore, to circumvent the preference leakage and judgment bias inherent in LLM-as-a-judge methodologies [
42,
43], the suite explicitly grounds all evaluations in pointer-level gold-evidence annotations combined with deterministic scoring.
Built upon this rigorous foundational protocol, the full SCDS pool contains 453 causal diagnostic queries. Among them, 100 queries are reserved as an independent validation split for parameter tuning and sanity-checking, while the remaining 353 queries constitute the final reported diagnostic evaluation split. We deliberately calibrate this interaction length and density to effectively isolate structural reasoning failures from sheer context-window overflows. These diagnostic tasks evaluate the architecture’s capacity for topological recovery by testing whether sparse, semantically heterophilous signals can be reconstructed into an unbroken causal chain, thereby rigorously stressing DR1. Concurrently, the suite examines whether historically distributed evidence can be preserved and reconnected without being overwritten or structurally flattened during long-horizon memory maintenance, thereby stressing DR2. Together, these high-resolution stress tests provide controlled diagnostic evidence regarding the structural durability of the proposed dual-loop mechanism.
Importantly, SCDS is intended as a controlled diagnostic suite rather than as a universal benchmark for open-world conversational memory. Its evaluation ontology is intentionally aligned with the relation closure used in MemLoom because the suite is designed to isolate specific structural risks, especially causal rupture and state-flattening failures, under pointer-grounded verification. Accordingly, SCDS should be interpreted as high-resolution diagnostic evidence, whereas more natural datasets such as LoCoMo provide complementary evidence regarding broader retrieval durability and reasoning behavior.
4.2. Comparisons of Categorized Architectures
To precisely isolate the structural advantages of MemLoom, we compare it against four categories of baseline architectures representing different evolutionary stages. First, the Full-Context baseline establishes a theoretical upper bound for accurate model understanding but is rendered unsuitable for large-scale deployment due to substantial latency growth and strict input budgets. Second, the Chunk-Based Vector baseline, sweeping parameters from 128 to 8192 tokens, explores current industry limits; however, its inherent lack of semantic boundary awareness inevitably results in context fragmentation and topic drift. To resolve such fragmentation, GraphRAG provides macroscopic global understanding through community detection, yet it is primarily optimized for corpus-level sensemaking rather than event-versioned temporal preservation in dynamic dialogue. Finally, contemporary online agentic memory frameworks, such as Mem0 [
13] and Zep [
44], attempt to solve this latency via real-time LLM tool calls. Nevertheless, the tight coupling of read and write operations often introduces a synchronous maintenance bottleneck, identified as DR3, while their reliance on overwrite-to-latest state updates directly complicates temporal state preservation, denoted as DR2, during long-horizon updates or historical backtracking.
To establish a fair and practically viable cross-system comparison, all evaluations, except the theoretical full-context upper bound, were conducted under a strict bounded-budget contract. Rather than treating token limits as a capability ceiling of modern LLMs, we enforce a strict 8k-token contextual budget, denoted as
, and a bounded generation budget, denoted as
, to simulate latency-sensitive deployment environments. Comprehensive details of this evaluation contract are provided in
Appendix A.5. This contract further prohibits any system from relaxing retrieval limits to obtain marginal gains by enforcing strict boundaries across three dimensions: the context budget
, the retrieval budget
, and the generation budget
. Specifically,
sets a hard upper bound on the total prompt tokens inputted to the generation model, such as 8k tokens;
establishes a fixed limit on the number of initial retrieval seeds, specifically the Top-
K selection, and the maximum evidence payload, ensuring consistent rules for de-duplication and truncation; and
places a strict lock on the maximal decoded parameters, for instance, 512 new tokens, to assure the full reproducibility of experimental results.
To ensure deployment parity, all evaluated systems, except for the theoretical upper bound, strictly adhere to a maximum context budget of tokens per query. Crucially, the Full-Context baseline reported in subsequent evaluations is exempt from this contract; it is included solely as an out-of-budget theoretical upper-bound reference to delineate the inherent reasoning limits of the LLM.
4.3. Evaluation Metrics and Diagnostic Signatures
To replace the monotonous reporting of single-dimensional generation accuracy, a dual-reporting protocol of “Quality × Cost” is adopted to evaluate the architecture’s deployment viability in resource-constrained environments. For operational efficiency, end-to-end tail latency, denoted as , is strictly tracked to diagnose DR3, as interactivity severely degrades when synchronous graph updates cause tail latency to spike. For retrieval quality, the turn-grounded recall at K, denoted as , measures the coverage of the retrieved results over the minimal sufficient source-turn evidence set, formally defined as within our controlled suite. Crucially, rather than treating it as a universal operational requirement, we establish full coverage, specifically achieving an of 1.0, as a diagnostic sufficiency gate within our controlled suite, where the target is the complete minimal evidence set rather than semantically related content alone. Failing to capture the complete necessary evidence set forces the LLM into an information blindness state, where downstream responses risk degrading into ungrounded parametric guessing rather than explicit, traceable reasoning.
To quantify event-boundary reconstruction accuracy, specifically addressing RQ1, and to examine structural tendencies related to DR2, we use the probabilistic segmentation error, denoted as
, and the WindowDiff metric, denoted as
, with respect to human annotations. These metrics are used here not merely as standard segmentation scores, but as diagnostic instruments for evaluating whether off-peak stewardship can reconstruct event boundaries without inducing either excessive fragmentation or structural flattening. The metrics are defined as follows:
where
T denotes the total length of the normalized turn sequence,
k denotes the evaluation window size,
and
denote the reference and system-projected segmentation labels, respectively, and
denotes the indicator function. In Equation (
6),
if positions
u and
v belong to the same segment and 0 otherwise. In Equation (
7),
and
denote the number of segment boundaries observed within the same window in the reference and system-projected segmentation, respectively.
The key insight captured by Equations (
6) and (
7) is that event-boundary quality in long-horizon conversational memory should not be judged only by whether boundaries exist, but by whether the system preserves the correct local relational structure of nearby turns within a finite diagnostic window. Under this interpretation,
measures whether the system agrees with the human reference about same-segment versus different-segment membership, whereas WindowDiff measures whether the local boundary density is preserved. We further define the boundary-count difference as
to distinguish two common structural tendencies:
indicates semantic fragmentation, namely excessive sensitivity to local semantic shifts, whereas
indicates over-merging, where distinct narrative units are merged more often than in the human reference. In our diagnostic interpretation, persistent over-merging is treated as evidence consistent with DR2 because it reflects the collapse of historically distinct event boundaries into a flatter structure.
For complex high-level cognitive tasks, specifically addressing RQ2 and RQ3, traditional lexical metrics exhibit important blind spots. We therefore assess reasoning consistency using an LLM-as-a-Judge metric, denoted as
J, which utilizes the evaluation prompt established in Mem0 [
13] and is estimated with 10-fold Monte Carlo verification to reduce sampling variance and mitigate survivorship bias. In addition, for the controlled causal diagnostics in SCDS, we retain attribution auditability, denoted as AA, as an answer-level, pointer-grounded auditability indicator. Unlike its former use in conflict-oriented settings, here AA measures whether the generated answer remains explicitly attributable to the gold supporting evidence and preserves causal traceability under long-range reasoning. Consequently, AA complements
and SCR by distinguishing retrieval sufficiency from answer-level evidential faithfulness. Finally, for causal stewardship, strict chain recall, abbreviated as SCR, is defined as 1 when all turns in the gold causal chain are included in the top-K retrieved turns, and 0 otherwise. A near-zero SCR despite a relatively high
suggests that the system failed to recover a contiguous logical path across temporal gaps, indicating a causal rupture, formally identified as DR1. In addition, when retrieval coverage remains adequate but chain continuity or answer-level auditability still degrades, such a pattern is interpreted as evidence that historically distributed states were not preserved or reconnected faithfully, aligning with the temporal state degradation identified as DR2.
5. Simulations and Experiments
5.1. Verifying Structuring Craft (QMSum)
To address RQ1, human annotations from the QMSum dataset serve as the ground-truth reference for evaluating whether the off-peak stewardship mechanism can reconstruct event boundaries. As detailed in
Table 2, the rigid fixed-window slicing employed by the chunk-based baseline fails to preserve event granularity, resulting in a fragmentation rate of 34.2% and a high
error of 0.604. Conversely, GraphRAG relies heavily on topological density rather than temporal boundaries. This aggregation tendency leads to an over-merging rate of 28.4%, increasing the risk of structural flattening and historical boundary loss, thereby exacerbating DR2. By completely decoupling complex structural curation from the online interaction loop, MemLoom reduces the
error to 0.375. Furthermore, removing the off-peak stewardship module causes the
error to rise to 0.513, indicating that asynchronous stewardship is critical for maintaining structural integrity. Under the diagnostic interpretation defined by Equations (
6) and (
7), the lower
and WindowDiff indicate that MemLoom more faithfully reconstructs event boundaries, while the lower over-merging tendency relative to graph-summarization baselines suggests stronger resistance to the structural flattening associated with DR2.
Takeaway: These QMSum results indicate that off-peak stewardship is not merely a maintenance utility, but the structural foundation that allows MemLoom to reconstruct event boundaries with lower fragmentation and reduced over-merging.
5.2. Reasoning Consistency and Latency Trade-Offs for RQ2 on LoCoMo
In this section, the LoCoMo benchmark [
4] is utilized to evaluate the mitigation effectiveness of MemLoom in addressing RQ2. Specifically, we verify whether the BDSR architecture can mitigate the logical limitations of temporal and multi-hop reasoning inherent in traditional RAG while maintaining bounded latency. The comparative results are presented in
Table 3 and
Table 4. To reduce fairness concerns, all major systems discussed in this section were re-evaluated under the same local setup, and the resulting comparison trends were found to be broadly consistent with previously reported patterns.
5.2.1. Single-Hop Performance: The Cost of Abstraction
For single-hop queries, the Mem0 baseline achieves the highest score (), marginally exceeding MemLoom (). This result reflects the inherent advantage of Mem0’s dense-vector retrieval, which maximizes the preservation of microscopic lexical details directly from the raw text. In contrast, the performance of MemLoom reflects the inevitable trade-off of lossy compression inherent in an event-centric architecture, where the off-peak stewardship abstracts continuous dialogues into discrete semantic nodes, occasionally omitting granular unstructured details. However, MemLoom remains highly competitive because the online sentence buffer within the dual-stream mechanism acts as a compensatory layer, bounding the loss of lexical details within a strictly acceptable threshold.
5.2.2. Complex Reasoning: Structural Consistency
Compared with the reference results in
Table 3, MemLoom exhibits a design-consistent advantage on the temporal and multi-hop subsets while remaining competitive on single-hop questions. For temporal reasoning, MemLoom reaches
, which is higher than the listed agentic baselines. We interpret this pattern as being consistent with MemLoom’s design emphasis on event-level structure and snapshot-based history preservation, because in-place updating in online memory systems may be less favorable when historically distinct states and intermediate transitions must remain explicitly traceable.
For multi-hop reasoning, MemLoom reaches , also showing a favorable comparison pattern relative to the reference results. We interpret this result as being consistent with the use of a curated event memory graph and bounded topology-aware traversal, which together provide a clearer structural substrate for long-range reasoning than purely flat or runtime-induced memory organization. Because these systems were re-evaluated under the same local setup, the consistency of the temporal and multi-hop trends strengthens the interpretation that MemLoom’s event-structured retrieval design is beneficial for long-horizon reasoning.
5.2.3. Efficiency Analysis and Bounded Latency
As shown in the latency analysis, Mem0 exhibits an extremely low total tail latency (
) due to its lightweight flat architecture. Under the bounded-budget answer-serving setting reported in
Table 4, the total latency of MemLoom (
) is slightly higher than that of Mem0, yet remains far below the prohibitive delay of the full-context baseline (
). This moderate overhead reflects the additional cost of structure-aware retrieval and grounding, rather than synchronous graph maintenance on the live serving path. Because heavyweight topology construction, relation synthesis, and snapshot publication are shifted to the off-peak stewardship loop, the online path remains bounded even under long-horizon interaction pressure. This empirical result therefore provides direct support for the core DLPS claim: MemLoom can practically decouple latency-sensitive answer serving from consistency-critical structural maintenance under the tested deployment budget.
Furthermore, MemLoom deliberately incurs a higher storage footprint by maintaining both sentence-level evidence and event-level structure. We interpret this as a memory-for-durability trade-off: additional storage cost is exchanged for stronger temporal continuity, causal recoverability, and answer traceability in long-horizon multi-party settings. Taken together, these results suggest that MemLoom occupies a practical middle ground in the quality–latency trade-off. Although it is not optimal on every isolated metric, it preserves bounded serving latency while maintaining stronger temporal and multi-hop structural consistency than flat retrieval systems.
Takeaway: Under the shared bounded-budget setting, MemLoom preserves bounded serving latency while sustaining stronger temporal and multi-hop structural consistency than flatter retrieval architectures, providing direct empirical support for the DLPS design objective.
5.3. Causal Stewardship and Structural Preservation (SCDS)
In this section, the causal diagnostic questions within SCDS are utilized to verify whether the system can recover chain-complete causal evidence under turn-grounded sufficiency and preserve historically distributed states against structural flattening. These scenarios are characterized by semantic heterophily and long-range temporal dispersion. As shown in
Table 5, the dual-loop design substantially affects whether a system can maintain causal continuity, answer-level auditability, and historical traceability under this controlled diagnostic setting.
A noticeable degradation in causal recoverability is observed in the Mem0 baseline. This pattern may reflect a mismatch between compact online memory consolidation and the need to preserve chain-complete, temporally grounded provenance. When explicit event versioning is unavailable, its inference-based update mechanism may be less suitable for retaining remote root-cause turns and intermediate historical states. Consequently, its drops to 0.36, its SCR reaches only 0.12, and its AA remains limited at 0.30. This suggests that systems relying on in-place state-overwrite mechanisms face difficulty in recovering long-range causal chains and may also induce structural erasure of earlier states under long-horizon updates.
While GraphRAG demonstrates a relatively high retrieval rate due to dense graph indexing, its core community aggregation procedure introduces a distinct design trade-off in causal stewardship. GraphRAG tends to group semantically related updates into the same broad summary region based on topological density rather than precise temporal boundaries. Consequently, although its remains relatively high at 0.78, its SCR reaches only 0.35 and its AA remains at 0.50. This indicates that relatively high node recall in isolation does not guarantee chain-complete causal recovery or answer-level evidential faithfulness. In this sense, GraphRAG may exhibit Topological Diffusion—a structural smoothing of specific directional causal paths under broad community summarization. Without a curated chronological backbone, such systems may be more prone to generating narratives that are superficially coherent but causally misaligned.
In contrast, MemLoom reconstructs long-range causality through structured traversal within the curated event memory graph. By navigating explicit logical paths rather than relying solely on semantic proximity, it shows a strong capacity to mitigate DR1. At the same time, its immutable snapshot mechanism and lineage-preserving event lifecycle help maintain historically distinct states without destructive overwrite, thereby reducing DR2. As a result, MemLoom attains strong scores, with , , and . We interpret this result as being consistent with MemLoom’s topology-aware retrieval design. At the same time, these findings should be interpreted within the intended scope of SCDS as a controlled diagnostic suite rather than as a substitute for broader real-world evaluation. Because SCDS is deliberately constructed to stress the structural failure patterns targeted by the present relation closure, part of the observed advantage may reflect diagnostic ontology alignment. We therefore treat the SCDS results as mechanism-level evidence for structural recovery, while relying on LoCoMo and QMSum as complementary evidence for more natural long-horizon reasoning and segmentation behavior.
Takeaway: Within the intended scope of SCDS as a controlled diagnostic suite, MemLoom shows stronger mechanism-level evidence for causal-chain recovery and answer-level auditability than baselines that rely on overwrite-oriented or community-smoothed memory organization.
5.4. Deployability and Boundedness Verification for RQ4
This section addresses RQ4 by evaluating the deployment viability of MemLoom and, more specifically, by testing the central architectural claim of DLPS: whether latency-sensitive answer serving can remain bounded while consistency-critical structural maintenance is deferred to asynchronous off-peak stewardship. As illustrated in
Figure 7, MemLoom demonstrates this deployability through two stress tests that jointly examine bounded tail latency and intra-cycle robustness during temporary structural lag.
First, the scale sweep test shows that MemLoom mitigates DR3 by offloading heavyweight structural curation to the off-peak stewardship layer under the dual-loop architecture. Compared with offline graph summarization approaches such as GraphRAG, which exhibit a prohibitive latency of 15.0 s during continuous interaction, MemLoom maintains a bounded operating range from 1.35 s in the mature state to 1.67 s during the intra-cycle buffering state. The upper endpoint of this range is consistent with the bounded answer-serving latency reported in the LoCoMo evaluation. This result is therefore not merely a latency observation; it directly supports the DLPS claim that user-facing serving can remain bounded even when structural curation is retained as an asynchronous background process.
The second evaluation introduces a freshness failsafe verification using the SCDS diagnostic suite to measure the system’s performance during the latency gap prior to an off-peak topology update. As shown in
Figure 7, even when the curated event memory graph is not yet synchronized with the latest conversational turns, MemLoom—leveraging its online sentence-level buffer and dual-stream mechanism—maintains competitive causal robustness, achieving AA of 0.78 and SCR of 0.60. These figures substantially outperform the chunk-based vector baseline (
,
). The chunk-based values are reused from the same S1 reference in
Table 5, because this baseline has no asynchronous freshness state; therefore, its intra-cycle and steady-state behavior are identical under our protocol.
This controlled decline—from a mature peak of and down to and —demonstrates graceful degradation under temporary structural lag. In the worst-case scenario where the off-peak stewardship cycle remains incomplete, the dual-stream online loop acts as a failsafe, providing foundational causal retrieval and answer-level auditability that still exceed those of traditional RAG systems relying solely on flat retrieval. This suggests that the dual-loop architecture effectively mitigates the risks associated with the “information freshness gap” while preserving bounded responsiveness.
Takeaway: These deployment-oriented stress tests support the claim that MemLoom can maintain bounded answer serving while degrading gracefully during temporary structural lag, which is the central practical benefit of DLPS.
6. Ablation Study
Section 6 reports the final measured ablation results used to assess the architectural necessity of MemLoom’s major functional components. The analysis is based on observed performance changes after removing or weakening one module at a time under the corresponding full-model evaluation protocol. The resulting ablation values are summarized in
Table 6, which serves as the primary evidence source for the module-level discussion below. To avoid redundancy, the QMSum-based steward ablation is discussed separately in
Section 5.1 and is therefore not repeated in
Table 6.
The removal of the event steward module causes a clear degradation in structural boundary quality on QMSum, driving the
error to 0.513. This regression moves the system toward a fragmented state, approaching the behavior of a standard chunk-based pipeline. The result identifies the steward as the structural foundation for downstream logical operations: without the event-boundary discipline established during ingestion and off-peak curation, higher-order reasoning lacks a stable basis for conceptual attachment. As discussed in
Section 5.1, this QMSum-based steward ablation is reported separately from
Table 6 because it targets boundary reconstruction rather than causal recovery or deployment latency.
Removing the topology module weakens answer-level traceability on SCDS, with AA decreasing from 0.80 to 0.68, indicating weaker recoverability of distant supporting evidence. These measured results show that even if a system retains local semantic relevance, the lack of a long-range contextual skeleton makes it more difficult to reconnect temporally distant evidence into an auditable answer path. Under this measured ablation, the observed decline should be interpreted as evidence that the topology module contributes to long-range evidence organization and answer-level traceability.
Removing the logic track causes SCDS SCR to decline from 0.72 to 0.39, while AA drops from 0.80 to 0.55. This larger degradation relative to the topology-only ablation indicates a distinction between the loss of contextual background and the loss of a structural reasoning mechanism. Whereas the topology module provides the broad contextual skeleton, the logic track is the primary module designed to bridge semantically separated but causally connected states and to prevent their collapse into structurally flattened interpretations. Once this mechanism is bypassed, the generation model frequently produces locally plausible yet globally misaligned answers. Under this measured ablation, the observed decline should be interpreted as evidence that the logic track is the primary mechanism for recovering semantically separated but causally linked states into a chain-complete reasoning path.
Removing the asynchronous off-peak stewardship mechanism forces the architecture to regress to a synchronous blocking mode, increasing client-side latency from 1.67 s to 8.34 s, which corresponds to an approximately 399.4% relative increase. In architectural terms, this ablation isolates the necessity of DLPS rather than merely showing the usefulness of one optional module. Once off-peak stewardship is removed, the intended decoupling between answer serving and structural maintenance collapses, and the online path is forced to absorb the full cost of graph updating. This result therefore reinforces the claim that dual-loop asynchronous offloading is not a cosmetic optimization, but a necessary architectural condition for bounded structured-memory serving in real-time environments.
Taken together, the ablation values in
Table 6 show that the major components of MemLoom are functionally non-redundant. The topology module contributes to long-range evidence organization and answer-level traceability, the logic track supports chain-complete causal recovery and answer-level evidential faithfulness, and the off-peak stewardship mechanism is necessary for maintaining bounded serving latency under long-horizon interaction. Under this interpretation, the ablation study supports the view that MemLoom is not an ad hoc stack of loosely coupled modules, but a coordinated architecture in which each component addresses a distinguishable structural risk.
7. Discussion and Limitations
The principal strategy in MemLoom is grounded in latency economics, which reflects a strategic trade-off between real-time responsiveness and deep logical structuring. The evaluations suggest that the MemLoom architecture can practically balance causal traceability, structural preservation, and bounded latency in long-horizon interactions. By appropriately deferring the computationally intensive neuro-symbolic graph synthesis to the off-peak stewardship loop, DR3 is effectively mitigated, thereby bounding the online tail latency to approximately 1.67 s under the bounded answer-serving setting. Notably, a substantial portion of the remaining delay is attributable to the cloud API communication overhead of the semantic router; deploying a local LLM or dedicated server in the future could further reduce this overhead. Although encapsulating raw utterances into abstract event nodes inevitably introduces some lossy compression, the sentence-level substrate implemented within bounded dual-stream retrieval can partially compensate for this effect.
Real-time conversational memory also operates under unavoidable uncertainty. In practice, ambiguity may arise from incomplete utterances, unstable references, delayed clarifications, or imperfect LLM-mediated relation inference. MemLoom does not assume that these uncertainties can be eliminated at the input level. Instead, the architecture is designed to contain uncertainty within bounded and auditable stages. First, the bounded policy vector constrains the scope of online routing and off-peak reasoning, preventing unbounded expansion of uncertain inferences. Second, contract validation acts as a rule-based validation filter over probabilistic candidate relations before publication. Third, sentence-level grounding preserves direct access to raw supporting spans, allowing error tracing even when event abstraction is lossy. Finally, immutable snapshot serving prevents partially updated or weakly validated structures from directly contaminating the live retrieval path. In this sense, the present framework prioritizes operational containment and traceability of uncertainty rather than full probabilistic uncertainty modeling.
Furthermore, the necessity of these defensive boundaries is supported by the ablation study. Specifically, removing the off-peak stewardship mechanism increases exposure to DR3, while weakening the event-boundary, topology, and logic-track safeguards increases the risks of DR1 and DR2. Hence, MemLoom can be understood as a decoupled architectural framework rather than a monolithic system with a trial-and-error stack of empirical features. Nevertheless, several limitations remain for broader real-world deployment. First, part of the evaluation relies on SCDS, which is intentionally diagnostic and controlled. While this suite is valuable for isolating DR1- and DR2-related structural failures under pointer-grounded verification, it does not replace large-scale real-world evidence. Moreover, because the SCDS ontology is deliberately aligned with the compact relation closure adopted in MemLoom, some portion of the observed gain may reflect this diagnostic alignment rather than a fully general architectural advantage. We therefore do not present the four-relation closure as a universal ontology for open-world conversational causality; instead, it should be understood as a compact and validation-bounded structural substrate chosen to balance expressiveness, auditability, and maintenance tractability. Although the complementary LoCoMo and QMSum results suggest that this restricted closure remains useful beyond the synthetic suite, the precise effect of this ontology choice on more natural open-ended datasets has not yet been independently quantified and remains an important direction for future work. Second, although the major LoCoMo systems were re-evaluated under our unified local setup to improve fairness, cross-system comparison should still be interpreted with appropriate caution because absolute values may remain sensitive to implementation details, model versions, prompt design, and evaluation configuration. Third, control policy parameters, such as routing thresholds and token budgets, are currently managed by static heuristics without precise cost adaptation. Finally, the current MemLoom framework is exclusively tailored for text-based transcripts managed by a single centralized steward, leaving multimodal inputs and decentralized multi-agent architectures as important open directions.
MemLoom also explicitly incurs a higher storage overhead than naive window-based retrieval because it maintains event graphs, lineage registries, and sentence-level evidence simultaneously. We treat this as a deliberate memory-for-durability trade-off rather than as an incidental cost: additional storage is exchanged for more traceable long-horizon causal continuity, historically traceable state evolution, and answer-level auditability. Importantly, this overhead is not exposed uniformly on the live serving path; in addition, off-peak curation cost appears mainly as background compute demand, validation workload, snapshot-publication overhead, and storage growth during stewardship cycles rather than as per-turn tail-latency inflation. Through lifecycle-aware memory organization, active event structures remain on the online retrieval surface, whereas archived events are migrated to colder storage once they exceed their relevance horizon or memory pressure requires compaction. Likewise, lineage records serve primarily as provenance support rather than as the main high-frequency query plane, and sentence-level memory is accessed through pointer-based grounding rather than broad full-text traversal in every request. Even so, the precise long-horizon resource envelope of this design has not yet been fully profiled under long-running deployment horizons, and systematic large-scale stress testing remains an important direction for future work.
8. Future Work
The dual-loop defensive capabilities of MemLoom have been preliminarily examined. Consequently, to reduce the substantial marginal computational costs and human tuning efforts currently required, the dynamic optimization of stewardship mechanisms can be prioritized in future work. Herein, control policy vectors related to router thresholds, temporal decay parameters, and retrieval budgets can be formulated as optimizable variables within an automated optimization framework. Specifically, multi-fidelity resource allocation strategies, such as Hyperband [
46] and BOHB [
47], can be deployed during the offline batch evaluation phase to rapidly screen high-quality parameter configurations under limited computational budgets. Moreover, rather than serving as a direct hyperparameter optimization tool, MAML [
48] can be explored to learn favorable model parameter initializations, enabling the system to warm-start and rapidly adapt to novel interaction environments. Simultaneously, for hyperparameters requiring temporal adaptation, population-based training (PBT) [
49] can be evaluated to asynchronously discover dynamic scheduling policies within the offline background maintenance phase. As more comprehensive interaction datasets become available, the current restriction that these optimization procedures remain strictly outside the online inference path can be relaxed. In addition, open information extraction (Open IE) can be introduced into the off-peak stewardship, enabling the system to autonomously induce a broader and more open-ended set of event relations, thereby improving its awareness of the nuanced heterogeneity present in real-world temporal and social topologies.
Another important extension concerns event explainability in LLM-based memory systems. In the current version, MemLoom emphasizes evidence-grounded traceability and pointer-level auditability, but it does not yet provide an explicit explanation layer for why an event was formed, why a relation was validated, or how a retrieved answer depends on a particular sequence of event evolution. Future work may therefore extend the framework with edge-confidence estimation, uncertainty-aware explanation, and more interpretable visualization of event-state evolution so that event formation, relation validation, and answer grounding become not only traceable but also more directly explainable to users and developers.
In this work, we prioritize operational traceability and bounded structural governance over explicit probabilistic uncertainty modeling. Developing a fully quantified uncertainty framework for event confidence, edge reliability, and retrieval-time risk propagation remains an important direction for future work.
Prematurely introducing multimodal or multi-agent networks before sufficiently resolving textual topological ruptures and historical state fragmentation may unnecessarily increase system complexity. Hence, after establishing a more robust logical substrate in the pure-text setting, the current MemLoom framework can be extended in a more stable manner. In future work, multimodal signals and decentralized multi-agent knowledge sources can be incorporated, such that the integration of visual states and audio features can further evolve MemLoom toward multimodal event graphs for embodied AI.
9. Conclusions
For emerging embodied AI and autonomous agents, long-horizon multi-party interactions are considered critical scenarios for examining the memory boundaries of large language models (LLMs). Herein, traditional retrieval-augmented generation (RAG) systems are frequently susceptible to the design risks identified in this study, specifically DR1, DR2, and DR3. To simultaneously address these challenges, the MemLoom architecture is proposed. Through a dual-loop publish–subscribe architecture, latency-sensitive online interactions are structurally decoupled from consistency-critical off-peak stewardship. Rather than claiming to definitively resolve all structural tensions in long-horizon conversational memory, the proposed architecture is intended as a practical balance point between real-time responsiveness, structural stewardship, and traceable retrieval under bounded deployment conditions.
Through controlled evaluations and measured ablation analyses, the present study examines how MemLoom supports event-centric structuring, causal recovery, historical state traceability, and bounded online serving within the target multi-party setting. The resulting evidence suggests that curated event memory graphs and contract-bound lifecycle control can improve structural durability and answer traceability, while asynchronous stewardship helps prevent live serving latency from being dominated by graph maintenance cost. These findings should be interpreted in light of the bounded-budget and unified local evaluation protocol adopted in this study; they therefore provide architecture-consistent evidence rather than a definitive universal ranking across all memory systems. Under this scope, MemLoom may serve as a useful architectural reference for long-horizon conversational agents that require stronger temporal continuity, structural auditability, and bounded deployment behavior.