ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture

Ming, Zongyu; Wu, Zimu; Chen, Genlang

doi:10.3390/info16040251

Open AccessArticle

ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture

by

Zongyu Ming

¹,

Zimu Wu

² and

Genlang Chen

^3,*

¹

School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, China

²

School of Cyber Science and Engineering, Ningbo University of Technology, Ningbo 315211, China

³

School of Computer and Data Engineering, Ningbo Tech University, Ningbo 315199, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(4), 251; https://doi.org/10.3390/info16040251

Submission received: 10 March 2025 / Revised: 18 March 2025 / Accepted: 19 March 2025 / Published: 21 March 2025

Download

Browse Figures

Versions Notes

Abstract

In recent years, the rapid development of large language models (LLMs) has led to a growing consensus in the industry regarding the integration of long-term and short-term memory. However, the widespread application of long-term and short-term memory systems faces two significant challenges: increased execution time and decreased answer accuracy from LLMs. To tackle these challenges, we propose the ILSTMA. This architecture uniquely combines fundamental theories of human forgetting with classical operating system principles, providing an unprecedented acceleration method that does not rely on traditional memory retrieval algorithms, which is all based on the systematic planning of available memory space. Furthermore, our proposed most relevant dialogue retrieval process substantially enhances the answer accuracy of LLMs while examining the potential of the two most commonly used memory retrieval algorithms. Experimental results demonstrate that our acceleration method improves the execution efficiency of the original architecture by 21.45%, and our most relevant dialogue retrieval process raises the answer accuracy to 88.4%, surpassing several benchmarks. These findings validate the high performance of the ILSTMA.

Keywords:

long-term memory; short-term memory; LLM; chat scene

Graphical Abstract

1. Introduction

The advent of large language models (LLMs) has radically transformed the domain of natural language processing. State-of-the-art models such as ChatGPT [1], LLaMA [2], ChatGLM [3], and GPT-4o [4] have drastically altered production patterns in a variety of social sectors. However, LLMs are constrained by their inability to access information beyond their contextual window boundaries [5,6], which significantly hinders their development. To overcome this limitation, the integration of external long-term memory (LTM) for storage has become the conventional approach to exceed the constraints of LLM storage capacity [7,8]. However, this approach introduces two hidden risks: a decrease in the accuracy of model responses and an increase in system execution time. This primarily stems from the incorporation of long-term memory, which adds a crucial and time-consuming process: the retrieval of the most relevant dialogues corresponding to the user’s specific query. Specifically, when a user poses a question, the system must invoke complex memory retrieval algorithms to locate the dialogues that are most relevant within the long-term memory. This retrieval process is inherently complex and involves extensive data comparison and filtering. Consequently, if the retrieved dialogues lack sufficient relevance to the user’s query, the accuracy of the model’s responses is likely to be adversely affected. Moreover, this memory retrieval process can lead to an increase in the overall response time of the system, as the model is required to spend additional time retrieving information before generating the final response. In summary, although the introduction of memory aims to enhance model performance through rich historical data, its complexity and time cost may undermine the effectiveness of the implementation, thereby impacting user experience.

Some studies, such as MemoryBank [9] and Generative Agents [10], have implemented strategies to periodically abstract low-dimensional information into high-dimensional information to enhance the model’s efficiency in utilizing information and delay the threshold of memory accumulation. TiM [11] excludes redundant semantically conflicting information through semantic recognition to reduce the spatial complexity of retrieval targets.

These methods significantly improve the accuracy of memory retrieval. However, there are still some limitations. One key issue is the excessive reliance on the high-dimensional information summarization capabilities of LLMs, which increases the instability of the system. Even the most outstanding models cannot guarantee exceptional performance in every instance of this task. Furthermore, as current long-term memory systems are still in the process of gradual refinement, the precision of memory retrieval algorithms continues to improve. Furthermore, we observe that the vast majority of studies lack optimization for the average execution time in the process of locating the most relevant dialogues. This is mainly due to the fact that the average execution time is largely determined by the choice of the memory retrieval algorithm. Virtually all existing algorithms require the text to undergo vector embedding followed by similarity computations. Optimizing algorithms at this level can be extremely challenging, especially given the enormous computational cost associated with large language models that can encompass hundreds of billions of parameters. However, the memory retrieval process is the most critical component of long-term memory systems and cannot be omitted. This creates a dilemma: the focus on optimizing execution efficiency lies within memory retrieval algorithms, yet improvements to these algorithms are notably difficult to achieve. Therefore, it is crucial to develop an architecture that can further enhance the accuracy of the LLM response while effectively reducing the average execution time. Based on these considerations, we developed the ILSTMA, which is a more accurate and faster long-term and short-term memory architecture.

As a systematic architecture, this study presents for the first time a novel design for the spatial layout of both short-term and long-term memory. Long-term memory is abstracted into a global memory table, encapsulating historical dialogues along with the corresponding metadata as table entry objects and treating all long-term memory operations as tables. Short-term memory is classified into five major segments based on the types of information, collectively providing reference information for the answer to the LLM. Building on this foundation, we explore the application potential of the two most widely used memory retrieval algorithms and propose an indexing-based memory retrieval modification algorithm to enhance retrieval accuracy. To alleviate execution time bottlenecks, the ILSTMA introduces a caching prefetch mechanism, which scores dialogues using caching prefetch indicators derived from human forgetting theory. Based on these scores, historical dialogues from long-term memory are cached in short-term memory. Through this technique, the system can potentially bypass the high-complexity memory retrieval process and directly answer user queries, thereby reducing average execution time.

Drawing on the foundational principles of human forgetting theory is motivated by the fact that the ultimate users of long-term and short-term memory systems are humans. The essence of the caching prefetch mechanism is to enable the system to predict the information that will be used next from a human perspective. Therefore, designing a caching prefetch indicator informed by human forgetting theory helps align the system with human cognitive processes, thereby enhancing the hit rate of the caching prefetch mechanism. Specifically, this study starts with Ebbinghaus’s forgetting curve [12] and applies the forgetting theory, which posits that humans exhibit strong memory consolidation in the early stages of learning. This consolidation weakens with repeated recall, ultimately stabilizing [13,14,15,16,17]. By refining the original forgetting curve using this principle, we can derive the caching prefetch indicators that better reflect human cognitive patterns. This, in turn, increases the hit rate of the caching prefetch mechanism, effectively reducing the average execution time of the system.

To assess the efficacy of the ILSTMA, the study proposed partitioning principles suitable for long-term memory datasets and applied these principles to an open-source dataset. The dataset was carefully classified and denoised to ensure its quality. It comprised five topics, each containing several chat dialogues and a series of questions that require memory retrieval through the ILSTMA for accurate answers. We designed unique ablation studies and benchmark comparison experiments that comprehensively demonstrate the effectiveness of the ILSTMA: the answer accuracy of the LLM reached 88.4%; the caching prefetch mechanism improved execution efficiency by 21.45%; and the ILSTMA also exhibited high performance in real-world chat environments, which confirms the validity of this research. The contributions presented in this paper are as follows:

This study optimizes the spatial layout of short-term and long-term memory, and building on this, it enhances LLM answer accuracy through an optimized retrieval algorithm and high-dimensional information summarization.
This study integrates human forgetting theory with OS caching prefetch mechanisms, enhancing execution efficiency without modifying retrieval algorithmsa and providing a valuable reference for related studies.
This study outlines the principles of partitioning the dataset and comprehensively evaluates the ILSTMA from multiple perspectives, demonstrating its efficiency.

The structure of this paper is as follows. In Section 2, the most advanced research on this topic is presented, providing theoretical and technical background for the development of the ILSTMA. Section 3 introduces the relevant details of the ILSTMA. Section 4 presents the methodology for the construction of the dataset and provides a comprehensive evaluation of the ILSTMA through experimental results.

2. Related Work

2.1. Large Language Models

LLMs have evolved from pretrained language models (PLMs). Architecturally, LLMs are mainly categorized into three types: The encoder–decoder architecture, as exemplified by FLAN-T5 [18]; the causal decoder architecture, such as OPT [19], BLOOM [20], and Gopher [21]; and the prefix decoder architecture, such as GLM-130B [3] and U-PaLM [22]. Although there is no substantial difference in network architecture between an LLM and a traditional PLM, both are based on large parameter counts and extensive training data. However, once these parameters and data reach a certain scale, LLMs exhibit exceptional performance, particularly in context understanding. This robust emergent capability has rapidly expanded the application of LLMs in various domains. For example, PanGu-

α

[23], based on Huawei’s MindSpore architecture [24], has shown remarkable zero/few-shot capabilities in Chinese language tasks; CodeGen [25] excels in autoregressive code generation tasks; OPT-IML [26], mT0, and BLOOMZ [27] show significant advantages in multilingual adaptation, and StarCoder [28] focuses on optimizing code writing capabilities. This study introduces the ILSTMA, which effectively utilizes the powerful capabilities of gpt-3.5-turbo in dialogue processing. It is important to note that the architecture of LLMs itself does not possess the capability to store information; it depends on its context window or external storage media to preserve historical data, which is a limitation that has spurred numerous related research initiatives.

2.2. Short-Term Memory

Short-term memory, also known as unified memory, utilizes the context window within the LLM to directly embed memory information into the prompts [29]. SayPlan [30] integrates the LLM with 3D scene graphs for efficient robotic task planning. LLMs facilitate natural language processing, enabling robots to understand and follow complex instructions. The 3D scene graphs provide structured representations of spatial relationships, enhancing situational awareness. In addition, a short-term system stores relevant task context, allowing for error correction and real-time adjustments. This approach aligns with scalable AI principles, supporting a wide range of robotic applications, from basic tasks to more complex interactions. CALYPSO [31] leverages an LLM to enhance the gaming experience by dynamically updating game plots, player decisions, and NPC (non-player character) statuses. Using a short-term memory system, it captures real-time context, allowing for adaptive storytelling and intelligent NPC interactions. This integration promotes greater player agency and a more immersive gameplay environment. Fischer [32] introduced reflective language programming (RLP), which is a system specifically designed for social scenarios that is equipped with advanced character portrayal capabilities. It records users’ psychological states through short-term memory and updates them as the dialogue progresses, thereby supporting reflective thinking. The system leverages real-time data analysis, making character interactions deeper and more authentic, as well as enhancing user engagement and experience. DEPS [33] has developed an LLM-based Minecraft game assistant that utilizes a short-term memory system to improve task planning and execution. This assistant tracks user interactions and intentions, allowing it to generate context-sensitive advice and suggestions. By recording errors that occur during task execution, it can analyze challenges faced by players and adapt its guide accordingly. This adaptive approach fosters a more efficient game experience, enabling players to overcome obstacles and enhance their overall enjoyment of the game. Although short-term memory operation is straightforward, encapsulating all the information into prompts placed within the context window, this method is constrained by the size of the window, which limits its ability to handle long information tasks.

2.3. Long-Term Memory

Long-term memory, as a strategy to address the limitations of unified memory, has garnered considerable attention in research. The Generative Agent [10] simulates everyday life scenarios within a virtual town, creating a dynamic and immersive environment. This system captures the daily thoughts, emotions, and actions of virtual agents, storing this information in long-term memory. By doing so, it enables the retrieval of past interactions and reflective summarization, allowing agents to build upon previous experiences. This memory-driven approach enhances the realism and continuity of interactions, facilitating more meaningful dialogues and behaviors as agents evolve over time, thereby enriching the virtual community experience. Reflexion [34] Language Feedback Reinforcement Learning is a framework designed to enhance the learning process of language agents through their past interactions. This approach enables agents to reflect on their previous dialogues and experiences, allowing them to identify effective strategies and areas for improvement. By compressing and summarizing long-term memory regularly, the system maintains a concise and relevant repository of knowledge, facilitating quick retrieval and reducing cognitive overload. This continual feedback loop empowers agents to adapt their responses over time, leading to improved communication skills and more meaningful interactions with users. GITM [35] has created an automated resource gathering agent for Minecraft that efficiently collects in-game materials. This agent records all successful task experiences in long-term memory, effectively building a reliable knowledge base. By storing information about past resource gathering strategies, techniques, and outcomes, the agent can refer to these experiences when tackling new tasks. This memory-driven approach not only enhances the agent’s efficiency and effectiveness in resource gathering but also allows it to adapt its methods based on previously successful practices, ultimately improving the gameplay experience for users. Additionally, Voyager [36] and ChatDev [37] utilize long-term memory to store descriptions of game skills and dialogue histories during software development, respectively, with the latter enhancing indexing processes through dialogue history encoding and incorporating self-reflective capabilities. AgentSims [38] introduces the concept of utilizing vector databases to store embedding vectors, which are numerical representations of information designed to capture semantic meaning. By employing vector similarity algorithms, the system can efficiently retrieve relevant information based on the similarity of these vectors. This approach allows for more nuanced searches, enabling the agent to understand and produce contextually relevant responses or actions. As a result, the use of vector databases enhances the agent’s ability to process and relate complex data, facilitating improved interactions and decision making in various applications. SCM [39] retrieves the top-k most relevant historical dialogues based on user queries for LLM reference. This architecture reasonably expands the number of related dialogues retrieved, enabling the model to handle complex queries. MemorySandbox [40] created a two-dimensional interactive interface for storing memory objects, allowing users to drag and share memory objects directly within the interface. ChatDB [41] leverages a traditional database to serve as the medium for long-term memory, allowing agents to manage their memory effectively by performing operations such as addition and deletion through SQL statements. This approach opens up new avenues for structuring and accessing long-term memory, providing insights into how different media can be optimized for memory storage in artificial agents. However, the practical application of this system is constrained by the agent’s proficiency in composing SQL statements. If the agents lack the necessary skills to generate accurate and efficient queries, their ability to manipulate memory becomes limited, potentially hindering their overall functionality and adaptability in various tasks. To address this, DB-GPT [42] effectively fine-tuned the agent to facilitate interactions with the database through natural language, allowing users to perform operations in a more intuitive manner. This advancement enhances the naturalness of the interaction, enabling seamless communication between users and the database without the need for complex query languages. The agent’s ability to understand and execute natural language commands significantly improves usability, making it accessible to a wider range of users, including those without technical expertise in database management. Consequently, this approach not only streamlines operations but also fosters a more user-friendly experience in database interactions. Regarding the latest study, HippoRAG [43] merges concepts from long-term memory and neuroscience to create a model that mimics the efficient knowledge integration and retrieval processes of the human brain. By leveraging this biological inspiration, HippoRAG claims to enhance model performance by 20%, which signifies a notable improvement in its ability to process and utilize information effectively. Additionally, the model boasts a significant reduction in operational costs, achieving efficiency gains of up to 20 times. This dual advantage of enhanced performance coupled with reduced costs positions HippoRAG as a potentially transformative approach in the field of artificial intelligence, demonstrating the viability of incorporating insights from cognitive science into model development. RecurrentGPT [44] addresses the challenge of generating long texts inherent in models based on conventional Transformer architectures. The architecture stores long-term memory as embedded vectors on a disk and retrieves relevant information using prospective plans, integrating it with the ephemeral information from short-term memory to generate the next segment of text. The authors in [45] intended to address the challenges of contextual forgetting and inconsistent generation in long-dialogue scenarios. Their work introduces a memory management architecture, where long-term memory stores summarized key information. This memory is recursively updated throughout the summaries to ensure that the long-term memory remains accurate and up-to-date at all times. These studies not only demonstrate the diverse applications of long-term memory but also highlight the limitations and directions for the improvement of current technologies.

3. Methodology

In this section, we will provide a comprehensive introduction to the ILSTMA. We will approach this introduction from three main directions: the space layout of long-term and short-term memory; the most relevant dialogue retrieval module, which includes the indexing-based memory retrieval modification algorithm and the high-dimensional information summarization mechanism; and the caching prefetch mechanism. We will begin with an overview of the ILSTMA.

3.1. Architecture Overview

Figure 1 illustrates the architecture of the ILSTMA, with steps 1–9 corresponding to the nine processes shown in the figure:

Step-1: Collecting dialogues: The chat window collects user’s questions.
Step-2: The hit judgment process: This step is the first part of the caching prefetch mechanism, wherein the ILSTMA checks whether the current short-term memory reference information (primarily cache information) and its own general knowledge are sufficient to answer the user’s question. This is a crucial assessment that directly determines the direction of the subsequent processes.
Step-3: Cache hit: Selective step. If the decision in Step 2 is affirmative, the ILSTMA will notify the long-term memory to update the data while retaining the reference information in the short-term memory, excluding the historical records, in anticipation of future cache hits.
Step-4: Providing information support from LTM: The most relevant dialogue retrieval module itself does not store data. The normal operation of both the indexing-based memory retrieval modification algorithm and the high-dimensional information summarization mechanism requires data retrieval from long-term memory.
Step-5: Injecting high-dimensional information: Selective step. The high-dimension information summarization mechanism needs to summarize redundant dialogue content and extract user personality based on the dialogue. This high-dimensional information is then injected into the short-term memory as reference information. If the decision in Step 2 is affirmative, this step is unnecessary.
Step-6: Injecting the most relevant dialogues: Selective step. This step requires the indexing-based memory retrieval modification algorithm to filter out the dialogues most relevant to the user’s question from a large set of dialogues, which will be injected into the short-term memory as reference information. Note that this step has a very high time complexity and is only triggered when the decision in Step 2 is not affirmative.
Step-7: Cache missed: Selective step. If the decision in Step 2 is not affirmative, the ILSTMA will retrigger the caching prefetch mechanism and update the cache information in the short-term memory, aiming for a successful cache hit next time.
Step-8: Updating LTM: The LTM needs to be updated in real-time. When a dialogue is recalled or high-dimensional summary information needs to be updated, this step is executed to overwrite the old information in the long-term memory.
Step-9: Answer user’s question: The final step. When this step is executed, it indicates that the ILSTMA has gathered all the necessary conditions to answer the user’s question, allowing it to guide the LLM in generating a response.

3.2. Long-Term Memory Spatial Layout

The long-term memory system is a necessary condition for handling a user’s recall questions, which are inquiries that cannot be answered solely on the basis of the general knowledge of the LLM and involve personal user information. However, the long-term memory system is an abstract concept, but it generally includes storage media, data structures, and associated operations. This study aims to make the long-term memory system more concrete. Specifically, we represent the entire long-term memory system with a global memory table. Each dialogue that needs to be stored is first encapsulated as a table entry object and assigned a globally unique index for storage in the global memory table. Each table entry consists of two parts: dialogue metadata and dialogue content. The metadata include details such as the creation time of the dialogue, the time interval since it was last recalled, and the recall times. Dialog content comprises the text of the dialog and its corresponding embedding vector. The general situation of the global memory table is illustrated in Table 1.

The global memory table serves as the unique identifier for the current topic dialogue. Visualizing long-term memory as a single table offers several advantages: (1) Ease of Storage: Storing a single table in a JSON text file format on a disk or a cloud server is clearer and more convenient compared to a sharded database approach. (2) Ease of Maintenance: Professionals involved in operations and maintenance understand the complexity associated with maintaining systems. Complex logical relationships often exist between different tables, raising the maintenance threshold. This aligns with one of the design intentions of the ILSTMA to simplify data structures as much as possible while providing complete functionality. (3) Ease of Operation: By eliminating the need for complicated multitable joins, all potential create, read, update, and delete (CRUD) operations can be performed on a single table, enhancing operational efficiency.

3.3. Short-Term Memory Spatial Layout

Any command must pass through short-term memory to reach the LLM, but due to the limitations of the context window, the space in the short-term memory is highly valuable. Systematic space planning for the short-term memory can effectively improve space utilization. The ILSTMA divides the short-term memory space into five main segments, which are represented more visually in Figure 1:

Cache Information: This part of the information is one of the main characteristics of this study. When a user’s question arrives, the ILSTMA scans a part of the global memory table while simulating human thinking to calculate the caching prefetch score for each entry. Then, it injects the top-three dialogues with the highest scores into the short-term memory. If these three dialogues can answer the user’s upcoming questions, the system can bypass the process of searching for the most relevant information, significantly improving execution efficiency.
Summary: Summary refers to the distilled insights obtained from historical dialogue records that are complex and highly redundant, and they are processed through the text summarization capabilities of the LLM. This type of information is typically characterized by its brevity and high level of abstraction, making it well suited to quickly conveying important content. In the ILSTMA, this high-dimensional summary information is regarded as auxiliary information because it allows the LLM to quickly grasp the general context of the user’s recall questions. This information not only provides relevant context but may also directly answer the user’s recall questions. Using high-dimensional summaries, the LLM can review previous conversational content more effectively, resulting in more precise and efficient responses.
User Personality: Different users may exhibit varying personality traits when engaging in conversations on different topics. By analyzing historical dialogues to portray the user’s personality, it is possible to prompt the LLM to reflect the traits exhibited by the user in the current chat scenario. This can guide the LLM to generate responses that align more closely with the user’s personality preferences.
Historical Records: This portion of space is essentially a queue, and the ILSTMA defaults to pushing the most recent seven turns of dialogue onto the queue. This information provides the LLM with recent context to inform its responses.
Relevant dialogue: This segment stores the most relevant dialogues and is written only when a cache is missing. In this case, this reference information will become an important basis for answering the user’s recall questions.

3.4. The Most Relevant Dialogue Retrieval Module

This module is primarily responsible for improving the accuracy of the ILSTMA in answering user’s recall questions, and it consists of two main components: the indexing-based memory retrieval modification algorithm, which is responsible for memory retrieval and provides low-dimensional references for user’s recall questions, and the high-dimensional information summarization mechanism, which is responsible for summarizing and distilling a large volume of dialogues while extracting user personality; it provides high-dimensional references for the user’s recall questions.

The Indexing-Based Memory Retrieval Modification Algorithm

The two commonly used memory retrieval algorithms are the FAISS [46], which is a memory retrieval engine to optimize large-scale vector retrieval that has been integrated into the LangChain framework [47] (The version number is 0.0.144, which will be omitted in the following text.), and the cosine similarity algorithm. However, these memory retrieval algorithms share a characteristic: they perform retrieval solely on the basis of semantic similarity. Dialogs that are being retrieved only need to ensure maximum semantic similarity to be directly submitted to the LLM for reference. In contrast, the ILSTMA uses a global memory table that requires not only precise retrieval at the semantic similarity level but also precision in indexing. This is because the ILSTMA needs to determine the specific location of the recalled dialogue within the global memory table, allowing for timely updates to the metadata of the corresponding entries. The recalled dialogues refer to those selected by the memory retrieval algorithm. If the corresponding index does not exist or is incorrectly located to another dialogue, it can lead to recall confusion—where the recalled dialogue fails to receive timely updates, while the dialogues that are not recalled are incorrectly updated. This can cause the caching prefetch indictors to be computed using erroneous metadata, resulting in failures in the caching prefetch mechanism. Therefore, this study proposed the indexing-based memory retrieval modification algorithm, which allows dialogue indexing to participate fully in the memory retrieval process. This modification helps the ILSTMA determine the specific location of the recalled dialogue in the global memory table. We noted that while the LangChain incorporates the FAISS algorithm, it also integrates a logical judgment module. This integration enables memory retrieval processes to be completed through prompt engineering. Relevant prompts are illustrated in Figure 2.

However, a large number of results indicate that using LangChain alone does not achieve the desired effect. We found that this is primarily due to the high uncertainty (redundancy and disorder) present in real dialogue scenarios, which dilutes the key content within the semantic environment. This situation hinders FAISS’s ability to retrieve global strings and increases the likelihood of errors in the semantic logic recognition module.

To address the potential errors arising from global string retrieval and the logic recognition module, this study explored narrowing the retrieval scope and removing the logic recognition module to optimize the ILSTMA memory retrieval process. Specifically, we shifted to a dialogue-level encoding approach, calculating the cosine similarity between each dialogue’s content and the user’s recall questions as a set of dialogues.

This method first concatenates the dialogue index with the dialogue content and limits the retrieval domain to a single dialogue. This reduces contextual interference, allowing for precise matching of semantic relevance while ensuring a direct association between the dialogue index and the dialogue content. This approach minimizes the complexities of logical reasoning processes and the likelihood of recall confusion. The result of this method is a list of similarity scores sorted from highest to lowest. The ILSTMA selects the top-k (k = 2) dialogues with the highest cosine similarity from the list as candidate dialogues to match and refine the output of the LangChain framework. This process is illustrated in Figure 3.

As shown in Figure 3, the memory retrieval process for the user’s recall questions is divided into two concurrently executed processes: on the one hand utilizing the LangChain framework integrated with FAISS to retrieve a unique candidate dialogue and on the other hand employing a cosine similarity algorithm to extract the top-k set of dialogues. Subsequently, the candidate dialogue indices are matched with the indices of the top-k dialogues; if the match is successful, the candidate dialogue is considered the final recalled dialogue; otherwise, the top-k set of dialogues is collectively regarded as the recalled dialogues. This situation does not constitute recall confusion, as the top-k dialogues exhibit a high relevance to the recalled question.

Furthermore, the adoption of this correction method is based on the following considerations: we have observed that the output of the LangChain architecture can be unpredictable in certain cases that exhibit very high accuracy in a limited number of scenarios. Although the use of this algorithm alone produces unsatisfactory results in most situations, its potential remains undeniable. In contrast, while the output of the cosine similarity algorithm shows significant stability, optimal responses often do not appear among the highest scoring responses (which is why most studies employing cosine similarity typically select the top-k dialogues). Crucially, when designing prompts, the more refined the reference information, the higher the accuracy of the LLM’s responses. In short, even if the top-k dialogues provided include the best answer, the accuracy of the final answer can still be affected by secondary answers. Therefore, we aim to ensure that the most relevant dialogues are as unique as possible, and the simplest method to achieve this is to integrate the results of both algorithms, with this being akin to a “double insurance”. Experimental results have also shown that this approach significantly enhances the accuracy of the LLM’s responses.

3.5. High-Dimensional Information Summarization Mechanism

High-dimensional information summarization refers to the process of using LLMs to summarize language, effectively compressing and refining redundant user dialogue history to ultimately produce highly distilled information. The core of this mechanism is the design of a comprehensive prompt. In the prompt design process for this study, the LLM is required to emphasize retaining four key elements—time, place, characters, and significant events—while also identifying and preserving certain information that is repeatedly mentioned in the dialogue. In addition, considering the personalized needs of users, the ILSTMA incorporates a user personality feature during the high-dimensional information summarization process. Specifically, the LLM is tasked with utilizing sentiment analysis capabilities to distill specific user personality traits from historical dialogue records. These personality profiles will also be transferred to the LLM as part of the high-dimensional information for reference. Moreover, given the constraints of the context window, the ILSTMA will group user dialogues for summarization. The more dialogues are included in a group, the more severe the loss of information will be during the summarization process. Based on previous literature and extensive manual experiments, this study defaults to summarizing dialogues in groups of five—this approach limits information loss, prevents excessively frequent triggering of the summarization process, and ensures that the word count of five groups of dialogue remains within the context window limit. The prompt for the high-dimensional summarization mechanism is illustrated in Figure 4.

As the dialogue progresses, the dialogue history is continuously updated. Updating the high-dimensional information is also necessary; otherwise, it can lead to information latency. The ILSTMA integrates high-dimensional information updates throughout the process, using an overlapping method for updates: it assesses whether there are semantically similar parts between the new and old summaries. For semantically similar parts, the corresponding sections of the new summary will overwrite those of the old summary, while the other parts will be combined. If there are no semantically similar sections, the content of the new and old summaries will remain unchanged. This process is illustrated in Figure 5.

Taking into account the judgment process in the summary update design, this study introduces the CoT into the prompt design process. We embed the CoT into the design of the instances, allowing the LLM to reproduce the semantic similarity overlap process between the old and new summary through the instances in the prompt. The prompt is illustrated in Figure 6.

After forming multiple sets of high-dimensional information, when addressing recall questions posed by users, in addition to finding the most relevant information, it is also necessary to apply the cosine similarity algorithm to match the recall questions with the high-dimensional information. The set of high-dimensional information with the highest similarity score will be selected as a reference. This is because the number of high-dimensional information sets will not be excessive, and with summary updates, the content has already been highly condensed. As supplementary information for the recalled dialogues, it is sufficient; hence, there is no need for a complex retrieval correction process. The relatively optimal cosine similarity algorithm can be directly used for matching.

3.6. Caching Prefetch Mechanism

This subsection introduces the caching prefetch mechanism, which is a unique approach of the ILSTMA aimed at improving the efficiency of execution in long-term and short-term memory systems. We will first discuss the design of the caching prefetch indicator, which is inspired by the fundamental theories of human forgetting and classic operating systems. Building upon this, we will detail the specifics of the caching prefetch mechanism.

3.6.1. The Caching Prefetch Indicator

The caching prefetch indicator is based on the original Ebbinghaus forgetting curve, which describes the relationship between the time interval since the last recall of knowledge, the learning strength, and the knowledge retention rate. This relationship is illustrated in Equation (1) as follows:

R = e^{- \frac{t}{S}}

(1)

where R represents the retention rate, and

t \in [0, + \infty]

represents the time interval from the last recall to the present, which is measured in whole days in this paper. S denotes the memory strength, which increases with recall times.

However, there is currently a lack of relational models between memory strength and recall times. Previous studies have considered the two as having a linear relationship, but this does not align with fundamental theories of human forgetting. We reexamined the relationship between them and modeled it using a sigmoid-based function, which is a widely used curve in deep learning activation functions and has strict monotonic properties. The final model is illustrated in Equation (2) as follows:

S = β + \frac{α}{1 + e^{ξ (r e c a l l T i m e s - μ)}}

(2)

where

β

,

α

,

ξ

, and

μ

are four constant parameters used to control the scaling and growth rates of the model. In this study, we assume that the range for recallTimes is [1, 5], and the range for S is [5, 25]. Within these intervals,

β = - 13.5

,

α = 37

,

ξ = - 1.5

, and

μ = 1

. Researchers can make adjustments to these four parameters based on their actual needs. The curve is illustrated in Figure 7 as follows:

We chose this curve for the following reasons: It exhibits a rapid growth rate in the early stage of memory, while the growth rate gradually decreases in the later stages, eventually approaching a flat curve. This aligns with the basic theories of human forgetting psychology, which emphasize that recall in the early stages has a strong consolidating effect on knowledge, and this consolidating effect decreases with the increase in recall times. Ultimately, in the later stages of memory, the strength of memory shows stability. Furthermore, the smoothness of the sigmoid function makes it highly suitable for simulating human thought processes. The final retention rate calculation formula is given as Equation (3):

R = e^{- \frac{t \times [1 + e^{ξ (r e c a l l T i m e s - μ)}]}{α + β \times [1 + e^{ξ (r e c a l l T i m e s - μ)}]}}

(3)

However, there seems to be a logical gap between the calculation of the dialogue retention rate and the demand for prefetch caching, yet this is not the case. We refer to the prefetch mechanism in classic operating systems, which is an optimization strategy designed to improve the speed of data access. Its main function is to predict the data that a program might need in the future based on its access patterns and load them into memory in advance, thus reducing the waiting time during the actual execution of the program. This approach aims to compensate for the speed mismatch between high-speed CPUs and low-speed memory, which can lead to prolonged CPU waiting times. Traditional prefetch mechanisms operate at the hardware level and adopt the principle of locality: The temporal locality principle demonstrates that programs tend to repeatedly access data that were recently accessed; the spatial locality principle indicates that programs tend to access data located close to each other.

The calculation of the dialogue retention rate is influenced by the time interval between the last recall of the dialogue and the current time, as well as the total recall times for that dialogue. This aligns with the principle of temporal locality, suggesting that dialogues with higher retention rates have shorter average time intervals and more frequent recalls. This indicates that the dialogue retention rate has another dimension of application: a score that measures the temporal locality of dialogues. At the same time, the global memory table assigns a globally unique index to each dialogue at the moment of its creation, reflecting the absolute sequential order of dialogue creation. This concept aligns with the basic theories of human forgetting, which state that when recalling a memory fragment, individuals tend to involuntarily recall nearby scenes. For example, when someone remembers that they had a cold two weeks ago, they are likely to further recall that it was due to wearing insufficient clothing and that it took three days of medication to fully recover. The revival of these memories is linked by their sequential order, reflecting the principle of spatial locality. Therefore, the final score of this index is determined by the dialogue retention rate (simulating temporal locality) and the relative distance of the dialogue index (simulating spatial locality). Equation (4) presents the calculation formula for the caching prefetch indicator:

S c o r e (t i m e I n t e r v a l, r e c a l l T i m e s, d i s) = R (t i m e I n t e r v a l, r e c a l l T i m e s) + \frac{π}{2} arctan (d i s)

(4)

The calculation method for Score is shown in Equation (3); the use of the arctangent function is to map the relative distance to the interval (0, 1). Therefore, the range of values for Score is (0, 2). Under the condition of not considering spatial locality (with dis constant), we plotted the caching prefetch indicator (Score) against different values of recallTimes, as shown in Figure 8, to illustrate the rationality of the method from a practical application perspective.

As shown in the figure, in the early stage of recall (recallTimes = 1, 2), the decay of Score significantly decreases. This is because in real-world scenarios, if a user reviews a conversation for the first time, they are highly likely to be interested in that conversation (which aligns with the principles of recommendation systems). This also reflects the strong reinforcing effect of early recall on memory, as stated in basic forgetting theory. In the later stages of recall (recallTimes = 4, 5), knowledge is recalled multiple times, indicating that the user indeed finds this conversation very necessary. At this point, the reinforcing effect of further recall on Score has little to no impact; the Score for this conversation has already become quite high, and the rate of decrease over time is slow, emphasizing the stability of memory. This illustrates the changes in the caching prefetch indicator in practical application scenarios.

3.6.2. The Process of Caching Prefetch Mechanism

The operation of the caching prefetch mechanism was divided into three phases: hit judgment, cache hit, and cache missed. Their approximate positions within the ILSTMA have been briefly introduced in Figure 1. A more detailed description will be provided here.

First, the hit judgment phase is the primary step of the caching prefetch mechanism. During this stage, it is essential to instruct the LLM to rely solely on short-term memory and its general knowledge of the world to respond to the user’s question. The ILSTMA is not allowed to trigger memory retrieval algorithms with high complexity in time at this point (because the relevant dialogue segment in the short-term memory is left empty). The caching prefetch mechanism aims to shorten the average execution time by allowing as many decisions as possible to be made in this manner. In this case, it effectively reduces the time consumption generated by the long-term memory, allowing for extremely rapid responses by relying solely on the short-term memory. This hit judgment is achieved through carefully designed prompts, as illustrated in Figure 9.

The output result of the prompt has two situations: If there is a hit, directly answer the user’s question while keeping the content in short-term memory unchanged except for the historical records; if there is a hit miss, clear the content of summary and cache information segment, trigger the most relevant dialogue retrieval module, answer the user’s question, and then trigger the caching prefetch mechanism again.

Second, if the hit judgment is affirmative, it indicates that a cache hit has occurred. This process requires maintaining the stability of all segments in the short-term memory, except for historical records segment, as this signifies that the reference information within the short-term memory possesses a high value, which the ILSTMA aims to retain for further cache hits. However, an essential question arises: When should the information in the cache information segment be updated? The ILSTMA adopts a deferred write strategy: Writing occurs only when the cache information needs to be overwritten. Consequently, when a cache hit is registered, the ILSTMA updates the recallTimes and timeInterval of the retrieved dialogues and temporarily stores these data, which are then collectively written into the global memory table after overwriting the cache information. This approach is employed because only the final state of the dialogues is necessary for writing; immediate write strategies are deemed unnecessary and would further impair execution efficiency. Using a delayed write strategy, the execution efficiency can be optimized to the maximum extent possible.

Third, if the hit judgment is not affirmative, it indicates that a cache miss has occurred. When this process transpires, the ILSTMA must engage the most relevant dialogue retrieval module to sequentially perform two steps: first, the user’s recall question is used as input to determine the recalled dialogues via the indexing-based memory retrieval modification algorithm, which are then injected into the relevant dialogue segment of the short-term memory. Subsequently, the high-dimensional information summarization mechanism summarizes the information and injects both the summary and user personality into their respective segments in the short-term memory. At this point, the ILSTMA possesses all the reference information necessary to respond to the user’s recall question and can proceed with the response. Upon completing the response, it is imperative to utilize the caching prefetch mechanism to select dialogues anew based on the caching prefetch indicator. The calculation of the caching prefetch indicator will be performed during the process of encapsulating dialogues into global entries. Subsequently, recalculations are only required for the relevant entries of the recalled dialogues, avoiding the need to repeatedly compute the caching prefetch indicator for all dialogues. When a user’s recall question arrives, the ILSTMA will determine the position of that question in the global memory table. Then, centered around this position, it will select a specific interval of [−offset, offset] and choose the k dialogues with the highest scores within this interval (where

k \leq 2 \times offset

) to place in the short-term memory. This approach is adopted because sorting the entire global memory table is very time consuming. By selecting dialogues within a limited range, it ensures that spatial locality is adequately weighted while keeping the sorting process time-efficient, with an overall time complexity of O(1). Please note that when the ILSTMA is first started, it will inevitably result in a hit miss.

Finally, to further illustrate the optimization effect of the caching prefetch mechanism on the execution efficiency, we will analyze the time complexity for both hit-and-miss situations (assuming the LLM response time is negligible). In the case of a hit, as previously analyzed, the time complexity is O(1); in the case of a miss, let the embedding vector encoding time be E. The memory retrieval process requires comparing the source dialogue encoding with the target database, resulting in a time complexity of O(n²). Thus, the overall time complexity for this process is O(n²) + E. Therefore, when the volume of dialogues is sufficiently large, hits can significantly reduce the average execution time.

4. Experiments and Discussion

In this section, we discuss the experimental process and the corresponding discussions designed for the ILSTMA. We first introduce a self-organized dataset that covers five dialogue topics and discuss how the dataset is divided. Next, we present the evaluation metrics. In the experimental section, we evaluate the caching prefetch mechanism in terms of the hit rate and average execution time. Additionally, to demonstrate the effectiveness of the ILSTMA in retrieving the most relevant dialogues, we collected open-source long-term memory chat models from the industry for comparative experiments. The experimental results show that the ILSTMA achieved the best performance in both the retrieval accuracy and the final LLM answer accuracy, and it outperformed the vast majority of benchmarks in terms of execution efficiency.

4.1. Dataset

4.1.1. Dataset Partitioning Principles

This study proposes that test datasets suitable for long-term memory systems should be clearly divided into two parts: the construction dataset and the recall dataset. The former refers to a dataset composed purely of bilateral dialogues, which is essentially a general dialogue dataset. There are two key points in constructing this dataset: First, it is crucial to expand the scope of personalized information as much as possible. This is because only with sufficient personalized questions can the LLM perform memory retrieval when general knowledge is not applicable, which allows for a more thorough testing of the long-term memory system’s performance. Additionally, a small number of general dialogues should be retained to demonstrate that the LLM does not lose its ability to solve problems using general knowledge. The latter refers to a dataset composed of recall questions and their corresponding labels, where the labels generally indicate the answers to the recall questions derived from the construction dataset. Similarly, the recall dataset should also appropriately retain general questions to stimulate the LLM’s ability to solve general problems.

From a usage perspective, the construction dataset should be preloaded into the long-term memory. This step corresponds to the process of building a global memory table in this study, similar to a traditional training dataset. The recall dataset, on the other hand, serves as the testing dataset, which is used as input for the LLM, and its output results are compared with the labels.

4.1.2. Dataset Introduction

Based on the principles mentioned above, we organized the datasets for the MemoryBank and SCM projects. Ensuring that all data were sourced from open datasets, we performed tasks, including manual denoising and dataset partitioning, resulting in a dataset containing dialogues across five major themes. The statistics are shown in Table 2:

Therein, Con_data represents the number of dialogues in the constructed dataset, Rec_data represents the number of recall questions and labels in the recall dataset, and Ave_tokens indicates the average number of tokens per dialogue group.

4.2. Relevant Metrics

The performance of the experiments was evaluated based on the following metrics: (1) Answer Accuracy: Measures the accuracy of the LLM outputs using a ternary scoring system of 0, 0.5, 1, where 0 indicates incorrect, 0.5 indicates partially correct, and 1 indicates fully correct. (2) Retrieve Accuracy: Utilizes the indexing-based memory retrieval modification algorithm, employing the ternary evaluation system {0, 0.5, 1}. (3) Recall Accuracy: This metric is a comparative metric used to evaluate memory retrieval accuracy before applying the retrieval modification algorithm. It still employs the ternary scoring system {0, 0.5, 1} based on the semantic proximity to the correct answer. (4) Contextual Coherence (Coherence): This is an indicator that measures the degree of alignment between the content of the LLM’s responses and the current chat context. A larger value of this indicator indicates that the LLM’s responses are more logically coherent. The range of this metric is [0, 1].

Using a ternary scoring system is a widely adopted method in the industry [9,11,39] and is used to provide an effective method of evaluation in the current absence of finer-grained measurement standards. In order to reduce human intervention and enhance the credibility of the results, we utilized the most powerful GPT-4 to score all ternary scoring metrics. All experimental results were obtained by conducting ten measurements and taking the average value.

4.3. Main Results and Discussion

This section will provide a detailed description of the experimental procedures and analyze the test results, with all experiments performed on the dataset constructed for this study.

4.3.1. Hit Rate Analysis

We first collected the average hit rate of the caching prefetch mechanism in various datasets, and the results are shown in Table 3:

The average hit rate shown in the table reached 35.35%. In classical operating system theory, a hardware-based prefetch mechanism typically has an ideal hit rate between 60% and 80%, and even in less favorable cases, it is above 50%. This is mainly because programs tend to conform more closely to the principle of locality. In reality, for recall datasets, user query behavior is unpredictable; even the theories outlined in human basic forgetting studies describe only a sort of human subconsciousness. Therefore, an average hit rate of 35. 35% already exceeds one-third of the total volume of recall datasets, making it a relatively reliable result.

4.3.2. Average Execution Time Analysis

We first performed an ablation experiment to assess the difference in the average execution time of the system before and after equipping it with the caching prefetch mechanism, as shown in Figure 10. The execution time refers to the duration from when a user submits a query to when they receive an answer from the LLM to that question. The figure shows that, compared to the absence of the caching prefetch mechanism, the average execution time was reduced by 21.45% after its implementation, demonstrating the effectiveness of the caching prefetch mechanism.

At the same time, to further demonstrate the execution efficiency of the ILSTMA among industry open-source benchmarks, we reproduced four of the latest open source benchmarks, and the results are shown in Table 4:

As shown in the table, the ILSTMA, equipped with the caching prefetch mechanism, demonstrated significant competitiveness among various benchmark models in terms of the average execution time. MemoryBank’s memory retrieval utilizes the LangChain framework, which resulted in a portion of the execution time being consumed by the logical reasoning module. SCM employs the cosine similarity algorithm; MemoChat combines high-dimensional summary information with recall questions to participate in the memory retrieval process. ChatRsum treats global memory as a high-dimensional summary, thus only designing real-time updates for high-dimensional summary information and not involving the memory retrieval process, resulting in the lowest average execution time.

4.3.3. Comprehensive Comparative Experiment

To test the effectiveness of the ILSTMA in retrieving the most relevant dialogue, we performed a comprehensive comparative experiment between ILSTMA and open source benchmarks, as shown in Table 5.

The bolded metrics indicate the optimal performance. The table shows that the ILSTMA’s most relevant dialogue retrieval module improved the LLM’s answer accuracy to 88.4%, surpassing all currently available benchmark models. This improvement is attributed to the effective results achieved by the indexing-based memory retrieval modification algorithm, which was implemented based on the spatial arrangement of the long-term and short-term memory, increasing the unrepaired recall accuracy from 66.3% to 27.5%. An interesting aspect is that while the ILSTMA’s retrieve accuracy was nearly on par with the state-of-the-art SCM, its answer accuracy improved by 3.5%. This enhancement is attributed to the memory retrieval modification algorithm, which integrates the results of both algorithms to maximize the uniqueness of the retrieval outcomes. This refinement streamlines the reference information on the prompts, enabling the LLM to perform better. The contextual coherence metric maintained a high level across all benchmarks, as it is primarily determined by the intrinsic performance of the LLM and the level of prompt engineering design. This directly reflects the LLM’s language organization capability. To ensure consistency in the experimental environment, all benchmarks used gpt-3.5-turbo. The results achieved by the ILSTMA in this metric demonstrate the reasonableness of its prompt design.

Additionally, it should be noted that MemoChat’s retrieval scheme concatenates high-dimensional summary information with recall questions, using the combined total information for similarity matching with historical dialogue records. However, as can be seen from the table, the answer accuracy for this approach was only 60.9%. This is due to the high-dimensional summary information greatly affecting the embedding vector of the concatenated total information, causing the embedding vector to diverge from focusing on recall questions, which led to unsatisfactory retrieval results. ChatRsum maintains only one global high-dimensional information; hence, there is no retrieval process; however, this also results in a lack of dialogue detail in that high-dimensional information, leading to an excessively low answer accuracy of only 48.6%. This indicates that using high-dimensional information for memory retrieval is not effective. Therefore, this experiment confirms the reasonableness of the design that uses only user questions for retrieval, incorporating high-dimensional information as one of the reference components for the ILSTMA responses. This study outperformed other benchmark models in all metrics, proving the validity of the research.

5. Conclusions

This paper proposed the ILSTMA, a more accurate and faster long-term and short-term memory architecture for LLMs tailored for chat scenarios. The study aimed to improve the accuracy of the LLM answer while reducing the average execution time of the system and improving the execution efficiency. We first introduced the concept of abstracting the long-term memory system into a global memory table while partitioning the short-term memory into five major spatial layouts. Subsequently, we analyzed two main memory retrieval algorithms and, based on this analysis, merged their use to propose the indexing-based memory retrieval modification algorithm, which increased the accuracy of the answers to new heights. To improve the system efficiency, we have innovatively introduced a caching prefetch mechanism: an approach that merges fundamental human forgetting theory with classic operating system prefetching mechanisms. This allows the system to leverage human thought processes to preload dialogues, thereby avoiding the complexities associated with high-cost memory retrieval algorithms.

In terms of the experiments, we performed our analysis in three parts: hit rate, average execution time, and comprehensive comparative experiments. We first introduced the principles for partitioning datasets suitable for long-term memory systems and then demonstrated that the ILSTMA, with an average hit rate of 35.35%, reduced the average execution time by 21.45%. Furthermore, comprehensive comparative experiments indicated that the ILSTMA leads in several metrics, including answer accuracy, proving the effectiveness of this research. We hope that this study can inspire new ideas for the design of long-term and short-term memory systems.

Author Contributions

Conceptualization, Z.M.; methodology, Z.M.; software, Z.M. and G.C.; validation: Z.M. and Z.W.; formal analysis: Z.M. and Z.W.; investigation, Z.M. and G.C.; resource, G.C.; data curation, Z.M. and Z.W.; writing—original draft preparation, Z.M.; writing—review and editing, G.C. and Z.W.; visualization, Z.M. and Z.W.; supervision, G.C.; project administration, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Ningbo Science and Technology Major Project (2024Z259) and the Key Technology RD Program of Ningbo (2022Z149).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in MemoryBank-SiliconFriend at https://github.com/zhongwanjun/MemoryBank-SiliconFriend (accessed on 24 December 2024) and SCM4LLMs at https://github.com/wbbeyourself/SCM4LLMs (accessed on 29 December 2024). We only performed preprocessing on this dataset, as detailed in the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model

References

OpenAI. ChatGPT. 2022. Available online: https://chat.openai.com/chat (accessed on 11 August 2024).
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. Glm-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Ratner, N.; Levine, Y.; Belinkov, Y.; Ram, O.; Magar, I.; Abend, O.; Karpas, E.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. Parallel context windows for large language models. arXiv 2022, arXiv:2212.10947. [Google Scholar]
Wang, X.; Salmani, M.; Omidi, P.; Ren, X.; Rezagholizadeh, M.; Eshaghi, A. Beyond the limits: A survey of techniques to extend the context length in large language models. arXiv 2024, arXiv:2402.02244. [Google Scholar]
Ng, Y.; Miyashita, D.; Hoshi, Y.; Morioka, Y.; Torii, O.; Kodama, T.; Deguchi, J. Simplyretrieve: A private and lightweight retrieval-centric generative ai tool. arXiv 2023, arXiv:2308.03983. [Google Scholar]
Zhao, A.; Huang, D.; Xu, Q.; Lin, M.; Liu, Y.J.; Huang, G. Expel: Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 19632–19642. [Google Scholar]
Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; Wang, Y. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 19724–19731. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual Acm Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
Liu, L.; Yang, X.; Shen, Y.; Hu, B.; Zhang, Z.; Gu, J.; Zhang, G. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. arXiv 2023, arXiv:2311.08719. [Google Scholar]
Ebbinghaus, H. Memory: A Contribution to Experimental. 1964. Available online: https://pmc.ncbi.nlm.nih.gov/articles/PMC4117135/ (accessed on 17 July 2024).
Brown, G.D.; Lewandowsky, S. Forgetting in memory models: Arguments against trace decay and consolidation failure. In Forgetting; Psychology Press: East Sussex, UK, 2010; pp. 63–90. [Google Scholar]
Cepeda, N.J.; Pashler, H.; Vul, E.; Wixted, J.T.; Rohrer, D. Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychol. Bull. 2006, 132, 354. [Google Scholar] [CrossRef]
Lindsey, J.; Litwin-Kumar, A. Theory of systems memory consolidation via recall-gated plasticity. eLife 2023, 12. [Google Scholar] [CrossRef]
Nadel, L.; Hardt, O. Update on memory systems and processes. Neuropsychopharmacology 2011, 36, 251–273. [Google Scholar] [CrossRef]
Smith, A.M. Examining the Role of Retrieval Practice in Improving Memory Accessibility Under Stress. Ph.D. Thesis, Tufts University, Medford, MA, USA, 2018. [Google Scholar]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A 176b-Parameter Open-Access Multilingual Language Model. 2023. Available online: https://inria.hal.science/hal-03850124/ (accessed on 20 November 2023).
Rae, J.W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoffmann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Young, S.; et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv 2021, arXiv:2112.11446. [Google Scholar]
Tay, Y.; Wei, J.; Chung, H.W.; Tran, V.Q.; So, D.R.; Shakeri, S.; Garcia, X.; Zheng, H.S.; Rao, J.; Chowdhery, A.; et al. Transcending scaling laws with 0.1% extra compute. arXiv, 2022; arXiv:2210.11399. [Google Scholar]
Zeng, W.; Ren, X.; Su, T.; Wang, H.; Liao, Y.; Wang, Z.; Jiang, X.; Yang, Z.; Wang, K.; Zhang, X.; et al. Pangu-alpha: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation. arXiv 2021, arXiv:2104.12369. [Google Scholar]
Huawei Technologies Co., L. Huawei Technologies Co., L. Huawei mindspore ai development framework. In Artificial Intelligence Technology; Springer: Berlin/Heidelberg, Germany, 2022; pp. 137–162. [Google Scholar]
Nijkamp, E.; Pang, B.; Hayashi, H.; Tu, L.; Wang, H.; Zhou, Y.; Savarese, S.; Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv 2022, arXiv:2203.13474. [Google Scholar]
Iyer, S.; Lin, X.V.; Pasunuru, R.; Mihaylov, T.; Simig, D.; Yu, P.; Shuster, K.; Wang, T.; Liu, Q.; Koura, P.S.; et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv 2022, arXiv:2212.12017. [Google Scholar]
Muennighoff, N.; Wang, T.; Sutawika, L.; Roberts, A.; Biderman, S.; Scao, T.L.; Bari, M.S.; Shen, S.; Yong, Z.X.; Schoelkopf, H.; et al. Crosslingual generalization through multitask finetuning. arXiv 2022, arXiv:2211.01786. [Google Scholar]
Li, R.; Allal, L.B.; Zi, Y.; Muennighoff, N.; Kocetkov, D.; Mou, C.; Marone, M.; Akiki, C.; Li, J.; Chim, J.; et al. Starcoder: May the source be with you! arXiv 2023, arXiv:2305.06161. [Google Scholar]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Front. Comput. Sci. 2024, 18, 186345. [Google Scholar]
Rana, K.; Haviland, J.; Garg, S.; Abou-Chakra, J.; Reid, I.; Suenderhauf, N. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv 2023, arXiv:2307.06135. [Google Scholar]
Zhu, A.; Martin, L.; Head, A.; Callison-Burch, C. CALYPSO: LLMs as Dungeon Master’s Assistants. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Salt Lake City, UT, USA, 8–12 October 2023; Volume 19, pp. 380–390. [Google Scholar]
Fischer, K.A. Reflective linguistic programming (rlp): A stepping stone in socially-aware agi (socialagi). arXiv 2023, arXiv:2305.12647. [Google Scholar]
Wang, Z.; Cai, S.; Chen, G.; Liu, A.; Ma, X.; Liang, Y. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv 2023, arXiv:2302.01560. [Google Scholar]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2024, 36, 8634–8652. [Google Scholar]
Zhu, X.; Chen, Y.; Tian, H.; Tao, C.; Su, W.; Yang, C.; Huang, G.; Li, B.; Lu, L.; Wang, X.; et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. arXiv 2023, arXiv:2305.17144. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar]
Qian, C.; Cong, X.; Yang, C.; Chen, W.; Su, Y.; Xu, J.; Liu, Z.; Sun, M. Communicative agents for software development. arXiv 2023, arXiv:2307.07924. [Google Scholar]
Lin, J.; Zhao, H.; Zhang, A.; Wu, Y.; Ping, H.; Chen, Q. Agentsims: An open-source sandbox for large language model evaluation. arXiv 2023, arXiv:2308.04026. [Google Scholar]
Wang, B.; Liang, X.; Yang, J.; Huang, H.; Wu, S.; Wu, P.; Lu, L.; Ma, Z.; Li, Z. Enhancing large language model with self-controlled memory framework. arXiv 2023, arXiv:2304.13343. [Google Scholar]
Huang, Z.; Gutierrez, S.; Kamana, H.; MacNeil, S. Memory sandbox: Transparent and interactive memory management for conversational agents. In Proceedings of the Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology; San Francisco, CA, USA, 29 October–1 November 2023, pp. 1–3.
Hu, C.; Fu, J.; Du, C.; Luo, S.; Zhao, J.; Zhao, H. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv, 2023; arXiv:2306.03901. [Google Scholar]
Zhou, X.; Li, G.; Liu, Z. Llm as dba. arXiv 2023, arXiv:2308.05481. [Google Scholar]
Gutiérrez, B.J.; Shu, Y.; Gu, Y.; Yasunaga, M.; Su, Y. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv 2024, arXiv:2405.14831. [Google Scholar]
Zhou, W.; Jiang, Y.E.; Cui, P.; Wang, T.; Xiao, Z.; Hou, Y.; Cotterell, R.; Sachan, M. Recurrentgpt: Interactive generation of (arbitrarily) long text. arXiv 2023, arXiv:2305.13304. [Google Scholar]
Wang, Q.; Ding, L.; Cao, Y.; Tian, Z.; Wang, S.; Tao, D.; Guo, L. Recursively summarizing enables long-term dialogue memory in large language models. arXiv 2023, arXiv:2308.15022. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar]
LangChain Inc. LangChain. 2022. Available online: https://docs.langchain.com/docs/ (accessed on 13 August 2024).

Figure 1. Overview of the ILSTMA (solid lines represent essential steps, while dashed lines denote non-essential steps): (1)—Collecting dialogues; (2), (3), (7)—Caching prefetch mechanism; (4)—Providing information support from LTM; (5)—Injecting high-dimensional information; (6)—Injecting the most relevant dialogues; (8)—Updating LTM; (9)—Answering user’s question.

Figure 2. Prompt for memory retrieval using LangChain (0.0.144).

Figure 3. The flowchart of the indexing-based memory retrieval modification algorithm.

Figure 4. Prompt for high-dimensional information summarization mechanism.

Figure 5. The flowchart of summary update.

Figure 6. Prompt for summary update.

Figure 7. The sigmoid-based curve model.

Figure 8. The caching prefetch indicator’s trend of change over time under different recallTimes.

Figure 9. Prompt for hit judgment.

Figure 10. Average execution time statistics figure (unit: s): CPM stands for caching prefetch mechanism.

Table 1. Fragment of the global memory table.

Index	Time Interval	Recall Times	Text	Embedding Vector
0	2	1	xxx	xxx
1	0	1	xxx	xxx
2	6	4	xxx	xxx
3	9	2	xxx	xxx
4	3	3	xxx	xxx
…	…	…	…	…

xxx: omitting excessively long content.

Table 2. Dataset statistics table.

Topic	Con_data	Rec_data	Ave_tokens
Legal consultation	472	57	679.84
Enhance love	532	62	577.91
Travel plan	398	40	620.44
Grandfather and poetry	451	51	598.63
Selling jewelry	503	49	654.90

Table 3. The average hit rate statistics table.

Topic	Average Hit Rate
Legal consultation	37.38%
Enhance love	34.29%
Travel plan	42.81%
Grandfather and poetry	31.57%
Selling jewelry	30.71%

Table 4. The average execution time statistics table (unit: s).

Model	Average Execution Time
ILSTMA (Without CPM)	5.17
ILSTMA (With CPM)	4.06
MemoryBank	4.87
SCM	4.10
MemoChat	4.31
ChatRsum	3.10

Bold font indicates the best performance. The underlined indicators represent the second-best performance.

Table 5. Comprehensive comparative experiment.

Model	Answer Accuracy	Retrieve Accuracy	Recall Accuracy	Coherence
ILSTMA	0.884	0.938	0.663	0.948
MemoryBank	0.624	0.640	×	0.927
SCM	0.849	0.934	×	0.943
MemoChat	0.609	0.512	×	0.932
ChatRsum	0.486	×	×	0.921

Bold font indicates the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ming, Z.; Wu, Z.; Chen, G. ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture. Information 2025, 16, 251. https://doi.org/10.3390/info16040251

AMA Style

Ming Z, Wu Z, Chen G. ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture. Information. 2025; 16(4):251. https://doi.org/10.3390/info16040251

Chicago/Turabian Style

Ming, Zongyu, Zimu Wu, and Genlang Chen. 2025. "ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture" Information 16, no. 4: 251. https://doi.org/10.3390/info16040251

APA Style

Ming, Z., Wu, Z., & Chen, G. (2025). ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture. Information, 16(4), 251. https://doi.org/10.3390/info16040251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Index	Time Interval	Recall Times	Text	Embedding Vector
0	2	1	xxx	xxx
1	0	1	xxx	xxx
2	6	4	xxx	xxx
3	9	2	xxx	xxx
4	3	3	xxx	xxx
…	…	…	…	…

Index	Time Interval	Recall Times	Text	Embedding Vector
0	2	1	xxx	xxx
1	0	1	xxx	xxx
2	6	4	xxx	xxx
3	9	2	xxx	xxx
4	3	3	xxx	xxx
…	…	…	…	…

Article Menu

ILSTMA: Enhancing Accuracy and Speed of Long-Term and Short-Term Memory Architecture

Abstract

1. Introduction

2. Related Work

2.1. Large Language Models

2.2. Short-Term Memory

2.3. Long-Term Memory

3. Methodology

3.1. Architecture Overview

3.2. Long-Term Memory Spatial Layout

3.3. Short-Term Memory Spatial Layout

3.4. The Most Relevant Dialogue Retrieval Module

The Indexing-Based Memory Retrieval Modification Algorithm

3.5. High-Dimensional Information Summarization Mechanism

3.6. Caching Prefetch Mechanism

3.6.1. The Caching Prefetch Indicator

3.6.2. The Process of Caching Prefetch Mechanism

4. Experiments and Discussion

4.1. Dataset

4.1.1. Dataset Partitioning Principles

4.1.2. Dataset Introduction

4.2. Relevant Metrics

4.3. Main Results and Discussion

4.3.1. Hit Rate Analysis

4.3.2. Average Execution Time Analysis

4.3.3. Comprehensive Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Index	Time Interval	Recall Times	Text	Embedding Vector
0	2	1	xxx	xxx
1	0	1	xxx	xxx
2	6	4	xxx	xxx
3	9	2	xxx	xxx
4	3	3	xxx	xxx
…	…	…	…	…