Retrieving Memory Content from a Cognitive Architecture by Impressions from Language Models for Use in a Social Robot
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presents an innovative approach combining cognitive architecture (ACT-R) with large language models (LLMs) and vision language models (VLMs) to endow social robots with human-like memory and recall capabilities, enhancing the contextual relevance and credibility of human-robot interaction. Through ACT-R's declarative memory, the robot stores real-time perceptual data (such as conversation keywords, visual features), and utilizes its procedural memory for association-based memory retrieval. The system can dynamically invoke personalized memory content to enhance LLM-generated responses, reducing "hallucination" issues. Experiments demonstrate two application scenarios: 1) text-based dialogue for train station information retrieval; and 2) memory triggering based on visual impressions, validating the framework's effectiveness in enhancing LLMs' contextual understanding and response accuracy. Generally speaking, this paper is well-motivated and easy to follow, and I do not have any major concern. Instead, I would like to provide some suggestions.
- Please reorganize the Methods section to clarify the workflow, for instance, how keywords are processed by ACT-R.
- Further discuss how the ACT-R integration improves over RAG.
- Use some quantitative metrics like precision,recall, response time.
- Conduct ablation study
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article presents an innovative integration of ACT-R cognitive models with social robots using LLMs/VLMs, but it would benefit from deeper experimental validation and a more critical discussion of limitations.
- In the related work section, the authors review many sources but miss a critical synthesis highlighting the research gap
- The methods are technically detailed; however, no quantitative evaluation metrics or performance benchmarks are presented
- The discussion correctly identifies limitations, but the latency impact needs deeper quantitative assessment
- Only 4 references from 2025 are used; total references are 52, which shows good breadth, but the inclusion of more recent empirical studies would strengthen the argument.
- In the Discussion section, “We testet this system …” should read “tested” (p. 12, l. 439)
- “we used it’s remote interface” misuses the apostrophe; it should be “its” (p. 6, l. 224)
- Reference 35 redundantly states “Proceedings of the Proceedings of the 17th ICAART”; delete the second “Proceedings”
- The authors are kindly encouraged to conduct thorough proofreading to correct minor typographical and grammatical inconsistencies throughout the manuscript.
The authors are kindly encouraged to conduct thorough proofreading to correct minor typographical and grammatical inconsistencies throughout the manuscript.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsThe work is meaningfully completed, concisely and consistently written and cannot be criticised for any shortcomings. The only thing that bothered me was the somewhat modest methodological part, from which it is very difficult to get a uniform conclusion and further course of research. Therefore, I suggest that the whole, especially the empirical part, be written down a little more systematically, because such an approach is very close to computer scientists (e.g. in the form of a thought pattern, diagram or flow chart). For example, you mention two examples in various places, in the first case only briefly (123 employed remained the same in both cases), and in the following (line 279) in more detail - are these the same or different experiments?
A few minor comments:
- Explain in a little more detail why - "126 of each image in three keywords" or "(127 human question or problem was also expressed in three keywords). Why three? Is this related to processing time or processing power?
- It would be interesting to add whether the system works in real time or what the lag is.
- To explain how you combined the different "programming languages", you mention LISP, C++, Python, ... then LLM and VLM?
- "401-402 In general, there are far more options available for a programmatic implementation of cognitive models with ACT-R or similar frameworks in combination with social robots" - Please clarify this claim a bit.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors should correct the duplicated DOI prefix in the reference 14; it currently reads “https://doi.org/https://doi.org/...” and should be revised to a single valid DOI link.
Author Response
Comments 1: The authors should correct the duplicated DOI prefix in the reference 14; it currently reads “https://doi.org/https://doi.org/...” and should be revised to a single valid DOI link.
Response 1: Thank you very much. We corrected the duplicated DOI prefix. It should be alright now.