You are currently viewing a new version of our website. To view the old version click .
by
  • Angeliki Antoniou1,*,
  • Anastasios Theodoropoulos2 and
  • Artemis Chaleplioglou1
  • et al.

Reviewer 1: Shuai Yuan Reviewer 2: Yanru Lyu

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This is an interesting and timely paper, exploring a highly relevant topic. The rapid advancements in generative AI present both opportunities and challenges for the cultural heritage conservation and tourism, and the paper's focus on evaluating AI tools for narrative creation and envisioning their use in VR environments is a valuable contribution. I thought the idea of testing different models on specific cultural artifacts and then using them to find associations between diverse heritage sites was a creative and promising approach.

However, while the topic is strong, the manuscript in its current form requires significant revision. My primary concerns relate to a lack of methodological rigor and a informal writing style that often deviates from academic conventions, as well as sections with digression from the research (i.e. sec 2 & 5).

AI in cultural heritage

The introduction to AI's use in cultural heritage (lines 94-115) mentions AI's ability to generate 3D objects and virtual environments, which "significantly reduces production effort." While true to an extent, this claim feels a bit overstated here. the statement doesn't fully acknowledge the significant challenges that still exist with AI-generated 3D assets. For example, issues with topology and poly-count optimization are critical for performance in VR environments.

Further, the core of your study doesn't actually use these 3D generation techniques, so this framing might give readers a misleading impression of the study's scope. It would be better to shorten the discussion on 3D generation and focus the literature review more tightly on text and narrative generation, which are the paper's actual focus.

Methods

The methodology section lacks critical details needed to assess the validity of the research. While the study is framed around questions (RQ1, RQ2, RQ3) and AI's connection to heritage, it also involves student evaluations. Key information about this process is missing. For example, when and where the study with students took place, how many students participated, and so on. It is also unclear who performed the tests for RQ2 and RQ3. Was it the same group of students from the RQ1 evaluation?

For RQ1, you assess different AI models. I think a note should be added to clarify that the stated limitations of these models were accurate at the time of the research (2024?). Features like live web access and hallucination reduction are constantly and rapidly improving, and the judgments you present may not be applicable by the time of publication. For example, GPT5 has been reported with significantly reduced hallucination and advanced internet search (in order to serve better for agentic tasks). In your section 5 or 6, you need to address these advancement and also I think especially agentic Ai like workflows and planner based agent (e.g., Manaus, Genspark, Gemini Deep Research, etc).

When you describe the student evaluation using metrics like "readability, validity, and usefulness," these terms aren't defined. It's important to explain how you defined and operationalized these concepts for the students, especially if your definition of a term like 'validity' differs from its common usage in social sciences.

Results

RQ1:

The paper acknowledges that the evaluation was conducted by students as a limitation, but the problems run deeper. There's no information on the number of students who provided ratings, how they were grouped, or how their scores were aggregated. The fact that all final scores in the table are clean integers raises questions about the summarization process. This lack of transparency makes it very difficult to trust the conclusion about which AI model is "better." Even without expert evaluators, the paper must detail the number of student participants, the evaluation procedure, and the method of data aggregation. At a minimum, mean scores and standard deviations should be reported to give a clearer picture of the results. Also, the preliminary nature of this evaluation should be more strongly emphasized.

RQ2:

The approach to having AI discover associations between heritage sites by feeding it metadata seems to limit the AI's potential. It pushes the models toward finding more superficial connections or baraly using its "intuition." A more robust method might involve a chain-of-thought approach (e.g., using thinking models) or multi-step reasoning, where the AI is first prompted to generate detailed descriptions and then asked to find links between them, or they could be asked to generated aspects that can connect those sites and make the actually connections from those aspects. These distinctions are similar to Kahneman's dual-system theory in human thinking ("fast and slow"). You should at least acknowledge that the direct method used here may not fully tap into the reasoning capabilities of modern large language models, especially those in 2025.

The reporting of these connection making results is also too general for a research paper. You do not mention how many times tests were repeated, what prompt variations were tried, or the frequency with which certain associations appeared. Since web versions of these models were likely used, their default "temperature" settings can produce varied outputs for the same prompt, and the way the data is presented does not address this potential for variability. The exploratory nature of this process needs to be described more cautiously.

RQ3:

The paper correctly identifies that AI models can produce factual errors (e.g., Gemini's misidentification of a painting, Claude's historical inaccuracies in the story). However, the discussion only briefly mentions "validity" as a challenge. A more valuable contribution would be to delve deeper into this. How should cultural heritage institutions design practical workflows to verify, correct, and ultimately leverage these AI outputs that are simultaneously creative and potentially flawed? This practical implication feels like a missed opportunity.

MobiCAVE and Engaging Virtual Worlds (Sections 5 & 6)

These last two sections are particularly problematic. The detailed description of a hypothetical future CAVE experience feels disconnected from the empirical results of the paper. It reads more like a separate research proposal or a general commentary, rather than a discussion grounded in your findings. The purpose is unclear. Is this a discussion of future research? A specific proposal? A reflection on teaching applications? The writing needs to clearly state its purpose and adopt a corresponding style. For instance, this could be condensed into a few sentences as a potential direction, or it could be fleshed out with more technical detail as a specific future study.

The tone in this section is also not appropriate for a research paper. A phrase like, "we can only imagine how such results could be presented and experienced..." is speculative and informal. Similarly, concepts like "AI procedural generated VR environments" and "story path" are used without proper definition or technical grounding. The assertion that "VR users could create their own worlds" is especially confusing. This suggests user-generated content platforms, which is a completely different paradigm from the AI-driven heritage narratives discussed earlier.

Crucially, the entire "Engaging virtual worlds" section, which functions as the conclusion, contains no citations to existing literature. The ideas presented here about immersive experiences, procedural generation, and ethical challenges are important topics in the field of digital heritage. This section must engage with relevant research to have any academic credibility.

Overall, the style in these sections relies on analogy and assertion rather than rigorous argumentation, and it needs a complete rewrite.

Minor Points

  • The term "BIP" is used in the introduction but should be written out in full on its first use.
  • When metadata is first mentioned in the context of RQ2, it would be helpful to provide a few brief examples (e.g., "metadata such as object type, creation date, materials...") to help readers unfamiliar with the term.
  • In the results section, the use of an arrow in paragraph text like "Acropolis and Belfries of Belgium → Both are human-made..." is very unconventional. This symbol is better suited for tables or diagrams though.
  • The title for Table 2 appears to be a copy-paste error from Table 1. It reads "Strengths and Limitations of Al models used" but displays the student rating scores.
Comments on the Quality of English Language

The paper is undermined by significant issues in its writing in multiple sections, make a line-by-line correction not feasible in this round of review. The author needs to ensure the revised manuscript adheres to academic writing standards. This includes presenting arguments with clear logical steps, providing consistent supporting evidence, maintaining methodological transparency, and avoiding an assertive or speculative tone. I do agree first-person phrasing can be used because research are done by people (the problem is not using "we") , but the overall presentation should grounded in evidence.

Author Response

AI in cultural heritage: The introduction to AI's use in cultural heritage (lines 94-115) mentions AI's ability to generate 3D objects and virtual environments, which "significantly reduces production effort." While true to an extent, this claim feels a bit overstated here. the statement doesn't fully acknowledge the significant challenges that still exist with AI-generated 3D assets. For example, issues with topology and poly-count optimization are critical for performance in VR environments. Further, the core of your study doesn't actually use these 3D generation techniques, so this framing might give readers a misleading impression of the study's scope. It would be better to shorten the discussion on 3D generation and focus the literature review more tightly on text and narrative generation, which are the paper's actual focus.

The paragraph has been shortened, and parts were rewritten to reveal the AI challenges that still apply. In addition, as suggested by the reviewer we focused on AI’s capacity for text and narrative generation, so the following was added:

In addition, AI has been successfully used in creating cultural narratives as game elements. The stories created were used in a scavenger hunt game to allow museum visitors to actively engage with the museum content [35]. AI also supports mobile applications that visitors can use to detect artworks and to interact with them through a chatbot that creates textual and auditory information. A previous study showed that when interacting with such a chatbot, visitors showed increased engagement with the artworks [36]. Furthermore, generative AI has been used to create a museum guide. Due to the accuracy challenges that emerge during the creation of the guide, human oversight is still necessary to guarantee scientific integrity. Despite these challenges, AI-generated museum guides could also provide personalized material based on unique visitor preferences [37]. Therefore, there seems to be an increased interest in the automatic creation of narratives or at least for ways that assist heritage professionals with the creation of cultural narratives. AI seems to be a very promising tool and for this reason there are studies that explore the potential of AI in story development. For example, [38] focused on the design and implementation of a platform that assisted curators in writing interactive stories. However, stories do not always come from official sources but could also derive from the unofficial stories of local people. AI can help local communities enhance such unofficial stories with images [39] and make these stories linkable and findable [40]. AI- assisted narratives enrichment through visualizations also seems to be a popular trend since there are a few recent studies that exploit the potential of AI regarding narrative visualizations for cultural heritage [41]. Finally, GPT-4 has been also used to create interactive storytelling and allow personalized city explorations, by highlighting points of interest and making the user the protagonist in the story created [42].

Methods: The methodology section lacks critical details needed to assess the validity of the research. While the study is framed around questions (RQ1, RQ2, RQ3) and AI's connection to heritage, it also involves student evaluations. Key information about this process is missing. For example, when and where the study with students took place, how many students participated, and so on. It is also unclear who performed the tests for RQ2 and RQ3. Was it the same group of students from the RQ1 evaluation?

Following the reviewer’s advice, we added the following to clarify the methodology: “A group of five students participated who performed all the tests to answer R1, R2 and R3.”

For RQ1, you assess different AI models. I think a note should be added to clarify that the stated limitations of these models were accurate at the time of the research (2024?). Features like live web access and hallucination reduction are constantly and rapidly improving, and the judgments you present may not be applicable by the time of publication. For example, GPT5 has been reported with significantly reduced hallucination and advanced internet search (in order to serve better for agentic tasks).

The following was added, since indeed the models improve very fast: “Since AI models are rapidly improving, the limitations observed by students and reported here are from April 2025”.

In your section 5 or 6, you need to address these advancement and also I think especially agentic Ai like workflows and planner based agent (e.g., Manaus, Genspark, Gemini Deep Research, etc).

In section 5, the following was added to acknowledge the significant advancements in the field: “There are currently significant advancements in Agentic AI and planner-based agents. Agentic AI models work in a proactive fashion, understanding the context of the users’ prompts, and addressing complex tasks. Advanced AI models can also form and follow plans, and they are known as planner-based agents. A planner-based AI can understand the overall goals of users, identify the right tool to achieve these goals, create detailed plans, execute and monitor each step. Systems like Genspark, Gemini Deep Research, Manaus, etc. are good examples of AI that moves beyond answering single questions and is capable of dealing with more complex requests. Since such AI advancements in all fields, including cultural heritage are rapid [59] the necessity to train future professionals is eminent.”

When you describe the student evaluation using metrics like "readability, validity, and usefulness," these terms aren't defined. It's important to explain how you defined and operationalized these concepts for the students, especially if your definition of a term like 'validity' differs from its common usage in social sciences.

The following was added to clarify the scales used: “Students specified the three terms and defined the different levels to proceed with the evaluation of the AI- generated content.

More specifically, for the Readability of the generated content the following scale was used [53]: 1-very difficult (text has problems with structure, grammar, syntax, spelling, and/or uses complex language and unknown terminology). 2 – difficult (problems with structure and fewer error. Terminology might still be confusing, but most parts make sense). 3 – moderate (clear structure and flow, with minor language issues that do not affect understanding). 4 – easy (good structure, easy to follow, very few if any grammatical/syntactical errors). 5 – very easy (easy to follow, clear and engaging. The reader does not require any effort to read, and aids like headings and bullet points are also used).

Regarding Validity, accuracy and reliability of the produced content, the following scales was used [54,55]: 1 – unreliable (factual errors, use of uncredible sources, misinformation). 2-questionalbe (does not refer to sources and sources are outdated. Inaccuracies might still exist although limited). 3 – credible (mostly accurate content but lacks a full range of references. Content seems correct but you might have to cross check). 4 – reliable (content is accurate and supported by strong and verifiable sources). 5 – highly reliable (multiple high-quality sources are used to support all claims).

Regarding Usefulness, the following scale was used [56,57]: 1 – not useful (content is irrelevant). 2 – minimal useful (although the content touches on the topic, it gives very little new or practical knowledge). 3 – moderately useful (the content is somewhat useful, but it lacks details or clear applicability). 4 – useful (covers the user needs and provides applicable knowledge that allows the users to achieve their goals). 5 – highly useful (the content is the ultimate solution to the problem and is easily applicable).

The students discussed the AI outcomes as a group and decided on the evaluation of each dimension.”

Results

RQ1:The paper acknowledges that the evaluation was conducted by students as a limitation, but the problems run deeper. There's no information on the number of students who provided ratings, how they were grouped, or how their scores were aggregated. The fact that all final scores in the table are clean integers raises questions about the summarization process. This lack of transparency makes it very difficult to trust the conclusion about which AI model is "better." Even without expert evaluators, the paper must detail the number of student participants, the evaluation procedure, and the method of data aggregation. At a minimum, mean scores and standard deviations should be reported to give a clearer picture of the results. Also, the preliminary nature of this evaluation should be more strongly emphasized.

We have now answered all the above in the methodology and the results sections. The following was added in the methodology section: “A group of five students participated who performed all the tests to answer R1, R2 and R3”.

 

In the results section we specify the way the evaluation was conducted. The following was added: “Students specified the three terms and defined the different levels to proceed with the evaluation of the AI- generated content.

More specifically, for the Readability of the generated content the following scale was used [53]: 1-very difficult (text has problems with structure, grammar, syntax, spelling, and/or uses complex language and unknown terminology). 2 – difficult (problems with structure and fewer error. Terminology might still be confusing, but most parts make sense). 3 – moderate (clear structure and flow, with minor language issues that do not affect understanding). 4 – easy (good structure, easy to follow, very few if any grammatical/syntactical errors). 5 – very easy (easy to follow, clear and engaging. The reader does not require any effort to read, and aids like headings and bullet points are also used).

Regarding Validity, accuracy and reliability of the produced content, the following scales was used [54,55]: 1 – unreliable (factual errors, use of uncredible sources, misinformation). 2-questionalbe (does not refer to sources and sources are outdated. Inaccuracies might still exist although limited). 3 – credible (mostly accurate content but lacks a full range of references. Content seems correct but you might have to cross check). 4 – reliable (content is accurate and supported by strong and verifiable sources). 5 – highly reliable (multiple high-quality sources are used to support all claims).

Regarding Usefulness, the following scale was used [56,57]: 1 – not useful (content is irrelevant). 2 – minimal useful (although the content touches on the topic, it gives very little new or practical knowledge). 3 – moderately useful (the content is somewhat useful, but it lacks details or clear applicability). 4 – useful (covers the user needs and provides applicable knowledge that allows the users to achieve their goals). 5 – highly useful (the content is the ultimate solution to the problem and is easily applicable).

The students discussed the AI outcomes as a group and decided on the evaluation of each dimension.”

 

RQ2: The approach to having AI discover associations between heritage sites by feeding it metadata seems to limit the AI's potential. It pushes the models toward finding more superficial connections or baraly using its "intuition." A more robust method might involve a chain-of-thought approach (e.g., using thinking models) or multi-step reasoning, where the AI is first prompted to generate detailed descriptions and then asked to find links between them, or they could be asked to generated aspects that can connect those sites and make the actually connections from those aspects. These distinctions are similar to Kahneman's dual-system theory in human thinking ("fast and slow"). You should at least acknowledge that the direct method used here may not fully tap into the reasoning capabilities of modern large language models, especially those in 2025.

Following the reviewer’s advice, we now acknowledge the limitations of the approach followed and relevant text was added in the methodology R2 and the conclusions:

“The approach of feeding AI models with metadata was a conscious one considering its benefits and limitations for the student training process. This grounded and context specific analysis offers a greater level of control compared to the general research query, appropriate for student training. Avoiding general internet knowledge and relying primarily on metadata, leads to fewer inaccuracies, allows the discovery of previously unknown links (since it does not rely on well-documented information found over the internet), avoids using common keywords and search engine optimization (since it focuses on raw data enhancing unbiased analysis), and can rely on available large heritage datasets often available from heritage institutions [52]. However, potential drawbacks must also be acknowledged here, such as quality of available metadata (not a problem in the current work, since metadata were curated by the researchers), lack of broader content and information beyond the available metadata and the time commitment needed to prepare the metadata.”

The reporting of these connection making results is also too general for a research paper. You do not mention how many times tests were repeated, what prompt variations were tried, or the frequency with which certain associations appeared. Since web versions of these models were likely used, their default "temperature" settings can produce varied outputs for the same prompt, and the way the data is presented does not address this potential for variability. The exploratory nature of this process needs to be described more cautiously.

To clarify the methodology steps further we added the following text: “Each model was provided with the same information, namely painting title, date of creation and artist. Then the following questions were asked to all models, one time per model”  and in RQ2: “The prompt used in all models once was: “find associations between X,Y” (to associate 2 sites), “find associations between X,Y,Z” (to associate 3 sites), etc.”

RQ3: The paper correctly identifies that AI models can produce factual errors (e.g., Gemini's misidentification of a painting, Claude's historical inaccuracies in the story). However, the discussion only briefly mentions "validity" as a challenge. A more valuable contribution would be to delve deeper into this. How should cultural heritage institutions design practical workflows to verify, correct, and ultimately leverage these AI outputs that are simultaneously creative and potentially flawed? This practical implication feels like a missed opportunity.

The issue raised by the reviewer is very important and was only briefly explored by the current work. We are acknowledging the importance by adding the following in the conclusion section: “Finally, the current work only briefly raised the issue of the validity of AI models. Future works should delve deeper into validity issues and examine ways that heritage institutions should design practical workflows to handle AI outputs that could be both creative and potentially flawed. The current work, only touch on this issue as it wished to train future heritage professionals in the use of AI models and allow them to realize potentials and limitations.”

MobiCAVE and Engaging Virtual Worlds (Sections 5 & 6): These last two sections are particularly problematic. The detailed description of a hypothetical future CAVE experience feels disconnected from the empirical results of the paper. It reads more like a separate research proposal or a general commentary, rather than a discussion grounded in your findings. The purpose is unclear. Is this a discussion of future research? A specific proposal? A reflection on teaching applications? The writing needs to clearly state its purpose and adopt a corresponding style. For instance, this could be condensed into a few sentences as a potential direction, or it could be fleshed out with more technical detail as a specific future study. The tone in this section is also not appropriate for a research paper. A phrase like, "we can only imagine how such results could be presented and experienced..." is speculative and informal. Similarly, concepts like "AI procedural generated VR environments" and "story path" are used without proper definition or technical grounding. The assertion that "VR users could create their own worlds" is especially confusing. This suggests user-generated content platforms, which is a completely different paradigm from the AI-driven heritage narratives discussed earlier.

Following the advice of Reviewer 2, we removed section 5 altogether and we only kept the following text that was added into the next section (previous section 6, now section 5) as a possible future development: “In a future work, a potential direction could also be explored, coupling AI’s ability to create cultural content (narratives, images, audio, etc.) with Virtual Reality environments. In particular, the MobiCAVE, a room-sized Cane Automatic Virtual Environment (figure 4) [60], could be used to allow embodied interaction with virtual cultural content that has been created by AI models. The system's room-scale setup and gesture-based controls create the perfect way to turn regular stories into engaging experiences that feel real and are spread out in space. On the other hand, AI can create cultural stories that are tied to a place, based on themes, or rich in symbolism — but above all, it can co-create the story the user wants to experience, responding to emotional triggers and mnemonic audiovisual elements.”

Crucially, the entire "Engaging virtual worlds" section, which functions as the conclusion, contains no citations to existing literature. The ideas presented here about immersive experiences, procedural generation, and ethical challenges are important topics in the field of digital heritage. This section must engage with relevant research to have any academic credibility. Overall, the style in these sections relies on analogy and assertion rather than rigorous argumentation, and it needs a complete rewrite.

References were added in the entire section 5 to back all arguments. In addition, large parts of section 5 were rewritten, and the style was changed.

Minor Points

  • The term "BIP" is used in the introduction but should be written out in full on its first use.

The term is now specified.

  • When metadata is first mentioned in the context of RQ2, it would be helpful to provide a few brief examples (e.g., "metadata such as object type, creation date, materials...") to help readers unfamiliar with the term.

Done

  • In the results section, the use of an arrow in paragraph text like "Acropolis and Belfries of Belgium → Both are human-made..." is very unconventional. This symbol is better suited for tables or diagrams though.

The arrow is removed.

  • The title for Table 2 appears to be a copy-paste error from Table 1. It reads "Strengths and Limitations of Al models used" but displays the student rating scores.

Changed to “Student evaluations for the AI models used”.

 

Comments on the Quality of English Language

The paper is undermined by significant issues in its writing in multiple sections, make a line-by-line correction not feasible in this round of review. The author needs to ensure the revised manuscript adheres to academic writing standards. This includes presenting arguments with clear logical steps, providing consistent supporting evidence, maintaining methodological transparency, and avoiding an assertive or speculative tone. I do agree first-person phrasing can be used because research are done by people (the problem is not using "we") , but the overall presentation should grounded in evidence.

We would like to thank the reviewer for the advice. We have now re-written many parts of the paper and have reduced the number of sections by one. All changes are indicated in the manuscript with blue fonts. The paper has also undergone proofreading.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript explores how AI tools can be integrated into cultural heritage experiences, from generating narratives about historical sites to creating immersive VR environments. There are some questions as follows:

  1. The title looks like this is a complete study from narrative generation to VR implementation, but the majority of the manuscript focuses on comparing AI model performance and does not demonstrate the actual technical implementation process from narratives to VR.
  2. There is no clear central research question or hypothesis driving the entire study. Therefore, three research questions are relatively independent and lack inherent logical connections. Why is AI integration necessary, and what previously unsolvable problems can be addressed?
  3. The manuscript like a description of workflow processes, lacking a theoretical foundation, clear academic contributions, and innovative insights.

Author Response

The title looks like this is a complete study from narrative generation to VR implementation, but the majority of the manuscript focuses on comparing AI model performance and does not demonstrate the actual technical implementation process from narratives to VR.

The title has changed to: “AI-enabled cultural experiences: A comparison of narrative creation across different AI models”      

There is no clear central research question or hypothesis driving the entire study. Therefore, three research questions are relatively independent and lack inherent logical connections. Why is AI integration necessary, and what previously unsolvable problems can be addressed? The manuscript like a description of workflow processes, lacking a theoretical foundation, clear academic contributions, and innovative insights.

We would like to thank the reviewer for the evaluation of our paper. Following reviewer’s advice we have made many changes. The entire paper was significantly restructured, and major parts were removed where others were added to improve the structure, the local connection between research questions and results. In addition, the literature review was enhanced, and the methodology was clarified further.  

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Overall, the authors have responded well to the previous comments, providing substantial additional information that enhances the transparency of the methods and frames the discussion of AI capabilities and future directions in a more appropriate manner. I have only the following minor revision comments.

1. Abstract, line 20. The phrase "appropriate artificial intelligence tools, we tested different tools to determine the most suitable for cultural content management" could be slightly reworded for precision. The term "AI tools" feels somewhat broad. Specifying "large language models" would better distinguish it from past deep learning approaches and future AI developments. This revision may also applied to the first paragraph in the introduction.
2. Line 22. The sentence "We then proceeded to search for common elements among different spaces to arrive at the creation of specific narratives, and indeed emotional narratives" is somewhat unclear, potentially due to imprecise description. A minor rephrasing for clarity would help.
3. Table 2 (Student evaluations for the AI models used). I saw each cell included the metric name, but these are identical across rows, making the table cluttered. Consider moving names like "readability" to the left-hand column (e.g., column 2) to avoid repetition and improve readability.
4. Section on methodology, RQ1 (e.g., line 146): While the companies behind the AI models are now specified, the versions remain unstated. Which DeepSeek version (v2.5, v3, r1, or all) was used? For ChatGPT, was an o1 or o3 style reasoning model employed? As LLM performance is closely tied to model versions (often more so than to the company), these details are important. If multiple versions from the same company were used, listing them all would be helpful.
5. I may have missed it, but if the current method description lacks approximate timelines for the research phases, these should be added for completeness.

Author Response

We would like to thank Reviewer 1 for the comments that lead to significant improvements of our work. You can see how we have made all the necessary changes to repond to all the suggestions. 

  1. Abstract, line 20. The phrase "appropriate artificial intelligence tools, we tested different tools to determine the most suitable for cultural content management" could be slightly reworded for precision. The term "AI tools" feels somewhat broad. Specifying "large language models" would better distinguish it from past deep learning approaches and future AI developments. This revision may also applied to the first paragraph in the introduction.

It is revised according to suggestion.

  1. Line 22. The sentence "We then proceeded to search for common elements among different spaces to arrive at the creation of specific narratives, and indeed emotional narratives" is somewhat unclear, potentially due to imprecise description. A minor rephrasing for clarity would help.

We have rephrased as follows: “They then connected common elements found across different cultural spaces to construct specific and emotional narratives.

  1. Table 2 (Student evaluations for the AI models used). I saw each cell included the metric name, but these are identical across rows, making the table cluttered. Consider moving names like "readability" to the left-hand column (e.g., column 2) to avoid repetition and improve readability.

Table 2 has been changed. The terms have been now moved to left to improve readability.

  1. Section on methodology, RQ1 (e.g., line 146): While the companies behind the AI models are now specified, the versions remain unstated. Which DeepSeek version (v2.5, v3, r1, or all) was used? For ChatGPT, was an o1 or o3 style reasoning model employed? As LLM performance is closely tied to model versions (often more so than to the company), these details are important. If multiple versions from the same company were used, listing them all would be helpful.

Following the suggestion, we have added the version of the tools we used:  “The free versions of the AI models we examined were DeepSeek V2, ChatGPT 3.5 (OpenAI), Claude 3 Haiku (Anthropic), and Gemini 2.5 Flash (Google DeepMind).”

  1. I may have missed it, but if the current method description lacks approximate timelines for the research phases, these should be added for completeness.

At the end of the methodology section, the following timeline was added: “Being a part of student training and an activity in a BIP Erasmus program, the entire process was planned to last for one day. More specifically, from 9.30-11.30 students compared the large language models for their readability, validity, and usefulness, recording the strengths and weaknesses of each model, focusing on Research Question 1 (R1). From 12.00-13.00 and 14.00-15.00 the focus shifted to R2, where students researched the production of metadata and the discovering of associations between sites. Finally, from 15.30-17.00 the focus concluded with R3, during which the AI created stories for the different sites.”

Reviewer 2 Report

Comments and Suggestions for Authors

I think this revision has aligned with the publication standard.

Author Response

We would like to thank Reviewer 2 for evalauating our manuscript. The Reviewer did not require any further changes in the reviewing round.