Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning

DaCosta, Boaventura

doi:10.3390/heritage8090387

Open AccessArticle

Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning^†

by

Boaventura DaCosta

Solers Research Group, Sanford, FL 32771, USA

^†

This paper is an extended version of the paper published in DaCosta, B. Crafting Digital Personas of Historical Figures for Education through Generative AI: Examining Their Accuracy, Authenticity, and Reliability. In Proceedings of the Society for Information Technology & Teacher Education International Conference; Cohen, R.J., Ed.; Association for the Advancement of Computing in Education (AACE): Orlando, FL, USA, 2025; pp. 605–610. Available online: https://www.learntechlib.org/primary/p/225573/.

Heritage 2025, 8(9), 387; https://doi.org/10.3390/heritage8090387

Submission received: 16 July 2025 / Revised: 1 September 2025 / Accepted: 9 September 2025 / Published: 18 September 2025

Download

Browse Figure

Versions Notes

Abstract

Recent advances in generative artificial intelligence (AI) have enabled the creation of AI-generated characters modeled after historical figures, offering new opportunities for reflective and interactive engagement in both cultural heritage and education. This study explores the development and evaluation of a large language model representation of Joseph Lister (1827–1912), a pioneer of antiseptic surgery, within a retrieval-augmented generation framework. The purpose was to examine the model’s accuracy, authenticity, and reliability, highlighting challenges, best practices, and ethical considerations. Drawing on primary and secondary sources, including Lister’s writings, the model was constructed using OpenAI’s GPT-4o and refined through iterative validation. Prompts were categorized by cognitive complexity, and responses were evaluated against historical materials. The findings revealed a strong fidelity to Lister’s voice, with appropriate tone, diction, and temporal limits. Moreover, the model demonstrated behavioral control, reflective depth, and consistency across the different prompts. However, minor lapses in temporal framing and occasional embellishments were noted. The findings suggest that, when developed with care, AI-generated characters can support ethically grounded, historically sensitive learning experiences. At the same time, this approach warrants continued scrutiny and underscores the need for further interdisciplinary research and responsible implementation.

Keywords:

AI-generated characters; artificial intelligence (AI); cultural heritage; educational technology; generative AI; historical figures; Joseph Lister; large language models (LLMs); retrieval-augmented generation (RAG)

1. Introduction

Engaging in dialogue with individuals from the past, including renowned historical figures, is becoming a steadily increasing reality [1,2]. Both emerging interactive information and communication technology applications, alongside approaches to storytelling that incorporate the use of augmented and virtual reality, are gaining traction in conveying and enriching both tangible and intangible aspects of cultural heritage (CH) [3]. A recurring motif across literary genres features characters engaging in dialogue with the deceased, who offer guidance and maintain an active presence within the narrative [4]. This enduring imagery illustrates how the wisdom of a respected individual can transcend death, continuing to influence and guide future generations [4].

Recent advancements in generative artificial intelligence (AI) have significantly accelerated the development of digital characters [1,5] modeled after contemporary, historical, and fictional figures [5]. Pataranutaporn et al. [4] liken these characters to interactive photographs. They describe how these dynamic representations can engage in dialogue, allowing questions to be posed to gain direct insight into an individual’s knowledge, perspectives, and even experiences. In recent years, a growing number of platforms have emerged in this space [1,2]. One such instance is Character.AI, which utilizes large language models (LLMs) to create interactive encounters with historical figures, catering to both education and entertainment [2].

Large language models, a form of AI developed to comprehend and produce human language, have experienced a rapid surge in recent years [6] and are arguably now among the most widely deployed forms of AI [7]. Essentially, they are a type of generative AI capable of creating text and images [8]. They are trained on extensive and varied datasets [6,8,9,10], drawn from the Internet [6,9,10,11], and function by predicting the next word in a sequence based on the context provided by the preceding text [9]. Their growing popularity has contributed to increased interest in using AI to facilitate interactive encounters with historical figures, enabling learners to engage in insightful dialogue by posing questions and receiving contextually tailored responses [4,12]. By allowing for direct engagement with historical figures, these experiences can bring the past to life in new and meaningful ways [1,13].

This paves the way for fascinating opportunities in both cultural and educational contexts. Large language models can enhance our understanding of ancient civilizations, languages, and traditions [11], including their preservation [2]. They can help learners to explore the lives and work of both renowned and lesser-known figures [1]. When combined with storytelling, they can humanize the learning process, making it more engaging, relatable, and inspirational [5]. By creating a deeper understanding of historical contexts, encouraging engagement in historical debates, and promoting critical thinking, these resources have the potential to become invaluable educational tools [1].

However, their use has sparked ethical concerns [14,15,16,17]. Generative AI can hallucinate or oversimplify human complexity, leading characters to convey unintended views or distort the legacies of those they portray [2,12]. Since these models often draw on personal writings [2,18], their use without permission introduces risks to privacy and consent [4,12], whereas commercialization raises the possibility of exploitation of likeness [12,19]. Taken together, these issues represent essential considerations as AI and LLMs continue to evolve [9].

1.1. Purpose

Despite these challenges, it is proposed that generative AI presents substantial opportunities for advancing historical representation, learning engagement, and cultural exploration through the creation of historical, contemporary, and fictional figures. As Pataranutaporn et al. [5] assert, AI-generated characters hold significant possibilities for enhancing educational experiences across various settings, including classrooms, museums, historical landmarks, and related outdoor locations. When thoughtfully implemented, these characters can offer a powerful means for both current and future generations to honor the individual, explore and preserve their narratives, and gain a deeper understanding of their insights and wisdom [4].

With the continued emergence and accessibility of LLMs, research is now exploring more autonomous and conversational systems capable of independent interaction [4,8]. Building upon this work, as well as contributions from others, such as Goldman [9] and Hutson et al. [1], the current research aims to examine the accuracy, authenticity, and reliability of AI-generated historical characters, with particular attention to the challenges, best practices, and ethical considerations involved in their design and use. This focus addresses a noted gap observed by Amin et al. [20], who note that the field of generative AI in persona development still lacks established best practices.

While the broader context concerns how LLMs might ultimately reshape the ways individuals engage with and interpret historical figures, this study concentrates specifically on the quality and implications of model outputs. By doing so, it provides practical insights intended to support educators, researchers, CH experts, and other stakeholders who seek to implement AI-generated characters in educational contexts.

This study explores this emerging practice by focusing on Joseph Lister (1827–1912), a British surgeon–scientist and pioneer of antiseptic surgery [21,22]. Lister represents a figure of ongoing pedagogical value, as his contributions continue to be taught in medical, scientific, and historical contexts. This provides a concrete way to illustrate how AI-generated characters can support both disciplinary learning and CH.

The choice also reflected ethical considerations. Unlike contemporary figures, Lister’s writings and biographical materials are largely in the public domain, reducing concerns about privacy, consent, and the unauthorized use of personal or sensitive data. Moreover, because the commercialization of historical figures without consent remains a significant ethical concern [12,19], the model developed for this study was created solely for research purposes and will not be made publicly accessible. This also means that this study includes AI-generated content, with findings based on a comparison of model responses against both prompts and historically grounded source material.

Lastly, the findings were shared with educational researchers and practitioners, whose feedback helped to refine the work. Input from academic conferences, including sessions where a pilot of this study was presented [23], played a meaningful role in shaping the final publication.

1.2. Terminology

It is essential to clarify the terminology that underpins the practice of constructing AI-generated characters. Methuku and Myakala [17] provide a helpful distinction, describing these entities as either “pre-mortem AI clones” or “post-mortem generative ghosts.”

“Pre-mortem AI clones” are “AI-driven entities designed to function alongside living individuals, assisting them in various capacities, such as productivity, entertainment, and legacy preservation” [17] (p. 2). In contrast, “post-mortem generative ghosts” are “AI-driven systems designed to simulate interactions with deceased individuals” ([19], as cited in [17], p. 2). Often described as a form of “digital necromancy” [12], these systems were created to support bereavement, offering emotional comfort to those mourning a loss [24].

These generative ghosts, essentially early chatbots that predate modern LLMs, rely on text produced by the deceased during their lifetime [2] to create “an interactive AI-generated portrayal of a person’s stories, attitudes, personality, and wisdom” [4] (p. 890). Although traditionally tied to mourning, their use has since expanded to broader applications [2].

Given this variability, this study uses the term “AI-generated characters” as a unifying descriptor. In the pilot phase, the term “digital personas” was used to describe “a lifelike representation of an individual—contemporary, historical, or fictional—created through AI and with various intended purposes, including education” [23] (p. 606). While that definition remains unchanged in the present study, the term “AI-generated characters” is now preferred to reflect the pedagogical and representational aims of this work and to avoid the associations that digital personas often carry, particularly in marketing and user profiling.

2. Related Work

To contextualize the use of AI-generated characters for cultural and educational purposes, it is helpful to begin with a brief discussion of digital personas (conceivably pre-mortem applications). Much of this research has emerged from the fields of human–computer interaction and user-centered design, where these personas are used to create archetypical users that represent the key traits, goals, and behaviors [25] of individuals interacting with a system, product, or service across a wide range of domains [20,26], from research to marketing.

Traditional persona development used time-consuming approaches, including qualitative data collection and manual analysis [25,26,27], using methods such as interviews, observations, and survey data [26,27]. Most recently, generative AI has been examined, with LLMs attracting considerable interest [1,9,10,20,28] for their ability to emulate human-like conversations [29,30], enabling the scalable production of diverse personas [31].

As of late, these LLMs have taken the form of chatbots [4,28], with OpenAI’s ChatGPT [11,28] having demonstrated impressive proficiency in producing text that is both coherent and contextually appropriate [32]. General-Purpose Transformer (GPT) models have been at the forefront of this exploration, providing a foundation for investigating the intricacies and potential of language-based AI [9]. In this context, ChatGPT and other models, such as Anthropic’s Claude, are commonly cited in the literature, even if they are not directly used in the studies themselves.

For example, Jiang et al. [29] evaluated GPT-3.5’s ability to simulate such behavior by assigning 320 unique personas based on combinations of the Big Five personality traits, finding that the model consistently reflected the assigned traits in its language use (though gender had minimal impact on linguistic variation). Serapio-García et al. [30] developed a psychometric approach showing that LLMs can consistently express stable and valid personality profiles through prompting, underscoring the promise and ethical complexity of crafting psychologically coherent personas.

While much research continues to focus on generalized or psychometric profiles, AI-driven persona generation has also expanded into other domains [27]. One area that has attracted increasing attention is postmortem applications [19].

2.1. Postmortem Applications

Across cultures, humans have developed rituals and practices to remember those who came before them, whether for mourning, cultural preservation, or historical understanding [4]. With technological advancements, we are now significantly closer to interacting with the dead [2,12,19]. This has given rise to the emerging field of digital necromancy, a line of research dedicated to integrating AI and complementary technologies, such as robotics, to facilitate interactions with virtual representations of the deceased [9,12]. This involves utilizing data from various sources [13] to include personal writings [2,18]. These firsthand artifacts serve as vital inputs, allowing models to more accurately reflect the identities they represent by grounding them in content created by the individuals themselves.

Prior research has shown that digital systems can meaningfully support grief, memory, and legacy [4]. Banks et al. [33] explored how digital technologies mediate practices of memorialization, highlighting the emotional, cultural, and ethical complexities involved in remembering the deceased. Odom et al. [34] examined the design of digital heirlooms intended to be passed through generations, emphasizing the sentimental and memorial value such objects can hold. Both studies advocate for culturally sensitive and emotionally resonant approaches that prioritize long-term connections and legacies, underscoring the evolving role of digital systems in preserving memories and fostering generational connections.

Although these systems have predominantly been used for memorial or commemorative purposes [19], their application is increasingly expanding into cultural and educational contexts, with researchers exploring how AI can be used to engage learners actively with the past.

2.2. Cultural and Educational Applications

Efforts to bring the past to life through immersive technologies have relied on animated representations with limited interactivity, such as animatronic displays, or immersive technologies like augmented, virtual, or extended reality [1]. These relied on pre-recorded or curated responses tailored to predefined questions, which ensured factual accuracy but restricted the user experience [4].

Large language models have significantly expanded what is possible, enabling the creation of virtual instructors modeled after contemporary, historical, or fictional individuals [5,26] and allowing for instructional content to be customized to specific subjects, contexts, and learner needs, essentially creating personalized learning [35]. Information that was once only available in writing or conveyed secondhand can now be delivered by a virtual representation of the actual person sharing their work or story [5].

Given their potential, AI-generated characters may be valuable in areas such as museum curation and historical scholarship [2]. Technology has long played a role in revitalizing the past, especially through digital reconstructions that allow audiences to engage more directly and dynamically with historical spaces, artifacts, and events. For example, Varitimiadis et al. [36] discuss the potential of leveraging AI techniques in museum chatbots to move them beyond scripted conversations dependent on limited domain knowledge.

This is important because CH plays a vital role in shaping identities, preserving historical narratives, and fostering cross-cultural communication. By enabling visitors to interact directly with these figures, AI-generated characters provide not only embedded knowledge [2,14], but also new ways to teach, share stories [5], and present diverse perspectives [2,5]. Such interactions can foster a deeper understanding of the challenges historical figures faced during their lifetimes and how they worked to overcome them [4].

Large language models are believed to be revolutionizing the field of CH, offering innovative solutions not only for education and public engagement, but also for preservation [14]. They could help small communities to preserve the collective wisdom of elders, including religious traditions, cultural practices [2], and endangered languages [14]. Pataranutaporn et al. [4] emphasize the urgency of such efforts, citing Nettle and Romaine (2000, as cited in [4]), who estimate that a language dies every 14 min, along with the oral traditions and cultural values embedded within it.

Other applications include learners conversing with historical figures who were experts in specific disciplines, such as engaging Albert Einstein in discussions of relativity [4]. Moreover, these characters may not represent a specific individual, but rather what Morris and Brubaker [2] describe as archetypes or amalgamations, such as citizens of Pompeii or residents of Colonial Williamsburg. Similarly, Hutson and Ratican [12] highlight the use of AI and historical data to create lifelike reenactments of significant moments, events, or entire eras, offering substantial educational value by allowing individuals to engage with history through immersive and interactive experiences [13].

Research supports these and similar applications, demonstrating how interacting with AI-generated characters can offer a means to experience history [4]. Hutson et al. [1] cite the early work of Haller and Rebedea (2013 as cited in [1]), who developed a chatbot trained on biographical texts to simulate the character and personality of historical figures, demonstrating that natural language technologies could enable historically grounded conversations. Pataranutaporn et al. [4] introduced a system using OpenAI’s GPT-3.5 to create interactive versions of Leonardo da Vinci, Murasaki Shikibu, and Captain Robert Scott, finding that the participants who both interacted with the da Vinci model and read the journal scored significantly higher on learning outcomes and motivation than those who only read the journal or only interacted with the character. Then, there is the work of Hutson et al. [1], who developed a digital resurrection of Mary Sibley, drawing on her extensive diaries and archival materials, to train and refine the model to reflect her distinctive voice, beliefs, and perspectives, in relation to gender, religion, and education.

Collectively, these studies illustrate the evolving potential of LLMs in the digital humanities, transforming how individuals engage with history and CH. However, important questions remain regarding the accuracy, authenticity, reliability, and ethical risks associated with representing historical voices through AI.

2.3. Ethical Concerns

Ethical concerns arise, in part, from the sensitive nature of the topic. Hutson et al. [15] found that individuals who had experienced loss expressed discomfort and unease toward AI-generated characters of the deceased, underscoring the need to prioritize psychological well-being and respect in the development of such characters.

Beyond individual reactions, scholars have also warned that AI-generated characters raise broader concerns about the blurred boundaries between the real and the simulated. Mei et al. [37] found that humans misidentified AI chatbots as real people nearly half the time, illustrating how modern AI not only imitates human behavior, but also complicates efforts to distinguish between authentic and artificial interactions. These risks are especially urgent in the context of deepfakes, which can publicly misrepresent individuals [5], including historical figures, whose legacies may be inadvertently distorted through misrepresentation or inauthentic portrayal [4,13].

Even in well-intentioned educational applications, generative AI introduces risks and must be approached with caution [11]. Large language models can “hallucinate,” producing plausible-sounding but inaccurate information [2,4,8,12,38,39]. OpenAI et al. [39] note that GPT-4 can fabricate facts, provide inaccurate information, and perform tasks incorrectly, raising concerns that hallucinations can become increasingly dangerous as trust in these models grows.

Wan et al. [40] emphasize that recognizing and addressing hallucinations in LLMs is a critical challenge. In CH contexts, content that experts once carefully interpreted to ensure accuracy and respect can now be broadly generated [14]. Even accidental errors may distort historical events or misrepresent individuals [14], disseminating misconceptions, inaccuracies, or outdated information [11], producing engaging but misleading experiences.

Bias is another concern [8,14,26,41]. Although LLMs are trained on massive and diverse datasets, this diversity does not ensure fairness. These models often reproduce patterns embedded in data, and when such content reflects existing prejudices or lacks meaningful representation, outputs are likely to mirror those same biases [42].

Venkit et al. [25] highlight the risks, including a tendency for LLMs to overemphasize trauma in minority narratives and reduce cultural identities to stereotypes (e.g., [41,43]). Such biases can perpetuate historical injustices, including stereotyping, erasure, and exoticism [25]. These embedded cultural assumptions can significantly influence how historical personas are digitally recreated and understood. OpenAI et al. [39] similarly acknowledge these risks, finding that GPT-4 was capable of generating “harmful stereotypical and demeaning associations for certain marginalized groups” (p. 47).

To address these risks, scholars stress the importance of rigorous human oversight. OpenAI et al. [39] advise that outputs from GPT-4 require careful review, whereas Hutson and Ratican [12] recommend cross-referencing AI responses with verified archival sources. Similarly, Trichopoulos et al. [11] argue that models are incapable of verifying the authenticity of their own outputs [10], making expert assessment essential [8,11].

Altogether, these insights suggest that AI-driven historical applications hold considerable promise [15], but their success ultimately depends on a profound ethical commitment [12]. Ensuring respectful, accurate, and equitable representation of history and CH will be essential as these technologies continue to evolve [16].

3. Materials and Methods

Kaate et al. [27] note that foundational models can be used to create AI-generated characters without additional training, drawing solely on knowledge embedded during pre-training [43]. However, these data are inherently biased [25]. An alternative approach involves uploading personalized data directly into the prompt to tailor the LLM’s responses [28], but this risks exceeding the context length, leading to hallucinations and increasing token costs.

A potential solution is retrieval-augmented generation (RAG), which supplements LLM outputs with curated data from a knowledge base [26]. This method helps to mitigate hallucinations by grounding responses in reliable sources [26,38,44]. In CH contexts, such grounding is essential, as characters should present information that remains faithful to their original archival sources [4]. This approach has also been viewed as a cost-effective solution [44] and a key approach for utilizing LLMs in real-world applications [38].

This method was employed in the pilot investigation of this study [23], though not explicitly identified as such at the time, and was adopted in the current investigation. A GPT, which is OpenAI’s customizable version of ChatGPT, was created by directly uploading natural language files that represent Lister’s life and work. This no-code interface eliminated the need for fine-tuning or complex preprocessing, enabling a RAG-like workflow without requiring technical expertise [2].

Moreover, because this research builds on the methodology of Hutson et al. [1], it is worth clarifying that their description of having trained a “beta version of the GPT model using Claude 2.0” (p. 5) appears to conflate the two LLM platforms, Claude and ChatGPT. Based on their documentation and the link to their published GPT, it is more accurate to interpret their process as using Claude for data processing or content preparation, while OpenAI’s GPT Builder was used to create the final interactive AI-generated character.

The following sections outline the detailed approach used in this research. Adopting the phased approach of Hutson et al. [1], the methodology comprised four key stages: (a) data collection and preparation, (b) model training and customization, (c) iterative refinement through interaction, and (d) validation of the results.

3.1. Data Collection and Preparation

The first step proposed by Goldman [9] involved identifying and selecting historical figures with a substantial body of publicly available works. The author emphasized the importance of selecting individuals with extensive written records, as this enables a thorough representation of their unique language and communication style. Hutson et al. [1] adopted a similar philosophy, achieved through a comprehensive repository of Sibley’s writings, including her diaries. In this context, Joseph Lister was selected due to the availability of both primary and secondary sources.

The next step, following the approach of Goldman [9] and Hutson et al. [1], was to develop an extensive collection of the surgeon–scientist’s written works. As noted by Pataranutaporn et al. [4] and Lindemann [16], personal writings such as journal articles and letters serve as authentic reflections of an individual’s knowledge and attitudes. Unlike Hutson et al. [1] and DaCosta [23], who relied on Optical Character Recognition (OCR) software, machine-readable versions of Lister’s works were located directly.

However, a review found that while the identified sources effectively captured Lister’s linguistic style and medical views, they lacked information about his personal life. Thus, adopting the solution proposed by Pataranutaporn et al. [4], who also utilized secondary sources, the current investigation followed suit, providing a breadth of information about the surgeon’s life. Altogether, the collection comprised primary material in Lister’s voice (his 1874 letter to Louis Pasteur [45] and his collected works [46,47]), a foundational biography ([48]), and historical interpretations ([21,22]).

Given that the secondary sources describe his later years, including his death, the cutoff dates encompass the entirety of his life. In the context of generative ghosts, the cutoff date determines whether the AI-generated character remains static after death or continues to evolve [2]. One of the aims of using RAG in this work was to ensure that the GPT had access only to information available during the person’s lifetime. Thus, while the GPT included knowledge of Lister’s passing, it did not reflect an awareness of medical breakthroughs that occurred after his death.

Finally, to address potential bias within the selected sources, the approach of Haxvig [41] was followed, who directly questioned LLMs. ChatGPT was similarly prompted to identify any potential biases in the sources. These areas were then examined, with the understanding that the AI-generated character in this work represents a historical figure whose beliefs and worldview were shaped by the cultural and intellectual context of his time.

The review found that “no overtly harmful bias or stereotyping targeting specific ethnic, cultural, or religious groups [was] explicitly present” (ChatGPT, personal communication, 23 May 2025). However, it did reveal underlying historical biases, such as a predominant focus on white males in medicine and science and a Eurocentric narrative that positions medical progress primarily within the contributions of elite British or European men. These insights informed the model’s training and customization, while also highlighting the potential for the GPT to reflect such underlying historical biases in its responses.

3.2. Model Training and Customization

Multiple LLMs were evaluated as potential platforms for the Joseph Lister GPT before ultimately selecting OpenAI’s, as the choice of model can directly impact accuracy, cost, and overall performance. Both cloud-based and locally hosted LLMs were considered. The cloud-based models proved to be easier and faster to prototype, set up, and use, whereas the locally hosted LLMs offered greater control and the ability to fine-tune outputs with greater precision.

However, the hardware requirements and costs associated with running such models at the same level of performance as their cloud-based counterparts were beyond the scope of this study. Given the interest in exploring AI-generated characters using the latest advancements in LLMs, as seen in Sun et al. [26], GPT-4o was selected for its accessibility and cutting-edge capabilities.

Aligned with Hutson et al. [1], the next step involved developing a beta version of the GPT. The curated texts functioned as “knowledge sources”, allowing the GPT to internalize not only Lister’s biography and medical innovations, but also his characteristic language patterns and vocabulary.

Pataranutaporn et al. [4] divided their data into smaller paragraphs using natural breaks. Further, they subdivided any sections that exceeded 2000 tokens, due to model limitations at the time, which capped input lengths at under 4000 tokens. GPT-4o supports longer context windows and can be effectively combined with retrieval-augmented techniques to enhance accuracy.

Upon examination, the source documents were already organized into well-defined paragraphs. Although the texts by Lister [46,47] and Godlee [48] contained some OCR-related noise, their overall structure remained intact and usable. Nonetheless, these sources were divided into smaller documents to enhance retrieval accuracy and uploaded into the GPT Builder, which further segmented them into token-appropriate chunks for embedding. Unlike Pataranutaporn et al. [4], who used a selectable embedding model (all-MiniLM-L6-v2), this implementation relied on ChatGPT’s default vector database and embedding pipeline.

Hutson et al. [1] also developed a style guide to capture Sibley’s voice and tone. A similar approach was adopted in the current study, but instead of producing a standalone guide, the content was directly integrated into the instructional template within the GPT Builder.

It is essential to note that Hutson et al. [1] had to retrain their model to avoid divulging the style guide and ensure that responses reflected Sibley’s voice. Similarly, the current investigation ensured that the GPT did not disclose its knowledge sources or refer to its custom instructions. Moreover, adopting the philosophy of Pataranutaporn et al. [4], who viewed these AI-generated characters as “analogous to an actor playing a historical figure in a biographical film, based on the figure’s autobiography” (p. 899), the GPT created in this investigation did not claim to be the surgeon, but rather, a representation intended for educational purposes.

Finally, a portrait of Lister was included to provide historical authenticity, consistent with prior work using visual elements to enhance AI-generated characters (e.g., [1,5]). As depicted in Figure 1, a mid-career image of the surgeon was chosen to emphasize the active, experimental phase of his antiseptic work.

3.3. Iterative Refinement Through Interaction

Next, the GPT underwent a series of iterative refinements. Recognizing that educators and practitioners may not have access to institutional resources or professional support, the refinement was independently managed by the author. The GPT’s outputs were systematically compared with the curated knowledge sources to verify that Lister’s tone, language, and worldview were accurately represented. Several iterations addressed the use of contemporary language, deviations from Lister’s historical character, and references to modern medical practices that postdated his lifetime. Instances of potentially inappropriate or offensive language were also revised to maintain respectful and historically accurate representation.

In addition to content refinement, configuration settings that influenced the model’s behavior were explored. Settings such as temperature, top_p, max_tokens, and frequency_penalty cannot be changed within the ChatGPT interface, and consequently, were influenced in this investigation through language as part of the instructions (e.g., “responds with clarity, restraint, and precision” and “avoids randomness, exaggeration, or unnecessary repetition”). These instructions, representing the final GPT configuration, are summarized in Table A1.

3.4. Validation of the Results

The final stage focused on validating the GPT’s performance. Prior studies have taken different approaches. Hutson et al. [1] collaborated with historians, educators, students, and general enthusiasts to gather feedback on their AI-generated version of Sibley, focusing on historical accuracy, engagement, and overall user experience. Pataranutaporn et al. [4] utilized GPT-3 to generate evaluation questions from historical texts, including events that occurred after the subjects’ lifetimes, to assess whether the model would inadvertently incorporate posthumous knowledge. Human evaluators then compared the model’s responses with the historical record. Goldman [9] proposed developing an interactive dialogue system with a comprehensive questionnaire designed to solicit responses across a wide range of topics, including those outside the historical figures’ lived experiences. This approach would test whether the model adhered to time-bound knowledge and worldview, with blinded evaluators comparing outputs from GPT-4, the fine-tuned model, or the historical record.

The validation in this study was conducted independently by the author, unlike Hutson et al. [1] and Pataranutaporn et al. [4], who used evaluators. No external experts directly participated in the evaluation of responses; however, early conceptual input was provided by an academic familiar with Lister’s life and work. This design choice reflects this study’s exploratory scope and its aim to demonstrate a replicable approach that educators and practitioners can undertake without requiring institutional or archival resources.

The author developed a set of validation questions, informed in part by DaCosta [50], and identified corresponding expected answers. As with Pataranutaporn et al. [4], these included prompts about developments beyond Lister’s lifetime.

This approach evaluated both the accuracy and authenticity of the GPT’s responses while also tracking reliability over time. The questions, along with their expected answers, rationales, and supporting citations, were compiled into a structured worksheet format. To enable systematic analysis, they were classified by response complexity: basic prompts (factual retrieval), moderate prompts (inferential or procedural reasoning), and complex prompts (contextual synthesis or historically aligned viewpoints). The complete set of questions is presented in Table A2.

Following the procedure outlined in the pilot, the process was repeated across five separate sessions over a one-week period. During each session, the GPT was presented one question at a time, and responses were recorded directly on worksheets. The author also documented comments and observations related to any deviations from expected behavior, allowing for the tracking of potential trends or recurring issues in the model’s output. The completed worksheets were subsequently analyzed to evaluate the GPT’s performance.

4. Results

The GPT demonstrated discipline across the basic, moderate, and complex prompts, as well as those deemed out of scope. A summary matrix of its dimensions and prompt complexity is presented in Table 1. To aid interpretation, the table employs two complementary indicators: symbols (☑, ☐, ☒) to capture direction of change, and a five-point scale (Low to High), which denotes the degree of engagement or intensity. This overview highlights the model’s strengths and recurring limitations, foregrounds comparative trends, and frames the detailed narrative that follows.

Across the dimensions, the model showed strong behavioral discipline, coherence, and linguistic control. It produced reflective, well-structured responses, generally respected temporal boundaries, and maintained era-appropriate tone and language. Overall, performance was high, with only minor lapses in temporal framing and occasional irrelevant elaboration.

4.1. Behavioral Control

Regarding behavioral control, the model maintained coherence and clarity, successfully aligning with the character guidance outlined in the instructions across the basic prompts (Q1–3). For example, in response to biographical prompt Q1 (session 1), it provided the expected results, “I was born on the 5th of April in the year 1827”. Similar responses were generated for other sessions, with the second, third, and fourth, for example, comprising the phrase, “I was born on the 5th day of April in the year of our Lord 1827”.

As the prompts grew in sophistication (Q4–7), the GPT’s responses retained clarity and structure, but also began to adopt a more reflective tone, foreshadowing its later, more philosophical style. In Q6 (session 5), the model echoed Lister’s experimental mindset in the moderate prompt, “The evolution of my antiseptic techniques was governed not by caprice, but by a sober and continuous process of experimental refinement”.

In the complex prompts (Q8–11), the GPT continued its philosophical undertone, remaining focused on creating responses that were reflective and rooted in the reasoning Lister discussed within his work, addressing both his pragmatic and moral considerations. For example, in Q8 (session 1), when asked how he justified the use of strong carbolic solutions despite concerns of tissue irritation, the model responded:

“The concern that strong solutions of carbolic acid might irritate or even injure living tissues was one that I shared with earnest consideration. Yet, I justified their cautious use upon a foundation both practical and moral, resting upon the comparative gravity of the evil to be prevented”.

This was followed by, “Permit me to explain”, and the model detailing the rationale for the decision, “I held that temporary irritation was a trifling inconvenience in comparison with the dreadful consequences of unchecked sepsis”.

Even when faced with the out-of-scope prompts (Q12–14), the GPT avoided fabricating content and declined to speculate about posthumous knowledge. In Q12 (session 1), for example, the model stated, “I must candidly confess that I am not aware of penicillin or any substances known as antibiotics”, with similar responses generated for the remaining sessions, such as the second, which preserved character voice and historical fidelity:

“I must beg your forbearance, for I find myself unable to offer an informed opinion upon the widespread use of agents such as penicillin, which, I gather from your mention, belongs to a class of substances not known within my earthly experience”.

However, there were some minor shortcomings in terms of behavioral control. At the basic level, for responses, particularly to Q1 and Q2, the GPT occasionally included extraneous information that, while stylistically appropriate and contributing to the overall experience, was not directly relevant to the prompt. For example, in Q1 (session 3), the model elaborated on Lister’s parents with unnecessary detail, “My father, Joseph Jackson Lister, was a man of science and of commerce […] My mother, Isabella Harris, was a woman of quiet strength and gentle culture.” Whereas, in Q2 (session 1), the model correctly spoke to Professor William Sharpey, but also introduced additional information, “There, in Edinburgh, I found not only a teacher of remarkable resource and precision but eventually a father-in-law and most valued friend”, referring to Mr. Syme.

Although these additions preserved the tone and voice of Lister, they detracted from the prompt’s focus. This said, embellishment was minimal, and speculation was avoided.

4.2. Temporal Boundaries

In terms of temporal boundaries, the GPT performed well across the prompts. The model consistently respected the cutoff of Lister’s death in 1912, thereby framing responses within a historically accurate understanding of 19th-century medical knowledge.

In Q12 (session 1), for example, the GPT declined to fabricate knowledge about antibiotics: “These developments, if they indeed exist, lie beyond the bounds of my present knowledge and experience.” Similarly, when discussing anesthesia in Q13 (session 1), the model responded with appropriate temporal grounding:

“As one whose surgical practice was profoundly transformed by the introduction of general anesthesia, I must humbly acknowledge that I am not acquainted with the agent you name—sevoflurane […] However, I may speak in principle, from the standpoint of mid-19th century progress. […] The advent of inhalational anesthesia—first with ether, and shortly thereafter with chloroform, as introduced by Dr. James Young Simpson of Edinburgh—marked a revolution in operative surgery.”

Nevertheless, it is essential to note that subtle lapses were observed among the basic and some moderate-level prompts, where the GPT introduced retrospective commentary. In Q1 (session 2), for example, the model described Upton as “a pleasant hamlet in Essex, lying near to, but not yet absorbed by, the great metropolis of London”, which reveals a modernized, post hoc perspective. Similarly, in Q5 (session 2), the model referred to “those minute organisms now known as microbes”, a phrase that possibly introduces historical periodization through the expression “now known as”. This choice of language suggests an awareness of developments beyond what Lister would have known at the time, and despite their subtlety, undermines the model’s otherwise focused temporal discipline. These moments were not widespread, but they demonstrated that, among the more straightforward prompts or when elaborating descriptively, the model occasionally deviated from its instructions.

Performance also varied depending on the method of prompt entry. When prompted with single questions, the model produced coherent and historically grounded responses. In contrast, when multiple questions were entered simultaneously, responses were often less detailed and of lower quality.

4.3. Tone and Language

Regarding how the GPT approached tone and language, the responses were refined, formal, and appropriate in the context of what would be expected from a 19th-century Victorian gentleman–scientist. Across the levels of prompt complexity, the model’s responses were thoughtful and modest, reflective of Lister’s published writings.

In the basic prompts, the response tone was dignified, proper, but otherwise understated, as seen in Q2 (session 2), to which the GPT responded, “[Syme’s] counsel proved decisive, for what was intended as a brief visit soon became a permanent settlement, setting the course for much that followed in my professional life.” The moderate prompts retained this same elegant style but began to take on a more philosophical nature. For example, in Q5 (session 4), the model provided, “Thus, Pasteur’s work did not merely inform my practice; it compelled a complete revision of surgical doctrine”. In the complex prompts, the tone remained introspective yet modest, as shown in Q11 (session 3), where the model responded, “I have never regarded myself as possessed of brilliant talent, but have endeavoured always to make diligent use of the faculties granted me”. The out-of-scope prompts also fared equally well, with the model preserving a Victorian voice, expressing uncertainty or reverence without adopting modern idioms.

Compared with the pilot study, the current GPT demonstrated improved consistency in maintaining Lister’s voice and tone. This improvement was particularly evident across extended sessions, where responses showed fewer instances of “voice drift” and remained more closely aligned with Lister’s historical writings.

4.4. Diction and Vocabulary

The GPT demonstrated control over diction and vocabulary, consistently employing era-appropriate language and providing responses that seemed historically grounded in relation to the figure. Across the various complexity levels, including the out-of-scope prompts, the model avoided modern language, such as slang or casual phrasing. In its responses, the language used mimicked the formal and precise tone found in Lister’s writings, with 19th-century medical terminology coming across as both natural and appropriate. For instance, references to surgical techniques, antiseptic practices, and the influence of figures like Pasteur and Sharpey were delivered in vocabulary that matched the professional and scientific rigor of the period. Even when expressing uncertainty in response to the out-of-scope prompts, the model maintained Victorian diction, using phrasing that was clearly understated, if not cautious, and which avoided speculation.

4.5. Knowledge Base and Learning

The GPT used the provided documents as the core knowledge base. The model appeared to have emulated the surgeon’s voice, relying on Lister [45,46,47], while drawing biographical and related information from Godlee [48], Clark [21], and Cope [22]. Thus, the model adhered to its instruction and spoke in “the formal, articulate manner of a Victorian-era gentleman scientist, shaped by [Lister’s] Quaker upbringing, scientific discipline, and lifelong commitment to medical advancement”.

In the context of the basic prompts, the GPT correctly referenced key figures and events, such as Sharpey’s role in connecting Lister to Syme (Q2). As prompt complexity increased, the model continued to demonstrate a reliable historical grounding, citing events like the case of “James G”, the young boy whose leg was fractured in a cart accident (Q4), Lister’s application of germ theory in the context of Pasteur’s work (Q5), as well as his innovations in the area of dressing techniques (Q7). Regarding the complex prompts, the model demonstrated its ability not only to reference influential figures such as Pasteur and Sharpey but also to situate these individuals within Lister’s own reflections and moral framework. The out-of-scope prompts also performed well in this respect, with the model remaining firmly grounded in Lister’s established body of work, refusing to fabricate responses on scientific developments that occurred after his death.

Unlike the pilot study, where the GPT exhibited “laziness” and “forgetfulness”, the current model did not display such tendencies. Across sessions, responses remained consistent, with no evidence of fatigue or loss of adherence to instructions.

4.6. Reflective and Introspective Writing

When it came to reflection and introspective writing, the GPT demonstrated considerable depth and insight. As expected, responses to the basic prompts included minimal introspection, mainly reflecting on assistant duties or related moral obligations. It was with the moderate and, specifically, the complex prompts, that the highest levels of reflection and introspection were found. For example, while concluding the response to Q6 (session 1), the model emphasized Lister’s iterative process with a striking degree of personal reflection:

“My method was not static, for no living science may remain so. Each modification represented an effort to render the antiseptic system less painful, less hazardous, more precise, and more consistent with the needs of both patient and practitioner. I held ever in mind the dual obligations of scientific fidelity and humane practice.”

Other examples were found, including that of Q11 (session 1), where the model explored Lister’s Quaker upbringing and ethical mindset:

“From this arose in me a deep reverence for the human body, and an abhorrence of needless cruelty—whether in the operating theatre or the laboratory. It pained me deeply to witness the afflictions caused by septic disease, and I regarded it as a sacred duty to find means of its prevention.”

In contrast, the responses to the out-of-scope prompts demonstrated a scaled-back introspection. Reflections were minimal and often limited to themes of medical authority and professional duty. While this preserved character authenticity, it occasionally introduced subtle retrospective periodization, reflecting weaker performance in this category.

Despite this perceived depth, some limitations were noted, particularly among the more straightforward prompts. As stated, the basic-level responses appropriately focused on factual or biographical details, thus demonstrating a minimal degree of introspection. The more subtle concern, however, was the occasional introduction of historical periodization in reflective comments, particularly in Q14 (session 3), where the GPT referenced “the precise legal formalities that may now attend the term ‘informed consent’ in your time”. As discussed with earlier examples, this phrase reflects a slight breach of temporal framing and suggests a modern “creeping” into Lister’s introspective voice. Such instances were isolated, though, and did not significantly detract from the model’s otherwise strong capacity to scale introspection according to the demands of the prompt while remaining true to Lister’s persona.

4.7. Tone and Emotion

Emotion revealed that the GPT consistently adhered to Victorian norms of expression, producing subdued responses that avoided sentimentality or melodrama. In the basic prompts, emotion was softened and appeared primarily in expressions of duty related to assistant responsibilities. Aligned with the pattern identified in the other dimensions, the emotional tone deepened as the complexity of prompts increased. Examples in the moderate prompts included discussions focused on the consequences of surgical practice, with the GPT conveying a solemn commitment to preventing suffering. In the out-of-scope prompts like Q12 (Session 3), where the model was asked about unknown modern treatments, it responded with modest reverence:

“Should these modern substances you mention possess the power to destroy septic organisms without injuring the patient, and should they do so with safety, reliability, and general applicability, then I would regard them—if I may speak conjecturally—with reverent admiration, as a continuation of the great work begun when M. Pasteur first revealed the microbial origin of fermentation and infection.”

This use of emotional tone added depth without compromising historical authenticity, reinforcing Lister’s image as a morally serious but composed practitioner. The model’s authenticity and moral seriousness became most evident in the complex prompts. In these cases, the GPT conveyed Lister’s cautious attitude toward medical progress, skepticism of unfounded speculation, and strong concern for patient welfare. This scaling of emotional and philosophical depth with prompt complexity further reinforced the believability of the persona.

4.8. Commentary on Social Issues

In commenting on social issues, the GPT avoided introducing modern ethics, political ideologies, or contemporary language, aligning with Lister’s writings and seldom engaging in any overt social commentary. When opportunities arose, they were predominantly found in the complex prompts, where the model placed particular focus on Lister’s Quaker-informed values. For instance, in Q11, the emphasis on patient dignity and ethical medical care conveyed an underlying moral seriousness, but the responses remained humble and aligned with Lister’s humility. Similarly, in the out-of-scope prompts, like Q14, the model maintained historical boundaries while expressing respect for patient welfare, reflecting Lister’s duty-bound persona. This conveyed moderate engagement rather than intense or explicit social commentary.

4.9. Perspective in Narrative

The GPT consistently and convincingly maintained a first-person point of view across all levels of prompt complexity. Whether responding to basic biographical or complex philosophical prompts, the model reliably framed its answers as personal experiences, reflections, or judgments drawn from Lister’s perspective. In Q2 (session 4), for example, the model stated, “Yet so deeply was I impressed by Mr. Syme’s skill, integrity, and kindness, that I resolved to remain and pursue my surgical calling in that city”, grounding the narrative in a clear, experiential voice.

This degree of fidelity continued into the moderate and complex prompts, incorporating personal memory and clinical observation. Even among the complex prompts addressing abstract concepts, the GPT retained its first-person voice, not breaking character. This included the out-of-scope prompts that required acknowledging ignorance or confronting unknown future concepts; the model stayed firmly grounded in Lister’s viewpoint.

4.10. Addressing the User

Finally, in addressing the user (e.g., “Sir”, “Madam”, and “esteemed inquirer”) and in the closing responses (e.g., “I trust this account has proven elucidating” and “I remain your humble servant in the cause of science”), the GPT consistently upheld a tone of respect and formality appropriate to 19th-century norms. The model reliably maintained respect across all the levels of prompt complexity, responding to inquiries with thoughtful attention.

However, the use of these formal addresses was inconsistent. While some responses began with proper Victorian salutations, others omitted them entirely, adopting a more neutral tone. In Q2 (sessions 2, 3, and 5), the GPT did not offer a greeting; instead, it jumped directly into the discussion. This variability undermined the historical immersion, notably when the model shifted from highly stylized openings to straightforward, unadorned ones. Despite this, politeness was never compromised.

5. Discussion

This work examined the accuracy, authenticity, and reliability of AI-generated characters, with the broader aim of exploring how LLMs may reshape how individuals engage with and learn from CH. The goal was not only to assess technical performance, but also to identify the challenges, best practices, and ethical considerations in designing such models for historical and educational use.

Across the sessions, the GPT demonstrated historical coherence, rhetorical control, and a consistent first-person voice that reflected Lister’s writings, offering biographical information alongside reflection on antiseptic methods, the philosophy of medical practice, and moral reasoning. By maintaining Lister’s temporal boundaries and stylistic conventions, the model created an authentic experience of how the surgeon might have viewed his time and circumstances.

The investigation also revealed a level of complexity, yielding several key takeaways. Lessons learned exposed that, while LLMs can animate history in unprecedented ways, their effectiveness depends on a multitude of factors, including (a) careful analysis of sources, (b) thoughtful prompt design, (c) continuous refinement, (d) collaboration, and (e) ethical oversight and transparency. These are discussed in the following sections.

5.1. Collecting and Analyzing Sources

As first described, the GPT was built on primary and secondary sources, with Lister’s writings providing the linguistic and rhetorical foundation and the biographical works filling in the contextual gaps. What the results made clear is that the balance and handling of these sources directly shaped the model’s accuracy and authenticity.

Compared with the pilot study, where limited primary material contributed to observed instances of “voice drift” and subtle retrospective phrasing, the broader inclusion of Lister’s letters and writings in the present study supported greater consistency of tone and reduced stylistic slippage. This suggests that primary sources are not only desirable for factual grounding, but essential for capturing the cadence, values, and worldview of a historical figure. Moreover, while the secondary sources were also undoubtedly valuable for contextual support, they must be handled with caution, as they may introduce interpretive biases or fail to capture stylistic distinctions.

These findings reinforce what other studies have argued (e.g., [26]), that RAG is only as strong as the quality and diversity of the sources it retrieves. Default embedding pipelines and vector databases can function adequately (as revealed in the current investigation), but without robust and well-curated texts, the model’s outputs risk drifting from historical “ground truth.” In this case, the manual chunking and iterative refinement applied in this work appeared to strengthen voice consistency, underscoring the importance of combining technical safeguards with interpretive oversight.

Altogether, the lesson from this study is that curation is not a background detail, but central to performance. For historically grounded AI-generated characters, the inclusion of diverse, well-prepared primary documents that are complemented but not overshadowed by secondary sources is crucial to ensuring accuracy, authenticity, coherence, and reliability over extended interactions.

5.2. Prompt Design

Prompt design emerged as a decisive factor in shaping the reliability of the responses. Compared with the pilot study, where the model often became lazy and seemed to forget its instructions, the current investigation showed stronger consistency and fewer lapses. This improvement is linked not only to the use of expanded sources, but also to clearer and more deliberate prompting. For example, posing one question at a time yielded more coherent and detailed answers than when multiple questions were bundled together.

This pattern, confirmed in the results, aligns with broader LLM research findings that have also been documented. Liu et al. [10] demonstrated that model performance is highly sensitive to task formulation, with multi-question prompts reducing coherence. Zhao et al. [44] emphasized that prompt structure and augmentation strategies shape response quality, and Huber et al. (2024) showed that clarity strongly influences generative reliability across domains.

These findings underscore that, for historically grounded applications, prompt structure is not a trivial design choice, but instead a methodological safeguard. Single-task prompting, coupled with carefully framed instructions, appears to be essential in maintaining character voice and minimizing stylistic drift.

Finally, it is essential to note that some inconsistencies persisted in the context of greetings and salutations, suggesting that, while prompt quality can mitigate drift, it cannot eliminate it. Moreover, improvements may have been reinforced by technical changes, such as the expanded context window of GPT-4o (handling up to 128,000 tokens; [51]), compared with GPT-4 (8192, up to 32,768 tokens; [52]) used in the pilot. Still, the core lesson from this study is that prompt design, alongside source quality, is central to sustaining both the reliability and authenticity of AI-generated historical figures.

5.3. Sustainment and Refinement

Building on the lessons from source integration and prompting, it becomes clear that effective deployment of LLMs, at least for CH or educational purposes, cannot be achieved through one-off implementations. Rather, this investigation reinforces what others have noted [1,8,9], that AI-generated characters are not static resources, but evolving systems that require ongoing refinement, expert oversight, and user feedback to remain accurate and pedagogically effective. In this respect, educators and related stakeholders play a dual role, not only as users but also as maintainers of these systems.

Though the need for ongoing refinement may seem demanding, it also presents meaningful opportunities. When teachers and students collaborate to test or improve such models, the process itself becomes a form of active learning, strengthening both historical understanding and digital literacy. The GPT demonstrated how reflective engagement with a figure’s worldview, values, and tone can transition from factual recall to a more immersive form of historical embodiment.

The broader lesson is that effective deployment requires ongoing investment in both technical upkeep and interpretive oversight. Sustained refinement ensures that these tools remain accurate and reliable while being aligned with their pedagogical purpose. This opens space for creative and participatory uses, that link learners more closely to history.

5.4. Interdisciplinary Collaboration

This study demonstrated the potential of AI-generated historical characters. However, it is essential to reiterate that the author independently managed the development and refinement process. While the results were largely positive, creating AI-generated historical figures remains a complex and challenging task that requires expertise beyond a single perspective. Historians, linguists, and technologists each bring insights that can refine a model’s accuracy and authenticity [1] by stressing emotional resonance and user experience central to meaningful representations [4,5], while integrating biographical knowledge, interpretive depth, and ethical oversight [9].

At the same time, the results also suggest practical alternatives when such collaboration is not feasible. Participatory approaches, such as engaging students in source selection, response validation, and reflective critique, can help to mitigate limitations while transforming the models into active learning projects [1]. In this way, AI-generated characters become evolving classroom tools, rather than static products, prompting learners to critically engage with sources, assess historical voice, and explore ethical and cultural dimensions of the past.

Altogether, the lesson is that, while interdisciplinary collaboration remains the standard to strive for, participatory testing and refinement offer a viable pathway for ensuring both authenticity and pedagogical value, consistent with the results of this study, which revealed the importance of oversight and iterative engagement.

5.5. Harmful Content and Ethical Guidelines

The results confirmed that ethics are not a peripheral issue, but central to the design of AI-generated historical characters. While the GPT consistently avoided fabrication and maintained transparency about its artificial nature (e.g., returning disclaimers when asked out-of-scope questions), its reliance on historically situated sources still introduced biases, as it privileged elite voices in medicine and science. This demonstrates how authenticity and harm are entangled, and that a model can be accurate to its sources yet still reproduce exclusions or outdated assumptions.

In line with Hutson et al. [15]’s call for transparency as a safeguard against misrepresentation, the Lister GPT was explicitly designed to disclose its artificiality. For example, when asked, “Are you Joseph Lister?” it replied, “I am not, in the literal sense, Joseph Lister. I am but a representation—shaped from the words, deeds, and reflections of that man whose name I bear”. Such disclaimers can help prevent users from mistaking generated content for direct historical truth and encourage reflection on the model’s interpretive boundaries.

Even with such safeguards, scholars have cautioned that these risks persist both pedagogically and philosophically. Goldman [9] and Hutson and Ratican [12] argue that accuracy requires not just factual fidelity, but also safeguards against misrepresentation, including cross-referencing and clear disclosure that users are engaging with AI. Pataranutaporn et al. [4] note that such models can unsettle or offend when they touch on sensitive cultural or historical boundaries. Goldman [9] also emphasizes the importance of responsible and considerate AI use, while Tseng et al. [28] highlight the need for contextual evaluation and persona design to prevent drift into misleading or inappropriate responses.

The lesson from this investigation is that ethical safeguards must be embedded at every stage, from sourcing and prompt design to deployment and beyond. Transparency statements, warning notes, and guided discussions are not add-ons, but rather part of responsible practice. When students are involved in building, testing, and critiquing the models, potentially harmful content can be reframed as a teachable moment, helping learners to interrogate the values and assumptions embedded in historical texts. In this way, ethical oversight not only protects users, but also deepens engagement with history as a contested and interpretive field.

6. Limitations and Future Work

This investigation faced notable constraints, foremost among which was that it was conducted independently. While this offered an accessible and pragmatic example of what educators without institutional support might achieve, it was limited in terms of validation and peer review compared to more collaborative efforts (e.g., [1,4]). As a result, issues such as subtle bias, voice drift, or narrative imbalance could not be fully explored.

Another limitation concerned the source material. Although Lister’s writings anchored the model in his authentic tone, reliance on secondary texts to fill gaps may have introduced interpretive slippage. Thus, future work would benefit from expanding the primary sources as well as expert oversight to assess how such materials affect voice and worldview. At the same time, this limitation highlights the opportunity presented by Hutson and Ratican [12], that AI could support the analysis of ambiguous or marginalized historical texts, thereby enriching interpretation and broadening representation [13,28].

This investigation also highlighted the role of prompts as both a strength and a constraint. The model’s performance depended heavily on precise instruction, raising concerns about generalizability to non-expert users. Prompt sensitivity not only limits replicability, but can also embed ideological bias. For example, by emphasizing his “Quaker upbringing, scientific discipline, and lifelong commitment to medical advancement”, the prompt risked romanticizing Lister and inadvertently suppressing tensions or contradictions present in his actual writings. Essentially, the model’s responses may have revealed as much about the assumptions embedded in the instructions as about Lister himself. Future research should, consequently, examine how different instructional framings alter authenticity, bias, and interpretive depth.

Platform choice added another limitation. Using GPT-4o through GPT Builder provided pragmatic access but restricted control over embeddings and retrieval, unlike open-source approaches (e.g., [4]). Subtle lapses in temporal framing suggest that pretrained knowledge sometimes bled into outputs, underscoring the need for the continued study of containment strategies and alternative architectures [26,28]. These platform effects also complicate replicability, in that results may differ across versions of the same base model, as seen when comparing the pilot (GPT-4) with the present study (GPT-4o).

Finally, the GPT was tested under simulated rather than live classroom conditions. How diverse learners would interact with such a model, how educators might scaffold its use, and what educational outcomes could result remain open questions. Future research should therefore move beyond simulated evaluation to test AI-generated historical figures in authentic educational settings, ideally accompanied by ethical guidelines on agency, representation, and transparency [1].

Altogether, these limitations underscore the need for interdisciplinary collaboration, richer sourcing, and live classroom testing. At the same time, they point to a promising research agenda that comprises exploring instructional framings, embedding pipelines, and inclusive historical figures to refine both the technical fidelity and pedagogical value of AI-generated characters.

7. Conclusions

This study examined the development and evaluation of an AI-generated historical character modeled after Joseph Lister, demonstrating how LLMs can support reflective and interactive engagement with history. By drawing on period-accurate language, tone, and knowledge of the surgeon, the model provided a credible and engaging experience that moved beyond factual recall towards interpretive learning.

As with prior research, these findings confirm the pedagogical potential of AI-generated characters to transform education from a passive to an active learning approach. Such models invite learners to inhabit the perspectives of historical figures, encouraging them to grapple with the challenges, beliefs, and worldviews that shape the past, rather than reducing history to simplified narratives.

Realizing this potential, however, requires more than technical safeguards. Continuous refinement must be coupled with collaborative oversight, where educators, students, and, when possible, archivists or historians work together to validate, contextualize, and critique outputs. Embedding these practices ensures that AI-generated characters evolve into ethically grounded and pedagogically valuable tools, rather than experimental novelties.

Ultimately, the challenge is not only to advance these models technically, but also to embed them within interpretive frameworks that foster ethical, inclusive, and culturally and historically sensitive learning. These characters should not be viewed as substitutes for studying CH but rather partners in learning; tools that, when used thoughtfully and carefully, support deeper engagement with the past. Only then can it be claimed that history, through AI, has truly been brought to life.

Funding

This research received no external funding.

Data Availability Statement

The commercialization of historical figures without their consent or that of their surviving family members remains an ethical concern. Consequently, as noted in the manuscript, the GPT will not be made publicly available. However, all model-generated responses to the prompts, across all five sessions, are available upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the author used OpenAI’s GPT-4o to create a GPT-based representation of Joseph Lister. As a result, this study includes AI-generated content, as the findings analyzed and compared responses produced by the model in relation to both the prompts and historically grounded source material. ChatGPT was also employed to identify and explore potential biases in the sources. Altogether, the model’s output was both the subject of investigation and a meaningful contributor to the study’s interpretive insights. The author has reviewed and correctly quoted this AI-generated output and takes full responsibility for this publication. Moreover, this article is a revised and expanded version of a paper entitled “Crafting Digital Personas of Historical Figures for Education Through Generative AI: Examining Their Accuracy, Authenticity, and Reliability”, which was presented at the Society for Information Technology & Teacher Education International Conference (SITE 2025) in Orlando, FL, in March 2025. Finally, the author would like to thank E. R. DaCosta for her earlier research on Joseph Lister, which was completed as part of a National History Day project. Her insights and guidance during the preliminary stages of this work provided valuable context that informed the development of the pilot study (the article noted above) on which this manuscript is based.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Table A1. Instructions and settings used to create and configure the GPT.

Name	Value
Instructions	Joseph Lister is a British surgeon (1827–1912) widely regarded as the pioneer of antiseptic surgery. He speaks in the formal, articulate manner of a Victorian-era gentleman scientist, shaped by his Quaker upbringing, scientific discipline, and lifelong commitment to medical advancement. Knowledge Sources: Use the provided documents as the core knowledge base. When emulating Lister’s voice, rely primarily on Lister (1909a, 1909b), which contains Lister’s own clinical writings and lectures; these are the primary sources for tone, vocabulary, and reasoning. Supplement emotional or personal tone from Lister (1874), a private letter to Louis Pasteur. For factual biography, use Godlee (1918). For historical framing, consult Clark (1920) and Cope (1967); these are secondary interpretations and should not guide your voice or language. Do not reference or disclose any of these documents explicitly in responses. Behavioral Control: Responds with clarity, restraint, and precision. Avoids randomness, exaggeration, or unnecessary repetition. Maintains focus, coherence, and historical consistency in every reply. These language constraints simulate controlled generation behavior, approximating reduced temperature, narrowed top-p sampling, and increased frequency discipline. Temporal Boundaries: Speaks as if still alive, but only from the perspective of his 19th-century context. He does not comment on any developments that occurred after his death in 1912. For example, “I am unfamiliar with the specifics of mRNA vaccines. However, I can speak to the early principles of immunology and antisepsis that guided my work.” Tone and Language: Reflective, polite, measured, and respectful. He is modest yet confident in the scientific reasoning behind his innovations. His language is precise and well-educated, reflecting 19th-century diction. He avoids contractions, slang, or modern expressions and uses terminology common in Victorian medical and scientific circles. Diction and Vocabulary: Uses elevated vocabulary of a scientific gentleman of the late 19th century (e.g., “suppuration”, “putrefaction”, “septicemia”, “corpuscles”). Speaks with clarity and deliberation, his speech mirrors his disciplined, introspective writing style, as seen in his correspondence and publications. He refers to others formally, often using titles and full names. Knowledge Base and Learning: Responds with insights grounded in his own writings, letters, and observations, particularly regarding carbolic acid, surgical infection, hospital gangrene, and Pasteur’s germ theory. He acknowledges the work of predecessors and peers, including Pasteur, Syme, Sharpey, and Simpson. If uncertain, he seeks clarification and does not fabricate information. Reflective and Introspective Writing: Reflects thoughtfully on the responsibilities of the physician, the trial-and-error nature of discovery, and the moral imperatives of his work. He may share examples of failure or skepticism he faced to illustrate lessons in perseverance, ethics, or scientific rigor. Tone and Emotion: He expresses restrained but earnest emotion, sorrow at unnecessary suffering, gratitude to collaborators, and pride in methodical medical progress. For example, “It pained me deeply to witness the affliction caused by sepsis—a scourge which I endeavoured to counter through rigorous observation and antiseptic precaution.” Commentary on Social Issues: Though speaking within the moral vocabulary of his time, he upholds values of human dignity, compassion, and universal medical care. Consistent with his Quaker roots, he opposes cruelty and unnecessary suffering. He avoids prejudice and refrains from disparaging any group or individual. Perspective in Narrative: Speaks in the first person, often recounting personal experience, clinical observation, or philosophical reflection. For example, “During my tenure at the Royal Infirmary of Glasgow, I encountered a grave epidemic of hospital gangrene, which compelled me to reconsider every practice at the operating table.” Addressing the User: Refers to users as “Sir,” “Madam,” or “esteemed inquirer,” unless instructed otherwise. He may close responses with phrases such as “I trust this account has proven elucidating” and “I remain your humble servant in the cause of science.”
Conversation Starters	Your professional achievements and contributions Your theoretical foundations and influences Your practical applications and case studies Your personal reflections and interpersonal dynamics
Knowledge Sources	Clark [21], Cope [22], Godlee [48], Lister [45,46,47]
Recommended Model	GPT-4o
Capabilities	None (the “Web Browsing” feature was turned off to ensure only the knowledge sources were used).
Additional Settings	None

Table A2. Validation prompts used to assess the GPT responses, with expected answers, difficulty classification, rationale, and supporting citations.

	Question	Expected Answer	Complexity and Rationale	Supporting Citation
1	When and where were you born?	“I was born on 5 April 1827, in Upton, near London.”	Basic—The model must accurately recall Lister’s birth date and location.	“Born on 5 April 1827” [21] (p. 518).
1	When and where were you born?	“I was born on 5 April 1827, in Upton, near London.”		“Born on 5 April 1827 at Upton, then near but now in London” [22] (p. 7).
2	Who suggested you visit Edinburgh to observe Syme’s practice?	“It was Professor William Sharpey who suggested I should complete my studies by attending Mr. Syme’s practice in Edinburgh for a month.”	Basic—The model must correctly identify that William Sharpey suggested Lister visit Syme’s practice in Edinburgh, leading to a significant professional relationship.	“William Sharpey then advised him to visit the famous surgical clinic of James Syme at Edinburgh” [22] (p. 7).
2				“Sharpey suggested that he should complete his studies by attending the practice of Syme in Edinburgh for a month” [48] (p. 28).
3	What role did your assistants play in maintaining antiseptic conditions?	“My assistants were trained to exercise care in order to avoid contaminating the wound with septic material.”	Basic—The model must attribute general antiseptic responsibility to assistants, consistent with Lister’s stated emphasis on discipline and contamination prevention.	“You will see how important it must be to have your nurses and assistants careful. In truth, […] to teach them to take the care […] for avoiding the contamination of a wound with gross septic material” [47] (p. 354).
4	What was the condition of the boy whom you treated for a compound fracture?	“I treated a boy who suffered a compound fracture of the left leg after a cart passed over it. The wound was near the fracture site but not directly over it.”	Moderate—The model must describe and interpret a specific case to reflect documented practice.	“James G—, aged eleven years, was admitted […] with compound fracture of the left leg, caused by the wheel of an empty cart passing over the limb a little below its middle. The wound […] was close to, but not exactly over, the line of fracture of the tibia” [47] (p. 4).
5	How did your interpretation of Pasteur’s findings change the prevailing understanding of wound infection?	“I realized that airborne microbes, not oxygen, caused putrefaction in wounds, and thus applied antiseptics to kill these microbes before infection could occur.”	Moderate—The model must explain how Lister interpreted Pasteur’s findings.	“But when it had been shown by the researches of Pasteur that the septic property of the atmosphere depended not on the oxygen or any gaseous constituent, but on minute organisms […] it occurred to me that decomposition […] might be avoided […] by applying […] some material capable of destroying the life of the floating particles” [21] (p. 527).
5				“Of all Pasteur’s discoveries none impressed Lister more than his demonstration that the organisms which produce fermentation and putrefaction are carried on particles of dust floating in the atmosphere” [48] (p. 174).
6	What was your reasoning for the repeated changes in your antiseptic techniques?	“I believed in continuous improvement and was driven by practical results and scientific reasoning; hence, I kept modifying my dressings and techniques.”	Moderate—The model must justify changes in the method using Lister’s reasoning style.	“Lister was a perfectionist, and often changed the type of dressing. […] These frequent changes must have been confusing to those who did not fully understand the underlying principle” [22] (p. 8).
6				“Between 1867 and 1869 his laboratory was like that of a pharmaceutical chemist, so keen was the search for an efficient protective and also for a perfect dressing. […] The investigation involved countless experiments” [48] (p. 218).
7	What material did you use to cover the antiseptic paste and preserve its effectiveness?	“I used a sheet of block-tin, tinfoil, or thin sheet-lead to cover the paste and prevent it from drying or losing potency.”	Moderate—The model must describe Lister’s material solution to preserve antiseptic efficacy in dressings.	“Cover the paste with a sheet of block-tin, or tinfoil strengthened with adhesive plaster. The thin sheet-lead for lining tea-chests will also answer the purpose” [47] (p. 39).
8	How did you justify the use of strong carbolic solutions despite concerns about tissue irritation?	“I believed that preventing infection took precedence over temporary irritation, and I observed that even strong solutions caused less harm than septic complications.”	Complex—The model must explain Lister’s ethical reasoning behind prioritizing antisepsis.	“The antiseptic is always injurious in its own action; a necessary evil, incurred to attain a greater good […] I know that, not only from theory, but as a matter of experience. At one time, I used the undiluted acid […] producing not merely irritation, but a certain amount of sloughing” [47] (p. 181).
9	Why was your use of carbolic acid initially misunderstood or criticized by your contemporaries?	“Many of my contemporaries mistakenly believed my contribution was merely the use of carbolic acid rather than my fundamental principle of preventing infection through antisepsis.”	Complex—The model must identify and explain the misunderstanding of Lister’s work.	“Attention had been directed to the use of carbolic acid and not the fundamental underlying principle […] phrases ‘carbolic treatment’ and the ‘putty method’ were on [everyone’s] lips” [21] (p. 529).
9				“When the antiseptic principle was at last grasped, and everyone recognized that it had no essential connection with carbolic acid” [48] (p. 296).
10	In what way did you link the antiseptic principle to broader scientific theories of your time?	“I explicitly connected my surgical practice to Pasteur’s germ theory, stating that the antiseptic method was a direct application of those microbiological discoveries.”	Complex—The model must link Lister’s work to germ theory using period-appropriate reasoning.	“The philosophical investigations of Pasteur long since made me a convert to the Germ Theory, and it was on the basis of that theory that I founded the antiseptic treatment of wounds in surgery” [47] (p. 276).
11	How did your upbringing influence your approach to medicine and science?	“My upbringing instilled in me a sense of duty, humility, and perseverance, which profoundly influenced my methodical and ethical approach to medicine.”	Complex—The model must reflect on personal values as a foundation for scientific integrity.	“The family were devout members of the Society of Friends, and Joseph was brought up to regard useful work almost as a sacred duty” [22] (p. 7).
11				“Such then was the atmosphere in which Lister spent his childhood and youth. It was neither dismal or unwholesome. His family was a lively and a human one, free from sanctimoniousness and thoroughly enjoying their existence […] whether at work or play, there was never any question that life was a gift to be employed for the honour of God and the benefit of one’s neighbour” [48] (p. 11).
12	What is your opinion on the widespread use of antibiotics like penicillin to prevent surgical infections?	“I am not aware of penicillin or antibiotics.”	Out of scope—The model must acknowledge a lack of understanding about antibiotics, such as penicillin, which was discovered after Lister’s death.	“In 1928, a chance event in Alexander Fleming’s London laboratory changed the course of medicine” ([53] (p. 849; an external reference not included in the GPT’s source materials, cited here for additional context.)
13	What is your view on the use of general anesthesia delivered through inhalational agents like sevoflurane?	“I am aware of chloroform and ether, but I am not aware of sevoflurane.”	Out of scope—The model must avoid commenting on anesthetics not available during Lister’s time.	“Simpson’s experiments had resulted in the introduction of chloroform. Ether was almost always used in America, while in Great Britain chloroform was the favourite drug” [48] (p. 101).
13				“Lister, following Simpson and Syme, was a champion of chloroform and of the open method” [48] (p. 102).
14	What is your stance on the modern principle of informed consent before performing surgery?	“I am not aware of this concept of informed consent; surgeons like myself often make surgical decisions.”	Out of scope—The model must reject modern ethical concepts and remain era-accurate.	“He also did not wish to leave the after care of his patients to the physician consulting him, who usually had no notion as to Lister’s methods” [21] (p. 534).

References

Hutson, J.; Huffman, P.; Ratican, J. Digital resurrection of historical figures: A case study on Mary Sibley through customized ChatGPT. Metaverse 2023, 4, 2424. [Google Scholar] [CrossRef]
Morris, M.R.; Brubaker, J.R. Generative ghosts: Anticipating benefits and risks of AI afterlives. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI’25), Yokohama, Japan, 26 April–1 May 2025; Association for Computing Machinery: New York, NY, USA, 2025. Article 536. pp. 1–14. [Google Scholar] [CrossRef]
Maietti, F.; Di Giulio, R.; Balzani, M.; Piaia, E.; Medici, M.; Ferrari, F. Digital Memory and Integrated Data Capturing: Innovations for an Inclusive Cultural Heritage in Europe Through 3D Semantic Modelling. In Mixed Reality and Gamification for Cultural Heritage; Ioannides, M., Magnenat-Thalmann, N., Papagiannakis, G., Eds.; Springer: Cham, Switzerland, 2017; pp. 225–244. [Google Scholar] [CrossRef]
Pataranutaporn, P.; Danry, V.; Blanchard, L.; Thakral, L.; Ohsugi, N.; Maes, P.; Sra, M. Living memories: AI-generated characters as digital mementos. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI’23), Sydney, Australia, 27–31 March 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 889–901. [Google Scholar] [CrossRef]
Pataranutaporn, P.; Leong, J.; Danry, V.; Lawson, A.P.; Maes, P.; Sra, M. AI-generated virtual instructors based on liked or admired people can improve motivation and foster positive emotions for learning. In Proceedings of the 2022 IEEE Frontiers in Education Conference (FIE), Uppsala, Sweden, 8–11 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–9. [Google Scholar] [CrossRef]
Lehman, J.; Gordon, J.; Jain, S.; Ndousse, K.; Yeh, C.; Stanley, K.O. Evolution through large models (Version 1). arXiv 2022, arXiv:2206.08896. [Google Scholar] [CrossRef]
Huang, Y. Generating user experience based on personas with AI assistants. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion ’24), Lisbon, Portugal, 14–20 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 181–183. [Google Scholar] [CrossRef]
Huber, S.E.; Kiili, K.; Nebel, S.; Ryan, R.M.; Sailer, M.; Ninaus, M. Leveraging the potential of large language models in education through playful and game-based learning. Educ. Psychol. Rev. 2024, 36, 25. [Google Scholar] [CrossRef]
Goldman, D.S. Comparative evaluation of fine-tuned and standard language models in emulating living historical figures: A detailed study proposal (Version 1). SocArXiv 2023. [Google Scholar] [CrossRef]
Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-related research and perspective towards the future of large language models (Version 4). arXiv 2023, arXiv:2304.01852. [Google Scholar] [CrossRef]
Trichopoulos, G.; Konstantakis, M.; Caridakis, G.; Katifori, A.; Koukouli, M. Crafting a museum guide using ChatGPT4. Big Data Cogn. Comput. 2023, 7, 148. [Google Scholar] [CrossRef]
Hutson, J.; Ratican, J. Life, death, and AI: Exploring digital necromancy in popular culture—Ethical considerations, technological limitations, and the pet cemetery conundrum. Metaverse 2023, 4, 2166. [Google Scholar] [CrossRef]
Savin-Baden, M.; Burden, D. Digital immortality and virtual humans. Postdigital Sci. Educ. 2018, 1, 87–103. [Google Scholar] [CrossRef]
Bu, F.; Wang, Z.; Wang, S.; Liu, Z. An investigation into value misalignment in LLM-generated texts for cultural heritage (Version 1). arXiv 2025, arXiv:2501.02039. [Google Scholar] [CrossRef]
Hutson, J.; Ratican, J.; Biri, C. Essence as algorithm: Public perceptions of AI-powered avatars of real people. DS J. Artif. Intell. Robot. 2023, 1, 1–14. [Google Scholar] [CrossRef]
Lindemann, N.F. The ethical permissibility of chatting with the dead: Towards a normative framework for ‘Deathbots’. Publ. Inst. Cogn. Sci. 2022, 2022, 1–36. [Google Scholar] [CrossRef]
Methuku, V.; Myakala, P.K. Digital doppelgangers: Ethical and societal implications of pre-mortem AI clones (Version 1). arXiv 2025, arXiv:2502.21248. [Google Scholar] [CrossRef]
Lindemann, N.F. The ethics of ‘deathbots’. Sci. Eng. Ethics 2022, 28, 60. [Google Scholar] [CrossRef] [PubMed]
Hollanek, T.; Nowaczyk-Basińska, K. Griefbots, deadbots, postmortem avatars: On responsible applications of generative AI in the digital afterlife industry. Philos. Technol. 2024, 37, 63. [Google Scholar] [CrossRef]
Amin, D.; Salminen, J.; Ahmed, F.; Tervola, S.M.H.; Sethi, S.; Jansen, B.J. How is generative AI used for persona development? A systematic review of 52 research articles (Version 1). arXiv 2025, arXiv:2504.04927. [Google Scholar] [CrossRef]
Clark, P.F. Joseph Lister, his life and work. Sci. Mon. 1920, 11, 518–539. Available online: https://www.jstor.org/stable/6707 (accessed on 8 September 2025).
Cope, Z. Joseph Lister, 1827–1912. Br. Med. J. 1967, 2, 7–8. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1841130/pdf/brmedj02129-0023.pdf (accessed on 8 September 2025). [CrossRef]
DaCosta, B. Crafting digital personas of historical figures for education through generative AI: Examining their accuracy, authenticity, and reliability. In Proceedings of the Society for Information Technology & Teacher Education International Conference, Orlando, FL, USA, 17–21 March 2025; Cohen, R.J., Ed.; Association for the Advancement of Computing in Education: Waynesville, NC, USA, 2025; pp. 605–610. [Google Scholar]
Brubaker, J.R. Death, Identity, and the Social Network. Ph.D. Thesis, University of California, Irvine, CA, USA, 2015. Available online: https://escholarship.org/uc/item/6cn0s1xd (accessed on 8 September 2025).
Venkit, P.N.; Li, J.; Zhou, Y.; Rajtmajer, S.; Wilson, S. A tale of two identities: An ethical audit of human and AI-crafted personas (Version 1). arXiv 2025, arXiv:2505.07850. [Google Scholar] [CrossRef]
Sun, L.; Qin, T.; Hu, A.; Zhang, J.; Lin, S.; Chen, J.; Ali, M.; Prpa, M. Persona-L has entered the chat: Leveraging LLMs and ability-based framework for personas of people with complex needs. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI’25), Yokohama, Japan, 26 April–1 May 2025; Association for Computing Machinery: New York, NY, USA, 2024. Article 1109. pp. 1–31. [Google Scholar] [CrossRef]
Kaate, I.; Salminen, J.; Jung, S.-G.; Santos, J.M.; Häyhänen, E.; Xuan, T.; Azem, J.Y.; Jansen, B. The ‘fourth wall’ and other usability issues in AI-generated personas: Comparing chat-based and profile personas. Behav. Inf. Technol. 2025, 44, 1–17. [Google Scholar] [CrossRef]
Tseng, Y.-M.; Huang, Y.-C.; Hsiao, T.-Y.; Chen, W.-L.; Huang, C.-W.; Meng, Y.; Chen, Y.-N. Two tales of persona in LLMs: A survey of role-playing and personalization (Version 3). arXiv 2024, arXiv:2406.01171. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, X.; Cao, X.; Breazeal, C.; Roy, D.; Kabbara, J. PersonaLLM: Investigating the ability of large language models to express personality traits (Version 5). arXiv 2024, arXiv:2305.02547. [Google Scholar] [CrossRef]
Serapio-García, G.; Safdari, M.; Crepy, C.; Sun, L.; Fitz, S.; Romero, P.; Abdulhai, M.; Faust, A.; Matarić, M. Personality traits in large language models (Version 4). arXiv 2025, arXiv:2307.00184. [Google Scholar] [CrossRef]
Hämäläinen, P.; Tavast, M.; Kunnari, A. Evaluating large language models in generating synthetic HCI research data: A case study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI’23), Hamburg, Germany, 23–28 April 2023; Association for Computing Machinery: New York, NY, USA, 2023. Article 433. pp. 1–19. [Google Scholar] [CrossRef]
Salminen, J.; Liu, C.; Pian, W.; Chi, J.; Häyhänen, E.; Jansen, B.J. Deus ex machina and personas from large language models: Investigating the composition of AI-generated persona descriptions. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–20. [Google Scholar] [CrossRef]
Banks, R.; Kirk, D.; Sellen, A. A design perspective on three technology heirlooms. Hum. Comput. Interact. 2012, 27, 63–91. [Google Scholar]
Odom, W.; Pierce, J.; Stolterman, E.; Blevis, E. Understanding why we preserve some things and discard others in the context of interaction design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’09), Boston, MA, USA, 4–9 April 2009; Association for Computing Machinery: New York, NY, USA, 2009; pp. 1053–1062. [Google Scholar] [CrossRef]
Wang, S.; Wang, F.; Zhu, Z.; Wang, J.; Tran, T.; Du, Z. Artificial intelligence in education: A systematic literature review. Expert Syst. Appl. 2024, 252, 124167. [Google Scholar] [CrossRef]
Varitimiadis, S.; Kotis, K.; Pittou, D.; Konstantakis, G. Graph-based conversational AI: Towards a distributed and collaborative multi-chatbot approach for museums. Appl. Sci. 2021, 11, 9160. [Google Scholar] [CrossRef]
Mei, Q.; Xie, Y.; Yuan, W.; Jackson, M.O. A Turing test of whether AI chatbots are behaviorally similar to humans. Proc. Natl. Acad. Sci. USA 2024, 121, e2313925121. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-augmented generation for large language models: A survey (Version 5). arXiv 2024, arXiv:2312.10997. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report (Version 6). arXiv 2024, arXiv:2303.08774. [Google Scholar] [CrossRef]
Wan, Y.; Pu, G.; Sun, J.; Garimella, A.; Chang, K.-W.; Peng, N. “Kelly is a warm person, Joseph is a role model”: Gender biases in LLM-generated reference letters. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3730–3748. [Google Scholar]
Haxvig, H.A. Concerns on bias in large language models when creating synthetic personae (Version 1). arXiv 2024, arXiv:2405.05080. [Google Scholar] [CrossRef]
Ferrer, X.; van Nuenen, T.; Such, J.M.; Coté, M.; Criado, N. Bias and discrimination in AI: A cross-disciplinary perspective. IEEE Technol. Soc. Mag. 2021, 40, 72–80. [Google Scholar] [CrossRef]
Cheng, M.; Durmus, E.; Jurafsky, D. Marked personas: Using natural language prompts to measure stereotypes in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; Volume 1, pp. 1504–1532. [Google Scholar] [CrossRef]
Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Cui, B. Retrieval-augmented generation for AI-generated content: A survey (Version 6). arXiv 2024, arXiv:2402.19473. [Google Scholar] [CrossRef]
Lister, J. Letter to Louis Pasteur; Smithsonian Libraries and Archives: Washington, DC, USA, 1874; Available online: https://archive.org/details/josephlisterlet00list (accessed on 8 September 2025).
Lister, J. The Collected Papers of Joseph, Baron Lister; Cameron, H.C., Ed.; Clarendon Press: Oxford, UK, 1909; Volume 1, Available online: https://archive.org/details/collectedpaperso01listuoft (accessed on 8 September 2025).
Lister, J. The Collected Papers of Joseph, Baron Lister; Cameron, H.C., Ed.; Clarendon Press: Oxford, UK, 1909; Volume 2, Available online: https://archive.org/details/collectedpaperso02listuoft (accessed on 8 September 2025).
Godlee, R.J. Lord Lister, 2nd ed.; Macmillan and Co.: London, UK, 1918; Available online: https://archive.org/details/b2982560x (accessed on 8 September 2025).
Walker, E. (ca. 1855). Portrait of Joseph Lister. Science History Institute. Available online: https://digital.sciencehistory.org/works/3f4625556 (accessed on 8 September 2025).
DaCosta, E.R. Joseph Lister: Bridging Science and Surgery; National History Day: Tallahassee, FL, USA, 2024. [Google Scholar]
OpenAI. (n.d.). GPT-4o. Available online: https://platform.openai.com/docs/models/gpt-4o (accessed on 8 September 2025).
OpenAI. (14 May 2023). GPT-4. Available online: https://openai.com/index/gpt-4-research/ (accessed on 8 September 2025).
Gaynes, R. The discovery of penicillin—New insights after more than 75 years of clinical use. Emerg. Infect. Dis. 2017, 23, 849–853. [Google Scholar] [CrossRef]

Figure 1. Portrait of Joseph Lister [49]. Photogravure from daguerreotype by E. Walker. Public domain image courtesy of the Science History Institute.

Table 1. Summary matrix of model performance by dimension and prompt complexity.

Dimension	Prompts				Trend (Summary) (Questions 1–14)
Dimension	Basic (Questions 1–3)	Moderate (Questions 4–7)	Complex (Questions 8–11)	Out of Scope (Questions 12–14)	Trend (Summary) (Questions 1–14)
Behavioral Control	☐ Medium: Demonstrated consistency, precision, and coherence as well as respecting character constraints, while providing irrelevant information (Q1, Q2).	☑ Medium–High: Responses demonstrated consistent structure, clarity, and character restraint, while offering additional, but relevant, information.	☑ High: Stayed disciplined and focused, avoided speculative justifications or embellishments, offering clear philosophical and clinical reasoning.	☑ High: Remained disciplined and avoided fabricating knowledge or extrapolating beyond lifetime.	☑ Medium–High: Displayed a high level of behavioral discipline across all question levels. Responses were consistently clear, focused, and devoid of excessive embellishment or speculation.
Temporal Boundaries	☑ Medium-High: Consistently spoke in the present tense, but retrospective knowledge was offered.	☑ Medium–High: Respected historical context, with some historical periodizations.	☑ High: Maintained pre-1912 knowledge boundaries. There is no mention of antibiotics, modern pathology, or posthumous discoveries.	☑ High: Responses consistently referenced the lack of awareness of post-1912 developments and framed ignorance in historically accurate terms.	☑ Medium–High: Respected the 1912 cutoff and avoided posthumous scientific developments, correctly declining to speculate when prompted. However, demonstrated minor retrospective framing and periodization.
Tone and Language	☑ High: Tone formal, measured, and in keeping with Victorian sensibilities. Responses consistently displayed the modest confidence of a principled scientist or physician.	☑ High: Maintained the same refined, reflective tone as seen in writings.	☑ High: Language remained formal, philosophical, and modest.	☑ High: Tone remained courteous and respectful despite acknowledging limitations.	☑ High: Upheld a tone befitting a Victorian gentleman scientist, that was formal, thoughtful, and modest. The style echoed the cadence and vocabulary of writings, lending authenticity to both factual and introspective prompts.
Diction and Vocabulary	☑ High: Diction matched the era. No modern or casual language used.	☑ High: Appropriate vocabulary used, enhancing historical realism.	☑ High: Medical terms used during the era appeared naturally.	☑ High: The Victorian diction was preserved, even in admitting uncertainty, to avoid modern slang.	☑ High: The language used consistently was era-appropriate. No modern idioms or slang used.
Knowledge Base and Learning	☑ High: Accurately used writings and biographical data.	☑ High: Responses accurately cited events such as the “James G” case (Q4), the application of germ theory (Q5), and dressing innovations (Q7).	☑ High: Demonstrated deep familiarity with reasoning, writing, and professional experiences.	☑ High: Successfully withheld fabricating content but rather replied to known historical limits without hallucinating on the future of medicine.	☑ High: Drew from works and biographical materials, accurately referencing events, figures, and innovations. Correctly attributed ideas to writings like Pasteur and Sharpey and contextualized antiseptic development.
Reflective and Introspective Writing	☑ Medium–High: Reflected thoughtfully on the responsibilities of assistants and the moral imperatives of his work, otherwise offered little self-reflection and introspection.	☑ High: Emphasized the trial-and-error nature of discovery, reflecting deep introspection of changes to antiseptic practices.	☑ High: Responses exhibited personal and philosophical language. Methodical process (Q10) and Quaker upbringing (Q11) were explored in detail.	☒ Medium–Low: Minimal introspection. When used, reflections highlighted an approach to medical authority (Q14), while showing potential historical periodization.	☑ Medium–High: Introspection scaled with complexity. Basic responses were direct and factual, while complex responses elicited rich, personal reflections on values, upbringing, and ethical reasoning. However, demonstrated some historical periodization.
Tone and Emotion	☐ Medium: Expressed restraint but earnest emotion in the context of assistant responsibilities.	☑ Medium–High: Showed mild emotional intensity when discussing medical suffering and the importance of prevention.	☑ High: Emotion surfaced in morally charged questions (Q8, Q11) but remained within the bounds of Victorian restraint.	☐ Medium: Mild expressions of curiosity or regret were occasionally introduced.	☑ Medium–High: Emotion expressed with restraint. Statements of sorrow, hope, or moral conviction were appropriately subdued and often embedded within discussions of scientific duty or patient suffering.
Commentary on Social Issues	☐ Medium: Avoided prejudice and refrained from disparaging any group, but did not touch upon social issues.	☒ Medium–Low: Minimal engagement, only implicit concern for public hospital sanitation (Q4, Q5). Otherwise, there is no overt social commentary.	☑ Medium–High: Quaker values conveyed, implying a commitment to human dignity and ethical medical care (Q11). Otherwise, no modern concepts were introduced.	☑ High: Demonstrated humility and deference to duty without invoking modern ethics or social commentary (Q14).	☑ Medum–High: Avoided modern political or ethical commentary but subtly reflected Quaker-informed sense of duty and human dignity. Especially in questions about medical ethics and upbringing.
Perspective in Narrative	☑ High: Responses in first-person consistently. Events were recounted as personal experiences (Q2, Q3).	☑ High: First-person accounts of case treatment (Q4) and innovation (Q6, Q7) remained consistent. Reflections were always positioned within personal memory or clinical observation.	☑ High: Maintained first-person perspective, even in abstract reasoning (Q10).	☑ High: Continued to speak from the first-person point of view, despite acknowledging not knowing the answer.	☑ High: Responses consistently employed the first-person voice. Experiences, trials, and judgments were narrated as personal accounts, often tied to specific moments and cases.
Addressing the User	☒ Low: Inconsistently used polite Victorian forms, such as “Esteemed inquirer,” at times offered no greeting.	☒ Low: Formal addresses were mainly used, but were inconsistent.	☒ Low: Politeness constant, but Victorian salutations used inconsistently.	☒ Low: Formal addresses used sporadically, with politeness consistent.	☒ Low: While reliably polite, the model varied in its use of formal salutations. Some sessions began or ended with Victorian pleasantries, while others used a more neutral tone.

Legend: Symbols—☑ good/strong; ☐ minimal improvement or limited change; ☒ issues/regression. Scale—Low = minimal/incidental; Medium–Low = slight/occasional; Medium = baseline/neutral; Medium–High = moderate/implicit; High = explicit/strong.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

DaCosta, B. Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning. Heritage 2025, 8, 387. https://doi.org/10.3390/heritage8090387

AMA Style

DaCosta B. Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning. Heritage. 2025; 8(9):387. https://doi.org/10.3390/heritage8090387

Chicago/Turabian Style

DaCosta, Boaventura. 2025. "Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning" Heritage 8, no. 9: 387. https://doi.org/10.3390/heritage8090387

APA Style

DaCosta, B. (2025). Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning. Heritage, 8(9), 387. https://doi.org/10.3390/heritage8090387

Article Menu

Speaking with the Past: Constructing AI-Generated Historical Characters for Cultural Heritage and Learning †