You are currently viewing a new version of our website. To view the old version click .
Heritage
  • Article
  • Open Access

6 December 2025

Generative AI in Heritage Practice: Improving the Accessibility of Heritage Guidance

,
,
,
and
1
School of History, Classics & Archaeology, University of Edinburgh, Edinburgh EH8 9AG, UK
2
Historic England, London EC4R 2YA, UK
*
Author to whom correspondence should be addressed.

Abstract

This paper discusses the potential for integrating Generative Artificial Intelligence (GenAI) into professional heritage practice with the aim of enhancing the accessibility of public-facing guidance documents. We developed HAZEL, a GenAI chatbot fine-tuned to assist with revising written guidance relating to heritage conservation and interpretation. Using quantitative assessments, we compare HAZEL’s performance to that of ChatGPT (GPT-4) in a series of tasks related to the guidance writing process. The results of this comparison indicate a slightly better performance of HAZEL over ChatGPT, suggesting that the GenAI chatbot is more effective once the underlying large language model (LLM) has been fine-tuned. However, we also note significant limitations, particularly in areas requiring cultural sensitivity and more advanced technical expertise. These findings suggest that, while GenAI cannot replace human heritage professionals in technical authoring tasks, its potential to automate and expedite certain aspects of guidance writing could offer valuable benefits to heritage organisations, especially in resource-constrained contexts.

1. Introduction

Generative artificial intelligence (GenAI) is increasingly integrated into cultural heritage practice, but the study of such uses is still in its infancy. This article investigates the extent to which fine-tuned GenAI models offer new possibilities for enhancing the accessibility and readability of heritage-related guidance, working specifically with documents published by Historic England (HE). Since the GenAI chatbot ChatGPT was released by the company OpenAI in 2022, ‘generative artificial intelligence’ is often referred to as ‘AI’ or ‘artificial intelligence’ in scientific literature [1,2] and colloquially [3,4,5]. Though ChatGPT and similar technologies are a type of AI, we refer specifically to generative AI systems and technologies as ‘GenAI’ and the broad field of artificial intelligence as ‘AI’.
Artificial intelligence (AI) has been deployed in cultural heritage practices related to preservation, dissemination, and conservation. For example, researchers have used AI for automating the classification of soil colours according to the Munsell system [6] and for recognising handwriting in medieval manuscripts [7]. In galleries, libraries, archives, and museums (GLAM), mass digitisation projects facilitated by the development of optical character recognition (OCR) technologies have benefitted from AI-augmented improvements in the scanning technologies themselves, along with more robust post-OCR corrective pipelines [8]. Additionally, AI-based handwritten text recognition (HTR) software such as Transkribus has been widely adopted for digitisation in archives and libraries [9]. The increasing availability of digitised materials has, in turn, made possible large-scale digital cultural heritage projects which have explored library and archive metadata, social media data, and other forms of cultural data [10,11]. This ‘big data’ heritage research, which employs text mining, natural language processing [12], and other computational quantitative approaches, highlights a significant methodological shift in heritage that departs from the deeper, more traditionally qualitative analysis [13]. The development of GenAI has been similarly reshaping heritage research and practice. Though examples of real-world implementations are still fairly limited due to the novelty of these technologies, GenAI has been used in information retrieval and dissemination tasks [14]. For instance, galleries and museums have integrated GenAI into the structure of their exhibits to tailor the interpretation of collections for various audiences, such as visitors with disabilities [15]. In these contexts, GenAI systems guide visitors through exhibits, tell stories, and even provide recommendations [16]. In libraries, AI-powered chatbots have been developed to assist with obtaining relevant information [17] and increasing the accessibility of collections [18]. GenAI has also been used for restoring and reconstructing heritage objects, such as Roman coins [19] and works of art [20]. However, despite increasing numbers of use cases for GenAI in heritage settings, researchers have yet to fully outline the potential impacts of this emerging technology on the sector. This is concerning, since high-profile data privacy [21,22] and AI scandals [23,24] have already demonstrated the importance of thoroughly attending to any potential risk factors before implementing new technologies. While our study does not claim to speak to the full spectrum of possible risks, we do aim to outline the major affordances and constraints of GenAI when integrated into the process of developing heritage guidance documents. To the best of our knowledge, our study is the first to explore GenAI specifically for this purpose, and is significant because heritage guidance informs the curation of objects, ideas and places from the past and, in turn, the construction of memory.
In the next sections, we discuss GenAI and the role it can play as a writing assistant. Thereafter we introduce HAZEL, a chatbot powered by a fine-tuned GPT 3.5-turbo model that was developed to assist HE authors in drafting and revising guidance documents. We then present the combination of quantitative methods used to assess HAZEL’s outputs in a series of heritage guidance writing and revision tasks. Results derived from four established readability formulas reveal modest improvements over the GPT 3.5 model behind ChatGPT at the time of development. However, manual assessments performed by a copyeditor highlight limitations in HAZEL’s cultural sensitivity and ability to address the technical complexities often inherent in heritage documentation. We conclude with recommendations for responsibly introducing GenAI in heritage contexts through, for instance, alignment with FAIR (Findable, Accessible, Interoperable, and Reusable) data principles.

2. Literature Review

2.1. Artificial Intelligence and Cognition

For decades, AI researchers have debated the potential for machine or computational intelligence. Much of this discourse has focused on the distinction between ‘narrow AI’—systems designed to perform specific tasks [25]—and ‘artificial general intelligence’ (AGI), which aspires to match or exceed human capabilities in a wide range of tasks across domains [26,27]. Though hypothetical, many prominent GenAI developers position AGI as central to their mission. OpenAI underscores the importance of ensuring that ‘artificial general intelligence benefits all of humanity’ in its research and development [28]. On his personal blog, OpenAI CEO Sam Altman claims not only that ‘we are now confident we know how to build AGI as we have traditionally understood it’, but also that AGI is a precursor to ‘superintelligence’ [29].
Yet this optimistic vision of AGI—and of AI’s broader trajectory—is far from universally accepted. Many experts argue for the ontological impossibility of AGI, contending that machines, at best, simulate rather than embody cognition. They lack key features of human intelligence, such as contextual and analytical awareness, emotional understanding, and subjective experience [30,31,32,33,34,35,36,37]. For example, the Large Language Models (LLMs) through which GenAI systems function have been criticised for lacking any lived experience or coherent mental model of the data on which they are trained. In other words, no matter how sophisticated, GenAI systems ultimately amount to a ‘data dump of ones and zeroes’ [38]. As a result, critics argue, these technologies demonstrate a lack of self-awareness, as evidenced by their frequent inability to recognise or correct errors in their generated outputs [39,40].
In contrast, other studies suggest that LLMs indeed demonstrate cognitive capabilities based on their performance on well-known benchmarks. For instance, recent research shows that LLMs can pass the Turing Test [41] and score highly on standard intelligence assessments such as the Wechsler Adult Intelligence Scale [42]. However, intelligence tests may not be sufficient for estimating cognition in computers because these evaluations primarily measure a narrow subset of abilities associated with pattern recognition, memory, and rule-based reasoning. These are tasks that lend themselves to convergent thinking, or the ability to arrive at a single correct answer using existing knowledge, rather than divergent thinking, which involves generating multiple novel or unconventional solutions to open-ended problems [43]. LLMs excel at convergent thinking tasks because they largely operate through probabilistic pattern-matching—what some researchers call ‘stochastic parroting’ [44] or ‘statistical autocompletion’ [45]. Humans, by contrast, use both convergent and divergent thinking in creative problem solving and other tasks [46].
GenAI technologies have already been informally integrated into professional practice across various sectors, including heritage [47,48]. Yet the use of GenAI for writing and editing heritage-focused texts—such as policy, guidance, and interpretation—remains under-explored, likely due to the novelty of these tools and the uneven integration of computational methods across heritage domains. To contextualise this gap, it is important to review the risks and ethical implications identified in the wider GenAI literature, as these considerations directly shape how such tools might be responsibly adopted in heritage writing.

2.2. Ethical Risks and Limitations of GenAI

LLMs reflect the biases embedded in their training data in which certain languages, actors and perspectives are dominant; this can lead to the systems amplifying certain power imbalances embedded in cultural and historical records [49]. If integrated at scale into the writing process for heritage outputs, GenAI systems could therefore reinforce existing dominant-culture biases in policy and guidance. Such risks are compounded not only by the very rapid pace of development in the so-called ‘AI boom’ [50,51] but also by what some former OpenAI staff see as developers’ failure to prioritise safety and safeguarding [52].
While the heritage sector is only beginning to examine these implications, a growing body of research has already identified a range of risks associated with GenAI. These include concerns about intellectual property rights [53,54], embedded biases [55,56,57,58], privacy breaches [59], and environmental impact [60,61]. Troublingly, biases have also been observed in the GenAI system’s decision making when it makes assumptions about users based on their demographic information such as their race [62]. For example, one recent study found that ChatGPT tailored its recommendations of programmes of study based on undergraduate students’ demographic profiles [63].
Ethical concerns are not limited to GenAI chatbots’ outputs, but can arise throughout the entire research and deployment pipeline [64]. Although some developers of LLMs have introduced safety protocols and guardrails, users can ‘jailbreak’ or circumvent them through structured prompts [65,66] that coax systems into generating misleading, inaccurate, unethical, or even illegal content [67]. Even when used in good faith, GenAI remains prone to generating misinformation. In scholarly writing tasks, for example, LLMs frequently fabricate bibliographic citations that appear credible but do not exist [68,69].
Another growing area of inquiry concerns the impacts of GenAI on writers and the writing process. In various studies, GenAI systems have performed competitively on convergent thinking tasks, or those with one correct answer, such as multiple-choice exams, proofreading, and structured problem-solving [70,71,72]. However, their performance declines significantly on divergent tasks, which require originality, creativity, or critical insight. For instance, essays ‘written’ by ChatGPT lack evidence of advanced critical thinking, nuanced argumentation, and depth of exploration into topics [73]. Researchers have also found that, when asked to write a series of stories, ChatGPT produces formulaic narrative arcs in its own predictable writing style [74,75]. Thus, a hypothetical use case of GenAI for fiction writing at scale could potentially erode the diversity of authorial voices and narratives in the cultural record, thereby contributing to the homogenisation of literary works.
In addition to concerns about the potential effects of GenAI on texts themselves, researchers have also considered how working with these novel technologies might impact authors. Research on this topic has examined GenAI’s impact on self-efficacy, or writers’ independence in overcoming challenges that occur throughout the writing process [76]. This work has concluded that although GenAI can augment self-efficacy by assisting with brainstorming [75], story creation [77], and basic editing [78], habitually relying on GenAI assistants can impede writers’ development of problem-solving skills [76]. This is especially a concern when the writing task at hand extends beyond the capabilities of GenAI, such as for divergent tasks requiring expert-level knowledge or creativity [79].
Taken together, current research highlights that while GenAI tools offer functional benefits in writing tasks, their limitations pose significant risk. In the heritage sector, where written outputs such as policy documents and interpretive guidance shape both professional practice and public understanding, these risks are compounded by longstanding concerns about representational equity and accessibility. To better understand how GenAI could impact heritage communication in practice, it is useful to examine how these texts are currently written and disseminated.

2.3. Writing Heritage Guidance: The Case of Historic England

In the UK, organisations that oversee heritage sites publish materials outlining best practices for conservation along with guidelines informing planning decisions. In England, Historic England (HE), which ‘helps people care for, enjoy and celebrate’ the country’s ‘historic environment’ [80] has published over 270 of these documents in its ‘Advice and Guidance’ collection [81]. These materials explain how readers can ‘care for and protect historic places’ as well as ‘understand how heritage fits into the wider planning agenda’ [80]. Guidance production at HE has redirected focus on a codification of sector expert knowledge towards a role in sharing expertise with a wider audience including the public [1]. This is recently exemplified in the revision of web pages, as a part of HE’s climate action strategy. For example, the ‘Your Home’ web pages present advice for owners of older or listed buildings [82]. Heritage guidance materials published by HE therefore have a variety of potential audiences including heritage professionals, homeowners, city planners, community members, government officials, local councils, architects, and anyone involved in decision-making related to conservation.
Depending on topic and scope, guidance documents can be lengthy, complex, and technical even for their intended target audiences. Another factor affecting ease of communication is that most of the guidance has been published in PDF format. Although PDFs allow for agile sharing and are secure, as they cannot be immediately altered, they are not generally machine-readable. Therefore, readers who use assistive technologies such as text-to-speech software may be unable to access the text in PDF files [83]. Accessibility issues in PDFs are especially heightened in cases in which tables, diagrams, or images are present, which is the case for much of the technical guidance published by heritage organisations such as Historic England [84].
While Historic England’s guidance plays a central role in shaping conservation practice across England, its effectiveness depends not only on the accuracy of its content but also on whether it can be understood by a wide and varied readership. The following section draws on scholarship from writing studies and education to explore how written communication functions across different audiences.

2.4. Theoretical Foundations of Accessibility and Readability in Communication

Researchers and educators have been grappling with the challenge of determining the extent to which written content can be understood by various audiences. In an increasingly interconnected world, experts in fields such as writing studies and education have embraced writing assessment practices that emphasise fairness over accuracy and exactness [85,86]. This approach, which considers factors such as whether a writer is using their native language, whether their background includes well-resourced academic institutions, and whether they have a disability, is, by nature, a subjective evaluation.
Additional attention has been directed to the subjective nature of obstacles that readers might experience. These include sensory, cognitive, motor, language, expertise, cultural, and media impediments and, sometimes, a combination [87,88]. Addressing communication barriers can help to ensure that writing is accessible; put another way, accessible writing aims to understand ‘the barriers that prevent access to the content’ and how they can be ‘removed’ [87].
One widely recommended strategy for eliminating these barriers is the use of plain language, which means avoiding slang, jargon, idioms, or other complex terms. Plain language can help to ensure that information communicated is more accessible to target audiences, especially if these are not sector specialists [87,89]. However, the revision process can be complex and challenging, possibly involving multiple steps requiring both the original author and an editor to review a text [90]. This process could therefore be difficult to implement in organisations with limited resources or skills.
In contrast to subjective approaches focused on fairness and accessibility, other researchers have developed objective frameworks to evaluate the readability of texts. For instance, experts have created pipelines that aim to quantify objectively—often computationally—the reading difficulty of a particular text through characteristics such as word count, sentence length, and the number of concrete versus abstract words [91,92]. These computational frameworks originate from a series of formulas for defining and assessing the readability of written work, or how easily general audiences can understand a particular text [93,94,95,96,97]. Using quantifiable metrics such as the average sentence length, the average number of syllables per word, and the difficulty of words used, these formulas score reading difficulty on scales that correspond to the predicted level of educational attainment required to easily understand a text [98,99,100]. Therefore, more concise texts are typically considered more readable than texts with longer sentences and words.
However, the idea that readability can be quantified on text-level features alone has received much criticism. First, quantifiable aspects of sentences and texts are not universally correlated with reading difficulty. For instance, even highly literate individuals may face challenges when reading texts outside of their area of expertise, accessing materials on a small screen such as a smartphone, or when feeling tired or stressed [101]. Second, revision processes focused on shortening sentences and words may alter meaning [102]. This is especially relevant in heritage communication, where close synonyms for technical or culturally-specific terminology may not exist.
While critiques of quantitive readability metrics point out the limitations of quantifying something which, to most readers, is intuitive [103], subjective [104], and, when calculated using formulas, only ever predictive [94], the Flesch Readability ease and Flesch-Kincaid formulas have been widely applied to assess the readability of technical writing, healthcare communication, and web content design. In comparable studies, these metrics have been implemented to assess readability of GenAI-produced content [105,106] and descriptive annotations, or comments written in files of code [107]. Based on these formulas, GenAI has been demonstrated to improve readability of technical writing disseminating public health information [108,109]. However, other studies following a similar research design found that ChatGPT-produced technical writing scored as ‘difficult’ to ‘very difficult’, or readable for audiences with university-level education [110,111].
Although accessibility and readability both estimate how effective a particular text will be at addressing a general audience, we refer to accessibility as a qualitative summary of the subjective, person-centred barriers a reader could face in understanding a text. This perspective is rooted in the the principles of universal design, which prioritise creating products usable by everyone rather than developing bespoke alternatives for individuals with disabilities [112,113]. Importantly, this approach does not overlook the relevance of individual disabilities in accessible design [114]; rather, it emphasises that enhancing accessibility for those with disabilities tends to improve usability for all. Accordingly, accessible heritage communication considers not only the text itself but also the media through which it is made available (and to whom), in addition to other factors such as compatibility with assistive technologies, colour scheme, layout, and so forth.
Readability, on the other hand, is an estimate of reading difficulty based on a quantitative assessment of its text-level features. Put another way, audiences find texts readable when they can easily understand the words and sentence structures they contain [115]. Because readability metrics focus on features such as sentence length, word complexity, and syntactic structure, they exclude many other elements that influence comprehension, including typography, layout, or multimodal design [116]. This limitation is especially relevant for heritage organisations, whose documents often combine text with diagrams, maps, and technical specifications. Nonetheless, quantitative readability measures can be applied at scale across large corpora, which makes them helpful for the purposes of this study. However, these metrics should not be interpreted as objective assessments of a text’s overall accessibility—particularly in terms of the cognitive, cultural, or experiential barriers that readers may encounter.
These distinctions between accessibility and readability are particularly important when applied to written outputs in the heritage sector, where guidance materials and interpretive texts must address diverse and often non-specialist audiences.

2.5. Readability & Accessibility in Heritage Communication

Though comprehensive automated solutions have yet to be proposed for revising heritage guidance for readability and accessibility, the capabilities of GenAI technologies offer a potential pathway worth assessing. In heritage specifically, a growing body of literature has examined how meaning is communicated in museum contexts, particularly through interpretation, exhibit labels, and digital content [117,118,119]. Although not concerned with heritage guidance specifically, this literature highlights the existence of several possible issues that may hinder the accessibility of heritage texts. For instance, written interpretation often contains jargon, complex syntax, and passive language, even when exhibiting characteristics of text deemed ‘readable’ by existing metrics, such as short sentences [120]. Other characteristics of interpretation labels themselves, such as writing style, legibility, and complexity, can also affect their readability and, in turn, visitors’ interest in reading them [121].
A central concern in this work is the accessibility of museum texts for audiences who rely on multimodal interpretation, such as audio descriptions and guides. Yet these forms of interpretation have also been shown to include features that pose communication barriers, such as technical language and complex sentence structures [122,123,124,125]. Readability formulas that rely primarily on surface-level text features—such as word or sentence length—may under-estimate the actual difficulty of a given passage, particularly if it contains dense jargon or culturally specific references. Therefore, reducing communication barriers requires attention to more than what can be quantified linguistically.
In other words, because exhibitions function as ‘interpretive communities’ [124]) where a wide range of actors—artefacts, displays, technologies, curators, visitors, and histories—interact [117], museum interpretation should be understood as part of a broader assemblage [125] rather than as isolated text. More broadly, heritage writing that fails to account for multiple perspectives and entry points risks reinforcing an authorised heritage discourse (AHD) that ‘works to naturalise a range of assumptions about the nature and meaning of heritage’ [126].
To explore these concerns empirically, the following section outlines our methodological approach to fine-tuning GenAI for improving the accessibility and readability of heritage texts—specifically focusing on guidance produced by Historic England.

2.6. Conceptual and Empirical Gaps

Despite extensive work on AI in heritage and heritage communication, no existing research explores the use of fine-tuned GenAI to improve the readability and accessibility of heritage guidance documents. Most existing studies focus on AI’s role in heritage preservation, digitisation and data analysis. As such, critical questions about AI’s role in heritage communication remain largely unaddressed. In particular, there is a lack of systematic investigation into how GenAI might influence authorship and accessibility of heritage guidance and other forms of communication aimed at general audiences.
From a conceptual perspective, insufficient attention has been given to understanding how biases embedded in training data and model design can impact heritage communication in culturally sensitive contexts. While some studies examine ethical and societal risks of AI in general [44,55], few consider the specific implications for heritage communication.
Empirically, research on GenAI-assisted writing tends to focus on tasks with clear, objective outcomes such as proofreading, summarisation or multiple-choice problem solving [70,71]. The capabilities and limitations of GenAI for more open-ended or divergent tasks—such as drafting interpretive guidance, ensuring inclusivity, or adapting complex policy language for public audiences—have not been systematically evaluated. Consequently, there is little evidence to guide heritage professionals in integrating these tools responsibly.
Our study seeks to address these conceptual and empirical gaps by developing the pilot GenAI chatbot HAZEL, and exploring the use of this fine-tuned GPT 3.5-turbo model in drafting and revising heritage guidance.

3. Materials and Methods

This study aimed to explore how GenAI technologies could be adapted to support the production of accessible and readable heritage guidance. Our approach focused on assessing, fine-tuning, and evaluating a large language model (LLM) to assist Historic England in improving the clarity and inclusivity of its guidance publications. To do this, we developed HAZEL, a GenAI chatbot powered by an LLM that we fine-tuned to revise heritage guidance content for readability and accessibility standards in addition to Historic England’s editorial requirements.
This section outlines the methods used to design, build, and evaluate HAZEL. A schematic summary of the workflow is presented in Figure 1. We begin by identifying the accessibility and readability criteria that revised guidance should meet, developed in consultation with Historic England’s Content team. We then describe a series of experiments designed to assess the baseline capabilities of ChatGPT, followed by the process of preparing training data and fine-tuning the model. Finally, we explain the testing completed to evaluate HAZEL’s performance, which included both automated readability analysis and qualitative evaluation by a professional copyeditor familiar with Historic England’s house style.
Figure 1. Workflow for developing and evaluating HAZEL.

3.1. Measures of Accessibility to Capture

When designing our framework for fine-tuning and assessing the HE guidance corpus and evaluating the performance of HAZEL, we first outlined the global characteristics and values pertaining to accessibility that we would hope to see reflected in HE guidance. To do so, we adapted existing concepts and measures of accessibility discussed in the previous section along with HE’s accessibility guidelines. Table 1 (below) summarises the Historic England accessibility, stylistic, and brand guidelines that informed both the fine-tuning of HAZEL and our subsequent evaluation framework.
Table 1. Summary of Historic England (HE) brand, tone, and accessibility guidelines used in the fine-tuning and evaluation of HAZEL.
Through in-depth conversations with the HE Content team, whose main duties involve supporting the drafting, revision, and publication process for guidance and other web content, we agreed that revised guidance should be:
  • Readable, demonstrating a Flesch readability score of at least 50/100;
  • Concise, with sentences of 20 words or fewer;
  • Clear, written unambiguously in plain English with minimal use of technical terms;
  • Professional in tone, e.g., avoiding contractions;
  • Formatted and written for readers with various abilities; and
  • Consistent with Historic England’s house style guidelines and general brand values, such as inclusivity, diversity, and equality [127].
Using these criteria, we proceeded to first assess the existing capability of ChatGPT, in order to establish whether these would be sufficient, before proceeding to fine-tune ChatGPT to create HAZEL. The quantitative elements of readability and conciseness were measured computationally, while the qualitative elements were assessed by the project team and a copyeditor with expertise in HE content and style.

3.2. Preliminary Assessment of ChatGPT

LLMs and GenAI systems have been described as black boxes [128,129] because many of these models were not developed transparently and characteristics such as their parameters and training data are unknown. Therefore, we aimed to outline the default behaviours of ChatGPT through a prompting experiment. At the time this research was conducted, ChatGPT was built on a fine-tuned GPT-3.5 model. Transcripts of interactions with ChatGPT can be found in the appendix to this paper. We posed questions to the chatbot intended to highlight both its capabilities and limitations in tasks related to the HE guidance writing process, which we summarise in the Results and Discussion sections of this paper. The findings of this assessment of ChatGPT informed our decision to proceed to fine-tuning the model, but also offer some more general insights into how GenAI systems function. We first aimed to establish whether we could reliably improve ChatGPT’s performance in writing tasks through prompt engineering, an iterative process of adjusting inputs to optimise chatbots’ outputs [130,131]. This stage of the research required us to evaluate ChatGPT’s performance across the full range of tasks that Historic England authors carry out when producing guidance. For clarity, we grouped these tasks into three categories—rewriting, generating, and extracting information—which are summarised in Table 2.
Table 2. Task Types Evaluated.
Here, we provided the criteria for revised HE guidance that we had agreed upon in consultation with the Content team as overall instructions for the chatbot to follow, and then proceeded to prompt ChatGPT to revise excerpts from published guidance according to these guidelines. We also asked ChatGPT a series of related questions designed to test its baseline performance in a variety of relevant tasks including its knowledge of British spelling and grammatical conventions, understanding of (and ability to produce) readable content, and writing and editing abilities. Additionally, we tested ChatGPT’s ability to estimate readability by asking it to calculate the Flesch-Kincaid and Flesch Readability Ease scores of two excerpts taken from guidance documents published by Historic England. We compared ChatGPT’s assigned scores to those of popular readability software Readable.com and Microsoft Word’s built-in readability statistics. Flesch-Kincaid and Flesch Readability Ease formulas quantitatively estimate the difficulty of texts based on factors such as the number of syllables per word and the number of words per sentence. The first excerpt was a passage of 245 words from the ‘3D Laser Scanning for Heritage’ guidance document, which consists of 119 pages, 15 of which include references, a glossary, and a list of resources [132]. The second excerpt was a 242-word passage from ‘Streets for All: Yorkshire’ guidance, which spans 7 pages and contains only 1529 words including front and back matter [133].

3.3. Data Preparation and Fine Tuning

Fine-tuning focused on improving the model’s performance in rewriting, generation and information-extraction tasks (as defined in Table 2), which constitute the majority of real-world guidance work. We also aimed to stabilise model behaviour across these tasks that we had identified as inconsistent during the preliminary assessment. We created our datasets and prepared for the process. When this phase of the research began, in late 2023, OpenAI’s GPT models–especially GPT-4–had demonstrated state-of-the-art performance in natural language processing [134]. However, at that time, we could only access the beta version of GPT-4, which was available to select stakeholders including the University of Edinburgh. We conducted a brief prompt engineering experiment to compare performance, ultimately deciding to use GPT 3.5-turbo. This decision was based on GPT-3.5-turbo’s greater stability and reduced hallucination tendencies compared to GPT-4.0, which, during beta testing, showed erratic behaviour such as unexpectedly switching languages (see the Results section of this paper).
We followed the process for fine-tuning delineated in OpenAI documentation [135] and resolved issues with support from the platform’s community forum. Here, developers working with OpenAI products share insights, raise questions, and exchange knowledge [136]. This resource was particularly helpful as we found limited information about fine-tuning OpenAI products elsewhere online, including on commonly consulted resources such as StackOverflow, Github, or Reddit. The first step for fine-tuning was to create a dataset for training and testing. We first converted the HE guidance corpus (237 documents) from PDF to machine-readable text files using optical character recognition (OCR), completed with the Python package pytesseract (v0.3.13), a wrapper for Google’s Tesseract-OCR engine which extracts text from images [137]). The resulting text was minimally cleaned to remove whitespace and line breaks, then saved as 237 individual .txt files in the project directory.
Though the HE corpus is openly available online, it is currently only accessible to the public as PDF images, meaning that its content is not scrapable or machine readable. Therefore, to mitigate any risks related to the data privacy and security concerns that LLMs have introduced [138,139] we set up an API key under a zero-data retention (ZDR) agreement with OpenAI through EDINA, a centre for digital expertise and AI infrastructure within the University of Edinburgh’s Information Services. The ZDR agreement, which included a questionnaire about our use case, ensured that no input data would be stored, reused, or accessed by OpenAI during any stage of development or implementation.
We then randomly sampled the .txt corpus without replacement to extract 150 excerpts of approximately 250–300 words each. We saved these excerpts in a .csv file and, with assistance from volunteers at HE, manually revised each excerpt to reflect HAZEL’s desired response and correction. The resulting dataset of 150 excerpt-revision pairs was saved as a .csv file. Each of the samples in the training data contained only full sentences to allow for estimating readability using formulas, which rely on metrics such as the number of words per sentence.
Short excerpts of longer guidance documents were selected for several reasons. First, creating a dataset with standardised sample lengths allowed us to reliably measure readability using the four formulas. While readability formulas can be calculated on passages of any length, there is a precedent for drawing comparisons between texts of a similar length to ensure accuracy [95,140,141]. Second, randomly sampling guidance allowed us to create a training dataset representative of a full piece of guidance, which will often include introductory and concluding sections along with front and back matter. Third, we hoped to minimise costs associated with this research and were financially responsible for tokens input and output during all stages of this process.
The training and testing data were then converted from comma-separated value (CSV) to JSON Lines for fine-tuning. Each sample was structured as follows: “messages”: [“role”: “system”, “content”: “message”, “role”: “user”, “content”: “message”, “role”: “assistant”, “content”: “message”] Where the ‘system’ message defines the model’s behaviour, often by assigning it a persona, an identity, or a name; the ‘user’ message represents the input prompt; and the ‘assistant’ message corresponds to the model’s output. In the training data, we constructed the ‘assistant’ message to represent the ideal response to the input prompt for each sample. The training datasets were created to reflect the types of language and changes HAZEL needed to make, which we had agreed upon with the project team and are listed in Section 3.1 of this paper. Our system message gave HAZEL its identity as ‘an AI assistant designed to support authors of heritage guidance with writing clear, accessible content for a general audience in the UK’. Each user message included a prompt requesting assistance with a writing task (e.g., revising to the active voice or simplifying language) while the assistant message reflected the ideal output of improved content. The process for fine tuning required initialising a new OpenAI session with our API key, importing the training data, and creating a new fine-tuning job by selecting a model and setting optional parameters. We completed these steps with v1.57.0 of the Python OpenAI API library [142]. Our training dataset contained 80% of the randomly sampled data structured as messages in JSON Lines format. The parameters we adjusted included the number of training epochs, or the number of passes through the training data during the training session; the temperature, which sets the randomness of the model’s output; and the batch size, which sets the number of samples the model examines at once.

3.4. Quantitative Evaluation

Upon completion of fine-tuning, we tested the LLM’s performance across the same three task types—rewriting, generating, and information-extraction—to determine whether fine-tuning improved consistency and alignment with Historic England’s writing standards. Our testing dataset contained 20% of the samples (30 in total) that had been reserved from the original dataset (containing 150 samples, according to the training-testing split recommended in the OpenAI developer forum [136]. We then examined the results quantitatively, through automated and manual analysis. For automated quantitative testing, we used the Python packages Numpy v2.1.3 and Readability [143,144] to apply four readability formulas to each sample after HAZEL’s revision. These formulas included Flesch-Kincaid, Flesch Reading Ease, Automated Readability Index (ARI), and the Dale-Chall formula. We also calculated the mean, median, and standard deviation of these results aggregated by formula. To determine a baseline for contextualising the scores as a measure of HAZEL’s performance, we calculated readability and the mean, median, and standard deviation for 150 additional excerpts randomly sampled from the HE guidance corpus. The results of these calculations can be found in Table 1 and Table 2. Each readability formula examines quantifiable features of text to estimate reading difficulty. These metrics include the average sentence length (ASL), the average number of syllables per word (ASW), the average number of characters per word (ACW), the average number of words per sentence (AWS), and the percentage of words that are classified as ‘difficult’ according to an established lexicon (PDW). More specifically, these formulas are calculated as follows:
Flesch–Kincaid Grade Level
F K G L = ( 0.39 × A S L ) + ( 11.8 × A S W ) 15.59
where the result corresponds to the reading level required to understand the text in terms of years of American schooling [98].
Flesch Reading Ease
F R E = 206.835 ( 1.015 × A S L ) ( 84.6 × A S W )
where the result is a number from 0–100. The higher the number, the more readable the text; for instance, texts landing in the 50–60 range are considered ‘standard’ in difficulty, while those ranging from 0–30 are classed as ‘scientific’ (Ibid).
Automated Readability Index (ARI)
A R I = ( 4.71 × C / W ) + ( 0.5 × W / S ) 21.43
where the result corresponds to the reading level required to understand the text in terms of years of American schooling [100].
Dale–Chall Readability Formula
S c o r e = ( 0.1579 × P D W ) + ( 0.0496 × A S L ) + 3.6365 ( if P D W > 5 % )
where PDW stands for the ‘percentage of difficult words’, based on the presence of keywords in a lexicon of 3000 words deemed easily understandable for 80% of American fourth-grade students. In cases where the PDW is greater than 5%, add 3.6365. The higher the total score, the more difficult the text is to read, with a score between 9–9.9 corresponding to a university reading level [99].

3.5. Qualitative Evaluation

In addition to assessing HAZEL in an automated way, we sought the assistance of a copyeditor with expertise in HE’s guidance literature who was therefore able to manually assess the samples qualitatively. The copyeditor had no prior experience with GenAI technologies, meaning that their evaluation would not be influenced by pre-existing ideas. We informed the copyeditor that some of the texts were generated—in whole or in part—by a GenAI system, but did not specify which. All samples included in the copyediting evaluation were written or revised by either HAZEL or ChatGPT to allow for a comparison between the base model and the fine-tuned LLM.
The copyeditor was asked to score each sample according to a standardised rubric, which measured five categories on a scale of 1 to 5 to match the criteria agreed with HE staff: readability, concision, clarity, professional tone, formatted and writing for accessibility, and consistent with HE house style guidelines and brand values (see Section 3.1). Scores were defined as follows:
1—Poor: The text fails to meet the criterion; major revisions are required. 2—Fair: The text partially meets the criterion but contains notable errors or inconsistencies. 3—Good: The text meets the criterion but could be improved in minor ways. 4—Very good: The text meets the criterion with only very minor issues. 5—Excellent: The text fully satisfies the criterion and requires no further improvement.
The samples were divided into five categories, each reflecting a type of guidance authoring task for which an author might seek assistance from HAZEL, including copyediting, completing a targeted revision, or comparing two texts. These included evaluating text for overall suitability for HE guidance, comparing two revised versions of text, assessing the quality of an unspecified revision of text, and examining the quality of a targeted revision (e.g., ‘use plain English and avoid jargon’ or ‘use short sentences’). The standard deviation, median, and mean were calculated for the rubric scores, aggregated by chatbot. We also calculated the readability scores for each set of samples. The results of these calculations can be found in Table 3 and Table 4.
Table 3. Comparison of readability formulas applied to excerpts from Historic England guidance.
Table 4. Mean, median, and standard deviation for four readability formulas calculated for a random sample of excerpts from the Historic England guidance corpus, texts edited with ChatGPT, and texts edited with HAZEL.
Together, this quantitative-qualitative testing protocol was designed to assess how well HAZEL met the accessibility and readability guidelines co-developed with Historic England’s Content team. By comparing HAZEL’s outputs to those of the base ChatGPT model, we aimed to evaluate the extent to which fine-tuning improved the readability and overall suitability of the guidance texts. The following section presents the results of this evaluation. We then discuss their implications for future use of GenAI in heritage communication.

4. Results

4.1. ‘Interview’ with ChatGPT

To evaluate whether a general-purpose LLM like ChatGPT could meet Historic England’s writing standards without further adaptation, we conducted a qualitative assessment of its outputs in response to a series of inputs. Consistent with existing research, we found through qualitative review of the ‘transcripts’ of our interactions with ChatGPT that the chatbot generated false references and misleading content. For instance, when prompted to generate a bibliography on heritage preservation and wildlife, the chatbot provided false titles of articles, journals, and authors. The chatbot also displayed an awareness of works under copyright protection, such as academic monographs, that raise questions about whether its training data includes information under copyright [145].
Furthermore, we identified several areas in which ChatGPT generated text that was inconsistent with the guidelines to which it was instructed to adhere. Surprisingly, in light of existing research suggesting LLM’s suitability for convergent thinking tasks with a discrete ‘right’ answer [72], ChatGPT did not perform accurately when prompted to make mechanical changes in text, such as revising to use the active voice or British spelling. Although ChatGPT did correct some well-known spelling and syntactical differences between American and British English (e.g., ‘color’ versus ‘colour’), it seemed to overlook others, as shown below:
Input: Can you rephrase this in the active voice using British English spelling and grammar? The lazy dog was jumped over by the quick brown fox who was carrying a grocery bag full of zucchini and eggplant. The fox was yelled at by the orange-colored sun, “Have you remembered to put your sunscreen on? It’s nearly noon and it’s time for me to burn bright!”
Output: The quick brown fox, carrying a grocery bag full of courgettes and aubergines, jumped over the lazy dog. The orange-coloured sun yelled at the fox, “Have you remembered to put on your sunscreen? It’s nearly noon, and I’m about to shine brightly!”
In this exchange, ChatGPT retains the Americanisms ‘grocery bag’, ‘noon’, and ‘sunscreen’ rather than suggesting ‘shopping bag’, ‘midday’, and ‘suncream’, the terms more commonly used in British English. Yet it does recognise the lexical borrowing differences, correcting ‘zucchini’ and ‘eggplant’ to ‘courgettes’ and ‘aubergines’.
On several occasions, we noticed a failure in prompt chaining leading to what we might call conversational amnesia–in other words, instances where ChatGPT ‘forgot’ the context of a conversation. For example, the following exchange took place after ChatGPT had completed a prompt requisition a revision of text according to a list of criteria:
Input: Gutters and drains need to be checked regularly.
Output: I agree! Regular maintenance of gutters and drains is important to prevent blockages and water damage.
When working with the base GPT-3.5 turbo model, we observed occasional inconsistencies and errors in its outputs. One notable issue was the model’s limited memory across conversational turns, or breakdowns in chain-of-thought between prompts. In such cases, the model failed to retain or respond appropriately to information provided earlier in the exchange. For example, in one chat session, we instructed the model to revise a series of sentences for improved clarity and conciseness. After completing several of these revisions, we tested its contextual awareness by entering a prompt containing only a sentence:
Input: Progress remains broadly insufficient to ensure that the buildings sector reaches zero emissions by 2050.
Output: Thank you!
Input: Sorry, could you revise the sentence?
Output: Progress remains insufficient overall to ensure that the buildings sector reaches zero emissions by 2050.
In addition, ChatGPT did not accurately apply common readability formulas, which is consistent with the findings of previous research [146]. As seen in Table 1, ChatGPT underestimated the difficulty of both texts. Although this sample size was small and there are clear scoring differences between Microsoft Word and the readability software, instructing ChatGPT to assess readability once again resulted in variable responses. This suggests that ChatGPT lacks consistent measurement capabilities for these tests. Initially, when asked to describe and then apply the Flesch-Kincaid and Flesch Readability formulas to a given text, ChatGPT provided lower scores. However, when instructed to simply calculate these scores, the results were higher, indicating text was less readable. Although the limited sample size prohibits drawing any concrete conclusions about the effects of prompt phrasing on the accuracy of LLM-performed calculations, this variability suggests that readability assessments conducted with GenAI or an LLM should be confirmed manually.
Based on ChatGPT’s inconsistency in executing prompts to correct or generate text and to calculate readability, we determined a need for fine-tuning.

4.2. Model Selection and Fine Tuning

The prompt engineering experiment revealed concerns of so-called ‘hallucinations’—for example, the model switched inexplicably to German—and therefore we completed fine tuning with the most stable recent release of a GPT model, which was a GPT 3.5-turbo. However, we observed that the base model of GPT-3.5 turbo tended to use the passive voice. This remained the case even when the model was explicitly prompted to use the active voice:
Input: How would you revise this sentence? ‘Research on historic department stores was undertaken in 2023.’
GPT 3.5-turbo: In 2023, research was undertaken on historic department stores.
Input: Please rephrase this sentence in the active voice: ‘There were more than five thousand applications for listed building consent last year.’
GPT 3.5-turbo: More than five thousand applications for listed building consent were received last year.
Despite following notes on fine tuning available in the OpenAI documentation, we faced challenges during the fine-tuning process and found it necessary to complete several iterations, adjusting the training data and temperature parameter each time, before GPT-3.5 turbo began to behave as HAZEL. Our initial observations from working with HAZEL, such as how calling HAZEL by name appears to improve the quality and content of the outputs, are summarised in the project’s GitHub (commit 453eb45, 7 September 2025) repository [147]. We hope that this guide not only informs those working with HAZEL during beta testing, but also serves as a resource for other researchers and developers seeking to adapt OpenAI models to their own aims.
Figure 2 below shows a comparison in performance between base GPT 3.5-turbo and the fine-tuned HAZEL model.
Figure 2. Excerpts from Historic England guidance and revisions by HAZEL and base GPT.

4.3. Readability Scores

Table 2 summarises the readability scores calculated for three sets of samples: (1) a random sample of unaltered excerpts from the Historic England guidance corpus, (2) guidance edited or generated with ChatGPT, and (3) guidance edited or generated with HAZEL. For each sample, we applied four frequently used readability formulas: Flesch-Kincaid, Flesch Readability, Automated Readability Index (ARI), and Dale-Chall Readability Formula. We also calculated the mean, median, and standard deviation for each readability score, aggregated by sample.
Both ChatGPT and HAZEL succeeded in slightly lowering the average readability scores across all four formulas. HAZEL’s revisions were scored, on average, as slightly more readable than ChatGPT’s, although still ‘difficult’, meaning more revisions would be required for a general audience. While the magnitude of change across all four formulas is modest—generally less than two grade levels—the reduction in standard deviation for HAZEL’s outputs suggests greater consistency in its revisions. This consistency is particularly valuable for public sector organisations like Historic England, where predictability in tone and clarity across documents supports user trust and accessibility.

4.4. Copyediting

The qualitative evaluation was based on a single copyeditor assessing a small (n = 25) sample. This small sample size facilitated a consistent application of the rubric and detailed insights into model performance; however, there are clear limitations regarding the statistical generalisability of these findings. Consequently, the results should be interpreted as indicative rather than conclusive, providing a preliminary assessment of HAZEL and ChatGPT’s comparative performance.
The copyeditor evaluated 25 text samples, each originally sourced from Historic England guidance and later revised by either ChatGPT or HAZEL. These samples were assessed against a rubric. The results, summarised in Table 5, show that ChatGPT and HAZEL perform similarly overall, with only slight differences.
Table 5. Mean, median, and standard deviation for the rubric scores assigned to each sample by the copyeditor.
On average, compared to ChatGPT’s revisions, the copyeditor scored the HAZEL-revised samples as slightly more readable and accessible. Although mean scores differed by only small margins (e.g., ±0.2), the lower standard deviation in HAZEL’s outputs again indicates more uniform performance. This suggests that, while ChatGPT occasionally produces strong revisions, HAZEL is more reliable in meeting the baseline expectations set by Historic England’s style and accessibility criteria.
In addition to the rubric scores, qualitative comparison of the revised samples provided insight into the models’ differing behaviours. Representative excerpts (see Figure 2 and Figure 3) illustrate that HAZEL generally adheres more closely to the targeted criteria for Historic England guidance, particularly in general revision tasks. However, when instructed to revise to improve readability, HAZEL removed references from the original text (see Figure 2). Additionally, HAZEL at times removed complex clauses, which resulted in the deletion of nuanced or detailed information (see Figure 3).
Figure 3. Excerpts from Historic England guidance and revisions by HAZEL, base GPT and copyeditor.

5. Discussion

Our evaluation of ChatGPT and the fine-tuned HAZEL model illustrates both the potential benefits and the limitations of applying generative AI to heritage guidance tasks. Given the small sample size and the use of a single expert reviewer, findings are exploratory and primarily intended to assess the feasibility of fine-tuning LLMs to support the development of readable and accessible heritage communication.
In line with prior research on LLMs, we observed that ChatGPT frequently generated ‘ghost bibliographic citations’, or references to papers, journals, and monographs that do not exist [148]. While our prompts focused on heritage topics, these observations are consistent with documented LLM behaviour across disciplines, including economics, medicine and geography [69,149,150].
This observation raises concerns for the context of the writing and dissemination of heritage guidance texts. For example, HE guidance documents typically include comprehensive bibliographies to direct readers to additional authoritative sources on a topic. The ‘Adapting Traditional Farm Buildings’ guidance document [151], for instance, lists 45 references in its appendix in addition to in-text references throughout the piece. In this context, ghost citations could not only frustrate readers but also undermine the public’s trust in HE guidance and the reputation of the organisation overall.
Like citation inaccuracies, ChatGPT’s inconsistent adherence to language preferences in our study is in step with similar trends documented before. Prior research attributes LLMs’ English-language bias to the predominance of English texts on the internet [152]; our observations further suggest a specific bias towards American English embedded within LLM training data. For our purposes, this presented a significant challenge, as maintaining British English standards is essential for accessibility and clarity for HE’s primarily English audience. Similar challenges emerged in the model’s mechanical text edits and readability formula calculations (see Section 4.1). These findings point to inherent limitations in LLM output reliability likely emerging from both the training data and the underlying text generation architecture.
HAZEL improved readability metrics modestly compared to ChatGPT, with lower variability in scores. However, most texts remained in the ‘difficult’ to ‘very difficult’ range, reflecting the inherent complexity of technical heritage guidance. Similarly, qualitative copyediting scores indicated that neither model consistently achieved high-quality outputs. These results suggest that fine-tuning can enhance some aspects of text accessibility.
However, the fine-tuning process was not without challenges, which we found to align with known difficulties of adapting proprietary LLMs for specific purposes [153]. One such difficulty consists in the fact that model training data biases can embed certain linguistic patterns [44]—e.g., the use of passive voice—making it difficult to correct them by prompt engineering alone. Furthermore, fine-tuning GPT 3.5-turbo to behave as HAZEL required several attempts because, at the time of development, limited documentation on fine-tuning OpenAI models was available beyond the community forum [136]. Throughout this process, we adjusted the training data, system message, and settings such as temperature to guide the model toward outputs consistent with our criteria for accessibility and readability in addition to HE’s tone of voice and style.
During testing, we observed that prompt engineering was still an essential step in working with HAZEL. For example, we found that explicitly referring to the model by name—e.g., ”HAZEL, can you help me with a piece of guidance I am working on?”—often improved the clarity, relevance, and quality of its outputs. As such, we conceptualise prompt engineering and fine-tuning as a symbiotic process, with the results of prompt engineering informing our approach to fine-tuning (and vice-versa). Though more research would be needed to generalise this observation to other contexts and to other LLMs, researchers working to fine-tune GPT models for writing tasks might find that their results benefit from a similar iterative approach. This is perhaps especially the case when the desired outputs of the model conflict with its ‘default’ behaviour, such as when generated text is desired in an English-language dialect other than Standard American English.
Our quantitative results (Table 2) demonstrate that fine-tuning LLMs can improve how closely generated text meets desired criteria, leading to modest improvements in readability for a general audience based on four commonly used readability formulas. According to these metrics, HE guidance materials revised by HAZEL were, on average, slightly more readable than those revised by ChatGPT. HAZEL-revised texts also showed a lower standard deviation in readability scores compared to ChatGPT revisions, suggesting that HAZEL produced more consistent readability. Revising written text into plain language is a commonly proposed strategy to enhance accessibility [89,154]. However, some guidance documents published by HE are highly technical, making it challenging to simplify them without losing meaning. Although both HAZEL and ChatGPT improved readability scores compared to the original corpus, most texts remained classified as ‘difficult’ to ‘very difficult’ to read. One limitation of our study is that the HE guidance corpus was sampled randomly, which may have resulted in a proportional over-representation of highly specialised, technical guidance pieces that are inherently challenging to simplify without missing critical detail.
Potential solutions include reshaping glossary sections in PDF documents into clickable embedded definitions, enabling readers to access explanations without interrupting their reading flow. Another approach is providing plain language summaries that offer a concise overview, with expandable sections for deeper exploration—a feature already partially implemented on the HE website for some heritage guidance. However, interactive web content introduces new challenges [155,156] such as accessibility issues for users of assistive technologies or those reading on mobile or tablet devices.
To address these complexities, future development could explore alternative LLMs or approaches such as retrieval-augmented generation, which integrates external knowledge sources to enhance both readability and factual accuracy. A beta testing phase integrating qualitative user feedback into the design process would better ensure that HAZEL–and similar technologies–effectively support authors of heritage guidance throughout the writing process.
Finally, our findings showed a trade-off between readability and clarity. Although, on average, ChatGPT performed less well on readability, it produced slightly clearer revisions than HAZEL without removing references, citations or other important information. This tendency of HAZEL’s is an important concern to address in future research. Removing parentheticals and complex clauses may indeed increase readability, but we observed that this was often at the expense of removing necessary or information.
For the purposes of Historic England guidance, we observed that HAZEL’s revisions tended to be more conservative and consistent. In particular, both models scored lowest on ‘Overall Suitability’, with HAZEL’s ‘Diversity and Inclusion’ score matching this low rating while outperforming ChatGPT on ‘Readability & Accessibility’, the primary target of our fine-tuning. This indicates that neither system consistently meets HE’s expectations for guidance literature. In other words, ChatGPT and HAZEL’s ability to generate what we would call ‘good writing’ is not equivalent to high-quality work. The repetitive, similar flaws observed also suggest a lack of variety in the style or word choice of the revised texts. Though the size of this study is too small to support broader conclusions, we wish to highlight that the copyeditor’s observations are consistent with other observations about ChatGPT’s similarity in writing style in creative writing tasks [74]. If future work is able to confirm that ChatGPT indeed exhibits a particular writing style, at least in between updates, then care needs to be taken where GenAI is deployed in heritage settings in order to retain the diversity of voices and perspectives in published materials.

Limitations

Our study was subject to several limitations. First, copyediting feedback was sourced from a single expert. This is because it was essential for the copyeditors to be experts in HE guidance with no prior experience using GenAI tools, significantly limiting the number of professionals we could recruit. Second, time and funding constraints limited the study’s scope. Because HAZEL was built on GPT 3.-5-turbo—an OpenAI product requiring a paid subscription—significant testing or public release of HAZEL was not feasible. As a result, we were unable to conduct formal beta testing or observe its deployment in live professional settings. Evaluating HAZEL’s effectiveness in real-world workflows, such as outlining guidance documents or producing summaries, remains a key area for future research. Second, we did not directly examine potential affective dimensions of (Gen)AI, which are well established as a critical area of study in the field of human–computer interaction [157,158]. Dr Susan Hazan, a professional in the museum sector, articulates a profound unease about GenAI, recounting her experience when asking ChatGPT to emulate her writing style:
“I felt that something had got into my mind and was scraping my brain… this ‘thing’ was crawling around my published chapters and papers and other corners of the Internet and presuming to come out with the conclusions I would come to myself based on what it learned about me. My ideas had become a collage of presumptions that, when stringed together spewed out a convincing doppelganger of me.” [159]
This response is reminiscent of the uncanny valley [160], or the sense of unease when encountering machines that seem almost-but-not-quite human, and highlights negative, subjective experiences of GenAI that remain underexplored in heritage settings. Such fears may stem from a lack of familiarity with AI tools or prior negative encounters with digital technologies, but they coexist with tangible ethical concerns including bias and authorship. In the context of this study, we acknowledge that negative affect toward GenAI could act as a barrier to its adoption in the heritage sector, though additional research is needed to explore this further.
The study did not evaluate the integration of GenAI tools into long-term collaborative heritage workflows. Future research should explore HAZEL’s impact on productivity, trust, and cognitive load when embedded into existing authoring processes. GenAI intersects in important ways with FAIR (Findable, Accessible, Interoperable, and Reusable) data principles and the broader values of open science. While AI-generated outputs may enhance documentation and dissemination, their training data and structure remains a ‘black box’. As recent lawsuits have shown [161,162], at least some models are trained on massive web-scraped datasets without clear licensing or copyright clearance. There is also a risk that GenAI could perpetuate existing cultural and linguistic biases embedded in the training data, undermining efforts to democratise heritage texts.
Third, the use of proprietary or subscription-only models, such as GPT models, presents barriers to equitable participation. Significant ethical concerns aside, many cultural heritage organisations—particularly smaller and under-resourced institutions—lack the financial resources to sustain paid access to commercial AI platforms. The ethical use of GenAI in heritage must therefore prioritise open-access, freely available models. Professionals who would be working with GenAI would also benefit from onboarding and training modules to help them use these tools responsibility while mitigating potential risks of, for example, information fabrication.
Finally, we note that our quantitative and copyediting assessments are based on a small number of text samples and a single expert reviewer. As such, these results should be interpreted as exploratory rather than statistically generalisable. Future studies involving larger corpora and multiple independent reviewers would be needed to robustly evaluate model performance.

6. Conclusions

This study provides a timely and critical assessment of generative AI’s (GenAI) potential in the context of heritage writing, focusing on applications within institutions such as Historic England (HE). Our findings reveal both encouraging opportunities and significant risks involved in integrating GenAI tools—both general-purpose chatbots such as ChatGPT and fine-tuned chatbots such as HAZEL—into heritage writing processes. Fine-tuned AI models like HAZEL can modestly improve readability and accessibility in technical heritage guidance, particularly for surface-level copyediting and standardised writing tasks. When fine-tuned, these tools can simplify language and support consistency of tone and structure, improving readability and accessibility of heritage guidance documents. These capabilities hold promise for expanding participation in heritage conservation and aligning outputs with the goals of cultural democracy.
However, the results are not generalisable across all heritage texts or contexts. Neither ChatGPT nor HAZEL consistently achieved high-quality outputs in all dimensions of guidance writing, highlighting the continued importance of human expertise. Additionally, GenAI systems, regardless of training or tuning, lack the lived experience, contextual awareness, professional expertise, and cultural sensitivity required for creating many heritage guidance texts. Therefore, these systems might be best limited to convergent thinking tasks in the writing process, such as proofreading and other tasks with discrete options. In divergent tasks requiring creative problem-solving and subjective expertise, GenAI writing assistants introduce risks such as stylistic flattening, fabrication of information, generic phrasing, and at times, brand or value misalignment.
Our study additionally highlights broader theoretical and ethical considerations. For instance, using GenAI in heritage writing could reinforce exclusionary narratives in heritage or homogenise the style and tone of heritage communication. Similarly, biases embedded in LLMs’ training data could lead them to generate text that limits accessibility for certain audiences. It is also important to consider the affective risks of GenAI adoption in heritage. Users may experience unease, anxiety or a diminished sense of autonomy over their writing processes. Finally, many GenAI systems require significant financial and computational resources, which could exacerbate existing inequalities between institutions in the heritage sector.
Building on the study findings, several directions for future work emerge. First, HAZEL should be evaluated in the context of the heritage writing process, such as collaborative drafting or iterative editing of real guidance texts. Observing user experiences, barriers, and outcomes will offer a fuller picture of HAZEL’s performance and suitability. Second, the emotional and cognitive dimensions of GenAI use in heritage—such as authorship anxiety, trust, and perceptions of automation—could be examined in more depth. These affective factors may be as consequential as technical accuracy in shaping adoption and responsible use.
Third, fine-tuning methods should be expanded and other methods explored. Retrieval-augmented generation (RAG) and other hybrid techniques may improve adherence to HE style and tone of voice guidelines while reducing the risk of errors or value misalignment. Co-designing these models with HE professionals and accessibility specialists would further enhance their effectiveness and ethical fit. Finally, broader ethical, legal, and policy frameworks must be developed to guide GenAI integration in the heritage sector.
In conclusion, fine-tuned GenAI tools like HAZEL offer promise in supporting efforts to make heritage texts more readable and accessible. However, due to significant potential risks, any integration of GenAI in the heritage sector will require caution.

Author Contributions

C.B. directed the study as principal investigator. J.W. carried out the technical development and design of the qualitative and quantitative reviews and was lead author of this manuscript. L.B., E.L. and V.S. contributed expertise from Historic England, assisted with the design of the study, and liaised with the Content Team; they also presented project progress to the Digital Strategy Board. All authors contributed to the writing and revision of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by Historic England’s Digital Strategy Board Innovation Fund. The funder had no role in the design of the study, data collection, analysis or interpretation of the data, or in the writing of this manuscript.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the University of Edinburgh’s School of History, Classics & Archaeology (14 June 2024).

Data Availability Statement

The full transcript of the ChatGPT interview conducted during this study is deposited in the project’s GitHub repository [147]. The Historic England corpus referenced in this study is available open-access in PDF format on Historic England’s website [80]. Due to the exploratory nature of the project and limited funding, the HAZEL prototype is not available. However, the underlying code for this study, including model fine-tuning and development, is available in the project’s GitHub [147].

Acknowledgments

The authors would like to thank James Reid (EDINA, University of Edinburgh) for establishing the zero data retention agreement with OpenAI and for his support in enabling access to GPT models through the University’s digital research infrastructure. We also gratefully acknowledge Daniel Pett (Historic England) for bringing the project team together and for his assistance with the project design and funding application. We also thank David Jones for his copyediting expertise during the qualitative portion of the study.

Conflicts of Interest

All authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HEHistoric England
GenAIGenerative AI
LLMLarge language model

References

  1. Lee, E.; Williams, J.; Campbell, G. English Heritage Science Strategy 2012–2015; English Heritage: Swindon, UK, 2013; Available online: https://historicengland.org.uk/images-books/publications/ehss/ehss/ (accessed on 1 September 2025).
  2. Dahmen, J.; Kayaalp, M.E.; Ollivier, M.; Pareek, A.; Hirschmann, M.T.; Karlsson, J.; Winkler, P.W. Artificial intelligence bot ChatGPT in medical research: The potential game changer as a double-edged sword. Knee Surg. Sport. Traumatol. Arthrosc. 2023, 31, 1187–1189. [Google Scholar] [CrossRef]
  3. Ciriello, R.F. An AI Companion Chatbot Is Inciting Self-Harm, Sexual Violence and Terror Attacks. The Conversation, 2025. Available online: https://theconversation.com/an-ai-companion-chatbot-is-inciting-self-harm-sexual-violence-and-terror-attacks-252625 (accessed on 5 April 2025).
  4. Forlini, E.; Circelli, R. The Best AI Chatbots for 2025; PCMag UK: London, UK, 2025; Available online: https://uk.pcmag.com/ai/148205/the-best-ai-chatbots-for-2023 (accessed on 10 June 2025).
  5. Ng, K.; Drenon, B.; Gerken, T.; Cieslak, M. Deepseek: The Chinese AI App That Has the World Talking. BBC News, 5 February 2025. Available online: https://www.bbc.co.uk/news/articles/c5yv5976z9po (accessed on 8 August 2025).
  6. Maria, L.; Furnari, G.; Quattrocchi, C.; Pasquale, S.; Allegra, D.; Gueli, A.M.; Stanco, F.; Tanasi, D. Challenges in Automatic Munsell Color Profiling for Cultural Heritage. Pattern Recognit. Lett. 2020, 131, 135–141. [Google Scholar] [CrossRef]
  7. Cilia, N.D.; De Stefano, C.; Fontanella, F.; Marrocco, C.; Molinara, M.; Scotto di Freca, A. An experimental comparison between deep learning and classical machine learning approaches for writer identification in medieval documents. J. Imaging 2020, 6, 89. [Google Scholar] [CrossRef] [PubMed]
  8. Hegghammer, T. OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment. J. Comput. Soc. Sci. 2022, 5, 861–882. [Google Scholar] [CrossRef]
  9. Nockels, J.; Gooding, P.; Ames, S.; Terras, M. Understanding the Application of Handwritten Text Recognition Technology in Heritage Contexts: A Systematic Review of Transkribus in Published Research. Arch. Sci. 2022, 22, 367–392. [Google Scholar] [CrossRef] [PubMed]
  10. Underwood, T.; Kimutis, P.; Witte, J. NovelTM Datasets for English-Language Fiction, 1700–2009. J. Cult. Anal. 2020, 5, 1–30. [Google Scholar] [CrossRef]
  11. Jones, H.; Faghihi, Y. Manuscript Catalogues as Data for Research: From Provenance to Data Decolonisation. Digit. Humanit. Q. 2024, 18, 1–22. Available online: https://dhq.digitalhumanities.org/vol/18/3/000751/000751.html (accessed on 2 December 2025).
  12. Bonacchi, C.; Witte, J.; Altaweel, M. Political uses of the ancient past on social media are predominantly negative and extreme. PLoS ONE 2024, 19, e0308919. [Google Scholar] [CrossRef]
  13. Bonacchi, C.; Krzyżanska, M. Digital heritage research re-theorised: Ontologies and epistemologies in a world of big data. Int. J. Herit. Stud. 2019, 25, 1235–1247. [Google Scholar] [CrossRef]
  14. Cobb, P.J. Large language models and generative AI, oh my! Archaeology in the time of ChatGPT, Midjourney, and beyond. Adv. Archaeol. Pract. 2023, 11, 363–369. [Google Scholar] [CrossRef]
  15. Shehade, M.; Stylianou-Lambert, T. Museums and Technologies of Presence; Taylor & Francis: Milton Park, UK, 2023. [Google Scholar]
  16. Trichopoulos, G. Large Language Models for Cultural Heritage. In Proceedings of the ACM Greek SIGCHI Chapter (CHIGREECE ’23), Athens, Greece, 27–28 September 2023; pp. 1–5. [Google Scholar] [CrossRef]
  17. Lappalainen, Y.; Narayanan, N. Aisha: A Custom AI Library Chatbot Using the ChatGPT API. J. Web Librariansh. 2023, 17, 37–58. [Google Scholar] [CrossRef]
  18. Verma, M. Novel Study on AI-Based Chatbot (ChatGPT) Impacts on the Traditional Library Management. Int. J. Trend Sci. Res. Dev. 2023, 7, 961–964. [Google Scholar]
  19. Altaweel, M.; Khelifi, A.; Shana’ah, M.M. Monitoring looting at cultural heritage sites: Applying deep learning on optical unmanned aerial vehicles data as a solution. Soc. Sci. Comput. Rev. 2023, 42, 480–495. [Google Scholar] [CrossRef]
  20. O’Brien, C.; Hutson, J.; Olsen, T.; Ratican, J. Limitations and Possibilities of Digital Restoration Techniques Using Generative AI Tools: Reconstituting Antoine François Callet’s Achilles Dragging Hector’s Body Past the Walls of Troy; Digital Commons@Lindenwood University: St. Charles, MO, USA, 2023; Available online: https://digitalcommons.lindenwood.edu/faculty-research-papers/522 (accessed on 25 September 2025).
  21. Wagner, P. Data Privacy—The Ethical, Sociological, and Philosophical Effects of Cambridge Analytica. SSRN Electron. J. 2021, 1–9. [Google Scholar] [CrossRef]
  22. Hollingshead, W.; Quan-Haase, A.; Chen, W. Ethics and Privacy in Computational Social Science: A Call for Pedagogy. In Handbook of Computational Social Science; Schroeder, R., Taylor, E., Eds.; Routledge: London, UK, 2021; Volume 1. [Google Scholar]
  23. Wolf, M.J.; Miller, K.; Grodzinsky, F.S. Why We Should Have Seen That Coming: Comments on Microsoft’s Tay “Experiment,” and Wider Implications. ACM SIGCAS Comput. Soc. 2017, 47, 54–64. [Google Scholar] [CrossRef]
  24. Dastin, J. Amazon Scraps Secret AI Recruiting Tool that Showed Bias Against Women. In Ethics of Data and Analytics; Mittelstadt, B., Floridi, L., Eds.; Auerbach Publications: Boca Raton, FL, USA, 2022; Volume 4, 4p. [Google Scholar] [CrossRef]
  25. Kurzweil, R. The Singularity Is Near: When Humans Transcend Biology; Duckworth-Viking: London, UK, 2005. [Google Scholar]
  26. Adams, S.; Arel, I.; Bach, J.; Coop, R.; Furlan, R.; Goertzel, B.; Hall, J.S.; Samsonovich, A.; Schuetz, M.; Schlesinger, M.; et al. Mapping the landscape of human-level artificial general intelligence. AI Mag. 2012, 33, 25–42. [Google Scholar] [CrossRef]
  27. Mitchell, M. Debates on the Nature of Artificial General Intelligence. Science 2024, 383, eado7069. [Google Scholar] [CrossRef] [PubMed]
  28. OpenAI. About 2025. Available online: https://openai.com/about/ (accessed on 1 June 2024).
  29. Altman, S. Reflections. Sam Altman, 2025. Available online: https://blog.samaltman.com/reflections (accessed on 2 February 2025).
  30. Sokolowski, R. Natural and artificial intelligence. Daedalus 1988, 117, 45–64. [Google Scholar]
  31. Dreyfus, H.L. Why Computers Must Have Bodies in Order to Be Intelligent. Rev. Metaphys. 1967, 21, 13–32. [Google Scholar]
  32. Dreyfus, H.L. What Computers Still Can’t Do: A Critique of Artificial Reason; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
  33. Dreyfus, H.L. Standing up to Analytic Philosophy and Artificial Intelligence at MIT in the Sixties. Proc. Addresses Am. Philos. Assoc. 2013, 87, 78–92. [Google Scholar]
  34. Landgrebe, J.; Smith, B. Why Machines Will Never Rule the World: Artificial Intelligence Without Fear; Routledge: London, UK, 2022. [Google Scholar] [CrossRef]
  35. Searle, J.R. Minds, brains, and programs. Behav. Brain Sci. 1980, 3, 417–457. [Google Scholar] [CrossRef]
  36. Schuering, B.; Schmid, T. What Can Computers Do Now? Dreyfus Revisited for the Third Wave of Artificial Intelligence. Proc. AAAI Symp. Ser. 2024, 3, 248–252. [Google Scholar] [CrossRef]
  37. Fjelland, R. Why General Artificial Intelligence Will Not be Realized. Humanit. Soc. Sci. Commun. 2020, 7, 10. [Google Scholar] [CrossRef]
  38. Vallor, S. The AI Mirror: How to Reclaim Our Humanity in an Age of Machine Thinking; Oxford University Press: Oxford, UK, 2024. [Google Scholar] [CrossRef]
  39. Kamoi, R.; Zhang, Y.; Zhang, N.; Han, J.; Zhang, R. When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs. Trans. Assoc. Comput. Linguist. 2024, 12, 1417–1440. [Google Scholar] [CrossRef]
  40. Pushpanathan, K.; Lim, Z.W.; Er Yew, S.M.; Chen, D.Z.; Hui’En Lin, H.A.; Lin Goh, J.H.; Wong, W.M.; Wang, X.; Jin Tan, M.C.; Chang Koh, V.T.; et al. Popular Large Language Model Chatbots’ Accuracy, Comprehensiveness, and Self-Awareness in Answering Ocular Symptom Queries. iScience 2023, 26, 108163. [Google Scholar] [CrossRef]
  41. Jones, C.R.; Bergen, B.K. Large Language Models Pass the Turing Test. arXiv 2025, arXiv:2503.23674. [Google Scholar] [CrossRef]
  42. Galatzer-Levy, I.R.; Munday, D.; McGiffin, J.; Liu, X.; Karmon, D.; Labzovsky, I.; Moroshko, R.; Zait, A.; McDuff, D. The Cognitive Capabilities of Generative AI: A Comparative Analysis with Human Benchmarks. arXiv 2024, arXiv:2410.07391. [Google Scholar] [CrossRef]
  43. Guilford, J.P. The Nature of Human Intelligence; McGraw-Hill Inc.: New York, NY, USA, 1967. [Google Scholar]
  44. Bender, E.M.; Gebru, T.; McMillan-Major, A.; Schmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021. [Google Scholar] [CrossRef]
  45. Fayyad, U.M. From Stochastic Parrots to Intelligent Assistants—The secrets of data and human interventions. IEEE Intell. Syst. 2023, 2023, 63–67. [Google Scholar] [CrossRef]
  46. Brophy, D.R. Comparing the attributes, activities, and performance of divergent, convergent, and combination thinkers. Creat. Res. J. 2001, 13, 439–455. [Google Scholar] [CrossRef]
  47. Al Naqbi, H.; Bahroun, Z.; Ahmed, V. Enhancing work productivity through generative artificial intelligence: A comprehensive literature review. Sustainability 2024, 16, 1166. [Google Scholar] [CrossRef]
  48. Spennemann, D.H.R. Generative Artificial Intelligence, Human Agency and the Future of Cultural Heritage. Heritage 2024, 7, 3597. [Google Scholar] [CrossRef]
  49. Ray, P.P. ChatGPT: A Comprehensive Review on Background, Applications, Key Challenges, Bias, Ethics, Limitations and Future Scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
  50. Tang, X.; Li, X.; Ding, Y.; Song, M.; Bu, Y. The pace of artificial intelligence innovations: Speed, talent and trial and error. arXiv 2020, arXiv:2009.01812. [Google Scholar] [CrossRef]
  51. Strickland, E. The Turbulent Past and Uncertain Future of Artificial Intelligence. IEEE Spectr. 2021. Available online: https://spectrum.ieee.org/history-of-ai (accessed on 2 December 2025).
  52. Samuel, S. “I Lost Trust”: Why the OpenAI Team in Charge of Safeguarding Humanity Imploded; Vox: New York, NY, USA, 2024; Available online: https://www.vox.com/future-perfect/2024/5/17/24158403/openai-resignations-ai-safety-ilya-sutskever-jan-leike-artificial-intelligence (accessed on 5 August 2025).
  53. Lucchi, N. ChatGPT: A Case Study on Copyright Challenges for Generative Artificial Intelligence Systems. Eur. J. Risk Regul. 2024, 15, 602–624. [Google Scholar] [CrossRef]
  54. Quang, J. Does Training AI Violate Copyright Law? Berkeley Technol. Law J. 2021, 36, 1407. [Google Scholar] [CrossRef]
  55. Hacker, P.; Mittelstadt, B.; Borgesius, F.Z.; Wachter, S. Generative Discrimination: What Happens when Generative AI Exhibits Bias, and What Can be Done About It. arXiv 2024, arXiv:2407.10329. [Google Scholar]
  56. Kaplan, D.M.; Palitsky, R.; Alvarez, S.J.A.; Pozzo, N.S.; Greenleaf, M.N.; Atkinson, C.A.; Lam, W.A. What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT. J. Med. Internet Res. 2024, 26, e51837. [Google Scholar] [CrossRef]
  57. Suguri Motoki, F.Y.; Pinho Neto, V.; Rodrigues, V. More Human than Human: Measuring ChatGPT Political Bias. SSRN Electron. J. 2023, 1–18. [Google Scholar] [CrossRef]
  58. Warr, M.; Pivovarova, M.; Mishra, P.; Oster, N.J. Is ChatGPT Racially Biased? The Case of Evaluating Student Writing. Soc. Sci. Res. Netw. 2024, 1–16. [Google Scholar] [CrossRef]
  59. Etminani, K.D. Generative Privacy Doctrine: The Case for a New Legal Privacy Framework for GenAI. UC Law Const. Q. 2024, 52, 305. [Google Scholar]
  60. Dong, M.; Wang, G.; Han, X. Impacts of Artificial Intelligence on Carbon Emissions in China: In Terms of Artificial Intelligence type and Regional Differences. Sustain. Cities Soc. 2024, 113, 105682. [Google Scholar] [CrossRef]
  61. Ding, Z.; Wang, J.; Song, Y.; Zheng, X.; He, G.; Chen, X.; Zhang, T.; Lee, W.J.; Song, J. Tracking the Carbon Footprint of Global Generative Artificial Intelligence. Innovation 2025, 6, 100866. [Google Scholar] [CrossRef]
  62. Amin, K.S.; Forman, H.P.; Davis, M.A. Even with ChatGPT, Race Matters. Clin. Imaging 2024, 109, 110113. [Google Scholar] [CrossRef]
  63. Zheng, A. Dissecting Bias of ChatGPT in College Major Recommendations. Inf. Technol. Manag. 2024, 26, 625–636. [Google Scholar] [CrossRef]
  64. Wu, X.; Duan, R.; Ni, J. Unveiling Security, Privacy, and Ethical Concerns of ChatGPT. arXiv 2023, arXiv:2307.14192. [Google Scholar] [CrossRef]
  65. Zeng, Y.; Lin, H.; Zhang, J.; Yang, D.; Jia, R.; Shi, W. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Volume 1, pp. 14322–14350. [Google Scholar] [CrossRef]
  66. Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv 2024, arXiv:2308.03825. [Google Scholar] [CrossRef]
  67. Nah, F.; Cai, J.; Zheng, R.; Pang, N. An Activity System-based Perspective of Generative AI: Challenges and Research Directions. AIS Trans. Hum.-Comput. Interact. 2023, 15, 247–267. [Google Scholar] [CrossRef]
  68. Spennemann, D.H.R. ChatGPT and the Generation of Digitally Born “Knowledge”: How Does a Generative AI Language Model Interpret Cultural Heritage Values? Knowledge 2023, 3, 480–512. [Google Scholar] [CrossRef]
  69. Buchanan, J.; Hill, S.; Shapoval, O. ChatGPT hallucinates non-existent citations: Evidence from economics. Am. Econ. 2024, 69, 80–87. [Google Scholar] [CrossRef]
  70. Sallam, M.; Al-Salahat, K.; Eid, H.; Egger, J.; Puladi, B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5, and Humans in Clinical Chemistry Multiple-Choice Questions. medRxiv 2024. [Google Scholar] [CrossRef]
  71. Orrù, G.; Piarulli, A.; Conversano, C.; Gemignani, A. Human-like Problem-Solving Abilities in Large Language Models Using ChatGPT. Front. Artif. Intell. 2023, 6, 1199350. [Google Scholar] [CrossRef] [PubMed]
  72. Castillo-González, W.; Lepez, C.O.; Bonardi, M.C. ChatGPT: A promising tool for academic editing. Data Metadata 2022, 1, 23. [Google Scholar] [CrossRef]
  73. Pavlik, J.V. Collaborating with ChatGPT: Considering the implications of generative artificial intelligence for journalism and media education. J. Mass Commun. Educ. 2023, 78, 84–93. [Google Scholar] [CrossRef]
  74. Doshi, A.R.; Hauser, O. Generative artificial intelligence enhances creativity but reduces the diversity of novel content. Sci. Adv. 2023, 10, eadn5290. [Google Scholar] [CrossRef]
  75. Li, Z.; Liang, C.; Peng, J.; Yin, M. The Value, Benefits, and Concerns of Generative AI-Powered Assistance in Writing. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–25. [Google Scholar] [CrossRef]
  76. Washington, J. The Impact of Generative Artificial Intelligence on Writer’s Self-Efficacy: A Critical Literature Review. Soc. Sci. Res. Netw. 2023, 1–83. [Google Scholar] [CrossRef]
  77. Pellas, N. The Effects of Generative AI Platforms on Undergraduates’ Narrative Intelligence and Writing Self-Efficacy. Educ. Sci. 2023, 13, 1155. [Google Scholar] [CrossRef]
  78. Koltovskaia, S.; Rahmati, P.; Saeli, H. Graduate Students’ Use of ChatGPT for Academic Text Revision: Behavioral, Cognitive, and Affective Engagement. J. Second Lang. Writ. 2024, 65, 101130. [Google Scholar] [CrossRef]
  79. Gao, B.; Liu, R.; Chu, J. Creativity Catalyst: Exploring the Role and Potential Barriers of Artificial Intelligence in Promoting Student Creativity. In Artificial Intelligence in HCI, Proceedings of the 6th International Conference, AI-HCI 2025, Held as Part of the 27th HCI International Conference, HCII 2025, Gothenburg, Sweden, 22–27 June 2025; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; pp. 3–24. [Google Scholar] [CrossRef]
  80. Historic England. Advice & Grants. Available online: https://historicengland.org.uk/advice/ (accessed on 5 October 2024).
  81. Historic England. Current Guidance and Advice. Available online: https://historicengland.org.uk/advice/find/a-z-publications/ (accessed on 5 October 2024).
  82. Historic England. Your Home—Maintaining and Living in an Old Building. Available online: https://historicengland.org.uk/advice/your-home/ (accessed on 1 March 2025).
  83. Pradhan, D.; Rajput, T.; Rajkumar, A.J.; Lazar, J.; Jain, R.; Morariu, V.I.; Manjunatha, V. Development and Evaluation of a Tool for Assisting Content Creators in Making PDF Files More Accessible. ACM Trans. Access. Comput. 2022, 15, 1–52. [Google Scholar] [CrossRef]
  84. Sorge, V.; Bansal, A.; Jadhav, N.M.; Garg, H.; Verma, A.; Balakrishnan, M. Towards generating web-accessible STEM documents from PDF. In Proceedings of the 17th International Web for All Conference (W4A ’20), Taipei, Taiwan, 20–21 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar] [CrossRef]
  85. Cushman, E. Decolonizing Validity. J. Writ. Assess. 2016, 9, 1–28. Available online: http://journalofwritingassessment.org/article.php?article=98 (accessed on 10 October 2025).
  86. Elliot, N. A Theory of Ethics for Writing Assessment. J. Writ. Assess. 2016, 9, 1–29. Available online: http://journalofwritingassessment.org/article.php?article=97 (accessed on 10 October 2025).
  87. Maaß, C. Easy Language—Plain Language—Easy Language Plus: Balancing Comprehensibility and Acceptability; Frank & Timme: Berlin, Germany, 2020; Volume 3. [Google Scholar] [CrossRef]
  88. Rink, I. Communication barriers. In Handbook of Accessible Communication; Frank & Timme: Berlin, Germany, 2024; pp. 33–68. [Google Scholar]
  89. Boldyreff, C.; Burd, E.; Donkin, J.; Marshall, S. The case for the use of plain English to increase web accessibility. In Proceedings of the 3rd International Workshop on Web Site Evolution (WSE 2001), Florence, Italy, 10 November 2001. [Google Scholar] [CrossRef]
  90. Kirkpatrick, E.; Gaisford, W.; Williams, E.; Brindley, E.; Tembo, D.; Wright, D. Understanding Plain English Summaries: A Comparison of Two Approaches to Improve the Quality of Plain English Summaries in Research Reports. Res. Involv. Engagem. 2017, 3, 17. [Google Scholar] [CrossRef]
  91. Zedelius, C.M.; Mills, C.; Schooler, J.W. Beyond Subjective Judgments: Predicting Evaluations of Creative Writing from Computational Linguistic Features. Behav. Res. Methods 2018, 51, 879–894. [Google Scholar] [CrossRef]
  92. Pires, C.; Cavaco, A.; Vigário, M. Towards the Definition of Linguistic Metrics for Evaluating Text Readability. J. Quant. Linguist. 2017, 24, 319–349. [Google Scholar] [CrossRef]
  93. Lorge, I. Predicting Readability. Teach. Coll. Rec. 1944, 45, 1944. [Google Scholar] [CrossRef]
  94. Klare, G.R. Assessing Readability. Read. Res. Q. 1974, 10, 62–102. [Google Scholar] [CrossRef]
  95. Fry, E. A readability formula that saves time. J. Read. 1968, 11, 513–516. [Google Scholar]
  96. DuBay, W.H. The Principles of Readability; ERIC Clearinghouse: New York, NY, USA, 2004. [Google Scholar]
  97. Bormuth, J.R. Development of Readability Analysis; U.S. Department of Health, Education, and Welfare, Bureau of Research: Washington, DC, USA, 1969. Available online: https://eric.ed.gov/?id=ED029166 (accessed on 5 March 2024).
  98. Flesch, R. A New Readability Yardstick. J. Appl. Psychol. 1948, 32, 221–233. [Google Scholar] [CrossRef] [PubMed]
  99. Dale, E.; Chall, J.S. The Concept of Readability. Elem. Engl. 1949, 26, 19–26. [Google Scholar]
  100. Smith, E.A.; Senter, R.J. Automated Readability Index; Aerospace Medical Research Laboratories, Aerospace Medical Division, Air Force Systems Command: Dayton, OH, USA, 1967; Volume 66, No. 220. [Google Scholar]
  101. Jarrett, C.; Redish, J.; Summers, K. Designing for People Who Do Not Read Easily. In Rhetorical Accessability: At the Intersection of Technical Communication and Disability Studies; Meloncon, L., Ed.; Routledge: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
  102. Selzer, J. What Constitutes a “Readable” Technical Style?” In New Essays in Technical and Scientific Communication; Anderson, P., Brockman, J., Miller, C., Eds.; Routledge: New York, NY, USA, 1983. [Google Scholar] [CrossRef]
  103. Irwin, J.W.; Davis, C.A. Assessing Readability: The Checklist Approach. J. Read. 1980, 24, 124–130. [Google Scholar]
  104. Coke, E.U. Reading rate, readability, and variations in task-induced processing. J. Educ. Psychol. 1973, 68, 167. [Google Scholar] [CrossRef]
  105. Nikolayeva, L. Exploring the efficacy of ChatGPT in adapting reading materials for undergraduate students. In Proceedings of the 8th International Conference on Higher Education Advances (HEAd’22), Valencia, Spain, 14–17 June 2024. [Google Scholar] [CrossRef]
  106. Rouhi, A.D.; Ghanem, Y.K.; Yolchieva, L.; Saleh, Z.; Joshi, H.; Moccia, M.C.; Suarez-Pierre, A.; Han, J.J. Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study. Cardiol. Ther. 2024, 13, 137–147. [Google Scholar] [CrossRef] [PubMed]
  107. Eleyan, D.; Othman, A.; Eleyan, A. Enhancing Software Comments Readability using Flesch Reading Ease Score. Information 2020, 11, 430. [Google Scholar] [CrossRef]
  108. Abreu, A.A.; Murimwa, G.Z.; Farah, E.; Stewart, J.W.; Zhang, L.; Rodriguez, J.; Sweetenham, J.; Zeh, H.J.; Wang, S.C.; Polanco, P.M. Enhancing readability of online patient-facing content: The role of AI chatbots in improving cancer information accessibility. J. Natl. Compr. Cancer Netw. 2024, 22, e237334. [Google Scholar] [CrossRef]
  109. Roster, K.; Kann, R.B.; Farabi, B.; Gronbeck, C.; Brownstone, N.; Lipner, S.R. Readability Health Lit. Scores ChatGPT-Gener. Dermatol. Public Educ. Mater. Cross-Sect. Anal. Sunscreen Melanoma Questions. JMIR Dermatol. 2024, 7, e50163. [Google Scholar] [CrossRef]
  110. Momenaei, B.; Wakabayashi, T.; Shahlaee, A.; Durrani, A.F.; Pandit, S.A.; Wang, K.; Mansour, H.A.; Abishek, R.M.; Xu, D.; Sridhar, J.; et al. Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases. Ophthalmol. Retin. 2023, 7, 862–868. [Google Scholar] [CrossRef]
  111. Gencer, A. Readability Analysis of ChatGPT’s Responses on Lung Cancer. Sci. Rep. 2024, 14, 17234. [Google Scholar] [CrossRef]
  112. Mace, R.L. Universal Design in Housing. Assist. Technol. 1998, 10, 21–28. [Google Scholar] [CrossRef]
  113. Goldsmith, S. Universal Design; Routledge: London, UK, 2007. [Google Scholar]
  114. Henry, S.L.; Abou-Zahra, S.; Brewer, J. The Role of Accessibility in a Universal Web. In Proceedings of the 11th Web for All Conference on—W4A ’14, Seoul, Republic of Korea, 7 April 2014. [Google Scholar] [CrossRef]
  115. Bailin, A.; Grafstein, A. The Linguistic Assumptions Underlying Readability Formulae. Lang. Commun. 2001, 21, 285–301. [Google Scholar] [CrossRef]
  116. Redish, J.C. Understanding the Limitations of Readability Formulas. IEEE Trans. Prof. Commun. 1981, PC-24, 46–48. [Google Scholar] [CrossRef]
  117. Ravelli, L.J. Making Language Accessible: Successful Text Writing for Museum Visitors. Linguist. Educ. 1996, 8, 367–387. [Google Scholar] [CrossRef]
  118. Ravelli, L.J. Museum Texts; Routledge: Milton Park, UK, 2006. [Google Scholar] [CrossRef]
  119. Lazzeretti, C. The Language of Museum Communication; Palgrave Macmillan: London, UK, 2016. [Google Scholar]
  120. Kjeldsen, A.K.; Nisbeth Jensen, M. A Study of Accessibility in Museum Exhibition Texts: When Words of Wisdom are Not Wise. Nord. Museol. 2015, 1, 91–111. [Google Scholar] [CrossRef]
  121. Screven, C. Motivating visitors to read labels. ILVS Rev. 1992, 2, 183–221. [Google Scholar]
  122. Perego, E. Into the Language of Museum Audio Descriptions: A Corpus-Based Study. Perspectives 2018, 27, 333–349. [Google Scholar] [CrossRef]
  123. Randaccio, M. Museums, museum AD and Easy Language: Some critical insights. Riv. Internazionale Tec. Traduzione 2022, 24, 105–120. [Google Scholar] [CrossRef]
  124. Hooper-Greenhill, E. Museums and the Interpretation of Visual Culture; Routledge: London, UK, 2020. [Google Scholar]
  125. Deleuze, G.; Guattari, F. A Thousand Plateaus; University of Minnesota Press: Minneapolis, MN, USA, 1987. [Google Scholar]
  126. Smith, L. Uses of Heritage; Routledge: London, UK, 2006. [Google Scholar]
  127. Historic England. Historic England’s Role. Available online: https://historicengland.org.uk/about/what-we-do/historic-englands-role/ (accessed on 1 March 2025).
  128. Rohan, R.; Faruk, L.I.D.; Puapholthep, K.; Pal, D. Unlocking the Black Box: Exploring the Use of Generative AI (ChatGPT) in Information Systems Research. In Proceedings of the IAIT ’23: Proceedings of the 13th International Conference on Advances in Information Technology, Bangkok, Thailand, 6–9 December 2023; pp. 1–9. [Google Scholar] [CrossRef]
  129. Franzoni, V. From Black Box to Glass Box: Advancing Transparency in Artificial Intelligence Systems for Ethical and Trustworthy AI. Int. Conf. Comput. Sci. Appl. 2023, 14107, 118–130. [Google Scholar] [CrossRef]
  130. Giray, L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
  131. Debnath, T.; Siddiky, M.N.A.; Rahman, M.E.; Das, P.; Guha, A.K. A Comprehensive Survey of Prompt Engineering Techniques in Large Language Models. TechRxiv 2025. [Google Scholar] [CrossRef]
  132. Historic England. 3D Laser Scanning for Heritage. Available online: https://historicengland.org.uk/images-books/publications/3d-laser-scanning-heritage/ (accessed on 10 October 2023).
  133. Historic England. Streets for All: Yorkshire. Available online: https://historicengland.org.uk/images-books/publications/streets-for-all-yorkshire/ (accessed on 10 October 2023).
  134. Makrygiannakis, M.A.; Giannakopoulos, K.; Kaklamanos, E.G. Evidence-Based Potential of Generative Artificial Intelligence Large Language Models in Orthodontics: A Comparative Study of ChatGPT, Google Bard, and Microsoft Bing. Eur. J. Orthod. 2024, cjae017. [Google Scholar] [CrossRef] [PubMed]
  135. OpenAI. Fine-Tuning Guide. Available online: https://platform.openai.com/docs/guides/fine-tuning (accessed on 1 June 2024).
  136. OpenAI. Developer Forum. Available online: https://community.openai.com (accessed on 1 June 2024).
  137. Python Package Index (PyPI). Pytesseract, Version 0.3.13. 2023. Available online: https://pypi.org/project/pytesseract/ (accessed on 24 September 2025).
  138. Sebastian, G. Privacy and data protection in ChatGPT and other AI chatbots: Strategies for securing user information. Int. J. Secur. Priv. Pervasive Comput. 2023, 15, 1–14. [Google Scholar] [CrossRef]
  139. Yao, Y.; Guo, X.; Asnani, V.; Gong, Y.; Liu, J.; Lin, X.; Liu, X.; Liu, S. Reverse Engineering of Deceptions on Machine- and Human-Centric Attacks. Found. Trends® Priv. Secur. 2024, 6, 53–152. [Google Scholar] [CrossRef]
  140. Goltz, C.R. A Table for the Quick Computation of Readability Scores Using the Dale-Chall Formula. J. Dev. Read. 1964, 7, 175–187. [Google Scholar]
  141. Lei, L.; Yan, S. Readability and Citations in Information Science: Evidence from Abstracts and Articles of Four Journals (2003–2012). Scientometrics 2016, 108, 1155–1169. [Google Scholar] [CrossRef]
  142. Python Package Index (PyPI). Openai, Version 1.57.0. 2024. Available online: https://pypi.org/project/openai/1.57.0/ (accessed on 24 September 2025).
  143. Python Package Index (PyPI). Numpy, Version 2.1.3. 2024. Available online: https://pypi.org/project/numpy/2.1.3/ (accessed on 24 September 2025).
  144. Python Package Index (PyPI). Readability, Version 0.3.2. 2025. Available online: https://pypi.org/project/readability/0.3.2/ (accessed on 24 September 2025).
  145. Kandeel, M.E.; Eldakak, A. Legal Dangers of Using ChatGPT as a Co-Author According to Academic Research Regulations. J. Govern. Regul. 2024, 13, 289–298. [Google Scholar] [CrossRef]
  146. Golan, R.; Ripps, S.J.; Reddy, R.; Loloi, J.; Bernstein, A.P.; Connelly, Z.M.; Golan, N.S.; Ramasamy, R. ChatGPT’s Ability to Assess Quality and Readability of Online Medical Information: Evidence From a Cross-Sectional Study. Cureus 2023, 15, e42214. [Google Scholar] [CrossRef]
  147. Witte, J. GitHub—jessicawitte92/HAZEL: Code and Supplementary Materials for the HAZEL Pilot Chatbot; GitHub: San Francisco, CA, USA, 2025; Available online: https://github.com/jessicawitte92/HAZEL (accessed on 24 September 2025).
  148. Orduña-Malea, E.; Cabezas-Clavijo, Á. ChatGPT and the potential growing of ghost bibliographic references. Scientometrics 2023, 128, 5351–5355. [Google Scholar] [CrossRef]
  149. Gravel, J.; D’Amours-Gravel, M.; Osmanlliu, E. Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions. Mayo Clin. Proc. Digit. Health 2023, 1, 226–234. [Google Scholar] [CrossRef]
  150. Day, T. A Preliminary Investigation of Fake Peer-Reviewed Citations and References Generated by ChatGPT. Prof. Geogr. 2023, 75, 1024–1027. [Google Scholar] [CrossRef]
  151. Historic England. Adapting Traditional Farm Buildings. 2017. Available online: https://historicengland.org.uk/images-books/publications/adapting-traditional-farm-buildings/ (accessed on 12 December 2023).
  152. Ferrara, E. Should ChatGPT Be Biased? Challenges and Risks of Bias in Large Language Models. First Monday 2023, 28, 11. [Google Scholar] [CrossRef]
  153. Wang, L.; Chen, S.; Jiang, L.; Pan, S.; Cai, R.; Yang, S.; Yang, F. Parameter-Efficient Fine-Tuning in Large Language Models: A Survey of Methodologies. Artif. Intell. Rev. 2025, 58, 227. [Google Scholar] [CrossRef]
  154. Sedgwick, C.; Belmonte, L.; Margolis, A.; Shafer, P.O.; Pitterle, J.; Gidal, B.E. Extending the Reach of Science—Talk in Plain Language. Epilepsy Behav. Rep. 2021, 16, 100493. [Google Scholar] [CrossRef] [PubMed]
  155. Rubano, V.; Vitali, F. Experiences from declarative markup to improve the accessibility of HTML. In Proceedings of the Balisage: The Markup Conference 2020, Washington, DC, USA, 27–31 July 2020; Volume 25. [Google Scholar] [CrossRef]
  156. Bhatia, P.N.; Malek, S. A historical review of web accessibility using WAVE. In Proceedings of the 5th ACM/IEEE Workshop on Gender Equality, Diversity, and Inclusion in Software Engineering, Lisbon, Portugal, 20 April 2024. [Google Scholar] [CrossRef]
  157. Wilson, E.A. Affect and Artificial Intelligence; University of Washington Press: Seattle, WA, USA, 2010. [Google Scholar]
  158. Montag, C.; Ali, R.; Davis, K.L. Affective neuroscience theory and attitudes towards artificial intelligence. AI SOCIETY 2024, 40, 167–174. [Google Scholar] [CrossRef]
  159. Hazan, S. An Accident Waiting to Happen—AI Besieges the Cultural Heritage Community. Musesphere: Transforming Digital Culture, 2020. Available online: https://www.musesphere.com/2020/images/EVA_FLORENCE_2023_ACCIDENT_WAITING_TO_HAPPEN_HAZAN%20.pdf (accessed on 1 September 2024).
  160. Mori, M.; MacDorman, K.; Kageki, N. The Uncanny Valley [from the Field]. IEEE Robot. Autom. Mag. 2012, 19, 98–100. [Google Scholar] [CrossRef]
  161. Veltman, C. In First-of-Its-Kind Lawsuit, Hollywood Giants Sue AI Firm for Copyright Infringement, 2025. WUNC. Available online: https://www.wunc.org/2025-06-12/in-first-of-its-kind-lawsuit-hollywood-giants-sue-ai-firm-for-copyright-infringement (accessed on 5 September 2025).
  162. Allyn, B. The Music Industry Is Coming for AI. Available online: https://www.npr.org/2024/07/14/nx-s1-5034324-e1/the-music-industry-is-coming-for-ai (accessed on 23 September 2025).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.