1. Introduction
The production of data in bioarchaeology continues to expand due to advances in scientific analysis, including ancient DNA sequencing, palaeopathology, isotopic profiling, and osteological assessment. However, this rapid increase in data volume is paralleled by the finite and often destructive nature of the samples from which it is derived. This paradox underscores the urgent need for data to be reused and repurposed, aligning with the principles of FAIR data management: making datasets Findable, Accessible, Interoperable, and Reusable. Despite this imperative, much of the valuable information in bioarchaeology remains embedded within grey literature, particularly in PDF-format reports published by commercial and academic archaeological units. While PDFs provide a stable and widely compatible medium for dissemination, they are ill-suited to structured data extraction and machine-readable processing. The result is that valuable datasets remain locked in static documents, limiting their utility for research and public engagement.
This paper investigates the feasibility of using Natural Language Processing (NLP) and Named Entity Recognition (NER) techniques to address these challenges. Specifically, it introduces the Osteoarchaeological and Palaeopathological Entity Search (OPES), a prototype system designed to extract domain-specific terms from grey literature PDFs archived by the Archaeology Data Service (ADS).
The project originated as an undergraduate dissertation at the University of York, conducted collaboratively across departments and in partnership with the ADS. While the results and their implications for the future of digital heritage can be examined here, the OPES tool and source code cannot currently be made openly available. However, the data used to train the system, including the XML-labelled dataset, as well as the results of the user survey are available. Despite these access restrictions, the analysis presented in this paper contributes to ongoing discussions around the ethical and sustainable use of AI in heritage contexts.
In contrast to recent work that utilises high-powered transformer models or large language models (LLMs), OPES is built using lightweight, interpretable methods. This approach reflects a conscious decision to balance computational performance with ethical and environmental considerations, particularly in digital heritage research, where transparency and sustainability are often prioritised over raw technical power. Overall, the paper investigates three main questions: (1) Can a specialised NER system accurately extract bioarchaeological entities? (2) How does OPES perform compared to general-purpose LLMs? (3) Do domain stakeholders find OPES practical for their workflows? In doing so, the paper contributes a practical tool for osteoarchaeological data access and outlines a replicable model for the ethical development and assessment of NLP systems in the heritage sector.
The next section will investigate the literature background to the project.
1.1. Bioarchaeology and Data Complexity
Bioarchaeology brings together a range of analytical specialisms, including osteology, palaeopathology, ancient DNA (aDNA), stable isotope analysis, and proteomics. Each sub-discipline contributes diverse data types, such as skeletal pathologies, isotopic ratios, and genomic sequences. These datasets are often particular and temporally or spatially contextualised. With continued advancement in molecular and imaging techniques, the volume of bioarchaeological data is expanding exponentially [
1,
2].
This data growth presents both opportunity and risk. On one hand, the data has increasing potential to contribute to broader archaeological and anthropological narratives. On the other hand, the destructive nature of bioarchaeological sampling and the finite availability of remains mean the datasets generated must be as reusable and accessible as possible. Applying FAIR data principles offers one pathway towards addressing this tension by encouraging open, structured, and sustainable data practices.
1.2. Assessing FAIRness in Bioarchaeology
The FAIR principles advocate for data to be Findable, Accessible, Interoperable, and Reusable [
3]. Achieving FAIRness requires consideration of several elements, including file formats, persistent identifiers, ontologies, and controlled vocabularies, as illustrated in
Figure 1.
A Needs Analysis was conducted to assess the extent to which current bioarchaeological data practices are FAIR [
4]. This study revealed that bioarchaeological data management is often inconsistent and lacks standardisation (see
Figure 2). Data is processed and deposited in varied formats, stored in different locations, and governed by differing levels of access and copyright. Adopting FAIR-supporting elements such as ORCiDs, structured metadata, and systematic documentation is uneven across sub-disciplines.
The study identified palaeopathology, zooarchaeology, and osteoarchaeology areas needing improved data reusability strategies. Given the predominance of PDF-based written reports in these fields, applying NLP and NER offers a valuable means of enhancing data discoverability and reuse. Prior work on zooarchaeological datasets has already demonstrated the potential of these technologies [
5].
1.3. CARE Principles and Ethical Data Stewardship
While FAIR principles provide a technical framework for data management, bioarchaeological research requires additional ethical considerations given the sensitivity of working with human remains and associated cultural heritage. The CARE Principles for Indigenous Data Governance—Collective Benefit, Authority to Control, Responsibility, and Ethics—offer essential guidance for responsible data stewardship in this context [
6].
The CARE principles emphasise four key areas. Collective Benefit refers to designing data ecosystems to support the well-being and self-determination of communities connected to the data. In bioarchaeology, this means ensuring that improved data access benefits not only researchers but also descendant communities, local stakeholders, and the broader public. Authority to Control recognises that Indigenous peoples and other communities maintain rights and interests in their cultural data and heritage. While the grey literature addressed in this study primarily concerns historical archaeological contexts in the UK, the principle of respecting community authority over culturally sensitive information remains paramount. Responsibility indicates that those working with heritage data have a responsibility to ensure that data use respects the dignity of the deceased, honours cultural protocols, and minimises potential harm. This includes being transparent about how data is collected, processed, and shared. Ethics requires that data practices align with the rights and well-being of affected communities and should minimise harm while maximising benefits. For bioarchaeological data, this means careful consideration of how information about human remains is extracted, presented, and made accessible.
The development of OPES was guided by these principles in several ways. First, by improving accessibility to grey literature, the tool democratises access to bioarchaeological information, enabling diverse stakeholders, including community groups, independent researchers, and students, to engage with heritage data without requiring institutional resources. Second, the lightweight, transparent architecture ensures that data extraction methods are interpretable and auditable, allowing communities and stakeholders to understand how information is being processed. Third, by focusing on published grey literature already in the public domain through the Archaeology Data Service, OPES respects existing access frameworks and permissions established by data depositors.
However, implementing CARE principles in automated data extraction systems presents ongoing challenges. Future iterations of OPES must incorporate mechanisms for community consultation, culturally appropriate metadata, and sensitive handling of information that may relate to identifiable individuals or communities. The tool’s modular design allows for such enhancements, including potential integration of access controls, community review processes, and culturally informed classification systems that go beyond the biomedical framework of the U.S. National Library of Medicine’s Medical Subject Headings terminology (here onwards MeSH).
Ultimately, combining FAIR and CARE principles creates a more holistic approach to bioarchaeological data management, one that balances technical interoperability with ethical responsibility, ensuring that advances in data accessibility serve both scholarly research and the broader interests of human dignity and community rights.
1.4. Natural Language Processing and the Role of NER
Natural Language Processing (NLP) and Named Entity Recognition (NER) have increasingly been adopted in archaeology and heritage informatics to address unstructured data challenges. NLP enables computers to understand and process human language, while NER is a specific NLP technique that identifies and classifies key terms, such as anatomical structures or pathological conditions, within text. Early initiatives such as Archaeotools [
7], STAR, and STELLAR [
8] illustrated the potential of semantic search and ontology mapping. Later projects like SENESCHAL and ARIADNE [
9,
10] incorporated linked open data, controlled vocabularies, and rule-based text mining to enable more structured data interactions. More recent work has demonstrated the application of BERT-based models for archaeological text retrieval [
11] and explored domain-specific NER approaches for historical documents [
12].
Developing a Zooarchaeological Entity Search [
5] demonstrated the viability of domain-specific NER systems in archaeology. This earlier work informed the development of OPES, adopting similar approaches to the osteoarchaeological and palaeopathological domains.
Recent work has shown promising results with BERT-based models for archaeological text retrieval, with Brandsen et al. [
11] reporting F1-scores of 0.91 for location entities and 0.87 for artifact entities in Dutch archaeological reports. However, their approach required substantial computational resources and showed limitations in handling domain-specific terminology. Similarly, historical document NER systems [
12] have achieved high performance (F1-scores > 0.90) but typically focus on well-defined entity types like person names and dates, rather than the complex bioarchaeological terminology addressed by OPES.
Comparative analysis reveals that while OPES achieves lower raw performance metrics (F1 = 0.889) than transformer-based systems, it offers several advantages: (1) interpretable decision-making processes allowing domain expert refinement, (2) minimal computational requirements enabling deployment in resource-constrained environments, and (3) faster inference times suitable for real-time search applications. The trade-off between accuracy and sustainability represents a conscious design choice aligned with responsible AI principles [
13].
In the current landscape, more powerful transformer-based Large Language Models (here onwards LLMs) such as BERT [
14], BioBERT [
15], SciBERT [
16], and GPT-4 [
17] have significantly advanced information extraction capabilities. These models have enabled improvements in zero-shot classification, contextual entity linking, and semantic search within large unstructured corpora [
18]. However, despite these technical advancements, they raise serious ethical, environmental, and epistemological concerns. High computational costs [
19], the opacity of model outputs [
20], and the risk of perpetuating biases [
21] make their uncritical deployment problematic, particularly in domains like heritage and archaeology that demand transparency, sustainability, and domain-sensitive accuracy.
Recent scholarship in digital humanities and archaeological informatics echoes this caution, advocating for more environmentally aware and ethically informed applications of AI [
22,
23]. Tools like OPES prioritising interpretability and methodological rigour, even at the cost of raw performance, align with these emerging priorities. By drawing on transparent, efficient methods and grounding its design in human-centred evaluation, OPES offers a scalable and responsible approach to improving access to grey literature. At the same time, it creates a foundation for future integration with LLM-powered enhancements while maintaining methodological accountability.
The remainder of this paper is organised as follows.
Section 2 details the methodology for developing and evaluating OPES.
Section 3 presents evaluation results from students, experts, and public participants.
Section 4 discusses the implications for heritage AI and positions OPES within the broader digital archaeology landscape.
Section 5 concludes with key contributions and future directions.
2. Materials and Methods
This section outlines the methodological approach taken in designing, developing, and evaluating the OPES tool.
2.1. Document Selection
A representative corpus was selected from the Archaeology Data Service (ADS) archive to train and evaluate the OPES prototype. The selection process required careful consideration of several factors to ensure the training data would be both robust and representative of the broader archaeological grey literature landscape.
Reports from the Crossrail excavations were chosen as the foundation for this corpus due to their exceptional richness in osteological content and notably consistent formatting. The Crossrail project (2012–2018) generated an unprecedented volume of archaeological documentation, with over 100 archaeologists discovering hundreds of thousands of artifacts spanning 55 million years of human history. This developer-funded rescue archaeology initiative produced extensive unpublished fieldwork reports that contained detailed osteoarchaeological analyses across multiple sites, periods, and contexts throughout the London area. The consistent documentation protocols employed across the Crossrail excavations ensured a degree of standardisation in terminology and report structure that would facilitate the NLP training process while still reflecting real-world archaeological reporting practices.
The selection methodology involved systematic pre-screening of available Crossrail reports to identify those containing relevant terminology associated with human remains. This pre-screening process specifically targeted documents with substantial osteoarchaeological content, including anatomical references (e.g., “humerus,” “femur,” “molar,” “cranium,” “vertebrae”) and disease indicators (e.g., “rickets,” “osteoarthritis,” “periostitis,” “cribra orbitalia”). The presence of diverse pathological conditions and skeletal elements across different demographic groups and time periods was considered essential to create a training dataset that would generalise effectively to other archaeological reports in the ADS archive.
Following this systematic review, five reports were selected to serve as the Gold Standard dataset for annotation and model training. This Gold Standard corpus was designed to provide comprehensive coverage of osteoarchaeological terminology while maintaining a manageable size for the detailed manual annotation process required for NER training. The selection balanced breadth of terminology coverage with depth of contextual usage, ensuring the model would learn to recognise these terms across varied sentence structures and reporting conventions typical of archaeological grey literature.
2.2. Annotation Process
The annotation process took place in three phases:
Initial Annotation: Using a word processor, the researcher manually annotated the selected documents. Each osteoarchaeological or palaeopathological term referring to the human body was tagged with a unique identifier derived from MeSH. MeSH was selected for its extensive vocabulary coverage, though limitations include inconsistencies with British English terminology and a lack of archaeological disease classifications.
Expert Annotation: A domain expert independently reviewed and re-annotated the same five documents to verify accuracy. Different colours were used to distinguish term types, with a consistent colour key maintained to ensure comparability. A “super-annotator” subsequently reviewed both annotation sets, resolving all discrepancies, 44 in total, to create a consistent, verified dataset. Overall, 2582 annotations covering 252 distinct terms were made, although 28 terms were only observed once, limiting their training utility. These low-frequency terms were excluded from the final model training.
Structured Annotation in GATE: The final, reconciled annotations were transferred into XML format and imported into GATE Developer (version 8.5.1, General Architecture for Text Engineering; University of Sheffield, 2018). GATE is an open-source framework widely used for natural language processing and text annotation, providing tools for corpus management and machine learning-based information extraction. The platform uses stand-off annotations, where annotations reference character offsets in the original text rather than modifying it directly, allowing multiple overlapping annotation layers to coexist while preserving document integrity.
Using MeSH identifiers established during reconciliation, osteoarchaeological and palaeopathological terms were systematically tagged within GATE. Each identified term was annotated with its corresponding MeSH unique identifier (e.g., D006801 for “Humans,” D005269 for “Femur”), creating a structured, machine-readable representation of domain terminology. The annotated documents were exported in GATE XML format, preserving both the original text and complete annotation structure. This Gold Standard training set, five fully annotated Crossrail reports with comprehensive osteoarchaeological markup, served as the foundation for training the Named Entity Recognition model.
This hybrid manual-expert-supervised approach ensured domain accuracy and data consistency, offering a high-quality foundation for model training.
2.3. Model Training and Rationale
Rather than using high-resource models such as BERT or GPT-style transformers, OPES was trained using a custom bidirectional LSTM-CRF (Long Short-Term Memory with Conditional Random Fields) neural architecture developed in Keras with Theano as the backend, following established approaches for archaeological metadata extraction.
Model Architecture and Training: The neural network employed a hierarchical architecture combining: (1) a character-level convolutional neural network (CNN) followed by max pooling to extract morphological features and handle out-of-vocabulary terms common in domain-specific archaeological terminology [
24], (2) a single bidirectional LSTM layer with 100 units to capture contextual information [
25], and (3) a linear-chain CRF output layer to enforce label consistency across token sequences [
26,
27]. Word embeddings were 300-dimensional dependency-based embeddings pre-trained on 2 billion words [
28]. Character-level features from each word were concatenated with word embeddings and capitalisation features to create rich input representations.
Implementation Details: The model was implemented using Keras 2 with the TensorFlow backend [
29,
30], utilising Gross’s [
31] Keras CRF implementation. Training hyperparameters included: Adam optimisation with Nesterov momentum, categorical cross-entropy loss function, 50% dropout regularisation (applied between layers), and a classification threshold of 0.5. Models were trained for up to 100 epochs with early stopping applied if validation performance did not improve for 20 consecutive epochs. Five-fold cross-validation was employed to ensure robust evaluation across all entity classes. Training was accelerated using an NVIDIA GTX 750 Ti graphics card, with an approximate training time of 20 min per fold.
Reproducibility Considerations: It is important to note that whilst the original OPES implementation used Theano as the backend, Theano was officially discontinued in 2017 (final release 1.0.5 in 2020). The architecture described here, however, is framework-agnostic and readily reproducible in contemporary deep learning frameworks. Modern replication would require either: (1) using archived Docker containers with exact package versions (Keras 2.x, TensorFlow 1.x, Python 3.5–3.6), or (2) re-implementing the model architecture in TensorFlow 2.x or PyTorch, which would preserve the architectural approach whilst updating the underlying computational infrastructure. The model architecture, hyperparameters, and training procedures are fully specified to facilitate reproduction.
Performance Metrics: Performance metrics on the test set showed:
Precision: 0.913;
Recall: 0.868;
F1-Score: 0.889;
While these frameworks are considered legacy in the context of modern AI, they were selected due to their low computational overhead, flexibility in fine-tuning, and reproducibility without reliance on cloud infrastructure. These factors made them especially suitable for deployment in heritage contexts where access to powerful hardware or proprietary APIs may be limited. Moreover, avoiding LLMs aligns with emerging critiques regarding AI sustainability and opacity [
19,
21].
2.4. Evaluation Framework
The effectiveness of the OPES tool and the underlying NLP and NER algorithms was evaluated through a structured user survey designed to assess real-world perceptions of the system’s utility and usability. The evaluation adopted a pragmatic approach, balancing methodological rigor with accessibility for diverse participant groups.
2.4.1. Survey Design and Rationale
Following the approach outlined by Albert and Tullis [
32], participants were asked to rate the system using a 7-point Likert scale across five success criteria: usefulness, time-saving ability, accessibility, reliability, and likelihood of reuse. Each criterion was assessed using a single-item question to minimise respondent burden and maximise completion rates across participant groups with varying levels of expertise and available time.
While validated, multi-item scales such as the System Usability Scale (SUS) or the Usefulness, Satisfaction, and Ease of use (USE) questionnaire offer greater psychometric reliability and can capture multidimensional constructs more comprehensively, the decision to employ single-item measures was made for several practical reasons. First, the target participant pool included domain experts, students, and members of the general public, groups with differing levels of familiarity with both the subject matter and evaluation methodologies. A lengthy, complex questionnaire risked deterring participation, particularly among busy professionals and community members. Second, the study’s primary aim was to gather broad, actionable feedback on the tool’s perceived strengths and limitations rather than to conduct fine-grained psychometric analysis. Third, research has demonstrated that single-item measures can provide valid assessments of clearly defined, unambiguous constructs, particularly when the goal is comparative evaluation across user groups rather than absolute measurement.
Nevertheless, this methodological choice carries limitations. Single-item questions lack the internal consistency checks provided by multi-item scales and may be more susceptible to individual interpretation and response bias. They cannot capture the nuanced, multidimensional nature of constructs such as “usefulness” or “reliability,” which may encompass multiple sub-dimensions (e.g., usefulness for research versus teaching, or reliability in terms of accuracy versus consistency). Future iterations of this evaluation would benefit from incorporating validated instruments or developing domain-specific multi-item scales tailored to archaeological data access tools.
2.4.2. Participant Recruitment and Survey Administration
Participants were recruited through multiple channels to ensure representation across key stakeholder groups. Archaeology students were recruited through university mailing lists and course announcements at the University of York. Domain experts, including osteoarchaeologists, palaeopathologists, and commercial archaeological specialists, were recruited through professional networks, including the British Association for Biological Anthropology and Osteoarchaeology (BABAO) mailing list and direct contact with specialists known to work with grey literature. Members of the general public with an interest in archaeology were recruited through social media channels, online archaeology forums, and community archaeology groups.
The survey was administered online using Qualtrics. Before accessing the evaluation questions, participants were provided with a brief introduction to OPES, including a demonstration of the tool’s interface and functionality through screenshots and a short video walkthrough (approximately 2 min). Participants were then given the opportunity to interact with a live demo version of OPES applied to sample documents from the Crossrail archive before completing the evaluation questions.
The survey collected minimal demographic information to preserve anonymity while allowing for group-based analysis: participants were asked only to identify whether they considered themselves experts in osteoarchaeology/palaeopathology, archaeology students, or members of the general public with an interest in archaeology. No personally identifiable information was collected.
2.4.3. Success Criteria and Scoring Thresholds
For each of the five evaluation criteria, participants responded to a single statement using a 7-point Likert scale, where
A score of 7 represents the highest level of agreement or satisfaction (Strongly Agree)
A score of 1 represents the lowest (Strongly Disagree)
A score of 4 represents a neutral position (Neither Agree nor Disagree)
The five criteria were operationalised through the following questions:
Do you think this tool is time saving?
Are the results reliable?
How accessible was this tool to use?
Would you like to use this tool again?
How useful is this tool for research for archaeologists?
The tool was considered to partially meet its success criteria if the mean and modal scores exceeded 3.5 (the midpoint of the scale) and to fully meet the criteria if scores exceeded 5.25 (representing three-quarters of the scale, or halfway between “Somewhat Agree” and “Strongly Agree”). Scores below 3.5 indicated a failure to meet expectations for that criterion. These thresholds were established a priori to provide clear benchmarks for interpreting user feedback.
2.4.4. Statistical Analysis
Statistical Analysis: Kruskal–Wallis tests were used to compare scores across the three participant groups (students, experts, public), as this non-parametric test is appropriate for ordinal Likert scale data with unequal group sizes. For criteria showing significant group differences (p < 0.05), post hoc pairwise comparisons were conducted using Mann–Whitney U tests with Bonferroni correction to control for multiple comparisons (α = 0.05/3 = 0.0167). All statistical analyses were performed using Python 3.x with SciPy library.
2.4.5. Methodological Limitations
Several limitations should be acknowledged regarding this evaluation approach. First, as noted above, the use of single-item measures limits the depth of construct measurement. Second, the sample was one of convenience rather than a systematically stratified random sample, which may introduce selection bias. Participants who volunteered to evaluate OPES may have pre-existing interest in digital tools or data accessibility, potentially inflating positive ratings. Third, the evaluation assessed perceived utility based on limited interaction with the tool rather than sustained real-world use, meaning long-term usability issues may not have been captured. Fourth, the study did not include formal task-based usability testing, which would have provided more objective performance metrics (e.g., time to complete specific searches, error rates). Finally, the relatively small number of general public (n = 15) limits the statistical power of between-group comparisons for this stakeholder group.
Despite these limitations, the evaluation provides valuable insight into how different user communities perceive OPES and offers actionable guidance for future development. The mixed-methods approach, combining quantitative ratings with qualitative feedback (where provided by participants), allows for both breadth of assessment and depth of understanding regarding the tool’s strengths and areas for improvement.
3. Results
83 participants contributed to evaluating the OPES tool, which aligned with the success criteria outlined earlier (see
Figure 3). The sample included mostly archaeology students, followed by domain experts and members of the general public.
3.1. Combined Results
When considering all participant groups collectively, the OPES tool was rated positively across most success metrics. The highest scoring criterion was accessibility, with a mean score exceeding 5.25 and a modal score of 7 (
Figure 4), indicating widespread agreement that the tool was easy to use and navigate. Usefulness and reliability also received favourable evaluations, with both criteria surpassing the 3.5 midpoint on the Likert scale, suggesting that most users found the tool informative and sufficiently accurate. Meanwhile, time-saving potential and willingness to use it again produced more variable responses. Although both were rated above the neutral threshold, they did not reach the levels of endorsement seen for accessibility or usefulness.
Table 1.
Statistical Comparison of Success Criteria Across Participant Groups.
Table 1.
Statistical Comparison of Success Criteria Across Participant Groups.
| Criterion | Students Mean (SD) | Experts Mean (SD) | Public Mean (SD) | H | p-Value | Significance |
|---|
| Time-saving | 4.93 (1.45) | 4.12 (1.56) | 5.47 (0.92) | 7.51 | 0.0233 | * |
| Reliability | 4.93 (1.24) | 3.85 (1.43) | 5.73 (0.96) | 18.6 | 0.0001 | ** |
| Accessibility | 5.81 (1.53) | 5.12 (1.63) | 6.27 (1.53) | 9.21 | 0.0100 | * |
| Willingness to use again | 5.17 (1.58) | 3.46 (1.5) | 4.33 (2.19) | 14.92 | 0.0006 | ** |
| Usefulness | 5.33 (1.39) | 4.38 (1.42) | 6.33 (0.98) | 17.68 | 0.0001 | ** |
These aggregate results suggest that OPES successfully meets its goal of increasing access to grey literature and demonstrates general acceptability across user demographics. Nonetheless, further insights can be gleaned by examining the results of individual user groups.
Statistical testing was conducted to assess whether differences between participant groups were statistically significant. Given the ordinal nature of Likert scale data and unequal group sizes (Students
n = 42, Experts
n = 26, Public
n = 15), non-parametric Kruskal–Wallis tests were employed rather than parametric Analysis of Variance (here onwards ANOVA). Results revealed significant differences between groups for all five success criteria (see
Table 1).
Post hoc pairwise Mann–Whitney U tests with Bonferroni correction (α = 0.0167) were conducted to identify specific group differences. For reliability, both students (Mdn = 5.0, p = 0.001) and public participants (Mdn = 6.0, p < 0.001) rated the tool significantly higher than experts (Mdn = 4.0). For usefulness, public participants (Mdn = 6.33, SD = 0.98) rated OPES significantly higher than both students (Mdn = 5.33, SD = 1.39, p = 0.0001) and experts (Mdn = 4.38, p < 0.001), with students also rating it higher than experts (p = 0.0001). For willingness to use again, students (Mdn = 5.17, p = 0.0006) rated OPES significantly higher than experts (Mdn = 3.46, p = 0.0006). For accessibility, public participants (Mdn = 5.73) rated the tool significantly more accessible than experts (Mdn = 3.85, p = 0.0001) and students (Mdn = 5.81, p = 0.0001). Finally, for time-saving potential, public participants (Mdn = 5.47, SD = 0.92) rated the tool significantly higher than experts (Mdn = 4.12, SD = 1.56, p = 0.0233).
These results indicate that whilst OPES was well-received by students and the general public, domain experts were more critical across all evaluation criteria, particularly regarding reliability and their willingness to use the tool again.
3.2. Students
Among the participant groups, archaeology students were the most enthusiastic in assessing the OPES tool. They awarded high ratings for both accessibility and usefulness, with scores that fully met the success criteria thresholds (
Figure 5). This suggests that students found the tool particularly supportive for learning and research. The remaining three criteria, reliability, time-saving, and willingness to use again, were also rated positively, each receiving average scores comfortably above the neutral midpoint. These results imply that OPES has strong potential to be integrated into academic settings, where students may benefit from its ability to surface relevant osteoarchaeological content without requiring exhaustive manual searches.
Statistical analysis confirmed that students evaluated OPES positively across all criteria, with mean scores ranging from 4.93 to 5.81 on the 7-point scale. Students rated the tool significantly higher than experts for reliability (M = 5.0 vs. 3.85, p < 0.001), usefulness (M = 5.33 vs. 4.38, p < 0.001), and willingness to use again (M = 5.17 vs. 3.46, p < 0.001).
3.3. Experts
In contrast, domain experts provided more reserved and critical evaluations of the OPES tool (
Figure 6). While the criteria for usefulness, accessibility, and reliability were each partially met, they scored lower compared to the other two user groups. Experts were particularly sceptical regarding the willingness to use OPES again criterion, which received the lowest modal score (3) of all participant subsets, narrowly missing the threshold for partial success. This feedback may reflect higher expectations for precision, terminological nuance, and integration with established research methodologies. Though not dismissive of the tool’s potential, the expert evaluations suggest a need for refinement, especially in tailoring output to match the specific needs of professional archaeologists.
Domain experts provided consistently lower ratings than the other two groups across all five criteria (means ranging from 3.46 to 5.12). Statistical testing confirmed that experts’ ratings were significantly lower than the students’ for reliability (p = 0.001), usefulness (p = 0.0001), and willingness to use again (p = 0.0006). Experts also rated the tool significantly lower than public participants for all criteria except willingness to use again (reliability p = 0.001, usefulness p = 0.0001, accessibility p = 0.0001, time-saving p = 0.0233).
3.4. Public
The public participants, those without formal academic qualifications in archaeology, responded positively overall (
Figure 7). Their scores for accessibility and usefulness were comparable to those provided by students, indicating that OPES was intuitive and informative even for non-specialist users. While their willingness to use again score was lower than that of students, it remained higher than that of experts, perhaps reflecting a more casual interest in archaeological content rather than a sustained research need. The modal response highlights the potential for NLP and NER tools like OPES to develop tools that are accessible and useful, thus potentially creating a platform that can foster broader engagement with heritage data and democratise access to archaeological information.
Public participants provided the most positive evaluations overall, with mean scores ranging from 4.33 to 6.33. Statistical analysis revealed that public participants rated OPES significantly higher than experts across four of five criteria: reliability (M = 5.73 vs. 3.85, p < 0.001), usefulness (M = 6.33 vs. 4.38, p < 0.001), accessibility (M = 5.73 vs. 3.85, p < 0.001), and time-saving (M = 5.47 vs. 4.12, p = 0.0233). Public participants also rated usefulness significantly higher than students (M = 6.33 vs. 5.33, p < 0.001) and accessibility significantly higher than students (M = 5.73 vs. 5.81, p = 0.0001).
These findings show that OPES was at least partially successful in meeting all five evaluation criteria across the three groups. Its strongest attributes lie in making grey literature more accessible and easily navigable, especially for students and the general public. These outcomes suggest significant promise for lightweight, ethically designed NLP tools in educational and outreach contexts while pointing toward improvement areas in expert-facing implementations.
4. Discussion
The evaluation of the OPES tool offers a window into both the promise and limitations of applying lightweight Natural Language Processing (NLP) and Named Entity Recognition (NER) methods within digital archaeology. With increasing reliance on data-intensive research methods, particularly those powered by LLMs, it is critical to examine how evaluation metrics and stakeholder responses should evolve to meet new technological benchmarks [
33,
34]. Recent developments in AI for cultural heritage have highlighted both opportunities and challenges in this rapidly evolving field [
23,
35].
4.1. Reliability
Reliability is a cornerstone of information extraction and semantic indexing. The OPES tool was only partially successful in this category in this study. While students and the public found the tool adequately reliable, experts were more critical, likely reflecting their familiarity with nuanced terminology and expectations for domain specificity.
4.2. Usefulness
The usefulness of OPES was broadly affirmed, especially by students and members of the public. The tool’s ability to extract key osteoarchaeological and palaeopathological terms was valued for its potential in teaching and public engagement. These results suggest that even with older techniques, tailored domain models can significantly enhance access to heritage data.
Experts were more ambivalent, reflecting an emerging challenge: as LLMs redefine performance standards, professional expectations will likely continue to rise. Usefulness must now be evaluated in terms of functionality and in relation to evolving standards of automation, semantic precision, and contextual sensitivity.
4.3. Time-Saving Potential
Participants rated OPES moderately on its time-saving potential. Again, students and public users saw more benefit than experts, perhaps due to differences in workflow integration. The variance in scores suggests that while OPES reduced the need for manual document review, the extracted data may not have always met expert standards for completeness or consistency.
Recent advances in few-shot and zero-shot learning with LLMs promise to address some issues by enabling more accurate and context-aware extraction from unstructured documents [
13]. However, these benefits come at significant energy and infrastructure costs [
36]. In contrast, tools like OPES offer a lean alternative for targeted applications, especially in educational settings or when dealing with large corpora of grey literature where high recall is more valuable than perfect precision.
4.4. Willingness to Use Again
This criterion emerged as the most challenging for OPES, particularly among experts. Although students and public users expressed moderate interest in future use, expert ratings suggest the tool did not yet meet expectations for repeat engagement. The lower scores here likely stem from concerns about reliability and integration rather than a rejection of the concept. These findings underscore the importance of iterative design: tools must adapt to meet the evolving needs of their users, and evaluation must consider not only immediate functionality but also longer-term engagement and trust.
4.5. Accessibility
Accessibility was OPES’s strongest performance area, with high scores across all user groups. The interface and functionality were well-received, confirming that lightweight NLP tools can serve as powerful enablers of access to archaeological information. As projects like ARIADNEplus and Heritage Connector have shown, enhancing access to cultural datasets is one of AI’s most impactful roles in the heritage sector [
37,
38].
That said, accessibility must also now contend with the expanding reach of conversational agents, chat-based search, and multi-modal LLMs that offer seemingly seamless user experiences. OPES’s success in this area affirms the value of minimalist design and focused functionality, particularly for academic and public sector deployments where simplicity and clarity are often more helpful than feature-saturated platforms.
The analysis suggests that while OPES may not match LLMs in computational sophistication, it successfully meets archaeological researchers’ needs at multiple levels. It also reinforces the ongoing importance of carefully defined evaluation metrics in determining “success” in the evolving landscape of AI for heritage.
4.6. Positioning OPES Within the Heritage AI Landscape
Recent work in heritage AI provides important context for evaluating OPES. Münster et al. [
23] outlined a research agenda emphasising sustainable, interpretable AI solutions for cultural heritage that prioritise environmental considerations and human oversight. OPES aligns with these principles through its lightweight architecture and transparent methods, addressing concerns about the computational costs and opacity of transformer-based models. However, Münster et al. also highlight challenges that remain relevant: creating standardised datasets across diverse contexts, avoiding historical biases, and integrating community engagement. OPES currently addresses only English-language UK reports, limiting broader applicability.
Gattiglia [
39] emphasises matching AI techniques to specific archaeological problems rather than adopting technology for its own sake. OPES embodies this “appropriate technology” principle by automating entity extraction to enable new synthetic analyses rather than simply digitising existing workflows. However, Gattiglia’s concerns about data quality apply directly: OPES depends on well-formatted PDFs and struggles with the limitations of MeSH vocabulary for archaeological disease classifications.
Moutsis et al. [
40] employed deep learning for historical manuscript digitisation, facing similar challenges of domain-specific terminology and limited training data. Whilst their work focused on computer vision and technical performance metrics, OPES emphasises user-centred evaluation. This reflects a broader tension in heritage AI between optimising technical performance and ensuring real-world usability; both approaches have merit and could be productively combined in future work.
These comparisons reveal growing consensus that sustainability, interpretability, and ethical considerations should be prioritised alongside technical performance. OPES’s targeted focus on bioarchaeological grey literature exemplifies the field’s shift towards well-defined, domain-specific applications rather than general-purpose solutions. Future enhancements could include integration with linked open data infrastructures, multilingual capabilities, and community-engaged evaluation processes, building on the modular foundation established here.
4.7. Implications for Heritage AI
The evaluation of the OPES prototype highlights key tensions between technological capability, ethical responsibility, and practical usability in deploying AI systems for cultural heritage. While the landscape of natural language processing is rapidly shifting towards high-performance models such as GPT-4, BioBERT, and other transformer-based architectures [
13,
15], the findings of this study suggest that lightweight, interpretable models still have an important role to play, particularly when viewed through the lens of sustainability, reproducibility, and accessibility.
The relatively high scores OPES received from students and the public illustrate the tool’s value in opening up grey literature for non-specialist use. This is especially significant in archaeological contexts, where much data remains siloed in difficult-to-access PDF reports. While LLMs may offer more advanced functionality, their workflow integration often requires significant infrastructure, licensing arrangements, and technical support [
20]. In contrast, OPES demonstrates that a leaner model can still yield meaningful improvements in data findability and reusability while remaining transparent, portable, and affordable.
The findings also underscore the continued importance of human-centred evaluation in heritage AI. As expectations rise in step with AI performance, evaluation frameworks must evolve accordingly to assess accuracy and interrogate user trust, accessibility, sustainability, and alignment with domain values [
22,
33]. The methodology used here, combining qualitative and quantitative feedback from diverse participant groups, offers a replicable model for this kind of robust and inclusive evaluation.
Finally, this study contributes to ongoing discussions around digital archaeologists’ and heritage scientists’ environmental and ethical responsibilities. As research communities increasingly recognise the carbon and social costs of large-scale AI [
19,
21], tools like OPES provide an important counterpoint: demonstrating that progress in digital archaeology does not always require cutting-edge infrastructure but can be driven by thoughtful design, user responsiveness, and methodological transparency. This aligns with current initiatives exploring sustainable AI approaches in archaeological research [
23,
37].
The evaluation of OPES must also be considered within the ethical framework established by the CARE principles. While the tool demonstrated strong accessibility scores, particularly among students and the public, future development must ensure that increased access does not come at the expense of ethical data stewardship. The current implementation respects existing access frameworks by working only with publicly accessible grey literature, but questions remain about how to incorporate community perspectives into automated data extraction systems. For instance, should certain types of pathological information be subject to additional access controls? How can descendant communities be involved in defining what information should be extractable? These questions highlight the tension between the technical goal of maximising data accessibility (FAIR) and the ethical imperative of respecting community authority and preventing harm (CARE).
One practical approach would be to develop tiered access levels within OPES, where basic anatomical information is freely searchable, but more sensitive pathological or contextual data requires additional authentication or community approval. This would align with emerging best practices in digital heritage [
41] while maintaining the tool’s core accessibility benefits. Additionally, future versions could incorporate Indigenous data sovereignty protocols, such as Traditional Knowledge (TK) Labels developed by Local Contexts (localcontexts.org), to provide culturally appropriate attribution and use guidelines.
5. Conclusions
This paper investigates whether lightweight, computationally modest Natural Language Processing (NLP) and Named Entity Recognition (NER) methods could enhance the reusability and accessibility of osteoarchaeological and palaeopathological data embedded within grey literature. In doing so, it introduced OPES (Osteoarchaeological and Palaeopathological Entity Search) as both a technical prototype and an ethical intervention, responding not only to the practical limitations of PDF-based heritage reports but also to the broader call for sustainable, responsible AI in archaeology.
The results indicate that OPES successfully meets its primary goals of improving findability and accessibility, especially among students and public users. Although its reliability and time-saving performance were more modest, particularly when assessed by domain experts, these findings reflect the broader challenges of designing tools that serve specialist and generalist communities. Significantly, OPES’s transparent, user-centred design was viewed positively, reaffirming the value of simplicity, explainability, and methodological rigour in heritage informatics.
Importantly, this study positions OPES within the broader discourse on ethical AI and environmental sustainability. As LLMs increasingly shape expectations around automation, accuracy, and functionality, there is a growing need to re-evaluate what constitutes successful digital innovation, particularly in fields like archaeology, where data preservation, interpretive nuance, and reproducibility are central concerns. In contrast to opaque and energy-intensive AI systems, OPES demonstrates how modest technical approaches can yield meaningful and scalable results when thoughtfully designed and rigorously evaluated.
The framework used to evaluate OPES also contributes to heritage AI. By combining qualitative insights and quantitative metrics across diverse user groups, the study provides a replicable model for assessing AI tools in a way that respects the specific needs and values of the heritage sector. Such evaluation strategies will become increasingly important as tools are deployed in complex socio-technical environments where success cannot be reduced to accuracy scores alone.
Looking forward, OPES offers a proof of concept and a platform for future development that will inform future initiatives based on similar conceptual frameworks. Its modular design could be extended with selectively applied LLM components, enhanced with richer ontologies, or embedded into wider digital infrastructures such as ADS or ARIADNE. However, any such advances must remain grounded in this project’s principles: transparency, inclusivity, and responsiveness to community needs. This work also highlights the importance of integrating ethical frameworks such as the CARE principles alongside technical standards like FAIR. As NLP and AI tools become more prevalent in heritage research, developers must proactively address questions of community rights, data sovereignty, and culturally appropriate access, ensuring that technological advancement serves both scholarly goals and ethical responsibilities.
In summary, OPES is a practical and principled response to the challenge of unlocking grey literature in bioarchaeology. It shows that when paired with ethical awareness and robust evaluation, small, targeted interventions can contribute significantly to the FAIR data agenda and help build a more equitable and sustainable digital archaeology.