Next Article in Journal
Technique for Sub-mHz Low-Frequency Corner in Capacitively Coupled Instrumentation Amplifiers
Previous Article in Journal
Correction: Zhuang et al. A Generalized Optimization Scheme for Memory-Side Prefetching to Enhance System Performance. Electronics 2025, 14, 2811
Previous Article in Special Issue
NOVA: A Novel Multi-Scale Adaptive Vision Architecture for Accurate and Efficient Automated Diagnosis of Malaria Using Microscopic Blood Smear Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automatic Metadata Extraction Leveraging Large Language Models in Digital Humanities

by
Adriana Morejón
1,*,
Borja Navarro-Colorado
2,
Carmen García-Barceló
1,
Alberto Berenguer
3,
David Tomás
2 and
Jose-Norberto Mazón
2
1
Institute for Computer Research (IUII), University of Alicante, Carr. de San Vicente del Raspeig, S/N, 03690 San Vicente del Raspeig, Spain
2
Department of Software and Computing Systems, University of Alicante, Carr. de San Vicente del Raspeig, S/N, 03690 San Vicente del Raspeig, Spain
3
INFERIA SOLUTIONS SL, University of Alicante Science Park, Carr. de San Vicente del Raspeig, S/N, 03690 San Vicente del Raspeig, Spain
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(24), 4962; https://doi.org/10.3390/electronics14244962
Submission received: 16 October 2025 / Revised: 20 November 2025 / Accepted: 3 December 2025 / Published: 18 December 2025

Abstract

DCAT-based data ecosystems, such as open data portals and data spaces, have shown their potential to foster data economy by supporting the FAIR (Findability, Accessibility, Interoperability, Reusability) principles. Nevertheless, there are domains where metadata are tailored to specific semantics of the domain, resulting in the absence of DCAT-based catalogs that adhere to FAIR principles. A particularly relevant case is that of the digital humanities, where texts encoded in TEI (Text Encoding Initiative) constitute a consolidated standard in the field of literature. However, TEI metadata are not always well aligned with the FAIR principles, nor easily integrated into interoperable catalogs that enable seamless combination with external datasets. To address this gap, our approach aims to (i) generate DCAT catalogs derived from TEI by identifying which metadata can be mapped and how, and (ii) explore the use of Large Language Models (LLMs) to assist in the generation and enrichment of metadata when transforming TEI to DCAT. Our approach contributes to catalog-level harmonization, enabling domain-specific standards such as TEI to be aligned with cross-domain standards like DCAT, thus facilitating adherence to FAIR principles.

1. Introduction

Catalogs of data ecosystems (such as open data portals and federated data spaces) have demonstrated considerable potential for advancing the data economy [1] by operationalizing the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles [2]. The DCAT vocabulary is specifically intended to facilitate interoperability between data catalogs published on the Web [3]. For the sake of operational clarity, FAIR principles could be considered as measurable properties realized on DCAT artifacts rather than as abstract aspirations. Specifically, findability is related to globally unique/persistent identifiers and indexed catalog records (e.g., dcterms:identifier, dcterms:title, dcat:keyword); accessibility is related to resolvable, standards-based access points and explicit permissions (e.g., dcat:accessURL/dcat:downloadURL, dcterms:license); interoperability corresponds to machine-actionable Web vocabularies and profile conformance that structure a layered catalog (Catalog–Dataset–Distribution) and declare adopted standards (e.g., dcterms:conformsTo); and reusability is considered via explicit rights, provenance, and domain semantics (e.g., dcterms:rights, prov:qualifiedAttribution).
However, many scholarly domains remain underrepresented in the adoption of the DCAT standard for adherence to FAIR principles, primarily because their metadata practices are deeply rooted in domain-specific conventions. Such is the case with TEI (Text Encoding Initiative) ( https://tei-c.org/, accessed on 10 September 2025). in the digital humanities, MIAME (https://www.fged.org/projects/miame/, accessed on 10 September 2025). in bioinformatics, or CIDOC-CRM (https://cidoc-crm.org/, accessed on 10 September 2025). in cultural heritage. This misalignment often results in valuable datasets being maintained in isolated silos, hindering their discoverability, accessibility, and reuse within broader data ecosystems [4,5]. As noted in recent FAIR implementation studies [6,7], the lack of standardized metadata exchange mechanisms across disciplines continues to be a major obstacle to realizing the full value of data and to fostering its integration into interoperable data ecosystems.
One exemplary domain is the digital humanities. It is particularly relevant because, with the implementation of digital technology, there is an increasing application of empirical and quantitative research methods, focused on data analysis and interpretation, combined with more traditional hermeneutic methods. This shift is evident, for instance, in the application of distant reading methodologies to literary studies [8,9,10]. Consequently, numerous diverse resources and applications are being developed (such as corpora and databases) for various fields, including history, archaeology, and literature. However, the fragmentation and heterogeneity characteristic of the humanities and social sciences are leading to significant challenges in resource reuse and interoperability. Therefore, the FAIR principles are currently being strongly promoted within the Humanities.
In Digital Humanities a substantial body of textual and manuscript scholarship is encoded using the TEI standard. Indeed, TEI is the dominant standard for text encoding in the humanities since its establishment in 1987. TEI offers deep expressiveness for representing textual, editorial, and even interpretive features in literary works. Yet, the richness of TEI metadata often hinders seamless alignment with cross-domain interoperability models, complicating integration with open data catalogs and obstructing linkage with external datasets [11].
Beyond the imperative of visibility and scholarly access, there is a technical and strategic rationale for incorporating literary corpora into DCAT-based catalogs: data spaces. In the European policy context, (https://digital-strategy.ec.europa.eu/en/policies/data-spaces/, accessed on 12 September 2025). data spaces are posited as foundational components for an interconnected data economy, where data are shared securely under common rules without ceding sovereignty. Data spaces offer a federated, sovereign model for data sharing across institutions, allowing data providers (such as publishers, libraries, archives, or universities) to retain full control over their resources while participating in a larger interoperable ecosystem.
For instance, consider a consortium of national libraries and university archives that maintain TEI-encoded editions of 19th-century literary works. Each institution stores its own corpus, with metadata structures tailored to internal cataloging conventions. By aligning these TEI metadata with DCAT and publishing them within a federated data space, the institutions could expose their collections in a unified catalog without relinquishing control over the underlying resources. This integration would allow researchers to discover and query literary datasets across institutions (e.g., all TEI-encoded editions of European Romantic poetry) through a single metadata endpoint. At the same time, each library would preserve data sovereignty of their own editions, since the actual TEI files and digital objects remain hosted within their own infrastructure. Such an arrangement not only enhances visibility and reuse but also facilitates cross-collection analysis, digital preservation, and semantic enrichment by enabling interoperability with other cultural heritage data (e.g., Wikidata, Europeana, or CLARIN resources).
A more contemporary use case involves publishing companies and research institutions collaborating within a data space to share copyrighted literary corpora under controlled access conditions. For instance, publishers holding rights to 21st-century novels could expose structured metadata (such as author information, thematic classifications, or linguistic features) within a data space with a DCAT-based catalog, while restricting direct access to the full text. Through data sovereignty mechanisms (e.g., access control, usage policies, and smart contracts), these publishers could grant selective access to specific researchers or projects, such as computational linguistics groups studying narrative structures or sentiment patterns. In this way, DCAT serves as the interoperability layer that enables metadata discoverability across the ecosystem, while the data space infrastructure enforces usage policies and intellectual property rights, ensuring that sensitive or copyrighted materials are shared transparently, ethically, and in compliance with licensing terms. This model not only promotes responsible reuse of contemporary literary data, but also exemplifies how FAIR principles can be operationalized in domains where data openness must coexist with data sovereignty.
Therefore, DCAT is useful because it is the W3C Web standard for interoperable data catalogs and the vocabulary adopted by not only open data portals but also data space ecosystems. In practice, these data infrastructures expose and harvest DCAT to support cross-catalog discovery, federation, and validation, and dataspace protocols reuse DCAT catalogs as discoverable endpoints. Aligning TEI to DCAT thus provides immediate, standards-based benefits: machine-actionable records; consistent exposure of identifiers, licensing, and access points; cross-portal search via common properties (title, description, keywords/themes, spatial/temporal coverage); and smoother linkage to external datasets and services. In short, DCAT alignment operationalizes FAIR principles for TEI corpora and integrates them into the existing ecosystems of open data and data spaces.
In this landscape, the present work proposes an approach for transforming TEI metadata into the DCAT standard, thereby enabling the inclusion of literary corpora within open and federated catalogs. The study proposed further explores the use of Large Language Models (LLMs) to assist in the automatic extraction, generation, and enrichment of metadata during a proposal of a TEI-DCAT mapping process. The goal of this work is twofold: to foster catalog-level harmonization between domain-specific and cross-domain standards, and to improve the discoverability, interoperability, and reuse of literary resources across the broader data ecosystem.
The core problem this research attempts to examine regarding TEI data in the Digital Humanities domain is the mismatch between TEI’s richly annotated, project-specific metadata and the cross-domain catalog standards (e.g., DCAT) that underpin FAIR, machine-actionable ecosystems. TEI excels at encoding textual, editorial, and interpretive features, but this flexibility yields heterogeneous practices that achieve self-compatibility rather than interoperability, leaving corpora in isolated silos that are difficult to discover, harvest, and link across portals. Consequently, TEI collections struggle to participate in infrastructures that require standardized catalogs, persistent identifiers, explicit licenses, and policy-aware access. This paper addresses that gap by mapping and enriching TEI metadata to the DCAT vocabulary so that literary datasets become interoperable, harvestable, and reusable across open data portals and sovereign data-sharing environments (i.e., data spaces).
The mapping methodology was applied to a bilingual corpus of 200 novels, comprising 100 in Spanish and 100 in English extracted from the European Literary Text Collection (ELTeC) [12,13]. (https://github.com/COST-ELTeC, accessed on 10 September 2025). Originally encoded in TEI format, each novel was transformed into DCAT format using the methodology developed in this work. Additional DCAT-related metadata fields not straightforwardly mapped from the original TEI files (e.g., theme, keywords, locations, characters) were extracted from the body of the novels using LLMs. The resulting metadata were evaluated by a literary expert to assess the quality of the LLM-based extraction process. All the code and datasets developed during this research are freely accessible to the scientific community. (https://github.com/amorejon6/electronics.git, accessed on 15 October 2025).
This work addresses the interoperability gap between the rich TEI metadata prevalent in digital literature and interoperable DCAT-based catalogs that enable FAIR adoption in open and federated data ecosystems. The central research question is how to perform this automatic transformation of TEI metadata to DCAT, maximizing alignment with FAIR. The main contribution of this study is to propose and experimentally validate a hybrid pipeline that combines expert mapping rules and automated generation of literary metadata with LLMs, facilitating the inclusion and discovery of literary works in interoperable infrastructures and freeing up the resources used to enable the replicability and evolution of the proposed approach.
The remainder of this paper is structured as follows: Section 2 reviews previous work on the publication of FAIR data and metadata extraction; Section 3 describes the methodology followed to map from TEI to DCAT and the metadata extraction procedure using LLMs; Section 4 presents the evaluation, results and discussion of the metadata extraction process; finally, Section 5 presents the conclusions and outlines future work.

2. Related Work

This section describes previous work in areas related to the study presented in the paper: open data publication, with a focus on the digital humanities, and metadata extraction using rule-based and machine learning techniques.
On the publication of data, several complementary approaches enable open, machine-actionable exposure of metadata at scale. The FAIR Data Point (FDP) [7] provides a reference architecture and services to publish semantically rich, machine-actionable metadata that can be harvested by portals and aggregators. In implementing open data portals, CKAN serializes and exposes DCAT catalogs, enabling bidirectional exchange between institutional, national, and thematic catalogs [14]. Application profiles extend this interoperability to domain needs: DCAT-AP [15] underpins cross-portal aggregation across the EU public sector, while GeoDCAT-AP [16] and StatDCAT-AP [17] tailor DCAT for geospatial and statistical resources without sacrificing cross-catalog compatibility. Furthermore, exposing datasets alongside DCAT is usual for web-scale discovery. For instance, Google Dataset Search relies on structured dataset markup published by providers [18].
At the standards level, the W3C Data Catalog Vocabulary (DCAT) 3 remains the foundation for interoperability between data catalogs on the Web, and the latest DCAT-AP 3.0 release aligns with DCAT 3 to support cross-border, cross-domain aggregation in Europe [3,15]. In order to support the FAIR principles [2], DCAT foregrounds persistent identifiers, explicit licensing, and machine-actionable metadata as prerequisites for findability and reuse across heterogeneous open data portals.
Beyond open portals, data spaces extend publication into federated, sovereignty-preserving ecosystems where organizations retain control while participating in interoperable exchange under shared rules and trust frameworks. The European vision frames sectoral data spaces as policy-governed environments for cross-border data sharing that build on web standards and catalog interoperability [19]. On the protocol layer, the emerging Eclipse Dataspace Protocol (DSP) specifies how participants (i) publish DCAT catalogs as discoverable endpoints, (ii) negotiate usage policies and agreements (often expressed with ODRL), and (iii) orchestrate controlled transfers across connectors, providing a standards-based bridge from catalog discovery to governed access; parallel initiative reports from IDSA document the standardization trajectory and governance of DSP [20,21].
In the field of digital humanities, several initiatives have explored how data can be represented and interconnected following FAIR principles. Notable efforts include the LODE framework [22], which facilitates linking local humanities collections to the Web of Data, mythLOD [23], which publishes cultural heritage datasets as Linked Data through processes of data cleaning, reconciliation, and visualization, and OntoPoetry, a Linked Open Data-based ontology for poetry (https://postdata.linhd.uned.es/results/ontopoetry-v2-0/, accessed on 12 September 2025).
Several open infrastructures and platforms have also emerged to support data publication and semantic interoperability in the humanities. DraCor (Drama Corpora) [24] provides a FAIR-compliant ecosystem for the computational study of European drama, offering TEI-encoded corpora accessible through APIs. The project DraCorOS (https://oscars-project.eu/projects/dracoros-fostering-open-science-digital-humanities-connecting-dracor-ecosystem-eosc, accessed on 12 September 2025) extends this infrastructure to connect with the European Open Science Cloud (EOSC), promoting open science practices in literary studies. Projects such as LODI4DH [25] and ISIDORE (https://isidore.science, accessed on 12 September 2025) represent significant infrastructures in the field of open data for digital humanities. These initiatives aim to provide interoperable access to cultural and scholarly data through Linked Data technologies, RDF serialization, and metadata enrichment services. LODI4DH focuses on building a national Linked Open Data infrastructure for the humanities, while ISIDORE aggregates and enriches metadata from thousands of sources in the humanities and social sciences, offering a SPARQL endpoint for querying RDF data.
However, these infrastructures do not explicitly adopt DCAT as their primary metadata schema for dataset cataloging. That limitation reduces interoperability with other open data ecosystems-particularly those in scientific or governmental domains where DCAT is the de facto standard for dataset discovery and exchange. The absence of DCAT-compliant descriptions makes it more difficult to integrate humanities datasets into broader data catalogs, automate dataset harvesting, or ensure consistent metadata alignment with repositories and portals such as the European Open Science Cloud (EOSC). A DCAT-based approach, as the one followed in this paper, enables richer cross-domain interoperability, improved machine-readability, and a more seamless integration of digital humanities data within the wider open data landscape.
Regarding metadata extraction, it is a fundamental aspect of the management, discovery and reuse of digital content in scientific and business environments. Over time, different methodological approaches have emerged to address this challenge, each with its own particular characteristics and scope.
Early approaches to automatic metadata extraction were based on explicit rules, such as regular expressions, syntactic patterns, and format-specific heuristics. These solutions work well in domains with well-defined structures. First, versions of traditional tools such as ParsCit [26] and GROBID [27] employ this kind of strategies to identify metadata from documents. Nevertheless, rule-based methods rely heavily on specialists to develop and continuously update a comprehensive set of rules. Moreover, their scalability is limited, as each new domain requires the creation, implementation, and ongoing maintenance of a fresh rule set [28].
Machine learning marks a turning point by enabling direct learning from large annotated corpus. Initially, classic supervised models such as Conditional Random Fields (CRF) and Support Vector Machines (SVM) were applied for sequence labeling and text fragment classification [29,30]. Extraction tools such as GROBID [27] and Science Parse [31] used CRF and, increasingly, deep learning models such as BiLSTM-CRF [32] or Transformer architectures trained specifically for scientific documents (e.g., SciBERT) [33]. The main advantage of these models is their ability to adapt to a variety of formats, citation styles, and languages. Multimodal models, which combine textual and visual analysis, have achieved substantial improvements in robust metadata extraction even in scanned PDFs or documents with embedded tables and images.
Recently, LLMs have driven a new transition in the field. These models not only understand textual content but can also follow complex instructions and generate structured annotations. They are applied with great success to extract metadata in multiple languages, associate complex attributes (authors, affiliations, abstracts, DOIs, sections) and validate the consistency of the extracted metadata with the required schemas. Compared to rule-based approaches, which require domain experts to design and continuously maintain large sets of hand-crafted rules, LLM-based systems scale more easily across domains and formats. A single model can be prompted to extract different metadata schemas or adapt to new document genres and languages without redesigning the extraction pipeline. This property makes LLMs particularly suitable for heterogeneous and evolving corpora such as TEI-encoded literary collections, where layout conventions and annotation practices could vary across projects. In this sense, the main advantage of LLMs lies in their flexibility and reduced maintenance cost: instead of rewriting rules when formats change, providers can update prompts or examples while reusing the same underlying model.
For example, the MOLE tool [34] demonstrates how an LLM-based system can outperform traditional solutions in terms of coverage and accuracy, extracting more than thirty metadata fields and handling large volumes of multilingual documents. Furthermore, these models have the ability to learn new rules and adapt to evolving formats without requiring manual reprogramming or frequent retraining, positioning LLMs as the state-of-the-art in scientific and literature metadata extraction [35].
In this context, the use of LLMs becomes a fundamental tool for enriching metadata in the realm of digital humanities [36]. One of the goals of the present work is to leverage these models to enhance the DCAT representation with labels that were not originally present in the TEI metadata format. To this end, LLMs are applied to the full text of the novels, rather than to the TEI metadata, in order to extract additional information and generate the final set of DCAT labels. The following section describes the procedure in more detail.

3. Methodology

This section describes the methodology followed to map TEI metadata to DCAT format. The mapping is defined according to three alignment criteria between TEI and DCAT: (i) structural conformance (valid DCAT records with required/recommended properties); (ii) semantic fidelity (preservation of TEI roles, provenance, and edition/distribution distinctions, e.g., representing TEI distributors with PROV (The provenance or business context of a dataset can be described in DCAT by using elements from the W3C Provenance Ontology: https://www.w3.org/TR/prov-o/, accessed on 12 September 2025). qualified attributions; and (iii) functional viability (successful harvesting, discovery, and link resolution in DCAT-based portals and, where applicable, publication via dataspace connectors with policy-governed access. The criteria for the mapping prioritizes a subset of DCAT properties that are highly adopted across open data portals and European DCAT-AP harvesters. (According to Metadata Quality Assessment (MQA): https://data.europa.eu/mqa/methodology, accessed on 12 September 2025). Specifically, dcterms:title and dcterms:description are foundational for human (i.e., readable) discovery and ranking; dcterms:creator and dcterms:publisher capture provenance and responsibility; dcterms:license and dcterms:identifier operationalize reuse and unambiguous access across catalogs; dcterms:language supports multilingual discovery and harmonization; dcat:keyword (free tags) and dcat:theme (controlled vocabulary concepts) together enable both broad semantic; and dcterms:spatial, dcterms:temporal, and dcterms:type provide the most common faceted filters used by aggregators and users to assess relevance at scale.
These elements are explicitly mandated or strongly recommended in DCAT-based open data portals (e.g., those using CKAN) that rely on minimal-yet-informative metadata. Consequently, mapping of TEI to this subset of DCAT properties aims to obtain efficient adherence to FAIR principles. Furthermore, the distributor property of TEI has also been mapped to DCAT, since it identifies the entity responsible for disseminating the digital resource, distinct from the creator or publisher. Preserving this role explicitly in DCAT avoids collapsing distribution responsibilities into generic provenance fields and enables machine-actionable statements about who provides access under what role. Using prov:qualifiedAttribution with prov:agent (e.g., https://zenodo.org/, accessed on 12 September 2025) and dcat:hadRole captures the distributor as an accountable agent and links that agent to a precise distribution role consistent with DCAT recommended attribution pattern.
Initially, the mapping required two experts: a computational linguist with expertise in corpus annotation and TEI (and prior involvement in ELTeC’s original annotation), and a DCAT-specialized computer engineer. The mapping process was discussed collaboratively to identify TEI header elements that could be directly mapped to aforementioned DCAT properties and those that could not. Additionally, considering that the ELTeC corpus contains the full textual content of the novels, a new set of DCAT labels was proposed to enrich the linguistic and literary information of each item, including elements such characters in the literary work. These additional labels were extracted using an LLM, as described later in this section.
Table 1 presents the map of TEI elements to semantically equivalent properties in DCAT. Viable alternative mappings that would preserve meaning without misrepresentation were not identified, given the different design focus of the standards (TEI is a rich textual markup schema; nevertheless, DCAT is a catalog/dataset/distribution metadata vocabulary supporting FAIR principles). Consequently, the adopted mapping target only those concepts with clear semantic alignment (e.g., authorship, title, and publisher). Beyond this core mapping, additional TEI elements may admit multiple reasonable mappings or require qualified patterns (e.g., PROV and profiles), which is out of the scope of this paper.
The following list describes the meaning of each of these fields in TEI and their corresponding equivalents in DCAT:
  • <titleStmt><title> to dcterms:title: The <title> element, within the <titleStmt> (Title Statement), represents the main title of the text or work. It provides the name by which the document is identified and is part of the TEI header’s bibliographic description. dcterms:title specifies the name of the dataset or resource in DCAT (Dublin Core Terms). It serves the purpose of giving the resource a human-readable title.
  • <titleStmt><author> to dcterms:creator: The <author> element identifies the person or organization responsible for the intellectual content of the text. It can include attributes such as affiliation or authority control references. dcterms:creator denotes the primary entity (person or organization) responsible for creating the dataset or resource, which is conceptually equivalent to TEI’s <author>.
  • <publicationStmt><publisher> to dcterms:publisher: The <publisher> element, within <publicationStmt> (Publication Statement), identifies the entity responsible for making the text available, typically a publishing house, institution, or digital repository. dcterms:publisher represents the organization that publishes or distributes the dataset. It provides provenance information and supports resource citation and discovery.
  • <publicationStmt><distributor> to prov:qualifiedAttribution [...]: The element <distributor> designates the entity responsible for distributing the text, such as an archive, digital library, or platform hosting the resource. The DCAT model does not have a direct distributor field. Instead, the relationship is represented through a PROV-O qualified attribution (prov:qualifiedAttribution), which links the dataset to an agent (e.g., Zenodo) and specifies its role (dcat:hadRole) as “distributor”. This structure allows richer provenance representation, following W3C’s Provenance Ontology (PROV-O). (https://www.w3.org/TR/prov-o/, accessed on 12 September 2025).
  • <publicationStmt><availability> to dcterms:license: The <availability> element provides information about the access conditions, rights, or licensing terms of the text (e.g., open access, restricted use, or copyright statements). dcterms:license indicates the license under which the dataset is distributed. It typically contains a URI pointing to a standard license (e.g., Creative Commons).
  • <publicationStmt><ref type=“doi”> to dcterms:identifier: The <ref> element with type=“doi” represents a reference to the Digital Object Identifier (DOI) assigned to the text, providing a persistent identifier for citation and retrieval. dcterms:identifier is used to store a unique identifier for the dataset, such as a DOI, Handle, or other persistent identifier schemes.
  • <langUsage> to dcterms:language: The <langUsage> element describes the languages used in the text, often including subelements like <language> to specify the ISO language code and sometimes the role or proportion of each language. dcterms:language identifies the language(s) of the dataset or resource, generally using standard codes such as ISO 639-1 or ISO 639-3 [37,38].
A hybrid strategy has been developed for generating DCAT metadata from documents encoded in TEI. The process consists of two phases: direct mapping of structured properties from the TEI format, and automatic generation of complex metadata by applying LLMs to the entire textual content of the document, i.e., the text tag of the TEI-encoded document.
On the one hand, TEI metadata that have direct equivalents are mapped directly to DCAT. In these cases, the transformation involved only the normalization of values and the translation of labels between schemas, ensuring syntactic and semantic consistency. The mapping has been based on rules, starting from the equivalent properties described in Table 1.
On the other hand, LLMs are used for automatic metadata extraction of properties that present semantic ambiguity, lack structure, or require contextual inference from the full text. The method consists of applying a specialized prompt to the full text of each novel: this prompt (Figure 1) precisely defines the metadata extraction instructions. Domain-specific restrictions were explicitly integrated into the prompting process to constrain the generative behavior of the LLM. Each metadata field was accompanied by a precise definition and, when applicable, a closed list of possible values (for instance, a predefined set of literary subgenres for dct:type). The prompt also included explicit instructions to produce the metadata in the same language as the input novel and to output “Not identified” whenever a value was not explicitly stated or could not be confidently inferred.
To guarantee the reliability of the automatically generated metadata, different quality-control mechanisms can be applied. First, all outputs are produced in a structured JSON format, enabling automatic validation of field presence, type, and syntax. Second, the extracted metadata can be programmatically checked for completeness and conformance with the DCAT schema by using interoperability checking tools such as the Interoperability Test Bed from the European Commission. (https://interoperable-europe.ec.europa.eu/collection/interoperability-test-bed-repository/solution/interoperability-test-bed, accessed on 12 September 2025). Third, a human-in-the-loop evaluation is performed by a literary expert, who assesses the coherence and factual accuracy of the generated metadata. This validation strategy, combining automated structural checks with expert qualitative review, ensures both reproducibility and consistency across the corpus, while mitigating the typical risks of hallucination and inconsistency associated with LLM-based generation.
The specific DCAT properties that have been generated through the application of the LLM are the following:
  • dcterms:description: Provides descriptive information about a resource, which may include, but is not limited to, an abstract, a table of contents, a graphical representation, or a free-text account of the resource. The value of this field is typically a text, and it is often used in metadata records for digital resources to give contextual or summary information. As regards its use in this paper, the field consists of a summary or description of the novel, no more than four or five sentences long.
  • dcat:keyword: A keyword or tag that describes the resource. Its values are simple labels or tags, usually text, that help characterize and index resources for search and discovery purposes. This property is used to improve resource discoverability by attaching relevant terms associated with the data resource. Relating to the type of documents that have been worked on, this field indicates a list of relevant topics that describe central aspects of the text, with a maximum of five elements.
  • dcat:theme: The main category or theme of the resource. The purpose of this field is to assign a structured subject category or thematic domain to the dataset, which supports thematic navigation and improves discoverability in catalogs. In the context of digital documents, it is the main theme or literary categories that fit the content.
  • dcterms:spatial: The geographic area or spatial region covered by a resource. Its value is typically an instance of dcterms:Location, or a link to a resource describing a location, commonly referencing authoritative gazetteers like GeoNames. (https://www.geonames.org/, accessed on 12 September 2025). Regarding the use case, it is the (fictional) geographic location where the action takes place. Based on the plain text locations generated by the LLM, the Geonames API is used to obtain the Internationalized Resource Identifier (IRI) for each location. If a location cannot be found in Geonames, it is assumed to be imaginary or too specific and therefore cannot be represented using the formats supported in this field. In these special cases, the place name is left in plain text.
  • dcterms:temporal: The temporal characteristics of the resource, and its value is indeed expected to be of the type dcterms:PeriodOfTime. This means it refers to a temporal period or interval associated with the resource, representing the time period about which the resource is relevant or descriptive. Applying it to the type of document being worked on, it refers to the historical time or period in which the action takes place. In this case, if the time period cannot be determined with the type because it is fictitious, the LLM’s inference in plain text is left as is.
  • dcterms:type: The nature or genre of the resource. With regard to digital documents, it is the type of novel. These types of novels are selected from a closed list: sentimental, gothic or horror, mystery, historic, victorian realist or naturalist, novel of manners, adventure, science fiction, social, or erotic.
In addition, the LLM has also been instructed to extract the main characters from the novel, with a maximum of four. Two fields have been extracted from these main characters: the name and a brief profile describing traits, role and relevance, with a maximum of 2–3 sentences. This information is stored in the same JSON file as the rest of the generated metadata in the “mainCharacters” tag, even though it is not DCAT metadata, so that it can be easily evaluated later.

4. Evaluation

This section presents the evaluation conducted on the transformation from TEI to DCAT using the ELTeC corpus. The mapping described in Table 1 is straightforward, and it only requires defining a set of fixed rules for implementation. Thus, this section is focused on evaluating the extraction of metadata from the body of the novels using LLMs, analyzing both quantitatively and qualitatively the accuracy of the process performed. The first subsection describes in more detail the ELTeC dataset used in the experiments. The second subsection analyzes the results of the procedure and discusses the strengths and weaknesses identified in the process.

4.1. Experimental Setup

In this section, the Spanish and English novels from the ELTeC collection were used for the assessment. ELTeC is a multilingual collection of corpora representing European novels published between 1840 and 1920. It is currently the principal multilingual and comparable collection of literary texts available. The collection comprises 12 corpora from 12 European languages, each containing 100 novels: Czech, German, English, French, Swiss-German, Hungarian, Polish, Portuguese, Romanian, Slovenian, Spanish, and Serbian. This amounts to a total of 1200 novels (93,260,365 word tokens), all published in Europe between 1840 and 1920. There is also an extended version of the corpus that includes more novels in these languages and others, such as Italian (70 novels), Lithuanian (32), Norwegian (58), or Ukrainian (50), among others.
The corpus was established to foster comparative studies among the different European literary traditions. To this end, the selection of novels in each language followed the same rigorous criteria. This way, the novels are balanced in terms of:
  • Publication Date: Approximately 25 novels for each 20-year period
  • Size: A minimum of 20 short novels, 20 medium-sized novels, and 20 longer novels
  • Gender: At least 10 novels written by female authors, aiming for up to 50 where possible
  • Editorial Success: Around 30 novels that are currently well-known and approximately 30 that have only been published once
Furthermore, a varied sample is ensured by permitting a maximum of three novels per author, while striving for a single novel per author.
To facilitate usage and interoperability, all novels were annotated using the TEI standard. The TEI-Header includes essential metadata such as the author, title, publisher, and the date and place of publication for the digital version. Additionally, it contains bibliographic information with data regarding both the novel’s first edition and the specific edition used to create the ELTeC digital text. The header is completed with metadata fields detailing the novel’s size (token count), the language(s) utilized within the novel, the author’s gender, and an indication of whether the work is currently well-known or, conversely, considered a rare work. Finally, within the text body, the structural components of the novels (chapters and their titles, sections, paragraphs, notes) as well as code switches have been marked.
As mentioned in Section 3, from the set of 100 Spanish and 100 English novels (20,965,631 word tokens), an LLM was used to extract from the body of each text information about description, keyword, theme, spatial, temporal, type, and name and profile of main characters. These labels are incorporated into the DCAT encoding of each novel as additional metadata.
To evaluate the extraction of metadata from the 200 novels, 10% of the corpus in each language was manually analyzed by a literary scholar, including novels of different types, sizes, and periods. The expert assessed whether the metadata extracted directly from the novels was consistent with the novel’s content or if, conversely, the LLM generated erroneous data (hallucinations). Regarding the validation of results generated by LLMs, the expert has focused on the consistency, relevance, and adequacy of the results from a qualitative perspective, ensuring that the enriched metadata are useful and accurate. This type of validation is common in the literature when seeking to guarantee the quality and usefulness of the results generated by LLMs, especially in scenarios where human interpretation is key to detecting nuances and errors.
The evaluation framework is grounded in the data perspectivism approach (https://pdai.info/, accessed on 12 September 2025) [39]. Under this model, data cannot be classified as 100% correct or incorrect; rather, a certain margin of divergence is inherent, even among experts, due to differences in textual interpretation. Hence, two human annotators may extract divergent data points from the same text, yet both remain valid. The generated data are considered correct insofar as they could have been produced by a human expert. Cases that contradict the work itself, where no standard interpretation or expert validation would be possible, are consequently defined as errors. Only these evident errors were classified as such.
Regarding the LLM used to generate the metadata, gpt-5-mini (https://platform.openai.com/docs/models/gpt-5-mini, accessed on 20 September 2025). has been chosen. This model combines efficiency and speed with high level of accuracy in interpreting complex content. Furthermore, it has a large input capacity of 272,000 tokens, making it ideal for processing long texts. Despite this, there have been 11 and 21 novels from the Spanish and English corpora, respectively, that have exceeded the limit. To solve this, the texts of these novels have been fragmented into chunks so that metadata has been generated for each fragment, and after obtaining them, the LLM has been asked to combine each one so that general metadata for the large novel is obtained. Model gpt-5-mini was selected for this task, rather than other model, because it offers strong instruction-following performance, reliable structured-output capabilities (such as JSON or schema-based responses), and efficient processing costs, making it suitable for large-scale metadata generation, while models like Claude 3 and Qwen 3 provide extended context windows (up to 1 million tokens in certain versions), gpt-5-mini’s context capacity (400,000 tokens) is typically sufficient for analyzing full-length novels when combined with chunking or summarization strategies. Its balance of quality, speed, and cost makes it a practical choice for long-text semantic extraction in both English and Spanish, even if it does not offer the absolute largest context window on the market.

4.2. Results

In general, the generation of metadata are accurate: apart from a few cases discussed below, it commits no errors or falsehoods. An expert human might generate a different summary or a different character profile, but they would not necessarily perform better than the output created by the LLM. Data extracted for settings, narrative time, keywords, theme, or novel type are mostly correct.
Table 2 shows an example of a novel in DCAT format, including both directly mapped metadata and metadata extracted by the LLM from the body of the novel.
From a quantitative perspective, the correctness of each tag in every novel was determined using a value between 0 and 1, where 0 signifies completely incorrect and 1 signifies completely correct. Interpretative issues of the text were not considered; instead, the assessment focused strictly on whether the information was factually correct or not.
Table 3 shows the correction of the automatic annotation for each tag. As can be observed, the information category with the highest error rate is the novel type, with a correctness score of 0.76. The reason will be explained below. The information about theme, spatial, and temporal tags was correct in all analyzed novels. The remaining categories exhibit a correctness score exceeding 0.9.
The accuracy of the metadata are particularly notable in novels such as Alarcón’s El Sombrero de Tres Picos (The Three-Cornered Hat) and Brontë’s Wuthering Heights. Similarly, in Pérez Galdós’s novel Fortunata y Jacinta, the LLM perfectly extracts the different locations of the protagonists’ honeymoon narrated within the text. Or, in Wells’s The Time Machine: An Invention, it correctly identifies the main character as “The Time Traveller” (as named) and “The Morlocks” (in general) as the antagonistic figures.

4.3. Limitations

Despite the good results shown in the previous section, there are three main issues where the LLM had difficulties in extracting the correct information.
Firstly, the LLM struggles to detect information that is implicit in the novel: information that a human reader can easily infer but is not explicitly specified in the text. For example, in Pardo Bazán’s novel La Madre Naturaleza (Mother Nature), it erroneously indicates that the protagonists, who are in a romantic relationship, are friends, when in reality they are siblings. This is the main conflict of the novel: it is highly relevant information, but it is implicit from the beginning (the reader knows it); and it should be included in the novel’s description and the characters’ profiles. Unfortunately, the LLM failed to detect this highly relevant information for the narrative development.
Similarly, in Gómez de Avellaneda’s novel Sab, there is a veiled critique of the situation of women, who are compared to slaves due to their lack of freedom. This critique is relevant for the interpretation of the novel, but because it is implicit to a certain extent (an explicit allusion is made only at the end of the novel), the LLM fails to detect and extract it accurately in the novel’s description.
The opposite case is found in the novel Los Pazos de Ulloa (The House of Ulloa), also by the novelist Pardo Bazán. The LLM extracts the keyword “sexual scandal” as a defining feature of the novel; however, no such scandal exists: it is a simple case of a married man having a mistress. It is the narrator, who is a priest, who considers the situation to be a “sexual scandal”, but that was not the case for most contemporary readers, much less for modern readers. Because the novel is narrated from the subjective viewpoint of a specific character, this explicit subjectivity causes the LLM to extract information that, while not false, is not entirely correct. Specifically, it gives too much importance to an element that is not as central to the novel (although it is highly relevant to the protagonist-narrator). For the same reason, other relevant aspects of the novel, such as the theme of motherhood (which has dramatic consequences for the narrative), do not appear in the description or the metadata.
Secondly, the LLM had some problems in character extraction, specifically in determining which characters are relevant to the novels. For example, in novels such as Gil y Carrasco’s El Señor de Bembibre or Pérez Galdós’s Trafalgar, the LLM fails to extract the hero’s antagonist as a relevant character (the Count of Lemos in the first case and Admiral Nelson in the second). In these instances, the antagonist is relevant because they generate the conflict and sustain the narrative action. In another case, such as Valera’s Pepita Jiménez, the LLM simply extracts some characters as relevant when they are actually secondary.
This issue could be improved by feeding the LLM with character type schemas commonly found in fictional narratives, such as those established by V. Propp [40]: hero, antagonist, aggressor, donor or helper of the hero, the father, etc.
Thirdly, the highest number of errors was detected in the classification of novels into novelistic types or subgenres (dct:type). As previously mentioned, to improve classification precision, the LLM was constrained to classify the novels according to a closed list of novel types. However, some novels in the corpus do not fit any of these types. In these cases, because the LLM can only utilize the provided categories, the resulting classification is incorrect.
For example, the LLM classifies the novel Fortunata y Jacinta as a “Victorian novel”, a designation that is only applicable to English novels. The correct category in Spanish would be “novela realista” (realist novel). Similarly, M. de Unamuno’s Niebla (Mist), published in the 20th century, is categorized as a “Novela costumbrista” (Novel of Manners), a type typically associated only with the 19th century. The correct category for this novel is “philosophical novel” or “modernist novel”. A different classification strategy is needed to correctly categorize the novels, an approach that allows the LLM to classify any novel type.

5. Conclusions and Future Work

This work addressed the challenge of aligning a domain-specific metadata standard in the digital humanities, TEI, with DCAT, which is central to the FAIR data ecosystem. The proposed methodology bridges the interoperability gap by enabling the transformation of TEI-encoded literary corpora into DCAT-compliant catalogs.
Furthermore, the study explores the use of LLMs to automatically extract and enrich metadata fields from the body of the novels, such as description, keywords, themes, spatial and temporal dimensions, literary type, and character profiles. This integration contributes to making literary data more discoverable, interoperable, and reusable within federated and sovereign data spaces, while preserving the domain’s semantic richness.
The evaluation, conducted on a bilingual corpus of 200 novels from ELTeC, demonstrates that the proposed pipeline performs accurately in most cases. Expert assessment confirmed that the LLM-generated metadata were largely consistent with the content of the novels, successfully identifying narrative settings, periods, and character information. However, several limitations were also observed. The model struggled with implicit narrative information, such as underlying relationships, occasionally overemphasized subjective aspects of narration, and misclassified certain subgenres when the provided typology did not fit the novel. These results suggest that, while LLMs can effectively automate the generation of rich literary metadata, domain adaptation and contextual reasoning remain challenging tasks requiring further refinement. The evaluation supports the validity and usefulness of the proposed method, confirming its ability to improve the interoperability and discoverability of literary collections in open data environments. Consequently, the findings presented in the evaluation substantiate the contributions of this study and guide future lines of research.
Future work will focus on addressing limitations detected in the evaluation and extending the proposed approach. First, narratology studies have proposed various character type categorizations, such as the original proposal by Propp [40]. These abstract character types can be used to improve character extraction by compelling the LLM to determine who is the hero/heroine, the antagonist, the mentor, etc.
Secondly, this work has focused on novels, but many other corpora of poetry or drama annotated in TEI are available, such as the Golden Age Spanish Sonnets Corpus (https://github.com/bncolorado/CorpusSonetosSigloDeOro, accessed on 20 September 2025) [41] or the aforementioned DraCor corpus. The methodology described in this work can be applied to these and other TEI corpora, adapting the type of information to be extracted. In poetry, for example, it would be necessary to extract information regarding meter, as it is one of its most relevant features.
Finally, future work will advance from catalog-level harmonization toward practical metadata unification [42] by co-developing a governed TEI/DCAT application profile and extending current developed mappings using additional semantic vocabularies.

Author Contributions

Conceptualization, J.-N.M., B.N.-C. and D.T.; Methodology, A.M., J.-N.M. and D.T.; Software, A.M., C.G.-B. and A.B.; Validation, A.B., A.M. and B.N.-C.; Formal Analysis, A.M. and B.N.-C.; Investigation, A.M., J.-N.M. and D.T.; Resources, A.M. and B.N.-C.; Data Curation, B.N.-C. and A.M.; Writing—Original Draft Preparation, A.M., D.T., B.N.-C. and J.-N.M.; Writing—Review and Editing, J.-N.M. and D.T.; Visualization, A.M. and D.T.; Supervision, J.-N.M. and D.T.; Project Administration, J.-N.M.; Funding Acquisition, J.-N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by CLIO project (TSI-100130-2024-69), funded by Spanish Ministry of Digital Processing and by NextGeneration EU.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in a GitHub repository at https://github.com/amorejon6/electronics.git, accessed on 15 October 2025.

Conflicts of Interest

Author Alberto Berenguer was employed by the company INFERIA SOLUTIONS SL. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interes.

References

  1. European Commission. Common European Data Spaces—Shaping Europe’s Digital Future. Publications Office of the European Union. 2022. Available online: https://digital-strategy.ec.europa.eu/en/policies/data-spaces (accessed on 10 September 2025).
  2. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
  3. World Wide Web Consortium (W3C). Data Catalog Vocabulary (DCAT)—Version 3. W3C Recommendation. 2023. Available online: https://www.w3.org/TR/vocab-dcat-3/ (accessed on 10 September 2025).
  4. Mons, B.; Neylon, C.; Velterop, J.; Dumontier, M.; da Silva Santos, L.O.B.; Wilkinson, M.D. Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf. Serv. Use 2017, 37, 49–56. [Google Scholar] [CrossRef]
  5. Albertoni, R.; Browning, D.; Cox, S.; Gonzalez-Beltran, A.N.; Perego, A.; Winstanley, P. The W3C data catalog vocabulary, version 2: Rationale, design principles, and uptake. Data Intell. 2024, 6, 457–487. [Google Scholar] [CrossRef]
  6. Welter, D.; Juty, N.; Rocca-Serra, P.; Xu, F.; Henderson, D.; Gu, W.; Strubel, J.; Giessmann, R.T.; Emam, I.; Gadiya, Y.; et al. FAIR in action—A flexible framework to guide FAIRification. Sci. Data 2023, 10, 291. [Google Scholar] [CrossRef] [PubMed]
  7. da Silva Santos, L.O.B.; Burger, K.; Kaliyaperumal, R.; Wilkinson, M.D. FAIR-Data Point: A FAIR-Oriented Approach for Metadata. Data Intell. 2023, 5, 163–183. [Google Scholar] [CrossRef]
  8. Manovich, L. The Science of Culture? Social Computing, Digital Humanities and Cultural Analytics. J. Cult. Anal. 2016, 1, 1–15. [Google Scholar] [CrossRef] [PubMed]
  9. Flanders, J.; Jannidis, F. The Shape of Data in the Digital Humanities; Routledge: New York, NY, USA, 2019. [Google Scholar]
  10. Moretti, F. Distant Reading; Verso: New York, NY, USA, 2013. [Google Scholar]
  11. Giovannetti, F.; Tomasi, F. Linked data from TEI (LIFT): A Teaching Tool for TEI to Linked Data Transformation. Digit. Humanit. Q. 2022, 16, 1–14. [Google Scholar]
  12. Odebrecht, C.; Burnard, L.; Navarro-Colorado, B.; Eder, M.; Schöch, C. The European Literary Text Collection (ELTeC). In Proceedings of the Digital Humanities Conference (DH 2019), DH, Utrecht, The Netherlands, 9–12 July 2019. [Google Scholar]
  13. Burnard, L.; Navarro-Colorado, B.; Odebrecht, C.; Scholger, M. Collaborative creation of a multi-lingual literary corpus. Challenges and best practices for corpus design. In Proceedings of the COST Action Distant Reading Closing Conference, Krakov, Poland, 21–22 April 2022. [Google Scholar]
  14. CKAN Project. Ckanext-Dcat: DCAT Extension for CKAN. 2025. Available online: https://docs.ckan.org/projects/ckanext-dcat/ (accessed on 12 September 2025).
  15. Interoperable Europe/SEMIC. DCAT-AP 3.0.0. 2024. Available online: https://semiceu.github.io/DCAT-AP/releases/3.0.0/ (accessed on 12 September 2025).
  16. Perego, A.; Nüst, D.; Cetl, V.; Friis-Christensen, A.; Lutz, M. GeoDCAT-AP: Representing Geographic Metadata Using the DCAT Application Profile; Technical Report; Joint Research Centre (European Commission): Brussels, Belgium, 2017. [Google Scholar]
  17. European Commission and UNECE. StatDCAT-AP: Application Profile for Statistical Data. 2017. Available online: https://unece.org/statistics/documents/2017/06/meeting-document/statdcat-ap-application-profile-statistical-data (accessed on 12 September 2025).
  18. Noy, N.; Burgess, M.; Brickley, D. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem. In Proceedings of the Web Conference (WWW ’19), San Francisco, CA, USA, 13–17 May 2019; pp. 1365–1375. [Google Scholar] [CrossRef]
  19. Otto, B. A federated infrastructure for European data spaces. Commun. ACM 2022, 65, 44–45. [Google Scholar] [CrossRef]
  20. Eclipse Foundation. Eclipse Dataspace Protocol (Project Proposal). 2024. Available online: https://projects.eclipse.org/proposals/eclipse-dataspace-protocol (accessed on 12 September 2025).
  21. International Data Spaces Association. Making the Dataspace Protocol an International Standard; Technical Report; IDSA: Arlington, Virginia, 2024. [Google Scholar]
  22. Sztyler, T.; Huber, J.; Noessner, J.; Murdock, J.; Allen, C.; Niepert, M. LODE: Linking digital humanities content to the web of data. In Proceedings of the IEEE/ACM Joint Conference on Digital Libraries, London, UK, 8–14 September 2014; pp. 423–424. [Google Scholar] [CrossRef]
  23. Pasqual, V.; Tomasi, F. Linked open data per la valorizzazione di collezioni culturali: Il dataset mythLOD. AIB Studi 2022, 62, 149–168. [Google Scholar] [CrossRef]
  24. Fischer, F.; Börner, I.; Göbel, M.; Hechtl, A.; Kittel, C.; Milling, C.; Trilcke, P. Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of the DH2019: “Complexities”, Utrecht, The Netherlands, 9–12 July 2019; Utrecht University: Utrecht, The Netherlands, 2019. [Google Scholar] [CrossRef]
  25. Hyvönen, E. Linked Open Data Infrastructure for Digital Humanities in Finland. Digit. Humanit. Nord. Balt. Ctries. Publ. 2020, 3, 254–259. [Google Scholar] [CrossRef]
  26. ParsCit. wing.comp.nus.edu.sg/parsCit. 2009–2025. Available online: https://github.com/knmnyn/ParsCit (accessed on 12 September 2025).
  27. GROBID. 2008–2025. Available online: https://github.com/kermitt2/grobid (accessed on 12 September 2025).
  28. Day, M.Y.; Tsai, R.T.H.; Sung, C.L.; Hsieh, C.C.; Lee, C.W.; Wu, S.H.; Wu, K.P.; Ong, C.S.; Hsu, W.L. Reference metadata extraction using a hierarchical knowledge representation framework. Decis. Support Syst. 2007, 43, 152–167. [Google Scholar] [CrossRef]
  29. Han, H.; Giles, C.L.; Manavoglu, E.; Zha, H.; Zhang, Z.; Fox, E.A. Automatic document metadata extraction using support vector machines. In Proceedings of the 2003 Joint Conference on Digital Libraries, Houston, TX, USA, 27–31 May 2003; pp. 37–48. [Google Scholar]
  30. Peng, F.; McCallum, A. Information extraction from research papers using conditional random fields. Inf. Process. Manag. 2006, 42, 963–979. [Google Scholar] [CrossRef]
  31. Science Parse. Allen Institute for AI. 2017. Available online: https://github.com/allenai/science-parse (accessed on 12 September 2025).
  32. Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
  33. Joshi, B.; Symeonidou, A.; Danish, S.M.; Hermsen, F. An End-to-End Pipeline for Bibliography Extraction from Scientific Articles. In Proceedings of the Second Workshop on Information Extraction from Scientific Publications, Nusa Dua, Indonesia, 1 November 2023; pp. 101–106. [Google Scholar]
  34. Alyafeai, Z.; Al-Shaibani, M.S.; Ghanem, B. MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs. arXiv 2025, arXiv:2505.19800. [Google Scholar] [CrossRef]
  35. Boukhers, Z.; Yang, C. Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents. arXiv 2025, arXiv:2501.05082. [Google Scholar] [CrossRef]
  36. Hu, X. Application of Large Language Models for Digital Libraries. In Proceedings of the 24th ACM/IEEE Joint Conference on Digital Libraries, Hong Kong, China, 16–20 December 2025; Association for Computing Machinery: New York, NY, USA, 2025. [Google Scholar]
  37. ISO 639-1:2002; Codes for the Representation of Names of Languages—Part 1: Alpha-2 Code. International Organization for Standardization (ISO): Geneva, Switzerland, 2002.
  38. ISO 639-3:2007; Codes for the Representation of Names of Languages—Part 3: Alpha-3 Code for Comprehensive Coverage of Languages. International Organization for Standardization (ISO): Geneva, Switzerland, 2007.
  39. Cabitza, F.; Campagner, A.; Basile, V. Toward a Perspectivist Turn in Ground Truthing for Predictive Computing. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
  40. Propp, V. Morphology of the Folktale; University of Texas Press: Austin, TX, USA, 1968. [Google Scholar]
  41. Navarro-Colorado, B.; Ribes-Lafoz, M.; Sánchez, N. Metrical annotation of a large corpus of Spanish sonnets: Representation, scansion and evaluation. In Proceedings of the 10th Edition of the Language Resources and Evaluation Conference, Portorož, Slovenia, 23–28 May 2016. [Google Scholar]
  42. Christodoulakis, C.; Gabel, M.; Brown, A.D. Metadata Unification in Open Data with Gnomon. In Proceedings of the 28th International Conference on Extending Database Technology, EDBT 2025, OpenProceedings.org, Barcelona, Spain, 25–28 March 2025; pp. 377–383. [Google Scholar] [CrossRef]
Figure 1. Prompt template to generate metadata from the novel’s content. Wikipedia URLs have been omitted due to lack of space. The # symbol has been used to separate each section of the prompt.
Figure 1. Prompt template to generate metadata from the novel’s content. Wikipedia URLs have been omitted due to lack of space. The # symbol has been used to separate each section of the prompt.
Electronics 14 04962 g001
Table 1. Mapping between TEI and DCAT elements.
Table 1. Mapping between TEI and DCAT elements.
TEI ElementDCAT Element
<titleStmt><title>dcterms:title
<titleStmt><author>dcterms:creator
<publicationStmt><publisher>dcterms:publisher
<publicationStmt><distributor>prov:qualifiedAttribution [
   a prov:Attribution;
   prov:agent <https://zenodo.org/>;
   dcat:hadRole <distributor>
]
<publicationStmt><availability>dcterms:license
<publicationStmt><ref type=“doi”>dcterms:identifier
<langUsage>dcterms:language
Table 2. Example of metadata obtained for one of the novels analyzed. Some metadata values have been truncated due to lack of space.
Table 2. Example of metadata obtained for one of the novels analyzed. Some metadata values have been truncated due to lack of space.
DCAT PropertyValue
dcterms:titleWuthering Heights
dcterms:creatorBrontë, Emily
dcterms:publisherCOST Action “Distant Reading for European Literary History”
prov:agent (distributor)https://zenodo.org/
dcterms:licensehttps://creativecommons.org/licenses/by/4.0/, accessed on 2 December 2025
dcterms:identifierhttps://doi.org/10.5281/zenodo.3462435, accessed on 2 December 2025
dcterms:languagehttp://id.loc.gov/vocabulary/iso639-1/en,
accessed on 2 December 2025
dcterms:descriptionA dark, passionate tale of obsessive love and revenge...
dcat:keywordrevenge, obsessive love, inheritance and class...
dcat:themePassionate/tragic love, Revenge and social retribution...
dcterms:spatialhttps://www.geonames.org/2641211/, accessed on 2 December 2025
dcterms:temporal1801–1802
dcterms:typeGothic or horror novel
character 1Heathcliff
character’s profile 1A dark, brooding foundling adopted into the Earnshaw...
character 2Catherine (Earnshaw) Linton/Catherine Heathcliff
character’s profile 2Headstrong, passionate and capricious, Catherine is...
character 3Edgar Linton
character’s profile 3Refined, genteel and comparatively weak, Edgar is...
character 4Hareton Earnshaw
character’s profile 4The last of the Earnshaw line, reared roughly and...
Table 3. Accuracy when extracting metadata from the text of novels using an LLM.
Table 3. Accuracy when extracting metadata from the text of novels using an LLM.
MetadataAccuracy
dcterms:description0.96
dcat:keyword0.97
dcat:theme1.00
dcterms:spatial1.00
dcterms:temporal1.00
dcterms:type0.76
characters0.94
character’s profile0.93
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Morejón, A.; Navarro-Colorado, B.; García-Barceló, C.; Berenguer, A.; Tomás, D.; Mazón, J.-N. Automatic Metadata Extraction Leveraging Large Language Models in Digital Humanities. Electronics 2025, 14, 4962. https://doi.org/10.3390/electronics14244962

AMA Style

Morejón A, Navarro-Colorado B, García-Barceló C, Berenguer A, Tomás D, Mazón J-N. Automatic Metadata Extraction Leveraging Large Language Models in Digital Humanities. Electronics. 2025; 14(24):4962. https://doi.org/10.3390/electronics14244962

Chicago/Turabian Style

Morejón, Adriana, Borja Navarro-Colorado, Carmen García-Barceló, Alberto Berenguer, David Tomás, and Jose-Norberto Mazón. 2025. "Automatic Metadata Extraction Leveraging Large Language Models in Digital Humanities" Electronics 14, no. 24: 4962. https://doi.org/10.3390/electronics14244962

APA Style

Morejón, A., Navarro-Colorado, B., García-Barceló, C., Berenguer, A., Tomás, D., & Mazón, J.-N. (2025). Automatic Metadata Extraction Leveraging Large Language Models in Digital Humanities. Electronics, 14(24), 4962. https://doi.org/10.3390/electronics14244962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop