Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models

Chen, Xinyu; Ma, Yin; Wu, Kai; Pang, Xing; Li, Guoqing; Ma, Ruikai; Yang, Linhan; Peng, Chuang; Zhi, Jiayu; Yuan, Jiabin

doi:10.3390/ijgi15060243

Open AccessArticle

Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models

by

Xinyu Chen

¹

,

Yin Ma

¹,

Kai Wu

²,

Xing Pang

^1,*,

Guoqing Li

³,

Ruikai Ma

¹,

Linhan Yang

¹,

Chuang Peng

¹,

Jiayu Zhi

¹ and

Jiabin Yuan

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

Department of Science Education, Jiangxi Institute of Technology, Nanchang 330098, China

³

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(6), 243; https://doi.org/10.3390/ijgi15060243

Submission received: 4 May 2026 / Revised: 22 May 2026 / Accepted: 26 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Abstract

Automated linkage between geoscientific literature and datasets is essential for improving data reuse, reproducibility, and knowledge discovery, yet existing methods often struggle with implicit dataset references, heterogeneous spatial–temporal expressions, and inconsistent naming conventions. To address this problem, we propose a literature–data linkage framework that integrates candidate retrieval, large language model (LLM)-based structured extraction, normalization, and knowledge graph construction. The framework first identifies candidate fragments through BM25-based retrieval, regex filtering, and whitelist-assisted scoring, and then applies schema-constrained prompting to extract dataset names and key attributes, including temporal coverage, spatial scope, resolution, provider, and role. The extracted results are subsequently normalized to canonical forms and ingested into a Neo4j-based knowledge graph linking articles, datasets, institutions, and regions. Experiments on a cross-journal benchmark show that the proposed framework achieves 93.79% precision, 90.66% recall, and 92.20% F1-score. Comparative experiments across multiple LLM backbones further indicate that the framework remains effective across both proprietary and open-source models, while ablation results confirm that candidate retrieval and normalization are the two most influential components for balanced extraction performance. The resulting knowledge graph provides a structured representation of literature–data linkages and supports exploration of dataset reuse patterns, provenance relations, and cross-document connections. These results demonstrate that carefully constrained LLM extraction, combined with retrieval and normalization, provides a robust and interpretable pathway for transforming unstructured geoscientific literature into structured and reusable knowledge.

Keywords:

geoscientific literature; dataset linking; large language models; knowledge graph; normalization; Neo4j

1. Introduction

Constructing reliable databases from scholarly articles increasingly relies on automated extraction rather than manual curation in the geosciences and related fields. With the rapid growth of geoscientific publications and open data resources, the ability to systematically link literature and datasets has become essential for data reuse, reproducibility, and large-scale knowledge discovery [1,2]. However, domain texts exhibit heterogeneous aliases, inconsistent temporal and spatial expressions, and mixed unit conventions, which hinder direct reuse and structured integration [3,4]. In particular, geoscientific articles frequently reference datasets implicitly—through spatial coverage, temporal intervals, or observational platforms—rather than through standardized identifiers such as DOIs, making automatic linkage especially challenging [5].

Recent advances in large language models (LLMs) enable accurate zero-shot and few-shot extraction when guided by instruction prompts and structured outputs [6], supported by scalable pretraining and transfer learning in foundation models [7] and by retrieval-augmented generation to ground responses in evidence [8]. Prompt engineering has proven effective at steering conversational LLMs toward faithful, domain-relevant outputs in scientific applications, including materials data extraction [9,10,11]. These developments suggest that LLM-based methods offer a promising direction for extracting structured information from complex scientific texts without heavy task-specific model retraining. Nevertheless, LLM outputs may contain unsupported attributes, incomplete records, or structurally inconsistent values, particularly when dataset references are implicit or source evidence is incomplete. Therefore, reliable geoscientific literature–data linkage requires not only LLM extraction, but also evidence-constrained prompting, candidate retrieval, normalization, and downstream validation.

Early rule-based or task-specific systems improved precision but required substantial engineering and annotated data, limiting portability and adoption across subdomains; representative approaches include supervised sequence models and transformer baselines that demand domain tuning, as well as purpose-built dataset-mention detectors and resources [12,13,14,15,16]. Although these approaches provide important foundations, they commonly focus on explicit entity detection or local mention classification rather than the complete process of identifying, normalizing, and linking heterogeneous dataset references to reusable data entities. Stand-alone LLM extraction can still suffer from hallucinations and non-standard outputs, motivating end-to-end designs that constrain generation and normalize results before integration into downstream data infrastructures [17].

Beyond this general limitation, most existing approaches focus on generic scientific entity extraction and rarely address the joint modeling of dataset names, spatial–temporal attributes, and institutional provenance in geoscience literature. More importantly, existing methods usually treat extraction, normalization, and linkage as separate problems. This separation is particularly problematic in the geoscientific literature, where dataset mentions are often implicit, aliased, and distributed across multiple sentences, and where there is temporal coverage, spatial scope, and provider information. In such cases, two textual mentions can only be reliably judged as referring to the same underlying dataset when multiple attributes are considered jointly. Accordingly, the central research gap addressed in this study is not simply insufficient mention-detection accuracy, but the absence of a reliable workflow that transforms ambiguous textual evidence into normalized, verifiable, and linkable dataset entities for downstream knowledge organization.

In practice, this challenge is evident when the same dataset is referred to in different forms across articles, for example, by a full name in one paper, by an abbreviated product name in another, and by temporal or regional description only in a third. A mention such as “1 km land-use data for the Tibetan Plateau from 2000 to 2020” may indicate a specific land-use product, but it cannot be reliably linked without combining textual evidence with temporal, spatial, and provenance cues. This makes geoscientific literature–data linkage fundamentally different from conventional entity extraction tasks that rely mainly on explicit surface forms. Without normalization of such heterogeneous mentions into canonical forms, subsequent dataset linking and graph construction would remain unstable and fragmented.

To bridge this gap, we propose a retrieve → extract → normalize → link framework for linking geoscientific literature and datasets using LLM-assisted structured extraction. The framework first narrows the search space through BM25-based candidate retrieval combined with regex evidence and whitelist-assisted scoring. It then applies schema-constrained LLM prompting to extract dataset names and key attributes, including temporal coverage, spatial scope, resolution, provider institution, DOI/URL, and data role. To reduce unsupported generation, the extraction stage requires unknown or unsupported fields to remain blank and applies post-validation before graph ingestion. The extracted records are subsequently normalized through a hybrid similarity strategy so that aliases, units, institutional names, and heterogeneous spatial–temporal expressions can be mapped to canonical forms. Finally, the normalized records are ingested into a knowledge graph that explicitly links articles, datasets, institutions, and regions, enabling structured retrieval and downstream semantic analysis [18].

The main contributions of this work are as follows.

(1): We formulate geoscientific literature–data linkage as a multi-stage task that goes beyond dataset mention detection alone. The task integrates candidate-fragment retrieval, schema-constrained attribute extraction, normalization of aliases and spatial–temporal expressions, and graph-based representation of linked dataset entities, thereby clarifying how textual and contextual evidence supports canonical dataset identification.
(2): We develop a modular retrieve → extract → normalize → link framework that combines BM25-based retrieval, regex evidence, whitelist-assisted scoring, schema-constrained LLM extraction, hybrid-similarity normalization, and knowledge graph construction. The framework incorporates evidence-based blank-field retention, structured-output validation, and low-confidence handling to limit the propagation of unsupported or ambiguous extraction results into the knowledge graph.
(3): We conduct a comprehensive evaluation on an expanded, manually annotated cross-journal benchmark, including extraction reliability analysis, prompting-strategy and repeated-call efficiency experiments, transformer-based and multi-LLM comparisons, module-level ablation analysis, and a quantitative query-based usability evaluation of the constructed knowledge graph. These experiments assess both the extraction quality of the framework and its ability to support structured retrieval of literature–data relationships.

Collectively, this study provides a practical yet methodologically structured framework for automatic literature–data linkage in geoscience. Rather than positioning LLM extraction as an isolated end point, the proposed approach integrates constrained extraction with normalization and graph representation, thereby supporting the conversion of heterogeneous dataset mentions into queryable and reusable literature–data knowledge.

The remainder of this paper is organized as follows. Section 2 reviews related studies on linking geoscience literature and datasets. Section 3 presents the overall framework of the proposed LLM-based method, including system architecture, extraction strategy, normalization, and knowledge graph construction. Section 4 reports the experimental design and results, including benchmark construction, extraction evaluation, prompting-strategy and repeated-call analyses, transformer-based baseline and LLM backbone comparisons, ablation experiments, and quantitative knowledge graph usability evaluation. Section 5 discusses the findings, implications, and limitations, and Section 6 concludes the study with future research directions.

2. Literature Review

Automatic extraction of structured information from scientific literature has been a long-standing research topic in natural language processing (NLP) and information retrieval. Early studies mainly relied on rule-based and statistical methods to identify metadata and key textual elements, such as titles, authors, and keywords [19,20,21]. Nasar et al. [22] systematically reviewed these early approaches and highlighted the gradual transition from pattern-matching techniques to learning-based models, such as conditional random fields and support vector machines, for named entity recognition and relation extraction in scholarly documents. These methods established the technical foundation for large-scale scientific text analysis but were largely developed for general scientific domains rather than domain-specific literature.

Traditional machine learning methods were later adapted to scientific corpora with domain annotations, supported by benchmark datasets such as STEM-ECR [23] and TDMSci [24], which enabled the evaluation of multidisciplinary and dataset-oriented entity extraction tasks. However, these benchmarks primarily focus on generic scientific entities and do not explicitly model domain-specific semantics such as spatial–temporal attributes or observation contexts, which are essential in geoscience literature.

Despite progress in general scientific information extraction, directly applying general-purpose models to geoscience literature remains challenging due to the domain’s strong semantic specificity. Geoscientific texts frequently describe datasets, observations, and experiments using compound expressions and implicit references, often involving spatial locations, temporal coverage, measurement variables, and observational platforms [1]. Early efforts to apply NLP to geoscience texts therefore focused on domain-specific term extraction and semantic annotation rather than generic entity recognition. For example, Wang et al. [25] proposed workflows for extracting geological terms and constructing knowledge graphs from geoscience literature, demonstrating that domain dictionaries and CRF-based segmentation significantly improved geological term identification compared to generic models. Subsequent text mining studies further annotated geological parameters and semantic relations in scientific reports, confirming that incorporating domain knowledge is critical for accurate interpretation of geoscientific texts [26].

Several recent studies have extended scientific entity extraction beyond shallow metadata toward richer semantic representations. Zhang et al. [27] released SciER, a dataset for entity and relation extraction across full scientific publications, covering entities such as datasets, methods, and tasks. Their experiments showed that even advanced neural models struggle with complex entity interactions and long-range dependencies in scholarly text. Similarly, Duan et al. [28] introduced SciNLP, a full-text benchmark that revealed substantial performance variability across model architectures and document lengths. These studies are directly relevant because they indicate that scientific information extraction is increasingly a document-level reasoning task rather than a sentence-level labeling task. This is especially important for geoscientific literature, where dataset mentions are often incomplete, distributed across multiple sentences, or expressed indirectly through temporal and spatial descriptions.

Linking scientific literature with geoscientific data has therefore been increasingly recognized as a key objective for enabling data reuse, reproducibility, and deeper knowledge discovery. Studies of dataset citation practices have shown that datasets in earth and environmental sciences are frequently referenced informally, without standardized identifiers such as Digital Object Identifiers (DOIs), which limits the effectiveness of persistent identifier–based linking methods and hinders systematic discovery of data usage across publications [29]. Analyses of barriers to dataset citation further indicate that commonly used bibliographic and reference management tools provide insufficient support for dataset reference types, impeding the adoption of standardized data citation practices across earth, space, and environmental sciences [30]. These limitations imply that literature–data linking in geoscience cannot rely solely on explicit citations or metadata alignment, but instead requires methods capable of inferring dataset-related information directly from unstructured text [31].

In recent years, deep learning and pretrained language models have demonstrated strong performance in scientific text processing tasks. SciBERT [32], a transformer-based model pretrained on large-scale scientific corpora, significantly outperformed general-domain models on scientific named entity recognition and classification tasks. More recently, large language models have been explored for structured information extraction from complex scientific texts, showing that fine-tuned models such as GPT-3 and LLaMA-2 can flexibly extract entities and relations without rigid, manually engineered pipelines [33]. Applications of deep NLP models to systematic literature review tasks further confirm the effectiveness of transformer-based architectures in extracting structured elements from unstructured scientific text [34]. At the same time, a growing body of work has shown that temporal and structural context play an important role when semantic units cannot be resolved from local text alone. For example, Continuous Spatiotemporal Transformer models were introduced to model continuous spatio-temporal dynamics, while recent studies on temporal and spatio-temporal graph neural networks have emphasized that evolving dependencies across time and space often require dedicated modeling strategies rather than static encoders alone [35,36]. Although these methods were not developed for geoscientific literature extraction, they are still informative for the present task because dataset mentions in scholarly texts are often disambiguated by temporal span, spatial scope, and surrounding contextual evidence.

Related studies have also highlighted the importance of incorporating temporal and structural context into text understanding. SoulMate, for example, combines textual and temporal signals through multi-aspect temporal-textual embedding and shows that such joint modeling can improve linking quality [37]. Although developed for a different task, this idea is relevant to geoscientific literature, where dataset mentions often need to be resolved jointly from textual description, temporal coverage, and contextual evidence rather than from name strings alone.

Related work on knowledge graphs provides an additional perspective for literature–data linkage. Attention-based graph models such as KGAT have shown that heterogeneous relational signals can be propagated more effectively through graph attention mechanisms [36]. In parallel, entity alignment studies have demonstrated that matching across graphs becomes more robust when temporal information is incorporated explicitly. TEA, for example, introduces time-aware entity alignment and shows that temporal context can improve alignment quality between evolving entities [38]. These studies are relevant because identifying whether different mentions refer to the same underlying dataset is, in essence, an alignment problem involving aliases, temporal coverage, institutions, and spatial scope rather than string matching alone.

To summarize the methodological evolution and the remaining limitations of prior work, Table 1 provides a concise comparison of representative approaches discussed above.

As shown in Table 1, prior studies have substantially improved scientific text understanding, but few of them jointly address implicit dataset mention extraction, spatial–temporal attribute normalization, and graph-based linkage in geoscientific literature.

Despite these advances, the application of large language models to the systematic extraction and linking of geoscience literature and geoscientific dataset information remains limited. Existing studies rarely address the joint extraction of dataset-related entities, spatial–temporal attributes, and contextual descriptions, nor do they explicitly consider how extracted information can be normalized and integrated with geoscientific data resources to support retrieval, reuse, and semantic querying. In particular, previous studies usually focus either on scientific entity extraction at the text level or on graph-based representation at the structured-data level, but seldom connect the two within a unified workflow for geoscientific literature. This gap directly motivates the present study, which investigates an LLM-based framework for extracting, normalizing, and linking dataset information from geoscientific literature, with the aim of improving semantic understanding and strengthening the connectivity between geoscience publications and geoscientific data.

3. Materials and Methods

3.1. Problem Formulation and Methodological Overview

The objective of this study is to automatically identify, extract, normalize, and semantically link dataset-related information from geoscientific literature and organize the results into a structured knowledge graph that explicitly captures relationships among datasets, articles, institutions, and spatial regions.

Formally, let a geoscientific document collection be denoted as

D = \{d_{1}, d_{2}, \dots, d_{n}\}

(1)

where each document

d_{i}

contains unstructured text describing scientific context, data usage, temporal coverage, spatial scope, and source information. For each document

d_{i}

, the task is to detect a set of dataset-related mentions

M_{i} = \{m_{i 1}, m_{i 2}, \dots, m_{i k}\}

(2)

where each mention

m_{i j}

may correspond to an explicit dataset name or to an implicit reference expressed through descriptive context. Each mention is then converted into a structured record

r_{i j} = (n a m e, t i m e, l o c a t i o n, p r o v i d e r, r e s o l u t i o n, d o i / u r l, r o l e),

(3)

where the attributes represent the key information required for literature–data linkage in geoscientific texts.

Unlike typical information extraction tasks in generic domains, geoscientific literature often refers to datasets indirectly through temporal, spatial, or methodological descriptions rather than through standardized identifiers [39]. To model the target knowledge structure more explicitly, we define a knowledge graph as

G = (V, E)

(4)

where

V

is the set of nodes and

E

is the set of typed links. In this study, the main node types include Article, Dataset, Institution, and Region. A node

v \in V

represents a canonicalized entity instance. Each link in

E

connects two nodes through a specific semantic relation, such as USE (Article

\to

Dataset), PROVIDED_BY (Dataset

\to

Institution), and LOCATED_IN (Dataset

\to

Region). In this setting, literature–data linkage means establishing a reliable mapping between a dataset mention in text and a canonical Dataset node in

G

, together with its associated attributes and relations.

A central difficulty is that different mentions may refer to the same underlying dataset while exhibiting substantial lexical and contextual variation. We describe this phenomenon through two related properties. First, alias ambiguity means that multiple surface forms, such as abbreviations, partial names, or descriptive references, may correspond to the same dataset entity. Second, spatial–temporal heterogeneity means that dataset mentions are often expressed through diverse time ranges, place names, scales, or resolution units, such as “2000–2020”, “from 2000 to 2020”, “Tibetan Plateau”, or “Qinghai–Tibet Plateau”. As a result, determining whether two mentions refer to the same dataset cannot rely on name matching alone, but must jointly consider textual semantics, temporal coverage, spatial scope, and provenance information.

Accordingly, the problem in this study can be formulated as learning a mapping

f : M \to Y

(5)

where

M = ⋃_{i} M_{i}

is the set of dataset-related mentions extracted from the document collection, and

Y

is the set of normalized, linkable dataset records. Each output record

y \in Y

is associated with a canonical Dataset node and its structured attributes, so that equivalent mentions from different documents can be aggregated into the same graph entity. Accordingly, the evaluation of extraction quality is defined with respect to whether a textual mention can be correctly transformed into a normalized and linkable dataset record. In this setting, true positives correspond to correctly linked mention-level records, whereas false positives and false negatives correspond to incorrect or missing links between textual evidence and canonical dataset entities. In this sense, the task goes beyond mention extraction and includes candidate identification, schema-constrained attribute extraction, normalization of aliases and spatial–temporal expressions, and graph-based entity linking.

To address this problem, we propose a multi-stage framework integrating retrieval, LLM-based structured extraction, normalization, and graph construction [40]. The overall design follows a “retrieve

\to

extract

\to

normalize

\to

link” paradigm, which separates high-recall candidate identification from high-precision canonicalization and linking [27,41]. In this workflow, post-processing does not refer to model fine-tuning; instead, it refers to the normalization and validation stage in which aliases, units, and heterogeneous spatial–temporal expressions are mapped to canonical forms before graph ingestion. Similarly, querying refers to structured graph retrieval over explicit entities and relations, such as identifying which datasets are used in a given research domain or which institutions provide datasets associated with a target region. Analysis refers to corpus-level semantic exploration based on the resulting graph, including aggregation of dataset reuse patterns, co-occurrence structures, and cross-document linkage relations. Therefore, knowledge graph construction operationalizes dataset–article linking by transforming extracted mentions into explicit nodes and typed relations that can be systematically retrieved, aggregated, and analyzed.

3.2. System Architecture

The overall architecture of the proposed system is illustrated in Figure 1. The pipeline transforms unstructured geoscientific articles (from PDF, Word, Markdown formats) into a Neo4j-based knowledge graph where dataset entities and their semantic relations to articles, institutions, and regions are explicitly represented.

The system comprises four sequential stages:

(1): Document preprocessing and candidate retrieval, which converts source documents into plain text, segments them into paragraphs and sentences, and identifies text fragments likely to contain dataset-related information through BM25-based retrieval, regex matching, whitelist-based filtering, and composite candidate scoring.
(2): LLM-based structured extraction, which converts the retrieved candidate fragments into schema-constrained JSON records through prompt-guided extraction. This stage includes structured field prompting, fragment-level JSON generation, repeated-call aggregation for improved stability, and blank-field retention when supporting evidence is unavailable.
(3): Post-processing and normalization, which resolves lexical variation, synonyms, spatial–temporal heterogeneity, and alias ambiguity through hybrid similarity matching and controlled vocabularies. This stage combines Levenshtein similarity and embedding-based semantic similarity, supported by a gazetteer, an institution registry, and a dataset alias lexicon, to map extracted values to canonical forms.
(4): Knowledge graph construction, which materializes the normalized records into graph nodes and typed relations in Neo4j, thereby enabling structured storage, querying, and visualization of literature–data linkages.

Figure 1 explicitly shows the principal operations and intermediate outputs in each stage, including composite candidate scoring in retrieval, schema-constrained JSON extraction with aggregation, hybrid-similarity normalization with threshold-based acceptance, and graph construction in the form of nodes and edges. This staged design decouples recall-oriented retrieval from precision-oriented normalization, a strategy that improves robustness across diverse document styles and domains [41].

3.3. Task Decomposition and Evaluation Alignment

To enable interpretable and quantifiable evaluation, we decompose the end-to-end pipeline into three operational sub-tasks, corresponding to Stages 1–3 of the pipeline, consistent with prior work on entity-centric information extraction evaluation [42]:

Task 1—Dataset Mention Detection

Input: text fragment x

Output: set of spans {(s, e, text)} representing dataset mention occurrences.

Task 2—Dataset Attribute Extraction

Input: (x, mention)

Output: structured record {name, temporal_coverage, spatial_scope, provider, resolution, doi_url, role}

Evaluation: field-level exact/partial match (Precision/Recall/F1) and record completeness, following standards used in entity and relation extraction benchmarks [43].

Task 3—Normalization and Alias Mapping

Input: raw extracted attributes

Output: canonicalized dataset record with similarity evidence

Evaluation: alignment accuracy at threshold τ, Precision/Recall/F1, and consistency of numerical fields.

This decomposition enables isolation of error sources (retrieval vs. extraction vs. normalization) and supports transparent analysis of dataset–article linking quality.

3.4. Document Preprocessing and Candidate Retrieval

The first stage prepares unstructured geoscientific articles for downstream information extraction. Source documents in PDF, Word, or Markdown formats are converted into plain text while preserving structural markers such as titles, headings, section identifiers, and page boundaries. Non-textual elements (e.g., figures and tables) are removed to reduce noise. The cleaned text is subsequently segmented into paragraphs and sentences using a rule-based tokenizer designed for mixed English–Chinese content, with specific handling for abbreviations, numerical expressions, and domain-specific patterns. All preprocessing scripts and configuration files are archived to ensure reproducibility.

To efficiently identify potential dataset mentions, a retrieval-based filtering strategy is employed. We use the BM25 ranking model to score text fragments according to their relevance to a domain-specific query set. The query set is constructed from a manually curated whitelist of commonly used geoscientific datasets and related keywords (e.g., land use/cover, climate station, remote sensing product), containing 267 terms. The BM25 scoring function is defined as:

S o u r c e (q, D) = \sum_{i = 1}^{n} I D F (q_{i}) (\frac{f (q_{i}, D) (k + 1)}{f (q_{i}, D) + k (1 - b + \frac{b | D |}{a v g (D)})})

(6)

I D F (q_{i}) = l n \frac{N - n (q_{i}) + 0.5}{n (q_{i}) + 0.5}

(7)

where

f (q_{i}, D)

denotes the frequency of query term

q_{i}

in document fragment

D

,

∣ D ∣

is the fragment length,

avg (D)

is the average fragment length in the corpus,

N

is the total number of fragments, and

n (q_{i})

is the number of fragments containing

q_{i}

. Following standard BM25 practice in information retrieval, the hyperparameters were initialized as

k = 1.5

and

b = 0.75

. Previous studies have shown that

k

is commonly selected within the range of 1.2–2.0, while

b

is often set around 0.75 for document-length normalization. To verify the robustness of this setting in our corpus, we further conducted a grid search over

k

values from 1.2 to 2.0 (step = 0.1) and

b

values from 0.5 to 1.0 (step = 0.05) on a held-out development set. The combination

(k, b) = (1.5, 0.75)

achieved the best retrieval F1-score and was therefore adopted in all subsequent experiments. To capture contextual signals beyond lexical similarity, we introduce a composite retrieval score:

S (f_{i}) = α B M 25 (f_{i}, q) + β {M a t c h}_{r e g e x} (f_{i}) + γ {H i t}_{w h i t e} (f_{i}), α + β + γ = 1

(8)

where

{M a t c h}_{r e g e x} (f_{i}) \in {0,1}

indicates whether

f_{i}

matches predefined regular-expression patterns (e.g., year ranges, DOIs), and

{H i t}_{w h i t e} (f_{i})

is the proportion of whitelist keywords in

f_{i}

, min–max normalized to [0, 1] per document. The weights

α

,

β

, and

γ

were optimized through grid search on the held-out development set, with the constraint

α + β + γ = 1

and a step size of 0.1. The combination

(α, β, γ) = (0.6, 0.2, 0.2)

produced the best F1-score, indicating that lexical relevance remains the primary signal, while rule-based and whitelist-based evidence provide complementary benefits.

The regular-expression templates target common contextual indicators associated with dataset descriptions, including temporal ranges (e.g., “2000–2020”), spatial references (e.g., “Qinghai–Tibet Plateau”), spatial resolution expressions (e.g., “1 km × 1 km”), and DOI patterns. This hybrid design enables the retrieval of both explicit dataset names and implicit contextual mentions.

For each document, the composite scores are normalized using min–max scaling. The filtering threshold was set to

θ_{r} = 0.35

, and fragments with

S (f_{i}) < θ_{r}

were discarded. This threshold was determined through grid search on the held-out development set, where candidate values from 0.20 to 0.60 with a step size of 0.05 were evaluated. The value

θ_{r} = 0.35

achieved the best downstream F1-score. Lower thresholds retained too many noisy fragments and reduced precision, whereas higher thresholds removed valid implicit dataset mentions and reduced recall. Figure 2a further illustrates the sensitivity of the downstream F1-score to the retrieval filtering threshold. The results show that performance first improves as low-confidence noise is removed, reaches the optimum at

θ_{r} = 0.35

, and then declines when the threshold becomes overly strict and excludes valid implicit dataset mentions. This trend supports the use of

θ_{r} = 0.35

in all subsequent experiments.

In addition, the retrieval depth was controlled by retaining the top-

K_{r}

fragments per document. To determine this depth, we conducted a grid search over

K_{r}

Value from 5 to 20 with a step size of 1 on the held-out development set. The results showed that

K_{r} = 8

provided the best trade-off between candidate recall and computational efficiency. When

K_{r}

was smaller, valid dataset mentions were more likely to be missed; when

K_{r}

was larger, more low-confidence fragments were retained, increasing subsequent LLM inference cost. Accordingly,

K_{r} = 8

was fixed for all subsequent experiments. Figure 2b shows the sensitivity of downstream F1-score to the retrieval depth

K_{r}

. The best performance is obtained at

K_{r} = 8

indicating that a moderate number of candidate fragments provides the best balance between coverage and noise control. Smaller

K_{r}

values tend to miss valid dataset mentions, whereas larger values introduce more low-confidence fragments and increase the burden of subsequent LLM processing.

The retrieval module is implemented using a Lucene-compatible engine via Anserini/Pyserini, executed with Java (OpenJDK) and Python 3.11. Stopwords follow the Snowball list with domain-specific extensions. The whitelist contains 267 dataset-related keywords, and regular expressions include patterns such as year ranges (\d{4}–\d{4}) and DOI identifiers (10.\d{4,9}/[A-Za-z0-9._-]+). The random seed is fixed at 42 for reproducibility.

The final output of this stage is a collection of high-confidence text spans likely to contain dataset information. These candidate fragments are then passed to the LLM-based module for structured extraction. As illustrated in Table 2, fragments containing dataset descriptions typically receive higher composite scores and are retained, whereas methodological or general narrative text is filtered out. The resulting candidate set substantially reduces noise while preserving relevant contextual information, forming a reliable input for the subsequent normalization and knowledge graph construction stages.

3.5. Prompt Engineering for Reliable Extraction

The second stage uses an LLM to transform candidate text spans into structured dataset records under a fixed schema. Rather than free-text generation, extraction is explicitly constrained by a JSON schema to promote consistency and reduce hallucination risks.

The theoretical motivation for schema-constrained prompting is that it reduces the LLM output space from open-ended natural-language generation to a set of predefined semantic fields and expected value types. In geoscientific literature–data linkage, the target information is not merely a free-text description but a structured record consisting of dataset name, temporal coverage, spatial scope, provider, resolution, DOI/URL, and data role. By explicitly defining these fields in the prompt, the model is guided to align its generation process with the task-specific schema rather than producing unconstrained narrative text.

This constraint improves extraction robustness in three ways. First, it reduces formatting variability by requiring machine-readable JSON output. Second, it reduces unsupported generation because unknown or unsupported attributes are required to remain blank rather than being inferred from prior knowledge. Third, it supports downstream validation and normalization because each extracted value is associated with a fixed semantic field. These properties are particularly important in geoscientific texts, where dataset references are often implicit, distributed across multiple sentences, and expressed through heterogeneous spatial–temporal descriptions. However, schema constraints do not by themselves guarantee factual correctness; they primarily restrict output structure and provide explicit locations at which unsupported or inconsistent values can be detected and reviewed.

The pipeline uses a two-stage prompting strategy with long-document chunking and controlled overlap, mirroring the four-stage system in Figure 1. This structured extraction stage can also be interpreted as a form of task-specific in-context learning, in which the model is guided by schema-constrained prompts, explicit output instructions, and structural constraints rather than parameter updating or task-specific fine-tuning. In this sense, the LLM is not used as a generic free-text generator, but as a controllable reasoning module for converting semi-structured textual evidence into standardized extraction records.

Before prompting the language model, non-informative tail sections (e.g., References, Acknowledgments, and Funding statements) are removed to reduce noise. The remaining text is segmented into overlapping chunks to preserve contextual continuity across chunk boundaries.

Each chunk was limited to a maximum length of

C = 3000

characters, with an overlap of

O = 400

characters between adjacent chunks. These values were selected through pilot experiments comparing chunk sizes

{2000, 3000, 4000}

and overlap sizes

{200, 400, 600}

on the held-out development set. The configuration

(C, O) = (3000, 400)

achieved the highest extraction F1-score while maintaining efficient inference cost. This sliding-window strategy ensures that dataset descriptions spanning chunk boundaries remain accessible to the model [34]. Compared with substantially larger chunk settings, this configuration provided a better trade-off between contextual completeness, extraction stability, and practical model input budget.

Given a document of length

L

, the number of chunks is computed as:

N_{chunks} = ⌈\frac{L - O}{C - O}⌉

(9)

where

C

denotes the chunk size (maximum chunk length) and

O

denotes the overlap size. The effective stride between consecutive chunks is therefore

C - O

. The ceiling operator ensures full coverage of the document without information loss. This sliding-window strategy balances computational efficiency and contextual completeness for long-span dataset descriptions.

Stage A: Dataset name identification (recall-oriented): Given one Markdown fragment, the model returns only a de-duplicated list of names that are actual data resources (statistical yearbooks/censuses/bulletins; base mapping databases such as “1:250 k”; remote-sensing/land-use products; DEMs such as SRTM/ASTER; reanalysis/gridded products; station observations; and derived outputs produced in the paper via interpolation/correction/resampling/retrieval/estimation). Platforms/portals/basemaps are excluded unless cited as a concrete product. Outputs from all chunks are merged and de-duplicated. Output schema (Stage-A): {“datasets”: [“name1”,”name2”, …]}.

Stage B: Schema-constrained structured extraction (Structured JSON): The final prompt receives (i) a concise core text (title, authors, abstract, methods) and (ii) the merged name list. The model then generates structured records under the following schema:

{

"paper_meta": {

"paper_id":"","title":"","authors":"",

"journal":"","year":"","volume":"","issue":""

},

"data_resources": [

{

"name":"","time_range":"","dataset_summary":"",

"location":"","dataset_authors":"",

"institution":"","resolution":"",

"doi":"","source_url":"","data_role":""

}

]

}

The prompt explicitly requires every populated field to be supported by textual evidence in the supplied fragment or core text. If a dataset name is identifiable but a specific attribute, such as provider, DOI/URL, resolution, or temporal coverage, is not explicitly supported, the corresponding field is retained as blank rather than completed from model prior knowledge.

Coverage verification is used to reduce omissions rather than to force unsupported generation. Specifically, if the source text contains explicit resource indicators, such as statistical yearbook, census, bulletin, 1:250 k base mapping, DEM/SRTM, or China 1:100 k land-use product expressions, but the initial output contains no corresponding dataset item, the relevant text span is re-examined. A record is added only when supporting evidence is present in the supplied text; otherwise, no dataset entity is created.

Time rules split discrete years (e.g., 1980, 1990, 2000, 2010, 2020) into multiple records, while continuous ranges (e.g., 1991–2020) remain single records (“1991–2020”); when only multi-period temporal indicators are present, one record is kept with a corresponding note in dataset_summary. Role inference rules assign data_role = “output” only when the text explicitly describes the dataset as being generated or transformed in the study through operations such as interpolation, correction, resampling, retrieval, or estimation; otherwise, the role is recorded as “source” only when textual evidence indicates external data use, and remains blank when the role cannot be reliably determined. Reliability controls and post-validation:

(1): Call stability: each prompt is executed twice under identical settings to reduce stochastic variation in LLM outputs. We compared one, two, and three repeated calls in pilot experiments on the held-out development set, and two calls provided most of the stability improvement while avoiding excessive inference cost. API calls were retried up to four times only in cases of transient service failure; this retry mechanism did not alter model predictions or evaluation results.
(2): Strict JSON enforcement: only the first valid top-level JSON block is retained and coerced into the predefined schema.
(3): Evidence-based extraction constraint: to reduce hallucination risks, the prompt explicitly requires the model to extract only information supported by the provided text. Unknown or unsupported fields are required to remain blank rather than being inferred from prior knowledge.
(4): Field-level validation: extracted values are checked using rule-based validation, including temporal-format checking, DOI-pattern validation, and consistency checking between dataset names and normalized attributes. Records containing internally inconsistent attributes or unsupported mappings are flagged as low-confidence before knowledge graph ingestion. Format validation is treated as an error-detection mechanism rather than as evidence that a field value is factually correct.
(5): Automatic temporal splitting: multi-year fields containing separators are programmatically expanded into separate records only when the text provides sufficient evidence that the years correspond to distinct dataset records; otherwise, the temporal information is retained in the original record to avoid artificial record generation.
(6): Structural fallback: If no dataset is extracted, a placeholder record is generated to maintain structural consistency for downstream processing. Such placeholder records do not represent confirmed dataset entities, are excluded from evaluation, and are not treated as validated dataset links in the knowledge graph.

These controls complement the retrieval stage (BM25 + regex + whitelist) and the normalization stage (hybrid similarity with a threshold of 0.70), forming an integrated framework for reliable dataset extraction. Importantly, the framework is designed to reduce, detect, and flag potential hallucination risks rather than to claim complete elimination of unsupported generation.

3.6. Post-Processing and Normalization

The outputs generated by the LLM are often heterogeneous in form, reflecting variations in how datasets are described across different articles. Therefore, a dedicated post-processing and normalization stage is required to ensure that all extracted information is consistent, comparable, and ready for integration into the knowledge graph. This stage transforms the raw model outputs into standardized, semantically interoperable records.

To formalize the normalization process, we define a mapping function

N (a_{i})

that standardizes each extracted attribute

a_{i}

to its canonical form in a controlled schema S:

N (a_{i}) = a r g \underset{s_{j} \in S}{m a x} s i m (a_{i}, s_{j})

(10)

where

s_{j}

represents a reference entry in the standardized schema, and

s i m (a_{i}, s_{j})

denotes the semantic similarity between the extracted attribute and the reference term.

To quantify the confidence of mapping, a normalized weight distribution is further defined:

w (a_{i}, s_{j}) = \frac{s i m (a_{i}, s_{j})}{\sum_{s_{k} \in S} s i m (a_{i}, s_{k})}

(11)

where

w (a_{i}, s_{j})

represents the confidence assigned to candidate schema entry

s_{j}

, and

\sum s ⱼ w (a ᵢ, s ⱼ) = 1

. This probabilistic formulation provides a soft alignment between extracted attributes and canonical schema nodes, supporting weighted aggregation and uncertainty-aware normalization.

The similarity function

s i m (a_{i}, s_{j})

is implemented as a weighted hybrid metric combining Levenshtein edit similarity (weight = 0.4) and sentence-embedding cosine similarity (weight = 0.6) using the text-embedding-3-small model. The hybrid weights were optimized through grid search on the held-out development set with a step size of 0.1 under the constraint

w_{1} + w_{2} = 1

. The combination

w_{1} = 0.4

for Levenshtein similarity and

w_{2} = 0.6

for embedding cosine similarity achieved the highest F1-score on the held-out development set and was therefore adopted in subsequent experiments. This result suggests that semantic embedding similarity contributes slightly more than surface-form similarity in this task.

s i m (a_{i}, s_{j}) = λ L e v (a_{i}, s_{j}) + (1 - λ) c o s (e (a_{i}), e (s_{j})), λ = 0.4

(12)

where

L e v (\cdot)

denotes normalized Levenshtein similarity and

c o s (\cdot)

is cosine similarity between sentence embeddings. Sentence embeddings are generated using the text-embedding-3-small model and L2-normalized prior to similarity computation:

\cos (e (a_{i}), e (s_{j})) = {e (a_{i})}^{T} e (s_{j})

(13)

The normalized Levenshtein similarity is defined as:

L e v (a_{i}, s_{j}) = 1 - \frac{d_{e d i t} (a_{i}, s_{j})}{m a x (|a_{i}|, |s_{j}|)}

(14)

where

d_{e d i t}

is the edit distance and

∣ \cdot ∣

denotes string length. This formulation normalizes edit similarity to the range

[0,1]

, while cosine similarity is computed over L2-normalized vectors.

A match is considered equivalent when

s i m \geq 0.60

. This threshold was determined through grid search on the held-out development set, with candidate values ranging from 0.60 to 0.90 (step = 0.05) in Figure 3. Results showed that

s i m = 0.70

achieved the highest F1 score while maintaining high precision. Lower thresholds introduced false positives from semantically adjacent dataset names, whereas higher thresholds reduced recall. Therefore,

s i m = 0.70

was adopted for normalization tasks.

To further support normalization, three domain-specific vocabularies are incorporated [39,40,41]:

(1): A geographical gazetteer (~3000 entries) covering provinces, basins, and major regions such as the Qinghai–Tibet Plateau.
(2): An institution registry (~700 entries) compiled from major data providers, including the Chinese Academy of Sciences (CAS) and the China Meteorological Administration (CMA).
(3): A dataset alias lexicon (~500 entries) derived from the annotated benchmark corpus.

Representative canonical datasets include WorldClim 2 [44], MODIS Collection 5 land cover [45], and FROM-GLC [46]. Additional references such as GlobeLand30 [47], national land-use change datasets [48], SRTM [49], and MODIS vegetation indices [50] are used when necessary to harmonize naming conventions, spatial resolution, and terminology.

Based on this framework, normalization is performed in three steps.

(1): Name normalization and synonym-conflict resolution. Dataset aliases are unified to canonical names. For example, “LUCC dataset”, “China land use dataset”, and “land cover product” are mapped to China Land Use/Cover Change Dataset (LUCC), ensuring unique dataset nodes. When one alias can be mapped to multiple candidate datasets, the system does not rely on name similarity alone. Instead, synonym conflicts are resolved by jointly comparing dataset name similarity, temporal coverage, spatial scope, provider institution, and spatial resolution. A candidate mapping is accepted only when its hybrid similarity exceeds the threshold $τ = 0.70$ and its auxiliary attributes do not contradict the candidate canonical record. If multiple candidates remain plausible or if key auxiliary attributes conflict, the case is marked as ambiguous and reserved for manual review. This strategy prevents semantically adjacent but non-equivalent datasets from being merged into the same canonical node.
(2): Temporal and spatial normalization. Temporal expressions such as “2000–2020”, “from 2000 to 2020”, or “20-year period” are converted into standardized interval formats. Spatial expressions such as “Qinghai–Tibet Plateau” and “Tibetan Plateau” are reconciled using the controlled geographic vocabulary. For ambiguous spatial entities, the system applies a hierarchical gazetteer-based disambiguation strategy. Candidate place names are compared according to province–city–county relations, regional aliases, administrative codes, and contextual cues such as neighboring administrative units, basin names, and study-area descriptions. When sufficient contextual evidence is available, the spatial expression is linked to a canonical geographic entity in the hierarchical gazetteer, while the original textual expression is preserved for traceability. If a unique geographic entity cannot be determined, the original expression is retained and the corresponding field is marked as low-confidence for manual review.
(3): Attribute harmonization and low-confidence handling. Resolution units such as “1000 m” and “1 km” are standardized, institution names are mapped to authoritative identifiers such as “CAS” to “Chinese Academy of Sciences”, and data roles are constrained to the controlled vocabulary “source” or “output”. Low-confidence cases are not forcibly normalized; instead, their original expressions are preserved and flagged for manual review, so that uncertain mappings do not propagate silently into the knowledge graph.

The result is a structured representation in which each dataset is consistently linked to standardized attributes. As illustrated in Figure 4, a dataset node is connected to normalized attribute nodes, including temporal coverage, location, institution, resolution, DOI/URL, summary, and data role. This representation ensures cross-article interoperability and provides a reliable foundation for subsequent knowledge graph construction.

To support geo-interoperability, spatial entities are normalized to a hierarchical gazetteer (province–city–county) and associated with a consistent CRS specification. When available, spatial extents are standardized into GeoJSON or WKT formats while preserving original textual expressions for traceability. Ambiguous or multilingual place names are resolved using alias dictionaries and administrative codes.

3.7. Knowledge Graph Construction

After post-processing and normalization, the structured dataset information is integrated into a Neo4j-based knowledge graph (KG). This stage aims to transform the extracted and standardized metadata into a graph representation that supports structured querying, visualization, and knowledge discovery.

A domain-oriented graph schema was designed to organize the extracted entities and their relationships. Four primary node types were defined: Article, representing the source publication with attributes such as title, authors, journal, year, and DOI; Dataset, representing standardized datasets extracted from the literature; Institution, representing organizations responsible for providing or curating datasets; and Region, representing spatial entities associated with datasets. Each Dataset node was further connected to attribute information, including temporal coverage, summary, spatial resolution, authorship, institution, DOI/URL, and dataset role, as described in Section 3.5. From a representation perspective, the constructed KG provides an explicit graph-based organization of literature–data linkage results, in which normalized entities are represented as typed nodes and their semantic associations are represented as typed edges. This graph representation enables relation-level retrieval, aggregation, and cross-document analysis beyond flat record storage, thereby making the linkage results structurally interpretable and computationally queryable.

To capture semantic associations across entities, several relationship types were defined, including USE (Article → Dataset), PROVIDED_BY (Dataset → Institution), LOCATED_IN (Dataset → Region), and HAS_ATTRIBUTE (Dataset → attribute information such as time, summary, and resolution). This schema enables semantic integration across publications and supports multi-dimensional analysis of dataset usage patterns.

The construction of the KG followed a three-stage data processing workflow. First, the normalized extraction results were transformed into structured files, where each entity and relationship was explicitly encoded for graph import. Second, the data were imported into Neo4j using Cypher scripts with LOAD CSV and MERGE operations. The MERGE mechanism ensured that duplicate entities were automatically unified while preserving all associated relationships. Third, the resulting graph could be interactively explored using Neo4j Browser or external visualization tools such as Gephi 0.10.1. Example analytical queries include identifying frequently used datasets in specific research domains, such as Qinghai–Tibet Plateau studies. In addition, scientific-mention corpora (e.g., SciDMT) can be used to facilitate downstream evaluation and comparative analysis [51].

The constructed KG supports both fine-grained and global analytical tasks. At the attribute level, users can examine dataset characteristics such as spatial resolution and temporal coverage. At the corpus level, the graph enables cross-article aggregation and trend analysis of dataset usage. Illustrative visualizations of the global KG structure and node-level relationships are presented in Section 4.5 (Figures 7 and 8).

4. Experimental Results

4.1. Dataset and Benchmark

The evaluation was conducted on an expanded cross-journal geoscientific literature benchmark. To improve generalizability beyond the original single-journal setting, we collected papers from multiple geoscientific journals with non-overlapping training and test sets. The training set contains 1000 papers collected from ISPRS International Journal of Geo-Information, Journal of Geographical Sciences, Computers & Geosciences, and International Journal of Applied Earth Observation and Geoinformation. The test set contains 200 papers collected from ISPRS International Journal of Geo-Information, Acta Geographica Sinica, Journal of Geographical Sciences, Applied Geography, Computers & Geosciences, and International Journal of Applied Earth Observation and Geoinformation. The training and test sets do not overlap.

To reduce layout-related variation, all source PDFs were first converted into standardized Markdown text before retrieval and extraction. This preprocessing step partially normalizes document-format differences, while the revised benchmark still preserves substantial heterogeneity in journal source, article organization, and dataset-description style. The updated test benchmark therefore provides a broader and more realistic evaluation setting for literature–dataset linkage under cross-journal conditions. These papers cover diverse domains, including land-use/cover change (LUCC), ecosystem stability, climate observation, and urbanization.

Each paper in the test benchmark was converted to standardized Markdown text and annotated according to a unified labeling guideline. A total of 1349 gold-standard dataset mentions were identified in the test set. Each dataset mention includes eight attributes: dataset name, temporal coverage, spatial scope, dataset authors, providing institution, resolution, DOI/URL, and dataset role (source or output). For confusion-matrix-based evaluation, we additionally defined 97 negative non-mention instances, yielding 1446 labeled evaluation instances in total.

The annotations were completed independently by six postgraduate annotators with backgrounds in geoinformatics, remote sensing, geographic information science, or computer science. All annotators had experience in reading geoscientific literature or processing geospatial datasets. Before formal annotation, the annotators received a unified annotation guideline that defined dataset mentions, attribute fields, positive and negative instances, and typical cases of implicit dataset references. A pilot annotation round was first conducted to calibrate annotation criteria, after which disagreements were discussed and the guideline was refined. During formal annotation, each test document was independently annotated by at least two annotators. Disagreements were resolved through joint adjudication by one researcher and one associate researcher with geoscience expertise, producing the final ground-truth benchmark for evaluation. Table 3 summarizes the basic statistics of the test set. (This test set can be viewed in the Data Availability Statement.)

To assess annotation reliability, we calculated inter-annotator agreement (IAA) on a randomly selected reliability-assessment subset before adjudication. Agreement was evaluated at both the mention level and the attribute level. At the mention level, average pairwise F1-score was used to measure span-level consistency in identifying dataset mentions. At the attribute level, Cohen’s kappa was used for the categorical data_role field, while normalized agreement was used for the remaining structured attributes after canonicalization. The resulting agreement scores were 0.85 for mention-level pairwise F1, 0.81 for data_role Cohen’s kappa, and 0.87 for structured-attribute normalized agreement, indicating good annotation reliability. Table 4 summarizes the agreement results, and Appendix A provides the detailed protocol and calculation procedure.

4.2. Data Extraction

The extraction system was evaluated on the revised cross-journal test benchmark described in Section 4.1 to assess its accuracy and robustness in identifying dataset mentions and their associated attributes from geoscientific literature.

As shown in Figure 5, the proposed system produced TP = 1223, FP = 81, FN = 126, and TN = 16, corresponding to 93.79% precision, 90.66% recall, and 92.20% F1-score. Precision, recall, and F1-score were computed in the standard way from the confusion-matrix statistics.

The evaluation follows a mention-level matching scheme. Each extracted dataset mention was compared with the expert-annotated ground truth after normalization. A prediction was counted as correct when its dataset name matched a reference mention with semantic similarity (

s i m \geq 0.70

). Attribute-level metrics were calculated in the same manner: missing attributes were treated as false negatives (FN), while incorrect or mismatched values were counted as false positives (FP) [52]. When multiple consecutive sentences referred to the same dataset name (

s i m \geq 0.90

), they were merged into a single mention before scoring.

m e r g e (m_{u}, m_{v}) ⟺ s i m (m_{u}, m_{v}) \geq 0.90

(15)

where

m_{u}, m_{v}

are candidate mentions in consecutive sentences, and

s i m (.)

is the hybrid similarity defined in Section 3.6 (Equations (12)–(14)). This strategy avoids over-counting repeated references and ensures consistent evaluation of multi-sentence dataset descriptions. From the perspective of graph construction, a true positive corresponds to an extracted mention that can be correctly normalized and linked to the corresponding gold-standard dataset entity, whereas false positives and false negatives indicate incorrect or missing mention-level links before graph ingestion. Under this setting, the confusion-matrix statistics are not only extraction metrics, but also indicators of the reliability of the upstream literature–data linkage that supports the subsequent KG construction.

The evaluation includes both positive and negative instances in the test benchmark, allowing explicit reporting of true negatives and a more complete confusion-matrix analysis. Because dataset mention extraction remains a highly imbalanced task in which positive instances are much rarer than non-mention spans, precision, recall, and F1-score remain the most informative metrics for assessing extraction quality [53]. In the context of geoscientific dataset mention extraction, precision reflects the proportion of extracted mentions that are truly correct, indicating the system’s ability to avoid spurious or weakly supported predictions. Recall reflects the proportion of gold-standard dataset mentions that are successfully identified, measuring the system’s ability to recover implicit or variably expressed dataset references. The F1-score provides a balanced summary of these two aspects by capturing the trade-off between false positives and false negatives under the current mention-level evaluation setting.

The results indicate that the proposed framework maintains strong extraction performance under a more heterogeneous cross-journal test setting. The observed balance between precision and recall suggests that the system can identify relevant dataset mentions effectively while limiting redundant or spurious matches.

To further assess the statistical reliability of the extraction results, we estimated 95% confidence intervals for precision, recall, and F1-score based on the mention-level confusion-matrix statistics. Precision and recall were treated as binomial proportions conditioned on predicted-positive and gold-positive instances, respectively, and their confidence intervals were estimated using binomial proportion intervals. Because F1-score is a nonlinear combination of precision and recall rather than a simple binomial proportion, its confidence interval was estimated using non-parametric bootstrap resampling with 1000 resamples. In each bootstrap iteration, test instances were sampled with replacement, and precision, recall, and F1-score were recomputed. The 2.5th and 97.5th percentiles of the bootstrap F1-score distribution were used as the 95% confidence interval.

The proposed framework achieved 93.79% precision, 90.66% recall, and 92.20% F1-score, with corresponding 95% confidence intervals of [92.35%, 94.97%], [88.99%, 92.10%], and [91.09%, 93.13%], respectively. These intervals indicate limited sampling uncertainty under the current mention-level evaluation setting and provide additional statistical support for the reported extraction performance. Therefore, the results further support the suitability of the proposed retrieve → extract → normalize → link pipeline for literature–dataset linkage and subsequent knowledge graph construction.

4.3. Prompting Strategy, Efficiency, and LLM Backbone Comparison

To further examine the robustness and reproducibility of the structured extraction stage, we conducted three experiments: (1) a prompting-strategy comparison, (2) a performance–efficiency analysis of repeated LLM calls, and (3) a comparison with transformer-based baselines and different LLM backbones. All experiments were conducted on the same revised cross-journal test benchmark using the same preprocessing procedure, retrieval input, normalization rules, and mention-level evaluation protocol. Unless otherwise specified, GPT-5.2 was used as the default LLM backbone.

4.3.1. Prompting Strategy Comparison

We first evaluated the influence of prompt design on structured extraction performance. Three prompting strategies were compared under the same retrieval, normalization, and evaluation settings: zero-shot prompting, few-shot in-context prompting, and schema-constrained prompting. The zero-shot prompt directly requested dataset information extraction without examples or strict field-level constraints. The few-shot in-context prompt provided annotated extraction examples before inference. The schema-constrained prompt used the fixed JSON schema and field-level instructions described in Section 3.5.

The results show that schema-constrained prompting achieved the best overall performance and the highest valid JSON rate in Table 5. Zero-shot prompting produced more format errors and incomplete attributes, which reduced downstream normalization reliability. Few-shot in-context prompting improved extraction quality compared with zero-shot prompting, but it increased prompt length and inference cost. In contrast, schema-constrained prompting provided the best balance among extraction accuracy, output validity, and compatibility with post-processing and knowledge graph construction. This result supports the use of schema-constrained prompting as the default extraction strategy, because the fixed output schema reduces format uncertainty and makes the extracted records more suitable for subsequent normalization and graph ingestion.

4.3.2. Performance–Efficiency Trade-Off of Repeated LLM Calls

In addition to prompt design, we evaluated the performance–efficiency trade-off of repeated LLM calls. In the proposed framework, repeated calls are used to reduce stochastic omissions and improve the stability of structured extraction. However, increasing the number of calls also increases API usage and token consumption. Therefore, we compared three settings,

n = 1

,

n = 2

, and

n = 3

, where

n

denotes the number of repeated extraction calls under the same schema-constrained prompting setting. For each setting, the same aggregation, post-processing, and normalization rules were applied.

Here, the repeated-call setting

n

is a methodological parameter, whereas relative API-call cost and relative token cost are efficiency indicators. Relative API-call cost measures the increase in the number of model calls compared with the

n = 1

setting, while relative token cost measures the approximate increase in total prompt and completion tokens.

As shown in Table 6, increasing repeated calls from

n = 1

to

n = 2

improved the F1-score from 89.99% to 92.20%, indicating that repeated extraction and aggregation can reduce stochastic omissions and improve structured extraction stability. However, increasing the number of calls from

n = 2

to

n = 3

only slightly improved the F1-score from 92.20% to 92.46%, while increasing relative API-call cost from 2.00× to 3.00× and relative token cost from 1.96× to 2.91×.

4.3.3. Baseline and LLM Backbone Comparison

To better contextualize the performance of the proposed framework, we compared it with transformer-based scientific information extraction baselines and multiple LLM backbones. Specifically, SciBERT was selected as a representative scientific-domain pretrained language model because it has been widely used in scientific named entity recognition and relation extraction tasks. For the SciBERT sequence labeling baseline, SciBERT was fine-tuned on the training set using BIO-style dataset mention labels and evaluated on the same revised cross-journal test benchmark. For the BM25 + SciBERT classifier baseline, the same BM25 retrieval module was first used to generate candidate fragments, and a SciBERT-based binary classifier was then applied to identify whether each candidate fragment contained a valid dataset mention. These two baselines were included to represent a pure scientific-transformer extraction setting and a retrieval-enhanced transformer setting, respectively.

The BM25 + SciBERT classifier baseline was included as a stronger retrieval-enhanced transformer baseline. It uses the same BM25 candidate retrieval stage as the proposed framework, but replaces schema-constrained LLM extraction and normalization with a SciBERT-based mention classifier. This setting helps determine whether the performance improvement comes only from candidate retrieval or from the full integration of retrieval, schema-constrained LLM extraction, hybrid-similarity normalization, and graph-based linkage. All baselines were evaluated on the same revised cross-journal test benchmark using the same mention-level evaluation protocol.

For LLM backbone comparison, four representative instruction-tuned LLMs were selected: GPT-5.2, Qwen2.5-72B-Instruct, Qwen2.5-32B-Instruct, and Llama-3.1-8B-Instruct. The model selection followed three criteria. First, the comparison should include both proprietary and open-source models to reflect different deployment scenarios, accessibility, and data-governance requirements. Second, the selected models should cover different parameter scales, so that the influence of model capacity on structured extraction performance can be examined. Third, the models should have strong instruction-following and multilingual text-processing capabilities, which are important for extracting dataset information from geoscientific literature containing mixed English–Chinese terminology, dataset names, and spatial expressions. Based on these criteria, GPT-5.2 was included as a representative proprietary model, while the Qwen and Llama models were selected as representative open-source alternatives with different model scales and deployment costs.

In this experiment, the document preprocessing procedure, retrieval input, prompt templates, JSON schema, repeated-call aggregation strategy, post-processing rules, and normalization pipeline were kept identical, and only the extraction backbone was changed when comparing LLMs. All parameters were fixed according to the development-set calibration described in Section 3.4, Section 3.5 and Section 3.6.

Table 7 shows that the proposed framework outperformed the transformer-based baselines in terms of overall F1-score. The SciBERT sequence labeling baseline was effective in identifying explicit dataset mentions, but its recall was limited for implicit, cross-sentence, and context-dependent dataset references. The BM25 + SciBERT classifier baseline improved performance by introducing candidate retrieval before classification, but it still lacked schema-constrained attribute extraction and normalization mechanisms for resolving aliases, heterogeneous spatial–temporal expressions, and provider information.

The LLM-only configuration achieved 88.63% precision, 85.41% recall, and 86.99% F1-score, indicating that LLM reasoning alone improves over transformer-based baselines but remains insufficient for robust literature–data linkage. In comparison, the proposed full framework achieved 93.79% precision, 90.66% recall, and 92.20% F1-score by combining candidate retrieval, schema-constrained LLM extraction, hybrid-similarity normalization, and graph-based linkage.

The LLM backbone comparison further shows that the proposed pipeline remains effective across different model families. GPT-5.2 achieved the best overall performance. Among the open-source models, Qwen2.5-72B-Instruct obtained the better results, suggesting that larger parameter scales can contribute positively to structured extraction performance. Llama-3.1-8B-Instruct remained competitive, but its lower recall indicates that it is relatively more conservative in identifying complex or implicit dataset mentions in geoscientific texts.

Therefore, these results demonstrate that the proposed retrieve → extract → normalize → link framework is not dependent on a single model family. Transformer-based baselines provide useful contextual representations, but the integration of schema-constrained LLM extraction, candidate retrieval, normalization, and graph-based linkage yields stronger balanced performance under the revised cross-journal benchmark.

4.4. Ablation Study

To further evaluate the contribution of each individual module beyond the multi-LLM comparison, an ablation study was conducted on the cross-journal test benchmark. The complete system consists of four major components:

(1): BM25-based candidate retrieval;
(2): Regex-based filtering;
(3): schema-constrained LLM extraction;
(4): Post-processing and normalization.

In the ablation experiments, each component was selectively removed while keeping all other settings identical. Performance was evaluated on the test set using overall precision, recall, and F1-score as the main metrics. The purpose of this analysis is to determine how each module contributes to the effectiveness of the overall retrieve → extract → normalize → link framework.

Table 8 presents the absolute performance values for each configuration. Removing BM25-based candidate retrieval causes the largest decline in recall, reducing it from 90.66% to 84.72%, and leads to a substantial drop in F1-score from 92.20% to 87.21%. This indicates that BM25 retrieval plays a crucial role in identifying high-relevance fragments and preserving candidate coverage. Without this retrieval stage, the system becomes more likely to miss valid dataset mentions embedded in long and heterogeneous geoscientific texts.

Removing regex-based filtering mainly affects precision. In this setting, precision decreases from 93.79% to 90.18%, while recall remains relatively stable at 89.47%. This suggests that regex rules primarily function as a noise-control mechanism, helping the system suppress irrelevant or weakly related textual patterns before they are passed to the extraction stage. Although regex filtering is not the dominant contributor to recall, it provides important support for reducing contextual noise and improving the reliability of candidate selection.

Removing whitelist-assisted scoring also reduces the overall performance, with precision, recall, and F1-score decreasing to 91.36%, 87.84%, and 89.57%, respectively. This indicates that the whitelist signal mainly contributes to domain-specific candidate coverage by increasing the ranking of fragments containing geoscientific dataset terms, product names, and aliases. In contrast to regex filtering, which primarily suppresses noisy fragments through explicit textual patterns such as year ranges, DOI strings, and resolution expressions, whitelist-assisted scoring improves the likelihood that domain-relevant but less explicit dataset mentions are retained for LLM-based extraction.

When both regex filtering and whitelist-assisted scoring are removed, the F1-score further decreases to 87.83%. This result confirms that the two retrieval-stage heuristic signals provide complementary evidence: regex filtering mainly improves precision by suppressing noisy fragments, whereas whitelist-assisted scoring mainly improves recall by enhancing the coverage of domain-relevant dataset mentions. The combined removal produces a larger performance degradation than removing either component alone, demonstrating that these two signals jointly improve candidate quality before structured extraction.

The removal of post-processing and normalization also leads to a clear degradation in overall performance, with F1-score decreasing from 92.20% to 89.80%. In particular, recall falls to 87.95%, indicating that normalization is essential for resolving alias variation, harmonizing heterogeneous attribute expressions, and preventing duplicated or semantically fragmented dataset records. Without this step, inconsistencies in names, units, temporal expressions, and institutional descriptions propagate into the final outputs and reduce the effectiveness of dataset-level matching.

The LLM-only configuration yields the weakest overall performance, with 88.63% precision, 85.41% recall, and 86.99% F1-score. In this setting, text fragments were passed directly to the schema-constrained extraction stage without BM25-based retrieval, regex-based filtering, whitelist-assisted scoring, or post-processing normalization. As shown in Figure 6, this result confirms that the language model alone is insufficient for robust dataset extraction from geoscientific literature. High-quality performance depends on the coordinated interaction between retrieval, filtering, semantic extraction, and normalization rather than on the LLM in isolation.

Overall, the ablation results demonstrate that BM25-based candidate retrieval and post-processing normalization are the two most influential components for balanced extraction performance, while regex filtering and whitelist-assisted scoring provide complementary support in the retrieval stage. Regex filtering mainly improves precision, whereas whitelist-assisted scoring mainly improves recall. These findings further validate the modular design of the proposed framework and show that its effectiveness arises from the interaction of multiple complementary components rather than from any single stage alone.

4.5. Knowledge Graph Construction and Usability Evaluation

To evaluate the practical usability of the proposed framework, all extracted and normalized dataset records were imported into a Neo4j-based geoscientific knowledge graph (KG). In contrast to a purely visual demonstration, this subsection further evaluates whether the constructed KG can support accurate and efficient graph-query-based access to literature–data linkage results. The evaluation focuses on representative downstream tasks, including dataset reuse analysis, provenance tracing, regional dataset discovery, and attribute-level inspection. The KG integrates structured information about articles, datasets, institutions, authors, regions, and dataset attributes, enabling macro-level analysis of dataset reuse patterns and micro-level inspection of individual dataset metadata. The overall structure of the constructed graph is shown in Figure 7.

Each Article node corresponds to a single publication and connects to one or more Dataset nodes through USE, indicating dataset citation or usage. Each Dataset node further links to Institution (via PROVIDED_BY) and Region (via LOCATED_IN) nodes, as well as to its standardized attribute nodes—Time, Location, Resolution, Authors, Institution, DOI/URL, Role, and Summary—via HAS_ATTRIBUTE. This graph structure provides an operational representation of the literature–data linkage results generated by the proposed pipeline. At the node level, a detailed dataset entry is shown in Figure 8.

This hierarchical schema captures not only the citation network but also the semantic relationships among data, sources, institutions, and spatial domains. The Neo4j environment provides interactive visualization and Cypher-based querying, allowing users to retrieve relation paths among articles, datasets, institutions, regions, and attributes rather than only isolated metadata records. In this study, graph reasoning refers to relation-path reasoning over explicitly constructed KG relations, such as Article–Dataset, Dataset–Institution, Dataset–Region, and Dataset–Attribute paths. Representative query tasks include:

(1): Which datasets are most frequently used in Tibetan Plateau studies between 2000 and 2020?
(2): Which institutions provide the datasets most frequently used in ecosystem or climate-related studies?
(3): Which articles use datasets associated with a specific region and temporal coverage?
(4): Which datasets have a specified spatial resolution, temporal coverage, or provider attribute?

For quantitative evaluation, we designed 40 representative Cypher query tasks covering four categories: dataset reuse analysis, provenance tracing, regional dataset discovery, and attribute-level inspection. For each query task, the returned graph entities and relation paths were manually checked against the normalized extraction records and the original annotated evidence. Three indicators were used for evaluation: query precision, manual verification accuracy, and average response time. Query precision measures the proportion of returned entities or paths that satisfy the query condition. Manual verification accuracy measures whether the returned Article–Dataset–Institution–Region or Dataset–Attribute links are consistent with the annotated evidence. Average response time measures the execution time of Cypher queries in Neo4j. Table 9 summarizes the quantitative usability evaluation results.

The results show that the constructed KG supports accurate and efficient structured retrieval over the literature–data linkage results. The overall query precision reaches 94.65%, and the average manual verification accuracy reaches 94.88%, indicating that most returned graph entities and relation paths are consistent with the annotated evidence. The average response time is 19.2 ms, suggesting that the Neo4j-based representation can support interactive exploration of dataset reuse, provenance relations, regional data connections, and attribute-level metadata.

Beyond simple keyword-based retrieval, the graph structure improves downstream usability by enabling relation-path reasoning among articles, datasets, institutions, regions, and attributes. For example, a user can start from a target region, identify related datasets, retrieve articles using these datasets, and further trace the institutions that provide them. Compared with flat table storage, the KG allows users to retrieve not only dataset records but also their contextual relationships, such as which datasets are repeatedly used in a specific region, which institutions provide datasets for a given research topic, and which publications share similar data sources.

Therefore, the usability evaluation demonstrates that the constructed KG is not merely a visualization artifact, but a queryable and interpretable semantic infrastructure for downstream geoscientific data discovery. It supports dataset reuse analysis, provenance tracing, regional dataset discovery, and cross-document linkage, thereby strengthening the practical value of the proposed literature–data linkage framework.

5. Discussion

5.1. Advantages of the Proposed Framework

The major strength of the proposed pipeline lies in its hybrid architecture that combines retrieval, rule-based filtering, and large language model reasoning. This integration enables the system to balance precision, recall, and semantic flexibility.

Beyond pure dataset name extraction, the framework is designed to support structured linking between geoscience literature and geoscientific data resources. In contrast to traditional natural language processing (NLP) or ontology-based methods that rely heavily on fixed vocabularies and handcrafted rules, our approach dynamically interprets contextual cues—such as implicit dataset names, abbreviated references, and multi-sentence mentions—with minimal human intervention.

The use of prompt-engineered LLMs allows zero-shot adaptability to new data sources without task-specific re-training. This capability is particularly valuable in geoscience, where dataset nomenclature, citation styles, and reporting conventions evolve continuously.

Moreover, the incorporation of post-processing and normalization not only improves extraction accuracy but also ensures semantic consistency across heterogeneous literature sources. This normalization step is essential for transforming textual mentions into standardized dataset entities that can be reliably integrated into a Neo4j-based knowledge graph, thereby enabling scalable cross-article linkage and relational analysis.

The ablation results further indicate that the framework’s effectiveness does not arise from the LLM alone, but from the coordinated interaction among candidate retrieval, filtering, structured extraction, and normalization. In this sense, the practical strength of the framework lies in its modular design, which allows different components to contribute complementary functions to literature–data linkage.

5.2. Interpretation of Experimental Results

The quantitative results on the revised cross-journal test benchmark confirm that the proposed framework maintains strong extraction performance under a more heterogeneous evaluation setting, achieving 93.79% precision, 90.66% recall, and 92.20% F1-score. High precision indicates that the pipeline effectively suppresses spurious or weakly supported dataset predictions, while strong recall shows that it can recover a large proportion of valid dataset mentions, including implicit and variably expressed references.

The confusion matrix and ablation results further show that each component contributes synergistically to overall performance. In particular, removing BM25-based retrieval leads to the largest decline in recall, indicating that high-relevance candidate identification is essential for preserving coverage in long and heterogeneous geoscientific texts. Removing regex-based filtering mainly reduces precision, which suggests that this component functions primarily as a noise-control mechanism. The degradation observed after removing normalization confirms that alias resolution and harmonization of heterogeneous attribute expressions are essential for stable dataset-level matching.

The multi-LLM comparison also provides an additional perspective on result interpretation. The fact that the framework remains effective across GPT-5.2, Qwen2.5-72B-Instruct, Qwen2.5-32B-Instruct, and Llama-3.1-8B-Instruct indicates that the overall pipeline is portable across different LLM backbones. At the same time, the performance differences among these models show that extraction quality still depends in part on model capability, especially in handling implicit dataset references, long-context reasoning, and schema-constrained attribute completion.

From a linguistic perspective, the high accuracy across temporal and spatial fields suggests that domain-sensitive prompt design effectively guides the LLM to interpret numerical expressions, spatial descriptions, and contextual dataset references commonly found in geoscientific writing.

Remaining errors are concentrated in more weakly standardized fields, such as institution names, dataset provenance, and descriptive attributes, which often vary across journals and reporting styles. This observation suggests that information linkage quality depends not only on model capability but also on the consistency of metadata reporting practices in the scientific community.

5.3. Comparison with Previous Approaches

Previous studies on scientific data extraction primarily relied on pattern-based NLP pipelines or fine-tuned transformer models requiring extensive annotated corpora. Conventional rule-based systems perform well on explicit mentions but often fail when dataset names appear in paraphrased or partially described forms. Fine-tuned models, while powerful, demand domain-specific training data that are costly to construct and maintain.

In contrast, the proposed framework reframes the task from pure named-entity recognition to literature–data linkage. By leveraging the generalized reasoning capability of conversational LLMs, enhanced through structured prompts and redundancy-based querying, the system captures implicit semantic relationships rather than relying solely on surface patterns.

Compared with earlier extraction-focused workflows, the present framework places stronger emphasis on the joint treatment of candidate identification, schema-constrained extraction, normalization, and graph-based linking. This design is particularly relevant for geoscientific literature, where dataset references are often expressed indirectly through temporal coverage, spatial scope, institutional provenance, or derived-product descriptions.

At the same time, the revised experiments suggest that the framework should not be interpreted as universally outperforming all supervised or rule-based alternatives in every setting. Rather, its main contribution lies in providing a modular and practically portable workflow that remains effective across heterogeneous journals and across multiple LLM backbones while requiring no task-specific model retraining. Furthermore, the integration with Neo4j extends the contribution beyond extraction by enabling relational querying, dataset co-usage analysis, and provenance tracking—functionalities that are rarely addressed in earlier extraction-focused studies.

5.4. Limitations and Future Work

Despite its promising performance, several limitations remain.

First, the framework remains dependent on the behavior of the underlying LLM, which introduces potential risks related to model bias and hallucination. In the present structured extraction task, hallucination may occur in several forms: the model may generate dataset attributes that are not explicitly supported by the source text, infer missing temporal coverage or provider information from prior knowledge, confuse a general data portal with a concrete dataset product, or assign unsupported spatial, resolution, or role information to an extracted dataset record. Even under schema-constrained prompting and repeated-call aggregation, the model may still produce partially unsupported attribute values, omit weakly expressed dataset information, or favor dominant expression patterns seen in training corpora. Such effects may be amplified when the input text contains ambiguous terminology, incomplete provenance statements, implicit dataset references, or domain-specific shorthand that is not explicitly grounded in the retrieved context.

To mitigate these risks, the current framework incorporates several control mechanisms. During extraction, evidence-based prompting instructs the model to output only information supported by the supplied text, while unknown or unsupported fields are required to remain blank. During post-validation, strict JSON enforcement, temporal-format checking, DOI-pattern validation, and normalization-based consistency checking are used to detect structurally invalid or internally inconsistent outputs before graph ingestion. These controls reduce format errors and unsupported completion, but they cannot independently guarantee the factual correctness of every extracted attribute, especially when the source literature itself is incomplete or ambiguous. Future work should therefore incorporate evidence-span grounding, citation-level verification, calibrated uncertainty scoring, and human-in-the-loop validation for low-confidence records.

Second, although the multi-model comparison shows that the framework is portable across both proprietary and open-source LLMs, the extraction results still vary with model family and parameter scale. This indicates a degree of model dependency in the current workflow. In practical deployment, the choice of LLM may therefore affect both extraction accuracy and stability, especially in cases involving implicit dataset mentions, long-range dependencies, or complex attribute completion. Moreover, differences in model access conditions, inference cost, and update cycles may influence reproducibility in future deployments.

Third, while normalization improves metadata consistency, certain domain-specific variations—such as regional naming conventions, evolving dataset versions, multilingual dataset references, synonym conflicts, and ambiguous spatial entities—still require careful handling. The current framework resolves such cases through hybrid similarity, auxiliary-attribute consistency checking, hierarchical gazetteer-based spatial disambiguation, and low-confidence flagging. However, ambiguous cases that lack sufficient contextual evidence may still require manual review, and continuous updates to the alias lexicon, institution registry, and geographic vocabulary remain necessary. Incorporating controlled vocabularies and domain ontologies, such as GCMD or INSPIRE themes, would further strengthen semantic alignment and interoperability.

Fourth, the benchmark, although substantially expanded in the revised manuscript, is still limited to a specific set of geoscientific journals and a bounded set of annotation guidelines. As a result, domain dependency remains a practical limitation: the current framework is likely to perform best in literature that is relatively close to the dataset types, writing conventions, and metadata patterns represented in the present corpus. Extending the benchmark to broader geoscientific subdomains, additional repositories, and richer multilingual settings will be necessary to test robustness more comprehensively.

Fifth, this study provides an initial quantitative evaluation of KG usability through representative Cypher query tasks involving dataset reuse analysis, provenance tracing, regional dataset discovery, and attribute-level inspection. Although the reported query precision, manual verification accuracy, and response time provide evidence that the constructed KG supports structured access to literature–data relationships, this evaluation remains limited to predefined relation-path query scenarios. More advanced downstream capabilities, such as complex graph reasoning, entity-alignment evaluation, large-scale repository deployment, and user-centered retrieval effectiveness, have not yet been systematically examined. Future work should therefore extend the evaluation from query usability to broader graph-based analytical and reasoning tasks.

Future research could also incorporate graph neural networks (GNNs) for pattern discovery within the constructed knowledge graph, thereby transforming static metadata into predictive insights, such as dataset co-usage evolution and spatial–temporal diffusion patterns. Such extensions should be evaluated carefully to distinguish improvements in graph-based downstream analysis from improvements in upstream extraction quality.

5.5. Implications for Geoscientific Data Infrastructure

The success of this approach highlights the potential of AI-assisted literature–data linkage in Earth and environmental sciences. Automating dataset identification and relational integration can accelerate metadata standardization, data reuse assessment, and interdisciplinary collaboration.

By bridging unstructured scholarly text with structured graph-based representations, the proposed framework contributes to the development of scalable, interoperable, and AI-ready geoscientific data ecosystems. Such infrastructures support reproducibility, transparency, and evidence-based knowledge synthesis across domains.

More broadly, the revised cross-journal evaluation and multi-LLM comparison suggest that literature–data linkage can be implemented as a reusable infrastructure component rather than as a journal-specific or model-specific prototype. In this sense, the framework may serve as a practical intermediate layer between scientific publications, dataset repositories, and knowledge-driven geoscientific applications.

6. Conclusions

This study presents an integrated framework for linking the geoscience literature and geoscientific datasets through large language model-driven extraction and knowledge graph construction. By combining BM25-based retrieval, regex-enhanced filtering, prompt-engineered LLM extraction, and systematic post-processing, the proposed system achieves strong extraction performance on a revised cross-journal benchmark of the geoscientific literature.

The results demonstrate that conversational LLMs, when combined with domain-aware heuristics and normalization strategies, can effectively bridge the gap between unstructured scientific text and structured geoscientific data entities. Rather than relying on task-specific model retraining, the proposed pipeline provides a modular workflow that supports dataset mention detection, structured attribute extraction, normalization, and graph-based linkage within a unified framework.

Whereas earlier NLP systems often struggled with implicit dataset references or inconsistent naming conventions, our framework leverages redundancy-based prompt reasoning and normalization to capture nuanced dataset relationships with 93.79% precision, 90.66% recall, and 92.20% F1-score on the revised cross-journal test benchmark. Importantly, the constructed Neo4j-based knowledge graph transforms isolated extraction results into an integrated literature–data linkage network. Its query-based usability was further examined through representative Cypher query tasks, providing initial quantitative evidence for supporting dataset reuse analysis, provenance tracing, regional dataset discovery, and attribute-level inspection.

From a methodological perspective, this work contributes to the growing field of LLM-driven scientific information extraction by proposing a generalizable workflow that emphasizes retrieval–generation synergy and structured integration. The experiments further show that the framework remains effective across multiple LLM backbones, including both proprietary and open-source models, while prompting-strategy comparison, repeated-call efficiency analysis, transformer-based baseline comparison, and ablation experiments clarify the contributions and practical trade-offs of the principal components. Rather than treating extraction as an end goal, the framework positions it as a foundational step toward building intelligent geoscientific knowledge infrastructures.

Looking ahead, several directions remain promising. Expanding the framework to multilingual and cross-domain corpora will test its robustness at a global scale. Integrating standardized ontologies and controlled vocabularies will enhance semantic interoperability. Coupling the knowledge graph with graph analytics or GNN-based models may enable the discovery of latent patterns, such as dataset evolution and cross-regional reuse trends. At the same time, future work should further address remaining challenges related to LLM bias, hallucination risk, model dependency, and domain dependency, especially in broader geoscientific and multilingual settings. Additional work is also required to evaluate more complex graph-reasoning scenarios, larger-scale repository deployment, and user-oriented retrieval effectiveness. Embedding this approach into open repositories could further automate metadata validation and data lineage tracking.

In summary, this research demonstrates that LLM-based automated extraction combined with structured knowledge graph modeling provides a scalable pathway for connecting the scholarly literature and geoscientific data. By unifying textual evidence and structured metadata within a coherent graph framework, the proposed approach advances the development of reproducible, interoperable, and data-centric geoscientific research ecosystems.

Author Contributions

Conceptualization, Xinyu Chen, Yin Ma, Kai Wu, Xing Pang, Guoqing Li, Ruikai Ma, Linhan Yang, Chuang Peng, Jiayu Zhi and Jiabin Yuan; Data curation, Xinyu Chen; Formal analysis, Xinyu Chen; Funding acquisition, Xing Pang; Investigation, Xinyu Chen; Methodology, Xinyu Chen, Yin Ma, Kai Wu, Xing Pang, Guoqing Li, Ruikai Ma, Linhan Yang, Chuang Peng, Jiayu Zhi and Jiabin Yuan; Project administration, Xinyu Chen; Resources, Xinyu Chen; Software, Xinyu Chen; Supervision, Xinyu Chen; Validation, Xinyu Chen; Visualization, Xinyu Chen; Writing—original draft, Xinyu Chen; Writing—review & editing, Xinyu Chen. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [Grant No. 42201505]; the National Key Research and Development Program of China [Grant No. 2024YFB3908404]; and Computer Network and Information Special Project Of Chinese Academy of Sciences [Grant No. 2025000010]. The author is very grateful to the anonymous reviewer and editor. They have greatly helped improve the quality of the paper.

Data Availability Statement

The original data presented in the study are openly available in: https://github.com/chenxinyu-lab/Geoscience-Data-Extraction (accessed on 1 May 2026).

Acknowledgments

We thank the anonymous reviewers and all of the editors who participated in the revision process.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Inter-Annotator Agreement Assessment

To further assess the reliability of the benchmark annotations, we conducted an inter-annotator agreement (IAA) analysis on a randomly selected reliability-assessment subset of the test benchmark. This analysis was performed before adjudication, so that the agreement scores reflect the consistency of the independent annotations rather than the final reconciled labels.

Appendix A.1. Reliability-Assessment Subset

A subset of the revised test benchmark was randomly sampled for annotation-reliability assessment. The sampled documents followed the same annotation protocol as the full test benchmark and were independently annotated by at least two annotators before expert adjudication. The annotators identified dataset mentions and assigned the corresponding structured attributes, including dataset name, temporal coverage, spatial scope, dataset authors, providing institution, resolution, DOI/URL, and dataset role (source or output).

All annotators had received the same annotation guideline and had participated in the pilot calibration round before the formal annotation stage. The reliability-assessment subset was used only to quantify annotation consistency and was not treated as a separate test set. After the agreement analysis was completed, disagreements in the sampled subset were resolved through the same adjudication procedure used for the full test benchmark.

Appendix A.2. Mention-Level Agreement

At the mention level, annotation agreement was measured using the average pairwise F1-score, which is appropriate for span-based extraction tasks in which annotators may differ not only in category assignment but also in the exact boundary of a mention. For each pair of annotators, one annotation set was treated as the reference and the other as the prediction, and mention-level precision, recall, and F1-score were computed based on exact or accepted span matching under the benchmark guideline. The reported mention-level agreement is the average of the pairwise F1-scores across annotator pairs.

This metric was selected because dataset mention identification in the geoscientific literature is fundamentally a span-detection task rather than a simple categorical classification task. Compared with chance-corrected categorical coefficients alone, pairwise F1 provides a more direct measure of consistency in identifying dataset mentions and their textual boundaries.

Appendix A.3. Attribute-Level Agreement

Attribute-level agreement was evaluated separately for categorical and structured textual fields.

For the categorical field data_role (source/output), agreement was measured using Cohen’s kappa, which is appropriate for chance-corrected agreement on discrete labels. Because the reliability-assessment setting was based on pairwise independent annotations, Cohen’s kappa was used for this categorical attribute.

For the remaining structured attributes, including dataset name, temporal coverage, spatial scope, providing institution, resolution, and DOI/URL, agreement was measured using normalized agreement after canonicalization. In this procedure, the independently annotated values were first converted to their canonical forms using the same normalization rules applied in the main pipeline. Agreement was then computed by comparing the normalized attribute values across annotators. This design was adopted because many attribute values in geoscientific literature can be expressed in multiple surface forms while still referring to the same underlying concept. For example, differences such as abbreviation versus full name, alternative place naming, or equivalent temporal expressions should not be counted as substantive annotation disagreement after normalization.

The reported overall structured-attribute agreement is the average normalized agreement across these fields.

Appendix A.4. Agreement Results

Table 4 reports the final inter-annotator agreement results on the reliability-assessment subset.

These agreement scores indicate good consistency in the benchmark annotations. In particular, the mention-level pairwise F1-score shows that annotators were broadly consistent in identifying dataset mentions, while the attribute-level scores indicate that the structured annotation of roles and normalized metadata fields is sufficiently reliable for benchmark-based extraction evaluation.

Appendix A.5. Role of Adjudication

After the independent annotations and agreement assessment were completed, disagreements were resolved through joint adjudication by one researcher and one associate researcher in geoscience. The adjudicated results were then used as the final ground-truth labels for the test benchmark. This procedure ensured that the benchmark combines both measurable annotation consistency and expert-reviewed label quality.

References

Sun, K.; Zhu, Y.; Pan, P.; Hou, Z.; Wang, D.; Li, W.; Song, J. Geospatial data ontology: The semantic foundation of geospatial data integration and sharing. Big Earth Data 2019, 3, 269–296. [Google Scholar] [CrossRef]
Kostoff, R.N. Role of Technical Literature in Science and Technology Development and Exploitation. J. Inf. Sci. 2003, 29, 223–228. [Google Scholar] [CrossRef]
Marsicek, J.; Goring, S.J.; Marcott, S.A.; Meyers, S.R.; Peters, S.; Ross, I.A.; Singer, B.; Williams, J. Automated Extraction of Spatiotemporal Geoscientific Data from the Literature Using GeoDeepDive. Past Glob. Changes Mag. 2018, 26, 70. [Google Scholar] [CrossRef]
Feldhoff, K.; Wiemer, H.; Träger, P.; Kühne, R.; Zimmermann, M.; Ihlenfeldt, S. Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing. Appl. Sci. 2025, 15, 9331. [Google Scholar] [CrossRef]
Arias, A.; Dini, I.; Casini, M.; Fiordelisi, A.; Perticone, I.; Pisano, A. Geoscientific Feature Update of the Larderello-Travale Geothermal System (Italy) for a Regional Numerical Modeling. In Proceedings of the World Geothermal Congress 2010, Bali, Indonesia, 25–30 April 2010. [Google Scholar]
Winata, G.I.; Madotto, A.; Lin, Z.; Liu, R.; Yosinski, J.; Fung, P. Language Models are Few-shot Multilingual Learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, Punta Cana, Dominican Republic, 10 November 2021; pp. 1–15. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2019, 21, 140:1–140:67. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
Polak, M.P.; Morgan, D. Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering. Nat. Commun. 2024, 15, 1569. [Google Scholar] [CrossRef]
Kusano, G.; Akimoto, K.; Takeoka, K. Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation. In Proceedings of the Nineteenth ACM Conference on Recommender Systems, New York, NY, USA, 2–4 September 2025; pp. 832–841. [Google Scholar]
Roberts, J.; Green, F. The Future of Prompt Engineering: Trends, Challenges, and Opportunities. In Proceedings of the IEEE International Symposium on Artificial Intelligence and Human Interaction, Shenzhen, China, 14–16 June 2024; pp. 75–88. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Deng, C.; Zhang, T.; He, Z.; Chen, Q.; Shi, Y.; Xu, Y.; Fu, L.; Zhang, W.; Wang, X.; Zhou, C.; et al. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. In Proceedings of the 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1–12. [Google Scholar]
Heddes, J.; Meerdink, P.; Pieters, M.; Marx, M. The Automatic Detection of Dataset Names in Scientific Articles. Data 2021, 6, 84. [Google Scholar] [CrossRef]
Pan, H.; Zhang, Q.; Dragut, E.; Caragea, C.; Latecki, L.J. DMDD: A Large-Scale Dataset for Dataset Mentions Detection. Trans. Assoc. Comput. Linguist. 2023, 11, 1132–1146. [Google Scholar] [CrossRef]
Zhou, B.; Li, K. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences 2025, 15, 382. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
Ma, X. Knowledge Graph Construction and Application in Geosciences: A Review. Comput. Geosci. 2022, 161, 105082. [Google Scholar] [CrossRef]
Cao, Q.; Wang, S.; Chen, Z.; Li, G.; Li, J. The Method of Extracting Names of Geo-science Data based on Regular Expressions. J. Geo-Inf. Sci. 2023, 25, 1601–1610. [Google Scholar]
Fries, J.A.; Varma, P.; Chen, V.S.; Xiao, K.; Tejeda, H.; Saha, P.; Dunnmon, J.A.; Chubb, H.; Maskatia, S.A.; Fiterau, M.; et al. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences. Nat. Commun. 2019, 10, 3111. [Google Scholar] [CrossRef] [PubMed]
Cui, B.-G.; Chen, X. An Improved Hidden Markov Model for Literature Metadata Extraction. In Proceedings of the 6th International Conference on Advanced Intelligent Computing Theories and Applications: Intelligent Computing, Changsha, China, 18 August 2010; pp. 205–212. [Google Scholar]
Nasar, Z.; Jaffry, S.W.; Malik, M.K. Information extraction from scientific articles: A survey. Scientometrics 2018, 117, 1931–1990. [Google Scholar] [CrossRef]
D’Souza, J.; Hoppe, A.; Brack, A.; Jaradeh, M.Y.; Auer, S.; Ewerth, R. The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2192–2203. [Google Scholar]
Hou, Y.; Jochim, C.; Gleize, M.; Bonin, F.; Ganguly, D. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 707–714. [Google Scholar]
Chengbin, W.; Ma, X.; Chen, J.; Chen, J. Information Extraction and Knowledge Graph Construction from Geoscience Literature. Comput. Geosci. 2018, 112, 112–120. [Google Scholar] [CrossRef]
Qiu, Q.; Tian, M.; Tao, L.; Xie, Z.; Ma, K. Semantic Information Extraction and Search of Mineral Exploration Data Using Text Mining and Deep Learning Methods. Ore Geol. Rev. 2024, 165, 105863. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, Z.; Pan, H.; Caragea, C.; Latecki, L.J.; Dragut, E. SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 3–7 November 2024; pp. 13083–13100. [Google Scholar]
Duan, D.; Peng, J.; Zhang, Y.; Zhang, C. SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 24–28 November 2025; pp. 14473–14486. [Google Scholar]
Gerasimov, I.; KC, B.; Mehrabian, A.; Acker, J.; McGuire, M.P. Comparison of Datasets Citation Coverage in Google Scholar, Web of Science, Scopus, Crossref, and DataCite. Scientometrics 2024, 129, 3681–3704. [Google Scholar] [CrossRef]
Vrouwenvelder, K.; Raia, N.H.; Thomer, A.K. Obstacles to Dataset Citation Using Bibliographic Management Software. Data Sci. J. 2025, 24, 017. [Google Scholar] [CrossRef]
Lafia, S.; Fan, L.; Hemphill, L. A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature. In Proceedings of the Association for Information Science and Technology, Pittsburgh, PA, USA, 9 October–1 November 2022; Volume 59. [Google Scholar]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3615–3620. [Google Scholar]
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef]
Du, J.; Wang, D.; Lin, B.; He, L.; Huang, L.-C.; Wang, J.; Manion, F.J.; Li, Y.; Cossrow, N.; Yao, L. Use of Deep Learning-Based NLP Models for Full-Text Data Elements Extraction for Systematic Literature Review Tasks. Sci. Rep. 2025, 15, 19379. [Google Scholar] [CrossRef]
Kamran, S.; Hosseini, S.; Esmailzadeh, S.; Kangavari, M.R.; Hua, W. Cognition2Vocation: Meta-Learning via ConvNets and Continuous Transformers. Neural Comput. Appl. 2024, 36, 12935–12950. [Google Scholar] [CrossRef]
Saaki, M.; Hosseini, S.; Rahmani, S.; Kangavari, M.R.; Hua, W.; Zhou, X. Value-Wise ConvNet for Transformer Models: An Infinite Time-Aware Recommender System. IEEE Trans. Knowl. Data Eng. 2023, 35, 9932–9945. [Google Scholar] [CrossRef]
Najafipour, S.; Hosseini, S.; Hua, W.; Kangavari, M.R.; Zhou, X. SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding. IEEE Trans. Knowl. Data Eng. 2022, 34, 448–461. [Google Scholar] [CrossRef]
Liu, Y.; Hua, W.; Xin, K.; Hosseini, S.; Zhou, X. TEA: Time-Aware Entity Alignment in Knowledge Graphs. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2591–2599. [Google Scholar]
Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Online, 6–14 December 2021. [Google Scholar]
Wang, B.; Wu, L.; Xie, Z.; Qiu, Q.; Zhou, Y.; Ma, K.; Tao, L. Understanding Geological Reports Based on Knowledge Graphs Using a Deep Learning Approach. Comput. Geosci. 2022, 168, 105229. [Google Scholar] [CrossRef]
Chen, Q.; Zhou, W.; Cheng, J.; Yang, J. An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain. Appl. Sci. 2024, 14, 11529. [Google Scholar] [CrossRef]
Niu, S.; Yang, K.; Zhao, R.; Liu, Y.; Li, Z.; Wang, H.; Chen, W. Tree-KG: An Expandable Knowledge Graph Construction Framework for Knowledge-intensive Domains. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), Vienna, Austria, 12–17 August 2025; pp. 18516–18529. [Google Scholar]
Goyal, N.; Singh, N. Named Entity Recognition and Relationship Extraction for Biomedical Text: A Comprehensive Survey, Recent Advancements, and Future Research Directions. Neurocomputing 2025, 618, 129171. [Google Scholar] [CrossRef]
Fick, S.; Hijmans, R. WorldClim 2: New 1-km Spatial Resolution Climate Surfaces for Global Land Areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
Friedl, M.A.; Sulla-Menashe, D.; Tan, B.; Schneider, A.; Ramankutty, N.; Sibley, A.; Huang, X. MODIS Collection 5 Global Land Cover: Algorithm Refinements and Characterization of New Datasets. Remote Sens. Environ. 2010, 114, 168–182. [Google Scholar] [CrossRef]
Gong, P.; Wang, J.; Yu, L.; Zhao, Y.; Zhao, Y.; Liang, L.; Niu, Z.; Huang, X.; Fu, H.; Liu, S.; et al. Finer Resolution Observation and Monitoring of Global Land Cover: First Mapping Results with Landsat TM and ETM+ Data. Int. J. Remote Sens. 2013, 34, 2607–2654. [Google Scholar] [CrossRef]
Chen, J.; Chen, J.; Liao, A.; Cao, X.; Chen, L.; Chen, X.; He, C.; Han, G.; Peng, S.; Lu, M.; et al. Global Land Cover Mapping at 30m Resolution: A POK-Based Operational Approach. ISPRS J. Photogramm. Remote Sens. 2015, 103, 7–27. [Google Scholar] [CrossRef]
Liu, J.; Kuang, W.; Zhang, Z.; Xu, X.; Qin, Y.; Ning, J.; Zhou, W.; Zhang, S.; Li, R.; Yan, C.; et al. Spatiotemporal Characteristics, Patterns, and Causes of Land-Use Changes in China Since the Late 1980s. J. Geogr. Sci. 2014, 24, 195–210. [Google Scholar] [CrossRef]
van Zyl, J.J. The Shuttle Radar Topography Mission (SRTM): A Breakthrough in Remote Sensing of Topography. Acta Astronaut. 2001, 48, 559–565. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the Radiometric and Biophysical Performance of the MODIS Vegetation Indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Pan, H.; Zhang, Q.; Caragea, C.; Dragut, E.; Latecki, L.J. SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 14407–14417. [Google Scholar]
Barlaug, N.; Gulla, J.A. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 2021, 15, 52. [Google Scholar] [CrossRef]
Ateia, S.; Kruschwitz, U.; Scholz, M.; Koschmider, A.; Almohaishi, M. LLM-Based Information Extraction to Support Scientific Literature Research and Publication Workflows. In New Trends in Theory and Practice of Digital Libraries; Balke, W.-T., Golub, K., Manolopoulos, Y., Aizawa, A., Mayr, P., Tzanoudaki, M., Eds.; Springer: Cham, Switzerland, 2025; Volume 2694, pp. 14407–14417. [Google Scholar]

Figure 1. Overall architecture of the proposed framework.

Figure 2. Impact of retrieval parameters on downstream extraction F1-score. (a) Retrieval filtering threshold sensitivity, with the optimum at

θ_{r} = 0.35

. (b) Retrieval depth sensitivity, with the optimum at

K_{r} = 8

.

Figure 2. Impact of retrieval parameters on downstream extraction F1-score. (a) Retrieval filtering threshold sensitivity, with the optimum at

θ_{r} = 0.35

. (b) Retrieval depth sensitivity, with the optimum at

K_{r} = 8

.

Figure 3. Threshold calibration curve for similarity-based normalization.

Figure 4. Schema of standardized attributes after hybrid-similarity normalization.

Figure 5. Confusion Matrix Comparing Extraction Results with Ground-Truth Annotations.

Figure 6. Ablation results on the revised test benchmark across different system configurations. Precision, recall, and F1-score are reported for each configuration.

Figure 7. Example of the constructed knowledge graph showing relationships among articles, datasets, institutions, regions, and dataset attributes.

Figure 8. Example of a dataset node in the knowledge graph, showing standardized attributes including name, time, resolution, location, institution, description, and contact.

Table 1. Comparative summary of representative methods for scientific information extraction and literature–data linkage.

Method Category	Representative Studies	Typical Data/ Benchmark	Main Strengths	Main Limitations for Geoscientific Literature–Data Linkage
Rule-based methods	Early pattern matching and dictionary/rule-based systems [19,20,21]	Scholarly metadata, structured or semi-structured documents	High precision on explicit patterns; interpretable	Limited robustness to implicit dataset mentions, alias variation, and heterogeneous spatial–temporal expressions
Traditional machine learning methods	CRF/SVM-based scientific IE; domain-specific geological term extraction [22,25,26]	Annotated scientific corpora; domain dictionaries	Better generalization than pure rules; can leverage domain features	Depend on handcrafted features and domain adaptation; weak for long-context and cross-sentence reasoning
Deep learning/pretrained transformer methods	SciER [27], SciNLP [28], SciBERT [32]	Scientific corpora and full-text annotated benchmarks	Stronger semantic representation; improved performance on scientific NER and relation extraction	Still challenged by full-text complexity, implicit references, and document-level reasoning; usually focus on text extraction rather than dataset linking
Neural methods for structured extraction and linkage	GPT/LLaMA-style structured extraction [33,34], SoulMate [37], KGAT [36], TEA [38]	Scientific texts, short-text linking, knowledge graphs, time-aware alignment tasks	Flexible structured extraction; better use of contextual, temporal, and relational signals	Existing studies rarely provide a unified workflow for geoscientific dataset extraction, normalization, and graph-based linkage; integration of spatial–temporal attributes remains limited

Table 2. Examples of candidate text retrieval for dataset mentions.

Original Text Fragment	BM25 Score	Candidate Selected
Based on the 2000–2020 daily meteorological station observations on the Tibetan Plateau, we analyzed the evolution of ecosystem stability.	7.85	Yes
The data were obtained from the China Land Use/Cover Change Dataset (LUCC) with a spatial resolution of 1 km × 1 km.	6.93	Yes
In the experimental section, we conducted a sensitivity analysis to verify the robustness of the model.	1.12	No
We adopted statistical yearbook data from 1990–2018 provided by national authorities.	5.74	Yes
The research methods mainly include literature review and case study.	0.95	No

Table 3. Summary of the basic statistics of the test set.

Statistic	Value	Description
Number of test papers	200	Non-overlapping test papers collected from multiple geoscientific journals
Total dataset mentions	1349	Labeled instances of dataset references in the test set
Total labeled evaluation instances	1446	Positive and negative instances used for confusion-matrix-based evaluation
Number of source journals	6	Cross-journal benchmark composition
Attributes per mention	8	Name, time, location, authors, institution, resolution, DOI/URL, role
Annotators	6	Independent postgraduate annotators following a unified guideline
Adjudicators	2	One researcher and one associate researcher in geoscience

Table 4. Inter-annotator agreement on the reliability-assessment subset.

Agreement item	Metric	Value	Description
Dataset mention identification	Average pairwise F1	0.85	Span-level agreement on dataset mention detection
Data role (source/output)	Cohen’s kappa	0.81	Agreement on categorical role annotation
Structured attributes (overall)	Normalized agreement	0.87	Average agreement across normalized name, time, location, institution, resolution, and DOI/URL fields

Table 5. Comparison of different prompting strategies.

Prompting Strategy	Precision (%)	Recall (%)	F1-Score (%)	Valid JSON Rate (%)
Zero-shot prompting	87.92	83.76	85.79	91.50
Few-shot in-context prompting	90.46	86.88	88.63	95.20
Schema-constrained prompting	93.79	90.66	92.20	99.10

Table 6. Performance–efficiency trade-off under different repeated-call settings.

Repeated Calls	Precision (%)	Recall (%)	F1-Score (%)	Relative API-Call Cost	Relative Token Cost
n = 1	91.62	88.41	89.99	1.00×	1.00×
n = 2	93.79	90.66	92.20	2.00×	1.96×
n = 3	94.05	90.92	92.46	3.00×	2.91×

Table 7. Comparison with transformer-based baselines and LLM-based extraction configurations on the test benchmark.

Method	Precision (%)	Recall (%)	F1-Score (%)
SciBERT sequence labeling baseline	86.42	79.85	83.01
BM25 + SciBERT classifier baseline	88.76	82.94	85.75
LLM-only	88.63	85.41	86.99
Qwen2.5-32B-Instruct	89.96	85.74	87.80
Llama-3.1-8B-Instruct	88.71	84.93	86.78
Qwen2.5-72B-Instruct	91.84	88.27	90.02
Proposed framework (GPT-5.2)	93.79	90.66	92.20

Table 8. Absolute precision, recall, and F1-score values for each system configuration on the test benchmark.

System Configuration	Precision (%)	Recall (%)	F1-Score (%)
Full System	93.79	90.66	92.20
– BM25 retrieval	89.94	84.72	87.21
– Regex filtering	90.18	89.47	89.82
– Whitelist-assisted scoring	91.36	87.84	89.57
– Regex filtering and whitelist scoring	88.95	86.73	87.83
– Normalization	91.72	87.95	89.80
LLM only	88.63	85.41	86.99

Table 9. Quantitative usability evaluation of the constructed knowledge graph.

Query Task	Number of Queries	Query Precision (%)	Manual Verification Accuracy (%)	Avg. Response Time (ms)
Dataset reuse analysis	10	95.00	95.00	18.6
Provenance tracing	10	93.33	94.00	21.4
Regional dataset discovery	10	94.12	94.50	19.8
Attribute-level inspection	10	96.15	96.00	16.9
Overall	40	94.65	94.88	19.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Chen, X.; Ma, Y.; Wu, K.; Pang, X.; Li, G.; Ma, R.; Yang, L.; Peng, C.; Zhi, J.; Yuan, J. Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS Int. J. Geo-Inf. 2026, 15, 243. https://doi.org/10.3390/ijgi15060243

AMA Style

Chen X, Ma Y, Wu K, Pang X, Li G, Ma R, Yang L, Peng C, Zhi J, Yuan J. Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS International Journal of Geo-Information. 2026; 15(6):243. https://doi.org/10.3390/ijgi15060243

Chicago/Turabian Style

Chen, Xinyu, Yin Ma, Kai Wu, Xing Pang, Guoqing Li, Ruikai Ma, Linhan Yang, Chuang Peng, Jiayu Zhi, and Jiabin Yuan. 2026. "Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models" ISPRS International Journal of Geo-Information 15, no. 6: 243. https://doi.org/10.3390/ijgi15060243

APA Style

Chen, X., Ma, Y., Wu, K., Pang, X., Li, G., Ma, R., Yang, L., Peng, C., Zhi, J., & Yuan, J. (2026). Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS International Journal of Geo-Information, 15(6), 243. https://doi.org/10.3390/ijgi15060243

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Problem Formulation and Methodological Overview

3.2. System Architecture

3.3. Task Decomposition and Evaluation Alignment

3.4. Document Preprocessing and Candidate Retrieval

3.5. Prompt Engineering for Reliable Extraction

3.6. Post-Processing and Normalization

3.7. Knowledge Graph Construction

4. Experimental Results

4.1. Dataset and Benchmark

4.2. Data Extraction

4.3. Prompting Strategy, Efficiency, and LLM Backbone Comparison

4.3.1. Prompting Strategy Comparison

4.3.2. Performance–Efficiency Trade-Off of Repeated LLM Calls

4.3.3. Baseline and LLM Backbone Comparison

4.4. Ablation Study

4.5. Knowledge Graph Construction and Usability Evaluation

5. Discussion

5.1. Advantages of the Proposed Framework

5.2. Interpretation of Experimental Results

5.3. Comparison with Previous Approaches

5.4. Limitations and Future Work

5.5. Implications for Geoscientific Data Infrastructure

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Inter-Annotator Agreement Assessment

Appendix A.1. Reliability-Assessment Subset

Appendix A.2. Mention-Level Agreement

Appendix A.3. Attribute-Level Agreement

Appendix A.4. Agreement Results

Appendix A.5. Role of Adjudication

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI