Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models
Abstract
1. Introduction
- (1)
- We formulate geoscientific literature–data linkage as a multi-stage task that goes beyond dataset mention detection alone. The task integrates candidate-fragment retrieval, schema-constrained attribute extraction, normalization of aliases and spatial–temporal expressions, and graph-based representation of linked dataset entities, thereby clarifying how textual and contextual evidence supports canonical dataset identification.
- (2)
- We develop a modular retrieve → extract → normalize → link framework that combines BM25-based retrieval, regex evidence, whitelist-assisted scoring, schema-constrained LLM extraction, hybrid-similarity normalization, and knowledge graph construction. The framework incorporates evidence-based blank-field retention, structured-output validation, and low-confidence handling to limit the propagation of unsupported or ambiguous extraction results into the knowledge graph.
- (3)
- We conduct a comprehensive evaluation on an expanded, manually annotated cross-journal benchmark, including extraction reliability analysis, prompting-strategy and repeated-call efficiency experiments, transformer-based and multi-LLM comparisons, module-level ablation analysis, and a quantitative query-based usability evaluation of the constructed knowledge graph. These experiments assess both the extraction quality of the framework and its ability to support structured retrieval of literature–data relationships.
2. Literature Review
3. Materials and Methods
3.1. Problem Formulation and Methodological Overview
3.2. System Architecture
- (1)
- Document preprocessing and candidate retrieval, which converts source documents into plain text, segments them into paragraphs and sentences, and identifies text fragments likely to contain dataset-related information through BM25-based retrieval, regex matching, whitelist-based filtering, and composite candidate scoring.
- (2)
- LLM-based structured extraction, which converts the retrieved candidate fragments into schema-constrained JSON records through prompt-guided extraction. This stage includes structured field prompting, fragment-level JSON generation, repeated-call aggregation for improved stability, and blank-field retention when supporting evidence is unavailable.
- (3)
- Post-processing and normalization, which resolves lexical variation, synonyms, spatial–temporal heterogeneity, and alias ambiguity through hybrid similarity matching and controlled vocabularies. This stage combines Levenshtein similarity and embedding-based semantic similarity, supported by a gazetteer, an institution registry, and a dataset alias lexicon, to map extracted values to canonical forms.
- (4)
- Knowledge graph construction, which materializes the normalized records into graph nodes and typed relations in Neo4j, thereby enabling structured storage, querying, and visualization of literature–data linkages.
3.3. Task Decomposition and Evaluation Alignment
3.4. Document Preprocessing and Candidate Retrieval
3.5. Prompt Engineering for Reliable Extraction
- (1)
- Call stability: each prompt is executed twice under identical settings to reduce stochastic variation in LLM outputs. We compared one, two, and three repeated calls in pilot experiments on the held-out development set, and two calls provided most of the stability improvement while avoiding excessive inference cost. API calls were retried up to four times only in cases of transient service failure; this retry mechanism did not alter model predictions or evaluation results.
- (2)
- Strict JSON enforcement: only the first valid top-level JSON block is retained and coerced into the predefined schema.
- (3)
- Evidence-based extraction constraint: to reduce hallucination risks, the prompt explicitly requires the model to extract only information supported by the provided text. Unknown or unsupported fields are required to remain blank rather than being inferred from prior knowledge.
- (4)
- Field-level validation: extracted values are checked using rule-based validation, including temporal-format checking, DOI-pattern validation, and consistency checking between dataset names and normalized attributes. Records containing internally inconsistent attributes or unsupported mappings are flagged as low-confidence before knowledge graph ingestion. Format validation is treated as an error-detection mechanism rather than as evidence that a field value is factually correct.
- (5)
- Automatic temporal splitting: multi-year fields containing separators are programmatically expanded into separate records only when the text provides sufficient evidence that the years correspond to distinct dataset records; otherwise, the temporal information is retained in the original record to avoid artificial record generation.
- (6)
- Structural fallback: If no dataset is extracted, a placeholder record is generated to maintain structural consistency for downstream processing. Such placeholder records do not represent confirmed dataset entities, are excluded from evaluation, and are not treated as validated dataset links in the knowledge graph.
3.6. Post-Processing and Normalization
- (1)
- A geographical gazetteer (~3000 entries) covering provinces, basins, and major regions such as the Qinghai–Tibet Plateau.
- (2)
- An institution registry (~700 entries) compiled from major data providers, including the Chinese Academy of Sciences (CAS) and the China Meteorological Administration (CMA).
- (3)
- A dataset alias lexicon (~500 entries) derived from the annotated benchmark corpus.
- (1)
- Name normalization and synonym-conflict resolution. Dataset aliases are unified to canonical names. For example, “LUCC dataset”, “China land use dataset”, and “land cover product” are mapped to China Land Use/Cover Change Dataset (LUCC), ensuring unique dataset nodes. When one alias can be mapped to multiple candidate datasets, the system does not rely on name similarity alone. Instead, synonym conflicts are resolved by jointly comparing dataset name similarity, temporal coverage, spatial scope, provider institution, and spatial resolution. A candidate mapping is accepted only when its hybrid similarity exceeds the threshold and its auxiliary attributes do not contradict the candidate canonical record. If multiple candidates remain plausible or if key auxiliary attributes conflict, the case is marked as ambiguous and reserved for manual review. This strategy prevents semantically adjacent but non-equivalent datasets from being merged into the same canonical node.
- (2)
- Temporal and spatial normalization. Temporal expressions such as “2000–2020”, “from 2000 to 2020”, or “20-year period” are converted into standardized interval formats. Spatial expressions such as “Qinghai–Tibet Plateau” and “Tibetan Plateau” are reconciled using the controlled geographic vocabulary. For ambiguous spatial entities, the system applies a hierarchical gazetteer-based disambiguation strategy. Candidate place names are compared according to province–city–county relations, regional aliases, administrative codes, and contextual cues such as neighboring administrative units, basin names, and study-area descriptions. When sufficient contextual evidence is available, the spatial expression is linked to a canonical geographic entity in the hierarchical gazetteer, while the original textual expression is preserved for traceability. If a unique geographic entity cannot be determined, the original expression is retained and the corresponding field is marked as low-confidence for manual review.
- (3)
- Attribute harmonization and low-confidence handling. Resolution units such as “1000 m” and “1 km” are standardized, institution names are mapped to authoritative identifiers such as “CAS” to “Chinese Academy of Sciences”, and data roles are constrained to the controlled vocabulary “source” or “output”. Low-confidence cases are not forcibly normalized; instead, their original expressions are preserved and flagged for manual review, so that uncertain mappings do not propagate silently into the knowledge graph.
3.7. Knowledge Graph Construction
4. Experimental Results
4.1. Dataset and Benchmark
4.2. Data Extraction
4.3. Prompting Strategy, Efficiency, and LLM Backbone Comparison
4.3.1. Prompting Strategy Comparison
4.3.2. Performance–Efficiency Trade-Off of Repeated LLM Calls
4.3.3. Baseline and LLM Backbone Comparison
4.4. Ablation Study
- (1)
- BM25-based candidate retrieval;
- (2)
- Regex-based filtering;
- (3)
- schema-constrained LLM extraction;
- (4)
- Post-processing and normalization.
4.5. Knowledge Graph Construction and Usability Evaluation
- (1)
- Which datasets are most frequently used in Tibetan Plateau studies between 2000 and 2020?
- (2)
- Which institutions provide the datasets most frequently used in ecosystem or climate-related studies?
- (3)
- Which articles use datasets associated with a specific region and temporal coverage?
- (4)
- Which datasets have a specified spatial resolution, temporal coverage, or provider attribute?
5. Discussion
5.1. Advantages of the Proposed Framework
5.2. Interpretation of Experimental Results
5.3. Comparison with Previous Approaches
5.4. Limitations and Future Work
5.5. Implications for Geoscientific Data Infrastructure
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Inter-Annotator Agreement Assessment
Appendix A.1. Reliability-Assessment Subset
Appendix A.2. Mention-Level Agreement
Appendix A.3. Attribute-Level Agreement
Appendix A.4. Agreement Results
Appendix A.5. Role of Adjudication
References
- Sun, K.; Zhu, Y.; Pan, P.; Hou, Z.; Wang, D.; Li, W.; Song, J. Geospatial data ontology: The semantic foundation of geospatial data integration and sharing. Big Earth Data 2019, 3, 269–296. [Google Scholar] [CrossRef]
- Kostoff, R.N. Role of Technical Literature in Science and Technology Development and Exploitation. J. Inf. Sci. 2003, 29, 223–228. [Google Scholar] [CrossRef]
- Marsicek, J.; Goring, S.J.; Marcott, S.A.; Meyers, S.R.; Peters, S.; Ross, I.A.; Singer, B.; Williams, J. Automated Extraction of Spatiotemporal Geoscientific Data from the Literature Using GeoDeepDive. Past Glob. Changes Mag. 2018, 26, 70. [Google Scholar] [CrossRef]
- Feldhoff, K.; Wiemer, H.; Träger, P.; Kühne, R.; Zimmermann, M.; Ihlenfeldt, S. Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing. Appl. Sci. 2025, 15, 9331. [Google Scholar] [CrossRef]
- Arias, A.; Dini, I.; Casini, M.; Fiordelisi, A.; Perticone, I.; Pisano, A. Geoscientific Feature Update of the Larderello-Travale Geothermal System (Italy) for a Regional Numerical Modeling. In Proceedings of the World Geothermal Congress 2010, Bali, Indonesia, 25–30 April 2010. [Google Scholar]
- Winata, G.I.; Madotto, A.; Lin, Z.; Liu, R.; Yosinski, J.; Fung, P. Language Models are Few-shot Multilingual Learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, Punta Cana, Dominican Republic, 10 November 2021; pp. 1–15. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2019, 21, 140:1–140:67. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
- Polak, M.P.; Morgan, D. Extracting Accurate Materials Data from Research Papers with Conversational Language Models and Prompt Engineering. Nat. Commun. 2024, 15, 1569. [Google Scholar] [CrossRef]
- Kusano, G.; Akimoto, K.; Takeoka, K. Revisiting Prompt Engineering: A Comprehensive Evaluation for LLM-based Personalized Recommendation. In Proceedings of the Nineteenth ACM Conference on Recommender Systems, New York, NY, USA, 2–4 September 2025; pp. 832–841. [Google Scholar]
- Roberts, J.; Green, F. The Future of Prompt Engineering: Trends, Challenges, and Opportunities. In Proceedings of the IEEE International Symposium on Artificial Intelligence and Human Interaction, Shenzhen, China, 14–16 June 2024; pp. 75–88. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Deng, C.; Zhang, T.; He, Z.; Chen, Q.; Shi, Y.; Xu, Y.; Fu, L.; Zhang, W.; Wang, X.; Zhou, C.; et al. K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization. In Proceedings of the 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 1–12. [Google Scholar]
- Heddes, J.; Meerdink, P.; Pieters, M.; Marx, M. The Automatic Detection of Dataset Names in Scientific Articles. Data 2021, 6, 84. [Google Scholar] [CrossRef]
- Pan, H.; Zhang, Q.; Dragut, E.; Caragea, C.; Latecki, L.J. DMDD: A Large-Scale Dataset for Dataset Mentions Detection. Trans. Assoc. Comput. Linguist. 2023, 11, 1132–1146. [Google Scholar] [CrossRef]
- Zhou, B.; Li, K. Fusing Geoscience Large Language Models and Lightweight RAG for Enhanced Geological Question Answering. Geosciences 2025, 15, 382. [Google Scholar] [CrossRef]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 195. [Google Scholar] [CrossRef]
- Ma, X. Knowledge Graph Construction and Application in Geosciences: A Review. Comput. Geosci. 2022, 161, 105082. [Google Scholar] [CrossRef]
- Cao, Q.; Wang, S.; Chen, Z.; Li, G.; Li, J. The Method of Extracting Names of Geo-science Data based on Regular Expressions. J. Geo-Inf. Sci. 2023, 25, 1601–1610. [Google Scholar]
- Fries, J.A.; Varma, P.; Chen, V.S.; Xiao, K.; Tejeda, H.; Saha, P.; Dunnmon, J.A.; Chubb, H.; Maskatia, S.A.; Fiterau, M.; et al. Weakly supervised classification of aortic valve malformations using unlabeled cardiac MRI sequences. Nat. Commun. 2019, 10, 3111. [Google Scholar] [CrossRef] [PubMed]
- Cui, B.-G.; Chen, X. An Improved Hidden Markov Model for Literature Metadata Extraction. In Proceedings of the 6th International Conference on Advanced Intelligent Computing Theories and Applications: Intelligent Computing, Changsha, China, 18 August 2010; pp. 205–212. [Google Scholar]
- Nasar, Z.; Jaffry, S.W.; Malik, M.K. Information extraction from scientific articles: A survey. Scientometrics 2018, 117, 1931–1990. [Google Scholar] [CrossRef]
- D’Souza, J.; Hoppe, A.; Brack, A.; Jaradeh, M.Y.; Auer, S.; Ewerth, R. The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 2192–2203. [Google Scholar]
- Hou, Y.; Jochim, C.; Gleize, M.; Bonin, F.; Ganguly, D. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 707–714. [Google Scholar]
- Chengbin, W.; Ma, X.; Chen, J.; Chen, J. Information Extraction and Knowledge Graph Construction from Geoscience Literature. Comput. Geosci. 2018, 112, 112–120. [Google Scholar] [CrossRef]
- Qiu, Q.; Tian, M.; Tao, L.; Xie, Z.; Ma, K. Semantic Information Extraction and Search of Mineral Exploration Data Using Text Mining and Deep Learning Methods. Ore Geol. Rev. 2024, 165, 105863. [Google Scholar] [CrossRef]
- Zhang, Q.; Chen, Z.; Pan, H.; Caragea, C.; Latecki, L.J.; Dragut, E. SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 3–7 November 2024; pp. 13083–13100. [Google Scholar]
- Duan, D.; Peng, J.; Zhang, Y.; Zhang, C. SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 24–28 November 2025; pp. 14473–14486. [Google Scholar]
- Gerasimov, I.; KC, B.; Mehrabian, A.; Acker, J.; McGuire, M.P. Comparison of Datasets Citation Coverage in Google Scholar, Web of Science, Scopus, Crossref, and DataCite. Scientometrics 2024, 129, 3681–3704. [Google Scholar] [CrossRef]
- Vrouwenvelder, K.; Raia, N.H.; Thomer, A.K. Obstacles to Dataset Citation Using Bibliographic Management Software. Data Sci. J. 2025, 24, 017. [Google Scholar] [CrossRef]
- Lafia, S.; Fan, L.; Hemphill, L. A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature. In Proceedings of the Association for Information Science and Technology, Pittsburgh, PA, USA, 9 October–1 November 2022; Volume 59. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3615–3620. [Google Scholar]
- Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef]
- Du, J.; Wang, D.; Lin, B.; He, L.; Huang, L.-C.; Wang, J.; Manion, F.J.; Li, Y.; Cossrow, N.; Yao, L. Use of Deep Learning-Based NLP Models for Full-Text Data Elements Extraction for Systematic Literature Review Tasks. Sci. Rep. 2025, 15, 19379. [Google Scholar] [CrossRef]
- Kamran, S.; Hosseini, S.; Esmailzadeh, S.; Kangavari, M.R.; Hua, W. Cognition2Vocation: Meta-Learning via ConvNets and Continuous Transformers. Neural Comput. Appl. 2024, 36, 12935–12950. [Google Scholar] [CrossRef]
- Saaki, M.; Hosseini, S.; Rahmani, S.; Kangavari, M.R.; Hua, W.; Zhou, X. Value-Wise ConvNet for Transformer Models: An Infinite Time-Aware Recommender System. IEEE Trans. Knowl. Data Eng. 2023, 35, 9932–9945. [Google Scholar] [CrossRef]
- Najafipour, S.; Hosseini, S.; Hua, W.; Kangavari, M.R.; Zhou, X. SoulMate: Short-Text Author Linking Through Multi-Aspect Temporal-Textual Embedding. IEEE Trans. Knowl. Data Eng. 2022, 34, 448–461. [Google Scholar] [CrossRef]
- Liu, Y.; Hua, W.; Xin, K.; Hosseini, S.; Zhou, X. TEA: Time-Aware Entity Alignment in Knowledge Graphs. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 2591–2599. [Google Scholar]
- Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Online, 6–14 December 2021. [Google Scholar]
- Wang, B.; Wu, L.; Xie, Z.; Qiu, Q.; Zhou, Y.; Ma, K.; Tao, L. Understanding Geological Reports Based on Knowledge Graphs Using a Deep Learning Approach. Comput. Geosci. 2022, 168, 105229. [Google Scholar] [CrossRef]
- Chen, Q.; Zhou, W.; Cheng, J.; Yang, J. An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain. Appl. Sci. 2024, 14, 11529. [Google Scholar] [CrossRef]
- Niu, S.; Yang, K.; Zhao, R.; Liu, Y.; Li, Z.; Wang, H.; Chen, W. Tree-KG: An Expandable Knowledge Graph Construction Framework for Knowledge-intensive Domains. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, (Volume 1: Long Papers), Vienna, Austria, 12–17 August 2025; pp. 18516–18529. [Google Scholar]
- Goyal, N.; Singh, N. Named Entity Recognition and Relationship Extraction for Biomedical Text: A Comprehensive Survey, Recent Advancements, and Future Research Directions. Neurocomputing 2025, 618, 129171. [Google Scholar] [CrossRef]
- Fick, S.; Hijmans, R. WorldClim 2: New 1-km Spatial Resolution Climate Surfaces for Global Land Areas. Int. J. Climatol. 2017, 37, 4302–4315. [Google Scholar] [CrossRef]
- Friedl, M.A.; Sulla-Menashe, D.; Tan, B.; Schneider, A.; Ramankutty, N.; Sibley, A.; Huang, X. MODIS Collection 5 Global Land Cover: Algorithm Refinements and Characterization of New Datasets. Remote Sens. Environ. 2010, 114, 168–182. [Google Scholar] [CrossRef]
- Gong, P.; Wang, J.; Yu, L.; Zhao, Y.; Zhao, Y.; Liang, L.; Niu, Z.; Huang, X.; Fu, H.; Liu, S.; et al. Finer Resolution Observation and Monitoring of Global Land Cover: First Mapping Results with Landsat TM and ETM+ Data. Int. J. Remote Sens. 2013, 34, 2607–2654. [Google Scholar] [CrossRef]
- Chen, J.; Chen, J.; Liao, A.; Cao, X.; Chen, L.; Chen, X.; He, C.; Han, G.; Peng, S.; Lu, M.; et al. Global Land Cover Mapping at 30m Resolution: A POK-Based Operational Approach. ISPRS J. Photogramm. Remote Sens. 2015, 103, 7–27. [Google Scholar] [CrossRef]
- Liu, J.; Kuang, W.; Zhang, Z.; Xu, X.; Qin, Y.; Ning, J.; Zhou, W.; Zhang, S.; Li, R.; Yan, C.; et al. Spatiotemporal Characteristics, Patterns, and Causes of Land-Use Changes in China Since the Late 1980s. J. Geogr. Sci. 2014, 24, 195–210. [Google Scholar] [CrossRef]
- van Zyl, J.J. The Shuttle Radar Topography Mission (SRTM): A Breakthrough in Remote Sensing of Topography. Acta Astronaut. 2001, 48, 559–565. [Google Scholar] [CrossRef]
- Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the Radiometric and Biophysical Performance of the MODIS Vegetation Indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
- Pan, H.; Zhang, Q.; Caragea, C.; Dragut, E.; Latecki, L.J. SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 14407–14417. [Google Scholar]
- Barlaug, N.; Gulla, J.A. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 2021, 15, 52. [Google Scholar] [CrossRef]
- Ateia, S.; Kruschwitz, U.; Scholz, M.; Koschmider, A.; Almohaishi, M. LLM-Based Information Extraction to Support Scientific Literature Research and Publication Workflows. In New Trends in Theory and Practice of Digital Libraries; Balke, W.-T., Golub, K., Manolopoulos, Y., Aizawa, A., Mayr, P., Tzanoudaki, M., Eds.; Springer: Cham, Switzerland, 2025; Volume 2694, pp. 14407–14417. [Google Scholar]








| Method Category | Representative Studies | Typical Data/ Benchmark | Main Strengths | Main Limitations for Geoscientific Literature–Data Linkage |
|---|---|---|---|---|
| Rule-based methods | Early pattern matching and dictionary/rule-based systems [19,20,21] | Scholarly metadata, structured or semi-structured documents | High precision on explicit patterns; interpretable | Limited robustness to implicit dataset mentions, alias variation, and heterogeneous spatial–temporal expressions |
| Traditional machine learning methods | CRF/SVM-based scientific IE; domain-specific geological term extraction [22,25,26] | Annotated scientific corpora; domain dictionaries | Better generalization than pure rules; can leverage domain features | Depend on handcrafted features and domain adaptation; weak for long-context and cross-sentence reasoning |
| Deep learning/pretrained transformer methods | SciER [27], SciNLP [28], SciBERT [32] | Scientific corpora and full-text annotated benchmarks | Stronger semantic representation; improved performance on scientific NER and relation extraction | Still challenged by full-text complexity, implicit references, and document-level reasoning; usually focus on text extraction rather than dataset linking |
| Neural methods for structured extraction and linkage | GPT/LLaMA-style structured extraction [33,34], SoulMate [37], KGAT [36], TEA [38] | Scientific texts, short-text linking, knowledge graphs, time-aware alignment tasks | Flexible structured extraction; better use of contextual, temporal, and relational signals | Existing studies rarely provide a unified workflow for geoscientific dataset extraction, normalization, and graph-based linkage; integration of spatial–temporal attributes remains limited |
| Original Text Fragment | BM25 Score | Candidate Selected |
|---|---|---|
| Based on the 2000–2020 daily meteorological station observations on the Tibetan Plateau, we analyzed the evolution of ecosystem stability. | 7.85 | Yes |
| The data were obtained from the China Land Use/Cover Change Dataset (LUCC) with a spatial resolution of 1 km × 1 km. | 6.93 | Yes |
| In the experimental section, we conducted a sensitivity analysis to verify the robustness of the model. | 1.12 | No |
| We adopted statistical yearbook data from 1990–2018 provided by national authorities. | 5.74 | Yes |
| The research methods mainly include literature review and case study. | 0.95 | No |
| Statistic | Value | Description |
|---|---|---|
| Number of test papers | 200 | Non-overlapping test papers collected from multiple geoscientific journals |
| Total dataset mentions | 1349 | Labeled instances of dataset references in the test set |
| Total labeled evaluation instances | 1446 | Positive and negative instances used for confusion-matrix-based evaluation |
| Number of source journals | 6 | Cross-journal benchmark composition |
| Attributes per mention | 8 | Name, time, location, authors, institution, resolution, DOI/URL, role |
| Annotators | 6 | Independent postgraduate annotators following a unified guideline |
| Adjudicators | 2 | One researcher and one associate researcher in geoscience |
| Agreement item | Metric | Value | Description |
|---|---|---|---|
| Dataset mention identification | Average pairwise F1 | 0.85 | Span-level agreement on dataset mention detection |
| Data role (source/output) | Cohen’s kappa | 0.81 | Agreement on categorical role annotation |
| Structured attributes (overall) | Normalized agreement | 0.87 | Average agreement across normalized name, time, location, institution, resolution, and DOI/URL fields |
| Prompting Strategy | Precision (%) | Recall (%) | F1-Score (%) | Valid JSON Rate (%) |
|---|---|---|---|---|
| Zero-shot prompting | 87.92 | 83.76 | 85.79 | 91.50 |
| Few-shot in-context prompting | 90.46 | 86.88 | 88.63 | 95.20 |
| Schema-constrained prompting | 93.79 | 90.66 | 92.20 | 99.10 |
| Repeated Calls | Precision (%) | Recall (%) | F1-Score (%) | Relative API-Call Cost | Relative Token Cost |
|---|---|---|---|---|---|
| n = 1 | 91.62 | 88.41 | 89.99 | 1.00× | 1.00× |
| n = 2 | 93.79 | 90.66 | 92.20 | 2.00× | 1.96× |
| n = 3 | 94.05 | 90.92 | 92.46 | 3.00× | 2.91× |
| Method | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|
| SciBERT sequence labeling baseline | 86.42 | 79.85 | 83.01 |
| BM25 + SciBERT classifier baseline | 88.76 | 82.94 | 85.75 |
| LLM-only | 88.63 | 85.41 | 86.99 |
| Qwen2.5-32B-Instruct | 89.96 | 85.74 | 87.80 |
| Llama-3.1-8B-Instruct | 88.71 | 84.93 | 86.78 |
| Qwen2.5-72B-Instruct | 91.84 | 88.27 | 90.02 |
| Proposed framework (GPT-5.2) | 93.79 | 90.66 | 92.20 |
| System Configuration | Precision (%) | Recall (%) | F1-Score (%) |
|---|---|---|---|
| Full System | 93.79 | 90.66 | 92.20 |
| – BM25 retrieval | 89.94 | 84.72 | 87.21 |
| – Regex filtering | 90.18 | 89.47 | 89.82 |
| – Whitelist-assisted scoring | 91.36 | 87.84 | 89.57 |
| – Regex filtering and whitelist scoring | 88.95 | 86.73 | 87.83 |
| – Normalization | 91.72 | 87.95 | 89.80 |
| LLM only | 88.63 | 85.41 | 86.99 |
| Query Task | Number of Queries | Query Precision (%) | Manual Verification Accuracy (%) | Avg. Response Time (ms) |
|---|---|---|---|---|
| Dataset reuse analysis | 10 | 95.00 | 95.00 | 18.6 |
| Provenance tracing | 10 | 93.33 | 94.00 | 21.4 |
| Regional dataset discovery | 10 | 94.12 | 94.50 | 19.8 |
| Attribute-level inspection | 10 | 96.15 | 96.00 | 16.9 |
| Overall | 40 | 94.65 | 94.88 | 19.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Chen, X.; Ma, Y.; Wu, K.; Pang, X.; Li, G.; Ma, R.; Yang, L.; Peng, C.; Zhi, J.; Yuan, J. Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS Int. J. Geo-Inf. 2026, 15, 243. https://doi.org/10.3390/ijgi15060243
Chen X, Ma Y, Wu K, Pang X, Li G, Ma R, Yang L, Peng C, Zhi J, Yuan J. Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS International Journal of Geo-Information. 2026; 15(6):243. https://doi.org/10.3390/ijgi15060243
Chicago/Turabian StyleChen, Xinyu, Yin Ma, Kai Wu, Xing Pang, Guoqing Li, Ruikai Ma, Linhan Yang, Chuang Peng, Jiayu Zhi, and Jiabin Yuan. 2026. "Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models" ISPRS International Journal of Geo-Information 15, no. 6: 243. https://doi.org/10.3390/ijgi15060243
APA StyleChen, X., Ma, Y., Wu, K., Pang, X., Li, G., Ma, R., Yang, L., Peng, C., Zhi, J., & Yuan, J. (2026). Research on Methods for Linking Geoscience Literature and Geoscientific Data Based on Large Language Models. ISPRS International Journal of Geo-Information, 15(6), 243. https://doi.org/10.3390/ijgi15060243

