LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation

Makrinsky, Kiril; Shendrikov, Valery; Makhonko, Anna; Merkushkin, Dmitry; Batishchev, Oleg V.

doi:10.3390/make7040152

Open AccessArticle

LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation

by

Kiril Makrinsky

,

Valery Shendrikov

,

Anna Makhonko

,

Dmitry Merkushkin

and

Oleg V. Batishchev

^*

Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31/4 Leninskiy pr., 119071 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 152; https://doi.org/10.3390/make7040152

Submission received: 21 October 2025 / Revised: 20 November 2025 / Accepted: 23 November 2025 / Published: 25 November 2025

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of the body of literature on heavy metal hyperaccumulation in plants has created a critical bottleneck in data synthesis. Manual curation is slow, labor-intensive, and not scalable. To address this issue, we developed an artificial intelligence pipeline that automatically transforms unstructured scientific papers, including text, tables, and figures, into a structured knowledge database. Our system recovers numerical data and extracts key experimental parameters, such as plant species, metal types, concentrations, and growing conditions. This enables on-demand dataset generation. We validated our pipeline by replicating a recently published, manually curated dataset that required seven months of expert effort. Our tool achieved comparable accuracy in minutes per article. We implemented a dual-validation strategy combining standard extraction metrics with a qualitative “LLM-as-a-Judge” fact-checking layer to assess contextual correctness. This revealed that high extraction performance does not guarantee factual reliability, underscoring the necessity of semantic validation in scientific knowledge extraction. The resulting open, reproducible framework accelerates evidence synthesis, supports trend analysis (e.g., metal–plant co-occurrence networks), and provides a scalable solution for data-driven environmental research.

Keywords:

large language models; LLM; scientific knowledge extraction; phytoextraction; LLM-as-a-Judge; FAIR data; heavy metals

1. Introduction

Hyperaccumulator plants are a unique group of species that can accumulate extraordinary concentrations of heavy metals, including cadmium, copper, and zinc. These remarkable plants are essential to phytoremediation strategies because they extract heavy metals from contaminated soils, thereby reducing environmental toxicity and restoring ecosystem function. Their use has become increasingly important for sustainable environmental management [1,2,3,4,5,6,7,8].

Over the past decade, research on heavy metal hyperaccumulation and phytoextraction has grown rapidly, reflecting the overall expansion of scientific publishing [9]. This expansion mirrors the broader trend in scientific publishing, in which researchers face unprecedented challenges in efficiently selecting, structuring, and filtering relevant information from newly published papers. In the context of hyperaccumulation research, the abundance of information creates a critical need for advanced analytical tools that can systematically process scientific literature and summarize parameters such as metal types, concentrations, experimental conditions, and biochemical effects studied across numerous publications. The expansion has produced a large, heterogeneous body of literature reporting on diverse plant species, metal types, concentration ranges, experimental setups, and biochemical responses. While these studies are informative individually, they are difficult to synthesize at a general level using traditional manual approaches.

The emergence of self-attention mechanisms and the subsequent development of large language models (LLMs) have revolutionized our ability to process and extract information from scientific literature [10,11,12,13,14]. Recent benchmarks such as SciDaSynth [15], demonstrate the effective extraction of structured knowledge from the scientific literature. These benchmarks use large multimodal models to interactively process and integrate information from text, tables, and figures. Similarly, in the biomedical field, LLM-powered frameworks have achieved high accuracy in extracting complex data from electronic health records [16], demonstrating greater reliability and efficiency than conventional manual curation methods. However, general-purpose models still have significant limitations when applied directly to specialized scientific subfields. These limitations include generating fabricated content and hallucinations in specialized domains [17,18], difficulties in synthesizing conflicting evidence across multiple studies [19], and having significant gaps in numerical reasoning and structured scientific data extraction [20,21].

Despite these advances, to our knowledge, no specialized LLM tool has been developed to extract information from papers about the hyperaccumulation of heavy metals in plants and phytoextraction. While researchers have made significant attempts to systematize existing knowledge in this field, these efforts have been limited in scope and methodology. For instance, a 2025 study published in Scientific Data [22] presented a manually curated dataset of phytoremediation parameters for four plant species (sunflower, hemp, castor bean, and bamboo), and extracted 6679 bioconcentration factor observations from 238 papers. However, this labor-intensive approach required manually reviewing 587 studies over seven months, which highlights the significant time and resource constraints of traditional data compilation methods. Similarly, machine learning analysis has been used to predict phytoextraction efficacy factors [23], incorporating only 173 data points to evaluate various predictive variables. While these efforts provide valuable, structured datasets, researchers currently lack accessible tools for generating custom, complex datasets with broader taxonomic coverage or specific parameters relevant to their research questions. General-purpose LLMs have shown promise in adjacent fields, such as soil health science. There, GPT-4 and specialized tools like GeoGalactica have been used in data processing pipelines [24]. Similarly, LLM pipelines for processing plant species data have achieved average precision and recall scores exceeding 90% for categorical trait extraction [25]. Nevertheless, these general applications face limitations, including model overconfidence with limited context, difficulty merging information from distinct sources, and challenges processing numerical data.

Based on our preliminary analysis of scientific data extraction systems, we hypothesize that standard token-level metrics (precision, recall, and F1-score) fail to capture contextual accuracy in scientific knowledge extraction. Specifically, we propose that models with high F1-scores can produce incorrect associations between experimental parameters. This gap in contextual accuracy necessitates a semantic validation layer that evaluates factual consistency rather than term extraction. To address this gap, we present a specialized large language model (LLM) pipeline designed to transform unstructured scientific literature on phytoextraction into a structured knowledge base that complies with the FAIR principles (findability, accessibility, interoperability, and reusability). Our system automatically extracts key parameters, including plant species, metal types, concentrations, experimental setups, and associated biochemical responses, and enables complex querying and trend analysis. Our approach combines multimodal document understanding with rigorous validation, including an “LLM-as-a-Judge” layer, to ensure high recall and factual consistency. This tool enables researchers to generate custom, reproducible datasets at scale, accelerating evidence synthesis and supporting data-driven decisions in environmental science.

2. Materials and Methods

This section outlines our comprehensive methodology for converting unstructured scientific literature on heavy metal hyperaccumulation into structured knowledge. We describe each processing stage, emphasizing the specialized adaptations necessary for accurately handling phytoextraction data. Our pipeline integrates three large language models to balance performance, accessibility, and computational requirements while maintaining scientific integrity through rigorous validation protocols.

2.1. Article Upload and Preprocessing

A custom web server interface allowed users to upload scientific articles in Portable Document Format (PDF). Upon upload, each PDF was processed using the PyMuPDF library to extract individual pages. Then, each page was rendered as a high-resolution PNG image at 300 dots per inch. The Pixmap method, via the Pillow library, was then used to handle the images to ensure consistency in image format.

2.2. Document Conversion to Structured Markdown

The page images were converted into a structured, text-based format to facilitate downstream processing. Google’s Gemini 2.5 Flash was used for this conversion via its public API due to its advanced multimodal capabilities. The model was prompted to generate a comprehensive Markdown representation that preserved the document’s logical structure. The output included:

Hierarchical headings/footers and body text.
Tables converted into Pandas DataFrames (represented as strings and python code).
Mathematical formulas and equations extracted in LaTeX format.
Descriptions of graphical elements (e.g., charts, diagrams); for complex graphs, a textual summary of the key trends and data points was also generated.
References.

A representative prompt used for this conversion is provided in GitHub repository https://github.com/KMakrinsky/hyperaccum (accessed on 20 November 2025).

2.3. Data Extraction via LLM

The individual Markdown files for each page were concatenated into a single document. This consolidated text was then processed by Qwen3-4B [26], GPT-OSS-120b [27] and Gemini 2.5 Pro (via API) to extract specific data entities and convert them into a structured JSON string. For Qwen3-4B and GPT-OSS-120b, inference was performed using llama.cpp with a Vulkan backend. A detailed prompt (available in prompt.md) defined the extraction logic by specifying the target parameters, their expected data types, and the JSON schema.

All models were prompted to calculate the bioconcentration factor (BCF), which is a fundamental quantitative metric in phytoremediation research. BCF characterizes a plant’s capacity to accumulate heavy metals from soil media relative to ambient concentrations. The BCF is formally defined as follows:

B C F = \frac{C_{p l a n t}}{C_{s o i l}}

where C_plant represents the concentration of a specific heavy metal in plant tissue, typically reported in mg·kg⁻¹ of dry weight, and C_soil denotes the concentration of the same metal species in the corresponding soil or growth substrate under identical experimental conditions. The extraction prompt was engineered to perform multi-step reasoning: First, it identifies paired measurements of plant tissue and soil metal concentrations from the same experimental setup. Second, it verifies unit consistency across measurements. Third, it applies the BCF formula to compute the ratio. Finally, it preserves traceability to the original source values.

2.4. Validation of Extraction Accuracy

We assessed the correctness of the extracted JSON data using a two-tiered validation approach on a randomly selected sample of processed articles.

Validation against a dataset: The LLM-generated JSON outputs were compared by “AI Judge” (GPT-OSS-120b) against a dataset that was pre-labeled by the authors. The AI judge then calculated standard information retrieval metrics—precision, recall, and F1-score—for each target data field.
Qualitative validation: An independent LLM (GPT-OSS-120b) was used as an “AI judge” to perform a consistency check. This was performed by generating natural language statements based on the JSON content and checking these statements directly against the source text.

2.5. Database Construction

We parsed and ingested validated JSON strings into a MongoDB database. A schema was designed that mapped the nested structure of the JSON data directly to MongoDB documents. This allowed for efficient storage and complex querying.

2.6. Natural Language Query Interface

The system featured a natural language interface for database interrogation. A user’s free-text query was processed by an LLM, which generated a syntactically correct MongoDB query. The generated query was then executed against the database, and the results were sent back to the LLM to formulate a coherent natural language response for the user. All source code and prompts are available in the GitHub repository https://github.com/KMakrinsky/hyperaccum (accessed on 20 November 2025).

2.7. The Use of GenAI for Manuscript Preparation

Text preparation, including translation, styling, and grammar editing, was performed using GenAI models such as Gemini 2.5 Pro and Qwen3-Max. Gemini 2.5 Pro was also used for code reviews and testing.

3. Results and Discussion

In this section, we comprehensively validate our pipeline’s performance across quantitative metrics and qualitatively assess factual accuracy. We demonstrate how our domain-specific prompt engineering strategy enables the accurate extraction of complex experimental relationships. Our innovative “LLM-as-a-Judge” validation layer reveals the critical limitations of standard evaluation metrics. Beyond technical validation, we present practical applications, such as metal-plant co-occurrence network analysis and the replication of a manually curated dataset that required seven months of expert effort. Together, these results validate our hypothesis that contextual understanding is essential for trustworthy scientific knowledge extraction.

3.1. Overall Pipeline

Figure 1 shows our end-to-end process for converting unstructured scientific literature on heavy metal hyperaccumulation into structured knowledge. The system begins with the ingestion of PDF documents and the conversion of high-resolution images, ensuring the preservation of all visual elements, regardless of the complexity of the source formatting.

A key innovation is the multimodal conversion stage, where a multimodal model (Google Gemini 2.5 Flash, in this case) transforms page images into structured Markdown. This process maintains the document’s hierarchy, converts tables to a Pandas-compatible format, preserves mathematical expressions in LaTeX, and generates descriptive summaries of graphical elements. This intermediate representation enables the consistent processing of heterogeneous scientific content prior to extraction.

Knowledge extraction uses specialized LLM prompting with Qwen3-4B, GPT-OSS-120b, and Gemini 2.5 Pro to generate structured JSON outputs containing experimental parameters. The three language models selected for evaluation represent different approaches to scientific text processing. Gemini 2.5 Pro is a state-of-the-art commercial model with advanced multimodal capabilities. GPT-OSS-120b is a large open-source model that strikes a good balance between performance and accessibility. Qwen3-4B illustrates the capabilities of compact models that can be deployed locally with limited computational resources. This selection provides a realistic assessment of the available tools for researchers with different infrastructure constraints and data privacy requirements.

Crucially, our system is built on the foundation of semantic validation for data quality. Rather than relying solely on token-matching metrics, our system uses an “LLM-as-a-Judge” layer to reconstruct natural language statements from extracted JSON subtrees. Then, it verifies the factual correctness of these statements against the source documents. This contextual validation ensures that relationships between entities remain intact. For example, it ensures that specific metal concentrations are linked to the correct plant species under precise experimental conditions. This capability is essential for scientific integrity and cannot be captured by standard precision/recall metrics.

A domain-specific approach was conducted through prompt engineering at every processing stage. Instead of using generic extraction templates, we carefully craft each prompt with hyperaccumulation-specific context and instructions. During the conversion of PDFs to Markdown, the model receives explicit guidance on preserving elements relevant to phytoextraction, such as metal concentration tables and experimental setup descriptions. During extraction, the prompts incorporate specialized knowledge about valid concentration ranges for different metals, unit conversion protocols specific to soil–plant systems, and critical relationships between experimental variables. Even our validation layer incorporates domain expertise about biologically plausible metal accumulation patterns. This end-to-end domain adaptation transforms general-purpose LLMs into specialized scientific assistants that understand the nuanced relationships defining phytoremediation research.

3.2. Quantitative Validation

We benchmarked the performance of three large language models for the task of structured data extraction to evaluate our pipeline’s core: Gemini 2.5 Pro (API-based), Qwen3-4B, and GPT-OSS-120B (both of which are locally hosted).

For each processed article, we created 20 sets of paired validation inputs. The first input was a manually curated CSV record according to our annotation schema. Domain experts created this record using the ground truth. The second input was the machine-generated CSV output from each evaluated LLM. We submitted these paired records to an independent validation model, GPT-OSS-120b, configured as an “AI Judge”, which performed a granular, field-by-field comparison using semantic matching rather than strict string equality. Specifically, the model was prompted to normalize terminology variants (e.g., recognizing “cadmium chloride” and “CdCl₂” as equivalent to the target entity “Cd”), and to resolve unit conversions and numerical representations (e.g., recognizing “50 μM” as equivalent to “0.05 mM”).

The judge model automatically computed standard information retrieval metrics—precision, recall, and F1 score—for each target field across all validation samples. This semantic validation approach addresses a critical limitation of conventional token-matching evaluation by preventing the misclassification of technically correct, terminologically variant extractions as false negatives by naive string comparison methods. Table 1 shows the aggregated performance metrics, revealing significant variations in model performance across parameter categories. This underscores the necessity of context-aware validation for scientific knowledge extraction.

Overall, Gemini 2.5 Pro demonstrated the most robust and balanced performance. It achieved an average F1-score of 0.896 across articles, which was the highest among all models. Qwen-4B followed with an F1 score of 0.871, and GPT-OSS-120b followed with an F1 score of 0.854. These results suggest that while all models can perform the extraction task well, larger, proprietary models currently have an advantage in terms of overall accuracy and reliability.

A deeper analysis of specific data fields revealed a clear pattern: all models excel at extracting simple, well-defined entities. For instance, performance was exceptionally high in identifying plant species (plants F1) and heavy metals (metals F1). Gemini 2.5-Pro and Qwen-3-4B achieved perfect F1 scores of 1.00 for plant identification, while Gemini achieved a nearly perfect score of 0.975 for metals. This success can be attributed to the fact that these entities are typically discrete nouns or standardized terms (e.g., “Arabidopsis thaliana”, “Cd”, and “Zinc”), which are analogous to standard named entity recognition (NER) tasks in which modern LLMs are known to perform well. [28].

In contrast, extracting more complex, descriptive information posed the greatest challenge to all models. The methods and groups yielded the lowest F1 scores consistently. This discrepancy underscores a critical point: lower performance reflects not only model limitations but also the inherent complexity and lack of standardization in scientific writing. Unlike discrete entities, experimental methods are often described in a narrative style. The distribution of crucial details across multiple paragraphs, supplementary materials, and figure legends is a well-documented obstacle in automated information extraction and a known contributor to the “reproducibility crisis” in science [29].

Similarly, the concept of an “experimental group” is often ambiguous. Authors may describe treatments without explicitly labeling them as distinct groups. Control conditions are often implied, such as “plants grown under standard conditions”, rather than explicitly defined. This requires the model to make complex inferences and synthesize fragmented information across the text. This task is so challenging that even human annotators tasked with extracting such complex information often achieve only moderate inter-annotator agreement. Studies on creating scientific corpora have consistently shown that agreement scores are highest for simple entities and decrease sharply for complex, relational, or descriptive information, such as experimental protocols [30,31]. Therefore, the models’ struggles in these categories highlight their difficulty in understanding linguistic complexity and the conceptual ambiguity within scientific discourse. This remains a challenging task even for human experts.

Interestingly, the two locally run open-source models exhibited distinct behavioral profiles. GPT-OSS-120b demonstrated a high-precision, low-recall strategy, particularly for identifying experimental groups (group precision of 0.905 versus group recall of 0.860). This suggests a more “cautious” model that extracts data only when highly confident, potentially resulting in the omission of relevant information. Conversely, Qwen-3-4B showed higher recall (group recall of 0.876) but lower precision (group precision of 0.773). This indicates a more “eager” approach that successfully captures more relevant data points, albeit with some incorrect ones included. The trade-off between precision and recall is critical when selecting a model for a specific application, depending on whether avoiding false positives or false negatives is more important.

3.3. Qualitative Validation

Although F1 scores confirm that a model can identify the correct words in a text, they do not guarantee that the model understands the critical relationships between those words. For a scientific database, this distinction is important. For instance, a paper may state that Arabidopsis halleri was treated with 50 µM of Cd and Thlaspi rotundifolium with 100 µM of Cd. A model with limited contextual understanding could correctly identify both species and concentrations, which would boost its F1 score. However, it could then create a JSON entry that incorrectly links A. halleri to the 100 µM concentration.

To detect such critical errors, we implemented the “AI judge” validation procedure, which aligns with the “LLM-as-a-Judge” paradigm [32]. A growing body of research comparing its performance against human judgments across diverse tasks supports the credibility and reliability of the LLM-as-a-Judge approach [33]. In information retrieval, the approach has been validated for scalable relevance judgment [34] and for the comprehensive evaluation of Retrieval-Augmented Generation (RAG) systems, assessing parameters such as relevance, accuracy, and coherence [35]. Together, these studies substantiate the LLM-as-a-Judge methodology as a robust, scalable alternative to traditional evaluation metrics. The schema of our validation approach is presented in Figure 2.

In this approach, the validation model (GPT-OSS-120b) was given not only the extracted JSON fragment, but also the full hierarchical context leading to each individual parameter. Specifically, for each extracted field (e.g., metal concentration linked to a plant species), the system reconstructed the complete JSON subtree—from the root document object through experimental groups, plant entries, and metal records—to preserve the semantic relationships encoded during extraction. The judge LLM then used this contextual subtree to generate a factual statement in natural language (e.g., “In the study, Arabidopsis halleri was exposed to 50 µM of cadmium under hydroponic conditions”). Then, the same LLM was prompted to independently verify this statement against the original, full-text Markdown representation of the source article without access to the extracted JSON. Only if the claim could be unambiguously confirmed from the source text was the extraction deemed factually correct. This two-step process—context-aware claim generation and source-grounded verification—ensures that entities are correctly identified and that their relational integrity is preserved. This mitigates hallucinated or misaligned data entries that standard token-level metrics fail to capture. The resulting scores are presented in Table 2.

The dramatic decrease in Qwen3-4b’s performance—from an F1 score of 0.871 to a qualitative accuracy of 46.7%—underscores its vulnerability to contextual errors. Its high-recall strategy allows it to identify keywords relevant to phytoextraction studies, such as concentrations, chemical compounds, and analytical methods. However, it often fails to correctly assemble the full experimental picture. This failure mode is a form of “hallucination”, which, in this scientific context, manifests as generating non-existent experimental results. This type of error is problematic because it can mislead researchers who query the database, causing them to draw incorrect conclusions about a plant’s tolerance or accumulation capacity under specific conditions.

In contrast, the stronger performance of Gemini 2.5 Pro and GPT-OSS-120b suggests a greater ability for “compositional reasoning”, or the capacity to correctly link specific parameters, such as metal concentration, soil pH, and exposure duration, to the corresponding plant species or experimental group.

This dual-validation approach confirms that a qualitative, fact-checking validation layer is essential, not just beneficial, to build a trustworthy knowledge base in a specialized field such as heavy metal hyperaccumulation. This validation layer ensures that the curated data accurately represents scientific findings and is not just a collection of extracted terms.

3.4. Usage Strategies

Our pipeline’s utility lies in its ability to transform unstructured documents into a searchable knowledge base, which goes beyond validating the extraction process. This enables researchers to synthesize information and identify large-scale trends in the field of hyper-accumulation, rather than analyzing individual papers. The natural language query interface provides access to this synthesized knowledge.

To demonstrate this capability, we analyzed a sample corpus of fifteen articles and generated a co-occurrence network graph of the metals and plants mentioned in the processed articles (Figure 3). In this graph, nodes represent either a plant species or a heavy metal. An edge between two nodes indicates that the corresponding entities were studied together in at least one paper within our sample corpus.

The resulting visualization reveals key trends within the sampled literature. Cadmium emerges as a central hub with the highest degree of connectivity, linking to a wide array of plant species, including Sedum alfredii, Arabidopsis halleri, and Oryza sativa. This indicates that Cd toxicity and accumulation are the most extensively studied topics in our dataset, reflecting Cd status as a widespread and hazardous environmental pollutant. Zinc is also a significant research focus, sharing connections with key model organisms, such as A. halleri and T. caerulescens, which are known co-accumulators of Cd and Zn.

Additionally, the graph identifies plant species with unique properties. For instance, Sesbania drummondii is connected to multiple metals (Pb, Ni, and Cu), suggesting that it could be the subject of multi-metal phytoremediation research. Conversely, the isolated pair of Tetrademum obliquum (an alga) and cobalt suggests a niche area of research that is disconnected from the main body of terrestrial plant studies in our corpus. Such visualizations provide researchers entering the field with a map of the research landscape, highlighting dominant themes, key model systems, and potential knowledge gaps.

3.5. Replicating a Human-Curated Dataset via LLM Reasoning

To validate the accuracy of our pipeline against established scientific standards and demonstrate its flexibility, we chose to reproduce the manually curated dataset referenced [22]. This dataset is a gold standard in phytoextraction research, but it took seven months of expert effort to compile. Thanks to our system’s schema-agnostic architecture, this replication required minimal effort. Only the extraction prompts needed modification, and no changes were necessary to the underlying pipeline infrastructure.

Specifically, we created a specialized prompt for Gemini 2.5 Pro, available as prompt_dataset.md in our repository, that precisely mirrored the schema of the target dataset. This prompt instructed the model to extract parameters that matched the exact column structure of the published tables; calculate derived metrics, such as the bioconcentration factor (BCF), on the fly; and consolidate values across similar experimental conditions (e.g., plant components, fertilizer treatments) according to meta-analysis conventions in phytoextraction research. We changed the expected JSON output schema so that it would mirror the flat table format of the target dataset instead of our default hierarchical structure. We provided explicit definitions for each parameter, including its scientific meaning, acceptable value ranges, units of measurement, and data type specifications. This parameter-by-parameter guidance ensured that the extracted data was fully structurally compatible with the original dataset.

This approach allowed for a direct, one-to-one comparison between our automated outputs and the manually curated reference data. For sample articles from the dataset, we averaged BCF values for each plant-metal combination across comparable experimental conditions. This allowed us to quantitatively assess extraction accuracy while adhering to standard practices for consolidating data in environmental meta-analyses. To establish robust performance metrics, we randomly selected ten articles from the original dataset. For this sample, our pipeline achieved an average relative error of 8.21% for BCF values. The remarkable fidelity of this replication, achieved solely through prompt adaptation, demonstrates our pipeline’s capacity to rapidly generate custom datasets matching any required structure. This transforms a process that traditionally required months of manual work into one that can be completed in minutes per article. Table 3 provides a representative example of results for several selected articles from the original dataset.

When we applied our pipeline to an existing phytoextraction dataset, we observed that the LLM effectively replicated existing human-curated datasets with high fidelity. For example, in study [36], our LLM pipeline captured element-specific accumulation patterns across five distinct hemp varieties (Zenit, Diana, Denise, Armanca, Silvana). We observed only minor deviations (4–12%) in directly comparable BCF values. Similar deviations were observed in [37], which we attribute to the challenges of processing visually complex tables and figures. In these cases, the multimodal conversion stage occasionally misaligned data points or misinterpreted formatting conventions.

For articles with simpler data presentation formats and fewer experimental points, such as [38], our approach achieved near-perfect replication of BCF values, with negligible relative differences (often 0–1%). This variation in performance correlates with the visual complexity of the source data presentations rather than the underlying scientific content. These results demonstrate that, while our pipeline maintains high accuracy with complex layouts, it achieves optimal performance when source documents present data in clear, well-structured formats.

The most important aspect of this approach, beyond the fidelity of the results, is the noticeable acceleration of the entire data curation process. Manually curating a single data-rich scientific article, as was done for the original dataset, can require several hours of a domain expert’s focused effort, including reading, identifying relevant values, performing calculations, and entering data into a structured format. Our automated pipeline completes the same task (from PDF upload to generation of structured, calculated JSON) in a matter of minutes.

4. Conclusions

Our study confirms the hypothesis that standard extraction metrics do not guarantee factual reliability in scientific knowledge extraction. This reveals a gap in contextual accuracy and challenges current evaluation practices in scientific AI and shows that semantic validation through natural language reasoning is necessary for creating reliable knowledge bases. We demonstrate that, although modern LLMs can achieve high extraction performance on discrete entities, their ability to preserve factual relationships—such as linking a metal concentration to the correct plant species—varies significantly. A dual-validation strategy that integrates standard metrics with a qualitative “LLM-as-a-Judge” layer reveals that high F1 scores do not guarantee contextual accuracy, particularly for narrative-rich scientific content, such as experimental protocols. This underscores the necessity of semantic verification in scientific knowledge extraction.

Crucially, our pipeline is prompt-driven and schema-agnostic. By modifying the instruction prompt, we successfully replicated the exact structure of a manually curated benchmark dataset. This allowed us to achieve comparable accuracy in minutes per article versus hours of expert effort. Since all prompts and code are publicly accessible, researchers can quickly adapt the pipeline to their own data schemas. This enables the on-demand generation of custom, FAIR-compliant datasets. The resulting knowledge base supports trend analysis, gap identification, and evidence synthesis—essential capabilities for data-driven environmental science. As an open, reproducible, and flexible framework, this tool offers a scalable solution for phytoremediation research and any other field requiring structured knowledge extraction from scientific literature.

Author Contributions

Conceptualization, K.M.; methodology, K.M.; software, K.M. and V.S.; validation, K.M., A.M. and D.M.; writing—original draft preparation, A.M. and V.S.; writing—review and editing, O.V.B.; visualization, K.M.; supervision, O.V.B. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the Ministry of Science and Higher Education of the Russian Federation (agreement #075-15-2024-534).

Data Availability Statement

Source code is available on GitHub repository https://github.com/KMakrinsky/hyperaccum (accessed on 20 November 2025).

Acknowledgments

During the preparation of this manuscript, the authors used Gemini 2.5 Pro and Qwen3-Max for the purposes of text preparation (translation, styling, and grammar editing). The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
FAIR y	Findability, Accessibility, Interoperability, Reusability
LLM	Large Language Model
PDF	Portable Document Format
PNG	Portable Network Graphics
GPT	Generative Pre-trained Transformer
JSON	JavaScript Object Notation
Cd	Cadmium
Co	Cobalt
Zn	Zinc
Pb	Lead
Ni	Nickel
Cu	Copper

References

Sabir, M.; Baltrėnaitė-Gedienė, E.; Ditta, A.; Ullah, H.; Kanwal, A.; Ullah, S.; Faraj, T.K. Bioaccumulation of Heavy Metals in a Soil–Plant System from an Open Dumpsite and the Associated Health Risks through Multiple Routes. Sustainability 2022, 14, 13223. [Google Scholar] [CrossRef]
Ghuge, S.A.; Nikalje, G.C.; Kadam, U.S.; Suprasanna, P.; Hong, J.C. Comprehensive Mechanisms of Heavy Metal Toxicity in Plants, Detoxification, and Remediation. J. Hazard. Mater. 2023, 450, 131039. [Google Scholar] [CrossRef] [PubMed]
Khan, I.U.; Qi, S.-S.; Gul, F.; Manan, S.; Rono, J.K.; Naz, M.; Shi, X.-N.; Zhang, H.; Dai, Z.-C.; Du, D.-L. A Green Approach Used for Heavy Metals ‘Phytoremediation’ Via Invasive Plant Species to Mitigate Environmental Pollution: A Review. Plants 2023, 12, 725. [Google Scholar] [CrossRef] [PubMed]
Bhat, S.A.; Bashir, O.; Ul Haq, S.A.; Amin, T.; Rafiq, A.; Ali, M.; Américo-Pinheiro, J.H.P.; Sher, F. Phytoremediation of Heavy Metals in Soil and Water: An Eco-Friendly, Sustainable and Multidisciplinary Approach. Chemosphere 2022, 303, 134788. [Google Scholar] [CrossRef]
Yaashikaa, P.R.; Kumar, P.S.; Jeevanantham, S.; Saravanan, R. A Review on Bioremediation Approach for Heavy Metal Detoxification and Accumulation in Plants. Environ. Pollut. 2022, 301, 119035. [Google Scholar] [CrossRef] [PubMed]
Angon, P.B.; Islam, M.S.; Kc, S.; Das, A.; Anjum, N.; Poudel, A.; Suchi, S.A. Sources, Effects and Present Perspectives of Heavy Metals Contamination: Soil, Plants and Human Food Chain. Heliyon 2024, 10, e28357. [Google Scholar] [CrossRef]
Tokhtar, V.K.; Tretyakov, M.Y.; Zelenkova, V.N.; Nikulin, I.S.; Batishchev, O.V. Features of Phytoextraction of Rare Earth Elements by a Complex of Plants and Microorganisms from Technogenically Polluted Wastewater of Mining Enterprises. E3S Web Conf. 2023, 463, 02010. [Google Scholar] [CrossRef]
Zhang, C.; Liu, H.; Zhou, F.; Long, X.; Liu, S.; Wu, Y. Enhancing Remediation Efficiency of Hyperaccumulators through Earthworm Addition: Evidence from a Pot Study on Cadmium-Contaminated Soil. Sci. Total Environ. 2024, 934, 173169. [Google Scholar] [CrossRef]
Wang, C.; Deng, L.; Zhang, Y.; Zhao, M.; Liang, M.; Lee, L.-C.; Cristhian, C.-O.; Yang, L.; He, T. Farmland Phytoremediation in Bibliometric Analysis. J. Environ. Manag. 2024, 351, 119971. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar] [CrossRef]
Odobesku, R.; Romanova, K.; Mirzaeva, S.; Zagorulko, O.; Sim, R.; Khakimullin, R.; Razlivina, J.; Dmitrenko, A.; Vinogradov, V. Agent-Based Multimodal Information Extraction for Nanomaterials. NPJ Comput. Mater. 2025, 11, 194. [Google Scholar] [CrossRef]
Peng, R.; Liu, K.; Yang, P.; Yuan, Z.; Li, S. Embedding-Based Retrieval with LLM for Effective Agriculture Information Extracting from Unstructured Data. arXiv 2023, arXiv:2308.03107. [Google Scholar] [CrossRef]
Patiny, L.; Godin, G. Automatic Extraction of FAIR Data from Publications Using LLM. ChemRxiv 2023. [Google Scholar] [CrossRef]
Biswas, A.; Talukdar, W. Robustness of Structured Data Extraction from In-Plane Rotated Documents Using Multi-Modal Large Language Models (LLM). arXiv 2024, arXiv:2406.10295. [Google Scholar] [CrossRef]
Wang, X.; Huey, S.L.; Sheng, R.; Mehta, S.; Wang, F. SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model. Campbell Syst. Rev. 2025, 21, e70073. [Google Scholar] [CrossRef] [PubMed]
Dagli, M.M.; Ghenbot, Y.; Ahmad, H.S.; Chauhan, D.; Turlip, R.; Wang, P.; Welch, W.C.; Ozturk, A.K.; Yoon, J.W. Development and Validation of a Novel AI Framework Using NLP with LLM Integration for Relevant Clinical Data Extraction through Automated Chart Review. Sci. Rep. 2024, 14, 26783. [Google Scholar] [CrossRef] [PubMed]
Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.-T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. Brief. Bioinform. 2023, 25, bbad493. [Google Scholar] [CrossRef]
Cox, L.A. An AI Assistant for Critically Assessing and Synthesizing Clusters of Journal Articles. Glob. Epidemiol. 2025, 10, 100207. [Google Scholar] [CrossRef]
Li, H.; Chen, X.; Xu, Z.; Li, D.; Hu, N.; Teng, F.; Li, Y.; Qiu, L.; Zhang, C.J.; Qing, L.; et al. Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 20004–20026. [Google Scholar]
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef]
Ha, H.; Sweat, K.G.; Conrow, K.D.; Haney, R.S.; Cahill, T.M.; LeBauer, D.S.; Leung, M.C.K. Remediating Toxic Elements with Sunflower, Hemp, Castor Bean, & Bamboo: An Open Dataset of Harmonized Variables. Sci. Data 2025, 12, 905. [Google Scholar] [CrossRef]
Shi, L.; Li, J.; Palansooriya, K.N.; Chen, Y.; Hou, D.; Meers, E.; Tsang, D.C.W.; Wang, X.; Ok, Y.S. Modeling Phytoremediation of Heavy Metal Contaminated Soils through Machine Learning. J. Hazard. Mater. 2023, 441, 129904. [Google Scholar] [CrossRef]
Wu, Y. Large Language Models and the Future of Soil Health: Bridging Knowledge Gaps through Scalable Semantic Intelligence. Soil Adv. 2025, 4, 100065. [Google Scholar] [CrossRef]
Domazetoski, V.; Kreft, H.; Bestova, H.; Wieder, P.; Koynov, R.; Zarei, A.; Weigelt, P. Using Large Language Models to Extract Plant Functional Traits from Unstructured Text. Appl. Plant Sci. 2025, 13, e70011. [Google Scholar] [CrossRef] [PubMed]
Lmstudio-Community/Qwen3-4B-Instruct-2507-GGUF·Hugging Face. Available online: https://huggingface.co/lmstudio-community/Qwen3-4B-Instruct-2507-GGUF (accessed on 2 October 2025).
Lmstudio-Community/Gpt-Oss-120b-GGUF·Hugging Face. Available online: https://huggingface.co/lmstudio-community/gpt-oss-120b-GGUF (accessed on 2 October 2025).
Foppiano, L.; Lambard, G.; Amagasa, T.; Ishii, M. Mining Experimental Data from Materials Science Literature with Large Language Models: An Evaluation Study. arXiv 2024, arXiv:2401.11052. [Google Scholar] [CrossRef]
Hu, Y.; Keloth, V.K.; Raja, K.; Chen, Y.; Xu, H. Towards Precise PICO Extraction from Abstracts of Randomized Controlled Trials Using a Section-Specific Learning Approach. Bioinformatics 2023, 39, btad542. [Google Scholar] [CrossRef] [PubMed]
Li, D.; Ma, B.; Xu, X.; Chen, G.; Li, T.; He, N. MMHub, a Database for the Mulberry Metabolome. Database 2020, 2020, baaa011. [Google Scholar] [CrossRef]
Kropf, S.; Uciteli, A.; Schierle, K.; Krücken, P.; Denecke, K.; Herre, H. Querying Archetype-Based EHRs by Search Ontology-Based XPath Engineering. J. Biomed. Semant. 2018, 9, 16. [Google Scholar] [CrossRef] [PubMed]
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y.; Ai, Q.; Ye, Z.; Liu, Y. LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv 2024, arXiv:2412.05579. [Google Scholar]
Rahmani, H.A.; Yilmaz, E.; Craswell, N.; Mitra, B.; Thomas, P.; Clarke, C.L.A.; Aliannejadi, M.; Siro, C.; Faggioli, G. LLMJudge: LLMs for Relevance Judgments. arXiv 2024, arXiv:2408.08896. [Google Scholar]
Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024. [Google Scholar]
Mihoc, M.; Pop, G.; Alexa, E.; Radulov, I. Nutritive Quality of Romanian Hemp Varieties (Cannabis sativa L.) with Special Focus on Oil and Metal Contents of Seeds. Chem. Cent. J. 2012, 6, 122. [Google Scholar] [CrossRef] [PubMed]
Deng, G.; Yang, M.; Saleem, M.H.; Rehman, M.; Fahad, S.; Yang, Y.; Elshikh, M.S.; Alkahtani, J.; Ali, S.; Khan, S.M. Nitrogen Fertilizer Ameliorate the Remedial Capacity of Industrial Hemp (Cannabis sativa L.) Grown in Lead Contaminated Soil. J. Plant Nutr. 2021, 44, 1770–1778. [Google Scholar] [CrossRef]
Zielonka, D.; Szulc, W.; Skowrońska, M.; Rutkowska, B.; Russel, S. Hemp-Based Phytoaccumulation of Heavy Metals from Municipal Sewage Sludge and Phosphogypsum Under Field Conditions. Agronomy 2020, 10, 907. [Google Scholar] [CrossRef]

Figure 1. Article processing pipeline.

Figure 2. Implemented LLM-as-a-judge validation approach.

Figure 3. Network graph of metals and plants mentioned across the processed articles.

Table 1. Performance evaluation results.

Entity Type	Metrics	Gemini 2.5-Pro	Qwen-4b	GPT-OSS-120b
group	precision	0.872541	0.799281	0.869870
	recall	0.900839	0.773160	0.905060
	F1	0.855655	0.875850	0.860119
metals	precision	0.975000	0.952381	0.891667
	recall	0.958333	0.928571	0.958333
	F1	1.000000	1.000000	0.875000
methods	precision	0.735642	0.732143	0.680035
	recall	0.770387	0.748016	0.692262
	F1	0.729464	0.754082	0.700595
plants	precision	1.000000	1.000000	0.975000
	recall	1.000000	1.000000	0.958333
	F1	1.000000	1.000000	1.000000

Table 2. Results of the “AI judge” procedure for the used models.

Model	Feature	Judge Score, %
Gemini 2.5 Pro	State-of-the-art proprietary model with API only access	83.3
GPT-OSS-120b	An open-weight big model available for local inference on the laboratory server	73.3
Qwen3-4b	An open-weight small model for local inference on the personal PC	46.7

Table 3. Example of quantitative Comparison of BCF values for sample articles.

Article	Variety	Element	BCF Original (Average)	BCF LLM (Average)	Absolute Difference	Relative Difference
[36]	Armanca	Ca	1.1052	1.01	0.0952	8.62%
		Fe	0.0841	0.08	0.0041	4.86%
		K	42.7026	40.44	2.2626	5.30%
		Mg	10.6875	9.84	0.8475	7.93%
		Mn	0.5192	0.46	0.0592	11.41%
	Denise	Ca	0.8381	0.91	0.0719	8.58%
		Fe	0.082	0.09	0.008	9.77%
		K	23.6459	24.8	1.1541	4.88%
		Mg	8.3484	9.04	0.6916	8.28%
		Mn	0.4873	0.54	0.0527	10.81%
	Diana	Ca	1.029	1.12	0.091	8.85%
		Fe	0.0836	0.09	0.0064	7.65%
		K	22.1518	23.24	1.0882	4.91%
		Mg	7.9016	8.54	0.6384	8.08%
		Mn	0.4619	0.52	0.0581	12.57%
	Silvana	Ca	0.8886	0.81	0.0786	8.84%
		Fe	0.0836	0.08	0.0036	4.26%
		K	52.9224	50.11	2.8124	5.31%
		Mg	14.5125	13.34	1.1725	8.08%
		Mn	0.5513	0.49	0.0613	11.12%
	Zenit	Ca	2.5758	2.82	0.2442	9.48%
		Fe	0.091	0.09	0.001	1.09%
		K	73.5292	77.05	3.5208	4.79%
		Mg	10.516	11.36	0.844	8.03%
		Mn	0.3655	0.41	0.0445	12.18%
[37]	Bamahuoma	Pb	0.1308	0.1447	0.0139	10.61%
[37]	Yunma	Pb	0.1301	0.1471	0.017	13.08%
[38]	Beniko	Cd	1.33	1.33	0	0.00%
		Cr	12.295	12.295	0	0.00%
		Cu	0.94	0.94	0	0.00%
		Fe	0.091	0.091	0	0.00%
	Bialobrzeskie	Cd	1.057	1.057	0	0.00%
		Cr	10.005	10.005	0	0.00%
		Cu	1.005	1.005	0	0.00%
		Fe	0.077	0.077	0	0.00%
	Tygra	Cd	1.294	1.294	0	0.00%
		Cr	14.673	14.663	0.01	0.07%
		Cu	0.978	1.008	0.03	3.07%
		Fe	0.097	0.097	0	0.00%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Makrinsky, K.; Shendrikov, V.; Makhonko, A.; Merkushkin, D.; Batishchev, O.V. LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation. Mach. Learn. Knowl. Extr. 2025, 7, 152. https://doi.org/10.3390/make7040152

AMA Style

Makrinsky K, Shendrikov V, Makhonko A, Merkushkin D, Batishchev OV. LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation. Machine Learning and Knowledge Extraction. 2025; 7(4):152. https://doi.org/10.3390/make7040152

Chicago/Turabian Style

Makrinsky, Kiril, Valery Shendrikov, Anna Makhonko, Dmitry Merkushkin, and Oleg V. Batishchev. 2025. "LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation" Machine Learning and Knowledge Extraction 7, no. 4: 152. https://doi.org/10.3390/make7040152

APA Style

Makrinsky, K., Shendrikov, V., Makhonko, A., Merkushkin, D., & Batishchev, O. V. (2025). LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation. Machine Learning and Knowledge Extraction, 7(4), 152. https://doi.org/10.3390/make7040152

Article Menu

LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation

Abstract

1. Introduction

2. Materials and Methods

2.1. Article Upload and Preprocessing

2.2. Document Conversion to Structured Markdown

2.3. Data Extraction via LLM

2.4. Validation of Extraction Accuracy

2.5. Database Construction

2.6. Natural Language Query Interface

2.7. The Use of GenAI for Manuscript Preparation

3. Results and Discussion

3.1. Overall Pipeline

3.2. Quantitative Validation

3.3. Qualitative Validation

3.4. Usage Strategies

3.5. Replicating a Human-Curated Dataset via LLM Reasoning

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI