Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data

Qin, Xi; Tang, Yanlin; Deng, Yirong; Lu, Meiqu; He, Wenqiang; Song, Jinrui; Lin, Keyu; Han, Feng

doi:10.3390/app16083895

Open AccessArticle

Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data

by

Xi Qin

¹,

Yanlin Tang

^2,3,

Yirong Deng

⁴,

Meiqu Lu

^2,3,

Wenqiang He

^2,3,

Jinrui Song

^2,3,

Keyu Lin

² and

Feng Han

^2,3,5,*

¹

Library, Guangxi University for Nationalities, No. 188 Daxue East Road, Nanning 530006, China

²

School of Artificial Intelligence, Guangxi University for Nationalities, No. 188 Daxue East Road, Nanning 530006, China

³

Guangxi Key Laboratory of Hybrid Computation and IC Design Analysis, No. 188 Daxue East Road, Nanning 530006, China

⁴

Guangdong Academy of Environmental Science, No. 84 Zhiming Road, Huangpu District, Guangzhou 510045, China

⁵

School of Earth Science and Engineering, Haiqin Building 4, Sun Yat Sen University, Tangjiawan Town, Zhuhai 519080, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 3895; https://doi.org/10.3390/app16083895

Submission received: 3 February 2026 / Revised: 31 March 2026 / Accepted: 12 April 2026 / Published: 17 April 2026

(This article belongs to the Topic Big Data and AI for Geoscience)

Download

Browse Figures

Versions Notes

Abstract

Urban soil contamination poses persistent risks to redevelopment, public health, and ecological restoration, yet actionable evidence is scattered across site investigation reports, monitoring databases, and regulatory documents. Existing decision-support tools often depend on manual searches and provide limited structured reasoning. This study develops a domain knowledge graph (KG) and a KG-powered question-answering (KBQA) system for urban soil management to organize multi-source evidence and deliver precise, auditable answers to parcel- and pollutant-specific queries. The approach (1) defines an urban soil ontology covering parcels, land uses, pollutants, measurements, pathways, and regulatory thresholds; (2) extracts and links entities and relations from textual and tabular sources; (3) constructs a graph database with provenance; and (4) implements a KBQA pipeline that maps natural-language questions to constrained graph queries and verbalizes results with citations. The resulting system supports source identification, land-use-specific exceedance checks, affected-parcel listing, and remediation reference retrieval. Experiments on a curated QA set and a South China case study show higher answer accuracy and lower latency than text-only baselines, while consistently returning traceable evidence and reducing cross-document lookup effort. Compared to text-only RAG baselines, the KG-powered system achieved a 0.14 improvement in Exact Match scores (e.g., 0.81 vs. 0.58 for Threshold tasks) and maintained a competitive median latency of 0.75 s. The pipeline utilizes a 13B-parameter instruction-tuned LLM. The ontology, schema, benchmark QA sets, and sample queries are publicly released to support transfer to other regions.

Keywords:

knowledge graphs; large language models (LLMs); urban soil management; geochemical data; decision support systems; environmental data integration; auditable AI

1. Introduction

Urban soil contamination poses significant challenges to brownfield redevelopment, public health, and ecological restoration, rendering access to reliable and timely information essential for urban planners and environmental engineers [1]. Distinguished from rural or mining environments, urban soils are subject to sustained and diverse anthropogenic inputs—including industrial activities, construction, traffic emissions, and historical land use—resulting in complex mixtures of contaminants and pronounced spatial heterogeneity, as illustrated in Figure 1. Recent nationwide surveys in China further highlight the urgency of this issue, documenting widespread exceedances of key trace elements such as Cd, Pb, Zn, and Cu across multiple urban areas, with associated risks to human health and local ecosystems [2,3]. These findings underscore a critical practical gap: while decision-makers require precise, parcel-specific answers, for example, determining whether cadmium levels at Parcel A exceed the 2018 residential risk control thresholds (GB36600-2018) [4] based on its historical industrial use, the necessary evidence remains scattered across disparate reports, databases, and regulatory documents.

However, the scientific challenge extends beyond mere monitoring—it lies in effectively integrating multi-source, heterogeneous evidence and performing reasoning that is consistent with regulatory thresholds, measurement units, and data provenance [5,6,7]. While traditional physical and statistical models remain valuable, they are often costly to parameterize, slow to update, and lack transparency when practitioners need to trace conclusions back to their original sources. Consequently, there is increasing interest in digital approaches capable of structuring domain knowledge to support auditable, query-driven analysis. Similar spatial heterogeneities and historical land-use challenges have been extensively documented in urban soil management studies across Europe and the Americas, highlighting the global universality of this issue [8,9]. Achieving this integration requires adherence to semantic interoperability standards in Earth Sciences, such as those established by the Open Geospatial Consortium (OGC) [10].

Current digital tools—ranging from full-text search and GIS overlays to more recent retrieval-augmented generation (RAG) systems—improve document accessibility but generally lack embedded domain semantics. Specifically, these text-only RAG systems fail to perform essential deterministic unit conversions, lack threshold-based logical reasoning, and cannot guarantee source-level provenance, which are crucial for audit and compliance [11,12]. This gap motivates the adoption of knowledge graphs (KGs) and knowledge-based question answering (KBQA), which explicitly encode entities, relations, and constraints while allowing natural-language questions to be translated into structured queries [13,14]. Over the past decade, KBQA has evolved from simple fact-lookup systems to handling compositional questions involving multi-hop reasoning, aggregation, and filtering, demonstrating promise in domains such as clinical decision support and e-commerce [15,16,17,18]. Nevertheless, urban soil management introduces specific requirements—including parcel-level granularity, land-use-dependent regulatory standards, contextual measurement interpretation, and fully auditable provenance—that are not adequately addressed by generic KBQA frameworks.

This study poses three primary research questions: (i) Can a schema-aware KG improve factual retrieval accuracy over text-only RAG in urban soil scenarios? (ii) How does the integration of an ontology impact the system’s ability to perform deterministic threshold logic and unit checks? (iii) Does the KG + KBQA pipeline effectively provide source-level provenance for auditable environmental decision-making? Accordingly, we (i) define urban soil ontology covering various parcels, such as historical land uses, pollutants, exposure pathways, measurements, and land-use-specific standards; (ii) construct a knowledge graph by extracting and linking entities from heterogeneous textual and tabular sources; and (iii) implement a KBQA pipeline that translates natural-language questions into structured graph queries and returns answers with explicit source-level provenance. Finally, we evaluate our approach using a curated question set, comparing its performance against BM25 and KG-free RAG baselines, as detailed in the Methods section, and demonstrate its practical application in a South China case study.

2. Methods

2.1. KBQA Construction Process

According to previous studies [19,20,21], the key steps in constructing a KBQA system include data collection and preprocessing, entity recognition and linking, relation extraction, schema design, and triple generation.

2.1.1. Data Collection and Preprocessing

This study gathered data from multiple sources, including survey reports from construction sites, environmental impact reports from environmental bureaus, an encyclopedia of chemistry, and official documentation on environmental policies and regulations. Consequently, a large corpus was compiled to construct the knowledge graph. The dataset includes exactly 7245 environmental survey reports, comprising 105,432 pages from construction sites, an encyclopedia of chemistry, official documents on environmental policies and regulations, and statistical data on urban economic and social issues (Table 1).

All collected data underwent preprocessing, which involved cleaning, normalization, and standardization. This process included removing duplicates, resolving inconsistencies in entity names, and converting the data into a uniform format, as shown in Figure 2. For technical preprocessing, PaddleOCR (Baidu, Inc., Beijing, China) was employed to extract text and embedded tables from PDFs. For malformed tables, heuristic rules were applied to align rows by identifying consistent unit headers (e.g., mg/kg) and spatial coordinate patterns.

2.1.2. Entity Recognition and Linking

To extract relevant entities and relationships from unstructured text sources, we employed deep learning models fine-tuned for named entity recognition and relation extraction tasks. These models were chosen for their ability to improve accuracy, automate extraction, handle ambiguity, and enhance flexibility, scalability, and efficiency [22]. Specifically, we utilized the RoBERTa-CRF architecture for NER and a span-based transformer for relation extraction. Models were fine-tuned on a manually annotated corpus of 5000 sentences for 10 epochs with a learning rate of 2 × 10⁻⁵ and a batch size of 16. Following entity recognition, the identified entities were linked to their corresponding entries in external knowledge bases, primarily public datasets, using a combination of rule-based heuristics and neural entity-linking models. Ambiguities in entity names were resolved by incorporating contextual information from the surrounding text or metadata. For instance, synonyms such as “Lead” and “Pb” were deterministically mapped to the canonical CAS number 7439-92-1 via the chemistry encyclopedia, prioritizing official CAS codes to resolve nomenclature conflicts.

The entities extracted included the following:

Organizations (e.g., construction companies and environmental agencies);
Locations (e.g., specific construction sites and affected regions);
Chemicals/pollutants (heavy metals and hazardous materials);
Legislation and policies (e.g., the Clean Water Act and environmental standards).

The text data from each source was processed through the knowledge extraction model, which identified and labeled these entities based on their context within the documents. For example, in the sentence “ABC Corp was fined for exceeding Organized Pollutant emissions at Riverside construction site, violating the Clean Water Act,” the model identified the organization “ABC Corp,” the pollutant “heavy metals,” the location “Riverside construction site,” and the law “Clean Water Act.” The entire process of knowledge graph construction and application is illustrated in Figure 3.

2.1.3. Ontology Creation

In knowledge graph construction, schema design is pivotal in ensuring that the knowledge graph is structured, coherent, and capable of providing accurate and meaningful insights. A schema defines an organizational framework that governs how entities, relationships, and their attributes are represented in a knowledge graph. This structured framework is typically defined through an ontology, which acts as a blueprint for classifying data and connecting various elements within the graph. A well-designed schema significantly improves a knowledge graph’s usability, scalability, and performance in tasks such as data retrieval, query answering, and knowledge reasoning.

To structure the knowledge graph, we developed a domain-specific ontology based on a hierarchy of entity types (e.g., person, organization, and event) and relationships (e.g., “is a,” “works for,” and “developed”). This schema was defined using the Ontology Web Language to ensure semantic interoperability with existing knowledge bases. The classes and relationships were organized into a taxonomy to guide the classification of entities and relationships during the knowledge graph construction process, as illustrated in Figure 4. A compact excerpt of 10 core classes and relations is provided in Supplementary Material. The full schema is available via the repository link in Appendix A (https://github.com/Feng-David/ontology_soil.git, accessed on 10 April 2026).

2.1.4. Triple Generation

Triple generation for the knowledge graph is generated based on the constructed ontology and the official department’s investigation into reports of soil pollution.

Using these sources, resource description frame triples can be generated as follows:

(Construction Site A produces Chemical Waste X) [Survey report];
(Chemical Waste X is harmful to Aquatic Life) [Environmental reports/encyclopedia];
(Government Regulation Y regulates Waste Disposal) [Policy documentation].

This approach facilitates the integration of diverse data sources into a structured and interconnected format, enabling effective responses to queries regarding the environmental impact of construction. On a held-out manually annotated set of 200 document pages, the extraction pipeline achieved a precision of 88.5% and a recall of 84.2% for core relations.

To ensure rigorous auditability, the system maintains strict data lineage from the physical environment to the digital graph. This traceability pipeline operates in five distinct stages: (1) Physical Sampling: On-site soil core extraction; (2) Laboratory Analysis: Geochemical quantification (e.g., ICP-MS); (3) Documentation: Generation of the formal PDF environmental survey report; (4) Digital Extraction: NLP-driven entity and relation extraction from the text/tables; and (5) Materialization: Instantiation is used as a connected measurement node within the knowledge graph, permanently linked to its source document.

2.2. Framework of KBQA

This study implements a hybrid LLM–knowledge graph (KG) pipeline that converts natural-language questions into constrained, auditable operations over the domain ontology and graph (Figure 5). An instruction-tuned LLM identifies entities/slots (e.g., parcel, pollutant, and land use) and intent-type tasks (lookup, threshold, list, and two-hop tasks); then, the schema links candidates to KG nodes and relations using a dual retriever (lexical labels/aliases/CAS/standard codes and a dense retriever). Top-k schema hints and grounding occur in the LLM before planning.

This research generates a machine-readable Plan JSON composed of calls to a fixed toolbox (e.g., get_latest_measurement, get_standard, compare_threshold, list_exceedances, and get_provenance). Grammar-constrained decoding (JSON Schema; low temperature) restricts outputs to valid ontology terms. An executor translates tool calls into parameterized graph queries (Cypher/SPARQL). The outline of this algorithm is shown below:

Step 1 (Plan): LLM parses intent → selects get_latest_measurement and compare_threshold.
Step 2 (Execute): Executor runs Cypher queries against Neo4j → returns raw values and citations.
Step 3 (Verify): Unit service normalizes bases → Verifier recomputes math. If fail → trigger re-plan; if pass → send structured JSON to verbalizer.

The LLM receives only structured results and verbalizes a concise answer; every fact carries source-level citations (report/table/page; standard document). When the KG lacks a required fact, this research invokes a bounded text-retrieval fallback over the curated corpus constrained to the relevant parcel/pollutant; any text-derived claim must include a citation and pass the same numeric/unit checks, and answers must disclose whether evidence is KG- or text-derived. For example, if the KG lacks the moisture content for Parcel X, the fallback retrieves the original PDF passage. The LLM extracts the value and appends an explicit text-derived citation (e.g., “Source: Report X, Page 12 [Text Retrieval]”), isolating it from KG-verified facts.

A lightweight verifier recomputes key numeric checks; failures trigger re-planning. Ambiguity is handled by generating alternative plans and re-ranking them by ontology plausibility, geospatial consistency, and provenance completeness. A calibrated confidence score aggregates planner entropy, verifier success, and evidence source.

LLM configuration: We deployed a 13B-parameter LLaMA-2-based model, instruction-tuned on 15,000 domain-specific QA pairs. The planning temperature was set to 0.2, top-k schema hints were set to 10, and the verbalization temperature was set to 0.0–0.2; JSON-Schema grammar was used for plan emission; top-k = 10 schema hints; max tool calls per query = 3; short, structured contexts were used only; and caching of schema hints and compiled queries enabled for repeated questions. A lightweight verifier recomputed numeric checks; failures triggered re-planning. Ambiguity was handled by generating alternative plans and re-ranking by ontology plausibility, geospatial consistency, and provenance completeness; a calibrated confidence score aggregated planner entropy, verifier success, and evidence source.

2.3. Test and Validation of the KBQA

This study validated the knowledge-graph-powered QA (KG + KBQA) on a held-out urban soil QA benchmark constructed from the test snapshot into four task types to reflect operational decision needs: lookup, threshold, list, and two-hop/compositional reasoning. The benchmark comprises 400 total queries, uniformly split into 100 queries per task type. The test set was curated from an independent subset of reports strictly segregated from the LLM tuning data. Evaluation followed a fixed runtime profile and reported Exact Match (EM), token-level F1, MRR/Recall@k, median/p90 latency, and provenance completeness. All answers were normalized to canonical units and controlled vocabularies prior to scoring.

2.3.1. Test Task

Lookup: The lookup task requires returning a single canonical fact from the knowledge base (e.g., a pollutant’s land-use-specific threshold, a pollutant CAS number, or a parcel attribute). A natural-language query is mapped to the domain schema and executed as a parameterized graph query to retrieve the target value. Outputs are normalized and accompanied by source-level provenance (e.g., standard document and section), isolating faithful recovery of atomic facts from multi-step reasoning effects.

Threshold (exceedance determination): The threshold task assesses whether a parcel’s most recent valid measurement for a specified pollutant exceeds the applicable land-use-specific standard. This research resolves land-use context and standard version/date, normalizes measurement units to canonical bases (e.g., mg/kg dry soil), and performs a deterministic comparison. The output includes a Boolean decision together with the measured value (unit, method, and date), the standard value (unit, land use, and version/date), and citations for both the measurement and the standard value. Cases lacking sufficient basis information (e.g., wet-basis values without moisture correction) are explicitly flagged.

List (set retrieval under constraints): The list task returns an unordered set of entities that satisfy structured filters (e.g., “parcels in City X where Pb exceeds 400 mg/kg under residential use within year Y”). The query is translated into schema-aware filters over parcels, pollutants, thresholds, land-use categories, geography, and time windows; the same numeric and unit policies as in the threshold task are applied, and aliases are deduplicated to produce a canonical set. Each returned item carries sufficient provenance to audit inclusion.

Two-hop/compositional reasoning: The two-hop task measures the ability to traverse and aggregate across multiple relations (e.g., identifying historical land uses most associated with benzene exceedances in a city). The system executes schema-valid paths (e.g., Parcel → Land Use; Parcel → Measurement → Pollutant → Standard), aggregates counts or ranks outcomes under land-use and temporal constraints, and reports the resulting categories or entities with representative citations. Only validated exceedances (per threshold rules) contribute to aggregations, ensuring unit-consistent, threshold-aware reasoning rather than surface co-occurrence.

Common policies across tasks include (i) Precedence of standards—use of the question-specified version/date when given; otherwise, the most recent applicable version is used. Second is the (ii) latest-measurement policy—selection of the latest valid record per parcel–pollutant when no time window is specified, followed by (iii) limits of detection—values reported as “<LoD” are addressed using statistical substitution (assigned as 1/2 LoD) during exceedance evaluations to avoid bias, rather than being treated as missing. Finally, (iv) ambiguity handling is performed—if entity linking remains unresolved after ontology/geospatial constraints, this task abstains or issues a clarification tag, excluding such instances from EM but counting toward Recall@k when a correct candidate appears.

2.3.2. Indicators for Validation

For the lookup, threshold, list, and two-hop tasks, effectiveness is reported using:

Exact Match (EM): The proportion of questions with an answer string (or Boolean threshold) exactly matching any reference after normalization.

Token-level F1 Scoring (F1): Harmonic mean of precision/recall on token sets for partially correct spans or sets (order-invariant for list). The calculation of the F1 is as follows:

Precision = \frac{|A \cap G|}{|A|}, Recall = \frac{|A \cap G|}{|G|}, F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall} .

(1)

Here

A

and

G

are the normalized token (or element) sets of the predicted and gold answers.

Mean Reciprocal Rank (MRR): The average of

\frac{1}{{r a n k}_{i}}

for the first correct answer for all queries:

M R R = \frac{1}{|Q|} \sum_{i \in Q} \frac{1}{{r a n k}_{i}} .

(2)

Recall at k (R@k): The fraction of queries for which at least one correct answer appears in the top

k

retrieved/candidate items:

R @ k = \frac{1}{|Q|} \sum_{i \in Q} 1 \{relevant in top k\} .

(3)

Median latency (s): p50 end-to-end time per query (planning → graph execution → checks → verbalization).

p90 latency (s): 90th-percentile end-to-end time; reflects tail performance on harder/uncached queries.

2.3.3. Comparison with Baseline

This study selected results by BM25 and RAG-without-KG: the two commonly used methods for information retrieval and extraction in question-answering systems, as references to compare the performance of KBQA.

BM25 is the canonical sparse lexical retrieval method in IR, long used as a strong, transparent baseline across TREC-style evaluations [13,23]. It ranks passages by term frequency, inverse document frequency, and length normalization with just two tunable parameters (

k_{1}, b

). We implemented BM25 using Elasticsearch with parameters (

k_{1}

= 1.2,

b

= 0.75), a Jieba tokenizer for Chinese text, and passage-level indexing. Using BM25 allowed us to benchmark our system against a well-understood, high-precision text-only approach that does not rely on schema, unit normalization, or graph structure—therefore isolating the value added by the KG. It remains competitive on factual lookups and is recommended in modern QA studies as a point of comparison to dense methods. RAG was introduced by [24] and typically relies on dense passage retrieval methods such as DPR to obtain semantically similar passages. The RAG baseline utilized a dense retriever (BGE-Large-zh, embedding dimension 1024) to retrieve the top 5 passages, matching them with the same 13B LLM used by our KBQA for generation. While effective for paraphrased questions, RAG-without-KG lacks schema-level semantics and deterministic unit/threshold reasoning.

3. Results

The General view of the KBQA system and the interface can be checked in the Supplementary Materials at the end of this article.

3.1. KBQA Performance

This study evaluated the KG + KBQA system against BM25 and RAG-without-KG on the urban soil QA benchmark (lookup, threshold, list, and two-hop tasks). As summarized in Table 2. KG + KBQA consistently outperformed both baselines in EM/F1, MRR/R@k, with the largest margins on threshold and lookup tasks where unit normalization and land-use-specific standards determine correctness. To confirm the robustness of these findings, a Wilcoxon signed-rank test was conducted across the 100 threshold queries. The Exact Match improvement of KG + KBQA over the RAG baseline was found to be statistically significant ($p < 0.01$), indicating that the performance gain is not due to chance. End-to-end median latency was lower than RAG-without-KG due to bounded tool calls and compiled graph queries, while remaining competitive with BM25. Completeness of provenance was highest for KG + KBQA, which returned both measurement and standard citations where applicable. Ablations showed that removing grammar constraints, the numeric verifier, ontology-aware entity linking, or the text fallback each degraded accuracy and/or increased failures. Error analysis highlighted three residual issues: ambiguous parcel aliases, missing standard version/date in legacy texts, and wet-basis measurements lacking moisture metadata (flagged and excluded from exceedance materialization).

To confirm the statistical significance of these results, a Wilcoxon signed-rank test was conducted across the 100 threshold queries. The Exact Match improvement of KG + KBQA over the RAG baseline was found to be statistically significant ($p < 0.01$), indicating that the performance gain is not due to chance.

To isolate the contributions of our pipeline’s specific components, we conducted an ablation study on the Exact Match (EM) metric (Table 3). As expected, removing the numeric verifier severely impacted the performance on threshold tasks, while removing ontology-aware linking caused the steepest drops in compositional reasoning (two-hop and list tasks). Eliminating grammar constraints led to higher rates of invalid graph query generation, and removing the text fallback reduced our system’s ability to recover from knowledge graph coverage gaps, collectively demonstrating the necessity of each module.

Furthermore, a brief sensitivity analysis on input vocabulary (e.g., swapping chemical names for their CAS numbers or varying phrasing) revealed an EM variance of less than 2%, demonstrating robust semantic linking.

3.2. Knowledge Completion

For link prediction, this research trained a lightweight model over relations such as has Historical Use, likely Emits, and governed By. On held-out triples, MRR and Hits@{1, 3, 10} improved when multi-source features (co-occurrence across parcels, and regulatory co-mentions) were included. Suggested links were surfaced as curation hints with confidence and evidence slices; they were not auto-asserted into the KG. In QA, completions were used only to prioritize candidates for two-hop questions, preserving provenance requirements.

3.3. Automatic KG Construction and Refresh

Incremental ingestion of newly released reports and standards produced an updated snapshot with higher throughput (documents/hour) and stable yield (triples/document) relative to cold start, aided by cached parcel aliases and compiled query templates. Quality gates enforced required fields for measurement triples (value, unit, analyte, parcel, and date); items failing checks were routed to quarantine with explicit failure codes (e.g., unit mismatch and unresolved alias). The append-only, time-stamped snapshotting ensured reproducibility and rollback for audit.

3.4. Knowledge Reasoning

For two-hop/compositional queries (e.g., land uses associated with pollutant exceedances), KG + KBQA achieved higher EM/F1 and Recall@k than text-only systems by traversing ontology-valid paths (Parcel → Measurement → Pollutant → Standard; Parcel → Land Use) under unit and standard constraints. For numeric, threshold-aware reasoning, the unit service and Standard Resolver yielded deterministic exceedance decisions accompanied by values, thresholds, and source-level citations; the verifier recomputed comparisons from returned numerics and triggered re-planning on mismatches. In the South China case study, these mechanisms reduced manual cross-document lookups and exposed version discrepancies in cited standards, illustrating decision support with transparent provenance.

While traditional RAG systems obscure their reasoning, the KG + KBQA pipeline guarantees explainability through a traceable query path. For a threshold exceedance query, the step-by-step traversal is: (1) Intent Parsing: Mapping the user query to a specific parcel and pollutant; (2) Node Linking: Traversing the graph to find the latest valid measurement node (applying the 1/2 LoD statistical substitution if censored); (3) Standard Resolution: Querying the ontology for the applicable regulatory standard node based on the parcel’s specific land use; (4) Normative Comparison: Executing the deterministic threshold check; and (5) Output Generation: Returning the Boolean decision alongside the exact source citations for both the measurement and the standard.

4. Discussion

4.1. Contributions and Innovations

This study shows that knowledge graphs (KGs) plus schema-aware AI question answering deliver measurable value for environmental decision-making: they encode explicit entities/relations, constraints, and provenance, enabling threshold-aware, unit-consistent, auditable answers rather than snippets of documents. This aligns with established best practice on KG construction and use for structured, verifiable reasoning [23]. Methodologically, we benchmarked against BM25 [25]—the canonical sparse lexical retriever from the probabilistic relevance framework—and RAG-without-KG, a modern dense-retrieval-plus-generation approach [24], to isolate what the KG contributes beyond text retrieval and generation [26]. Constrained LLM orchestration (grammar-bounded tool calling) paired with graph queries, unit normalization, and versioned-standard resolution delivered auditable outputs—a response to well-documented limits of text-only generation on provenance and faithful, structured reasoning.

4.2. Limitations

Several constraints temper these findings. Source fidelity remains a bottleneck (e.g., legacy reports without standard version/date; wet-basis measurements without moisture data). Entity ambiguity (parcel aliases and heterogeneous naming) can require abstention or curation despite ontology-aware linking. Jurisdictional transferability may need adaptation where regulatory structures differ. We did not use a general-purpose LLM for end-to-end extraction of entities/relations because high-stakes settings demand boundary-accurate spans, strict unit/basis normalization, and citation-complete provenance—areas where unconstrained LLMs still exhibit instability and hallucination. We, therefore, favor transformer + CRF NER, span-based RE, and ontology-constrained linking to obtain measurable precision/recall and reproducible error modes [27,28]. Finally, while we employ an ontology and validate structure, broader conformance checks (e.g., SHACL shapes with OWL semantics) [29,30] should be expanded in future releases. Additionally, soil survey reports often contain sensitive commercial information. Data ownership must be strictly managed via anonymization of specific parcel owners before KG ingestion. Furthermore, while our system boasts high citation completeness (replacing the ambiguous “provenance completeness” metric to reflect the frequency of citing original sources vs. generating text), we strongly emphasize the necessity of human-in-the-loop oversight; AI-generated outputs must support, rather than autonomously dictate, regulatory compliance decisions.

4.3. Prospects

Future work will amplify the value of KG-centric, AI-enabled analytics and explicitly extend beyond a single locality, referring to previous studies [31,32]. Regarding transferability, the ontology’s Standard Resolver is highly adaptable. When migrating to other jurisdictions—such as Mexico’s Official Mexican Standards (NOMs) for hydrocarbons—the system systematically resolves standards by querying localized parameters without requiring core graph structural changes. Furthermore, to manage the inherent spatial heterogeneity of urban soils, future iterations will integrate geostatistical methods (e.g., Kriging interpolation) directly into the KG to model spatial variance. First, this research will conduct multi-site practitioner studies across cities and jurisdictions to assess transfer performance and usability under differing regulatory regimes and reporting templates. Second, we will generalize temporal and geospatial reasoning to accommodate region-specific standard revisions, multilingual documents, and heterogeneous cadastral systems, using explicit ontology modules (e.g., jurisdiction, land-use taxonomy, and unit systems) and versioned policy timelines. Third, we will scale human-in-the-loop curation with active learning and LLM-assisted triage while preserving ontology and provenance guardrails, enabling rapid adaptation to new locales with minimal expert effort. Fourth, we will maintain comparisons to evolving dense-retrieval methods and apply portable KG completion as curator-prioritized hints to accelerate coverage without sacrificing auditability. To support deployment elsewhere, we will provide a transfer toolkit (mapping guides for local standards and land-use codes, multilingual synonym dictionaries, and city-level alias gazetteers) and report domain-shift diagnostics (performance deltas by jurisdiction, language, and template family). Collectively, these steps position the proposed KG + LLM framework as a generalizable, regulator-ready platform for environmental governance that strengthens risk assessment, compliance, and remediation planning across regions, while retaining transparent, citable, threshold-aware reasoning. Furthermore, to rigorously manage the inherent spatial heterogeneity and measurement uncertainty of urban soils, future iterations of the KG will integrate geostatistical methods (such as Kriging interpolation). This will allow the system to model spatial variance and estimate contamination probabilities for unsampled locations.

For global transferability, the ontology’s Standard Resolver module is designed to be highly adaptable. For example, mapping Mexico’s Official Mexican Standards (NOMs) for hydrocarbons to this framework simply requires the localized threshold parameters and land-use context tags to be updated within the resolver, without necessitating any alterations to the foundational graph schema.

5. Conclusions

This study demonstrates that combining a rigorously curated domain knowledge graph with schema-aware AI question answering delivers materially better support for urban soil environmental decisions than text-only retrieval. On a benchmark spanning lookup, threshold, list, and two-hop queries and in a South China case study, the KG + KBQA system produced more accurate answers, lower or competitive latency, and source-level provenance (measurement and standard citations), thereby achieving the stated aim. Crucially, the pipeline’s constrained LLM orchestration—tool-based planning, unit and threshold normalization, version-aware standard resolution, and numeric verification—enabled deterministic, auditable reasoning that black-box generation and traditional searches do not provide.

More broadly, the results affirm the value and significance of knowledge graphs and AI for environmental management. KGs transform heterogeneous reports, measurements, standards, and land-use context into structured, interoperable, and queryable knowledge, while AI provides the natural-language interface and planning needed to operationalize knowledge at decision time. Together, they (i) increase reliability through explicit semantics, constraints, and provenance; (ii) improve efficiency by reducing manual cross-document effort; and (iii) enhance accountability by making every answer traceable to its sources—capabilities that are central to risk assessment, regulatory compliance, and remediation planning. This architecture is generalizable beyond urban soil to groundwater, sediments, and air quality, where threshold-aware, unit-consistent, and version-standard reasoning are equally critical.

While utility still depends on source fidelity and some linking ambiguities remain, the evidence here shows that KG-centered, AI-enabled systems constitute a substantive advance over document-centric workflows and unconstrained LLMs for environmental decision support. The released artifacts (ontology/schema, QA set, and example graph snapshot) are intended to catalyze adoption and independent evaluation. Future work will scale practitioner studies, extend temporal and geospatial reasoning under explicit constraints, and refine human-in-the-loop curation—advancing toward regulator-ready, transparent AI that strengthens environmental governance and sustainable urban development.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16083895/s1. General view of the KBQA.

Author Contributions

Conceptualization, F.H.; methodology, F.H.; software, F.H.; validation, F.H., Y.D., J.S. and K.L.; formal analysis, F.H., Y.D., J.S. and K.L.; investigation, F.H.; resources, F.H.; data curation, Y.D., J.S. and K.L.; writing—original draft preparation, F.H. and M.L.; writing—review and editing, X.Q., W.H. and Y.T.; visualization, F.H.; supervision, F.H.; project administration, F.H.; funding acquisition, F.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China [Grant No. 2022YFF0800101] (Funder: MOST, China; Duration: January 2023–December 2027) and Guangxi University for Nationalities Start-up Program for Introduced Talents [Grant No. 2023KJQD34] (Funder: Guangxi University for Nationalities, China; Duration: July 2023–June 2026).

Informed Consent Statement

Informed consent was obtained from all the subjects involved in the study.

Data Availability Statement

The data used in this study are not publicly available due to data governance and usage restrictions but are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BM25	Best Matching 25
CAS	Chemical Abstracts Service
CFD	Computational Fluid Dynamics
CRF NER	Conditional Random Field-Based Named Entity Recognition
ID	Identifier
JSON	JavaScript Object Notation
KB	Knowledge Base
KBQA	Knowledge Base Question Answering
LLM	Large Language Model
MRR	Mean Reciprocal Rank
OWL	Web Ontology Language
QA	Question Answering
RAG	Retrieval-Augmented Generation
RE	Relation Extraction
SHACL	Shapes Constraint Language
SPARQL	SPARQL Protocol and RDF Query Language

Appendix A

The major models and codes used in this study can be accessed below:

City Soil Pollution Ontology: https://github.com/Feng-David/ontology_soil.git (accessed on 10 April 2026).
Knowledge Extraction Model: https://github.com/Feng-David/knowledge_extraction.git (accessed on 10 April 2026). Owing to storage space limitations, only the code of the model is provided, as the occupation of the checkpoint of the trained model is over 1.6 G, exceeding the storage space limit.
Text Preprocessing Module: https://github.com/Feng-David/wordcut4soilReport.git (accessed on 10 April 2026). Original data cannot be made public based on data provider requirements. Only the model code was provided.
Triplet Conversion and Import Module: https://github.com/Feng-David/Tranform-the-information-of-sites-into-neo4j.git (accessed on 10 April 2026).
Knowledge Reasoning Model: https://github.com/Feng-David/Link_prediction_for_KG.git (accessed on 10 April 2026).
Knowledge Graph Visualization Platform: https://github.com/Feng-David/soilKG.git (accessed on 10 April 2026).

References

Deng, C.; Zeng, G.; Cai, Z.; Xiao, X. A survey of knowledge based question answering with deep learning. J. Artif. Intell. 2020, 2, 157. [Google Scholar] [CrossRef]
Yang, H.; Huang, X.; Thompson, J.R.; Flower, R.J. Soil pollution: Urban brownfields. Science 2014, 344, 691–692. [Google Scholar] [CrossRef] [PubMed]
Pan, L.; Wang, Y.; Ma, J.; Hu, Y.; Su, B.; Fang, G.; Wang, L.; Xiang, B. A review of heavy metal pollution levels and health risk assessment of urban soils in Chinese cities. Environ. Sci. Pollut. Res. 2018, 25, 1055–1069. [Google Scholar] [CrossRef]
GB 36600-2018; Soil Environmental Quality–Risk Control Standard for Soil Contamination of Development Land. China National Standardization Administration: Beijing, China, 2018.
Huang, Y.; Zhang, X.; Li, Z. Analysis of nationwide soil pesticide pollution: Insights from China. Environ. Res. 2024, 252, 118988. [Google Scholar] [CrossRef]
Konstantinova, E.; Minkina, T.; Nevidomskaya, D.; Lychagin, M.; Bezberdaya, L.; Burachevskaya, M.; Rajput, V.D.; Zamulina, I.; Bauer, T.; Mandzhieva, S. Potentially toxic elements in urban soils of the coastal city of the Sea of Azov: Levels, sources, pollution and risk assessment. Environ. Res. 2024, 252, 119080. [Google Scholar] [CrossRef]
Zhou, Y.; Zhang, L.; Zhang, A.; Wang, J. Earth Science Big Data Mining and Machine Learning; Sun Yat-sen University Press: Guangzhou, China, 2018; p. 269. (In Chinese) [Google Scholar]
Balseiro-Romero, M.; Baveye, P.C. Book Review: Soil Pollution: A Hidden Danger Beneath our Feet. Front. Environ. Sci. 2018, 6, 130. [Google Scholar] [CrossRef]
Okpara, U.T.; Fleskens, L.; Stringer, L.C.; Hessel, R.; Bachmann, F.; Daliakopoulos, I.; Berglund, K.; Blanco Velazquez, F.J.; Ferro, N.D.; Keizer, J.; et al. Helping stakeholders select and apply appraisal tools to mitigate soil threats: Researchers’ experiences from across Europe. J. Environ. Manag. 2020, 257, 110005. [Google Scholar] [CrossRef]
van Rees, E. Open geospatial consortium (OGC). Geoinformatics 2013, 16, 28. [Google Scholar]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar] [CrossRef]
Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv 2024, arXiv:2404.16130. [Google Scholar] [CrossRef]
Lukovnikov, D.; Fischer, A.; Lehmann, J.; Auer, S. Neural network-based question answering over knowledge graphs on word and character level. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; pp. 1211–1220. [Google Scholar]
Lan, Y.; He, G.; Jiang, J.; Jiang, J.; Zhao, W.X.; Wen, J.-R. Complex Knowledge Base Question Answering: A Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 11196–11215. [Google Scholar] [CrossRef]
Huang, J. Research and Applications Analysis of Knowledge Base Question Answering. Highlights Sci. Eng. Technol. 2022, 16, 16–22. [Google Scholar] [CrossRef]
Jin, H.; Luo, Y.; Gao, C.; Tang, X.; Yuan, P. ComQA: Question Answering Over Knowledge Base via Semantic Matching. IEEE Access 2019, 7, 75235–75246. [Google Scholar] [CrossRef]
Fu, B.; Qiu, Y.; Tang, C.; Li, Y.; Yu, H.; Sun, J. A Survey on Complex Question Answering over Knowledge Base: Recent Advances and Challenges. arXiv 2020, arXiv:2007.13069. [Google Scholar] [CrossRef]
Wang, H.; Zhou, Y.; Xu, Y.; Wang, W.; Cao, W.; Liu, Y.; He, J.; Lu, K. IoT Monitoring and Visualization of Urban Soil Pollution Based on Microservice Architecture. Earth Sci. Front. 2024, 31, 165–174. [Google Scholar] [CrossRef]
Cui, W.; Xiao, Y.; Wang, H.; Song, Y.; Hwang, S.-W.; Wang, W. KBQA: Learning Question Answering over QA Corpora and Knowledge Bases. arXiv 2019, arXiv:1903.02419. [Google Scholar] [CrossRef]
Wang, Y.; Lipka, N.; Rossi, R.A.; Siu, A.; Zhang, R.; Derr, T. Knowledge graph prompting for multi-document question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 19206–19214. [Google Scholar]
Chen, G.; Xiahou, X.; Li, J.; Chen, L.; Zou, Y.; Zhou, S. Knowledge graph-driven question answering for prefabricated building quality management through natural language processing and transfer learning. Eng. Constr. Archit. Manag. 2025, 2, 92–98. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Hogan, A.; Blomqvist, E.; Cochez, M.; d’Amato, C.; Melo, G.D.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S. Knowledge graphs. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Virtual/Online, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
Robertson, S.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond; Now Publishers Inc.: Hanover, MA, USA, 2009; Volume 4. [Google Scholar]
Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.-T. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6769–6781. [Google Scholar]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2023, 43, 1–55. [Google Scholar] [CrossRef]
Zhou, Y.; Zuo, R. (Eds.) Application of Big Data Mining, Machine Learning and Artificial Intelligence in Ore Deposits; MDPI: Basel, Switzerland, 2025; p. 222. [Google Scholar]
Ke, J.; Zacouris, Z.; Acosta, M. Efficient validation of SHACL shapes with reasoning. Proc. VLDB Endow. 2024, 17, 3589–3601. [Google Scholar] [CrossRef]
Cortés, C.; Ehrlinger, L.; Etcheverry, L.; Naumann, F. Is SHACL Suitable for Data Quality Assessment? arXiv 2025, arXiv:2507.22305. [Google Scholar] [CrossRef]
Zhang, Q.; Zhou, Y.; Yu, P.; Wang, H.; Han, F.; He, J. Ontology construction of multi-level ore deposit and its application in knowledge graph. Bull. Mineral. Petrol. Geochem. 2024, 43, 211–217. [Google Scholar] [CrossRef]
Zhou, Y.; Xiao, F. Overview: A glimpse of the latest advances in artificial intelligence and big data geoscience research. Earth Sci. Front. 2024, 31, 1–6. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of urban soil pollution hazards.

Figure 2. Process of corpus preprocessing.

Figure 3. Knowledge graph construction and application framework.

Figure 4. Urban soil pollution ontology: The schema links core classes, including parcel, pollutant, land use, and measurement. Example of triple generation: (Parcel_A, has_Measurement, Measurement_1) → (Measurement_1, of_Pollutant, Cadmium).

Figure 5. System architecture. Labels indicate data flow between the LLaMA-2 planner, Neo4j graph database, and the numeric verifier modules.

Table 1. Overview of datasets used in this study.

Dataset Family	Source/Provider	Raw Size	Modality	Key Fields	Primary Use
Construction-site environmental survey reports	Local environmental agencies; project owners (approved disclosures)	7000 + reports (PDF/DOCX); 100,000 pages total	Text + embedded tables	Site ID/name, coordinates, historical land use, sampling plan, analytes, measurement values, units, methods (e.g., ICP-MS), LoD, dates	Core source for parcel, measurement, pollutant, land-use entities and relations
Encyclopedia of chemistry	⟨ChemSRC/url: https://www.chemsrc.com⟩	over 50,000 entries (pollutants/chemicals)	Structured pages/CSV	Synonyms, CAS, molar mass, volatility, persistence, toxicity notes	Pollutant properties; synonym expansion for entity linking
Policies and soil standards	National GB/DB standards; municipal guidelines	over 100 documents	PDFs/machine-readable tables	Land-use-specific thresholds (value, unit), analytical method, applicability notes	Standard entities; exceeds Standard logic
Urban socioeconomic statistics	Statistical yearbooks; open data portals	over 100 tables	CSV/XLSX	Population density, industrial composition, land supply	Covariates for exploratory analyses; not used in QA scoring

Table 2. KBQA effectiveness and latency by task type.

Task (n)	System	EM	F1	MRR	R@5	Median Latency (s)	p90 Latency (s)	Citation Completeness (%)
Lookup (n = 100)	KG + KBQA	0.84	0.90	0.92	0.98	0.65	1.20	99
	BM25	0.62	0.71	0.75	0.90	0.45	0.90	63
	RAG (no KG)	0.70	0.78	0.80	0.93	1.10	2.20	81
Threshold (n = 100)	KG + KBQA	0.81	0.88	0.90	0.96	0.75	1.40	100
	BM25	0.45	0.56	0.60	0.78	0.50	1.00	57
	RAG (no KG)	0.58	0.66	0.70	0.85	1.30	2.50	74
List (n = 100)	KG + KBQA	0.68	0.80	0.82	0.90	0.95	1.80	98
	BM25	0.38	0.55	0.60	0.74	0.55	1.20	52
	RAG (no KG)	0.52	0.66	0.71	0.86	1.50	2.90	77
Two-hop (n = 100)	KG + KBQA	0.64	0.76	0.79	0.88	1.05	2.00	97
	BM25	0.30	0.46	0.50	0.65	0.60	1.30	58
	RAG (no KG)	0.48	0.60	0.66	0.80	1.70	3.20	73

Table 3. Ablation study on Exact Match (EM) performance.

Configuration	Lookup (EM)	Threshold (EM)	List (EM)	Two-Hop (EM)
Full System (KG + KBQA)	0.84	0.81	0.68	0.64
Numeric Verifier	0.84	0.65	0.66	0.62
Grammar Constraints	0.78	0.75	0.59	0.55
Ontology-Aware Linking	0.67	0.68	0.51	0.46
Text Fallback	0.79	0.76	0.65	0.61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, X.; Tang, Y.; Deng, Y.; Lu, M.; He, W.; Song, J.; Lin, K.; Han, F. Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data. Appl. Sci. 2026, 16, 3895. https://doi.org/10.3390/app16083895

AMA Style

Qin X, Tang Y, Deng Y, Lu M, He W, Song J, Lin K, Han F. Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data. Applied Sciences. 2026; 16(8):3895. https://doi.org/10.3390/app16083895

Chicago/Turabian Style

Qin, Xi, Yanlin Tang, Yirong Deng, Meiqu Lu, Wenqiang He, Jinrui Song, Keyu Lin, and Feng Han. 2026. "Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data" Applied Sciences 16, no. 8: 3895. https://doi.org/10.3390/app16083895

APA Style

Qin, X., Tang, Y., Deng, Y., Lu, M., He, W., Song, J., Lin, K., & Han, F. (2026). Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data. Applied Sciences, 16(8), 3895. https://doi.org/10.3390/app16083895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Auditable Urban Soil Management: A Knowledge Graph and LLM Approach Fusing Environmental and Geochemical Data

Abstract

1. Introduction

2. Methods

2.1. KBQA Construction Process

2.1.1. Data Collection and Preprocessing

2.1.2. Entity Recognition and Linking

2.1.3. Ontology Creation

2.1.4. Triple Generation

2.2. Framework of KBQA

2.3. Test and Validation of the KBQA

2.3.1. Test Task

2.3.2. Indicators for Validation

2.3.3. Comparison with Baseline

3. Results

3.1. KBQA Performance

3.2. Knowledge Completion

3.3. Automatic KG Construction and Refresh

3.4. Knowledge Reasoning

4. Discussion

4.1. Contributions and Innovations

4.2. Limitations

4.3. Prospects

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI