1. Introduction
Urban soil contamination poses significant challenges to brownfield redevelopment, public health, and ecological restoration, rendering access to reliable and timely information essential for urban planners and environmental engineers [
1]. Distinguished from rural or mining environments, urban soils are subject to sustained and diverse anthropogenic inputs—including industrial activities, construction, traffic emissions, and historical land use—resulting in complex mixtures of contaminants and pronounced spatial heterogeneity, as illustrated in
Figure 1. Recent nationwide surveys in China further highlight the urgency of this issue, documenting widespread exceedances of key trace elements such as Cd, Pb, Zn, and Cu across multiple urban areas, with associated risks to human health and local ecosystems [
2,
3]. These findings underscore a critical practical gap: while decision-makers require precise, parcel-specific answers, for example, determining whether cadmium levels at Parcel A exceed the 2018 residential risk control thresholds (GB36600-2018) [
4] based on its historical industrial use, the necessary evidence remains scattered across disparate reports, databases, and regulatory documents.
However, the scientific challenge extends beyond mere monitoring—it lies in effectively integrating multi-source, heterogeneous evidence and performing reasoning that is consistent with regulatory thresholds, measurement units, and data provenance [
5,
6,
7]. While traditional physical and statistical models remain valuable, they are often costly to parameterize, slow to update, and lack transparency when practitioners need to trace conclusions back to their original sources. Consequently, there is increasing interest in digital approaches capable of structuring domain knowledge to support auditable, query-driven analysis. Similar spatial heterogeneities and historical land-use challenges have been extensively documented in urban soil management studies across Europe and the Americas, highlighting the global universality of this issue [
8,
9]. Achieving this integration requires adherence to semantic interoperability standards in Earth Sciences, such as those established by the Open Geospatial Consortium (OGC) [
10].
Current digital tools—ranging from full-text search and GIS overlays to more recent retrieval-augmented generation (RAG) systems—improve document accessibility but generally lack embedded domain semantics. Specifically, these text-only RAG systems fail to perform essential deterministic unit conversions, lack threshold-based logical reasoning, and cannot guarantee source-level provenance, which are crucial for audit and compliance [
11,
12]. This gap motivates the adoption of knowledge graphs (KGs) and knowledge-based question answering (KBQA), which explicitly encode entities, relations, and constraints while allowing natural-language questions to be translated into structured queries [
13,
14]. Over the past decade, KBQA has evolved from simple fact-lookup systems to handling compositional questions involving multi-hop reasoning, aggregation, and filtering, demonstrating promise in domains such as clinical decision support and e-commerce [
15,
16,
17,
18]. Nevertheless, urban soil management introduces specific requirements—including parcel-level granularity, land-use-dependent regulatory standards, contextual measurement interpretation, and fully auditable provenance—that are not adequately addressed by generic KBQA frameworks.
This study poses three primary research questions: (i) Can a schema-aware KG improve factual retrieval accuracy over text-only RAG in urban soil scenarios? (ii) How does the integration of an ontology impact the system’s ability to perform deterministic threshold logic and unit checks? (iii) Does the KG + KBQA pipeline effectively provide source-level provenance for auditable environmental decision-making? Accordingly, we (i) define urban soil ontology covering various parcels, such as historical land uses, pollutants, exposure pathways, measurements, and land-use-specific standards; (ii) construct a knowledge graph by extracting and linking entities from heterogeneous textual and tabular sources; and (iii) implement a KBQA pipeline that translates natural-language questions into structured graph queries and returns answers with explicit source-level provenance. Finally, we evaluate our approach using a curated question set, comparing its performance against BM25 and KG-free RAG baselines, as detailed in the Methods section, and demonstrate its practical application in a South China case study.
2. Methods
2.1. KBQA Construction Process
According to previous studies [
19,
20,
21], the key steps in constructing a KBQA system include data collection and preprocessing, entity recognition and linking, relation extraction, schema design, and triple generation.
2.1.1. Data Collection and Preprocessing
This study gathered data from multiple sources, including survey reports from construction sites, environmental impact reports from environmental bureaus, an encyclopedia of chemistry, and official documentation on environmental policies and regulations. Consequently, a large corpus was compiled to construct the knowledge graph. The dataset includes exactly 7245 environmental survey reports, comprising 105,432 pages from construction sites, an encyclopedia of chemistry, official documents on environmental policies and regulations, and statistical data on urban economic and social issues (
Table 1).
All collected data underwent preprocessing, which involved cleaning, normalization, and standardization. This process included removing duplicates, resolving inconsistencies in entity names, and converting the data into a uniform format, as shown in
Figure 2. For technical preprocessing, PaddleOCR (Baidu, Inc., Beijing, China) was employed to extract text and embedded tables from PDFs. For malformed tables, heuristic rules were applied to align rows by identifying consistent unit headers (e.g., mg/kg) and spatial coordinate patterns.
2.1.2. Entity Recognition and Linking
To extract relevant entities and relationships from unstructured text sources, we employed deep learning models fine-tuned for named entity recognition and relation extraction tasks. These models were chosen for their ability to improve accuracy, automate extraction, handle ambiguity, and enhance flexibility, scalability, and efficiency [
22]. Specifically, we utilized the RoBERTa-CRF architecture for NER and a span-based transformer for relation extraction. Models were fine-tuned on a manually annotated corpus of 5000 sentences for 10 epochs with a learning rate of 2 × 10
−5 and a batch size of 16. Following entity recognition, the identified entities were linked to their corresponding entries in external knowledge bases, primarily public datasets, using a combination of rule-based heuristics and neural entity-linking models. Ambiguities in entity names were resolved by incorporating contextual information from the surrounding text or metadata. For instance, synonyms such as “Lead” and “Pb” were deterministically mapped to the canonical CAS number 7439-92-1 via the chemistry encyclopedia, prioritizing official CAS codes to resolve nomenclature conflicts.
The entities extracted included the following:
Organizations (e.g., construction companies and environmental agencies);
Locations (e.g., specific construction sites and affected regions);
Chemicals/pollutants (heavy metals and hazardous materials);
Legislation and policies (e.g., the Clean Water Act and environmental standards).
The text data from each source was processed through the knowledge extraction model, which identified and labeled these entities based on their context within the documents. For example, in the sentence “ABC Corp was fined for exceeding Organized Pollutant emissions at Riverside construction site, violating the Clean Water Act,” the model identified the organization “ABC Corp,” the pollutant “heavy metals,” the location “Riverside construction site,” and the law “Clean Water Act.” The entire process of knowledge graph construction and application is illustrated in
Figure 3.
2.1.3. Ontology Creation
In knowledge graph construction, schema design is pivotal in ensuring that the knowledge graph is structured, coherent, and capable of providing accurate and meaningful insights. A schema defines an organizational framework that governs how entities, relationships, and their attributes are represented in a knowledge graph. This structured framework is typically defined through an ontology, which acts as a blueprint for classifying data and connecting various elements within the graph. A well-designed schema significantly improves a knowledge graph’s usability, scalability, and performance in tasks such as data retrieval, query answering, and knowledge reasoning.
To structure the knowledge graph, we developed a domain-specific ontology based on a hierarchy of entity types (e.g., person, organization, and event) and relationships (e.g., “is a,” “works for,” and “developed”). This schema was defined using the Ontology Web Language to ensure semantic interoperability with existing knowledge bases. The classes and relationships were organized into a taxonomy to guide the classification of entities and relationships during the knowledge graph construction process, as illustrated in
Figure 4. A compact excerpt of 10 core classes and relations is provided in
Supplementary Material. The full schema is available via the repository link in
Appendix A (
https://github.com/Feng-David/ontology_soil.git, accessed on 10 April 2026).
2.1.4. Triple Generation
Triple generation for the knowledge graph is generated based on the constructed ontology and the official department’s investigation into reports of soil pollution.
Using these sources, resource description frame triples can be generated as follows:
(Construction Site A produces Chemical Waste X) [Survey report];
(Chemical Waste X is harmful to Aquatic Life) [Environmental reports/encyclopedia];
(Government Regulation Y regulates Waste Disposal) [Policy documentation].
This approach facilitates the integration of diverse data sources into a structured and interconnected format, enabling effective responses to queries regarding the environmental impact of construction. On a held-out manually annotated set of 200 document pages, the extraction pipeline achieved a precision of 88.5% and a recall of 84.2% for core relations.
To ensure rigorous auditability, the system maintains strict data lineage from the physical environment to the digital graph. This traceability pipeline operates in five distinct stages: (1) Physical Sampling: On-site soil core extraction; (2) Laboratory Analysis: Geochemical quantification (e.g., ICP-MS); (3) Documentation: Generation of the formal PDF environmental survey report; (4) Digital Extraction: NLP-driven entity and relation extraction from the text/tables; and (5) Materialization: Instantiation is used as a connected measurement node within the knowledge graph, permanently linked to its source document.
2.2. Framework of KBQA
This study implements a hybrid LLM–knowledge graph (KG) pipeline that converts natural-language questions into constrained, auditable operations over the domain ontology and graph (
Figure 5). An instruction-tuned LLM identifies entities/slots (e.g., parcel, pollutant, and land use) and intent-type tasks (lookup, threshold, list, and two-hop tasks); then, the schema links candidates to KG nodes and relations using a dual retriever (lexical labels/aliases/CAS/standard codes and a dense retriever). Top-k schema hints and grounding occur in the LLM before planning.
This research generates a machine-readable Plan JSON composed of calls to a fixed toolbox (e.g., get_latest_measurement, get_standard, compare_threshold, list_exceedances, and get_provenance). Grammar-constrained decoding (JSON Schema; low temperature) restricts outputs to valid ontology terms. An executor translates tool calls into parameterized graph queries (Cypher/SPARQL). The outline of this algorithm is shown below:
Step 1 (Plan): LLM parses intent → selects get_latest_measurement and compare_threshold.
Step 2 (Execute): Executor runs Cypher queries against Neo4j → returns raw values and citations.
Step 3 (Verify): Unit service normalizes bases → Verifier recomputes math. If fail → trigger re-plan; if pass → send structured JSON to verbalizer.
The LLM receives only structured results and verbalizes a concise answer; every fact carries source-level citations (report/table/page; standard document). When the KG lacks a required fact, this research invokes a bounded text-retrieval fallback over the curated corpus constrained to the relevant parcel/pollutant; any text-derived claim must include a citation and pass the same numeric/unit checks, and answers must disclose whether evidence is KG- or text-derived. For example, if the KG lacks the moisture content for Parcel X, the fallback retrieves the original PDF passage. The LLM extracts the value and appends an explicit text-derived citation (e.g., “Source: Report X, Page 12 [Text Retrieval]”), isolating it from KG-verified facts.
A lightweight verifier recomputes key numeric checks; failures trigger re-planning. Ambiguity is handled by generating alternative plans and re-ranking them by ontology plausibility, geospatial consistency, and provenance completeness. A calibrated confidence score aggregates planner entropy, verifier success, and evidence source.
LLM configuration: We deployed a 13B-parameter LLaMA-2-based model, instruction-tuned on 15,000 domain-specific QA pairs. The planning temperature was set to 0.2, top-k schema hints were set to 10, and the verbalization temperature was set to 0.0–0.2; JSON-Schema grammar was used for plan emission; top-k = 10 schema hints; max tool calls per query = 3; short, structured contexts were used only; and caching of schema hints and compiled queries enabled for repeated questions. A lightweight verifier recomputed numeric checks; failures triggered re-planning. Ambiguity was handled by generating alternative plans and re-ranking by ontology plausibility, geospatial consistency, and provenance completeness; a calibrated confidence score aggregated planner entropy, verifier success, and evidence source.
2.3. Test and Validation of the KBQA
This study validated the knowledge-graph-powered QA (KG + KBQA) on a held-out urban soil QA benchmark constructed from the test snapshot into four task types to reflect operational decision needs: lookup, threshold, list, and two-hop/compositional reasoning. The benchmark comprises 400 total queries, uniformly split into 100 queries per task type. The test set was curated from an independent subset of reports strictly segregated from the LLM tuning data. Evaluation followed a fixed runtime profile and reported Exact Match (EM), token-level F1, MRR/Recall@k, median/p90 latency, and provenance completeness. All answers were normalized to canonical units and controlled vocabularies prior to scoring.
2.3.1. Test Task
Lookup: The lookup task requires returning a single canonical fact from the knowledge base (e.g., a pollutant’s land-use-specific threshold, a pollutant CAS number, or a parcel attribute). A natural-language query is mapped to the domain schema and executed as a parameterized graph query to retrieve the target value. Outputs are normalized and accompanied by source-level provenance (e.g., standard document and section), isolating faithful recovery of atomic facts from multi-step reasoning effects.
Threshold (exceedance determination): The threshold task assesses whether a parcel’s most recent valid measurement for a specified pollutant exceeds the applicable land-use-specific standard. This research resolves land-use context and standard version/date, normalizes measurement units to canonical bases (e.g., mg/kg dry soil), and performs a deterministic comparison. The output includes a Boolean decision together with the measured value (unit, method, and date), the standard value (unit, land use, and version/date), and citations for both the measurement and the standard value. Cases lacking sufficient basis information (e.g., wet-basis values without moisture correction) are explicitly flagged.
List (set retrieval under constraints): The list task returns an unordered set of entities that satisfy structured filters (e.g., “parcels in City X where Pb exceeds 400 mg/kg under residential use within year Y”). The query is translated into schema-aware filters over parcels, pollutants, thresholds, land-use categories, geography, and time windows; the same numeric and unit policies as in the threshold task are applied, and aliases are deduplicated to produce a canonical set. Each returned item carries sufficient provenance to audit inclusion.
Two-hop/compositional reasoning: The two-hop task measures the ability to traverse and aggregate across multiple relations (e.g., identifying historical land uses most associated with benzene exceedances in a city). The system executes schema-valid paths (e.g., Parcel → Land Use; Parcel → Measurement → Pollutant → Standard), aggregates counts or ranks outcomes under land-use and temporal constraints, and reports the resulting categories or entities with representative citations. Only validated exceedances (per threshold rules) contribute to aggregations, ensuring unit-consistent, threshold-aware reasoning rather than surface co-occurrence.
Common policies across tasks include (i) Precedence of standards—use of the question-specified version/date when given; otherwise, the most recent applicable version is used. Second is the (ii) latest-measurement policy—selection of the latest valid record per parcel–pollutant when no time window is specified, followed by (iii) limits of detection—values reported as “<LoD” are addressed using statistical substitution (assigned as 1/2 LoD) during exceedance evaluations to avoid bias, rather than being treated as missing. Finally, (iv) ambiguity handling is performed—if entity linking remains unresolved after ontology/geospatial constraints, this task abstains or issues a clarification tag, excluding such instances from EM but counting toward Recall@k when a correct candidate appears.
2.3.2. Indicators for Validation
For the lookup, threshold, list, and two-hop tasks, effectiveness is reported using:
Exact Match (EM): The proportion of questions with an answer string (or Boolean threshold) exactly matching any reference after normalization.
Token-level
F1 Scoring (
F1): Harmonic mean of precision/recall on token sets for partially correct spans or sets (order-invariant for list). The calculation of the F1 is as follows:
Here and are the normalized token (or element) sets of the predicted and gold answers.
Mean Reciprocal Rank (MRR): The average of
for the first correct answer for all queries:
Recall at
k (
R@
k): The fraction of queries for which at least one correct answer appears in the top
retrieved/candidate items:
Median latency (s): p50 end-to-end time per query (planning → graph execution → checks → verbalization).
p90 latency (s): 90th-percentile end-to-end time; reflects tail performance on harder/uncached queries.
2.3.3. Comparison with Baseline
This study selected results by BM25 and RAG-without-KG: the two commonly used methods for information retrieval and extraction in question-answering systems, as references to compare the performance of KBQA.
BM25 is the canonical sparse lexical retrieval method in IR, long used as a strong, transparent baseline across TREC-style evaluations [
13,
23]. It ranks passages by term frequency, inverse document frequency, and length normalization with just two tunable parameters (
). We implemented BM25 using Elasticsearch with parameters (
= 1.2,
= 0.75), a Jieba tokenizer for Chinese text, and passage-level indexing. Using BM25 allowed us to benchmark our system against a well-understood, high-precision text-only approach that does not rely on schema, unit normalization, or graph structure—therefore isolating the value added by the KG. It remains competitive on factual lookups and is recommended in modern QA studies as a point of comparison to dense methods. RAG was introduced by [
24] and typically relies on dense passage retrieval methods such as DPR to obtain semantically similar passages. The RAG baseline utilized a dense retriever (BGE-Large-zh, embedding dimension 1024) to retrieve the top 5 passages, matching them with the same 13B LLM used by our KBQA for generation. While effective for paraphrased questions, RAG-without-KG lacks schema-level semantics and deterministic unit/threshold reasoning.
3. Results
The General view of the KBQA system and the interface can be checked in the
Supplementary Materials at the end of this article.
3.1. KBQA Performance
This study evaluated the KG + KBQA system against BM25 and RAG-without-KG on the urban soil QA benchmark (lookup, threshold, list, and two-hop tasks). As summarized in
Table 2. KG + KBQA consistently outperformed both baselines in EM/F1, MRR/R@
k, with the largest margins on threshold and lookup tasks where unit normalization and land-use-specific standards determine correctness. To confirm the robustness of these findings, a Wilcoxon signed-rank test was conducted across the 100 threshold queries. The Exact Match improvement of KG + KBQA over the RAG baseline was found to be statistically significant (
$p < 0.01
$), indicating that the performance gain is not due to chance. End-to-end median latency was lower than RAG-without-KG due to bounded tool calls and compiled graph queries, while remaining competitive with BM25. Completeness of provenance was highest for KG + KBQA, which returned both measurement and standard citations where applicable. Ablations showed that removing grammar constraints, the numeric verifier, ontology-aware entity linking, or the text fallback each degraded accuracy and/or increased failures. Error analysis highlighted three residual issues: ambiguous parcel aliases, missing standard version/date in legacy texts, and wet-basis measurements lacking moisture metadata (flagged and excluded from exceedance materialization).
To confirm the statistical significance of these results, a Wilcoxon signed-rank test was conducted across the 100 threshold queries. The Exact Match improvement of KG + KBQA over the RAG baseline was found to be statistically significant ($p < 0.01$), indicating that the performance gain is not due to chance.
To isolate the contributions of our pipeline’s specific components, we conducted an ablation study on the Exact Match (EM) metric (
Table 3). As expected, removing the numeric verifier severely impacted the performance on threshold tasks, while removing ontology-aware linking caused the steepest drops in compositional reasoning (two-hop and list tasks). Eliminating grammar constraints led to higher rates of invalid graph query generation, and removing the text fallback reduced our system’s ability to recover from knowledge graph coverage gaps, collectively demonstrating the necessity of each module.
Furthermore, a brief sensitivity analysis on input vocabulary (e.g., swapping chemical names for their CAS numbers or varying phrasing) revealed an EM variance of less than 2%, demonstrating robust semantic linking.
3.2. Knowledge Completion
For link prediction, this research trained a lightweight model over relations such as has Historical Use, likely Emits, and governed By. On held-out triples, MRR and Hits@{1, 3, 10} improved when multi-source features (co-occurrence across parcels, and regulatory co-mentions) were included. Suggested links were surfaced as curation hints with confidence and evidence slices; they were not auto-asserted into the KG. In QA, completions were used only to prioritize candidates for two-hop questions, preserving provenance requirements.
3.3. Automatic KG Construction and Refresh
Incremental ingestion of newly released reports and standards produced an updated snapshot with higher throughput (documents/hour) and stable yield (triples/document) relative to cold start, aided by cached parcel aliases and compiled query templates. Quality gates enforced required fields for measurement triples (value, unit, analyte, parcel, and date); items failing checks were routed to quarantine with explicit failure codes (e.g., unit mismatch and unresolved alias). The append-only, time-stamped snapshotting ensured reproducibility and rollback for audit.
3.4. Knowledge Reasoning
For two-hop/compositional queries (e.g., land uses associated with pollutant exceedances), KG + KBQA achieved higher EM/F1 and Recall@k than text-only systems by traversing ontology-valid paths (Parcel → Measurement → Pollutant → Standard; Parcel → Land Use) under unit and standard constraints. For numeric, threshold-aware reasoning, the unit service and Standard Resolver yielded deterministic exceedance decisions accompanied by values, thresholds, and source-level citations; the verifier recomputed comparisons from returned numerics and triggered re-planning on mismatches. In the South China case study, these mechanisms reduced manual cross-document lookups and exposed version discrepancies in cited standards, illustrating decision support with transparent provenance.
While traditional RAG systems obscure their reasoning, the KG + KBQA pipeline guarantees explainability through a traceable query path. For a threshold exceedance query, the step-by-step traversal is: (1) Intent Parsing: Mapping the user query to a specific parcel and pollutant; (2) Node Linking: Traversing the graph to find the latest valid measurement node (applying the 1/2 LoD statistical substitution if censored); (3) Standard Resolution: Querying the ontology for the applicable regulatory standard node based on the parcel’s specific land use; (4) Normative Comparison: Executing the deterministic threshold check; and (5) Output Generation: Returning the Boolean decision alongside the exact source citations for both the measurement and the standard.
5. Conclusions
This study demonstrates that combining a rigorously curated domain knowledge graph with schema-aware AI question answering delivers materially better support for urban soil environmental decisions than text-only retrieval. On a benchmark spanning lookup, threshold, list, and two-hop queries and in a South China case study, the KG + KBQA system produced more accurate answers, lower or competitive latency, and source-level provenance (measurement and standard citations), thereby achieving the stated aim. Crucially, the pipeline’s constrained LLM orchestration—tool-based planning, unit and threshold normalization, version-aware standard resolution, and numeric verification—enabled deterministic, auditable reasoning that black-box generation and traditional searches do not provide.
More broadly, the results affirm the value and significance of knowledge graphs and AI for environmental management. KGs transform heterogeneous reports, measurements, standards, and land-use context into structured, interoperable, and queryable knowledge, while AI provides the natural-language interface and planning needed to operationalize knowledge at decision time. Together, they (i) increase reliability through explicit semantics, constraints, and provenance; (ii) improve efficiency by reducing manual cross-document effort; and (iii) enhance accountability by making every answer traceable to its sources—capabilities that are central to risk assessment, regulatory compliance, and remediation planning. This architecture is generalizable beyond urban soil to groundwater, sediments, and air quality, where threshold-aware, unit-consistent, and version-standard reasoning are equally critical.
While utility still depends on source fidelity and some linking ambiguities remain, the evidence here shows that KG-centered, AI-enabled systems constitute a substantive advance over document-centric workflows and unconstrained LLMs for environmental decision support. The released artifacts (ontology/schema, QA set, and example graph snapshot) are intended to catalyze adoption and independent evaluation. Future work will scale practitioner studies, extend temporal and geospatial reasoning under explicit constraints, and refine human-in-the-loop curation—advancing toward regulator-ready, transparent AI that strengthens environmental governance and sustainable urban development.