Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits

Zhu, Jinhao; Wang, Yueying; Tong, Wanying; Li, Shengmiao; Wang, Mingguo; Wang, Chengbin

doi:10.3390/min16010050

Open AccessArticle

Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits

by

Jinhao Zhu

¹

,

Yueying Wang

¹,

Wanying Tong

¹,

Shengmiao Li

²,

Mingguo Wang

^1,3 and

Chengbin Wang

^1,*

¹

Ministry of Natural Resources Key Laboratory of Resource Quantitative Evaluation and Information Engineering, School of Earth Resources, China University of Geosciences, Wuhan 430074, China

²

Geological Survey Institute of Hunan Province, Changsha 410114, China

³

Yunnan Geological Big Data Center, Geological Survey and Mapping Institute of Yunnan Province, Kunming 650218, China

^*

Author to whom correspondence should be addressed.

Minerals 2026, 16(1), 50; https://doi.org/10.3390/min16010050

Submission received: 6 December 2025 / Revised: 29 December 2025 / Accepted: 30 December 2025 / Published: 31 December 2025

(This article belongs to the Special Issue Digital Exploration and Assessment of Mineral Resources: Theories, Methods and Achievements, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The rise of artificial intelligence has led to the emergence of geoscience knowledge graphs (GeoKG) as effective tools for organizing and representing complex knowledge. The growing complexity of geoscience data calls for innovative strategies for structuring and interpreting extensive information. Conventional knowledge extraction methods often rely on manual annotation and deep learning techniques, which can be costly and inefficient. Herein, we leverage a large language model (LLM) to address the challenges of knowledge extraction and fusion in creating a knowledge graph focused on gold deposits. First, we developed an ontology explicitly designed for gold deposits, drawing on insights from geological experts. Next, we formulate a prompt to guide the LLM to accurately extract geological entities and their semantic relationships in accordance with the knowledge graph schema. Subsequently, we conducted geological entity alignment and integration to construct the gold deposit knowledge graph, which encompasses over 3738 entities and 3900 semantic relationships. Finally, we identified an optimal configuration balancing F1-score and computational cost through comparative experiments on locally deployed models with varying parameters. Our findings demonstrate that an LLM can effectively capture long-range contextual relationships to identify geological entities and their semantic connections, demonstrating strong performance in handling diverse expressions.

Keywords:

ontology; geological entities; gold deposits; knowledge graph; large language model; semantic relation

1. Introduction

The geoscience knowledge graph (GeoKG) has garnered considerable attention in recent years as a powerful tool for representing and organizing knowledge, especially in the fields of artificial intelligence (AI) and geoscience [1]. The concept of a knowledge graph was first introduced by Google in 2012, encompassing over 500 million entities and 700 million facts [2]. The fundamental idea behind a knowledge graph is to visually represent knowledge through a graphical format [3]. Knowledge graphs can capture and illustrate complex knowledge structures by forming semantic networks among entities, relationships, and their attributes. In the realm of AI, the knowledge graph has become an essential technology that supports AI models in understanding and reasoning, finding applications in intelligent search, question answering, and personalized recommendations [4]. In geoscience, knowledge graphs integrate data from various disciplines, enhancing semantic understanding and reasoning capabilities [5], thereby laying a foundation for big data analysis in geoscience.

Geoscience big data is characterized by multimodality, including structured geophysical measurements, semistructured sensor outputs, and unstructured geoscience literature [5]. The heterogeneity of geoscience data brings significant challenges in establishing unified knowledge frameworks for GeoKG representation. The construction of knowledge graphs involves two stages. In the first stage, cross-disciplinary terminology alignment is performed to resolve lexical inconsistencies across different geoscience subdisciplines and contexts. In the second stage, the triple structure of Entity1–Relation–Entity2 is utilized to organize heterogeneous geoscientific data and knowledge from various sources.

Recent advancements in AI have notably accelerated the construction of GeoKG, particularly through deep learning techniques for named entity recognition and semantic relation extraction [6,7,8,9,10]. Deep neural networks can capture more nuanced, context-rich features, thereby improving the accuracy of identifying geological entities, such as lithologies, structural features, and stratigraphic units [11,12,13,14]. Additionally, Transformer-based models (e.g., Bidirectional Encoder Representations from Transformers [BERT], Robustly Optimized BERT Pretraining Approach [RoBERTa]) have gained attention for their self-attention mechanisms, which effectively manage ambiguous or overlapping geological terms [15,16]. Hybrid methods that combine the recurrent neural network (RNN) or transformer-based encoders with graph neural networks enhance the identification of domain-specific relationships (e.g., geographic entities and geomorphological features) and support natural language processing (NLP) tasks for multimodal data extraction [17,18,19,20,21]. Furthermore, integrating domain ontologies into these processes helps reduce noise and aligns extracted data with standardized vocabularies, a practice particularly advantageous for heterogeneous geoscience datasets [22].

In recent years, LLMs have demonstrated significant advantages in NLP tasks, such as extracting geological entities and their semantic relationships for knowledge graph construction [23]. When fine-tuned on domain-specific corpora, these models achieve higher accuracy on tasks such as lithological or structural feature recognition, even with limited training data. The synergistic integration of LLMs, open data resources, and prompt engineering has greatly enhanced the analytical capabilities and processing efficiency of geoscientific data [24].

The extraction of geological entities and their semantic relationships has become more sophisticated. Nevertheless, the geoscientific field still faces challenges. Handling multimodal geological data using deep learning techniques can be complicated. Additionally, inconsistencies in geological terminology increase conceptual confusion, such as the use of different names for strata across various geological blocks. Moreover, data-driven approaches using neural network models rely on high-precision geoscience corpora, which can be labor-intensive to curate. Therefore, identifying an effective method that balances accuracy and efficiency to improve the extraction of entity relationships is a critical challenge at this stage.

In this study, we proposed an LLM-based approach guided by an ontology model to construct a gold-deposit knowledge graph, leveraging instruction-tuned LLMs for GeoKG via entity extraction and relation disambiguation. The remainder of this paper is structured as follows: Section 2 presents the dataset utilized in the knowledge graph and the primary methods employed. Section 3 describes the completed knowledge graph we developed and its applications. Section 4 discusses the feasibility of the LLM methods, compares the LLM approach with deep learning techniques, and outlines prospects for future research. Finally, Section 5 concludes the paper.

2. Materials and Methods

2.1. Dataset and Data Sources

To construct the dataset, we conducted a brief literature search and review based on the records from the USMIN mineral deposit database. Using the mineral deposit data in the database, we systematically collected published open-access literature on these important gold deposits. These deposits represent major contributors to global gold reserves and offer valuable references for investigating mineralization patterns. Notable examples include the Carlin-type deposits in Nevada and the Kalgoorlie Super Pit in Australia, which have become key research focuses due to their unique mineralization processes and geological characteristics. Ultimately, we compiled 178 academic papers, forming a corpus of approximately 1.5 million words. This corpus provides comprehensive and reliable data for this study.

2.2. Gold Deposits Ontology

In this study, we employed the Entity Relation Triple Model (ER Model) as a structured framework to organize the knowledge and data recognized by LLMs. The ER model is a conceptual design tool for databases that constructs a conceptual model using three fundamental elements: entities, attributes, and relationships [25]. An entity refers to an object in the real world that holds independent significance and can be distinctly identified. Attributes describe the characteristics or properties of these entities. Relationships illustrate the connections between entities, including one-to-one, one-to-many, and many-to-many. In geoscience, the ER model is also crucial for managing and integrating geological data.

Based on our understanding of gold deposits, we identified 24 types of entities (see Table 1) and 10 types of relationships (see Table 2) [11]. The 24 entity types include Location, Geological Time, Lithostratigraphy, Chronostratigraphy, Geological Background, Geological Events, Fracture, Folded Structure, Metallic Mineral, Nonmetallic Mineral, Element, Sedimentary Rock, Igneous Rock, Metamorphic Rock, Rock Mass, Ore Body, Deposit, Alteration Type, Mineralization, Exploration Engineering, Prospecting Sign, Mineralization Type, Geological Anomaly, and Fluid Inclusion Type. The first eight entities establish the time-space framework and geological context of the ore deposit. The subsequent nine entities represent the core characteristics of the deposit itself, while the final seven provide critical insights for ore exploration. Together, these entities create a knowledge graph framework that systematically represents the formation, distribution, and exploration of gold deposits.

The relationship types include “hasAlteration,” “isControlledBy,” “isLocatedIn,” “isFormedIn,” “hasMinerals,” “hasElement,” “isRelatedTo,” “isFoundIn,” “isAnalyzedBy,” and “isRevealedBy.” The first seven primarily depict semantic relationships among entities within the gold ore metallogenic system, whereas the last three focus on semantic relationships among entities within the gold ore exploration system. The definitions of these relationship types accurately convey knowledge and elucidate the internal connections among entities. Based on the defined entities and relationships, we constructed the ER model, as illustrated in Figure 1, which provides a solid knowledge foundation for the subsequent development of the knowledge graph and for the comparative analysis of ore-forming processes.

2.3. Prompt Engineering

Prompt engineering involves designing and optimizing prompts to guide large language models (LLMs) to generate desired outputs [26]. Users can create prompts tailored to their specific requirements, guiding the model to generate the required content [27]. While the ontology model provides a structured framework for constructing knowledge graphs, it cannot be directly interpreted by an LLM. Therefore, it is essential to develop prompt engineering strategies that translate the knowledge model into commands comprehensible to an LLM [28]. Accurate mapping is crucial for this transformation.

The mapping process entails converting the entities and relationships in the ontology model into specific commands within the prompt engineering framework [29]. To ensure the accuracy of the mapping, we must thoroughly analyze the structure and semantics of the knowledge model and convert them into prompts. This process begins with identifying the entities and relationships present in the ontology model. For instance, the knowledge model includes entities such as “Mineralization Type” and relationships like “hasAlteration.” We need to translate these abstract concepts into clear instructions, such as “extract mineralization types and their associated alteration features” or “determine the spatial distribution of mineralization types in relation to structural features.” By carefully crafting the prompts, we ensure that the model adheres to these instructions accurately, facilitating efficient knowledge extraction and application [30].

Given these considerations, we designed the prompts as shown in Figure 2. First, we clarified the system’s role identity. In the task description (see Supplementary Material), the system role is defined as “You are a knowledge graph expert in the field of gold deposits,” instructing the LLM to approach the task with the perspective of a knowledge graph expert. This approach somethe LLM’s creativity outside geological contexts. Additionally, we defined what restricts the task’s working environment based on the relevant literature on gold deposits that we provided. This ensures that the LLM does not generate fabricated data. The goal is clearly articulated: to extract entities and relationships and assign appropriate labels to them.

We detailed the relevant information on entities and relationships. Accurate extraction of these elements is the primary objective of prompt engineering. Based on the data in Table 1 and Table 2, we summarized and integrated 24 entity types and 10 relationship types into the prompt engineering framework. These entities encompass nearly all critical information within the gold mining sector. The relationships illustrate the structural connections and inherent logical links between entities. Precise definitions and descriptions of these entities and relationships ensure that the LLM can accurately and comprehensively extract valid information from the literature, facilitating successful task completion.

After classifying the entities and relationships, we specified the entity labels. The selection of labels was limited to the predefined 24 entities. These labels provide clear classifications for each entity, thereby enhancing extraction accuracy and optimizing subsequent analyses. By categorizing entities through labels, relevant entities can be efficiently linked with their corresponding relationships. This method improves the efficiency of knowledge graph construction and provides precise data support for reasoning analysis.

The definitions above primarily focus on delineating entities and relationships and on describing labels. To ensure that the LLM accurately understands the task, we incorporated numerous task descriptions and constraints into the prompt. The first four instructions require the LLM to carefully consult the definitions and explanations provided earlier. The latter part aims to prevent the model from making arbitrary fabrications. In the constraints section, it states, “if a certain type of relationship does not exist, it does not need to be provided” [31], thereby enhancing the accuracy of the model-generated content. Finally, the model is instructed to recognize steps sequentially, which can trigger the Chain of Thought in the LLM [32], improving the model’s performance over time. The output format is specified to present the triples as an Excel table: [Label1, Entity1, Relations, Label2, Entity2]. This standardized output format facilitates the subsequent utilization of the data.

3. Results

3.1. Geological Entities and Their Extracted Semantic Relations

Our research employs the locally deployed DeepSeek-R1 model (70B parameters, 4-bit-quantized, distilled version) to implement the designed prompt engineering. Initially, we feed the literature into the model to enable a comprehensive understanding of the document’s content. Subsequently, we input the prompts to guide the model in performing information extraction based on its prior understanding. To improve processing efficiency, we developed custom code to call the model and automatically extract geological entities and their semantic relationships in a stepwise manner. Figure 3 illustrates a comparison between the original literature text (top) and the triples extracted by the model (bottom). As shown, the LLM demonstrates a strong ability to accurately identify and extract structured triples from the texts. In total, we processed 178 documents and extracted 3900 triples, averaging approximately 22 triples per document.

To enhance data management and application efficiency, we systematically compiled all model-generated triples into a single table. This organization clarifies the extracted data, making it more intuitive for further analysis and utilization. A portion of the organized triple data is presented in Table 3. The collated triple data serve as foundational material for constructing the knowledge graph and require additional data processing and analysis to fully realize their potential value.

3.2. Knowledge Alignment and Integration

The original data obtained from the model via preset prompt engineering contain missing values, incorrect entries, and structural inconsistencies arising from variations in entity descriptions across documents and from biases in model interpretation. These problems lead to many attribute-lacking entities and isolated nodes when building the knowledge graph, which greatly impairs Neo4j’s effectiveness in knowledge visualization and linking. The fragmented representation of knowledge undermines the overall usability and logical coherence of the data. To address these challenges, systematic data preprocessing is essential for transforming raw data into actionable information. This process begins with an initial organization of the raw data to eliminate factors that obstruct knowledge visualization and linking [34].

After extracting the data, Python scripts can be utilized to convert triples into Cypher Query Language (CQL) statements, which can then be imported into an Excel spreadsheet for batch preprocessing. Given that inconsistencies in data formats may lead to multiple representations of the same entity in Neo4j, resulting in identical entities with different formats, the first critical step is to standardize the data. Specifically, Excel’s Replace functionality can be employed to unify all characters into English characters, remove spaces, <>, ‘’ and-symbols, and enforce letter case consistency, thereby ensuring data uniformity prior to graph database integration.

The knowledge graph G = {E,R,T} is a directed graph comprising a set of entities E, a set of relations R, and a set of triples T⊆E × R × T. Given a source knowledge graph G1 = {E1,R1,T1}, a target knowledge graph G2 = {E1,R1,T1} and a set of aligned entity pairs S = {(u, v)|u∈E1,v∈E2,u≡v}, where ≡ denotes equivalence (i.e., u and v refer to the same concept), and the goal of the entity alignment task is to identify equivalent entity pairs in these two knowledge graphs. As shown in Figure 4, the entity alignment task aims to identify semantically identical entities across the graphs.

To facilitate the transformation of complex relational data into an easily comprehensible graphical format, we enhance data analysis by making it more intuitive. By using the third-party extension plugin “<Neo4jplugin>” in Gephi, we can connect directly to the Neo4j database and import nodes, relationships, and attribute data into Gephi for visualization. In the node table, nodes can be manually reviewed and merged by sorting their names alphabetically to identify similar entries. Once the merging process is complete, a standardized dictionary (see Table 4) can be created to unify entity names, thereby eliminating entities that denote the same meaning but differ in representation. This approach not only ensures the alignment and integration of knowledge, effectively addressing the issue of fragmented indicators mentioned earlier, but also provides a standardized linguistic foundation for subsequent data processing. We conducted a statistical analysis of triple data for 30 gold deposits, focusing on definition issues and the occurrence of isolated nodes during the extraction process. On average, each of the 30 deposits presented 9.03 limited issues, such as inconsistent geological entity definitions and partial misinterpretations of certain entities, along with 2.06 isolated nodes.

To further enhance node clustering, we employed a relation-based modularity community clustering algorithm that uses the Levenshtein distance to assess node similarity. While this algorithm primarily relies on node relationships for clustering, nodes with strong connections are automatically grouped into the same community. This method allows for the automatic aggregation of similar entities, thereby reducing the need for manual intervention. However, in practice, some communities may contain fewer than three nodes. When data are missing in small communities, it is essential to document the entities within these communities, perform work in Neo4j, and adopt appropriate patching strategies. Isolated nodes may arise when the relationship between an entity and the deposit or ore body is too implicit, or when the contextual span is excessively long, exceeding the model’s context window limit and leading to memory loss. For isolated points, the source documents should be reviewed to identify the relationships between the entity and the deposits. These relationships can then be completed using CQL statements. For entities lacking attributes in relationships, the standardized dictionary should be referenced to define the missing entities using CQL statements.

3.3. Knowledge Graph Visualization

The data processing described above eliminates ambiguities, duplicates, and other redundant nodes, fills gaps, and resolves multilingual conflicts, thereby significantly enhancing the accuracy and completeness of the data. After converting the processed data into CQL statements, we presented the data dynamically using the built-in Neo4j Browser. The resulting graph contains 3738 nodes of geological entities and 3900 semantic relationships.

3.4. Knowledge Service

The visualized knowledge graph organizes information about gold deposits in a diagrammatic format, creating an interactive knowledge database for human–machine interaction [11]. This knowledge graph not only facilitates rapid querying, reasoning, and discovery of information on gold deposits but also provides geologists with an intuitive and efficient platform for exploring complex relationships and patterns within the data. Its capability to synthesize and present intricate information makes the knowledge graph a powerful tool for understanding geological processes, optimizing exploration strategies, and supporting decision-making. Below are some specific applications of the gold deposit knowledge graph.

Cypher is the query language tailored for Neo4j, enabling users to query, modify, and update nodes and relationships in a graph through simplified pattern matching. It has been widely used in the geoscience knowledge querying, retrieving, and service [11,29,35]. For instance, you can use Cypher to retrieve specific information about a mineral deposit, such as the Zopkhito gold deposit depicted in Figure 5. The query results will clearly display details such as location, mineral composition, alteration processes, and other attributes.

Additionally, knowledge graphs can uncover potential insights. This study examines the Huangjindong deposit in Hunan Province and employs Jaccard similarity to compare its relationships with large and super-large gold deposits worldwide [36]. Jaccard similarity is defined as the ratio of the size of the intersection of sets A and B to the size of their union [37]. The nine gold deposits most similar to the Huangjindong deposit are listed in Table 5.

The nine gold deposits and their associated mineral entities queried from the knowledge graph are illustrated in Figure 6. The mineral associations between the Huangjindong deposit and other deposits are more readily observed. From Figure 6 and the similarity calculations, it is evident that the Huangjindong deposit is most similar to the Boroo Gold Deposit, Badran orogenic gold deposit, and Muruntau gold deposit. This suggests that these deposits share similar mineral types, indicating they may have formed through comparable processes or geological histories. This information provides valuable insights into the formation of gold deposits.

3.5. LLM Performance in the Extraction of Geological Entities and Their Semantic Relations

To demonstrate the effectiveness of LLMs in extracting entities and relationships within the geological domain, we conducted a case study involving four representative gold deposits. We collected 13 relevant research articles on these deposits and manually extracted information, and compared it with results obtained from single-text extraction and multi-text deduplication using an LLM (the findings of this analysis are summarized in Table 6).

When extracting entities and relationships from a single text using an LLM, the model’s performance in terms of precision, recall, and F1-score is limited due to the restricted coverage of a single document. However, by leveraging the contextual learning capabilities of LLMs, integrating information from multiple texts, and applying deduplication techniques, extraction accuracy can be significantly improved. To further validate this hypothesis, we constructed knowledge graphs for the selected gold deposits using both manual and LLM-based extraction (Figure 7: upper panel, manually constructed; lower panel, LLM-generated).

The analysis results demonstrate that supplementing the LLM with multi-text data significantly improves the accuracy and completeness of the generated knowledge graph. While the precision of LLM-based extraction remains slightly lower than that of manual extraction, LLMs exhibit a notable advantage in terms of time efficiency. For example, manually constructing a knowledge graph for a single deposit takes approximately two hours, whereas an LLM can complete the same task in five minutes. This efficiency underscores the substantial potential of LLMs to rapidly generate large-scale knowledge graphs in the geological domain.

To further investigate the capabilities of LLM in extracting geological entities and their semantic relationships, we conducted experiments using DeepSeek models of varying parameter sizes (hash: 7b: sha256:96c415656d377afbff962f6cdb2394ab092ccbcbaab4b82525bc4ca800fe8a49, 14b: sha256:6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e, 32b: sha256:6150cb382311b69f09cc0f9a1b69fc029cbd742b66bb8ec531aa5ecf5c613e93, 70b: sha256:4cd576d9aa16961244012223abf01445567b061f1814b57dfef699e4cf8df339, 671b: sha256:fde1b97ae799921a2b3ad6aaa59da5057e1665256e7a2639ca3b40dd82b5bc9e). Based on the experimental results, we performed curve fitting for both the models’ F1 Scores and their computational power requirements (Figure 8). The fitted curves reveal that increasing model parameters significantly enhances extraction capability. However, this performance improvement diminishes asymptotically as model size grows larger. Meanwhile, computational demands increase substantially with model scale. Consequently, for the DeepSeek-R1 architecture, models within the 32B–70B parameter range exhibit notably higher cost-effectiveness. This makes them an optimal choice when computational resources are constrained.

4. Discussion

4.1. Key Insights and Progress Achieved

A knowledge graph serves as a vital tool for storing and representing structured data, facilitating the organization and retrieval of complex information. This study introduces a method for swiftly constructing large-scale GeoKG graphs using LLMs. Initially, we collected datasets and constructed the knowledge model. We then used LLMs to process unstructured and semi-structured geoscience data, guiding them to extract geological entities and their semantic relationships through a meticulously designed prompt engineering approach. During this process, we developed custom code to implement a stepwise automated extraction workflow, thereby significantly improving processing efficiency. After extracting the triples, we conducted knowledge alignment and integration in Gephi, removing redundant nodes and relationships and enhancing the consistency of the knowledge graph. Finally, the gold deposit knowledge graph was visualized in Neo4j, enabling structured and interactive access to geological information. The findings from this study indicate that applying LLMs to knowledge graph construction has significantly improved efficiency and reduced manual annotation costs, establishing a practical methodology.

4.2. Comparison of Methods and Limitations of LLM

In tasks involving the extraction of geological entities and their semantic relationships, traditional deep learning methods (such as models based on RNN, bidirectional long short-term memory [BiLSTM], convolutional neural network, or BERT) typically demonstrate high accuracy and recall rates on specific datasets. For instance, on the widely used CoNLL-2003 dataset, the F1 score of the classical BiLSTM-CRF model is approximately 0.91, whereas the BERT-based model achieves 0.92–0.93 on the same dataset [47,48,49,50,51]. However, the number of entities extracted by these models ranges from 6 to 8 [52]. To more comprehensively compare LLMs and deep learning for entity and semantic relation extraction, we propose a new index, “extracting weighted F1” (EW-F1). This index multiplies the standard F1 score by the average number of extracted entities, thereby reflecting both the accuracy of the extraction and the model’s ability to capture a wide range of entities. For example, a supervised deep learning model may achieve an F1 score of 0.92 while extracting an average of 7 entities per document, resulting in an EW-F1 of 6.44. In contrast, an LLM might achieve an F1 score of 0.70 but extract an average of 22 entities, yielding an EW-F1 of 15.4. This suggests that while deep learning methods generally maintain high accuracy, LLMs may ultimately outperform them in terms of the total number of extracted entities, particularly in more diverse or sparser labeling contexts.

LLMs are highly efficient in extracting geological entities and their semantic relationships using a straightforward approach. However, we encountered several challenges during the study. First, LLMs have limitations in comprehending the information contained in geological literature. They struggle to analyze data, trends, or specific symbols in charts and graphs, resulting in incomplete information extraction and undermining the overall integrity of the knowledge retrieval process. Notably, the hierarchical organization of information in the source literature can also influence extraction accuracy. When explicit hierarchy or attribution markers are absent, LLMs may misattribute regional features to specific deposits, leading to inaccurate entity–relationship matching. This highlights the model’s ability to understand and reason about complex domain knowledge remains insufficient [53]. Second, although LLMs can produce “pipeline” outputs through multiple rounds of prompting and integration with other tools, their ultimate performance still varies with the overall system design [22]. Compared to alternative approaches, using LLMs for extraction functions like a “black box,” meaning its internal operations are opaque, which makes tracing or verifying the source of extracted triplets or quintuplets challenging. Third, the quality of the input text also limits the LLM’s comprehension. For historical papers printed on low-quality paper with poor transcription quality, the LLM struggles to capture the information they contain. Fourth, for some papers, only the abstract is written in English, while more detailed studies are presented in local languages. Regarding multilingual challenges, existing technologies and algorithms still underperform in various cross-lingual tasks, making it difficult to satisfy practical application standards [54]. Fifth, the LLM output may contain issues such as isolated nodes, misaligned entity names, and misinterpretations of certain entities, requiring varying degrees of post-processing depending on the context. Finally, the model’s knowledge base is limited in its timeliness and version updates, making it difficult for the model to reflect the latest advances in the field and to keep pace with ongoing research and cutting-edge technologies.

4.3. Future Development

The geoscience domain is characterized by vast amounts of data and a diverse array of data types. These data can be categorized into image, curve, and record data. Beyond geological literature, extracting data from diverse sources and formats—such as remote-sensing imagery, geophysical exploration data, and geochemical analysis data—enriches knowledge graph construction, a key area for future development.

While LLMs show versatile generalist skills, they tend to have particular accuracy issues in geology, often misunderstanding nuanced geological terms, hierarchical rock and mineral categories, and intricate relationships. Additionally, when extracting data from specific regions (e.g., a particular tectonic belt), existing models struggle to accurately learn and integrate the distinctive geological markers and genetic patterns characteristic of the area. For data from multiple sources and in different formats, current models also struggle with integrated processing and effective correlation. For example, combining textual descriptions of a region with spatial data, such as geological maps and cross-sections, remains challenging. Furthermore, as new discoveries continue to emerge in geoscience, a significant challenge will be leveraging LLMs to efficiently and dynamically update knowledge graphs.

To address these limitations, future work must move beyond simple application and develop targeted strategies. We can implement domain-adapted strategies by pretraining on domain-specific geological corpora (e.g., gold deposit datasets from the Jiangnan Orogenic Belt in China) to enable task-specific optimization. However, more transformative solutions are needed. Future frameworks should explore hybrid architectures that tightly couple LLMs with expert geological databases, spatial reasoning engines, and models for interpreting non-textual data, thereby enabling true multimodal knowledge fusion. With advances in technology, future tools for knowledge graph construction, combined with LLMs, may automatically identify errors or outdated information within the graph and make corrections through self-optimization mechanisms.

5. Conclusions

In this study, we present a method for constructing knowledge graphs using an LLM. Based on our findings, we draw the following conclusions: (1) Utilizing an LLM for large-scale extraction of geological entities and semantic relationships in the geoscience field is both feasible and efficient. (2) The integration of LLMs with knowledge graph construction effectively facilitates cross-source data integration and semantic fusion. (3) Combining automated extraction with manual verification can significantly enhance construction efficiency and ensure the accuracy of the extraction process. (4) By experimenting with locally deployed LLMs under different parameter configurations, we identified the most cost-effective setup. This finding offers a clear reference for selecting models in the extraction of geological entities and their semantic relationships. (5) The global knowledge graph we developed for typical significant to super-large gold deposits can serve as a valuable reference and provide data support for research on gold deposits.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/min16010050/s1; Prompt Engineering for guiding LLM to extract the geological entities and their semantic relations of gold deposits.

Author Contributions

J.Z.: Investigation, Methodology, Writing—original draft preparation. Y.W.: Investigation, Methodology. W.T.: Writing—original draft preparation, Data Curation. C.W.: Conceptualization, Investigation, Data Curation, Writing-Review and Editing, Project administration, Funding acquisition. S.L.: Data Curation, Investigation. M.W.: Data Curation, Investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (2022YFF0801202, 2022YFF0801200, 2024ZD1001205), National Natural Science Foundation of China (41902305), corporate projects of the Geological Survey and Mapping Institute of Yunnan Province and Yunnan Gold Mining Group Co., Ltd., and Knowledge Innovation Program of Wuhan-Shuguang (2023010201020332), and the Fundamental Research Funds (CUG-DMX2025-01) for the Central Universities, China University of Geosciences (Wuhan).

Data Availability Statement

The source codes are available on GitHub at https://github.com/wangcug/LLM4GoldKG (accessed on 31 August 2025).

Conflicts of Interest

Shengmiao Li is researcher of Geological Survey Institute of Hunan Province. Mingguo Wang is researcher of Yunnan Geological Big Data Center, Geological Survey and Mapping Institute of Yunnan Province. The paper reflects the views of the scientist and not the Institute.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
GeoKG	Geoscience Knowledge Graph
BERT	Bidirectional Encoder Representations from Transformers
RoBERTa	Robustly Optimized BERT Pretraining Approach
RNN	Recurrent Neural Network
NLP	Natural Language Processing
ER Model	Entity Relation Triple Model
CQL	Cypher Query Language
BiLSTM	Bidirectional Long Short-term Memory
BiLSTM-CRF	Bidirectional Long Short-term Memory-Conditional Random Field

References

Sun, K.; Hu, Y.; Song, J.; Zhu, Y. Aligning geographic entities from historical maps for building knowledge graphs. Int. J. Geogr. Inf. Sci. 2021, 35, 2078–2107. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Ye, P.; Du, M.; Lu, Y.; Xue, H. Geographic Knowledge Graph (GeoKG): A Formalized Geographic Knowledge Representation. ISPRS Int. J. Geoinf. 2019, 8, 184. [Google Scholar] [CrossRef]
Gutierrez, C.; Sequeda, J.F. Knowledge graphs. Commun. ACM 2021, 64, 96–104. [Google Scholar] [CrossRef]
Chen, T.; Wang, X.; Yue, T.; Bai, X.; Le, C.X.; Wang, W. Enhancing Abstractive Summarization with Extracted Knowledge Graphs and Multi-Source Transformers. Appl. Sci. 2023, 13, 7753. [Google Scholar] [CrossRef]
Zhou, C.; Wang, H.; Wang, C.; Hou, Z.; Zheng, Z.; Shen, S.; Cheng, Q.; Feng, Z.; Wang, X.; Lv, H.; et al. Geoscience knowledge graph in the big data era. Sci. China Earth Sci. 2021, 64, 1105–1114. [Google Scholar] [CrossRef]
Enkhsaikhan, M.; Holden, E.-J.; Duuring, P.; Liu, W. Understanding ore-forming conditions using machine reading of text. Ore Geol. Rev. 2021, 135, 104200. [Google Scholar] [CrossRef]
Qiu, Q.; Xie, Z.; Wu, L.; Tao, L.; Li, W. BiLSTM-CRF for geological named entity recognition from the geoscience literature. Earth Sci. Inform. 2019, 12, 565–579. [Google Scholar] [CrossRef]
Qun, N.; Yan, H.; Qiu, X.P.; Huang, X.J. Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node. J. Comput. Sci. Technol. 2020, 35, 1115–1126. [Google Scholar] [CrossRef]
Zhong, Y.; Liu, X.; Wang, J.; Chen, Y.; Zhang, T. Research of Extraction on Petroleum Unstructured Information Based on Named Entity Recognition. J. Southwest. Pet. Univ. 2020, 42, 165–173, (In Chinese with English abstract). [Google Scholar]
Chu, D.; Wan, B.; Li, H.; Fang, F.; Wang, R. Geological Entity Recognition Based on ELMO-CNN-BiLSTM-CRF Model. Diqiu Kexue—Zhongguo Dizhi Daxue Xuebao/Earth Sci.—J. China Univ. Geosci. 2021, 46, 3039–3048, (In Chinese with English abstract). [Google Scholar]
Wang, C.; Li, Y.; Chen, J.; Ma, X. Named entity annotation schema for geological literature mining in the domain of porphyry copper deposits. Ore Geol. Rev. 2023, 152, 105243. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Yan, J.; Song, W.; Zhu, Y.; Chen, X. Deep learning-based named entity recognition and knowledge graph construction for geological hazards. ISPRS Int. J. Geo-Inf. 2019, 9, 15. [Google Scholar] [CrossRef]
Enkhsaikhan, M.; Liu, W.; Holden, E.J.; Duuring, P. Auto-labelling entities in low-resource text: A geological case study. Knowl. Inf. Syst. 2021, 63, 695–715. [Google Scholar] [CrossRef]
Wang, B.; Wu, L.; Xie, Z.; Qiu, Q.; Zhou, Y.; Ma, K.; Tao, L. Understanding geological reports based on knowledge graphs using a deep learning approach. Comput. Geosci. 2022, 168, 105229. [Google Scholar] [CrossRef]
Chen, Z.; Yuan, F.; Li, X.; Wang, X.; Li, H.; Wu, B.; Chen, Y. Knowledge Extraction and Quality Inspection of Chinese Petrographic Description Texts with Complex Entities and Relations Using Machine Reading and Knowledge Graph:A Preliminary Research Study. Minerals 2022, 12, 1080. [Google Scholar] [CrossRef]
Yu, Y.; Wang, Y.; Mu, J.; Li, W.; Jiao, S.; Wang, Z.; Lv, P.; Zhu, Y. Chinese mineral named entity recognition based on BERT model. Expert Syst. Appl. 2022, 206, 117727. [Google Scholar] [CrossRef]
Chen, R.; Lei, J.; Yao, H.; Li, T.; Li, S. Anchor-Enhanced Geographical Entity Representation Learning. IEEE Trans. Neural Netw. Learn Syst. 2023, 36, 924–938. [Google Scholar] [CrossRef]
Wang, W.; Ma, L.; Chen, M.; Du, Q. Joint Correlation Alignment-Based Graph Neural Network for Domain Adaptation of Multitemporal Hyperspectral Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 3170–3184. [Google Scholar] [CrossRef]
Wang, C.; Ma, X.; Chen, J.; Chen, J. Information extraction and knowledge graph construction from geoscience literature. Comput. Geosci. 2018, 112, 112–120. [Google Scholar] [CrossRef]
Li, S.; Chen, J.; Xiang, J. Prospecting Information Extraction by Text Mining Based on Convolutional Neural Networks-A Case Study of the Lala Copper Deposit, China. IEEE Access 2018, 6, 52286–52297. [Google Scholar] [CrossRef]
Tian, M.; Ma, K.; Wu, Q.; Qiu, Q.; Tao, L.; Xie, Z. Joint extraction of entity relations from geological reports based on a novel relation graph convolutional network. Comput. Geosci. 2024, 187, 105571. [Google Scholar] [CrossRef]
Feng, Q.; Zhao, T.; Liu, C. A Pipeline-Based Approach for Automated Construction of Geoscience Knowledge Graphs. Minerals 2024, 14, 1296. [Google Scholar] [CrossRef]
Fu, Y.; Wang, M.; Wang, C.; Dong, S.; Chen, J.; Wang, J.; Yu, H.; Huang, J.; Chang, L.; Wang, B. GeoMinLM: A Large Language Model in Geology and Mineral Survey in Yunnan Province. Ore Geo. Rev. 2025, 182, 106638. [Google Scholar] [CrossRef]
Zhang, J.; Clairmont, C.; Que, X.; Li, W.; Chen, W.; Li, C.; Ma, X. Streamlining geoscience data analysis with an LLM-driven workflow. Appl. Comput. Geosci. 2025, 25, 100218. [Google Scholar] [CrossRef]
Chen, P.P.-S. The Entity-Relationship Model-Toward a Unified View of Data. In Readings in Artificial Intelligence and Databases; Morgan Kaufmann: Burlington, MA, USA, 1976. [Google Scholar] [CrossRef]
Giray, L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
Lester, B.; Al-Rfou, R.; Constant, N. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar] [CrossRef]
Hitzler, P.; Krötzsch, M.; Parsia, B.; Patel-Schneider, P.F.; Rudolph, S. OWL 2 Web Ontology Language Primer. In Encyclopedia of Social Network Analysis and Mining, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
Wang, C.; Tan, L.; Li, Y.; Wang, M.; Ma, X.; Chen, J. Ontology-driven relational data mapping for constructing a knowledge graph of porphyry copper deposits. Earth Sci. Inf. 2024, 17, 2649–2660. [Google Scholar] [CrossRef]
Noy, N.F.; Mcguinness, D.L. Ontology Development 101: A Guide to Creating Your First Ontology; Stanford Medical Informatics: Stanford, CA, USA; Available online: http://protege.stanford.edu/publications/ontology_development/ontology101.html (accessed on 10 October 2025).
Bai, R.; Chen, Q.; Zhang, Y.; Yang, C. Generating Effectiveness Entities of Patent Technology Based on ChatGPT+Prompt. Data Anal. Knowl. Discov. 2024, 8, 14–25. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 35, 24824–24837. [Google Scholar] [CrossRef]
Kekelia, S.A.; Kekelia, M.A.; Kuloshvili, S.I.; Sadradze, N.G.; Gagnidze, N.E.; Yaroshevich, V.Z.; Asatiani, G.A.; Doebrich, J.L.; Goldfarb, R.J.; Marsh, E.E. Gold deposits and occurrences of the Greater Caucasus, Georgia Republic: Their genesis and prospecting criteria. Ore Geo. Rev. 2008, 34, 369–386. [Google Scholar] [CrossRef]
Wang, C.; Ma, X.; Chen, J. Data preprocessing technology is applied in geoscience big data. Acta Petro. Sin. 2018, 34, 303–313, (In Chinese with English abstract). [Google Scholar]
Wu, R.; Huang, M.; Ma, H.; Huang, J.; Li, Z.; Mei, H.; Wang, C. A Multi-Temporal Knowledge Graph Framework for Landslide Monitoring and Hazard Assessment. GeoHazards 2025, 6, 39. [Google Scholar] [CrossRef]
Jaccard, P. Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaudoise Sci. Nat. 1901, 37, 547–579. [Google Scholar] [CrossRef]
Jaccard, P. The Distribution of the Flora in the Alpine Zone. New Phytol. 1912, 11, 37–50. [Google Scholar] [CrossRef]
Fridovsky, V.; Kryazhev, S.; Polufuntikova, L.; Kudrin, M.; Anisimova, G. Geology, Fluid Inclusions, Mineral and (S-O) Isotope Chemistry of the Badran Orogenic Au Deposit, Yana-Kolyma Belt, Eastern Siberia: Implications for Ore Genesis. Front. Earth Sci. 2024, 12, 1340112. [Google Scholar] [CrossRef]
Kempe, U.; Graupner, T.; Seltmann, R.; De Boorder, H.; Dolgopolova, A.; Zeylmans Van Emmichoven, M. The Muruntau Gold Deposit (Uzbekistan)—A Unique Ancient Hydrothermal System in the Southern Tien Shan. Geosci. Front. 2016, 7, 495–528. [Google Scholar] [CrossRef]
Galdos, R.; Vallance, J.; Baby, P.; Salvi, S.; Schirra, M.; Velasquez, G.; Viveen, W.; Soto, R.; Pokrovski, G.S. Origin and Evolution of Gold-Bearing Fluids in a Carbon-Rich Sedimentary Basin: A Case Study of the Algamarca Epithermal Gold-Silver-Copper Deposit, Northern Peru. Ore Geol. Rev. 2024, 166, 105857. [Google Scholar] [CrossRef]
Khishgee, C.; Akasaka, M. Mineralogy of the Boroo Gold Deposit in the North Khentei Gold Belt, Central Northern Mongolia. Resour. Geol. 2015, 65, 311–327. [Google Scholar] [CrossRef]
Sylla, S.; Gueye, M.; Ngom, P.M. New Approach of Structural Setting of Gold Deposits in Birimian Volcanic Belt in West African Craton: The Example of the Sabodala Gold Deposit, SE Senegal. IJG 2016, 7, 440–458. [Google Scholar] [CrossRef]
Qingdong, Z.; Jianming, L.; Hongtao, L. Geology and Geochemistry of the Bianbianshan Au-Ag-Cu-Pb-Zn Deposit, Southern Da Hinggan Mountains, Northeastern China. Acta Geol. Sin. (Engl. Ed.) 2012, 86, 630–639. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Y.; Li, H.; Han, J.; Song, Q. Genesis of the Jinying Gold Deposit, Southern Jilin Province, NE China: Constraints from Geochronology and Isotope Geochemistry. Geol. Mag. 2023, 160, 1761–1774. [Google Scholar] [CrossRef]
Wang, Q.; Deng, T.; Xu, D.; Lin, Y.; Liu, G.; Tang, H.; Zhou, L.; Zhang, J. Genetic Association between Carbonates and Gold Precipitation Mechanisms in the Jinshan Deposit, Eastern Jiangnan Orogen. Geol. Soc. Am. Bull. 2024, 136, 4195–4217. [Google Scholar] [CrossRef]
Zhen, S.-M.; Zhu, X.-Y.; Li, Y.-S.; Du, Z.-Z.; Gong, F.-Y.; Gong, X.-D.; Qi, F.-Y.; Jia, D.-L.; Wang, L. Zircon U-Pb geochronology and Hf isotopic compositions of the monzonite, related to the Xianrenyan gold deposit in Hunan province and its geological significances. Jilin Daxue Xuebao (Diqiu Kexue Ban)/J. Jilin Univ. (Earth Sci. Ed.) 2012, 42, 1740–1756. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Available online: https://github.com/tensorflow/tensor2tensor (accessed on 10 October 2025).
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar] [CrossRef]
Dong, J.; Qiu, Q.; Xie, Z.; Ma, K.; Hu, A.; Wang, H. Understanding table content for mineral exploration reports using deep learning and natural language processing. Ore Geol. Rev. 2023, 156, 105383. [Google Scholar] [CrossRef]
Huang, X.; Zhu, Y.; Fu, L.; Liu, Y.; Tang, K.; Li, J. Research on a geological entity relation extraction model for gold mine based on BERT. J. Geomech. 2021, 27, 391–399. [Google Scholar] [CrossRef]
Ma, K.; Tian, M.; Tan, Y.; Xie, X.; Qiu, Q. What is this article about? Generative summarization with the BERT model in the geosciences domain. Earth Sci. Inform. 2022, 15, 21–36. [Google Scholar] [CrossRef]
Ngomo, A.-C.N.; Röder, M.; Moussallem, D.; Usbeck, R.; Speck, R. BENGAL: An Automatic Benchmark Generator for Entity Recognition and Linking. arXiv 2017, arXiv:1710.08691. [Google Scholar] [CrossRef]
Kumar, S. A Survey of Deep Learning Methods for Relation Extraction. arXiv 2017, arXiv:1705.03645. [Google Scholar] [CrossRef]
Huang, K.; Mo, F.; Zhang, X.; Li, H.; Li, Y.; Zhang, Y.; Yi, W.; Mao, Y.; Liu, J.; Xu, Y.; et al. A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers 2025. arXiv 2024, arXiv:2405.10936. [Google Scholar] [CrossRef]

Figure 1. Knowledge model of ore deposits, representing the relationships between different geological entities.

Figure 2. Part of the prompt project that accurately guides the LLM to complete the specified task through statements.

Figure 3. The comparison between the original literature text (top) and the triples extracted by the model (bottom). Colors indicate different types of Entity2. The paragraph is excerpted from Kekelia et al. [33].

Figure 4. The presentation comparison for geological entities alignments; (a) shows the results before alignment, and (b) shows the results after alignment.

Figure 5. The upper panel shows the Zopkhito gold deposit and its associated entities. Different colors represent different entity types. The lower panel shows an example Cypher query.

Figure 6. This network diagram visualizes mineral deposits analogous to the Huangjindong deposit and their associated minerals. Nodes represent entities: green for deposit nodes and yellow for mineral nodes. Connecting lines represent the relationship “hasMinerals” between them.

Figure 7. This figure compares knowledge graphs made by two methods: manual extraction (top) and LLM extraction (bottom). Orange nodes represent mineral deposits. Dark blue nodes indicate what the LLM missed—they are present in the manual graph but absent from the LLM graph.

Figure 8. The figure presents fitted curves showing the entity relation extraction capability (F1-score, blue curve) and computational demand (red curve) across DeepSeek models of varying parameter sizes, with computational requirements normalized against the 671B-parameter model as the baseline (671B = 1). It illustrates the trade-off between performance and computational requirements.

Table 1. Definition of geological entities in this knowledge graph.

Entity Type	Definition	Entity Type	Definition
LOCATION	Geographical location of the ore deposit	OREBODY	Geological body with mining value
GEOLOGIC TIME	Different time periods in the geological evolution process	CHRONOSTRATIGRAPHY	Erathem, system, series, stage
GEOLOGICAL BACKGROUND	Geological conditions of the region during the formation of deposit	ROCK MASS	Geological body composed of rocks with a certain structure and fabric
GEOLOGICAL EVENT	Representative phenomena that occurred during geological history	LITHOSTRATIGRAPHY	Lithotope units, including group, formation, member, bed
METALLIC MINERAL	Minerals with metallic properties contained in the ore deposit or ore body	DEPOSIT	Geological body that contains mineral resources capable of being exploited and utilized
NONMETALLIC MINERAL	Minerals without metallic properties contained in the ore deposit or ore body	ELEMENT	Includes major element, trace element, and isotope
FRACTURE	Joints, cleavage, and faults	MINERALIZATION	Geological processes occurring during the metallogenic process
PLEATED STRUCTURE	Plastic deformation of rock formation	PROSPECTING SIGN	Indicate the possible presence of an ore deposit
ALTERATION TYPE	Phenomena resulting from the interaction between hydrothermal fluids and surrounding rocks	EXPLORATION ENGINEERING	Engineering layout during the mineral exploration and survey process
SEDIMENTARY ROCK	Rocks formed by sedimentary processes	MINERALIZATION TYPE	Enrichment process and form of mineral concentration
IGNEOUS ROCK	Includes effusive rock and intrusive rock	FLUIDINCLUSION TYPE	Such as gas-phase inclusions and liquid-phase inclusions
METAMORPHIC ROCK	Rocks formed by metamorphic processes, such as marble and mylonite	GEOLOGICALANOMALY	Geological, geophysical, geochemical, and remote sensing anomalies

Table 2. Definition of semantic relations in this knowledge graph.

Relation Type	Definition
hasAlteration	Mineralization phenomena and alteration types occurring near ore bodies or rock masses, such as the presence of potassium feldspar alteration, silicification, etc., in a certain mining area
hasMinerals	Metallic and nonmetallic minerals formed within the ore deposit or ore body, such as chalcopyrite, bornite, actinolite, etc.
hasElement	Chemical element composition contained in rock or mineral samples, such as Cu, Pb, etc.
isControlledBy	Connection between ore bodies, rock masses, etc., and controlling factors, indicating the controlling role and influence in the metallogenic process
isRelatedTo	Certain geological units that appear in ore deposits and related geological bodies, such as the relation between the ore deposit and the three main rock types or rock masses
isRevealedBy	Rock masses or ore bodies are revealed by certain engineering operations, such as pit exploration, trench exploration, drilling exploration, geophysical exploration and other exploration methods
isFoundIn	Geological anomalies discovered in a certain part of the ore deposit, such as the detection of Cu, Zn, and other element anomalies within a specific fault
isFormedIn	Age of ore deposit formation, age of associated strata, and the geological events that occurred during the formation period
isLocatedIn	Location information about the mining area, including the administrative region and geological background information
isAnalyzedBy	Semantic relationship between exploration methods and research subjects, referring to the geological exploration methods used when conducting exploration in a specific study area, such as the induced polarization sounding method

Table 3. Partial display of geological entities and semantic relationships in the form of all triples organized manually.

Label	Entity1	Relation	Label	Entity2
Deposit	Seabee_Gold_Deposit	isLocatedIn	Location	Northern_Saskatchewan_ Canada
Deposit	Seabee_Gold_Deposit	isFormedIn	Geological_Time	Paleoproterozoic
Deposit	Seabee_Gold_Deposit	hasAlteration	Mineralization	Orogenic_gold_mineralization
Deposit	Seabee_Gold_Deposit	hasAlteration	Mineralization	Remobilization
Metamorphic_Rock	Chlorite_hornblende_biotite_schist	isRelatedTo	Geological_Background	Greenschist_to_ amphibolite_grade
Metallic_Minerals	Pyrite	hasElement	Element	Fe
Metallic_Minerals	Gold	hasElement	Element	Au

Table 4. Example of a small-scale geological entity alignment standardization dictionary.

Original Pattern	Standard Pattern
Abnormal_dispersion_of_gold	Abnormal_Au_element
Abnormal_gold
Abnormal_gold_grade
Xiaolong_mining_area	Xiaolong_gold_deposit
Xiaolong_gold_deposit
Xiaolong_gold_mine

Table 5. The nine gold deposits similar to the Huangjindong deposit and their associated minerals.

Deposit	Metallic Minerals in the Huangjindong Deposit	Metallic Minerals in the Similar Deposits	Jaccard Index	Genetic Type	References
Badran_orogenic_gold_ deposit	Pyrite, Arsenopyrite, Gold, Chalcopyrite, Sphalerite, Galena, Magnetite	Pyrite, Sphalerite, Chalcopyrite, Arsenopyrite, Galena	0.71	Orogenic gold deposit	[38]
Muruntau_gold_deposit		Pyrite, Sphalerite, Chalcopyrite, Arsenopyrite, Gold,	0.71	Orogentic gold deposit	[39]
Algamarca_Au_Ag_Cu_ deposit		Pyrite, Chalcopyrite, Arsenopyrite, Tetrahedrite, Native_Gold, Tennantite	0.44	Epithermal gold deposit	[40]
Boroo_Gold_Deposit		Pyrite, sphalerite, arsenopyrite, gold, tetrahedrite, galena, chalcopyrite	0.75	Orogentic gold deposit	[41]
Sabodala_deposit		Gold, pyrite, blende, galena, chalcopyrite, argentite	0.44	Mesothermal vein gold deposit	[42]
Banbianshan_gold_mine		Arsenic_oxide, Gold, Pyrite, Arsenopyrite, Limonite, Silver, Tetrahedrite, Galena, Bornite, Copper_orchid, Copper_blue	0.29	Hydrothermal deposit	[43]
Jinyinshan_gold_deposit		Arsenic_oxide, Pyrite, Limonite, Natural_gold, Galena, Bornite, Chalcopyrite, Magnetite, Apatite, Pyrrhotite, Chalcocite, Hematite	0.36	Magmatic hydrothermal deposit	[44]
Leigaowu_gold_mine		Natural_silver, Pyrite, Natural_gold, Tetrahedrite, Galena, Bornite, Chalcopyrite, Zircon, Apatite, Marcasite, Argentite, Anatase, Copper_blue	0.25	Hydrothermal deposit	[45]
Hunan_xianrenyan_gold_polymetallic_deposit		Pyrite, Limonite, Natural_gold, Chalcopyrite, Molybdenite, Magnetite, Hematite, Malachite, Copper_blue	0.33	Epithermal gold deposit	[46]

Table 6. Comparison of precision, recall, and F1 values between single-text extraction and multitext extraction.

Methods	Precision	Recall	F1
Single text	0.783	0.721	0.751
Multiple text	0.886	0.821	0.852

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, J.; Wang, Y.; Tong, W.; Li, S.; Wang, M.; Wang, C. Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits. Minerals 2026, 16, 50. https://doi.org/10.3390/min16010050

AMA Style

Zhu J, Wang Y, Tong W, Li S, Wang M, Wang C. Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits. Minerals. 2026; 16(1):50. https://doi.org/10.3390/min16010050

Chicago/Turabian Style

Zhu, Jinhao, Yueying Wang, Wanying Tong, Shengmiao Li, Mingguo Wang, and Chengbin Wang. 2026. "Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits" Minerals 16, no. 1: 50. https://doi.org/10.3390/min16010050

APA Style

Zhu, J., Wang, Y., Tong, W., Li, S., Wang, M., & Wang, C. (2026). Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits. Minerals, 16(1), 50. https://doi.org/10.3390/min16010050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Gold Deposit Ontology Guides Large Language Model to Transform Text into Knowledge Graphs for Gold Deposits

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Data Sources

2.2. Gold Deposits Ontology

2.3. Prompt Engineering

3. Results

3.1. Geological Entities and Their Extracted Semantic Relations

3.2. Knowledge Alignment and Integration

3.3. Knowledge Graph Visualization

3.4. Knowledge Service

3.5. LLM Performance in the Extraction of Geological Entities and Their Semantic Relations

4. Discussion

4.1. Key Insights and Progress Achieved

4.2. Comparison of Methods and Limitations of LLM

4.3. Future Development

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI