Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe

Huang, Lutong; Zhu, Yueqin; Li, Yingfei; Yan, Tianxiao; Xiao, Yu; Wei, Dongqi; Xing, Ziyao; Li, Jian

doi:10.3390/app15168879

Open AccessArticle

Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe

by

Lutong Huang

¹,

Yueqin Zhu

^1,*

,

Yingfei Li

¹,

Tianxiao Yan

²,

Yu Xiao

¹,

Dongqi Wei

³,

Ziyao Xing

¹ and

Jian Li

¹

National Institute of Natural Hazards, Ministry of Emergency Management of China, Beijing 100085, China

²

College of Geological and Surveying Engineering, Taiyuan University of Technology, Taiyuan 030024, China

³

Xi’an Center of Geological Survey, China Geological Survey, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8879; https://doi.org/10.3390/app15168879

Submission received: 25 June 2025 / Revised: 27 July 2025 / Accepted: 7 August 2025 / Published: 12 August 2025

(This article belongs to the Special Issue Applications of Big Data and Artificial Intelligence in Geoscience)

Download

Browse Figures

Versions Notes

Abstract

Loess landslide disasters are influenced by a multitude of factors, including slope conditions, triggering mechanisms, and spatial attributes. Extracting these factors from unstructured geological texts is challenging due to nested entities, semantic ambiguity, and rare domain-specific terms. This study proposes a joint extraction framework guided by a domain ontology that categorizes six types of loess landslide influencing factors, including spatial relationships. The ontology facilitates conceptual classification and semi-automatic nested entity annotation, enabling the construction of a high-quality corpus with eight tag types. The model integrates a Soft-Lexicon mechanism that enhances character-level GloVe embeddings with explicit lexical features, including domain terms, part-of-speech tags, and word boundary indicators derived from a domain-specific lexicon. The resulting hybrid character-level representations are then fed into a BiLSTM-CRF architecture to jointly extract entities, attributes, and multi-level spatial and causal relationships. Extracted results are structured using a content-knowledge model to build a spatially enriched knowledge graph, supporting semantic queries and intelligent reasoning. Experimental results demonstrate improved performance over baseline methods, showcasing the framework’s effectiveness in geohazard information extraction and disaster risk analysis.

Keywords:

loess landslide influencing factors; Soft-Lexicon; named entity recognition; knowledge graph

1. Introduction

A landslide is defined as the downward movement of a mass of rock, soil, or debris under the influence of gravity, typically occurring on a slope due to natural or anthropogenic factors. Landslides are among the most common and destructive geological hazards worldwide, posing significant threats to human life, infrastructure, and the environment [1,2]. Among them, loess landslides are particularly prevalent and hazardous in the Loess Plateau region of China, where they are triggered by the complex interplay of factors such as intense rainfall, unique loess geology, steep topography, and human activities. Traditional empirical and physical models mainly analyze individual factors but cannot effectively model their complex coupling or utilize the rich information embedded in unstructured geological texts [3,4]. These unstructured texts, such as geological survey and disaster assessment reports, contain valuable knowledge on disaster-influencing factors, often described through heterogeneous, ambiguous, and nested domain-specific terminology. Extracting key entities, their attributes, and the relationships among them from such texts remains a significant challenge [5,6].

Among these influencing factors, spatial information is a vital component. It includes not only geographic place names and geological spatial objects but also disaster-related attributes such as severity, extent, and spatial relationships. This spatial knowledge is essential for comprehensively understanding landslide mechanisms and for building structured representations that support risk assessment and disaster management [7]. Therefore, this study focuses on the intelligent extraction of disaster-influencing factors, including entities, attributes, and relationships, from large volumes of unstructured geological texts, with special emphasis on spatial information as a core dimension of the extracted knowledge.

Advances in natural language processing (NLP) and knowledge graph technologies have opened new avenues for extracting structured representations of landslide causative factors. Recent studies have aimed to automatically identify entities, attributes, and relationships from unstructured data to support disaster intelligence and risk assessment. Nevertheless, challenges remain in recognizing low-frequency entities, handling nested structures, and extracting implicit causal and spatial relationships.

Early research predominantly utilized statistical and physical models to analyze landslide causes. For instance, Sun et al. [8] quantified the correlation between slope, rainfall, and landslide occurrence through logistic regression, elucidating the roles of individual factors. The emergence of machine learning methods like random forests and support vector machines further improved landslide susceptibility prediction by integrating multivariate parameters, including geotechnical properties and vegetation coverage [9]. While traditional statistical and physical models rely heavily on structured input data for landslide hazard assessment, they often require comprehensive and up-to-date datasets. In this context, our approach focuses on efficiently extracting and organizing rich, implicit knowledge from unstructured geological survey texts to complement and enhance these models.

To address these challenges, recent advances in knowledge graph technology and domain ontology construction have shown promise for structuring heterogeneous geological disaster data. Knowledge graphs enable the integration and representation of multi-source, multimodal information, while ontologies provide formal semantic frameworks defining core concepts and relationships within specific domains. Qiu et al. [10] proposed an automated construction method of geological disaster knowledge graphs combining computer vision with domain earth science ontologies, facilitating the identification of disaster chains from multimodal data. Additionally, Qiu et al. [11] extended knowledge graph construction to earthquake disaster prevention, improving reasoning capabilities through geological data integration. Wen et al. [12] developed a domain ontology for loess landslides, defining core concepts such as rock and soil properties and hydrological conditions, enabling the initial extraction of entity relationships. Guzzetti et al. [13] emphasized semantic alignment in integrating remote sensing images with geographic information systems (GISs) for a global landslide database framework. Despite these advances, the development of high-quality, specialized ontologies and annotated corpora for loess landslide texts remains limited, and automated methods for extracting complex nested entities and implicit causal and spatial relations are still insufficient. This gap constrains the dynamic updating and comprehensive utilization of geological knowledge in knowledge graphs.

Consequently, the effective semantic parsing of geological texts through natural language processing (NLP) techniques becomes essential. Geological texts pose unique challenges, including terminology heterogeneity, the recognition of low-frequency and nested entities, and the extraction of implicit causal relationships. Traditional NLP methods such as conditional random fields (CRFs) and bidirectional long short-term memory networks (BiLSTM-CRF) perform well in general domains but show limited accuracy on specialized geological terminology and sparse entities [14]. Domain adaptation strategies have been proposed to address these issues. For example, Ma et al. [15] introduced Soft-Lexicon, which incorporates dictionary knowledge into character representations to enhance model efficiency and accuracy. Similarly, Liu et al. [16] employed GeoBERT, a BERT-based model augmented with geological terminology lists, achieving improved named entity recognition in geological corpora. Nevertheless, current methods exhibit insufficient robustness in extracting implicit causal relations and limited generalization across texts, constraining large-scale automated knowledge extraction.

Effective knowledge representation is critical to the usability of extracted information. Earlier studies represented landslide knowledge as simple triples (entity–relationship–entity) [12], which cannot capture attribute values or relationship modifications. Das et al. [17] proposed a method combining statistical and machine learning techniques to model the spatiotemporal evolution of landslide precursors from radar data with spatiotemporal labeling, greatly enhancing knowledge representation. Zheng et al. [18] constructed an ontology-based knowledge graph for hazardous chemical management that integrates entity attributes through pre-trained BERT-CRF models, extracting entities and attribute-value pairs from unstructured data. These approaches improve the efficiency and accuracy of knowledge graph construction and provide useful references for the intelligent extraction of loess landslide influencing factors. However, challenges remain in dynamic updating and multimodal fusion, limiting real-time disaster monitoring applications.

Among the extracted information, spatial entities—especially toponyms—are indispensable for understanding landslide processes. They describe hazard environments and encode spatial relationships within disaster chains. Wang et al. [19] developed a deep belief network for Chinese toponym recognition, improving spatial entity extraction. Zhou et al. [20] proposed TopoBERT, a BERT-enhanced model for disaster corpus toponym extraction. Zhang et al. [21] introduced ChineseCTRE, combining geographic entity recognition and error correction to improve spatial interpretation. However, existing methods still focus on lexical and semantic levels and lack domain-specific ontologies. This limits their ability to handle nested spatial entities, harmonize classification, and model complete entity–relation structures. Spatial information incorporation often relies on manual annotation or template matching, limiting scalability and generalization.

Despite certain advances in related research, significant challenges remain in extracting the influencing factors of loess geological disasters, particularly in integrating complex geological semantics with multi-level nested spatial entities and accurately recognizing low-frequency domain-specific terms. To address these challenges, this study first constructs a comprehensive domain ontology and a high-quality annotated corpus using a nested entity annotation scheme, which together clarify the hierarchical and logical relationships among influencing factors, including complex spatial entities. Building on these, a joint extraction model is proposed that simultaneously captures entities, their attributes, and spatial as well as causal relationships from unstructured geological texts. The extracted information is further organized into a spatially enriched knowledge graph via a content-knowledge model, enabling detailed semantic analysis and spatial reasoning to support intelligent disaster prevention and risk management.

2. Ontology Construction of Influencing Factors of Loess Landslide Geological Disaster

In the intelligent extraction of factors influencing loess landslide geological disasters, text data usually contains professional terminology and domain-specific expressions, which leads to semantic ambiguity. The same entity may have multiple expressions, and the same expression may correspond to different concepts. In addition, text data lacks hierarchical relationships, and primitive tags cannot reflect the classification level of entities nor can they support the semantic splitting of nested entities [22]. Therefore, the construction of the ontology of factors influencing loess landslide geological disasters can solve such problems. By defining clear terminology and concepts, reducing semantic ambiguity, building a classification system to clarify the classification level of entities, supporting the semantic splitting of nested entities, enhancing the correlation between entities, and supporting reasoning based on entity relationships.

The ontology enables the study of factors influencing slope stability in the field of loess landslide geological disasters, including concept classification, descriptive attributes of concepts, and relationships between concepts, thereby providing a conceptual framework for the knowledge graph in the geological disaster domain. Ontology construction can not only construct entities but also construct entity relationships and attributes to form a complete knowledge graph. In the data layer of the knowledge graph, facts are stored as triples of “entity–relationship–entity” or “entity–attribute–attribute value” to form a graphical knowledge base [23]. Ontology is defined as a structured domain knowledge framework that realizes the semantic mapping of primitive entities to standardized concepts through predefined classes, attributes, and relationship networks. Ontology construction helps to formalize expert knowledge and transform it into a computer-understandable knowledge system, thereby improving the accuracy of intelligent extraction.

The factors that affect slope stability are diverse and complex, but different influencing factors can be divided into internal and external factors. Internal factors include topography, geotechnical properties, geological structure, geotechnical structure, water action, ground stress, residual stress, etc.; external factors include engineering load conditions, vibration, slope morphology and weathering, air conditions, climate conditions, surface vegetation development, etc. According to domain knowledge, the stability of a slope should be evaluated based on a comprehensive determination of the above factors. This paper aims to sort out the conceptual system and the relationship between the above-mentioned influencing factors from an overall perspective and conceptualize the empirical ontology of predecessors. Figure 1 summarizes the eight major influencing factors that affect slope stability and briefly summarizes the interrelationships between the factors. In summary, these relationships have causal relationships, such as weathering will reduce the strength of rocks and increase fissures and correlation relationships, such as the relationship between the distribution of fracture zones and slopes.

The influencing factors summarized in the figure above are as follows [24,25,26]:

Geotechnical properties: Including the hardness, weathering resistance, softening resistance, shear strength, particle size and shape, and permeability of the rock and soil.
Rock layer structure and texture: Including the distribution pattern and development degree of bedding, joints, and fissures, the cementation of structural surfaces, the distribution of weak surfaces and fracture zones, and their relationship with the slope, the morphology of the rock and soil interface, and the spatial relationship with the slope direction and slope gradient.
Hydrogeological conditions: Including the six points of groundwater, burial conditions, erosion conditions, and dynamic changes.
Weathering: Weathering will reduce the strength of the rock and soil, increase the width and number of fissures, affect the morphology of the slope, and promote the infiltration of surface water.
Climate effect: Climate has a very close causal relationship with the thickness of the rock and soil weathering layer, the weathering rate, and the mechanical and chemical changes in the rock after weathering.
Earthquake action: In addition to increasing the sliding force due to earthquake acceleration, earthquake action will also increase the pore water pressure in the rock and soil and reduce the strength of the rock and soil, which is detrimental to the stability of the slope.
Topographic factors: The important factors influencing slope stability include the height, slope, and topographic factors of the slope.
Human factors: Slope instability can be caused by unreasonable slope engineering design, large-scale infiltration of external water and blasting, and other human factors.

This paper focuses on the ontology modeling of six major influencing factors in geography and geology and does not discuss the remaining earthquake action and human factors, as shown in Figure 2.

3. Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters

Faced with the various influencing factors of loess landslide geological disasters, it is obviously not a good idea to design an extraction model for each factor separately. Therefore, this paper proposes a field-wide progressive joint extraction mode. The progressive joint extraction mode is based on a universal distributed representation–encoding–decoding structure. It is a solution to the sequence tagging problem that regards entities, relations, and attributes as generalized entities. Its main process is divided into several main steps: primitive annotation set corpus establishment, primitive automatic annotation, entity recognition, and entity conceptualization. The extraction process is shown in Figure 3.

3.1. Primitive Annotation and Corpus Construction

Unstructured geological disaster texts, including investigation reports, field records, and map annotations, are crucial carriers of information on disaster-causing factors. However, these texts are often loosely expressed with heterogeneous terminology and contain implicit field writing rules. They urgently require structured analysis to enable knowledge-based applications. Although existing studies have preliminarily organized these texts through data cleaning and topic aggregation, fine-grained knowledge extraction remains limited due to the lack of annotated corpora. Adaptive annotation resources have not yet been established in the field of geological disasters, resulting in the inadequate performance of existing models in terms of cross-text generalization and terminology consistency. To address this issue, this study proposes a progressive primitive annotation system, based on the investigation and analysis of related work on the construction of annotated corpora in other fields [27,28,29]. By defining field writing rules and designing joint annotation rules, this system achieves integrated knowledge expression of entities, relationships, and attributes, constructs an annotated corpus for disasters such as loess landslides, and breaks through the technical bottleneck of knowledge extraction.

“Primitives” are defined as the semantic unit combination rules implicit in geological disaster texts, expressed as conventional patterns used by geological workers, such as parameter quantification sentences, “cohesion is 15 kPa”, and spatial positioning structures, “the sliding zone is located in the Q3 loess layer.” Based on the text analysis of 3000 slope disaster cases (including landslides, collapses, and unstable slopes) in the loess region, this paper extracts eight types of primitive rules, covering elements such as main entities (SUB), attribute-value pairs (ATN/ATO/ATV), and predicate relations (PRD), supporting entity–relationship–attribute integrated annotation. The evaluation experiment results on the manually annotated primitive entity corpus show that its accuracy rate reaches 98.1% and the recall rate reaches 96.2%, meeting the requirements for subsequent processing. The eight types of annotation categories are as follows:

Main Entity (SUB): Typically, it is the subject of a sentence, which can derive various entities and act as a node in the graph.
Beof Relationship (BOF): Forms an implicit “…of” relationship with the main entity, such as “The sliding body (BOF) front”.
Attribute Entity Type (ATN)/Attribute Value (ATO/ATV) Entity: The attribute/value of the main entity or BEOF entity describes the entity characteristics. It usually appears in pairs; the attribute name is generally a noun terminology followed by a numeric or enumeration attribute value. The enumeration attribute value is taken from subordinate words in the object Partition subclass of the ontology and marked as ATO. The numeric attribute value is marked as ATV.
Nested Entity Type (NST): Nested entities appear as a whole in the primitive annotation and contain rich semantic relationships between entities. These can later be split and identified using nested entity recognition technology. For example, “Yintaishan Village, Qiaoshan Town, east side of Huangling County” represents the geographical location, and “Gray-green sandstone with thin layers of blue-gray mudstone” represents the lithology.
Predicate Relationship Type (PRD): The predicate connecting the left and right entities acts as the edge of the graph, typically a verb.
Prepositional Relationship Type (P...PRD): The relationship formed between a preposition expressing the location, purpose, reason, object, passive, and comparison of the main entity (BOF entity) and the following predicate. The entity placed in the middle serves as the right node of the prepositional relationship. For example, Slope—(bounded by _) inclined beam.
Modification (^MOD): Also called assertion, it manifests as the modification of the relationship between entities and can be understood as the attribute of the relationship.
Connector (_JNT): Indicates the interval relationship of parallel, or, and attribute values of the same type.

3.2. Automatic Extraction of Primitives

In order to solve the problem of complex nested entities and many low-frequency terms in geological disaster texts, this paper proposes a domain-wide progressive joint extraction mode. The progressive joint extraction mode is based on a universal distributed representation–encoding–decoding structure and is a solution to the sequence tagging problem that treats entities, relationships, and attributes as generalized entities. In addition, this paper proposes a deep neural network model based on vocabulary enhancement technology (Adaptive Embedding paradigm), which significantly improves the recognition ability of nested entities and low-frequency terminology by integrating domain dictionaries and multi-granular features (characters, parts of speech, domain terminology). Specifically, this method introduces the domain knowledge of the “Earth Science Dictionary” through the Soft-Lexicon module, combines the GloVe character-level embedding model to generate context-aware vectors, and designs a bidirectional interactive BiLSTM-CRF decoding layer to achieve the integrated extraction of entities, relationships, and attributes in geological disaster texts.

3.2.1. Soft-Lexicon and GloVe Embedding Model

In named entity recognition (NER) networks, the primary distinction when employing an encoder–decoder architecture lies in the representation of text, particularly in constructing word embeddings. This paper adopts a hybrid vector representation method that integrates external knowledge from domain ontologies with character-level embeddings. There are two reasons for selecting character embeddings: Firstly, Chinese is fundamentally composed of characters, each with strong expressive power; secondly, improper word segmentation can lead to ambiguity due to unclear word boundaries, and the vast vocabulary size can easily result in out-of-vocabulary (OOV) issues. To address the problem of capturing only character-level information without accessing underlying word-level information, this paper proposes a method using GloVe for character embedding and employs an extended Soft-Lexicon hybrid distributed representation approach to incorporate explicit domain word order information (such as domain lexicons, part-of-speech tags, and word boundaries) into the character vectors, forming a hybrid vector representation for the input sequence. This method aims to enhance the domain expression capability of the embedding vectors by introducing domain knowledge on top of character-level vectors in a sensible manner.

This paper employs an enhanced Adaptive Embedding paradigm, incorporating lexical techniques to achieve the aforementioned objectives. Typically, this paradigm involves pre-training character vectors using word embedding technology before constructing adaptive hybrid vector representations based on lexical information. A significant advantage of this approach is its ability to integrate external information into the word representation layer, which operates independently of the subsequent encoding–decoding model. This design allows for seamless integration with any encoding–decoding model, such as BiLSTM-CRF or other general network architectures, facilitating sequence labeling tasks and thereby offering broad applicability. In this study, Soft-Lexicon is utilized as an extension module to incorporate domain-specific lexicons, thereby enriching the vocabulary.

Soft-Lexicon [15] offers a streamlined and efficacious strategy for infusing lexicon knowledge into character embeddings. This methodology encompasses distinct procedures, including lexicon alignment, positional categorization, word vector weighting regularization, and the synthesis of hybrid vectors.

By constructing sets corresponding to

\{B, M, E, S\}

for each character in the sequence {c₁, c₂,..., c_n}, the method effectively incorporates lexicon information into character representations.

B (c_{i}) = \{w_{i, k}, \forall w_{i, k} \in L, i < k \leq n\}

(1)

M (c_{i}) = \{w_{j, k}, \forall w_{j, k} \in L, j \leq k \leq n\}

(2)

E (c_{i}) = \{w_{j, i}, \forall w_{j, i} \in L, 1 \leq j < i \leq n\}

(3)

S (c_{i}) = \{c_{i}, \exists c_{i} \in L\}

(4)

Then, weighted regularization is performed on the word sets corresponding to the

B, M, E

, and S categories, respectively.

ν^{s} (s) = \frac{4}{z} \sum_{ω \in s} z (ω) ν^{ω} (ω)

(5)

where

ν^{ω} (ω)

is the vector of word ω,

Z = \sum_{ω \in B \cup M \cup E \cup S} z (ω)

.

Finally, the lexicon information is added to the character vectors by using vector concatenation to form hybrid representation vectors.

h y (c_{i}) = [ν (c_{i}); ν^{s} (B); ν^{s} (E); ν^{s} (S)]

(6)

GloVe (Global Vectors for Word Representation) [30] is an unsupervised technique for learning word embeddings, integrating global statistical methods with local window-based approaches to model co-occurrence frequencies and capture semantic relationships between words.

The core principle of the GloVe model involves using a global word co-occurrence matrix to derive word vectors such that the relationships among these vectors reflect the semantic connections of the words. Semantic relationships are captured through ratios of word co-occurrence probabilities. By statistically counting co-occurrences in a large corpus, a word co-occurrence matrix is obtained. GloVe utilizes this matrix to learn word vector representations by minimizing the difference between the word vectors and the logarithms of the co-occurrence probabilities. The word vectors generated by GloVe serve as input to the Soft-Lexicon module, providing semantic information at the character level.

This paper proposes GloVe as a character embedding model. Firstly, character-level embeddings for the entire unlabeled raw corpus R are constructed following the method proposed by Pennington [30]. This approach effectively captures local contextual relationships between characters within the corpus. Concurrently, a co-occurrence matrix is constructed to obtain semantic vectors at a hidden layer, utilizing global statistical information. The specific methodology is as follows: Let the co-occurrence matrix of characters be X, where its element

X_{i j}

represents the number of times characters

c_{i}

and

c_{j}

appear together within a specified window across the corpus. This window can slide freely through the corpus. Let

X_{i} = \sum_{k} X_{i k}

denote the number of times any character

c_{k}

appears in the context of

c_{i}

, essentially the sum of the

c_{i}

row in the co-occurrence matrix. Finally, let

P_{i j} = P (i | j) = X_{i j} / X_{i}

represent the probability of

c_{j}

appearing in the context of

c_{j}

. Let

{r a t i o}_{o, j, k} = P_{i, k} / P_{j, k}

represent the ratio of two conditional probabilities. The core idea behind GloVe vectors is to construct a feature function of the character vectors

\overset{⇀}{c_{i}}, \overset{⇀}{c_{j}}

, and

\overset{⇀}{c_{k}}

such that

F (\overset{⇀}{c_{i}}, \overset{⇀}{c_{j}}, c_{k}) = \frac{P_{i k}}{P_{j k}}

(7)

Here,

\overset{⃑}{c} \in R^{d}

represents the character vector, and

\tilde{c} \in R^{d}

represents the vector of words in the context of c, distinct from

\overset{⃑}{c_{i}}

. By using the feature function and loss function provided in reference [30], we train and generate embedding vectors for all characters in the domain-specific corpus R, thus enabling any character

c_{i}

to be represented as the character vector v(

c_{i}

).

On the basis of the GloVe character embedding model, domain lexicons are introduced. The specific steps are as follows:

Firstly, we utilize the GloVe model to generate character vectors for each character, which can capture semantic information at the character level. Subsequently, a sequence of length n is defined.

S e q = \{c_{i} | 1 \leq i \leq n, \forall c_{i} \in V_{c}\}

(8)

where

V_{c}

represents the dictionary set of the corpus. We use the entries, domain ontology concepts, terminology, and general part-of-speech dictionaries of the Earth Science Dictionary to construct a dictionary

V_{w} = \{< w_{i}, {p o s}_{i} >\}

, where

< w_{i}, {p o s}_{i} >

represents the word

w_{i}

and the corresponding part-of-speech

{p o s}_{i}

, the part-of-speech tag set is

P O S = \{N, V, A, U\}

, the word order information

O r d e r (w_{i})

of the word is represented by the “BIO” annotation, and the word order set

W_{o} = \{B, I, O\}

. It is agreed that the word and part of speech meet the independent uniqueness assumption; that is, each word in the dictionary has one and only one part of speech.

Then, construct the word w weight function to represent the probability of a certain word order tag appearing under a certain part of speech.

f (w) = P_{w} (w o | p o s) = P_{w} (w o | N) + P_{w} (w o | V) + P_{w} (w o | A) + P_{w} (w o | U)

(9)

where

w o = \{B, I, O\}

and

p o s = \{N, V, A, U\}

. According to the uniqueness assumption,

P_{w} (w o | \neg p o s) = 0

.

For each type of part of speech

p o s = \{N, V, A, U\}

, calculate

ν^{p o s} (w o) = \sum_{w \in w_{o}} f (w) ν (w)

(10)

Among them, F =

\sum_{w \in \cup w_{o}} f (w)

,

ν (w)

is the embedding vector of word

w

.

Then, the

v^{p o s} (W_{o})

of

N, V, A, U

are vector-connected to form a vector representation of the domain-explicit word order information (this paper uses the “

\oplus

” symbol to represent the connection of vectors).

v^{w} (p o s) = v^{N} (w o) \oplus v^{V} (w o) \oplus v^{A} (w o) \oplus v^{U} (w o)

(11)

Then, concatenate it with the character vector v(c), generated by GloVe to form the final hybrid vector representation.

h y (ν) = v (c) \oplus v^{w} (p o s)

(12)

The Soft-Lexicon model constructs a hybrid vector representation by integrating character-level dense vectors, as generated by GloVe, with word-level vectors derived from a domain-specific lexicon. This composite representation amalgamates semantic information from both characters and words, offering enriched features for sequence labeling tasks such as named entity recognition. Consequently, the model’s proficiency in identifying entities and structures within text is significantly augmented. In summary, the Soft-Lexicon model leverages GloVe’s character vectors alongside the lexicon’s word vectors to create a more exhaustive text representation, thereby enhancing the precision of sequence labeling.

3.2.2. BiLSTM-CRF Model

The BiLSTM-CRF [31] model, a sequence tagging framework integrating a bidirectional long short-term memory network (BiLSTM) with a conditional random field (CRF), is extensively employed in natural language processing tasks. Its core architecture comprises two principal components.

BiLSTM Layer: The forward and backward LSTM networks, respectively, capture past and future contextual information of the input sequence, generating a context-aware representation for each character. Specifically, the forward LSTM processes each character from the beginning to the end of the sequence, while the backward LSTM does so in reverse. The hidden states from both directions are then concatenated to form a bidirectional semantic representation.

CRF Layer: Utilizing the output from the BiLSTM, the CRF defines a label transition matrix to ensure the legitimacy of the label sequence. It employs the Viterbi algorithm to determine the globally optimal label sequence as opposed to making greedy predictions for each character independently.

3.3. Nested Entity Recognition

A nested named entity is a named entity that has one or more simple named entities nested inside it. The nested entity is called the inner entity, and the outermost entity is also called the entity [32]. For example, a nested entity representing a location:

[Yintaishan Village, Qiaoshan Town, East Side of Huangling County] (PLACE)

can be split into

[Yintaishan Village] (NS), [Qiaoshan Town] (NS), [East Side] (DCT), [Huangling County] (NS)

Nested named entity recognition (NER) within an open environment remains a formidable challenge. The predominant approaches can be categorized into three types: rule-based and dictionary-based methods, machine learning-based methods, and deep learning-based methods. Recent research indicates that constructing a multi-layer NER model to hierarchically identify nested entities yields promising results. However, the effectiveness of these methods hinges on the availability of a sufficiently large annotated corpus [33,34,35]. Consequently, addressing the issue of nested entity recognition in the field of geological disasters primarily involves developing an adequately annotated corpus.

In view of the characteristics of nested entities and geological disasters, this paper proposes to use domain ontology and semi-automatic methods to build a complete nested named entity recognition corpus for geological disaster text data. This method first uses the ontology to automatically complete the pattern matching from the concept system to the instance and then manually verifies the nested named entity annotations to improve the annotation quality. Additionally, the ontology concept–instance matching set is updated, and the corpus construction is finally completed through reciprocating iterations. Specifically, building a nested named entity corpus based on ontology includes the following two main steps:

Named entity instance classification: Using the concept–instance classification system of the ontology library, the entity types of the annotated corpus are divided to form a mapping dictionary.

Semi-automatic annotation of nested entities: The mapping dictionary formed in the first step and the related information after the primitive annotation of the corpus are used to complete the pattern matching of entity annotation type to instance, and then the annotation quality is improved through manual verification. The final nested entity annotation corpus is produced through repeated iterations.

3.4. Entity Conceptualization

The result of primitive annotation yields a string entity with a primitive type that does not align with the conceptual framework of the ontology, necessitating its conceptualization and categorization. Based on the analysis of factors influencing loess landslide geological disasters and the established ontology model, corpus annotation primarily involves six major conceptual categories of influencing factors. These primary conceptual categories can be further subdivided; for instance, the geotechnical properties category includes entities such as lithostratigraphy, geotechnical types, rock and soil hardness (density), resistance to weathering and softening, shear strength, particle size, shape, and permeability. Additionally, the degree measurement relationship encompasses subcategories like the development degree of joints and fissures, the integrity of the rock mass, slope stability, overall degree, and rock weathering intensity. For more detailed information, refer to Section 2.

The ontology-based conceptual entity type annotation system is constructed upon the foundation of completed primitive entity annotations. It represents a conceptualization of the primary entity types, specifically, the Beof entity and attribute entity types, within the primitive dataset. This system relies on the domain ontology’s knowledge structure to categorize entity types by subdividing them into conceptual categories that encompass various influencing factors. For instance, Table 1 illustrates the correspondence between entity annotations for three concepts: rock mass properties, rock layer structure, and positional relationships, along with their respective subcategories, as defined by the ontology. The process of entity annotation can also be viewed as the instantiation of ontology concepts. Consequently, with the aid of the ontology’s concept terminology classification system, a large-scale mapping between ontology concepts and their instances can be achieved. During the initial manual annotation phase, the guiding principle is to prioritize matching with subclasses; if no match is found at that level, the system proceeds to match with the main class; if still unsuccessful, it defaults to matching with the base type.

3.5. Content-Knowledge Model

The content-knowledge model [36] offers a comprehensive definition and structural framework for knowledge elements, which serve as specific manifestations of this model. By delineating the structure of the content-knowledge model, knowledge elements systematically organize extracted entities, relationships, and attributes into a clear and intuitive knowledge graph (Figure 4). This representation not only enhances the comprehensibility of information but also facilitates complex reasoning and analysis by providing structured data support for subsequent intelligent extraction and analytical processes.

The content-knowledge model represents an advancement of the traditional content model, wherein knowledge is derived through a more granular extraction and segmentation of content entities. Specifically, knowledge pertains to the descriptive representation of fundamental domain-specific concepts that either cannot or do not necessitate further division.

Definition 1: Knowledge Element

A knowledge element K is defined as a tuple (E, RE, <P, V>, Ref), where

E denotes a knowledge entity, an abstraction of an objective individual, and the subject of the knowledge element. It is utilized here to represent various entities, concepts, situations, attributes, actions, states, etc., within the scope of slope geological hazards in loess areas. E may have one or more types, which are abstractions sharing similar characteristics, derived from the type subdomain of the ontology. The type domain is a collection of all possible types within a specific field, as defined in the ontology.
RE represents the relationship between knowledge entities.
<P, V> is a hash table of attribute-value pairs, which can also be interpreted as a binary relationship between entities and their attribute values. The edges in this table represent attribute names.
Ref is a reference to the context of the knowledge element within the content entity.

2.: Definition 2: Relationship Between Knowledge Entities

The relationship between knowledge entities RE is defined as (T_r, E_src, E_targe,<P, V>), where

Tr is the relationship type of R. All relationship types belong to the relationship type domain and are well-defined within the domain ontology.
Each relationship has its own definition domain D(r), and the value range f(r) specifies the permissible value range for the knowledge entity types of Esrc and Etarge, respectively.
Entity relationships RE may contain attributes akin to those of knowledge entities, hence the definition of <P, V> remains consistent.

4. Experimental Results

4.1. Data

This study focuses on typical slope geological hazards in the Loess Plateau region. The data used in this work originate from various original geological survey materials related to loess landslides, which were initially collected through multiple channels and existed primarily in unstructured formats. These original sources include field investigation records, detailed reports on geological hazard investigations and risk assessments, various textual reports, and map annotations describing relevant influencing factors. To support subsequent information extraction tasks, these unstructured materials were pre-processed in earlier stages through content segmentation, storage, and standardized description, resulting in a unified dataset organized into consistent textual units. The dataset used in this study is based on this processed data and contains approximately 529,700 characters and 10,200 sentences. It should be noted that the collection and initial organization of the original materials are not the main focus of this paper and are therefore not described in detail here. Instead, this work centers on extracting knowledge and modeling relationships of the influencing factors based on the processed textual data.

4.2. Analysis of the Results of the Primitive Automatic Extraction Model

The evaluation dataset was annotated using the primitive annotation method described in Section 3.1, employing the “BIO” sequence tag set to generate the annotation sequences. The corpus was split into a training set and a test set in a 3:1 ratio. The performance of the proposed method was compared with existing methods from the literature [15,37,38], using the geological disaster corpus data. Precision (P), recall (R), and F1 score (F1) were utilized as evaluation metrics. In this experiment, word segmentation for the method using word-level vectors was conducted using the perceptron method. Subsequently, the model was trained with parameters consistent with the original paper. The proposed method employed word vectors of 200 dimensions, POS tags of 50 dimensions, biLSTM hidden layer units totaling 128, a learning rate of 0.0015, and dropout to optimize the network and prevent overfitting. As illustrated in Table 2, the proposed method outperformed other methods in terms of primitive annotation performance on the geological disaster corpus.

Additionally, this study conducted comparative experiments to evaluate the performance of various input sequence representation methods. Table 3 compares the effectiveness of different word embedding techniques for primitive annotation using an identical codec. The table reveals that the character-level embedding method outperforms the word-level method in overall performance. Furthermore, the GloVe method, enhanced with vocabulary augmentation, demonstrates superior performance compared to the original GloVe character-level embedding method. The experimental results indicate that this approach excels in processing the primitive annotation task within the geological disaster domain corpus relative to other methods.

4.3. Extraction Results of Influencing Factors of Loess Landslide Geological Disasters

The process of populating knowledge elements involves mapping entities, relationships, and attributes extracted from unstructured natural language descriptions onto a graph. As outlined in Section 3.5, these entities are integrated into knowledge elements (refer to Definitions 1 and 2 in Section 3.5 for comprehensive details). The inter-entity relationships function as edges that link pairs of knowledge elements. Both entities and their connecting relationships may encompass attributes.

Specifically, the SUB (main entity) and BoF (Beof relationship) entities extracted from knowledge are represented as entity ‘E’ within the knowledge element ‘K,’ characterized by ontology type ‘T.’ Contextually relevant attention (ATTN) and attribute value (ATTV) tags are denoted as ‘P’ and ‘V’, respectively. Modification indicators (MOD) are shown as attributes of either entity edges or standalone attribute edges. The reference (Ref) designates the linguistic context—specifically, the text segment expressing the entity in natural language corresponding to a given knowledge element, demarcated by a period. This reference facilitates subsequent tasks such as entity alignment and linking in knowledge fusion processes.

Predicate relations (PRD) and preposition–object relations (P...PRD) are used to form relations RE between two entities, where the filling rule of relation type Tr can be expressed as a primitive relation type (ontology relation type). If two entities E1 and E2 have a relation re, they must satisfy Type (E1) ∈ D(r) and Type (E2) ∈ f(r). The attributes of a relation are usually taken from the values of the modifiers MOD in the primitive. In the attribute graph, relations are represented as edges.

This section uses specific examples to illustrate the process from natural language text knowledge extraction to knowledge element filling and then to knowledge element connection into an attribute graph. For instance, consider the text contexts context1 and context2.

context1 = “Yintaishan Landslide is located at ^{the east side of} Yintaishan Village, Qiaoshan Town, Huangling County. The boundary is clear. ^{On the flat surface}, it presents a dustpan shape. The sliding body is narrow at the top and wide at the bottom, and the morphology presents higher in the east and lower in the west.”

context2 = “The landslide has a length of about 200 m and a front edge width of about 300 m.”

Table 4 presents the results of joint knowledge extraction.

The result of knowledge element filling is as follows:

K1 = (“Yintaishan Landslide”, re₁(“bof”), [<Boundary: “clear”>, Presents (on the flat surface): “a dustpan shape”>], ref(context1))

K2 = (“Sliding body”, [<null, “Narrow at the top and wide at the bottom”>, <Morphology presents: “Higher in the east and lower in the west”>], ref(context1))

K3 = (“Landslide”, [<Length: “200 m”>, <A Front Edge Width (about): “300 m”>], ref(context2))

K4 = (“Yintaishan Village”, …)

K5 = (“Qiaoshan Town”)

K6 = (“Huangling County”)

The BoF entity forms a beOf relationship with the nearest SUB entity mentioned above:

r e_{1} = (B O F, K_{1}, K_{2})

(13)

The predicate relation PRD forms a predicate-type positional relation with the two entities in the nearest context:

{r e}_{2} = (\begin{matrix} P R D (i s l o c a t e d a t), K_{1}, K_{4}, < m o d (p o s i t i o n), \\ “ t h e e a s t s i d e o f ” > \end{matrix})

(14)

The place name entities K₄, K₅, and K₆ form a SubClassOf relationship based on administrative divisions:

{r e}_{3} = (S C O, K_{4}, K_{5})

(15)

{r e}_{4} = (S C O, K_{5}, K_{6})

(16)

The outcomes of knowledge extraction for the case study are illustrated in Figure 5. Here, rounded rectangular boxes denote entities (nodes within the attribute graph), with distinct colors signifying various entity types. Hexagonal blue boxes illustrate attribute values. Solid directed arrows depict the relationships between these entities (edges of the attribute graph), whereas dotted undirected arrows indicate the linkage between an entity and its attributes. The shaded area represents external knowledge, comprising entities and relationships pre-existing in the graph, such as the hierarchical administrative division of place names in this instance.

5. Discussion

The experimental results underscore the effectiveness of the proposed joint extraction framework for loess landslide geological disasters, particularly in accurately identifying and structuring key influencing factors such as slope characteristics, lithology, triggering conditions, and spatial relationships. Integrating character-level GloVe embeddings with Soft-Lexicon lexical enhancements within the BiLSTM-CRF architecture significantly outperformed existing baselines in terms of precision, recall, and F1 score. As shown in Table 2, our method achieved an F1 score of 87.03%, representing an improvement compared to the baseline models. This performance gain highlights the advantages of combining domain-specific lexical knowledge with fine-grained character-level semantic representations, especially in Chinese-language geological texts where word segmentation ambiguities and rare terminology are prevalent.

Further comparative experiments (Table 3) demonstrate the superiority of character-level representations over word-level embeddings. Traditional word embeddings, such as Word2Vec and GloVe, achieved moderate performance. In contrast, the character-level GloVe method, particularly when enhanced by Soft-Lexicon, provided a more robust representation of rare and morphologically rich terms, resulting in improved entity recognition accuracy. This confirms the advantage of encoding domain lexicons, part-of-speech tags, and boundary information into the character sequence, which offers richer semantic context and mitigates issues related to out-of-vocabulary terms.

Beyond performance metrics, the qualitative outcomes of knowledge graph construction highlight the practical value of the extracted structured information, especially concerning hierarchical and spatial relationships among influencing factors. Notably, the integration and accurate extraction of spatial relationships substantially enhance the framework’s applicability to loess geological hazards. Spatial information is fundamental in characterizing the distribution, extent, and interactions of landslide-prone areas in the Loess Plateau, supporting spatially informed risk assessment and disaster response. The constructed spatially enriched knowledge graph explicitly models nested spatial structures and multi-level administrative hierarchies, enabling advanced semantic queries and intelligent reasoning. These capabilities are essential for effective hazard trend detection, scenario-based disaster analysis, and spatial decision-making in geohazard management.

The conceptual ontology of influencing factors constructed in this study (Figure 1) and the refined core influencing factor framework (Figure 2) provide a theoretical basis for analyzing loess landslide geological hazards. The specific extraction results presented in Figure 5 primarily reflect topographic factors and some aspects related to rock layer structure and geomorphic features. For example, the case descriptions detail the shape, location, orientation, elevation differences, and dimensional parameters (length and width) of the landslide body, which directly correspond to the “topographic factors” and “rock formation structure” categories defined in the ontology. Although factors such as hydrogeological conditions and climate effects are not explicitly illustrated in this particular case, these categories are included in the overall dataset and the annotation system. Due to the fragmentary nature of individual texts, some factors may not appear in every example, but all six core influencing factor categories are represented across the entire corpus, ensuring comprehensive coverage of the ontology. Notably, during the extraction process, certain local geomorphological details described in the texts were more precise than those currently reflected in the ontology, such as specific slope shapes, micro-relief features, and orientation descriptors. This suggests that the “topographic factors” category could be further refined in future work to better capture fine-grained spatial characteristics.

This study further benefits from applying the framework to real-world field investigation records from a representative region of the Loess Plateau, encompassing over 500,000 characters and 10,000 sentences. The annotated corpus derived from these sources not only offers a valuable resource for model training and evaluation but also showcases the feasibility of semi-automated annotation methods, thereby accelerating corpus development in domains with limited data availability.

Nevertheless, some limitations remain. First, the model’s performance depends heavily on the quality and coverage of the domain lexicons used in the Soft-Lexicon enhancement. Missing or inconsistent lexical entries can cause misclassification, especially for rare geological terms. Second, despite regularization efforts, the relatively small and domain-specific corpus raises concerns about overfitting and limits generalizability to other regions or disaster types. Third, as the method is tailored for Chinese with character-level embeddings, its applicability to other languages or domains may require adaptation. Finally, complex nested entities and context-dependent ambiguities still challenge the model, warranting further research on enhanced contextual modeling and dynamic lexicon updates.

Extracting information from texts on geological disasters—particularly those detailing loess landslides—presents significant challenges for our model due to complex and rare domain-specific terminology, nested entity structures, and part-of-speech (POS) ambiguities. For instance, consider the following description: “the gneissic granite exhibits well-developed joints, with joint orientation presented as N160°∠45°. Part of the rock mass is affected by the intrusion of dioritic dikes, forming a complex metamorphic rock contact zone. Locally, the rock exhibits a crystal face offset structure, accompanied by distinct shear displacement.” This excerpt contains multilayered, nested entities that integrate geological formations with spatial attributes, alongside bilingual terms in parentheses. Such constructions impede sequence labeling models, which often either merge disparate entities or mis-segment them. Furthermore, lexical items like “crystal face offset” and “shear displacement” denote highly specialized structural phenomena that frequently masquerade as verbs or adjectives based on contextual cues, thereby complicating both POS tagging and boundary detection tasks. These difficulties stem primarily from the terms’ rarity in training corpora, their deep hierarchical nesting, and the hybrid linguistic nature of the text. Although incorporating domain ontologies and Soft-Lexicon enhancements boosts recognition for common entities, the precise extraction of these complex, rare, and deeply nested instances remains an unresolved problem. Future work will prioritize advancing nested entity recognition capabilities and expanding domain-specific lexicons to better accommodate these specialized terms.

Future work will aim to expand the ontology to encompass a wider range of geological disasters and integrate multi-source spatial data, such as remote sensing and sensor measurements, to further improve spatial reasoning and representation. Efforts will also focus on automating the annotation process further through active learning and weak supervision, reducing human workload. Additionally, enhancing the model’s adaptability for cross-lingual and cross-domain applications will be explored to broaden its applicability and robustness in diverse geological hazard analysis.

6. Conclusions

This study introduces an intelligent extraction framework to tackle the challenges posed by heterogeneous terminology, nested expressions, and scarce domain resources in loess landslide geological disaster texts. By integrating domain ontology, Soft-Lexicon enhancement, GloVe character embeddings, and a BiLSTM-CRF decoding layer, the framework automates the conversion of unstructured text into structured knowledge graphs. This research offers a promising method for extracting domain knowledge from unstructured geological texts, with significant implications for disaster comprehension, risk assessment, and spatial knowledge modeling. The following conclusions are drawn: (1) A comprehensive domain ontology was developed, covering six key categories of influencing factors with hierarchical relationships and spatial attributes. This ontology, combined with entity conceptualization and a semi-automatic nested entity annotation scheme, guided the creation of a high-quality annotated corpus, enhancing annotation accuracy and efficiency. (2) A joint extraction and annotation system was developed, which enhances character-level GloVe embeddings with domain-specific lexical features using a Soft-Lexicon mechanism, and feeds the resulting hybrid representations into a BiLSTM-CRF decoding layer. This system enables the accurate extraction of nested entities, attribute-value pairs, and spatial expressions, significantly improving the model’s ability to semantically parse complex geological disaster texts. (3) A content-knowledge model was developed to encode disaster entities, attributes, spatial metadata, and their interrelations into a structured knowledge graph framework suitable for semantic querying, facilitating reasoning and advanced disaster intelligence analysis.

Author Contributions

Conceptualization, L.H. and Y.Z.; methodology, Y.Z. and D.W.; data curation, J.L., Y.L., T.Y. and Y.X.; formal analysis, L.H., Y.L., T.Y. and Y.X.; writing—original draft, L.H.; writing—review and editing, L.H. and Y.Z.; visualization, Z.X.; supervision, Y.Z.; Funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Ningxia-Hui-Autonomous Region key research and development project (grant number: 2024BEG01005), and the National Institute of Natural Hazards, Ministry of Emergency Management of China (Grant Number: ZDJ2024-11).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

We sincerely thank Hongze Yang, Bianbian Sun, and Xiaodong Zhang for their valuable contributions to this work, including assistance with data collection, manuscript revision, and support in securing research funding. Their efforts greatly contributed to the success of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cruden, D.M. A Simple Definition of a Landslide. Bull. Int. Assoc. Eng. Geol. 1991, 43, 27–29. [Google Scholar] [CrossRef]
Hungr, O.; Leroueil, S.; Picarelli, L. The Varnes Classification of Landslide Types, an Update. Landslides 2014, 11, 167–194. [Google Scholar] [CrossRef]
Xie, W.; Guo, Q.; Wu, J.Y.; Li, P.; Yang, H.; Zhang, M. Analysis of Loess Landslide Mechanism and Numerical Simulation Stabilization on the Loess Plateau in Central China. Nat. Hazards 2021, 106, 805–827. [Google Scholar] [CrossRef]
Turner, A.K. Challenges and Trends for Geological Modelling and Visualisation. Bull. Eng. Geol. Environ. 2006, 65, 109–127. [Google Scholar] [CrossRef]
Chen, L.; Ge, X.; Yang, L.; Li, W.; Peng, L. An Improved Multi-Source Data-Driven Landslide Prediction Method Based on Spatio-Temporal Knowledge Graph. Remote Sens. 2023, 15, 2126. [Google Scholar] [CrossRef]
Qiu, Q.; Xie, Z.; Wang, S.; Zhu, Y.; Lv, H.; Sun, K. ChineseTR: A Weakly Supervised Toponym Recognition Architecture Based on Automatic Training Data Generator and Deep Neural Network. Trans. GIS 2022, 26, 1256–1279. [Google Scholar] [CrossRef]
Fan, R.; Wang, L.; Yan, J.; Song, W.; Zhu, Y.; Chen, X. Deep Learning-Based Named Entity Recognition and Knowledge Graph Construction for Geological Hazards. ISPRS Int. J. Geo-Inf. 2019, 9, 15. [Google Scholar] [CrossRef]
Sun, X.; Chen, J.; Bao, Y.; Han, X.; Zhan, J.; Peng, W. Landslide Susceptibility Mapping Using Logistic Regression Analysis along the Jinsha River and Its Tributaries Close to Derong and Deqin County, Southwestern China. ISPRS Int. J. Geo-Inf. 2018, 7, 438. [Google Scholar] [CrossRef]
Sharma, A.; Prakash, C.; Manivasagam, V. Entropy-Based Hybrid Integration of Random Forest and Support Vector Machine for Landslide Susceptibility Analysis. Geomatics 2021, 1, 399–416. [Google Scholar] [CrossRef]
Qiu, Q.; Wu, L.; Ma, K.; Xie, Z.; Tao, L. A Knowledge Graph Construction Method for Geohazard Chain for Disaster Emergency Response. Earth Sci. 2023, 48, 1875–1891. (In Chinese) [Google Scholar]
Qiu, P.; Pang, L.; Luo, Y.; Liu, Y.; Xing, H.; Liu, K.; Zhuang, G. Earthquake Event Knowledge Graph Construction and Reasoning. Geomat. Nat. Hazards Risk 2024, 15, 2383768. [Google Scholar] [CrossRef]
Wen, M.; Qiu, Q.; Zheng, S.; Ma, K.; Zheng, S.; Xie, Z.; Tao, L. Construction and Application of a Multilevel Geohazard Domain Ontology: A Case Study of Landslide Geohazards. Appl. Comput. Geosci. 2023, 20, 100134. [Google Scholar] [CrossRef]
Guzzetti, F.; Mondini, A.C.; Cardinali, M.; Fiorucci, F.; Santangelo, M.; Chang, K.-T. Landslide Inventory Maps: New Tools for an Old Problem. Earth-Sci. Rev. 2012, 112, 42–66. [Google Scholar] [CrossRef]
Wang, H.; Niu, R.; Han, Y.; Deng, Q. Construction of a Geological Fault Corpus and Named Entity Recognition. Appl. Sci. 2025, 15, 2465. [Google Scholar] [CrossRef]
Ma, R.; Peng, M.; Zhang, Q.; Huang, X. Simplify the Usage of Lexicon in Chinese NER 2020. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5951–5960. [Google Scholar]
Liu, H.; Qiu, Q.; Wu, L.; Li, W.; Wang, B.; Zhou, Y. Few-Shot Learning for Name Entity Recognition in Geological Text Based on GeoBERT. Earth Sci. Inform. 2022, 15, 979–991. [Google Scholar] [CrossRef]
Singh, K.; Tordesillas, A. Spatiotemporal Evolution of a Landslide: A Transition to Explosive Percolation. Entropy 2020, 22, 67. [Google Scholar] [CrossRef]
Zheng, X.; Wang, B.; Zhao, Y.; Mao, S.; Tang, Y. A Knowledge Graph Method for Hazardous Chemical Management: Ontology Design and Entity Identification. Neurocomputing 2021, 430, 104–111. [Google Scholar] [CrossRef]
Wang, S.; Zhang, X.; Ye, P.; Du, M. Deep Belief Networks Based Toponym Recognition for Chinese Text. ISPRS Int. J. Geo-Inf. 2018, 7, 217. [Google Scholar] [CrossRef]
Zhou, B.; Zou, L.; Hu, Y.; Qiang, Y.; Goldberg, D. TopoBERT: A Plug and Play Toponym Recognition Module Harnessing Fine-Tuned BERT. Int. J. Digit. Earth 2023, 16, 3045–3064. [Google Scholar] [CrossRef]
Zhang, W.; Meng, J.; Wan, J.; Zhang, C.; Zhang, J.; Wang, Y.; Xu, L.; Li, F. ChineseCTRE: A Model for Geographical Named Entity Recognition and Correction Based on Deep Neural Networks and the BERT Model. ISPRS Int. J. Geo-Inf. 2023, 12, 394. [Google Scholar] [CrossRef]
Liu, X.; Shao, S.; Zhang, C.; Shao, S. Landslide Susceptibility Prediction in the Loess Tableland Considering Geomorphic Evolution. CATENA 2025, 249, 108668. [Google Scholar] [CrossRef]
Qiu, Q.; Xie, Z.; Zhang, D.; Ma, K.; Tao, L.; Tan, Y.; Zhang, Z.; Jiang, B. Knowledge Graph for Identifying Geological Disasters by Integrating Computer Vision with Ontology. J. Earth Sci. 2023, 34, 1418–1432. [Google Scholar] [CrossRef]
Datta, M. Factors Affecting Slope Stability Of Landfill Covers. In Advances in Environmental Geotechnics; Chen, Y., Zhan, L., Tang, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 620–624. ISBN 978-3-642-04459-5. [Google Scholar]
Zheng, Y.; Liu, J. The Influence of Material Factor on Slope Stability. Electron. J. Geotech. Eng. 2016, 21, 2379–2388. [Google Scholar]
Naghadehi, M.Z.; Jimenez, R.; KhaloKakaie, R.; Jalali, S.-M.E. A Probabilistic Systems Methodology to Analyze the Importance of Factors Affecting the Stability of Rock Slopes. Eng. Geol. 2011, 118, 82–92. [Google Scholar] [CrossRef]
Yang, J.F.; Guan, Y.; He, B.; Qu, C.Y.; Yu, Q.B.; Liu, Y.X.; Zhao, Y.J. Corpus Construction for Named Entities and Entity Relations on Chinese Electronic Medical Records. J. Softw. 2016, 27, 2725–2746. (In Chinese) [Google Scholar]
Zan, H.; Liu, T.; Niu, C.; Zhao, Y.; Zhang, K.; Sui, Z. Construction and Application of Named Entity and Entity Relations Corpus for Pediatric Diseases. J. Chin. Inf. Process. 2020, 34, 19–26. (In Chinese) [Google Scholar]
Huang, S.; Wang, D. Review of Corpus Research in China. J. Inf. Resour. Manag. 2021, 11, 4–17. (In Chinese) [Google Scholar]
Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging 2015. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Kim, H.; Kim, J.-E.; Kim, H. Exploring Nested Named Entity Recognition with Large Language Models: Methods, Challenges, and Insights. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8653–8670. [Google Scholar]
Zhou, J.S.; Dai, X.Y.; Yin, C.Y.; Chen, J.-J. Automatic Recognition of Chinese Organization Name Based on Cascaded Conditional Random Fields. Acta Electron. Sin. 2006, 34, 804–809. (In Chinese) [Google Scholar]
Li, Y.; He, Y.; Qian, L.; Zhou, G. Chinese Nested Named Entity Recognition Corpus Construction. J. Chin. Inf. Process. 2018, 32, 19–26. (In Chinese) [Google Scholar]
Jin, Y.; Xie, J.; Wu, D. Chinese Nested Named Entity Recognition Based on Hierarchical Tagging. J. Shanghai Univ. (Nat. Sci. Ed.) 2022, 28, 270–280. (In Chinese) [Google Scholar]
Wei, D.; Jiang, B.; Zhang, J. Research on Content Storage Method of Unstructured Geological Data. Northwest. Geol. 2021, 54, 266–273. (In Chinese) [Google Scholar] [CrossRef]
Baevski, A.; Edunov, S.; Liu, Y.; Zettlemoyer, L.; Auli, M. Cloze-Driven Pretraining of Self-Attention Networks 2019. arXiv 2019, arXiv:1903.07785. [Google Scholar]
Jiang, Y.; Hu, C.; Xiao, T.; Zhang, C.; Zhu, J. Improved Differentiable Architecture Search for Language Modeling and Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3583–3588. [Google Scholar]

Figure 1. System diagram of factors affecting slope stability.

Figure 2. Six major concepts of influencing factors.

Figure 3. Progressive entity/relation/attribute extraction process.

Figure 4. Content-knowledge model structure diagram.

Figure 5. Visualization of graphs filled with knowledge elements and connections.

Table 1. Example table of primitive concepts.

Primitive Types	Concept Category	Subclasses	Examples
NST	The nature of the soil	Chronostratigraphy	The lithology of the sliding bed is mainly sandstone intercalated with mudstone in the Hujiacun Formation of the Upper Triassic.
ATN	/	Geotechnical type	The lithology of the sliding bed is mainly sandstone.
ATO	/	The hardness of the soil	Q2eol loess is dense and hard.
ATO	/	Wind resistance	The surface loess is strongly weathered.
ATV	Rock structure	Occurrence–tendency	The vertical joints have orientations of 156°/89° and 203°/84°, respectively, where the first number indicates the strike angle and the second the dip angle.
ATN	/	Joints and fissures	The collapse is located in a rock slope with well-developed joints, weathering, and unloading fissures.
ATO	/	Joint development degree	The loess on the slope has well-developed vertical joints.
Prepositional relationship	Spatial topological relationship	Include	The Liuping landslide is located within the Liuping Group, Zouqitou, Liuping, Wuqi Town.
Prepositional relationship	/	overlap	The rear edge of the landslide is traversed by the Baoji Gorge Irrigation Canal.
Predicate relation	Spatial direction relationship	Relative direction	The unloading platform landslide is located on the right bank of the Luo River.
Predicate relation	/	Absolute direction	The landslide body is located on the north side of the loess ridge.

As the experiment is conducted on Chinese texts, the original examples are presented in Chinese. The bold text in the table footer corresponds to the “Primitive Types” column and is used to indicate examples of corpus annotation types. Therefore, the bold formatting is necessary and has been retained.

Table 2. Performance comparison of this paper with other methods.

Model Methods	P (%)	R (%)	F1 (%)
Word-based GloVe RNN + CRF [38]	83.91	80.6	82.22
charCNN LSTM + CRF [37]	83.43	82.89	83.16
Soft-Lexicon Bert LSTM + CRF [15]	85.11	83.27	84.18
Soft-Lexicon Character—level GloVe BiLSTM-CRF	87.45	86.62	87.03

Table 3. Comparison table of primitive named entity recognition performance of different input sequence representation methods.

Input Representation	Codecs	Accuracy (%)	Recall (%)	F1 (%)
Word-level word2vec	LSTM + CRF	82.74	80.57	81.64
Word-level GloVe		81.91	80.6	81.25
Character-level GloVe		83.59	81.93	82.75
Soft-Lexicon		84.88	82.27	83.55
Soft-Lexicon character-level GloVe		87.45	86.62	87.03

Table 4. Text extraction results.

English Expression	Tag
Yintaishan Landslide	SUB
is located at	PRD
the east side of	MOD
Huangling County	NS
Qiaoshan Town	NS
Yintai Mountain Village	NS
[Huangling County, Qiaoshan Town, Yintaishan Village]	NST
The boundary	ATTN
is clear	ATTV
On the flat surface	MOD
it presents	ATTN
a dustpan shape	ATTV
The sliding body	BOF
is narrow at the top and wide at the bottom	ATTV
and the morphology presents	ATTN
higher in the east and lower in the west	ATTV
The landslide	SUB
has a length of	ATTN
about	MOD
200 m	ATTV
and a front edge width of	ATTN
about	MOD
300 m	ATTV

As the experiment is conducted on Chinese texts, the original examples are presented in Chinese. To facilitate comprehension, English translations are provided in parallel.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, L.; Zhu, Y.; Li, Y.; Yan, T.; Xiao, Y.; Wei, D.; Xing, Z.; Li, J. Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe. Appl. Sci. 2025, 15, 8879. https://doi.org/10.3390/app15168879

AMA Style

Huang L, Zhu Y, Li Y, Yan T, Xiao Y, Wei D, Xing Z, Li J. Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe. Applied Sciences. 2025; 15(16):8879. https://doi.org/10.3390/app15168879

Chicago/Turabian Style

Huang, Lutong, Yueqin Zhu, Yingfei Li, Tianxiao Yan, Yu Xiao, Dongqi Wei, Ziyao Xing, and Jian Li. 2025. "Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe" Applied Sciences 15, no. 16: 8879. https://doi.org/10.3390/app15168879

APA Style

Huang, L., Zhu, Y., Li, Y., Yan, T., Xiao, Y., Wei, D., Xing, Z., & Li, J. (2025). Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe. Applied Sciences, 15(16), 8879. https://doi.org/10.3390/app15168879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters Based on Soft-Lexicon and GloVe

Abstract

1. Introduction

2. Ontology Construction of Influencing Factors of Loess Landslide Geological Disaster

3. Intelligent Extraction Method of Influencing Factors of Loess Landslide Geological Disasters

3.1. Primitive Annotation and Corpus Construction

3.2. Automatic Extraction of Primitives

3.2.1. Soft-Lexicon and GloVe Embedding Model

3.2.2. BiLSTM-CRF Model

3.3. Nested Entity Recognition

3.4. Entity Conceptualization

3.5. Content-Knowledge Model

4. Experimental Results

4.1. Data

4.2. Analysis of the Results of the Primitive Automatic Extraction Model

4.3. Extraction Results of Influencing Factors of Loess Landslide Geological Disasters

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI