A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts

Ni, Weijian; Wang, Wenjing; Xie, Nengfu; Liu, Tong; Zeng, Qingtian; Liu, Cong

doi:10.3390/ijgi15060258

Open AccessArticle

A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts

by

Weijian Ni

^1,2

,

Wenjing Wang

²,

Nengfu Xie

^1,*

,

Tong Liu

²

,

Qingtian Zeng

² and

Cong Liu

³

¹

Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China

²

College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China

³

NOVA Information Management School, Universidade Nova de Lisboa, 1070-312 Lisbon, Portugal

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(6), 258; https://doi.org/10.3390/ijgi15060258 (registering DOI)

Submission received: 17 April 2026 / Revised: 28 May 2026 / Accepted: 1 June 2026 / Published: 10 June 2026

Download

Browse Figures

Versions Notes

Abstract

Named entity recognition (NER) in marine meteorological disaster texts is essential for automated information extraction and disaster management. However, disaster-chain descriptions often contain nested entities that are difficult for conventional flat NER models to represent. This paper proposes PRSpan, a position-role-aware span classification model for nested NER. PRSpan incorporates Rotary Position Embedding (RoPE)-enhanced attention for relative position-aware boundary modeling and uses Conditional Layer Normalization (CLN) to generate role-specific Head, Mid, and Tail token features. A Positional Role Pooling strategy further aggregates these features into span representations to preserve boundary cues and internal semantic coherence. To support evaluation, we construct MMD-NER, a domain-specific dataset containing 1899 sentences, 17,017 entities in 11 categories, and 2978 nested entity pairs through a four-step LLM-assisted pipeline. Experimental results show that PRSpan achieves Micro-F1 and Macro-F1 scores of 94.58% and 93.47%, outperforming the strongest baseline by 3.61 and 3.93 percentage points, respectively. Additional analyses verify the effectiveness of RoPE-enhanced attention, role-specific feature generation, and Positional Role Pooling. Cross-domain transfer and LLM prompting comparisons further demonstrate the practical value of PRSpan for nested entity extraction in low-resource Earth science domains.

Keywords:

nested named entity recognition; marine meteorological disasters; Natural Language Processing; deep learning

1. Introduction

Marine meteorological disasters, including typhoons, storm surges, and severe cold waves, rank among the most devastating natural hazards in coastal regions, inflicting profound human, economic, and environmental tolls. These phenomena not only endanger the lives and property of coastal communities but also disrupt marine ecosystems, compromise maritime navigation safety, and inflict billions in losses on industries such as fisheries, shipping, and offshore energy production. Over the decades, a vast and growing body of textual data—spanning disaster situation reports, marine meteorological early warning bulletins, emergency response records, and the scientific literature—has been accumulated by national and regional maritime agencies. These documents encode rich, structured knowledge about disaster chains, from triggering mechanisms and meteorological conditions to cascading impacts and response measures. However, the sheer volume and complexity of such texts far exceed the capacity of manual analysis, creating an urgent need for automated information extraction methods that can systematically distill actionable intelligence for real-time early warning systems, informed emergency decision-making, and comprehensive risk assessment [1,2].

At the heart of this challenge lies named entity recognition (NER), a fundamental task in Natural Language Processing (NLP) that seeks to automatically identify and classify entities with specific semantic categories in text, delineating their boundaries and types. NER serves as the foundational step for constructing domain knowledge graphs and enabling downstream analytical applications, including relation extraction [3], question-answering systems [4], and information retrieval [5]. In the geoscience domain, NER has been increasingly applied to tasks such as geological hazard text analysis [6], mineral resource report mining [7], and ecological text processing [2], demonstrating the value of text mining as a means of knowledge discovery in Earth science domains. However, conventional NER approaches are designed for flat entity structures, where entities do not overlap or nest within one another [8,9,10,11]. This assumption breaks down in marine meteorological disaster texts, where entities frequently exhibit complex nested structures that reflect the inherent hierarchical logic of disaster events. For instance, in the expression “typhoon-induced storm surge in the South China Sea”, the entity “typhoon” (a primary marine disaster) is semantically embedded within the broader entity “typhoon-induced storm surge” (a secondary marine disaster), which is itself nested within a geographic scope “the South China Sea”. Such multi-layered nesting directly mirrors the disaster chain from cause to consequence, and capturing these inclusion relationships is essential for reconstructing the full causal and spatial structure of disaster events. This requirement has driven the emergence of nested named entity recognition (NNER) [12], which not only pinpoints entity boundaries and types but also elucidates inclusion relationships among nested entities (as illustrated in Figure 1). By unraveling these semantic hierarchies, NNER enables a more complete extraction of the disaster-chain information encoded in professional texts, supporting applications ranging from disaster event knowledge graph construction to coastal risk assessment.

Despite its importance, NNER research specifically targeting the marine meteorological disaster domain remains in its infancy. A critical bottleneck is the absence of publicly available, professionally annotated corpora that capture the domain’s characteristic nested structures. Unlike general-domain NER datasets, texts in this field feature extremely high professionalism, involving complex terminology systems spanning oceanography and meteorology, where the definition of entity boundaries and the classification of entity categories demand specialized domain knowledge. This scarcity of annotated resources directly limits both model training and systematic evaluation. On the methodological side, region-based span classification methods have gradually become the mainstream approach for NNER [13,14,15], whose core advantage lies in directly modeling all possible spans in the text and recognizing entities by learning the association between the start and end positions of spans and their semantic features. However, when facing the complex nested scenarios characteristic of marine meteorological disaster texts, current span classification methods still exhibit two notable limitations. First, in terms of entity boundary characterization, existing methods often rely on simple distance encoding to handle the relative positional dependencies of tokens within text spans. For long spans composed of compound technical terms with multiple modifiers—common in disaster descriptions such as “coastal waterlogging triggered by Level-12 rotating winds from a severe typhoon”—they struggle to accurately capture boundary features, leading to deviations in determining the start and end positions of entity spans. Second, in intra-span semantic modeling, most methods treat spans as homogeneous wholes for feature aggregation, overlooking the functional differences of tokens at different positions within a span: the start position bears the semantic signal of entity initiation, the end position conveys the information of entity closure, and the intermediate positions are responsible for semantic cohesion. The lack of this positional role distinction may weaken the accuracy of entity type determination, particularly for densely nested entities where precise boundary discrimination is critical.

To address these challenges, this study proposes PRSpan, a refined position-role-aware span classification framework tailored for the complex nested structures prevalent in marine meteorological disaster texts. At its core, PRSpan incorporates RoPE-enhanced attention into the span representation process to strengthen relative positional modeling between span boundary tokens and internal tokens, thereby improving sensitivity to entity boundary features. It should be noted that this study does not modify the mathematical formulation of the original Rotary Position Embedding (RoPE); instead, RoPE is introduced into the PRSpan attention module to enhance relative position-aware boundary modeling. Furthermore, we introduce a Position-Role-Aware Span Representation Mechanism that utilizes Conditional Layer Normalization (CLN) to dynamically generate adaptive feature representations for each token according to its three positional roles, namely Head, Mid, and Tail. These role-differentiated features are aggregated through Positional Role Pooling to produce entity-level span encodings, enabling the model to simultaneously preserve boundary signals and capture internal semantic coherence.

The main contributions of this study are as follows:

(1): We construct and validate MMD-NER, the first specialized nested NER dataset for marine meteorological disaster texts. Built through a four-step LLM-assisted pipeline, MMD-NER contains 11 domain-specific entity types, 1899 sentences, 17,017 entities, and 2978 nested entity pairs. Its annotation quality is further examined through human-gold validation, error analysis, and LLM-backbone comparison.
(2): We propose PRSpan, a position-role-aware span classification framework for nested entity recognition. By combining RoPE-enhanced attention, CLN-generated Head/Mid/Tail role features, and Positional Role Pooling, PRSpan captures both boundary-sensitive positional dependencies and span-internal semantic coherence.
(3): We perform comprehensive evaluations on MMD-NER and related disaster-domain data. Beyond overall comparisons with representative baselines, we conduct fine-grained structural analysis, extended ablation studies, cross-domain transfer evaluation, LLM prompting comparison, and qualitative error analysis, demonstrating PRSpan’s effectiveness, robustness, and applicability.

The remainder of this paper is organized as follows. Section 2 reviews related work, Section 3 defines the task and entity schema, Section 4 presents PRSpan, Section 5 reports dataset validation and experimental analyses, Section 6 discusses robustness, deployment, ethical considerations, and limitations, and Section 7 concludes the paper.

2. Related Work

2.1. General Nested Named Entity Recognition Methods

Nested named entity recognition extends conventional flat named entity recognition by allowing one entity mention to be fully or partially embedded within another. This breaks the one-token-one-label assumption of sequence labeling and makes the task substantially more challenging. Recent studies have developed several representative methodological lines, including layer-based methods, span-based methods, graph-based methods, and parsing-based methods.

Layer-based methods leverage the hierarchical structure of nested entities by decomposing NNER into multiple recognition stages. Wang et al. [16] proposed Pyramid, which recursively applies stacked flat NER layers to identify entities of different lengths, while Rojas et al. [17] showed that a comparatively simple Multiple LSTM-CRF architecture can still achieve strong performance. Although these methods align naturally with entity hierarchy, they remain vulnerable to error propagation when outer-layer predictions rely heavily on inner-layer outputs.

Span-based methods formulate NNER as classification over candidate spans rather than token-level labeling. Yu et al. [18] proposed a biaffine span-scoring framework that scores candidate spans in a structured manner, while Su et al. [15] introduced GlobalPointer, which models entity boundaries with a two-dimensional scoring matrix and relative positional information. These methods are well suited to overlapping and nested entities, but their performance still depends on the effective modeling of boundary cues, relative positions, and intra-span semantics.

Graph-based methods encode nested structures through graph interactions among tokens or spans. Luo and Zhao (2020) [19] proposed the BiFlaG model, which represents nested entities through a bipartite flat-graph network to capture interactions between inner and outer entities. Wan et al. [20] further introduced span-level graphs to model relationships among candidate spans and improve recognition of long or low-frequency nested entities. These methods are effective in capturing structural dependencies beyond independent span classification, although their graph construction and interaction modeling can increase architectural complexity.

Parsing-based methods formulate nested NER as a structured prediction task closely related to constituency parsing. Yang and Tu (2022) [21] proposed a bottom–up pointer-network framework that generates nested entities through structured decoding, while Lou et al. [22] modeled the task as latent lexicalized constituency parsing and demonstrated its strong global modeling capacity. Although these methods are well suited to hierarchical structures, they are generally more complex than standard span-based approaches.

Overall, recent NNER research has increasingly moved toward more direct modeling of spans, boundaries, and hierarchical dependencies. Among existing paradigms, span-based methods remain particularly appealing because they naturally accommodate overlapping mentions and integrate effectively with pre-trained language models. For this reason, they form the methodological basis of the present study. Nevertheless, existing span-based methods still have limitations in modeling relative positional dependencies in long, nested technical expressions and in distinguishing the semantic roles of boundary and internal tokens within a span.

2.2. Named Entity Recognition Methods in the Marine Meteorological Disaster Domain

Compared with several better-studied domain applications of named entity recognition, research on marine meteorological disaster texts remains limited. Existing work in adjacent fields nevertheless shows that Earth science and hazard-related texts require dedicated resources and domain-adapted models. Qiu et al. [23] demonstrated that unstructured geoscience reports contain rich spatiotemporal and semantic information that can be extracted through text mining workflows. Lv et al. [24] further proposed a BERT-based model for named entity recognition in the geoscience domain, confirming that specialized geological texts benefit from domain-oriented representation learning. In a closely related hazard setting, Sun et al. [25] constructed a natural hazard annotated corpus and compared deep learning approaches for hazard-oriented named entity recognition. In the marine domain, Zhao et al. [2] built a specialized corpus for Chinese marine coral reef texts and proposed a BERT-BiGRU-Att-CRF framework, further indicating that marine scientific texts also require tailored corpora and models.

Despite these advances, an important gap remains for the task studied in this paper. Most existing work in adjacent Earth science and hazard-related domains focuses on flat named entity recognition or broader information extraction, rather than nested named entity recognition. However, marine meteorological disaster texts frequently contain hierarchical disaster-chain expressions in which meteorological conditions, disaster events, affected regions, and resulting impacts are embedded within one another. In addition, the domain is highly specialized and semantically dense, with many long compound expressions drawn from oceanography, meteorology, ecology, emergency response, and maritime operations. These properties make entity boundaries difficult to determine and often require substantial domain knowledge for accurate labeling.

Overall, prior studies have provided useful foundations for entity recognition in adjacent hazard-related and Earth science domains, but a clear gap remains at the intersection of marine meteorological disaster texts and nested entity recognition. To the best of our knowledge, there is still no dedicated benchmark and no span-based framework specifically tailored to the long technical expressions and disaster-chain-driven nested structures of this domain. This gap motivates the present study, which contributes both a domain-specific NNER dataset and a refined span classification model for marine meteorological disaster texts.

2.3. Prompting Large Language Models for NER and Synthetic NER Data Generation

Recent advances in large language models (LLMs) have opened a new research direction for named entity recognition, encompassing both prompt-based entity extraction and LLM-assisted data annotation or synthesis. Prompt-oriented methods reformulate NER as a prompting task rather than a conventional token classification problem. For example, Shen et al. [26] proposed PromptNER, which unifies entity detection and typing within a prompt-based framework. Such studies suggest that prompting provides a flexible alternative for NER, particularly in low-resource settings.

At the same time, LLMs have increasingly been used not only as predictors but also as annotators or data generators. Zhang et al. [27] proposed LLMaAA, which incorporates LLMs into an active annotation loop and shows that models trained on LLM-generated labels can achieve competitive performance at lower cost. Santoso et al. [28] further explored few-shot synthetic data generation for low-resource NER, showing that prompting open-source LLMs with a small number of labeled examples can improve baseline performance. Kamath and Vajjala (2025) [29] systematically examined the utility of LLM-generated synthetic NER data and found that such data are most beneficial when labeled resources are unavailable, although limited high-quality human annotation remains more reliable. In domain-specific settings, Zhao et al. [30] reported that ChatGPT-assisted augmentation improves generalization in few-shot biomedical NER. Collectively, these studies indicate that LLM-based annotation and synthesis offer a promising way to reduce the cost of corpus construction.

Beyond NER-oriented prompting and data synthesis, LLMs have also been explored for disaster information extraction and knowledge graph construction. For example, Huang et al. [31] proposed TyphoonKGent, an LLM-based agent that constructs typhoon knowledge graphs through Chain-of-Thought reasoning and hierarchical knowledge representation, demonstrating the potential of LLMs for organizing fragmented disaster information and supporting downstream reasoning. However, such methods mainly focus on high-level knowledge organization and graph construction, whereas this study focuses on strict span-level nested entity recognition. For NER model training, accurate entity boundaries and type assignments are essential, while direct LLM prompting may still suffer from boundary over-extension, boundary under-extension, and category misjudgment in overlapping or nested structures. Therefore, LLM-based knowledge graph construction and PRSpan are complementary: the former supports high-level reasoning and knowledge organization, while the latter provides fine-grained and reliable entity extraction for downstream relation extraction or knowledge graph construction.

Most existing LLM-assisted studies focus on flat NER or high-level information organization, with limited attention to the structural constraints required for nested entities. This limitation directly motivates the dataset construction strategy described in Section 5.1. Rather than relying on one-step prompting, our study adopts a progressive prompting pipeline for nested entity generation in marine meteorological disaster texts. The pipeline integrates domain attribute induction, hierarchical entity pool construction, contextualized sentence generation, and boundary-aware self-correction, thereby guiding the LLM to produce structurally valid nested entity instances aligned with disaster-chain semantics. In this way, the present work extends LLM-assisted NER data generation from general-purpose flat augmentation or high-level knowledge organization to domain-specific nested NER data construction with explicit structural control.

3. Problem Description of Named Entity Recognition in the Marine Meteorological Disaster Domain

3.1. Named Entity Category Definition

The design of a domain-specific entity type system is the prerequisite for constructing a high-quality annotated corpus and training an effective NNER model. Unlike general-domain NER tasks where entity types such as Person, Organization, and Location are broadly applicable, marine meteorological disaster texts demand a specialized entity taxonomy that reflects the unique knowledge structure of this domain. Texts in this field—including disaster situation reports, marine meteorological early warning bulletins, and disaster news reports—are characterized by high information density, extensive use of cross-disciplinary technical terminology spanning oceanography and meteorology (e.g., subtropical high pressure ridge, typhoon Doppler radar, and red tide toxins), and detailed quantitative descriptions of disaster conditions, ecological impacts, and response measures. These characteristics necessitate an entity type system that can comprehensively capture all key elements of disaster events while preserving their inherent logical relationships.

To establish such a system, we conducted an in-depth analysis of a large number of texts in the marine meteorological disaster domain and identified a consistent organizing principle: the relationships between entities are not randomly distributed but revolve around the core theme of “the entire disaster chain process”. Descriptions in these texts typically follow a coherent semantic logic that progresses from the triggering causes and prerequisite meteorological conditions of a disaster, through its occurrence and cascading development, to the resulting ecological and socioeconomic impacts and the corresponding response measures. This disaster-chain framework, which captures the full lifecycle of marine meteorological disaster events, serves as the theoretical foundation for our entity type definition.

Guided by this framework, we systematically defined 11 entity types organized along four functional dimensions of the disaster chain as shown in Table 1. These types are designed to cover all critical elements from the initiation of a disaster event to its aftermath, enabling comprehensive information extraction for downstream applications such as disaster monitoring, emergency decision-making, and loss assessment.

The first functional dimension addresses temporal and spatial positioning. Time and SeaArea are defined to identify the temporal nodes and geographic scopes of disaster occurrence and impacts, establishing the fundamental spatiotemporal reference frame for disaster event localization.

The second dimension captures the disaster occurrence and development chain, which constitutes the semantic core of this entity type system. Four entity types are defined to represent the complete causal progression of a marine meteorological disaster: OceanPhen represents marine natural phenomena that serve as the initial triggering cause (e.g., El Niño events and red tide outbreaks); MarMetCond denotes the specific meteorological elements that constitute the critical prerequisites for disaster occurrence, marking the transition from a latent hazard to an active event (e.g., Level-12 wind speed and abnormal sea surface temperature); PrimaryMarDis is the core carrier of the disaster chain, representing the principal disaster body in the development process (e.g., typhoon and storm surge); and SecondaryMarDis captures the cascading impacts derived from the primary disaster’s propagation and interaction with local conditions (e.g., coastal waterlogging triggered by typhoon). Together, these four types form a directed causal chain: OceanPhen → MarMetCond → PrimaryMarDis → SecondaryMarDis. This chain mirrors the physical progression of disaster events from genesis to cascading consequence.

The third dimension addresses the disaster response and monitoring system. Three entity types are defined to represent the organizational, personnel, and technological components involved in disaster response: MarMetAgency denotes the professional institutions responsible for monitoring, forecasting, and coordinating disaster response (e.g., National Marine Environmental Forecasting Center); DisasterResp represents the professionals who implement on-the-ground response operations (e.g., maritime search and rescue personnel); and MarDisEqpt identifies the technical equipment and infrastructure used for disaster monitoring and mitigation (e.g., Doppler radar and breakwater).

The fourth dimension captures the ecological and environmental impact. MarOrganism represents the biological populations affected by disaster events (e.g., coral reef community and mangrove ecosystem), while MarPollutant denotes the ecological pollutants generated or released as a consequence of disasters (e.g., red tide toxins and oil spill contaminants). These two types enable a systematic assessment of the ecological footprint of marine meteorological disasters.

These 11 entity types are interrelated yet mutually exclusive in scope, collectively forming a comprehensive taxonomy that covers the full lifecycle of marine meteorological disaster events without redundancy. The following example illustrates how these entity types manifest as complex nested structures in authentic domain texts, demonstrating the necessity of NNER for this domain.

Taking the sentence “In July 2024, the National Marine Environmental Forecasting Center detected via Typhoon Doppler Radar that Putian City, Fujian Province could suffer typhoon-triggered coastal waterlogging in the coastal areas of the South China Sea under Level-12 winds and a severe typhoon” as an example, the entity configuration reflects the disaster-chain logic defined above while exhibiting strict nested structures. The meteorological condition entity “Level-12 winds” (MarMetCond) and the primary disaster entity “severe typhoon” (PrimaryMarDis) jointly describe the triggering context of the event. A clear nested case appears in the secondary disaster mention “typhoon-triggered coastal waterlogging” (SecondaryMarDis), within which the source disaster term “typhoon” (PrimaryMarDis) is explicitly embedded. Another strict nested relation occurs in the spatial expression “coastal areas of the South China Sea” (SeaArea), which contains the inner geographic entity “South China Sea” (SeaArea). The temporal entity “July 2024” (Time) anchors the event chronologically, while “National Marine Environmental Forecasting Center” (MarMetAgency), “Typhoon Doppler Radar” (MarDisEqpt), and “Putian City, Fujian Province” (SeaArea) provide the observational and local impact context. As illustrated in Figure 1, these disaster-chain entities and their nested inclusion relations together constitute a structured information network centered on the occurrence, propagation, and impact of marine meteorological disasters.

Such hierarchical nesting is pervasive in marine meteorological disaster texts and reflects the inherent multi-scale, multi-causal nature of disaster events. Ignoring the nested structure and extracting only flat entities inevitably leads to the loss of critical causal and spatial relationships. Traditional flat NER methods, which assume non-overlapping and non-inclusive entity boundaries, fundamentally cannot handle such hierarchical structures. In contrast, NNER can simultaneously locate entity boundaries and types while parsing the nested associations between entities, ensuring that the full-chain information of disaster events—from cause to consequence—is comprehensively extracted to support downstream applications including disaster knowledge graph construction, intelligent early warning, and risk assessment.

3.2. Formal Task Definition

Building upon the entity type system defined in Section 3.1, we now formalize the NNER task in the marine meteorological disaster domain. The core objective is to fully extract all instances of the 11 predefined domain entity types from input text, which involves three sub-tasks: precisely locating the start and end boundaries of each entity instance in the text sequence, identifying its corresponding entity type, and parsing the hierarchical nesting relationships between entity instances.

Given an input sentence

X = [x_{1}, x_{2}, \dots, x_{N}]

, where

x_{i}

denotes the i-th token in the sentence and N is the sequence length, the goal of NNER is to identify all entity mentions in X and assign each of them a label from the predefined entity type set

T = {t_{1}, t_{2}, \dots, t_{K}}

, where

K = 11

in this study. Each identified entity is represented as a triple

(s_{j}, e_{j}, t_{j})

, where

s_{j}

and

e_{j}

denote the start and end positions of the entity span, respectively, with

1 \leq s_{j} \leq e_{j} \leq N

, and

t_{j} \in T

denotes its entity type.

The defining characteristic of NNER, as opposed to flat NER, lies in its capacity to handle entities whose spans overlap through inclusion. Given two entity triples

(s_{a}, e_{a}, t_{a})

and

(s_{b}, e_{b}, t_{b})

, a nesting relationship is established when

s_{a} \leq s_{b}

and

e_{b} \leq e_{a}

, indicating that the entity of type

t_{a}

fully contains the entity of type

t_{b}

. In the marine meteorological disaster domain, such relationships frequently encode the causal and compositional logic of the disaster chain—for instance, a SecondaryMarDis entity containing a PrimaryMarDis entity reflects the physical derivation of a cascading disaster from its source. This structural complexity can be further categorized based on boundary alignment: if

s_{a} = s_{b}

and

e_{b} \neq e_{a}

, it constitutes a head-nested relationship where entities share the same starting token; if

s_{a} \neq s_{b}

and

e_{b} = e_{a}

, it forms a tail-nested relationship with a shared ending token. Additionally, the condition

s_{a} < s_{b} < e_{a} < e_{b}

indicates a crossing relationship, where two entities partially overlap without strict containment. These three sub-types of structural complexity pose distinct challenges for entity boundary detection, requiring the model to discriminate fine-grained positional differences within overlapping spans.

The final output of the task is a structured set of all entity triples extracted from the input text, which not only ensures the accuracy of each entity’s boundary and type but also explicitly encodes the nesting relationships through the positional associations among triples. This structured output serves as the foundation for constructing domain knowledge graphs and supporting intelligent disaster management applications.

4. Methods

Addressing the NNER task in marine meteorological disaster texts, as formalized in Section 3.2, requires a model architecture that can simultaneously tackle three core technical challenges. First, marine meteorological disaster texts contain entities of highly variable length—from short temporal expressions (e.g., “July 2024”) to longer disaster-related expressions with nested structures (e.g., “typhoon-triggered coastal waterlogging” and the spatial expression “coastal areas of the South China Sea”)—demanding that the encoder captures precise relative positional relationships between tokens without distorting their rich domain-specific semantics. Second, the pervasive nested structures in this domain mean that a single token may simultaneously serve as the beginning of one entity and the interior of an enclosing entity, requiring the model to generate role-differentiated feature representations for each token. Third, the transition from token-level features to entity-level span representations must preserve boundary signals and internal semantic coherence to support accurate classification across the 11 entity types defined in Table 1.

To address these challenges, PRSpan adopts a four-stage pipeline, as illustrated in Figure 2. The Encoder layer first generates context-aware token embeddings using a BERT pre-trained language model, then integrates RoPE into the self-attention mechanism to encode relative positional information without compromising semantic features (Section 4.1). The CLN-driven role feature generator subsequently applies Conditional Layer Normalization to produce three parallel feature representations—Head, Mid, and Tail—that capture the distinct boundary and interior roles of tokens within entity spans (Section 4.2). The Span encoder with Positional Role Pooling then aggregates these role-differentiated token features into entity-level span encodings through a structured pooling strategy (Section 4.3). Finally, a softmax global decoder maps each span encoding to a probability distribution over the 11 entity categories plus a non-entity class, yielding the complete set of nested entity triples as output.

4.1. Initial Encoding Integrating RoPE Attention

Given an input sentence

X = [x_{1}, x_{2}, \dots, x_{N}]

, where

x_{i}

(

i = 1, 2, \dots, N

) represents the individual tokens, this sequence is fed into the BERT encoder to obtain context-dependent hidden states:

H_{bert} = BERT (X),

(1)

where

H_{bert} \in R^{N \times D}

, N is the sequence length, and D is the hidden dimension. While BERT captures rich contextual semantics, its standard approach of fusing position vectors with word embeddings through addition may conflate semantic and positional information. This conflation is particularly problematic for nested entity recognition in marine meteorological disaster texts, where precise positional discrimination is essential for resolving overlapping entity boundaries. To address this, we introduce the Rotary Position Embedding mechanism [32], which encodes positional information into the attention computation via complex-valued rotations, preserving the independence of semantic features. The formalization of the RoPE mechanism is illustrated in Figure 3.

The hidden states

H_{bert}

are projected into the Query and the Key matrices via two independent linear transformations:

Q = H_{bert} \cdot W_{q},

(2)

K = H_{bert} \cdot W_{k},

(3)

where

W_{q}

and

W_{k}

are learnable weight matrices. Both

Q

and

K

have a dimension of

R^{N \times D}

, and the original input

H_{bert}

serves as the Value matrix

V

.

The last dimension D of

Q

and

K

matrices is decomposed into

D / 2

complex pairs, where the modulus encodes semantic intensity and the phase encodes positional information. The k-th complex pair is formulated as

q (k) = Q [2 k] + i \cdot Q [2 k + 1],

(4)

k (k) = K [2 k] + i \cdot K [2 k + 1] .

(5)

For position m and dimension pair

k \in [0, D / 2 - 1]

, a rotation angle

θ_{m, k}

is defined using a dimension-dependent frequency

ω_{k} = 10000^{- 2 k / D}

that decreases as the dimension index increases. This ensures that rotations at different frequencies encode relative positional information across varying scales. Given the Query vector

q_{m} (k) = Q {[2 k]}_{m} + i \cdot Q {[2 k + 1]}_{m}

at position m and the Key vector

k_{n} (k) = K {[2 k]}_{n} + i \cdot K {[2 k + 1]}_{n}

at position n, RoPE performs a position-dependent rotation on their

D / 2

complex components by multiplying with

e^{i θ_{m, k}}

and

e^{i θ_{n, k}}

, respectively. Leveraging Euler’s formula

e^{i θ} = cos θ + i sin θ

, this geometric transformation modifies the phase angle while preserving the modulus, thereby injecting positional awareness without altering the original semantic content. The rotated Query vector

q_{{rot}_{m}} (k)

is calculated by

q_{{rot}_{m}} (k) = q_{m} (k) \cdot e^{i θ_{m, k}} = (Q {[2 k]}_{m} + i \cdot Q {[2 k + 1]}_{m}) \cdot (cos (θ_{m, k}) + i sin (θ_{m, k})) .

(6)

Expanding yields the real and imaginary parts

\begin{matrix} Re [q_{{rot}_{m}} (k)] & = Q {[2 k]}_{m} cos (θ_{m, k}) - Q {[2 k + 1]}_{m} sin (θ_{m, k}), \end{matrix}

(7)

\begin{matrix} Im [q_{{rot}_{m}} (k)] & = Q {[2 k]}_{m} sin (θ_{m, k}) + Q {[2 k + 1]}_{m} cos (θ_{m, k}) . \end{matrix}

(8)

The same rotation is applied to

k_{n} (k)

to obtain

k_{{rot}_{n}} (k) = k_{n} (k) \cdot e^{i θ_{n, k}}

. The attention score

α_{m, n}

between the Query at position m and the Key at position n is computed as the real part of the complex conjugate dot product between the rotated vectors

q_{{rot}_{m}} (k)

and

k_{{rot}_{n}} (k)

:

α_{m, n} = \sum_{k = 0}^{D / 2 - 1} Re [q_{{rot}_{m}} (k) \cdot \bar{k_{{rot}_{n}} (k)}],

(9)

where

\bar{k_{{rot}_{n}} (k)}

denotes the complex conjugate of

k_{{rot}_{n}} (k)

. Substitute the rotated terms into the above equation and expand it:

\begin{matrix} α_{m, n} & = \sum_{k = 0}^{D / 2 - 1} Re [(q_{m} (k) \cdot e^{i θ_{m, k}}) \cdot (\bar{k_{n} (k)} \cdot e^{- i θ_{n, k}})] \\ = \sum_{k = 0}^{D / 2 - 1} Re [q_{m} (k) \cdot \bar{k_{n} (k)} \cdot e^{i (θ_{m, k} - θ_{n, k})}] . \end{matrix}

(10)

Since

θ_{m, k} = m \cdot ω_{k}

, the difference in angles reduces to

θ_{m, k} - θ_{n, k} = (m \cdot ω_{k}) - (n \cdot ω_{k}) = (m - n) \cdot ω_{k} .

(11)

This demonstrates the key property of RoPE: the attention score

α_{m, n}

depends only on the token semantics

q (k)

and

k (k)

together with their relative position difference

Δ = m - n

, successfully converting absolute positional encoding into relative positional encoding. This property is particularly beneficial for nested entity recognition, as it allows the model to capture the relative positional relationships between an inner entity and its enclosing outer entity regardless of their absolute positions in the sentence.

The attention score is then scaled and normalized to obtain the attention weights. Specifically, the scalar score between the Query at position m and the Key at position n is scaled as

{\tilde{α}}_{m, n} = \frac{α_{m, n}}{\sqrt{D}} .

(12)

Let

S = [{\tilde{α}}_{m, n}] \in R^{N \times N}

denote the scaled score matrix. The attention weight matrix

A \in R^{N \times N}

is obtained by applying the softmax operation along the Key dimension:

A_{m, n} = \frac{exp ({\tilde{α}}_{m, n})}{\sum_{j = 1}^{N} exp ({\tilde{α}}_{m, j})} .

(13)

This normalization ensures that

\sum_{n = 1}^{N} A_{m, n} = 1

for each Query position m. The RoPE-enhanced contextual representation is then computed by weighting the Value matrix

V

:

H_{rope} = A \cdot V,

(14)

where

H_{r o p e} \in R^{N \times D}

. Finally, the RoPE-enhanced representation is concatenated with the original BERT output along the feature dimension to form the fused feature matrix:

H_{concat} = Concat (H_{bert}, H_{rope}) .

(15)

4.2. Span Encoding with Positional Role Dependence

While the RoPE mechanism in Section 4.1 establishes position awareness at the attention level, it does not explicitly differentiate token roles within entity structures. In nested entity recognition, this differentiation is critical: a token at the start of an entity carries boundary-marking semantics distinct from tokens in the entity’s interior or at its end. Moreover, in the marine meteorological disaster domain, a single token may simultaneously occupy different roles across nested entities—for instance, the token marking the beginning of “storm surge” also falls within the interior of the larger span “coastal waterlogging triggered by storm surge”. Standard Layer Normalization applies uniform normalization parameters across all tokens, making it inadequate for capturing these role-dependent distinctions. To address this, we adopt CLN [33], which dynamically generates role-specific normalization parameters conditioned on positional role indicators—Head (start), Mid (intermediate), and Tail (end)—to produce differentiated feature representations for each role. The CLN module is illustrated in Figure 4.

For each positional role, CLN generates dedicated scaling and offset parameters through learnable linear transformations:

γ_{role} = W_{γ}^{role} \cdot H_{concat} + b_{γ}^{role},

(16)

β_{role} = W_{β}^{role} \cdot H_{concat} + b_{β}^{role},

(17)

where

role \in {Head, Mid, Tail}

,

W_{γ}^{role}

and

W_{β}^{role}

are learnable weight matrices, and

b_{γ}^{role}

and

b_{β}^{role}

are bias vectors. These conditional parameters are applied to perform an affine transformation on the normalized features:

H_{role} = γ_{role} ⊙ \frac{H_{concat} - μ}{\sqrt{σ^{2} + ϵ}} + β_{role},

(18)

where

μ

and

σ^{2}

are the mean and variance of

H_{concat}

, and

ϵ

is a numerical stability constant.

Through this mechanism, the model generates three parallel role-specific feature sequences—

H_{Head}

,

H_{Mid}

, and

H_{Tail}

—from a shared input representation. Unlike standard Layer Normalization that produces a single undifferentiated representation, the CLN dynamic parameter generation enables each token to carry role-adapted features, directly supporting the subsequent span encoding stage in distinguishing entity boundaries from entity interiors.

4.3. Span Generation and Recognition Based on Positional Role Pooling

The Head, Mid, and Tail feature representations generated by CLN are defined as follows:

H_{Head} = [h_{1}^{Head}, h_{2}^{Head}, \dots, h_{N}^{Head}]

,

H_{Mid} = [h_{1}^{Mid}, h_{2}^{Mid}, \dots, h_{N}^{Mid}]

, and

H_{Tail} = [h_{1}^{Tail}, h_{2}^{Tail}, \dots, h_{N}^{Tail}]

. These representations provide token-level features differentiated by positional role. However, since entities exist as continuous spans in text, these discrete token features must be aggregated into entity-level representations. We design a Positional Role Pooling strategy that leverages the role differentiation established by CLN. For a candidate span

(m, n)

, where m and n denote the start and end positions, respectively, the span encoding is computed as

s_{m, n} = ϕ (U_{1} h_{m}^{Head} + U_{2} pooling (h_{m + 1}^{Mid}, h_{m + 2}^{Mid}, \dots, h_{n - 1}^{Mid}) + U_{3} h_{n}^{Tail}),

(19)

where

h_{m}^{Head}

extracts the Head feature at the unique start position, and

h_{n}^{Tail}

extracts the Tail feature at the unique end position of the entity. For the Mid features at the entity’s intermediate positions, we extract them via strategies (e.g., average pooling and attention mechanisms) to capture internal semantic coherence. If no intermediate token exists, the Mid representation is set to a zero vector or a learnable placeholder vector. The learnable weight matrices

U_{1}, U_{2}, U_{3}

allow the model to adaptively weight boundary signals versus interior semantics, and the activation function

ϕ (\cdot)

introduces nonlinearity. Compared with traditional span representations that apply uniform pooling across all positions, this strategy explicitly assigns each position within a span to its designated functional role, thereby strengthening boundary discrimination and reducing type confusion.

The span encodings are then passed through a fully connected layer with ReLU activation, followed by softmax to obtain the probability distribution over entity labels:

\begin{matrix} {\hat{y}}_{m, n} & = softmax (ReLU (s_{m, n})), \end{matrix}

(20)

\begin{matrix} l_{m, n} & = arg max ({\hat{y}}_{m, n}) . \end{matrix}

(21)

The model is trained by minimizing the Negative Log-Likelihood Loss over all candidate spans:

Loss = - \sum_{0 \leq m \leq n \leq N} y_{m, n} log ({\hat{y}}_{m, n}),

(22)

where

y_{m, n}

is the ground-truth label and

{\hat{y}}_{m, n}

is the predicted probability. This span-level loss jointly optimizes boundary detection and type classification across all candidate spans, enabling the model to learn the complete set of nested entity structures in a single forward pass.

5. Experiments

5.1. MMD-NER Dataset Construction and Validation

5.1.1. Four-Step LLM-Assisted Dataset Construction

The advancement of NER in the marine meteorological disaster domain is currently hindered by a scarcity of annotated datasets. Traditional data augmentation methods, which typically rely on domain-specific unlabeled corpora or extensive manual intervention, are often cost-prohibitive and lack flexibility. Furthermore, directly prompting Large Language Models (LLMs) to generate NER data frequently results in imprecise entity boundaries and category misjudgments, failing to meet the strict precision requirements for model training. This problem is compounded in our task by the prevalence of nested named entities: disaster descriptions often exhibit a hierarchical structure. For instance, a quantitative indicator like “Level-12 Wind” (MarMetCond) is semantically embedded within a disaster event like “Super Typhoon” (PrimaryMarDis). Traditional flat NER generation methods fail to capture this overlapping topology.

To address these challenges, we implemented a specialized pipeline that enforces hierarchical integrity throughout the data synthesis process. Rather than direct one-step generation, we adopted a progressive, step-by-step strategy. We initiated the process by harvesting authentic academic literature and disaster situation reports from Google Scholar via a web crawler. These texts served as demonstration samples, providing the foundational logic for disaster chains. Based on the 11 entity categories defined in Section 3, we expanded these samples using a four-step workflow designed to systematically construct and verify nested structures. The detailed procedure is outlined below and illustrated in Figure 5.

Step 1: Domain Attribute Generation via Self-Reflection. The model first analyzes the seeded data to identify key structural attributes of disaster reporting, such as spatiotemporal scope, disaster intensity, and impact chain. By mapping these to our entity categories (e.g., SeaArea and MarMetCond), the model establishes a multi-dimensional attribute space suitable for complex scenarios.

Step 2: Hierarchical Entity Pool Creation. Unlike standard NER generation, our approach requires the model to generate correlated entity pairs with potential nesting relationships. For example, under the scenario of “Typhoon Observation”, the model generates the core disaster entity “Severe Typhoon In-fa” (PrimaryMarDis) alongside its constituent meteorological attributes “Gusts of 35 m/s” (MarMetCond). This “Entity-First” strategy ensures that both the outer container (the disaster) and the inner content (the attribute) are established in the pool before sentence synthesis, facilitating coherent nesting.

Step 3: Contextualized Nested Sentence Synthesis. Acting as a domain expert, the LLM synthesizes sentences by embedding the inner entities into the descriptions of outer entities. The prompt explicitly instructs the model to follow a “Macro-to-Micro” encapsulation logic. For instance, the model constructs phrases such as the following: “The coastal areas of Fujian [SeaArea (Outer)], specifically Putian City [SeaArea (Inner)], were hit.” Similarly, it generates: “suffered from storm surge triggered by a typhoon [PrimaryMarDis (Outer)] containing Level-12 winds [MarMetCond (Inner)]”. This deliberately creates the overlapping boundaries required for NNER training.

Step 4: Self-Correction with Nested Boundary Verification. To resolve the boundary ambiguity often found in generated nested labels, we implemented a discriminative classification check. The model reviews each candidate span, especially overlapping ones, and classifies them into four categories to verify hierarchical integrity:

(A): Valid Entity: The span is correct with precise boundaries (e.g., correctly identifying the inner “Level-12 wind” as distinct from the outer “Typhoon”).
(B): Imprecise Boundary: The span contains an entity but fails to separate the nested structure (e.g., merging the inner attribute with the outer event into a single long span).
(C): Misclassification: The span is a valid entity but assigned an incorrect category.
(D): Non-Entity: The span is irrelevant text.

Only spans classified as (A), which successfully pass the nested boundary check, are retained.

Through this nesting-aware generation pipeline, we established the MMD-NER dataset. The final corpus comprises 1899 sentences, encompassing 17,017 entities and 2978 nested entity pairs that capture the semantics of “disaster chains.” The dataset was partitioned into training, validation, and test sets with a 7:2:1 ratio. Table 2 provides a detailed statistical breakdown, and Table 3 illustrates the quantity distribution across the 11 entity categories.

A notable characteristic of the MMD-NER dataset is the severe class imbalance: PrimaryMarDis (4161) and SecondaryMarDis (3622) dominate the corpus, while Time (89) and MarOrganism (171) are extremely scarce. This long-tail distribution reflects the natural frequency of these entities in real disaster reports, where temporal markers and biological impact references appear far less frequently than disaster event descriptions. This imbalance poses a specific challenge for model evaluation—high micro-averaged performance may mask poor recognition of low-frequency categories—which motivates the use of both micro-averaging and macro-averaging metrics in Section 5.3.

5.1.2. Human Validation of Generated Annotations

To evaluate the quality of the LLM-assisted dataset construction pipeline, we conducted a human validation study on the GPT-generated MMD-NER subset. It should be noted that the proposed pipeline generates nested NER samples from scratch rather than annotating pre-existing sentences. Therefore, we validate the generated data by re-annotating the generated sentences and comparing the pipeline-generated annotations with expert human-gold annotations.

Specifically, we used stratified sampling to select 200 sentences from the GPT-generated MMD-NER subset, covering different entity categories, entity lengths, and nested structures. Two domain-trained annotators independently re-annotated these sentences without access to the pipeline-generated labels. Disagreements were resolved through adjudication to obtain the final human-gold validation subset.

As shown in Table 4, the validation subset contains 1842 pipeline-generated entities and 326 pipeline-generated nested pairs, while the human-gold annotations contain 1796 entities and 309 nested pairs. The inter-annotator agreement is high, with 94.86% boundary agreement F1, 92.73% type-aware agreement F1, and 90.41% nested-pair agreement F1, indicating that the validation process is reliable. Compared with the human-gold annotations, the pipeline-generated labels achieve 93.02% boundary-only F1, 90.01% type-aware strict F1, and 86.54% nested-pair F1. These results show that the GPT-based four-step pipeline produces annotations that are largely consistent with expert judgments, although nested-pair consistency remains more challenging than individual entity recognition.

Table 5 further summarizes the main types of annotation discrepancies. Boundary over-extension and type confusion are the two most frequent errors, accounting for 24.03% and 22.48%, respectively. This suggests that the pipeline occasionally generates overly broad spans or confuses semantically close categories such as OceanPhen and MarMetCond. Boundary under-extension, missing inner or outer entities, and invalid nesting relations further indicate that complex nested structures remain difficult for automatic generation.

Representative cases are shown in Table 6. These examples demonstrate that the remaining errors mainly arise from boundary granularity, semantically similar entity categories, and incorrect inference of nesting relations. Overall, the human validation confirms the reliability of the GPT-assisted four-step dataset construction pipeline, while also revealing the need for accurate boundary modeling and nested-relation discrimination in the subsequent NNER model.

5.1.3. LLM-Backbone Analysis for Data Generation

To further examine the influence of different LLM backbones on the proposed data construction pipeline, we conducted a pilot comparison using GPT-4o mini, Qwen2.5-72B-Instruct, and DeepSeek-V3. Since the full MMD-NER dataset was constructed using GPT, this experiment is intended to evaluate whether the proposed four-step pipeline can be applied consistently across different LLM backbones and how generation quality varies with model capability.

For a fair comparison, each LLM was evaluated using 50 controlled scenario seeds. As shown in Table 7, these seeds do not contain predefined entity spans, entity labels, or nested entity pairs. Instead, each seed only specifies the disaster scenario, target domain, expected semantic dimensions, target nested complexity, and generation constraints. Under the same seed, each LLM still needs to complete the entire four-step pipeline, including domain attribute generation, hierarchical entity pool construction, contextualized nested sentence synthesis, and nested-boundary verification. Therefore, the controlled scenario seeds serve only as a unified starting point to make generation difficulty comparable; they do not replace the proposed four-step pipeline.

As shown in Table 8, GPT-4o mini achieves the best overall generation quality, with a valid sample rate of 90%, requirement satisfaction of 88%, nesting validity of 86%, boundary agreement F1 of 92.41%, type-aware agreement F1 of 89.76%, and nested-pair agreement F1 of 85.31%. DeepSeek-V3 obtains slightly lower but competitive results, while Qwen2.5-72B-Instruct performs relatively lower across all metrics. These results indicate that stronger instruction-following and structural reasoning abilities contribute to higher-quality nested NER sample generation.

Importantly, this comparison should not be interpreted as a “card-to-sample” generation experiment. The controlled cards only define the scenario-level constraints, while the actual attributes, entity pools, nested sentences, entity spans, labels, and nested pairs are still generated by each LLM through the complete four-step pipeline. The observed differences among LLM backbones further confirm that LLM-generated NNER data require explicit validation in terms of valid sample rate, requirement satisfaction, nesting validity, boundary agreement, type-aware agreement, and nested-pair consistency before being used for model training.

5.2. Experimental Settings and Baselines

Experiments are conducted on a server equipped with an NVIDIA GeForce RTX 4090 GPU, running Ubuntu 22.04 with TensorFlow 2.10.0 and Python 3.9. Key hyperparameters are configured as follows: batch size of 16, maximum input sequence length of 256, dropout rate of 0.1, hidden layer dimension of 128, and AdaFactorEMA optimizer with an initial learning rate of 5

\times 10^{- 4}

.

To validate the effectiveness of the proposed method, we select eight baseline models covering four mainstream paradigms for nested named entity recognition:

Sequence Labeling Models: BERT-CRF [34], BiLSTM-CRF [35], and BERT-BiGRU-Att-CRF [2] treat NER as a sequence labeling task. BERT-BiGRU-Att-CRF extends standard CRF-based tagging by incorporating BiGRU and attention modules, but it still follows the token-level labeling paradigm. Therefore, these models are less suitable for nested NER, where a token may participate in multiple overlapping entity spans.
Hierarchical Structure Models: Pyramid [16] is designed specifically for nested entities, employing a bidirectional pyramid interaction structure that parses nested relationships through stacked hierarchical layers.
Span Classification Models: Global Pointer [15] and Biaffine [18] directly score all possible entity spans in the input sequence by modeling associations between start and end positions. Our proposed PRSpan belongs to this category, which represents one of the most effective paradigms for processing nested entities, allowing direct comparison to reveal the specific gains contributed by our RoPE-enhanced attention, CLN, and Positional Role Pooling components.
Graph Structure Models: BiFlaG [19] and PANNER [36] capture structural dependencies by treating entities as nodes and relationships as edges, parsing nested structures through message passing mechanisms within graph networks.

5.3. Overall and Category-Level Performance

5.3.1. Main Results Under Micro-Average and Macro-Average

To comprehensively evaluate the performance of the PRSpan model, we use Precision, Recall, and F1-score under both micro-averaging and macro-averaging. Micro-averaging treats all entity predictions as independent samples, reflecting overall recognition accuracy across the entire test set. Macro-averaging computes metrics per entity category before taking the arithmetic mean, thereby giving equal weight to each category regardless of its frequency—a critical perspective given the severe class imbalance documented in Table 3.

Table 9 and Table 10 reveal three tiers of performance that correspond to the models’ structural capacity for handling nested entities in the marine meteorological disaster domain.

The sequence labeling models occupy the lowest tier. BiLSTM-CRF achieves only 29.28% Micro-F1, mainly due to its low recall on nested entities. BERT-CRF improves with pre-trained contextual embeddings, but its single-label-per-token assumption still limits its ability to represent overlapping spans. BERT-BiGRU-Att-CRF further introduces BiGRU encoding and attention, increasing Micro-F1 to 82.17%; however, it remains constrained by linear CRF decoding and therefore still underperforms span-based and graph-based models in nested entity recognition.

The middle tier comprises Pyramid, BiFlaG, Global Pointer, Biaffine, and PANNER, all of which are architecturally capable of handling nested entities and achieve Micro F1 scores between 86.12% and 90.97%. Among these, the span classification models (Global Pointer and Biaffine) outperform both the hierarchical model (Pyramid) and the graph-based models (BiFlaG and PANNER), confirming the effectiveness of directly modeling start-end position associations for this task. Notably, Global Pointer’s advantage over Biaffine (Micro F1: 90.97% vs. 89.86%) can be attributed to its integration of relative position information, which partially addresses the positional awareness challenge identified in Section 4.

PRSpan achieves the highest performance across all metrics, with a Micro F1 of 94.58% and a Macro F1 of 93.47%. The 3.61 percentage-point improvement in Micro F1 over the strongest baseline (Global Pointer) demonstrates that the three proposed components—RoPE-enhanced attention for relative position encoding, CLN for role-differentiated feature generation, and Positional Role Pooling for structured span aggregation—provide complementary gains beyond what relative position information alone (as in Global Pointer) can achieve. Equally important is the Macro F1 gap: PRSpan’s Macro F1 (93.47%) exceeds Global Pointer’s (89.54%) by 3.93 points, a larger margin than the Micro F1 gap (3.61 points). This indicates that PRSpan’s improvements are not concentrated on high-frequency categories but extend to low-frequency entity types such as Time and MarOrganism, where the CLN mechanism’s role-specific feature generation provides particular benefits by sharpening boundary signals even when training samples are scarce.

5.3.2. Category-Level Analysis

To further investigate the models’ capabilities across specific entity types, Figure 6 presents the category-level performance comparison between PRSpan and three representative span-based and graph-based baselines (Global Pointer, Biaffine, and PANNER).

The category-level results show that PRSpan performs particularly well on entity types frequently involved in nested structures, such as PrimaryMarDis and SecondaryMarDis. These entities often contain overlapping boundaries between source disasters, meteorological conditions, and derived disaster-chain expressions. RoPE-enhanced relative position modeling and CLN-based role-specific representations help PRSpan distinguish span boundaries and internal semantics more effectively than GlobalPointer and Biaffine.

For long-tail categories, the Time class remains relatively challenging. This may be partly attributed to the generated dataset distribution, as Time contains only 89 instances in MMD-NER and temporal expressions vary substantially in surface form. Baseline models without explicit role-aware span representation are therefore more likely to miss or inconsistently detect temporal entities. PRSpan alleviates this issue to some extent, although the scarcity of Time annotations remains a limitation of the current dataset.

The comparison also reflects the limitations of different baseline paradigms. Sequence-labeling models are constrained by single-label-per-token decoding; GlobalPointer and Biaffine lack explicit role differentiation within spans; and graph-based models may be affected by sparse interactions for rare entities. These limitations explain why PRSpan achieves more balanced category-level performance.

5.4. Ablation Analysis

To further explain how the proposed components improve nested entity recognition, we conduct interpretability and ablation analyses from three perspectives. First, we visualize the attention distributions of the standard attention variant and the RoPE-enhanced attention variant to examine how RoPE affects positional focus in nested spans. Second, we conduct an extended ablation study on CLN-generated role-specific features and Positional Role Pooling. Third, we provide a practical case study to show how Head, Mid, and Tail features contribute differently to span recognition.

5.4.1. RoPE Attention Visualization

To analyze how RoPE-enhanced attention improves nested entity recognition, we visualize the attention distributions on a representative nested-entity sentence. The rows in Figure 7 denote selected query tokens, and the columns denote sentence tokens. The left panel shows the attention distribution of the w/o RoPE variant, where the rotary transformation applied to Query and Key vectors is removed and standard scaled dot-product attention is used. The right panel shows the attention distribution after introducing RoPE-enhanced attention.

As shown in Figure 7, without RoPE, the attention distribution is relatively diffuse and is sometimes assigned to tokens outside the target entity spans. For example, the query tokens “storm” and “typhoon” do not consistently focus on the complete outer SecondaryMarDis span or the corresponding inner PrimaryMarDis span. In contrast, with RoPE-enhanced attention, the model assigns stronger attention to boundary-relevant and span-internal tokens. The query token “storm” focuses more clearly on the outer entity “storm surge triggered by a typhoon”, while “typhoon”, as an inner PrimaryMarDis entity, forms a more concentrated attention region. Similarly, “winds” shows stronger attention to the MarMetCond entity “Level-12 winds”.

This visualization suggests that RoPE-enhanced attention helps the model capture relative positional dependencies between inner and outer spans. By encoding relative position information in the attention computation, RoPE makes the model more sensitive to entity boundaries and span-internal structures, which is particularly useful for recognizing nested entities with overlapping or shared boundary regions.

5.4.2. Extended Ablation on CLN and Positional Role Pooling

To evaluate the contribution of CLN-generated role-specific features and Positional Role Pooling, we conduct an extended ablation study as shown in Table 11. To avoid ambiguity in the role-specific ablation, all variants use the same span aggregation framework and differ only in the source of token representations or the pooling strategy.

In the w/o CLN variant, the start, middle, and end positions of a candidate span are all represented by the same base representation, without role-conditioned normalization. In the Mid-only CLN variant, all positions use Mid-conditioned features. In the Boundary-only CLN variant, the start and end positions use Head and Tail features, respectively, while the middle positions use the base representation. To independently evaluate Positional Role Pooling, we further add two variants, Full CLN + AvgPool and Full CLN + MaxPool. These two variants retain the CLN-generated Head, Mid, and Tail features but replace the proposed position-role-aware aggregation with conventional average pooling or max pooling over the span. The Full CLN setting corresponds to the complete design, where Head, Mid, and Tail features are explicitly assigned to the start, middle, and end positions of each candidate span.

As shown in Table 11, the w/o CLN variant obtains the lowest performance, indicating that a unified span representation without role differentiation is insufficient for nested NER. The Mid-only CLN variant slightly improves over w/o CLN, but the improvement is limited, suggesting that internal semantic information alone cannot accurately determine nested span boundaries. The Boundary-only CLN variant achieves a larger improvement, showing that Head and Tail features are particularly important for start–end boundary localization. However, this variant still underperforms the full model, indicating that Mid features provide complementary internal semantic coherence.

The comparison between Full CLN + AvgPool/MaxPool and Full CLN further confirms the independent contribution of Positional Role Pooling. Although the pooling variants retain the CLN-generated Head, Mid, and Tail features, they do not explicitly assign these features to the start, middle, and end positions. Their performance is therefore lower than that of the full model. This result shows that Positional Role Pooling is not merely a pooling operation; rather, it is an effective mechanism for preserving boundary roles and internal semantic information in nested span representation.

5.4.3. Practical Case Study of CLN Role Features

To further illustrate how different role-specific features affect actual predictions, we provide a case study in Table 12. The sentence fragment contains a nested structure involving the pollutant entity “red tide toxins” and the ocean phenomenon “red tide outbreak”. This example requires the model to distinguish the inner pollutant entity, the phenomenon entity, and the enclosing nested span.

As shown in Table 12, the w/o CLN variant only detects the more obvious phenomenon expression and misses the pollutant-related nested structure. The Head + Tail setting captures boundary cues more effectively, but it still lacks sufficient internal semantic modeling and may misclassify the span. In contrast, the Mid-only setting captures part of the internal semantics but weakens explicit boundary localization, leading to an imprecise span. The full CLN setting correctly identifies both the inner entities and the outer nested span. This case confirms that Head, Mid, and Tail features play complementary roles: Head and Tail features help locate span boundaries, while Mid features preserve the semantic coherence of the span interior.

5.5. Fine-Grained Analysis of Nested Structures

To further investigate the performance of PRSpan under different nested configurations, we conduct a fine-grained analysis on the MMD-NER test set according to nesting type, nesting depth, and entity length. This analysis directly examines whether PRSpan remains robust when handling structurally complex nested entities. Since the model output is a set of typed entity spans, nesting relations are derived from span positions. For the nesting-type analysis, a nested pair is regarded as correctly predicted only when both the inner and outer entities are correctly recognized with exact boundaries and entity types.

We compare PRSpan with GlobalPointer, the strongest baseline in the overall experiment. As shown in Table 13, Table 14 and Table 15, PRSpan consistently outperforms GlobalPointer across all fine-grained groups. This indicates that the proposed position-role-aware span representation is effective not only in overall evaluation but also under more challenging structural conditions.

As shown in Table 13, PRSpan achieves the best performance on strict containment structures, where the inner entity is fully enclosed by the outer entity without sharing boundaries. The gains over GlobalPointer are also clear for head-nested and tail-nested structures, suggesting that PRSpan is more effective in distinguishing shared start and end boundaries. This improvement is consistent with the model design: RoPE-enhanced attention strengthens relative boundary modeling, while CLN generates role-specific Head and Tail representations for boundary-sensitive span encoding. The lowest performance is observed for crossing structures, where two spans partially overlap without strict containment. This indicates that crossing structures remain the most difficult configuration because their boundaries are not hierarchically aligned.

Table 14 shows that both models experience performance degradation as nesting depth increases. PRSpan achieves 95.53% F1 on depth–1 entities, 93.88% on depth–2 entities, and 90.76% on entities with depth ≥ 3. This confirms that deeply nested structures are more difficult because each entity boundary must be identified in the presence of multiple enclosing or enclosed spans. Nevertheless, PRSpan maintains a larger advantage over GlobalPointer in the depth ≥ 3 group, indicating that the proposed position-role-aware span representation is particularly beneficial when boundary ambiguity becomes stronger.

The entity-length analysis further shows that longer entities are more difficult for both models. PRSpan achieves 96.18% F1 on 1–2 token entities, but the score decreases to 89.74% for entities longer than 10 tokens. This trend is expected because long marine meteorological disaster entities often contain multiple modifiers, causal triggers, or embedded disaster conditions. Nevertheless, PRSpan consistently outperforms GlobalPointer across all length groups, with a particularly clear advantage on long-span entities. This suggests that RoPE-enhanced relative position modeling helps preserve long-distance boundary dependencies, while CLN-based role-specific features help maintain internal semantic coherence within longer spans.

Overall, the fine-grained results show that PRSpan remains more robust than GlobalPointer across different nesting types, nesting depths, and entity lengths. Performance decreases on crossing structures, deeply nested entities, and very long entities, indicating that boundary overlap and span length remain important sources of difficulty. However, PRSpan exhibits smaller performance degradation in these challenging cases, demonstrating the effectiveness of its position-role-aware span representation for complex nested entity recognition.

5.6. Cross-Domain Transfer Evaluation

To evaluate whether PRSpan can be adapted to related disaster information extraction scenarios beyond the marine meteorological disaster domain, we conduct a pilot cross-domain transfer experiment on an external English disaster-specific NER dataset. Considering language consistency, domain relevance, and data availability, we select the disaster-specific NER dataset released by Hafsa et al. [37] as the target-domain dataset. This dataset consists of English online disaster news and is annotated with 14 crisis-specific entity types, including NaturalHazard, Location, Date, AffectedPopulation, InfrastructureDamage, and CollapsedStructure. Compared with general geoscience terminology datasets, this dataset is more closely related to disaster information extraction and is therefore suitable for evaluating the transferability of PRSpan to a related disaster domain.

As shown in Table 16, the target dataset differs substantially from MMD-NER in label schema, text source, and structural complexity. MMD-NER focuses on fine-grained nested entities in marine meteorological disaster texts, whereas the external disaster-specific NER dataset is mainly derived from online disaster news and follows a flat NER setting. Therefore, this experiment is positioned as a pilot cross-domain transfer evaluation, rather than a comprehensive benchmark for cross-domain nested NER generalization.

In the transfer experiment, we convert the released annotations into sentence-level samples and use 1000 annotated sentences, split into 700 training sentences, 100 validation sentences, and 200 test sentences. Since the target dataset does not contain annotated nested entity pairs, nested-pair evaluation is not applicable. We therefore report Precision, Recall, Micro-F1, and Macro-F1 under the target-domain entity schema.

During transfer, the encoder, RoPE-enhanced attention module, CLN role feature generator, and span representation layers trained on MMD-NER are used for initialization, while the final classification layer is re-initialized according to the 14 entity types in the target dataset. We compare PRSpan trained only on the target-domain data with PRSpan initialized from MMD-NER and then fine-tuned on the target-domain data. BERT-CRF and GlobalPointer are also included as representative sequence-labeling and span-based baselines.

As shown in Table 17, PRSpan trained only on the target disaster-specific NER dataset outperforms BERT-CRF and GlobalPointer, indicating that the span-based and role-aware design remains effective in a related disaster news domain. Moreover, PRSpan initialized from MMD-NER achieves the best performance, improving Micro-F1 from 87.29% to 88.35% and Macro-F1 from 84.61% to 85.52%. This suggests that the span representations and role-aware features learned from MMD-NER provide useful initialization for related disaster NER tasks.

However, the improvement is moderate rather than dramatic. This is expected because the target dataset differs from MMD-NER in several aspects. First, the target dataset is based on online disaster news, whereas MMD-NER focuses on marine meteorological disaster texts. Second, the target dataset uses 14 crisis-specific entity types, while MMD-NER uses 11 fine-grained marine disaster-chain entity types. Third, the target dataset mainly follows a flat NER setting and does not annotate nested entity pairs. These differences indicate that the transferability of PRSpan depends on the similarity between source and target data distributions. When the target-domain data do not share similar nested structures or entity distributions with MMD-NER, the transfer gain may be limited.

Overall, this experiment provides preliminary evidence that PRSpan can be adapted to related disaster information extraction tasks, especially when the target domain shares disaster-related semantics with MMD-NER. Future work will further evaluate PRSpan on more Earth science domains, such as hydrological events, ecological disasters, and geological hazard reports.

5.7. Comparison with LLM Prompting Baselines

To examine whether direct LLM prompting can serve as a practical alternative for MMD-NER, we conduct a sampled evaluation on 50 stratified test sentences. The subset is selected to cover major entity categories and representative nested structures. We compare GPT-4o mini and Qwen3 under the same MMD-NER entity schema, annotation guideline, and few-shot prompt. The LLMs are required to output entity spans, entity types, and nested pairs in a structured JSON format. Their outputs are then matched against the gold annotations using exact span-level matching. This sampled evaluation is intended to analyze the trade-off between LLM prompting and supervised task-specific models, rather than to provide a comprehensive benchmark of all available LLMs.

As shown in Table 18, direct LLM prompting achieves reasonable semantic extraction ability but is clearly weaker than PRSpan under strict span-level evaluation. GPT-4o mini performs better than Qwen3, but both LLMs show a clear gap compared with the supervised PRSpan model. The main gap comes from exact boundary matching and nested-pair consistency. LLMs can often identify semantically relevant expressions, but they may over-extend or under-extend entity boundaries, omit inner entities, or generate entity types that do not exactly follow the predefined MMD-NER schema.

In contrast, PRSpan is trained under the MMD-NER annotation schema and produces more stable span-level predictions. In terms of practical deployment, LLM prompting is flexible and does not require task-specific training, but it incurs higher per-sample inference cost and lower inference efficiency. PRSpan requires supervised training, but after deployment it is more efficient and cost effective for repeated large-scale extraction. These results suggest that direct LLM prompting is useful for exploratory extraction and data construction, whereas task-specific supervised models remain preferable when strict boundary consistency, nested-pair accuracy, and stable large-scale inference are required.

5.8. Qualitative Error Analysis

To better understand the remaining limitations of PRSpan, we conduct a qualitative error analysis on representative incorrect predictions from the test set. Since PRSpan has already achieved the best overall Micro-F1 and Macro-F1 scores, this analysis does not suggest that boundary recognition is generally weak. Instead, it focuses on the remaining difficult cases that are still challenging for the model, especially low-frequency entity types, semantically close categories, long outer entities, and complex overlapping structures.

The category-level results show that Time is one of the most difficult entity types for baseline models. This is partly related to data sparsity, as Time contains only 89 instances in MMD-NER. Similar long-tail effects are also observed for MarOrganism. Although PRSpan performs more robustly on these low-frequency categories than the baselines, temporal expressions remain relatively challenging because they appear in diverse surface forms and are less frequently involved in stable span patterns. This observation also helps explain why temporal expressions are more difficult for other models and suggests that the relatively weak performance on Time is partly related to the generated dataset distribution.

As shown in Table 19, the remaining errors of PRSpan mainly occur in three situations. First, low-frequency entities such as Time are more likely to be missed because the generated dataset contains fewer temporal expressions and their surface forms are diverse. Second, semantically close categories such as MarMetCond and OceanPhen may be confused when both describe abnormal marine states. Third, long outer entities and coordinated overlapping structures remain challenging because they require the model to distinguish core entities from causal modifiers or adjacent semantically related spans. These observations are consistent with the category-level and fine-grained analyses: PRSpan substantially improves the recognition of low-frequency and nested entities, but rare categories and complex overlapping structures still require further improvement.

6. Discussion

6.1. Robustness-Related Findings

The additional analyses in Section 5 provide a more comprehensive view of PRSpan’s robustness beyond overall F1 scores. The fine-grained evaluation by nesting type, nesting depth, and entity length shows that PRSpan consistently outperforms GlobalPointer across different structural conditions, indicating that the proposed position-role-aware span representation remains effective for strict containment, head-nested, tail-nested, and long-span entities. At the same time, the performance decrease on crossing structures, deeply nested entities, and very long entities suggests that boundary overlap and span complexity remain challenging.

The cross-domain transfer experiment on the external disaster-specific NER dataset further provides preliminary evidence of domain adaptability. PRSpan initialized from MMD-NER performs better than the target-only setting, suggesting that span-based and role-aware representations learned from marine meteorological disaster texts can provide useful initialization for related disaster NER tasks. The improvement is moderate rather than dramatic, which is expected because the target dataset differs from MMD-NER in label schema, text source, entity distribution, and structural complexity.

The extended ablation analysis also supports the robustness of the key architectural design choices. The comparison among w/o CLN, Mid-only CLN, Boundary-only CLN, Full CLN + AvgPool/MaxPool, and Full CLN shows that the performance gain is not caused by a single component alone. Instead, RoPE-enhanced relative position modeling, CLN-generated role-specific features, and Positional Role Pooling jointly contribute to the final performance. The qualitative error analysis complements these findings by identifying the remaining difficult cases, including low-frequency temporal expressions, semantically close entity types, long outer entities, and coordinated overlapping structures. Together, these results clarify both the strengths and limitations of PRSpan under different structural and domain conditions.

6.2. Practical Deployment Considerations

Although PRSpan achieves strong recognition performance on the MMD-NER dataset, its practical deployment in real-time marine meteorological disaster monitoring systems requires further consideration of computational cost, inference efficiency, and memory usage. PRSpan follows a span-based nested NER paradigm, in which candidate spans are constructed over an input sequence of length N. As a result, the number of candidate spans grows quadratically with the sequence length, leading to an

O (N^{2})

span enumeration and classification cost. In addition, the RoPE-enhanced attention module retains the standard self-attention complexity of

O (N^{2} D)

, where D denotes the hidden dimension. The CLN-based role feature generation introduces additional transformations for Head, Mid, and Tail representations, but this overhead is relatively small compared with the quadratic span classification space and attention computation. Therefore, the main computational bottleneck of PRSpan lies in the two-dimensional span classification space rather than in the CLN module itself.

From a deployment perspective, the main computational bottleneck of PRSpan lies in the two-dimensional span classification space rather than the CLN module itself. This reflects a common trade-off in span-based NNER methods: directly modeling all candidate spans improves the ability to recognize overlapping and nested entities, but it also increases computation and memory consumption. In practice, marine meteorological disaster texts are usually processed after sentence segmentation, and sentence-level inputs are often of moderate length. Therefore, PRSpan is more suitable for sentence-level or paragraph-level information extraction pipelines, rather than unrestricted processing of very long documents.

Several optimization strategies can be considered to improve deployment efficiency. First, candidate span pruning can be applied by limiting the maximum span length or filtering unlikely start–end combinations before classification. Second, the BERT encoder can be replaced with a lighter domain-adapted encoder, or the full PRSpan model can be distilled into a smaller student model through knowledge distillation. Third, parameter pruning and low-bit quantization can be used to reduce memory usage and accelerate inference. Fourth, a two-stage extraction framework can be adopted, where a lightweight model first identifies disaster-related text segments and PRSpan is then applied only to high-confidence segments for fine-grained nested entity recognition. This study mainly focuses on recognition accuracy and structural modeling for marine meteorological disaster NNER. Systematic experiments on inference latency, memory consumption, and compression strategies under different deployment environments will be explored in future work.

6.3. Ethical Considerations and Risk Mitigation

Although PRSpan is designed to support information extraction from marine meteorological disaster texts, its outputs should not be directly used as the sole basis for emergency decision-making. In disaster management scenarios, incorrect recognition of critical entities, such as disaster type, affected sea area, meteorological condition, or response agency, may lead to delayed warnings, incomplete situational awareness, or inappropriate resource allocation. Therefore, PRSpan should be deployed as a decision-support tool rather than an autonomous decision-making system.

To reduce potential risks, uncertainty-aware deployment strategies should be considered. First, the confidence scores produced by the softmax decoder can be used to identify low-confidence entity predictions, which should be flagged for human review. Second, confidence calibration methods, such as temperature scaling or validation-based threshold tuning, can be applied to make prediction probabilities better reflect actual reliability. Third, critical entity categories related to disaster events, affected regions, and meteorological conditions should be assigned stricter confidence thresholds. Finally, in real-world disaster management workflows, PRSpan should be integrated with expert verification, multi-source evidence checking, and audit logs to ensure traceability and reduce the risk of automated extraction errors.

These considerations indicate that while PRSpan can improve the efficiency of disaster text processing, human oversight and risk-control mechanisms remain necessary for high-stakes emergency applications.

6.4. Limitations and Future Work

This study still has several limitations. Although the GPT-assisted four-step pipeline and human validation results support the reliability of MMD-NER, the dataset is still generated with LLM assistance and may inherit certain distributional biases. This issue is particularly relevant to low-frequency categories such as Time and MarOrganism, where limited examples may reduce the diversity of surface forms and span patterns. Future work will incorporate more expert-annotated samples and continuously refine the generated corpus.

The cross-domain transfer experiment also remains preliminary. The external disaster-specific NER dataset used in this study follows a flat NER setting and differs from MMD-NER in label schema, text source, entity distribution, and structural complexity. Therefore, the current experiment cannot fully demonstrate cross-domain nested NER generalization. Future work will evaluate PRSpan on more Earth science domains, including hydrological events, ecological disasters, and geological hazard reports.

From the modeling perspective, crossing structures, deeply nested entities, and very long entities remain challenging even though PRSpan improves the recognition of nested and long-span entities overall. More explicit global structural constraints, relation-aware decoding, or joint entity-relation modeling may help better capture complex span interactions. In addition, practical deployment requires more systematic evaluation of inference latency, memory usage, model compression, and confidence calibration, especially in real-time disaster monitoring environments.

7. Conclusions

This study addresses the challenge of nested named entity recognition in marine meteorological disaster texts, where disaster-chain semantics frequently produce overlapping and hierarchical entity structures. To support this task, we constructed MMD-NER, a domain-specific nested NER dataset generated through a GPT-assisted four-step pipeline. The dataset contains fine-grained marine disaster-chain entity categories and nested entity pairs that reflect the causal, spatial, meteorological, and impact-related structure of disaster descriptions. Human validation further confirms that the generated annotations are largely consistent with expert judgments in terms of entity boundaries, type assignments, and nested-pair relations, demonstrating the reliability of the proposed dataset construction process.

To recognize complex nested entities, we proposed PRSpan, a position-role-aware span classification model. PRSpan incorporates RoPE-enhanced attention to strengthen relative positional modeling and uses CLN to generate Head, Mid, and Tail role-specific features for candidate spans. These features are further aggregated through Positional Role Pooling to preserve both boundary signals and internal semantic coherence. Experimental results show that PRSpan outperforms sequence-labeling, hierarchical, span-based, and graph-based baselines on MMD-NER. Additional attention visualization, extended ablation studies, CLN case analysis, and fine-grained evaluation by nesting type, nesting depth, and entity length further demonstrate the effectiveness of RoPE-enhanced attention, CLN-based role representation, and Positional Role Pooling for nested entity recognition.

The extended analyses also clarify the model’s applicability and limitations. Cross-domain transfer results provide preliminary evidence that PRSpan can be adapted to related disaster NER tasks, while the comparison with LLM prompting baselines shows that direct LLM prompting remains less stable than supervised span-based modeling under strict boundary and nested-pair evaluation. Nevertheless, rare categories such as Time, very long entities, deeply nested structures, and coordinated overlapping expressions remain challenging. Future work will incorporate more expert-annotated samples, expand evaluation to additional Earth science domains such as hydrological events and ecological disasters, and explore deployment-oriented optimization strategies including span pruning, model compression, confidence calibration, and human-in-the-loop verification for high-stakes disaster management applications.

Author Contributions

Conceptualization, Weijian Ni, Wenjing Wang and Nengfu Xie; methodology, Weijian Ni, Wenjing Wang and Qingtian Zeng; software, Weijian Ni and Wenjing Wang; validation, Cong Liu; formal analysis, Weijian Ni and Wenjing Wang; investigation, Weijian Ni and Nengfu Xie; resources, Weijian Ni, Nengfu Xie and Tong Liu; data curation, Wenjing Wang and Tong Liu; writing—original draft preparation, Weijian Ni and Wenjing Wang; writing—review and editing, Nengfu Xie, Tong Liu, Qingtian Zeng and Cong Liu; visualization, Wenjing Wang; supervision, Qingtian Zeng and Nengfu Xie; project administration, Weijian Ni and Nengfu Xie; funding acquisition, Weijian Ni, Qingtian Zeng and Cong Liu. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Open Fund Project of the Key Laboratory of Blockchain Agricultural Applications, Ministry of Agriculture and Rural Affairs [No.2023KLABA01], the National Science and Technology Major Project [No.2022ZD0119501], the Natural Science Foundation of Shandong Province, China [Nos.ZR2022MF319 and ZR2025ZD17], the Taishan Scholar Program of Shandong Province, China [No.TSTP20250506], national funds through FCT (Fundação para a Ciência e a Tecnologia) [UID/04152/2025], and Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS [UID/PRR/04152/2025].

Data Availability Statement

The data and materials used and analyzed during the current study are available from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Holloway, S.W. Named Entity Recognition in the Climate Change Domain: An Examination of NER Systems for Climatological Knowledge Discovery. Master’s Thesis, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, 2015. [Google Scholar]
Zhao, D.F.; Chen, X.L.; Chen, Y. Named Entity Recognition for Chinese Texts on Marine Coral Reef Ecosystems Based on the BERT-BiGRU-Att-CRF Model. Appl. Sci. 2024, 14, 5743. [Google Scholar] [CrossRef]
Miwa, M.; Bansal, M. End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 1105–1116. [Google Scholar] [CrossRef]
Hanifah, A.F.; Kusumaningrum, R. Non-Factoid Answer Selection in Indonesian Science Question Answering System using Long Short-Term Memory (LSTM). Procedia Comput. Sci. 2021, 179, 736–746. [Google Scholar] [CrossRef]
Bai, T.; Ge, Y.; Guo, S.; Zhang, Z.; Gong, L. Enhanced Natural Language Interface for Web-Based Information Retrieval. IEEE Access 2021, 9, 4233–4241. [Google Scholar] [CrossRef]
Li, Y.; Luo, L.; Zeng, X.; Han, Z. Fine-tuned BERT-BiLSTM-CRF approach for named entity recognition in geological disaster texts. Earth Sci. Inform. 2025, 18, 368. [Google Scholar] [CrossRef]
Qiu, Q.; Tian, M.; Tao, L.; Xie, Z.; Ma, K. Semantic information extraction and search of mineral exploration data using text mining and deep learning methods. Ore Geol. Rev. 2024, 165, 105863. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Guo, X.; Luo, P.; Wang, T.; Wang, W. Chinese Named Entity Recognition Based on Transformer Encoder and BiLSTM. Des. Eng. 2020, 68–80. Available online: https://api.semanticscholar.org/CorpusID:231778912 (accessed on 28 May 2026).
Yan, R.; Jiang, X.; Dang, D. Named Entity Recognition by Using XLNet-BiLSTM-CRF. Neural Process. Lett. 2021, 53, 3339–3356. [Google Scholar] [CrossRef]
Chang, J.; Han, X. Multi-level context features extraction for named entity recognition. Comput. Speech Lang. 2023, 77, 101412. [Google Scholar] [CrossRef]
Wang, Y.; Tong, H.H.; Zhu, Z.Y.; Li, Y. Nested Named Entity Recognition: A Survey. ACM Trans. Knowl. Discov. Data 2022, 16, 108. [Google Scholar] [CrossRef]
Eberts, M.; Ulges, A. Span-based joint entity and relation extraction with transformer pre-training. In Proceedings of the European Conference on Artificial Intelligence (ECAI 2020); International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar] [CrossRef]
Shen, Y.L.; Ma, X.Y.; Tan, Z.; Zhang, S.; Wang, W.; Lu, W.M. Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1: Long Papers, Online, 1–6 August 2021; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2782–2794. [Google Scholar] [CrossRef]
Su, J.L.; Murtadha, A.; Pan, S.F.; Hou, J.; Sun, J.; Huang, W.W.; Wen, B.; Liu, Y.F. Global Pointer: Novel Efficient Span-based Approach for Named Entity Recognition. arXiv 2022, arXiv:2208.03054. [Google Scholar] [CrossRef]
Wang, J.; Shou, L.D.; Chen, K.; Chen, G. Pyramid: A Layered Model for Nested Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5918–5928. [Google Scholar] [CrossRef]
Rojas, M.; Robaldo, L.; Soto, X.; Muñoz, R. Simple yet powerful: An overlooked architecture for nested named entity recognition. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 5248–5260. Available online: https://aclanthology.org/2022.coling-1.184/ (accessed on 28 May 2026).
Yu, J.T.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6470–6476. [Google Scholar] [CrossRef]
Luo, Y.; Zhao, H. Bipartite Flat-Graph Network for Nested Named Entity Recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6408–6418. [Google Scholar] [CrossRef]
Wan, J.; Ru, D.; Zhang, W.; Yu, Y. Nested named entity recognition with span-level graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Dublin, Ireland, 22–27 May 2022; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 892–903. [Google Scholar] [CrossRef]
Yang, S.; Tu, K. Bottom-up constituency parsing and nested named entity recognition with pointer networks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Dublin, Ireland, 22–27 May 2022; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1763–1774. [Google Scholar] [CrossRef]
Lou, C.; Yang, S.; Tu, K. Nested named entity recognition as latent lexicalized constituency parsing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Dublin, Ireland, 22–27 May 2022; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 6183–6198. [Google Scholar] [CrossRef]
Qiu, Q.; Xie, Z.; Wu, L.; Tao, L. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci. Inform. 2020, 13, 1393–1410. [Google Scholar] [CrossRef]
Lv, X.; Xie, Z.; Wu, L.; Tao, L.; Qiu, Q. Chinese named entity recognition in the geoscience domain based on BERT. Earth Space Sci. 2022, 9, e2021EA002166. [Google Scholar] [CrossRef]
Sun, J.; Liu, Y.; Cui, J.; He, H. Deep learning-based methods for natural hazard named entity recognition. Sci. Rep. 2022, 12, 4598. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Tan, Z.; Wu, S.; Zhang, W.; Zhang, R.; Xi, Y.; Lu, W.; Zhuang, Y. PromptNER: Prompt locating and typing for named entity recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers, Toronto, ON, Canada, 9–14 July 2023; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 12492–12507. Available online: https://aclanthology.org/2023.acl-long.698/ (accessed on 28 May 2026).
Zhang, R.; Li, Y.; Ma, Y.; Zhou, M.; Zou, L. LLMaAA: Making large language models as active annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13088–13103. [Google Scholar] [CrossRef]
Santoso, J.; Sutanto, P.; Cahyadi, B.; Setiawan, E. Pushing the limits of low-resource NER using LLM artificial data generation. In Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 9652–9667. [Google Scholar] [CrossRef]
Kamath, G.; Vajjala, S. Does synthetic data help named entity recognition for low-resource languages? In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Mumbai, India, 20–24 December 2025; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 159–167. Available online: https://aclanthology.org/2025.ijcnlp-short.15/ (accessed on 28 May 2026).
Zhao, D.; Mu, W.; Jia, X.; Liu, S.; Chu, Y.; Meng, J.; Lin, H. Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction. BioData Min. 2025, 18, 28. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Xia, Y.; Tao, R.; Jiao, D.; Min, X.; Zheng, J.; Jiang, Y.; Wu, W.; Du, P. A LLM-based agent for the construction of typhoon knowledge graphs. Environ. Model. Softw. 2026, 197, 106856. [Google Scholar] [CrossRef]
Su, J.L.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Zhang, Z.H.; Li, X.Q.; Li, Y.X.; Dong, Y.J.; Wang, D.; Xiong, S.W. Neural Noise Embedding for End-To-End Speech Enhancement with Conditional Layer Normalization. In Proceedings of ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing; IEEE: New York, NY, USA, 2021; pp. 7113–7117. [Google Scholar] [CrossRef]
Hu, S.; Zhang, H.; Hu, X.; Du, J. Chinese Named Entity Recognition based on BERT-CRF Model. In Proceedings of the 2022 IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS), Zhuhai, China, 26–28 June 2022; IEEE: New York, NY, USA, 2022; pp. 105–108. [Google Scholar] [CrossRef]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 260–270. [Google Scholar] [CrossRef]
Zhou, L.; Li, J.M.; Gu, Z.Q.; Qiu, J.; Gupta, B.B.; Tian, Z.H. PANNER: POS-Aware Nested Named Entity Recognition Through Heterogeneous Graph Neural Network. IEEE Trans. Comput. Soc. Syst. 2024, 11, 4718–4726. [Google Scholar] [CrossRef]
Hafsa, N.E.; Alzoubi, H.M.; Almutlq, A.S. Accurate disaster entity recognition based on contextual embeddings in self-attentive BiLSTM-CRF. PLoS ONE 2025, 20, e0318262. [Google Scholar] [CrossRef]

Figure 1. Example of nested named entity recognition in marine meteorological disaster domain.

Figure 2. Framework of PRSpan. Purple boxes denote input tokens, and yellow stacked blocks denote BERT contextual embeddings. In the attention module, blue, orange, and yellow matrices denote Query (Q), Key (K), and Value (V), respectively, while the pink block denotes RoPE-enhanced attention features. In the CLN module, blue, orange, and green indicate Head, Mid, and Tail role-specific features, respectively. In the span classification matrix, 0 denotes a non-entity span, and non-zero values denote predicted entity category IDs. In this example, the entity category IDs 1, 2, and 3 correspond to PrimaryMarDis, MarMetCond, and SecondaryMarDis, respectively.

Figure 3. RoPE-enhanced attention mechanism. The orange and blue branches represent the Query vector

Q_{m}

and Key vector

K_{n}

at positions m and n, respectively. The stacked blocks show their decomposition into

D / 2

two-dimensional component pairs, and the rotation diagrams illustrate the RoPE transformation applied to the k-th pair. The red blocks denote the resulting attention scores, where

α_{m, n}

is the attention score between positions m and n.

Figure 3. RoPE-enhanced attention mechanism. The orange and blue branches represent the Query vector

Q_{m}

and Key vector

K_{n}

at positions m and n, respectively. The stacked blocks show their decomposition into

D / 2

two-dimensional component pairs, and the rotation diagrams illustrate the RoPE transformation applied to the k-th pair. The red blocks denote the resulting attention scores, where

α_{m, n}

is the attention score between positions m and n.

Figure 4. Role-conditioned feature generation via CLN. The green blocks denote h and

\hat{h}

; the orange blocks denote Avg and Std; the yellow blocks

γ

and

β

denote role-conditioned parameters; and the purple labels indicate different positional roles.

Figure 4. Role-conditioned feature generation via CLN. The green blocks denote h and

\hat{h}

; the orange blocks denote Avg and Std; the yellow blocks

γ

and

β

denote role-conditioned parameters; and the purple labels indicate different positional roles.

Figure 5. NNER data generation pipeline.

Figure 6. Comparison of class-level performance of named entity recognition models.

Figure 7. Attention visualization of standard attention and RoPE-enhanced attention on a nested-entity sentence.

Table 1. Entity category meaning and examples.

Entity Category	Category Meaning	Domain Examples
Time	Related time	10 July 2024
SeaArea	Coverage area	Northern waters of the South China Sea; waters of the Yangtze River Estuary
OceanPhen	Marine natural phenomena	El Niño phenomenon; red tide outbreak
MarMetCond	Meteorological element	Level-12 wind speed; sea fog visibility
MarPollutant	Ecological pollutants	Red tide toxins; oil spill contaminants
MarMetAgency	Professional institutions	National Marine Environmental Forecasting Center (NMEFC)
MarOrganism	Biological population	Coral reef community
DisasterResp	Professionals	Maritime search and rescue personnel
MarDisEqpt	Technical equipment	Doppler radar; breakwater
PrimaryMarDis	Source disaster	Typhoon; South China Sea storm surge
SecondaryMarDis	Derived chain disasters	Coastal waterlogging triggered by typhoon

Table 2. Dataset statistical information.

	Train	Dev	Test
Sentence	1323	385	191
Number of Words	28,091	8913	4331
Number of entities	11,872	3389	1756

Table 3. Dataset entity category statistics.

Entity Category	Quantity of Entities
MarOrganism	171
PrimaryMarDis	4161
OceanPhen	1778
SeaArea	2835
MarDisEqpt	823
Time	89
DisasterResp	371
SecondaryMarDis	3622
MarMetAgency	393
MarMetCond	1712
MarPollutant	1062

Table 4. Human validation of generated MMD-NER annotations.

Validation Aspect	Metric	Value
	Sampled sentences	200
	Pipeline-generated entities	1842
Validation subset	Human-gold entities	1796
	Pipeline-generated nested pairs	326
	Human-gold nested pairs	309
	Boundary agreement F1	94.86%
Inter-annotator agreement	Type-aware agreement F1	92.73%
	Nested-pair agreement F1	90.41%
	Boundary-only F1	93.02%
Pipeline vs. human gold	Type-aware strict F1	90.01%
	Nested-pair F1	86.54%

Table 5. Error analysis of pipeline-generated annotations against human gold.

Error Type	Description	Percentage
Boundary over-extension	The generated span is longer than the expert boundary	24.03%
Type confusion	The entity boundary is correct or partially correct, but the type is wrong	22.48%
Boundary under-extension	The generated span misses part of the expert boundary	18.60%
Missing inner entity	The inner entity in a nested structure is omitted	11.63%
Missing outer entity	The enclosing outer entity is omitted	9.30%
Spurious entity	Non-entity text is incorrectly labeled as an entity	8.53%
Invalid nesting relation	The generated inner–outer relation is inconsistent with expert judgment	5.43%

Table 6. Representative discrepancy cases between pipeline-generated annotations and human-gold annotations.

Error Type	Sentence Fragment	Pipeline-Generated Annotation	Human-Gold Annotation	Explanation
Boundary over-extension	persistent sea fog reduced visibility to less than 200 m near the Zhoushan fishing grounds	“persistent sea fog reduced visibility to less than 200 m” → MarMetCond	“persistent sea fog” → OceanPhen; “visibility to less than 200 m” → MarMetCond; “Zhoushan fishing grounds” → SeaArea	The pipeline incorrectly merged an ocean phenomenon and its meteorological condition into one over-extended condition span.
Boundary under-extension	coastal waterlogging triggered by severe typhoon In-fa	“coastal waterlogging” → SecondaryMarDis	“coastal waterlogging triggered by severe typhoon In-fa” → SecondaryMarDis; “severe typhoon In-fa” → PrimaryMarDis	The outer disaster-chain expression was truncated and the causal trigger was not included.
Type confusion	abnormal sea surface temperature in the South China Sea	“abnormal sea surface temperature” → OceanPhen	“abnormal sea surface temperature” → MarMetCond	The pipeline confused a meteorological/oceanographic condition with a marine phenomenon.
Missing inner entity	storm surge caused by Typhoon Doksuri	“storm surge caused by Typhoon Doksuri” → SecondaryMarDis	“storm surge caused by Typhoon Doksuri” → SecondaryMarDis; “Typhoon Doksuri” → PrimaryMarDis	The outer entity was identified, but the inner source-disaster entity was omitted.
Missing outer entity	red tide toxins released during a red tide outbreak	“red tide toxins” → MarPollutant; “red tide outbreak” → OceanPhen	“red tide toxins released during a red tide outbreak” → OceanPhen; “red tide toxins” → MarPollutant	The pipeline identified inner entities but failed to annotate the enclosing phenomenon-level span.
Spurious entity	emergency coordination was strengthened along the coast	“emergency coordination” → DisasterResp	None	The phrase describes an action rather than a disaster-response personnel entity.
Invalid nesting relation	floating oil contaminants drifted eastward near the Beibu Gulf after the cold wave	“Beibu Gulf” → SeaArea nested inside “floating oil contaminants” → MarPollutant	“floating oil contaminants” → MarPollutant; “Beibu Gulf” → SeaArea; “cold wave” → PrimaryMarDis	The pipeline incorrectly constructed a nesting relation between pollutant and sea area, although the two mentions are separate entities without span containment.

Table 7. Examples of controlled scenario seeds for LLM-backbone comparison.

Seed	Field	Content
Seed 1	Scenario seed	Typhoon-induced coastal hazard
	Target domain	Marine meteorological disaster
	Expected semantic dimensions	Disaster event; meteorological condition; sea area; impact chain
	Target complexity	At least one nested entity pair
	Suggested nesting difficulty	Medium
	Constraint	Do not predefine entity spans or entity labels; the LLM must generate attributes, entity pool, nested sentence, and verified annotations through the four-step pipeline.
Seed 2	Scenario seed	Red tide ecological impact
	Target domain	Marine ecological disaster
	Expected semantic dimensions	Ocean phenomenon; pollutant; marine organism; affected sea area
	Target complexity	At least one nested entity pair
	Suggested nesting difficulty	Medium
	Constraint	Do not predefine entity spans or entity labels; the LLM must generate attributes, entity pool, nested sentence, and verified annotations through the four-step pipeline.

Table 8. Pilot comparison of generated data quality across LLM backbones.

LLM	Valid Sample Rate	Requirement Satisfaction	Nesting Validity	Boundary Agreement F1	Type-Aware Agreement F1	Nested-Pair Agreement F1
GPT-4o mini	90.00%	88.00%	86.00%	92.41%	89.76%	85.31%
Qwen2.5-72B-Instruct	86.00%	84.00%	82.00%	90.47%	87.32%	82.68%
DeepSeek-V3	88.00%	86.00%	84.00%	91.16%	88.05%	83.74%

Table 9. Nested entity recognition performance comparison (micro-average). The best results in each column are shown in bold.

Model	Precision (%)	Recall (%)	F1-Score (%)
BERT-CRF	55.65	52.63	54.06
BiLSTM-CRF	57.44	19.64	29.28
BERT-BiGRU-Att-CRF	83.41	80.96	82.17
Pyramid	84.70	87.70	86.20
Global Pointer	90.52	91.42	90.97
Biaffine	89.57	89.94	89.86
BiFlaG	87.27	85.00	86.12
PANNER	89.17	90.46	89.87
PRSpan	93.92	95.00	94.58

Table 10. Nested entity recognition performance comparison (macro-average). The best results in each column are shown in bold.

Model	Precision (%)	Recall (%)	F1-Score (%)
BERT-CRF	56.48	48.83	48.46
BiLSTM-CRF	32.81	11.36	16.17
BERT-BiGRU-Att-CRF	80.73	78.94	79.82
Pyramid	86.19	88.99	85.86
Global Pointer	89.13	89.95	89.54
Biaffine	88.52	89.23	88.57
BiFlaG	86.41	86.77	86.18
PANNER	88.05	89.73	88.16
PRSpan	94.37	92.49	93.47

Table 11. Extended ablation study on CLN configurations and Positional Role Pooling. The best results in each column are shown in bold.

Variant	Span Representation	Precision (%)	Recall (%)	Micro-F1 (%)	Macro-F1 (%)
w/o CLN	Base (start) + Base (mid) + Base (end)	93.00	92.10	92.54	91.20
Mid-only CLN	Mid (start) + Mid (mid) + Mid (end)	92.84	92.61	92.72	91.64
Boundary-only CLN	Head (start) + Base (mid) + Tail (end)	93.46	93.58	93.52	92.36
Full CLN + AvgPool	AvgPool (Head, Mid, Tail) over span	93.55	93.41	93.48	92.34
Full CLN + MaxPool	MaxPool (Head, Mid, Tail) over span	93.42	93.25	93.33	92.18
Full CLN	Head (start) + Mid (mid) + Tail (end)	93.92	95.00	94.58	93.47

Table 12. Practical case study of CLN-generated role-specific features.

Sentence Fragment	Gold Annotation	Variant	Prediction	Interpretation
red tide toxins released during a red tide outbreak	“red tide toxins” → MarPollutant; “red tide outbreak” → OceanPhen; outer nested span → OceanPhen	w/o CLN	Only identifies “red tide outbreak”	The model lacks role-specific span representation and misses the pollutant-related nested structure.
		Head + Tail	Identifies the outer span but misclassifies it as MarPollutant	Boundary cues are captured, but internal semantic composition is insufficient.
		Mid only	Identifies “toxins released during a red tide” with imprecise boundary	Internal semantics are partially captured, but explicit start/end boundary roles are weakened.
		Full CLN	Correctly identifies both inner entities and the outer nested span	Head and Tail features support boundary localization, while Mid features preserve internal semantic coherence.

Table 13. Performance comparison by nested structure type on the MMD-NER test set.

Nested Structure Type	Definition	Pairs Count	GlobalPointer F1 (%)	PRSpan F1 (%)
Strict containment	$s_{a} < s_{b} \leq e_{b} < e_{a}$	121	91.28	95.21
Head-nested	$s_{a} = s_{b}, e_{b} < e_{a}$	76	89.86	94.37
Tail-nested	$s_{a} < s_{b}, e_{b} = e_{a}$	67	88.97	93.82
Crossing	$s_{a} < s_{b} < e_{a} < e_{b}$	34	85.62	90.94

Table 14. Performance comparison by nesting depth on the MMD-NER test set.

Nesting Depth	Description	Entity Count	GlobalPointer F1 (%)	PRSpan F1 (%)
Depth–1	Non-nested or single-level entity	1264	91.67	95.53
Depth–2	Entity involved in two-level nesting	402	88.95	93.88
Depth ≥ 3	Entity involved in three or more nested levels	90	84.31	90.76

Table 15. Performance comparison by entity length on the MMD-NER test set.

Entity Length	Entity Count	GlobalPointer F1 (%)	PRSpan F1 (%)
1–2 tokens	684	92.14	96.18
3–5 tokens	712	90.85	95.02
6–10 tokens	276	88.12	93.11
>10 tokens	84	83.97	89.74

Table 16. Statistics of the target disaster NER dataset used for cross-domain transfer evaluation.

Dataset	Domain	Language	Source	EntitySchema	NestedAnnotations	UsedSamples
MMD-NER	Marine meteorological disasters	English	This study	11 fine-grained marine disaster-chain entity types	Yes	1899 sentences
Disaster-specific NER	Disaster-related news	English	Hafsa et al. (2025) [37]	14 crisis-specific entity types	No	1000 sentences

Table 17. Cross-domain transfer results on the external disaster-specific NER dataset. The best results in each column are shown in bold.

Model/Setting	Precision (%)	Recall (%)	Micro-F1 (%)	Macro-F1 (%)
BERT-CRF Target-only	85.96	84.73	85.34	82.71
GlobalPointer Target-only	87.18	85.94	86.55	83.96
PRSpan Target-only	87.86	86.72	87.29	84.61
PRSpan MMD-NER → Disaster-NER	88.74	87.96	88.35	85.52

Table 18. Sampled comparison between PRSpan and LLM prompting baselines on MMD-NER.

Method	Setting	Sample Size	Micro-F1 (%)	Boundary Agreement F1 (%)	Nested-Pair Agreement F1 (%)	Inference Efficiency	Cost-Effectiveness
GPT-4o mini	Few-shot prompting	50	78.42	82.35	70.18	Low	Medium
Qwen3	Few-shot prompting	50	75.86	79.64	66.72	Low	Medium/Low
PRSpan	Supervised model	50	94.31	96.08	91.46	High	High after training

Table 19. Qualitative analysis of representative PRSpan failure modes.

Failure Mode	Sentence Fragment	Gold Annotation	PRSpan Prediction	Analysis
Low-frequency temporal expression	from late July to early August, offshore warnings were issued repeatedly	“late July to early August” → Time	Missed Time	Temporal expressions are sparse in MMD-NER and appear in diverse surface forms, making them harder to learn than high-frequency disaster entities.
Semantic confusion between related entity types	abnormally high sea surface temperature intensified the algal bloom	“abnormally high sea surface temperature” → MarMetCond; “algal bloom” → OceanPhen	“abnormally high sea surface temperature” → OceanPhen; “algal bloom” → OceanPhen	Both categories describe abnormal marine states, so the model may confuse background conditions with observable ocean phenomena.
Incomplete recognition of long outer entity	coastal flooding caused by prolonged storm surge along low-lying islands	“coastal flooding caused by prolonged storm surge” → SecondaryMarDis; “storm surge” → OceanPhen	“coastal flooding” → SecondaryMarDis; “storm surge” → OceanPhen	The inner entity is correctly recognized, but the long outer disaster-chain expression is shortened because the causal modifier increases span complexity.
Error in coordinated overlapping structure	strong waves and coastal erosion affected the harbor entrance during the gale	“strong waves” → OceanPhen; “coastal erosion” → SecondaryMarDis; “gale” → MarMetCond	“strong waves and coastal erosion” → OceanPhen; “gale” → MarMetCond	Coordinated expressions are challenging because adjacent entities are semantically related but should not always be merged into one span.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Ni, W.; Wang, W.; Xie, N.; Liu, T.; Zeng, Q.; Liu, C. A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts. ISPRS Int. J. Geo-Inf. 2026, 15, 258. https://doi.org/10.3390/ijgi15060258

AMA Style

Ni W, Wang W, Xie N, Liu T, Zeng Q, Liu C. A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts. ISPRS International Journal of Geo-Information. 2026; 15(6):258. https://doi.org/10.3390/ijgi15060258

Chicago/Turabian Style

Ni, Weijian, Wenjing Wang, Nengfu Xie, Tong Liu, Qingtian Zeng, and Cong Liu. 2026. "A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts" ISPRS International Journal of Geo-Information 15, no. 6: 258. https://doi.org/10.3390/ijgi15060258

APA Style

Ni, W., Wang, W., Xie, N., Liu, T., Zeng, Q., & Liu, C. (2026). A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts. ISPRS International Journal of Geo-Information, 15(6), 258. https://doi.org/10.3390/ijgi15060258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Refined Span Classification Model for Recognizing Nested Named Entity in Marine Meteorological Disaster Texts

Abstract

1. Introduction

2. Related Work

2.1. General Nested Named Entity Recognition Methods

2.2. Named Entity Recognition Methods in the Marine Meteorological Disaster Domain

2.3. Prompting Large Language Models for NER and Synthetic NER Data Generation

3. Problem Description of Named Entity Recognition in the Marine Meteorological Disaster Domain

3.1. Named Entity Category Definition

3.2. Formal Task Definition

4. Methods

4.1. Initial Encoding Integrating RoPE Attention

4.2. Span Encoding with Positional Role Dependence

4.3. Span Generation and Recognition Based on Positional Role Pooling

5. Experiments

5.1. MMD-NER Dataset Construction and Validation

5.1.1. Four-Step LLM-Assisted Dataset Construction

5.1.2. Human Validation of Generated Annotations

5.1.3. LLM-Backbone Analysis for Data Generation

5.2. Experimental Settings and Baselines

5.3. Overall and Category-Level Performance

5.3.1. Main Results Under Micro-Average and Macro-Average

5.3.2. Category-Level Analysis

5.4. Ablation Analysis

5.4.1. RoPE Attention Visualization

5.4.2. Extended Ablation on CLN and Positional Role Pooling

5.4.3. Practical Case Study of CLN Role Features

5.5. Fine-Grained Analysis of Nested Structures

5.6. Cross-Domain Transfer Evaluation

5.7. Comparison with LLM Prompting Baselines

5.8. Qualitative Error Analysis

6. Discussion

6.1. Robustness-Related Findings

6.2. Practical Deployment Considerations

6.3. Ethical Considerations and Risk Mitigation

6.4. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI