Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition

Yang, Meng; Wang, Shuo; Yang, Hexin; Chen, Ning

doi:10.3390/computers15010036

Open AccessArticle

Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition

¹

School of Artificial Intelligence, China University of Mining and Technology, Beijing 100083, China

²

Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computers 2026, 15(1), 36; https://doi.org/10.3390/computers15010036

Submission received: 16 December 2025 / Revised: 2 January 2026 / Accepted: 5 January 2026 / Published: 7 January 2026

(This article belongs to the Special Issue Adaptive Decision Making Across Industries with AI and Machine Learning: Frameworks, Challenges, and Innovations)

Download

Browse Figures

Versions Notes

Abstract

Although fine-tuning pretrained language models has brought remarkable progress to zero-shot named entity recognition (NER), current generative approaches still suffer from inherent limitations. Their autoregressive decoding mechanism requires token-by-token generation, resulting in low inference efficiency, while the massive parameter scale leads to high computational and deployment costs. In contrast, span-based methods avoid autoregressive decoding but often face large candidate spaces and severe noise redundancy, which hinder efficient entity localization in long-text scenarios. To overcome these challenges, we propose an efficient Embedding-based NER framework that achieves an optimal balance between performance and efficiency. Specifically, the framework first introduces a lightweight dynamic feature matching module for coarse-grained entity localization, enabling rapid filtering of potential entity regions. Then, a hierarchical progressive entity filtering mechanism is applied for fine-grained recognition and noise suppression. Experimental results demonstrate that the proposed model, which is trained on a single RTX 5090 GPU for only 24 h, attains approximately 90% of the performance of the SOTA GNER-T5 11B model while using only one-seventh of its parameters. Moreover, by eliminating the redundancy of autoregressive decoding, the proposed framework achieves a 17× faster inference speed compared to GNER-T5 11B and significantly surpasses traditional span-based approaches in efficiency.

Keywords:

zero-shot named entity recognition; embedding-based framework; dynamic feature matching; hierarchical progressive filtering; inference efficiency

1. Introduction

Named Entity Recognition (NER), as a fundamental task in Natural Language Processing (NLP), plays an indispensable role in advanced applications such as text information extraction, question-answering systems, sentiment analysis, and machine translation. Its core objective is to automatically identify and extract entities with specific semantic features from text, such as person names, locations, organization names, and domain-specific entities [1]. Traditional NER models face significant limitations in recognizing entities from new domains or out-of-distribution (OOD) data, as expanding to new domains or identifying unseen entity types requires retraining with large-scale high-quality annotated datasets. The construction of such datasets typically requires substantial human and time resources, particularly in highly specialized domains such as medicine and law [2,3]. Additionally, traditional methods are strictly constrained by predefined entity category frameworks, with their classification systems being rigid and lacking dynamic adjustment capabilities, making it difficult to flexibly adapt to the emergence of new entity types or changes in the semantic boundaries of existing categories in real-world scenarios [4]. To address the dual challenges of scarce annotated data and dynamic entity category expansion, Zero-Shot named entity recognition (Zero-Shot NER) technology has emerged. Its core objective is to directly recognize open-domain or previously unseen entity categories through semantic reasoning and knowledge transfer, without relying on annotated data in the target domain.

In recent years, the development of large language models (LLMs), such as GPT and LLaMa [5,6], has brought transformative breakthroughs to the field of natural language processing. Through techniques like prompt engineering [7] or fine-tuning [8], researchers have achieved remarkable results in Zero-Shot learning scenarios for NER tasks [9]. These approaches effectively address the challenges of limited annotation resources and difficulties in entity category expansion. However, they still face two critical challenges in practical applications: (1) The training process requires exorbitant computational costs, relying on large-scale GPU clusters for computational support [10], while the massive parameter scales substantially increase storage and computational resource consumption. Furthermore, the extreme parameter size elevates deployment barriers, hindering model adaptation to resource-constrained environments such as edge devices and embedded systems [11]. (2) The inherent autoregressive generation mechanism necessitates iterative token-by-token processing. In long-sequence generation scenarios, this sequential processing nature leads to exponential growth in cumulative latency effects [12,13]. Achieving the optimal balance between model performance and resource efficiency while improving inference speed remains a critical technical bottleneck to overcome.

Meanwhile, span-based NER methods have also drawn wide attention. Such methods enumerate all possible contiguous spans in text and classify them, thereby performing entity detection and recognition, and can naturally handle nested entities and open-category entities [14,15]. However, these methods remain inefficient in real-world scenarios, as the candidate span space grows rapidly with text length, leading to prohibitively high computational and memory costs when handling long documents or large span windows [16,17]. In addition, extreme class imbalance between positive and negative spans causes training to be susceptible to noise, and insufficient boundary precision may lead to entity truncation or redundancy, especially in domain-specific texts [15,18]. Finally, processing long texts and multi-layer nested entities further increases the complexity of training and inference, reducing efficiency and generalization [17,19]. Consequently, neither autoregressive LLMs nor enumeration-based span approaches are capable of simultaneously balancing performance, computational efficiency, and generalization ability in zero-shot NER settings [20].

In this paper, we propose CSRVNER, an Embedding-based universal named entity recognition framework based on a non-autoregressive architecture. Inspired by cognitive process modeling, this framework simulates the semantic understanding mechanism of human annotators by decomposing the NER task into three stages: “Reading → Annotation → Review”.The main contributions of this work are summarized as follows:

We propose an encoding-based NER framework that simulates human cognitive processes through a Hierarchical Progressive Entity Filtering (HPEF) mechanism for open-type entity recognition. Experimental results show that the framework significantly improves recognition accuracy while maintaining highly efficient inference performance.
To address the challenge of effective entity localization, we design a Dynamic Feature-to-Entity Mapping (DFEM) module. DFEM integrates entity description semantics with contextual text features through cross-attention and adaptive feature fusion, enabling dynamic modeling of contextual dependencies and improving semantic alignment. The resulting representations offer high-quality contextual semantics for the subsequent entity filtering stage.
Building upon DFEM, we further introduce a Dynamic Span Feature Encoder (DSFE) as a refinement stage to enhance candidate entity representations. DSFE utilizes a multi-layer cross-attention framework for fine-grained semantic verification, enabling it to suppress low-confidence predictions and strengthen entity boundary coherence.

2. Related Work

2.1. Traditional NER Approaches

2.1.1. StatisticalMachine Learning Methods

Sequence labeling models centered on Conditional Random Fields (CRF) [21,22] and Support Vector Machines (SVM) [23], constructed through manual feature engineering (e.g., part-of-speech tags, morphological features), achieved notable progress on benchmark datasets like CoNLL-2003 [24,25,26,27,28]. These approaches faced dual constraints: heavy reliance on large-scale annotated data and difficulties in transferring handcrafted features across domains.

2.1.2. Deep Learning Methods

Neural architectures based on BiLSTM-CRF [29,30] and Transformer [31,32] automatically learned contextual semantic features through distributed representations. Pre-trained language models like BERT [33,34,35,36] further achieved performance breakthroughs on complex corpora such as OntoNotes 5.0 [37]. A fundamental limitation persists: dependency on predefined entity type schemas impedes dynamic adaptation to emerging entity types in open-domain scenarios. Both paradigms share the core characteristic of modeling constrained entity type spaces through either manual feature engineering or neural architectures, remaining bound by closed entity-type systems.

2.2. Span-Based NER Approaches

Unlike traditional token-level sequence labeling frameworks, span-based methods regard Named Entity Recognition as a span classification problem, where all possible contiguous token spans are enumerated and classified into entity categories or non-entity types. This paradigm effectively handles nested and overlapping entities that are challenging for sequence labeling models. Early span-based models [38,39,40] introduced exhaustive span enumeration followed by feed-forward classification, yet suffered from high computational cost due to the quadratic number of candidate spans. To mitigate this, several studies proposed span pruning and boundary refinement mechanisms [41,42,43], achieving significant improvements in efficiency and precision.

Recent advances further integrated span-based modeling with pre-trained encoders such as BERT and RoBERTa [44,45,46], leveraging contextualized embeddings to enhance span boundary sensitivity. Moreover, adaptive span selection techniques [47,48] and multi-stage filtering strategies have been developed to reduce negative span imbalance and improve entity boundary detection. Despite these improvements, span-based NER approaches still face limitations: the combinatorial explosion of candidate spans in long texts, imbalance between positive and negative samples, and challenges in adapting to unseen entity categories in open-domain scenarios. These constraints motivate the exploration of more efficient and flexible zero-shot span frameworks capable of dynamically aligning entity semantics without relying on predefined schemas.

2.3. Zero-Shot NER

Large language models (LLMs) introduced paradigm-shifting advancements in NER [49,50]. Recent studies demonstrate that LLMs exhibit remarkable Zero-Shot transfer capabilities through instruction tuning and knowledge distillation. InstructUIE [51] validated model generalizability across 14 information extraction datasets via a multi-task instruction framework, while UniversalNER [8] enhanced cross-domain adaptability through conversational training paradigms. GoLLIE [9] innovatively integrated code-style instructions to improve structured output capabilities. Furthermore, GNER [52] introduced negative samples during training and proposed an efficient Longest Common Subsequence (LCS) matching algorithm to further optimize Zero-Shot performance. Despite these advancements, LLMs face three persistent challenges in NER applications: (1) high computational costs during fine-tuning, (2) inference latency impacting real-time deployment, and (3) model compression difficulties in low-resource scenarios.

3. Method

This chapter systematically presents the proposed CSRVNER framework and its core modules, together with the training strategy and loss functions. As illustrated in Figure 1, the CSRVNER framework is designed as an efficient embedding-based system that operates in three stages: Read → Annotation → Review. The model takes the target text and entity description as inputs. In the Read stage, we employ Qwen2.5-1.5B as the backbone for feature extraction.

Specifically, we utilize the model’s inherent causal masking mechanism to extract high-dimensional hidden states via a single forward pass, without performing autoregressive token generation.

Subsequently, in the Annotation stage, the Dynamic Feature Extraction Module (DFEM) performs cross-attention interactions between the text and entity description to produce candidate entities and dynamic feature representations. Finally, in the Review stage, the Dynamic Semantic Filtering Engine (DSFE) refines and verifies candidate entities through fine-grained semantic validation. Together, DFEM and DSFE form the Hierarchical Progressive Entity Filtering (HPEF) mechanism, enabling progressive entity recognition from coarse detection to fine verification.

3.1. Hierarchical Progressive Entity Filtering

To improve both efficiency and accuracy in entity recognition, we propose a hierarchical progressive entity filtering mechanism. This mechanism consists of two layers: the first layer (DFEM) locates potential entity spans, and the second layer (DSFE) evaluates and filters these candidates to retain only high-quality entities.

Formally, given an input text sequence T, the mechanism can be expressed as a two-step process:

C^{(1)} = DFEM (T)

(1)

\hat{C} = DSFE (C^{(1)})

(2)

where

C^{(1)}

denotes the candidate entity set produced by DFEM through entity span localization, and

\hat{C}

represents the final refined candidate entity set obtained by DSFE through semantic-based filtering.

Overall, the proposed mechanism operates under a coarse-to-fine, hierarchical filtering framework. Specifically, DFEM performs coarse-grained span detection to generate potential entity regions, while DSFE conducts fine-grained filtering to remove noisy or redundant spans, leading to significant gains in both precision and computational efficiency.

3.2. Dynamic Feature-to-Entity Mapping

Dynamic Feature-to-Entity Mapping module dynamically aligns text representations with entity descriptions through an iterative cross-attention refinement mechanism. Given input text enbedding matrix

A \in R^{n \times d_{hid}}

and entity description matrix

B \in R^{m \times d_{hid}}

, the module operates as follows:

3.2.1. Iterative Attention Refinement

The representation is refined through T layers of multi-head attention:

A^{(t)} = LayerNorm (A^{(t - 1)} + MHA (A^{(t - 1)}, B))

(3)

where

A^{(0)}

is the initial text enbedding and B contains fixed entity patterns.

3.2.2. Attention Computation

Each attention head computes token-description alignment:

{Attn}_{h} = softmax (\frac{A W_{Q_{h}} {(B W_{K_{h}})}^{⊤}}{\sqrt{d_{h}}}) B W_{V_{h}}

(4)

with projection matrices

W_{Q_{h}}, W_{K_{h}}, W_{V_{h}} \in R^{d_{hid} \times d_{h}}

where

d_{h} = d_{hid} / n_{head}

. The multi-head output combines all heads:

MHA (A, B) = [{Attn}_{1} \oplus \dots \oplus {Attn}_{n_{head}}] W_{O}

(5)

3.2.3. Classification

After t refinement iterations, the final representation

A^{(T)}

(DynaEmbed) is fed into a token-wise classifier:

P (y_{i}) = softmax (MLP (a_{i}^{(t)}))

(6)

where

a_{i}^{(t)}

denotes the refined embedding of the i-th token in

A^{(T)}

, and the MLP projects the embedding to entity-type logits. The model predicts entity labels by matching token representations with descriptions in B, enabling cross-domain generalization through description conditioning.

3.3. Dynamic Span Feature Encoder

To address the issue of false positives generated in the first stage, we introduce the Dynamic Span Feature Encoder (DSFE) as a second-stage Review Module. This module operates at the span level rather than the token level. It takes the candidate spans generated by the Annotation module and performs a fine-grained classification to determine the final entity type.

3.3.1. Candidate Span Generation via Greedy Strategy

The input to the DSFE relies on the output of the first-stage Annotation module. We adopt a greedy strategy to form candidate spans based on the discrete classification results: continuous sequences of tokens classified as entities are concatenated to form the candidate span set

S = {S_{1}, S_{2}, \dots, S_{K}}

. Here, each span

S_{k}

corresponds to a start–end interval

[s_{k}, e_{k}]

. This strategy ensures that the DSFE focuses solely on plausible entity segments suggested during the first stage.

3.3.2. Span Representation and Fusion

For each candidate span

S_{k}

, we extract its dynamic feature sequence

H [s_{k} : e_{k}]

from the shared encoder. To capture the interaction between the span content and specific entity types, we employ an n-layer cross-attention network:

Z_{k}^{(l)} = CrossAttn (Q = Z_{k}^{(l - 1)}, K = E, V = E), l = 1, \dots, n,

(7)

where

Z_{k}^{(0)} = H [s_{k} : e_{k}]

, and

E

represents the learnable embeddings of entity types. This mechanism aligns span-specific signals with semantic category features.

3.3.3. Span-Level Classification

Unlike the first stage, which predicts per-token labels, the DSFE aggregates the fused features of the entire span into a single vector representation via average pooling:

v_{k} = AvgPool (Z_{k}^{(n)}), p_{k} = softmax (W_{c} v_{k} + b_{c})

(8)

$p_{k} \in R^{C + 1}$ represents the probability distribution of span $S_{k}$ over entity categories (including a “None” class for invalid spans). The final decision depends entirely on this review result.

3.4. Training Objectives

This paper employs a two-phase training strategy to sequentially optimize token-level recall and span-level precision; during this process, Qwen remains frozen as the feature extraction model, and only the parameters of the DFEM and DSFE modules as well as the classifier are optimized.

3.4.1. Phase 1: Annotation Module (Token-Level Learning)

In the first phase, we train the Annotation module to identify potential entity boundaries. This is a token-level task utilizing a weighted cross-entropy loss:

L_{token}^{(1)} = - \frac{1}{N} \sum_{i = 1}^{N} w_{i} \cdot y_{i} log ({\hat{y}}_{i})

(9)

where N is the total number of tokens in the sequence,

y_{i}

is the ground truth label for token i, and weights

w_{i}

are set higher (e.g., 25) for positive tokens to improve recall.

3.4.2. Phase 2: Review Module (Span-Level Learning)

In the second phase, we freeze the parameters of the Annotation module and train the DSFE Review module. Crucially, the optimization target shifts from tokens to spans. Based on the candidate spans

S

generated by the greedy strategy in Phase 1, we minimize the span classification loss as follows:

L_{span}^{(2)} = - \frac{1}{K} \sum_{k = 1}^{K} w_{k}^{'} \cdot Y_{k} log (P_{k})

(10)

where

K is the number of candidate spans generated by the greedy strategy.
$Y_{k}$ is the ground truth label for the k-th span. We assign a positive entity label to $S_{k}$ if its Intersection over Union (IoU) with a ground truth entity exceeds 0.5; otherwise, it is labeled as “None” (invalid).
$P_{k}$ is the model’s predicted probability that span $S_{k}$ belongs to category $Y_{k}$ .
$w_{k}^{'}$ is a weight coefficient assigned to positive entity spans to address the imbalance between valid entities and false-positive spans.

This formulation explicitly treats the review process as a span classification problem, ensuring high-quality entity recognition.

4. Experiment Settings

4.1. Training Data

We employ the Pile-NER dataset released by [8], which comprises 44,889 high-quality text entries containing 240,000 annotated entity instances and covering 13,000 distinct entity types, ensuring the training data’s advantages in both domain coverage and entity diversity. To facilitate the zero-shot implementation of our framework, we utilized GPT-4o to generate structured descriptions for the diverse entity types. The process involved designing tailored prompts to capture core characteristics (e.g., specific attributes for the event entity). The generated descriptions were subsequently calibrated through a post-processing step that strictly focused on ensuring linguistic fluency and definition generality. Further details regarding the prompt designs are provided in the Appendix A.

4.2. Evaluation

4.2.1. Datasets

We primarily evaluate our model under a Zero-Shot setting following established protocols from prior studies [8,51]. Formally, we define this setting as Zero-Shot Cross-Dataset Transfer: while the model benefits from broad semantic supervision during training, the target datasets are strictly unseen, and no fine-tuning or few-shot examples are provided during inference. This setting evaluates the model’s capability to handle domain shifts and generalize to new data distributions. Our evaluation framework comprises three benchmarks: Cross-Domain NER Benchmark (Table 1), which integrates 5 multi-domain datasets from CrossNER [4], specifically designed to assess out-of-domain generalization capabilities of NER models; and Multi-Domain NER Benchmark (Table 2), covering 15 classical datasets across diverse domains.

4.2.2. BaseLines

To validate the effectiveness of our proposed framework, we compare CSRVNER with recent leading methods in open-domain NER, establishing a comprehensive benchmark for performance evaluation. We first evaluate prompting-based chat models, including ChatGPT and Vicuna [53], which adopt the prompting strategy proposed by [54]; their performance metrics are presented as reported by [8]. Additionally, we compare with the following large language models (LLMs) specifically fine-tuned for NER:InstructUIE [51], built upon the FlanT5-11B architecture and fine-tuned across multiple NER datasets; UniNER [8], which leverages a LLaMa model fine-tuned on ChatGPT-generated synthetic data; GoLLIE [9], based on CodeLLama and enhanced through guideline-aware fine-tuning to improve generalization on unseen information extraction tasks. Finally, we include comparisons with USM [55] and GLiNER [10], both employing compact architectures with reduced parameter sizes but differing in structural design.

4.2.3. Metrics

Evaluation follows the standard exact-match protocol for NER, in which F1-scores are computed by requiring complete agreement between predicted and annotated entities with respect to both span boundaries and entity categories.

5. Results and Analysis

5.1. Zero-Shot Performance

5.1.1. OOD NER Benchmark

We first evaluate our model on the out-of-domain (OOD) benchmark, as summarized in Table 1. The comparative results against various baseline models demonstrate the superior performance of our approach. Specifically, CSRVNER surpasses general-purpose language models such as ChatGPT and Vicuna, and further outperforms the 11B InstructUIE model, which is instruction-tuned specifically for NER tasks.

5.1.2. 15 NER Benchmark

Table 2 presents the comparative results on 15 diverse NER datasets, evaluated against ChatGPT, UniNER, and GLiNER. Consistent with the OOD benchmark results, ChatGPT performs substantially worse than fine-tuned models, lagging behind UniNER. CSRVNER achieves state-of-the-art performance on 8 datasets, outperforming GLiNER by an average margin of 2 percentage points, which underscores its robust cross-domain generalization and adaptability.

5.2. Inference Speed

Figure 2 shows that CSRVNER (1.5B) achieves an outstanding balance between performance and efficiency in zero-shot NER tasks. Although its parameter size is only one-seventh that of GNER-T5-xxl (11B), CSRVNER attains approximately 90% of its performance while achieving a 17× faster inference speed. This clearly demonstrates that CSRVNER greatly enhances computational and deployment efficiency while maintaining high accuracy.

To further evaluate the model’s efficiency in long-text scenarios, we conducted a comparative experiment against the span-based baseline, W2NER. Note that GLiNER was excluded from this comparison as it does not support the extraction of excessively long texts. We measured Latency, Throughput, and GPU Memory usage (VRAM) across input sequence lengths of 1000, 2000, and 3000 tokens. As shown in Table 3, while W2NER exhibits lower latency on shorter sequences (1000 tokens), its computational cost increases drastically as the sequence length grows. Specifically, at a length of 3000 tokens, W2NER’s VRAM consumption surges to 27.94 GB with a significant drop in throughput to 567.40 tok/s, likely due to the quadratic complexity inherent in its grid-tagging architecture. In contrast, our CSRVNER demonstrates superior scalability. Even at 3000 tokens, CSRVNER maintains a stable throughput of over 3000 tok/s and consumes only 9.15 GB of VRAM. These results confirm that CSRVNER is significantly more efficient and suitable for processing long documents compared to traditional span-based approaches.

5.3. Ablation Studies

This study systematically validates the effectiveness of core model components through ablation experiments, with experimental designs detailed as follows.

5.3.1. Effectiveness Evaluation of DFEM

To further demonstrate the contribution of DFEM, we design a controlled ablation study comparing two configurations of the model. (1) Direct concatenation of entity descriptions with input texts followed by traditional self-attention for entity localization (SA-Concat). (2) The dual-stream interactive architecture adopted in DFEM incorporates a cross-attention mechanism (CA-Cross). All experiments are conducted under strictly controlled conditions, with identical training parameters and a consistent phased training strategy, Table 4 demonstrates that CA-Cross achieves significantly superior F1-scores across all seven benchmark datasets compared to SA-Concat (average improvement

Δ

= 30.7), confirming the critical role of cross-attention in facilitating feature interaction with entity descriptions.

5.3.2. Impact of Model Capacity Analysis

To address the concern that the performance improvements of our proposed method might stem solely from an increase in trainable parameters, we conducted a rigorous ablation study focusing on model capacity. We designed a variant named DFEM-Deep, which extends the standard DFEM module by increasing its layer depth. This ensures that DFEM-Deep possesses a parameter count comparable to our complete HPEF model (specifically aligning with the parameter scale of the DSFE module). As presented in Table 5, we compared the standard DFEM against the capacity-enhanced DFEM-Deep across five datasets. The results indicate that simply scaling up the model size does not lead to performance gains; in fact, DFEM-Deep exhibited a slight performance degradation in most domains (Avg. 48.7) compared to the standard DFEM (Avg. 49.3). This observation suggests that the entity recognition task in these domains is not bottlenecked by model capacity but rather requires effective feature decoupling and refinement mechanisms. Consequently, this validates that the superiority of our proposed framework is attributed to its hierarchical progressive architecture rather than a mere increase in parameters.

5.3.3. Contribution Analysis of HPEF

We conducted ablation studies from the temporal perspective of the decision-making process to evaluate the proposed HPEF mechanism, including two settings: (1) single-phase decision using only the first-stage output (D1-Only), and (2) the complete Hierarchical Progressive Entity Filtering (HPEF) mechanism. As shown in Table 6, HPEF consistently outperforms D1-Only across all datasets (average improvement

Δ

= 16.5), demonstrating that the hierarchical progressive filtering mechanism effectively decouples boundary detection and type classification subtasks, and significantly enhances complex entity recognition through iterative, layer-wise refinement.

5.4. Qualitative Analysis

To investigate the limitations of the proposed framework, we conducted a qualitative analysis on the prediction mismatches. As detailed in Table 7, Span Boundary Errors constitute the primary bottleneck (40.9%). This phenomenon indicates that while the model successfully locates the core entity, it tends to capture longer semantic dependencies, frequently including professional titles (e.g., “Prime Minister”) or modifiers within the entity span. False Positives (36.4%) rank second, which can be attributed to the model’s sensitivity to adjectival demonyms (e.g., “British”) that are often excluded in specific annotation schemas, as well as occasional inconsistencies in the ground truth labels themselves. False Negatives (18.2%) and minor tokenization artifacts account for the remainder, primarily occurring in non-standard sentence structures such as capitalized datelines.

The analysis demonstrates that CSRVNER possesses robust semantic understanding and feature matching capabilities based on entity descriptions, as the majority of errors stem from precision issues in boundary delineation rather than detection failures. The model effectively identifies the semantic focus, further validating the efficacy of our methodology; however, it remains challenged in strictly separating entities from their immediate syntactic modifiers. Future improvements could focus on integrating boundary-aware constraints or employing data augmentation techniques specifically targeting complex noun phrases and non-standard text formats to refine the granularity of entity extraction.

6. Conclusions

This study proposes a non-autoregressive zero-shot NER framework that leverages dynamic entity description features to effectively alleviate the high resource consumption and token-by-token autoregressive latency encountered by generative models in zero-shot NER. To address the limitations of span-based methods—large candidate space, severe class imbalance, and insufficient boundary accuracy—we design a Hierarchical Progressive Entity Filtering (HPEF) mechanism, which determines candidate regions in a single pass, significantly reducing computational and memory overhead. Trained on a single 32 GB GPU for 24 h, the framework achieves approximately 90% of SOTA performance across multiple domain datasets, with an inference speed 17 times faster than that of SOTA methods. The framework maintains high recognition accuracy while greatly improving computational efficiency, providing a practical solution for the efficient deployment of zero-shot NER.

Author Contributions

Conceptualization, M.Y. and S.W.; methodology, M.Y. and S.W.; software, S.W.; investigation, S.W. and H.Y.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, M.Y.; supervision, N.C.; project administration, N.C.; funding acquisition, N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities, grant number 2024ZKPYZN01, and Beijing Longruan Spatiotemporal Intelligence Technology Co., Ltd.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available at https://huggingface.co/datasets/milistu/Pile-NER-type-conll (accessed on 4 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Hyperparameters

The detailed hyperparameter configurations are summarized in Table A1. We employ the AdamW optimizer with a learning rate of

1 \times 10^{- 6}

for both the pretrained encoder (lr_encoder) and other parameters (lr_others). Training is conducted for 3 epochs using a batch size of 8, with evaluation performed every 5000 steps. Notably, no warmup phase is explicitly applied (indicated by the unspecified warmup_ratio).

For model architecture, we utilize the Qwen2.5-1.5B [56,57] pretrained model without fine-tuning (fine_tune = false). The hidden dimension of projection layers is set to 1536, and a dropout rate of 0.3 is adopted to prevent overfitting.

Table A1. Hyperparameter configuration.

Hyperparameter	Value
Optimizer
Optimizer	AdamW
lr_encoder	$1 \times 10^{- 6}$
lr_others	$1 \times 10^{- 6}$
Training Parameters
epoch	3
warmup_ratio	-
train_batch_size	8
eval_every	5000
Model Configuration
model_name	Qwen2.5-1.5B
fine_tune	false
hidden_size	1536
dropout	0.3

Appendix A.2. Entity Description Generation

Entity Description Generation Details

In this study, we utilized ChatGPT (GPT-4o version) to generate structured descriptions for diverse entity types. To ensure the integrity of the zero-shot setting, the input prompts were strictly designed to contain only the entity label name (e.g., event). We explicitly clarify that the prompts contained absolutely no sample instances, descriptions, or contextual snippets from either the training or testing datasets.

The system instructions were formulated to guide the model (e.g., “Providea concise English definition of the event entity, including core characteristics and representative examples while avoiding subjective language”). To ensure high quality, the candidate texts generated by the model underwent a manual post-processing phase. It is worth noting that this post-processing was restricted solely to enhancing linguistic fluency and ensuring the generality of the definitions, avoiding any semantic alteration based on specific dataset samples. (see Table A2).

For instance, the final description for the event entity was standardized as follows: “An occurrence or activity with specific participants, time, and location, often reflecting social, cultural, or historical significance. Examples: political elections, natural disasters, artistic performances.”

Table A2. Dataset Statistics.

Dataset	Train	Dev	Test	Types	Avg. Tokens	Avg. Entities
AnatEM [58]	5861	2118	3830	1	37	0.7
bc2gn [59]	12,500	2500	5000	1	36	0.4
bc4chend [60]	30,682	30,639	26,364	1	45	0.9
bc5cdr [61]	4560	4581	4797	2	41	2.2
conll 03 [24]	14,041	3250	3453	3	25	1.9
GENIA [62]	15,023	1669	1854	5	43	3.5
HarveyNER [63]	3967	1301	1303	4	48	0.4
MultiNERD [64]	134,144	10,000	10,000	16	28	1.6
ncbi [65]	5432	923	940	1	39	1.0
Ontonotes [37]	59,924	8528	8262	18	18	0.9
PolyglotNER [66]	393,982	10,000	10,000	3	34	1.0
TweetNER7 [67]	7111	886	576	7	52	3.1
WikiANN en [68]	20,000	10,000	10,000	3	15	1.4
FindVehicle [69]	21,565	20,777	20,777	21	33	5.5
CrossNER AI [4]	100	350	431	13	52	5.3
CrossNER Literature [4]	100	400	416	11	54	5.4
CrossNER Music [4]	100	380	465	12	57	6.5
CrossNER Politics [4]	199	540	650	8	61	6.5
CrossNER Science [4]	200	450	543	16	54	5.4

References

Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
Chiu, J.P.; Nichols, E. Named entity recognition with bidirectional lstm-cnns. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
Sun, C.; Yang, Z.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Biomedical named entity recognition using bert in the machine reading comprehension framework. J. Biomed. Inform. 2021, 118, 103799. [Google Scholar] [CrossRef]
Liu, Z.; Xu, Y.; Yu, T.; Dai, W.; Ji, Z.; Cahyawijaya, S.; Madotto, A.; Fung, P. CrossNER: Evaluating Cross-Domain Named Entity Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 13452–13460. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Xie, T.; Li, Q.; Zhang, J.; Zhang, Y.; Liu, Z.; Wang, H. Empirical study of zero-shot ner with chatgpt. arXiv 2023, arXiv:2310.10035. [Google Scholar] [CrossRef]
Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; Poon, H. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv 2023, arXiv:2308.03279. [Google Scholar]
Sainz, O.; García-Ferrero, I.; Agerri, R.; de Lacalle, O.L.; Rigau, G.; Agirre, E. Gollie: Annotation guidelines improve zero-shot information-extraction. arXiv 2023, arXiv:2310.03668. [Google Scholar]
Zaratiana, U.; Tomeh, N.; Holat, P.; Charnois, T. GLiNER: Generalist model for named entity recognition using bidirectional transformer. arXiv 2023, arXiv:2311.08526. [Google Scholar] [CrossRef]
Warden, P.; Situnayake, D. TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers; O’Reilly Media: Sebastopol, CA, USA, 2019. [Google Scholar]
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; Anderson, R. The curse of recursion: Training on generated data makes models forget. arXiv 2023, arXiv:2305.17493. [Google Scholar]
Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 2023, 5, 606–624. [Google Scholar]
Wang, Y.; Tong, H.; Zhu, Z.; Li, Y. Nested named entity recognition: A survey. ACM Trans. Knowl. Discov. Data 2022, 16, 1–29. [Google Scholar] [CrossRef]
Huang, P.; Zhao, X.; Hu, M.; Fang, Y.; Li, X.; Xiao, W. Extract-select: A span selection framework for nested named entity recognition with generative adversarial training. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 85–96. [Google Scholar]
Zhu, E.; Liu, Y.; Jin, M.; Li, J. Recognizing nested entities from flat supervision: A new ner subtask, feasibility and challenges. arXiv 2022, arXiv:2211.00301. [Google Scholar]
Wan, J.; Ru, D.; Zhang, W.; Yu, Y. Nested named entity recognition with span-level graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 892–903. [Google Scholar]
Si, S.; Cai, Z.; Zeng, S.; Feng, G.; Lin, J.; Chang, B. SANTA: Separate strategies for inaccurate and incomplete annotation noise in distantly-supervised named entity recognition. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, AB, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 3883–3896. [Google Scholar]
Mao, H.; Mao, X.L.; Tang, H.; Shang, Y.M.; Gao, X.; Ma, A.J.; Huang, H. Span-based unified named entity recognition framework via contrastive learning. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; International Joint Conferences on Artificial Intelligence: Montreal, QC, Canada, 2024; pp. 6406–6414. [Google Scholar]
Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; Poon, H. Recent advances in named entity recognition: A survey. arXiv 2024, arXiv:2404.14294. [Google Scholar]
Sutton, C.; Rohanimanesh, K.; McCallum, A. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004; ACM: New York, NY, USA, 2004; p. 99. [Google Scholar]
Sun, X.; Sun, S.; Yin, M.; Yang, H. Hybrid neural conditional random fields for multi-view sequence labeling. Knowl.-Based Syst. 2020, 189, 105151. [Google Scholar] [CrossRef]
Ekbal, A.; Bandyopadhyay, S. Named Entity Recognition Using Support Vector Machine: A Language Independent Approach. Int. J. Electr. Comput. Syst. Eng. 2010, 4, 155–170. [Google Scholar]
Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
Finkel, J.R.; Grenager, T.; Manning, C.D. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 363–370. [Google Scholar]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Finkel, J.R.; Manning, C.D. Joint Parsing and Named Entity Recognition. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 326–334. [Google Scholar]
Johnson, R.; Zhang, T. A High-Performance Semi-Supervised Learning Method for Text Chunking. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 1–9. [Google Scholar]
Luo, L.; Yang, Z.; Yang, P.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. An attention-based bilstm-crf approach to document-level chemical named entity recognition. Bioinformatics 2018, 34, 1381–1388. [Google Scholar] [CrossRef]
Li, D.; Yan, L.; Yang, J.; Ma, Z. Dependency syntax guided bert-bilstm-gam-crf for chinese ner. Expert Syst. Appl. 2022, 196, 116682. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Ushio, A.; Camacho-Collados, J. T-NER: An All-Round Python Library for Transformer-Based Named Entity Recognition. arXiv 2022, arXiv:2209.12616. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Hakala, K.; Pyysalo, S. Biomedical Named Entity Recognition with Multilingual BERT. In Proceedings of the 5th Workshop on BioNLP Open Shared Tasks, Hong Kong, China, 4 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 56–61. [Google Scholar]
Souza, F.; Nogueira, R.; Lotufo, R. Portuguese Named Entity Recognition Using BERT-CRF. arXiv 2019, arXiv:1909.10649. [Google Scholar]
Li, X.; Zhang, H.; Zhou, X.H. Chinese Clinical Named Entity Recognition with Variant Neural Structures Based on BERT Methods. J. Biomed. Inform. 2020, 107, 103422. [Google Scholar] [CrossRef]
Weischedel, R.; Palmer, M.; Marcus, M.; Hovy, E.; Pradhan, S.; Ramshaw, L.; Xue, N.; Taylor, A.; Kaufman, J.; Franchini, M.; et al. OntoNotes Release 5.0 LDC2013T19; Linguistic Data Consortium: Philadelphia, PA, USA, 2013. [Google Scholar]
Sohrab, M.G.; Miwa, M. Deep exhaustive model for nested named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018; ACL: San Jose, CA, USA, 2018; pp. 2843–2849. [Google Scholar]
Fu, J.; Kong, X.; Zhang, Y.; Xu, W.; Zhou, G. SPAN: Structured Prediction Adaptation Network for Generalized Zero-Shot Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; pp. 12875–12883. [Google Scholar]
Zheng, C.; Cai, Y.; Xu, J.; Leung, H.F.; Xu, G. A Boundary-aware Neural Model for Nested Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; ACL: San Jose, CA, USA, 2019; pp. 357–366. [Google Scholar]
Xu, G.; Zhang, Y.; Liu, M. SEE-Few: Seed, Expand and Entail for Few-shot Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; ACL: San Jose, CA, USA, 2022; pp. 2637–2651. [Google Scholar]
Li, X.; Chen, X.; Shen, Y.; Zhang, S.; Chen, Q.; Huang, X.; Qiu, X. Unified Named Entity Recognition as Word-Word Relation Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; ACL: San Jose, CA, USA, 2022; pp. 4730–4742. [Google Scholar]
Zhang, S.; Chen, X.; Qiu, X.; Huang, X. DeepKE: A Deep Learning-Based Knowledge Extraction Toolkit for Knowledge Graph Construction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Dublin, Ireland, 22–27 May 2022; ACL: San Jose, CA, USA, 2022; pp. 289–296. [Google Scholar]
Yu, J.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; ACL: San Jose, CA, USA, 2020; pp. 6470–6476. [Google Scholar]
Wang, X.; Lu, Y.; Wang, H.; Yu, D.; Zhang, X.; Liu, K. SPANNER: Named Entity Recognition as Span Boundary Detection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Virtual Event/Punta Cana, Dominican Republic, 7–11 November 2021; ACL: San Jose, CA, USA, 2021; pp. 256–267. [Google Scholar]
Shen, Y.; Wang, X.; Wang, H.; Zhang, X.; Liu, K.; Huang, S. Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 1–6 August 2021; ACL: San Jose, CA, USA, 2021; pp. 2782–2794. [Google Scholar]
Yan, H.; Li, T.; Ji, J.; Qiu, X.; Huang, X. A Unified Generative Framework for Various NER Subtasks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 1–6 August 2021; ACL: San Jose, CA, USA, 2021; pp. 5808–5822. [Google Scholar]
Li, X.; Zhao, X.; Lu, Y.; Zhang, X.; Liu, K. Dice Loss for Data-imbalanced NLP Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; ACL: San Jose, CA, USA, 2022; pp. 7661–7670. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. Available online: https://arxiv.org/abs/2109.01652 (accessed on 4 January 2026).
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Wang, X.; Zhou, W.; Zu, C.; Xia, H.; Chen, T.; Zhang, Y.; Zheng, R.; Ye, J.; Zhang, Q.; Gui, T.; et al. InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction. arXiv 2023, arXiv:2304.08085. [Google Scholar]
Ding, Y.; Li, J.; Wang, P.; Tang, Z.; Yan, B.; Zhang, M. Rethinking negative instances for generative named entity recognition. arXiv 2024, arXiv:2402.16602. [Google Scholar] [CrossRef]
Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. LMSYS Org. 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 4 January 2026).
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [Google Scholar] [CrossRef]
Lou, J.; Lu, Y.; Dai, D.; Jia, W.; Lin, H.; Han, X.; Sun, L.; Wu, H. Universal Information Extraction as Unified Semantic Matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
Qwen Team. Qwen2.5: A Party of Foundation Models. 2024. Available online: https://qwenlm.github.io/blog/qwen2.5/ (accessed on 18 September 2024).
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
Pyysalo, S.; Ananiadou, S. Anatomical entity mention recognition at literature scale. Bioinformatics 2014, 30, 868–875. [Google Scholar] [CrossRef] [PubMed]
Smith, L.; Tanabe, L.K.; Kuo, C.J.; Chung, I.; Hsu, C.N.; Lin, Y.S.; Klinger, R.; Friedrich, C.M.; Ganchev, K.; Torii, M.; et al. Overview of BioCreative II gene mention recognition. Genome Biol. 2008, 9, S2. [Google Scholar] [CrossRef]
Krallinger, M.; Rabal, O.; Leitner, F.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D.M.; et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J. Cheminform. 2015, 7, S2. [Google Scholar] [CrossRef]
Li, J.; Sun, Y.; Johnson, R.J.; Sciaky, D.; Wei, C.H.; Leaman, R.; Davis, A.P.; Mattingly, C.J.; Wiegers, T.C.; Lu, Z. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database 2016, 2016, baw068. [Google Scholar] [CrossRef] [PubMed]
Kim, J.D.; Ohta, T.; Tateisi, Y.; Tsujii, J. GENIA Corpus—A Semantically Annotated Corpus for Bio-Textmining. Bioinformatics 2003, 19, i180–i182. [Google Scholar] [CrossRef]
Chen, P.; Xu, H.; Zhang, C.; Huang, R. Crossroads, buildings and neighborhoods: A dataset for fine-grained location recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 3329–3339. [Google Scholar]
Tedeschi, S.; Navigli, R. MultiNERD: A multilingual, multi-genre and fine-grained dataset for named entity recognition (and disambiguation). In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 801–812. [Google Scholar]
Doğan, R.I.; Leaman, R.; Lu, Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Inform. 2014, 47, 1–10. [Google Scholar] [CrossRef]
Al-Rfou, R.; Kulkarni, V.; Perozzi, B.; Skiena, S. Polyglot-NER: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015; SIAM: Bangkok, Thailand, 2015; pp. 586–594. [Google Scholar]
Ushio, A.; Neves, L.; Silva, V.; Barbieri, F.; Camacho-Collados, J. Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Taipei, Taiwan, 20–23 November 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022. [Google Scholar]
Pan, X.; Zhang, B.; May, J.; Nothman, J.; Knight, K.; Ji, H. Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 1946–1958. [Google Scholar]
Guan, R.; Man, K.L.; Chen, F.; Yao, S.; Hu, R.; Zhu, X.; Smith, J.; Lim, E.G.; Yue, Y. FindVehicle and VehicleFinder: A NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system. arXiv 2023, arXiv:2304.10893. [Google Scholar] [CrossRef]

Figure 1. CSRVNER adopts a three-stage pipeline structure—Read → Annotation → Review—where DFEM first generates candidate entities, which are then refined and verified by DSFE, forming a continuous, progressive entity recognition process. Different colors represent different embeddings. Different colors denote different embeddings; for labels, color variations only serve to distinguish individual labels rather than representing label types.

Figure 2. Model’s performance and inference speed in Zero-Shot settings. The upward and rightward arrows indicate better performance and faster inference speed, respectively. To ensure a fair comparison, the inference speed is evaluated on a single A100 node, aligning with the hardware setup reported in [52]. Results of InstrucrUIE, UniNER, and GNER are from [52].

Table 1. Zero-Shot Scores on the Out-of-Domain NER Benchmark. We compared our model with various Open NER models. Results for Vicuna, ChatGPT, and UniNER are from Zhou et al. [8]; USM and InstructUIE are from Wang et al. [51]; GoLLIE is from Sainz et al. [9]; GLiNER is from [10]; and GNER-FlanT5 is from [52].

Model	Params	AI	Literature	Music	Politics	Science	Average
Vicuna-7B	7B	12.8	16.1	17.0	20.5	13.0	18.5
Vicuna-13B	13B	22.7	22.7	26.6	27.0	22.0	24.2
USM	0.3B	28.2	56.0	44.9	36.1	44.0	41.8
ChatGPT-3.5	–	52.4	39.8	66.6	68.5	67.0	58.9
InstructUIE	11B	49.0	47.2	53.2	48.1	49.2	49.3
UniNER-7B	7B	53.6	59.3	67.0	60.9	61.1	60.4
UniNER-13B	13B	54.2	60.9	64.5	61.4	63.5	60.9
GoLLIE	7B	63.0	62.7	67.8	57.2	55.5	61.2
GLiNER-L	0.3B	57.2	64.4	69.6	72.6	62.6	65.3
CSRVNER	1.5B	57.6	64.2	71.5	70.2	65.6	65.8

Table 2. Zero-shot performance on 15 NER datasets. Results of ChatGPT and UniNER are reported from [8] and GLiNER is form [10]. The best results are highlighted in bold.

Dataset	ChatGPT	UniNER	GLiNER	CSRVNER
Params	-	7B	0.3B	1.5B
AnatEM	30.7	25.1	33.3	32.1
bc2gm	40.2	46.2	47.9	35.2
bc4chemd	35.5	47.9	43.1	44.5
bc5cdr	52.4	68.0	66.4	68.6
CoNLL03	52.5	72.2	64.6	71.5
FindVehicle	10.5	22.2	41.9	21.2
GENIA	41.6	54.1	55.5	57.8
HarveyNER	11.6	18.2	22.7	34.6
MultiNERD	58.1	59.3	59.7	71.6
ncbi	42.1	60.4	61.9	60.4
OntoNotes	29.7	27.8	32.2	42.5
PolyglotNER	33.6	41.8	42.9	51.7
TweetNER7	40.1	42.7	41.4	41.5
WikiANN	52.0	55.4	58.9	65.6
WikiNeural	57.7	69.2	71.8	75.6
Average	39.2	47.3	49.6	51.6

Table 3. Comparison of inference speed and memory usage between W2NER and our proposed CSRVNER across different input lengths. Thr. denotes Throughput. The best results are highlighted in bold. ↑ indicates higher is better; ↓ indicates lower is better.

Length	W2NER			CSRVNER (Ours)
	Latency	Thr.	VRAM	Latency	Thr.	VRAM
	(ms)↓	(token/s)↑	(GB)↓	(ms)↓	(token/s)↑	(GB)↓
1000	85.61	11,680.96	5.75	273.98	3649.91	8.11
2000	2384.00	838.93	14.10	520.34	3843.65	8.53
3000	5287.26	567.40	27.94	971.28	3088.72	9.15

Table 4. Cross-attention (CA-Cross) vs. concatenation-based self-attention (SA-Concat) for entity localization, showing an average F1 improvement of

Δ

= 30.7 across five benchmarks. The best results are highlighted in bold.

Table 4. Cross-attention (CA-Cross) vs. concatenation-based self-attention (SA-Concat) for entity localization, showing an average F1 improvement of

Δ

= 30.7 across five benchmarks. The best results are highlighted in bold.

Dataset	SA-Concat	CA-Cross
AI	31.1	57.6
Literature	28.9	64.2
Music	34.0	71.5
Politics	39.5	70.2
Science	41.9	65.6
Avg.	35.1	65.8

Table 5. Ablation study on model capacity. DFEM denotes the standard module, while DFEM-Deep represents the variant with increased layers to match the parameter count of the full HPEF model. The best results are highlighted in bold.

Dataset	DFEM	DFEM-Deep
AI	43.6	44.3
Literature	45.6	44.0
Music	51.5	49.8
Politics	55.7	55.6
Science	50.3	49.9
Avg.	49.3	48.7

Table 6. Ablation of the HPEF framework: HPEF achieves

Δ

= 16.5 F1 gains over single-phase baselines. The best results are highlighted in bold.

Table 6. Ablation of the HPEF framework: HPEF achieves

Δ

= 16.5 F1 gains over single-phase baselines. The best results are highlighted in bold.

Dataset	D1-Only	HPEF
AI	43.6	57.6
Literature	45.6	64.2
Music	51.5	71.5
Politics	55.7	70.2
Science	50.3	65.6
Avg.	49.3	65.8

Table 7. Distribution of error types based on the qualitative analysis of sampled instances. The analysis focuses on the 22 mismatch cases found in the sample set.

Error Type	Count	Percentage	Primary Cause
Span Boundary Error	9	40.9%	Inclusion of titles/modifiers (e.g., Job Titles)
False Positives	8	36.4%	Adjectival entities & Ground truth omissions
False Negatives	4	18.2%	Missed capitalized datelines & Ambiguity
Tokenization Artifacts	1	4.5%	Possessive suffix handling (’s)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, M.; Wang, S.; Yang, H.; Chen, N. Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition. Computers 2026, 15, 36. https://doi.org/10.3390/computers15010036

AMA Style

Yang M, Wang S, Yang H, Chen N. Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition. Computers. 2026; 15(1):36. https://doi.org/10.3390/computers15010036

Chicago/Turabian Style

Yang, Meng, Shuo Wang, Hexin Yang, and Ning Chen. 2026. "Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition" Computers 15, no. 1: 36. https://doi.org/10.3390/computers15010036

APA Style

Yang, M., Wang, S., Yang, H., & Chen, N. (2026). Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition. Computers, 15(1), 36. https://doi.org/10.3390/computers15010036

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Breaking the Speed–Accuracy Trade-Off: A Novel Embedding-Based Framework with Coarse Screening-Refined Verification for Zero-Shot Named Entity Recognition

Abstract

1. Introduction

2. Related Work

2.1. Traditional NER Approaches

2.1.1. StatisticalMachine Learning Methods

2.1.2. Deep Learning Methods

2.2. Span-Based NER Approaches

2.3. Zero-Shot NER

3. Method

3.1. Hierarchical Progressive Entity Filtering

3.2. Dynamic Feature-to-Entity Mapping

3.2.1. Iterative Attention Refinement

3.2.2. Attention Computation

3.2.3. Classification

3.3. Dynamic Span Feature Encoder

3.3.1. Candidate Span Generation via Greedy Strategy

3.3.2. Span Representation and Fusion

3.3.3. Span-Level Classification

3.4. Training Objectives

3.4.1. Phase 1: Annotation Module (Token-Level Learning)

3.4.2. Phase 2: Review Module (Span-Level Learning)

4. Experiment Settings

4.1. Training Data

4.2. Evaluation

4.2.1. Datasets

4.2.2. BaseLines

4.2.3. Metrics

5. Results and Analysis

5.1. Zero-Shot Performance

5.1.1. OOD NER Benchmark

5.1.2. 15 NER Benchmark

5.2. Inference Speed

5.3. Ablation Studies

5.3.1. Effectiveness Evaluation of DFEM

5.3.2. Impact of Model Capacity Analysis

5.3.3. Contribution Analysis of HPEF

5.4. Qualitative Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Hyperparameters

Appendix A.2. Entity Description Generation

Entity Description Generation Details

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI