An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities

Albahli, Saleh

doi:10.3390/app15063073

Open AccessArticle

An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities

by

Saleh Albahli

^1,2

¹

Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia

²

Department of Computer Science, Kent State University, Kent, OH 44240, USA

Appl. Sci. 2025, 15(6), 3073; https://doi.org/10.3390/app15063073

Submission received: 30 January 2025 / Revised: 1 March 2025 / Accepted: 6 March 2025 / Published: 12 March 2025

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that supports applications such as information retrieval, sentiment analysis, and text summarization. While substantial progress has been made in NER for widely studied languages like English, Arabic presents unique challenges due to its morphological richness, orthographic ambiguity, and the frequent occurrence of nested and overlapping entities. This paper introduces a novel Arabic NER framework that addresses these complexities through architectural innovations. The proposed model incorporates a Hybrid Feature Fusion Layer, which integrates external lexical features using a cross-attention mechanism and a Gated Lexical Unit (GLU) to filter noise, while a Compound Span Representation Layer employs Rotary Positional Encoding (RoPE) and Bidirectional GRUs to enhance the detection of complex entity structures. Additionally, an Enhanced Multi-Label Classification Layer improves the disambiguation of overlapping spans and assigns multiple entity types where applicable. The model is evaluated on three benchmark datasets—ANERcorp, ACE 2005, and a custom biomedical dataset—achieving an F1-score of 93.0% on ANERcorp and 89.6% on ACE 2005, significantly outperforming state-of-the-art methods. A case study further highlights the model’s real-world applicability in handling compound and nested entities with high confidence. By establishing a new benchmark for Arabic NER, this work provides a robust foundation for advancing NLP research in morphologically rich languages.

Keywords:

natural language processing (NLP); multi-label classification; cross-attention; nested entities

1. Introduction

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities, such as names of persons, locations, organizations, and events, in textual data. Despite significant advancements in NER for widely studied languages like English, Arabic remains a particularly challenging domain due to its morphological richness, orthographic ambiguity, and frequent occurrences of nested and overlapping entities [1]. While nested entities are common across many languages, Arabic texts present additional complexities due to the language’s rich morphology, inflectional nature, and frequent compounding of entity phrases. This leads to highly ambiguous boundaries and nested structures that intertwine multiple entity types. For example, in the phrase “مدينة الملك عبدالله الطبية” (King Abdullah Medical City), the text contains nested entities (a location name containing a person name) that cannot be easily separated using traditional sequence labeling techniques.

Another critical limitation of existing Arabic NER systems relates to their handling of external lexical information. Current methods often rely heavily on static lexicons or general-purpose pre-trained language models. While these resources capture valuable semantic and syntactic information, they also introduce noise in the form of irrelevant matches, outdated terms, or domain mismatches, especially in specialized texts such as biomedical reports or informal texts from social media [2]. This inability to dynamically filter external lexical information leads to suboptimal performance, especially when models encounter domain shifts or noisy, ambiguous content.

Additionally, Arabic text often omits diacritical marks, which are essential for disambiguating words with identical spellings but different meanings [3]. This orthographic ambiguity further compounds the difficulty in recognizing entities correctly. Existing NER approaches for Arabic, particularly those adapted from other languages, tend to either overfit to the training domain or generalize poorly across diverse contexts, particularly when external lexical cues are either incomplete or contradictory.

In this work, we propose a novel Arabic NER framework that directly addresses these challenges through innovative architectural enhancements. The framework introduces a Hybrid Feature Fusion Layer, which integrates external lexical features using cross-attention mechanisms combined with a Gated Lexical Unit (GLU) to filter noise and retain relevant information [4]. Additionally, a Compound Span Representation Layer is designed to capture hierarchical relationships and dependencies between entities using Rotary Positional Encoding (RoPE) and Bidirectional GRUs. Finally, the Enhanced Multi-Label Classification Layer enables the model to disambiguate overlapping spans and assign multiple entity types where necessary.

Despite these challenges, recent advancements in pre-trained language models, such as AraBERT and multilingual models like XLM-R, have shown promise in Arabic NLP tasks [5]. These models leverage large-scale text corpora to capture contextual and semantic information effectively. However, their general-purpose architectures are not specifically optimized for the unique requirements of Arabic NER, such as handling nested entities and filtering irrelevant lexical information [6]. As a result, there remains a significant performance gap that motivates the development of specialized models.

The proposed framework is evaluated on three datasets: ANERcorp, ACE 2005, and a custom biomedical dataset. These datasets encompass a wide range of scenarios, from flat entities in ANERcorp to complex nested structures in ACE 2005 and specialized domain entities in the biomedical dataset [2]. The results demonstrate that the proposed model consistently outperforms state-of-the-art methods in precision, recall, and F1-score, highlighting its robustness and adaptability across diverse domains.

Furthermore, this work includes a case study to illustrate the practical applicability of the model. Using real-world Arabic text, we show how the model effectively handles compound and nested entities with high confidence, demonstrating its capability to address real-world linguistic complexities. The case study not only underscores the model’s effectiveness but also provides a tangible example of its use in practical applications, such as digital content analysis or information extraction from Arabic texts.

The contributions of this work are threefold. First, we propose novel architectural enhancements, including the Hybrid Feature Fusion Layer, Compound Span Representation Layer, and Enhanced Multi-Label Classification Layer, which collectively address critical challenges in Arabic NER [1]. Second, we conduct a comprehensive evaluation using diverse datasets and scenarios, providing a robust benchmark for future research. Third, we present practical insights through a detailed case study, bridging the gap between research and real-world applications.

This paper addresses the significant challenges of Arabic NER through a carefully designed framework that leverages advanced techniques to enhance lexical integration, span representation, and multi-label classification. By bridging the gap between general-purpose architectures and the specific needs of Arabic NER, this work contributes to the growing body of research aimed at advancing NLP for the Arabic language.

The remainder of this paper is structured as follows: Section 2 discusses related work, highlighting recent advancements and gaps in Arabic NER research. Section 3 presents the methodology, detailing the architectural components of the proposed model. Section 4 provides an in-depth experimental analysis and results, including dataset descriptions, ablation studies, and visualizations. Section 5 concludes with a summary of the findings, limitations, and directions for future research.

2. Related Works

Named Entity Recognition (NER) has seen remarkable progress in recent years, driven by advancements in pre-trained language models, attention mechanisms, and domain-specific adaptations. Multilingual NER has gained significant attention, as highlighted by Chen et al. [7], who leveraged transfer learning to achieve state-of-the-art performance across 10 languages, underscoring the importance of adapting models dynamically to target languages. Similarly, Li et al. [8] introduced a cross-lingual framework employing contrastive learning, significantly improving NER for low-resource languages by aligning multilingual embeddings. These studies reveal the potential of transfer learning and cross-lingual techniques to bridge linguistic gaps in NER.

For domain-specific applications, Wang et al. [9] showcased the effectiveness of integrating external knowledge graphs in biomedical NER, enabling precise recognition of specialized entities. In the Arabic NER domain, Youssef et al. [10] demonstrated the superiority of AraBERT over generic multilingual models by incorporating morphological features, a critical advancement given Arabic’s rich morphology. Further emphasizing domain specificity, Liu et al. [11] introduced dual-attention mechanisms that enhanced feature integration in financial texts, achieving substantial accuracy gains. These studies collectively highlight that combining domain knowledge with linguistic adaptations is crucial for advancing NER in specialized fields.

The complexity of nested and overlapping entities remains a persistent challenge in NER. Sun et al. [12] addressed this by proposing a span-based model with enhanced positional encoding, achieving notable success in nested entity recognition. Complementing this, Gao et al. [13] developed a global attention mechanism that effectively tackled overlapping entity boundaries. Building on these innovations, Ahmed et al. [14] proposed data augmentation techniques, such as back-translation and masking, which proved invaluable for underrepresented languages and noisy datasets.

Arabic NER, in particular, faces unique challenges due to its complex morphology and diacritic-based ambiguity. Al-Rashed et al. [15] showed that integrating morphological analyzers significantly improved recognition in informal texts, such as social media. Similarly, Zhang et al. [16] enhanced rare entity recognition by incorporating lexicons, a strategy that is particularly relevant for resource-constrained languages like Arabic. Additionally, Wu et al. [17] demonstrated that cross-attention mechanisms improve contextual alignment, which is crucial for disambiguating morphologically similar tokens in Arabic.

Despite these advancements, gaps remain in seamlessly combining morphological understanding, lexicon integration, and cross-lingual knowledge transfer for NER. Studies such as those of Ma et al. [18] and Singh et al. [19] underline the potential of integrating lexical features directly into transformer layers, while Park et al. [20] highlighted the promise of zero-shot and few-shot learning frameworks for expanding NER capabilities in languages lacking annotated datasets. These approaches emphasize the importance of scalability and adaptability in addressing resource limitations.

In summary, recent studies underscore the growing emphasis on multilinguality, domain adaptation, and morphological complexity in NER. While advancements in cross-lingual learning and domain-specific models have pushed the boundaries of performance, the integration of lexicon features, morphological analyzers, and attention mechanisms remains a fertile ground for further exploration. Arabic NER, with its unique linguistic challenges, stands to benefit significantly from these emerging methodologies. Future work must focus on harmonizing these innovations to create robust, adaptable, and resource-efficient models for diverse applications.

Despite significant advancements in Named Entity Recognition (NER), several gaps persist that hinder the development of robust and versatile models as shown in Table 1. Multilingual NER approaches, such as those by Chen et al. [7] and Li et al. [8] achieve impressive results but often fail to address the unique challenges of morphologically rich languages like Arabic, particularly in informal and domain-specific contexts. Similarly, while Youssef et al. [10] enhanced Arabic NER by leveraging morphological features, their work focuses predominantly on formal texts, neglecting informal and diverse domains. Existing lexicon-enhanced methods, such as those by Ma et al. [18] and Zhang et al. [16], highlight the importance of external knowledge but rely heavily on high-quality lexicons, which are scarce for many languages and domains. Furthermore, approaches like those of Sun et al. [12] and Wu et al. [17], though effective for nested and overlapping entities, lack scalability for large datasets and fail to integrate lexicons seamlessly into the learning process. To overcome the reliance on high-quality lexicons, Zhang et al. [16], and scalability issues in nested entity recognition, Sun et al. [12], our framework integrates domain-specific lexicons via a noise-filtering GLU and employs efficient span-based encoding for hierarchical structures.

3. Methodology

This section outlines the architecture of our proposed model for Arabic Named Entity Recognition (NER). The model is designed to address the unique challenges of the Arabic language, such as its morphological complexity, nested entities, and overlapping spans. The methodology incorporates novel layers and hybrid mechanisms to enhance feature integration, span representation, and multi-label classification, achieving state-of-the-art results. Below, we describe the key components of the architecture, as illustrated in Figure 1.

The input to the model is an Arabic text sequence, which is tokenized into characters for better handling of Arabic morphology. For example, the sentence “يعيش الغزال في الغابات” (“The deer lives in the forests”) is tokenized into its constituent characters: Lam, Ya, Noon, etc. These characters form the basis of the input sequence, capturing detailed linguistic nuances.

3.1. Character Representation Layer

This layer employs a pre-trained AraBERT model to encode the input sequence into dense vector representations. Formally, given an input sequence

S = [c_{1}, c_{2}, \dots c_{n}]

, the character embeddings are computed as:

H = BERT (S)

where

H = [h_{1}, h_{2}, \dots h_{n}]

represents the hidden states of the tokens. These embeddings capture both the semantic and syntactic properties of the text, serving as a strong foundation for downstream processing.

For instance, in the example sentence “يعيش الغزال”, the embeddings for “يعيش” (lives) and “الغزال” (the deer) are contextually enriched by AraBERT, distinguishing entity boundaries based on linguistic context.

3.2. Hybrid Feature Fusion Layer: Cross-Attention + Gated Lexical Integration

To enhance the semantic representation of tokens, we introduce a novel Hybrid Feature Fusion Layer that combines Cross-Attention with a Gated Lexical Unit (GLU). While Cross-Attention aligns token embeddings with external lexical features, the GLU selectively filters these features to reduce noise and retain only relevant information. Lexical features are derived from domain-specific dictionaries (e.g., biomedical ontologies) and normalized using Arabic morphological analyzers (e.g., MADAMIRA).

For each token embedding

h_{i}

and its corresponding lexical embedding

v_{i}

, the gated feature is computed as:

f_{i} = σ (W_{g} h_{i} + b_{g}) ⊙ v_{i},

where

W_{g}

and

b_{g}

are learnable parameters, σ is a sigmoid activation function, and ⊙ represents element-wise multiplication. The final fused representation is as follows:

{\tilde{h}}_{i} = h_{i} + Attention (Q, K, V) + f_{i} .

This layer enhances the model’s ability to integrate lexicon-derived features while discarding irrelevant information, improving overall performance.

3.3. Compound Span Representation Layer: Dense + Recurrent Positional Encoding

To handle nested and overlapping entities effectively, we propose a Compound Span Representation Layer that integrates Dense Layers with Recurrent Positional Encoding (RoPE). This layer combines local span features with sequential dependencies for robust representation.

For a span

s_{i, j}

between tokens

c_{i}

and

c_{j}

, the representation is computed as:

s_{i, j} = W_{1} h_{i} \oplus W_{2} h_{j} + R (_{i, j}) .

where

W_{1}

and

W_{2}

are dense layers and

R (_{i, j})

is the positional encoding for the span. To further enhance span representation, we introduce a lightweight Bidirectional GRU that processes sequential spans:

{\hat{s}}_{i, j} = B i G R U ([S_{i, j}]) .

This hybrid approach enables the model to capture both local and global span dependencies, addressing challenges posed by Arabic’s morphological richness and entity nesting.

3.4. Enhanced Multi-Label Classification Layer: Span-Label Alignment Network

To improve entity classification, we design an Enhanced Multi-Label Classification Layer called the Span-Label Alignment Network. This layer aligns spans with their most likely entity types using a probabilistic alignment mechanism.

For a span

s_{i, j}

, the alignment score for label

t

is calculated as:

α t = \frac{\exp \pm (W \begin{matrix} T \\ t \end{matrix} s_{i, j})}{\sum k \exp \pm (W \begin{matrix} T \\ t \end{matrix} s_{i, j})}

where

W_{t}

is the learnable weight vector for label

t

. The final label assigned to the span is determined by

L a b e l (s_{i, j}) = a r g m a x α t .

This mechanism ensures precise label assignment for overlapping spans and allows for multi-label predictions, a critical requirement for Arabic NER.

Training and Optimization: The model is trained using a multi-label classification loss function:

L = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{n (n + 1) / 2} \sum_{t} y_{i, j, t} l o g {\hat{y}}_{i, j, t},

where

y_{i, j, t}

is the true label and

{\hat{y}}_{i, j, t}

is the predicted probability for span

s_{i, j}

and entity type

t

. Optimization is performed using the AdamW optimizer with learning rate warm-up.

The proposed methodology integrates novel layers, including the Hybrid Feature Fusion Layer, Compound Span Representation Layer, and Enhanced Multi-Label Classification Layer, to address the challenges of Arabic NER. These enhancements enable the model to handle complex linguistic structures, improve lexical integration, and achieve state-of-the-art performance on challenging datasets.

4. Experimental Analysis and Results

This section presents a comprehensive evaluation of the proposed Arabic Named Entity Recognition (NER) model. We assess its performance on multiple datasets, compare it with state-of-the-art methods, and analyze its effectiveness through ablation studies, error analysis, and visualizations. The results validate the model’s ability to address challenges specific to Arabic, such as its morphological complexity, nested entities, and overlapping spans.

4.1. Datasets

The evaluation was conducted on three datasets: ANERcorp, ACE 2005 (Arabic Subset), and a Custom Biomedical and Legal Dataset, as shown in Table 2. These datasets were selected to ensure comprehensive coverage across different types of text and entity structures.

ANERcorp is a widely used benchmark dataset that focuses on flat named entities in general Arabic news articles, including persons, locations, and organizations.
ACE 2005, in contrast, contains more complex, nested entity structures commonly found in formal newswire texts, making it a suitable testbed for evaluating the proposed model’s nested entity recognition capabilities.
The custom dataset covers biomedical and legal documents, representing domain-specific content that requires deep semantic understanding and lexicon filtering due to specialized terminology.

By covering these three diverse datasets, the evaluation ensures that the proposed model is tested across both general-purpose and domain-specific content, capturing a wide linguistic spectrum from formal media language to specialized professional terminology. This demonstrates the domain adaptability of the proposed approach, and future work will further extend the evaluation to additional domains such as social media, historical documents, and financial texts to further validate the model’s robustness across informal and emerging text genres.

These datasets provide a diverse evaluation environment, from simple flat entities in ANERcorp to complex nested entities in ACE 2005. For example, in the sentence “زار الغزال الحديقة الملكية” (“The deer visited the royal park”), ANERcorp captures entities like “الحديقة الملكية” (the royal park), while ACE 2005 includes nested structures distinguishing “الحديقة” (park) and “الملكية” (royal).

4.2. Evaluation Metrics

Performance is measured using standard metrics: Precision, Recall, F1-Score, and Exact Match Ratio (EMR). Precision quantifies the proportion of correct predictions among all predictions, while recall measures the proportion of true entities correctly identified. The F1-score, calculated as the harmonic mean of precision and recall, provides a balanced evaluation of a model’s performance. Additionally, the Exact Match Ratio (EMR) measures the percentage of sentences where all entities are correctly identified without errors.

P r e c i s i o n = \frac{T P}{T P + F P}

A high precision value indicates that the model produces fewer false positives, which is particularly important in applications where the cost of false positives is significant.

R e c a l l = \frac{T P}{T P + F N}

A high recall indicates that the model successfully captures the majority of positive instances, thereby reducing the occurrence of false negatives.

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

This metric is especially useful when the class distribution is imbalanced, as it provides a more nuanced assessment than accuracy alone.

E M R = \frac{S e n t e n c e s w i t h a l l c o r r e c t e n t i t i e s}{T o t a l S e n t e n c e s}

For example, if 500 out of 6000 sentences in ANERcorp are fully correct, EMR is

\frac{500}{6000}

≈ 8.33%.

4.3. Experimental Setup

The model uses AraBERT as the base encoder for the Character Representation Layer. Training is performed using the AdamW optimizer with a learning rate of

5 \times 10^{- 5}

, a weight decay of

1 \times 10^{- 2}

, and a batch size of 16 sentences. We trained the model for up to 10 epochs, applying early stopping to prevent overfitting. Experiments were conducted on an NVIDIA A100 GPU, with each epoch taking approximately 12 min. Table 3 shows the results of our model compared to state-of-the-art methods.

As shown in Figure 2, the results of our model outperform all baselines in terms of F1-score and EMR. For instance, on ANERcorp, it achieves an F1-score of 93.0%, surpassing XLM-R + CRF by 2.4%. Similarly, on ACE 2005, the model’s F1-score of 89.6% highlights its ability to handle nested entities effectively. While XLM-R excels in cross-lingual transfer, its lack of Arabic-specific morphological analyzers limits its ability to disambiguate diacritic-free text, underscoring the need for our tailored architecture.

To comprehensively evaluate the performance of the proposed framework, we compared it with a combination of Arabic-specific and generalized state-of-the-art baselines. Specifically, we included AraBERT + CRF and XLM-R + CRF, which are commonly used for Arabic NER, as well as two recent generalized NER methods that have demonstrated strong performance across multiple languages: LUKE (Yamada et al. [21]) and Seq2Seq Span-based NER (Zhang et al. [22]). LUKE employs entity-aware attention, making it particularly effective at capturing contextualized entity representations, while the Seq2Seq Span-based NER formulates entity recognition as a sequence-to-sequence span generation task, enabling it to effectively handle nested and overlapping entities across languages. This diversified set of baselines allows for a robust assessment of the proposed model’s strengths in both Arabic-specific and cross-lingual settings.

Specifically, Figure 3 presents a comprehensive comparative analysis of the proposed model’s performance relative to both Arabic-specific and generalized state-of-the-art baselines across two datasets: ANERcorp and ACE 2005. The Arabic-specific baselines include AraBERT + CRF and XLM-R + CRF, while the generalized baselines consist of LUKE and Seq2Seq Span-based NER, both of which have demonstrated strong performance across multiple languages. LUKE employs entity-aware attention, making it particularly effective at capturing contextualized entity representations, while Seq2Seq Span-based NER formulates entity recognition as a sequence-to-sequence span generation task, enabling it to effectively handle nested and overlapping entities across languages. Each chart in Figure 3 highlights the model’s superiority in precision, recall, and F1-score, demonstrating its ability to handle both flat and nested entities effectively. On ANERcorp, the proposed model achieves an F1-score of 93.0%, significantly outperforming the best Arabic-specific baseline, XLM-R + CRF, by 2.4 percentage points and surpassing LUKE and Seq2Seq Span-based NER by even larger margins. Similarly, on ACE 2005, which contains complex nested entities, the proposed model achieves an F1-score of 89.6%, surpassing GlobalPointer (the previous state-of-the-art), LUKE, and Seq2Seq Span-based NER, highlighting the effectiveness of the proposed framework in capturing Arabic’s rich morphology and nested structures. These results validate the contributions of our novel components, such as the Hybrid Feature Fusion Layer and Compound Span Representation Layer, confirming the importance of combining Arabic-specific linguistic features with flexible, generalized modeling capabilities. The consistent improvement across both datasets emphasizes the robustness, adaptability, and generalizability of the proposed framework.

As shown in Figure 3 and Table 4, the proposed framework consistently outperforms all baselines across the three datasets: ANERcorp, ACE 2005, and the custom biomedical dataset. The improvements are particularly significant in the ACE 2005 dataset, which contains complex nested entities, demonstrating the proposed model’s superior capability to capture hierarchical structures in Arabic text. Compared to LUKE and Seq2Seq Span-based NER, the proposed framework achieves a notable performance gain in F1-score, confirming the importance of incorporating Arabic-specific morphological features and lexicon filtering mechanisms alongside generalized modeling capabilities.

4.3.1. Comparison with Basic Models

The proposed framework consistently outperforms all baselines across the three datasets: ANERcorp, ACE 2005, and the custom biomedical dataset as shown in Table 4. The improvements are particularly significant in the ACE 2005 dataset, which contains complex nested entities, demonstrating the proposed model’s superior capability to capture hierarchical structures in Arabic text. Compared to LUKE and Seq2Seq Span-based NER, the proposed framework outperforms both LUKE and Seq2Seq Span-based NER across the ANERcorp, ACE 2005, and custom biomedical datasets, highlighting its ability to better capture Arabic morphological and contextual nuances. The improvements are particularly noticeable in nested entity scenarios, where language-specific adaptations play a crucial role.

To illustrate, Figure 3 presents a comprehensive comparative analysis of the proposed model’s performance relative to both Arabic-specific and generalized state-of-the-art baselines across two datasets: ANERcorp and ACE 2005. The Arabic-specific baselines include AraBERT + CRF and XLM-R + CRF, while the generalized baselines consist of LUKE and Seq2Seq Span-based NER, both of which have demonstrated strong performance across multiple languages. LUKE employs entity-aware attention, which enhances its ability to capture contextualized entity representations, while Seq2Seq Span-based NER formulates entity recognition as a sequence-to-sequence span generation task, making it particularly effective for nested entities. Each chart in Figure 3 highlights the superior precision, recall, and F1-score of the proposed model, demonstrating its robustness and adaptability to varying entity recognition challenges. On ANERcorp, the proposed model achieves an F1-score of 93.0%, significantly outperforming XLM-R + CRF by 2.4 percentage points, and further surpassing LUKE and Seq2Seq NER, confirming its effectiveness in handling flat entities. Similarly, on ACE 2005, which contains complex nested entities, the proposed model excels with an F1-score of 89.6%, outperforming all other baselines, including the generalized LUKE and Seq2Seq NER models. These results validate the contributions of the Hybrid Feature Fusion Layer and Compound Span Representation Layer, which enable effective handling of Arabic’s morphological complexity and nested entity structures. The clear margin of improvement across both datasets underscores the robustness, adaptability, and efficiency of the proposed architecture in addressing real-world NER tasks in Arabic.

To evaluate the contributions of each novel component, we conducted an ablation study by removing one component at a time. The results are summarized in Table 5 and show that each component significantly contributes to the model’s performance. Removing the Hybrid Feature Fusion Layer results in a 2.1% drop in F1-score, demonstrating its importance in integrating high-quality lexical features. Similarly, the absence of the Compound Span Representation Layer reduces the F1-score by 3%, highlighting its role in handling nested and overlapping entities. The Enhanced Multi-Label Classification Layer also plays a critical role in assigning correct labels to spans, with its removal causing a 1.5% decrease in F1-score.

To provide further insights, we visualize the model’s behavior using a heatmap and a precision–recall (PR) curve as follows:

Entity Classification Heatmap: The heatmap in Figure 4 illustrates the alignment scores between spans and entity types for a sample sentence. High confidence scores are observed for correct span-label assignments, while misaligned predictions show lower confidence.

Precision–Recall Curve: Figure 5 presents the Precision–Recall Curve for the ANERcorp dataset, illustrating the trade-off between precision and recall for each evaluated model. The curve is derived from the final validated prediction outputs and directly reflects the precision and recall values reported in Table 4. The proposed model achieves the best balance between precision and recall, reinforcing its robustness in recognizing both flat and nested entities in Arabic text.

4.3.2. Error Analysis

To better understand the limitations of the proposed model, we conducted a detailed error analysis across the evaluation datasets. In this context, errors refer to various types of incorrect predictions made by the model, including entity boundary errors, misclassification errors, omission errors, and over-segmentation errors.

Entity boundary errors occur when the predicted entity span does not exactly match the correct span. This type of error is particularly frequent in nested entities, where the outer and inner entity boundaries overlap. The ambiguity inherent in Arabic compounding and phrase nesting contributes heavily to these boundary mismatches.

Misclassification errors arise when the model assigns the wrong entity type to a correctly identified span. For example, a person entity might be misclassified as an organization, particularly in cases where the entity name refers to a company established by a well-known individual. This confusion is further exacerbated by the rich polysemy found in Arabic, where the same phrase can have different meanings depending on the context.

Omission errors are cases where the model completely fails to detect an entity, often due to its rarity or domain specificity. Such errors were more frequent in the biomedical dataset, where highly specialized terminology is common and less likely to be covered in general-purpose language models.

Over-segmentation errors refer to cases where a single entity is incorrectly split into two or more separate entities. This typically occurs due to orthographic ambiguity (e.g., inconsistent use of spaces in Arabic compound phrases) or incorrect handling of multi-word expressions.

For instance, in the ACE 2005 dataset, the phrase “وزير الخارجية السعودي فيصل بن فرحان” (Saudi Foreign Minister Faisal bin Farhan) illustrates several challenges. The model correctly identifies “ فيصل بن فرحان” as a person entity but sometimes splits “وزير الخارجية السعودي” into separate entities such as a location and an organization instead of recognizing it as a single title entity. Such errors highlight the difficulty of capturing the hierarchical and contextual dependencies in Arabic entity mentions.

These types of errors collectively demonstrate the complexities of Arabic NER and emphasize the importance of architectural components such as the Compound Span Representation Layer and the Hybrid Feature Fusion Layer in the proposed model. By identifying these error patterns, future work can target enhanced morphological analysis, improved external lexicon handling, and dynamic span boundary refinement to further enhance performance.

4.3.3. Case Study

To further demonstrate the practical effectiveness of the proposed Arabic Named Entity Recognition (NER) model, we conducted a case study using real-world Arabic text excerpts. The case study highlights the model’s ability to correctly identify entities across different categories, including nested and overlapping entities, showcasing its robustness in complex scenarios and its practical relevance to real-world applications.

The following Arabic sentence was selected as an example: “زار الملك سلمان المؤتمر الدولي للسلام في الأمم المتحدة”. Translation: “King Salman attended the international peace conference at the United Nations”.

In this sentence, the entities span across multiple categories:

Person: “الملك سلمان” (King Salman).
Event: “المؤتمر الدولي للسلام” (The international peace conference).
Organization: “الأمم المتحدة” (The United Nations).

The proposed model successfully identifies these entities and assigns the correct labels. Notably, it correctly handles the compound structure of “الملك سلمان” as a person entity, preserving the nested nature of “المؤتمر الدولي للسلام” as a complete event entity, and confidently recognizes “الأمم المتحدة” as an organization.

To further strengthen the case study, we extended the analysis to a sample of 50 documents (approximately 500 sentences) covering both biomedical and legal domains. This extended case study aimed to evaluate the model’s robustness in handling domain-specific terminology, nested entity recognition, and boundary precision in formal texts. Table 6 presents a summary of the results, including predicted entities, ground truth labels, and model confidence scores.

Additionally, across the extended case study corpus, the proposed model achieved an overall F1-score of 91.2%, with span accuracy reaching 93.5%, and correctly handled nested entities in 87.5% of cases. This performance directly supports the validation of our hypothesis that integrating Hybrid Feature Fusion and Compound Span Representation enhances Arabic NER, especially in morphologically complex and domain-specific contexts.

The case study also provided insights into remaining challenges, including occasional misclassification of professional titles (e.g., “وزير الصحة”—Minister of Health being misclassified as an organization) and boundary confusion in multi-word biomedical entities. These observations align with the broader error analysis discussed in Section 4.3.2, reinforcing the importance of combining language-specific modeling with adaptive lexical filtering for future improvements.

5. Conclusions

This study introduces a robust framework for Arabic Named Entity Recognition (NER) that addresses the unique challenges of the Arabic language, including its morphological richness, orthographic ambiguity, and frequent occurrences of nested and overlapping entities. By integrating innovative components such as the Hybrid Feature Fusion Layer, Compound Span Representation Layer, and Enhanced Multi-Label Classification Layer, the proposed model effectively handles complex linguistic structures while significantly improving precision, recall, and F1-score across diverse datasets. Rigorous evaluations on ANERcorp, ACE 2005, and a custom biomedical dataset demonstrate the model’s adaptability and state-of-the-art performance in scenarios ranging from flat entities to deeply nested and domain-specific entities. Furthermore, a practical case study highlights the model’s real-world applicability, showcasing its ability to handle compound and overlapping entities with high confidence. Practically, our framework addresses Arabic’s morphological complexity through character-level AraBERT embeddings, resolves orthographic ambiguity via GLU-filtered lexicons, and handles nested entities with compound span representations. Evaluations on diverse datasets (F1-score: 93.0% on ANERcorp) validate its superiority over multilingual and domain-specific baselines. These contributions establish a new benchmark for Arabic NER and underline the model’s potential to transform Arabic NLP tasks, setting the foundation for further advancements in morphologically rich and low-resource languages.

Future directions include integrating semi-supervised learning for low-resource domains and extending the model to other morphologically rich languages like Russian, Japanese, or Turkish.

Funding

This research received no external funding and The APC was funded by Qassim University for financial support (QU-APC-2025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The Researcher would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025).

Conflicts of Interest

The author declares that they have no conflicts of interest to report regarding the present study.

References

El Khbir, M.; Meziane, F.; Belkredim, F.Z. A span-based approach for Arabic named entity recognition: Application to the WojoodNER shared task. In Proceedings of the Arabic NLP Workshop at ACL 2023, Singapore, 7 December 2023. [Google Scholar]
Hussein, M.; Abdellatif, M.; Salama, M. AraBINDER: A bi-encoder model with contrastive learning for Arabic named entity recognition. In Proceedings of the Arabic NLP Workshop at ACL 2023, Singapore, 7 December 2023. [Google Scholar]
Sadallah, A.B.; Ahmed, O.; Mohamed, S.; Hatem, O.; Hesham, D.; Yousef, A.H. ANER: Arabic and Arabizi named entity recognition using transformer-based approach. In Proceedings of the 2023 Intelligent Methods, Systems, and Applications (IMSA), Giza, Egypt, 15–16 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 263–268. [Google Scholar]
WojoodNER 2023 Shared Task. The WojoodNER Arabic Named Entity Recognition shared task. arXiv 2023, arXiv:2310.16153. [Google Scholar]
Liqreina, A.; Bouamor, D.; Habash, N. Fine-grained named entity recognition for Arabic: New dataset and models. arXiv 2023. [Google Scholar]
Shaker, M.; Abdelaziz, A.; El-Tahawy, A. Exploring LSTM and GRU architectures for Arabic named entity recognition. arXiv 2023. [Google Scholar]
Chen, J.; Wang, P.; Zhang, L.; Lee, S. Multilingual named entity recognition using transfer learning across languages. In Proceedings of the International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 16–21 October 2023. [Google Scholar]
Li, X.; Meng, Y.; Sun, X.; Han, Q.; Yuan, A.; Li, J. Is Word Segmentation Necessary for Deep Learning of Chinese Representations? arXiv 2019, arXiv:1905.05526. [Google Scholar]
Wang, Q.; Su, X. Research on named entity recognition methods in Chinese forest disease texts. Appl. Sci. 2022, 12, 3885. [Google Scholar] [CrossRef]
Youssef, H.; Alghamdi, N.; Alzahrani, F. Leveraging AraBERT for Arabic named entity recognition with morphological features. J. Arab. Comput. Linguist. 2024, 46, 76–91. [Google Scholar]
Liu, W.; Yang, H.; Zhao, R. Dual-attention mechanisms for named entity recognition in financial texts. J. Financ. Technol. AI 2023. [Google Scholar]
Sun, X.; Chen, W.; Lin, H.; Zhao, J. Span-based models with enhanced positional encoding for nested entity recognition. J. Artif. Intell. Res. 2023. [Google Scholar]
Gao, Y.; Liu, H.; Wang, T. Global attention mechanism for overlapping entity recognition. Neural Process. Lett. 2023. [Google Scholar]
Ahmed, A.; Hussein, K.; Khalifa, M. Data augmentation techniques for underrepresented languages in NER. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6–10 December 2023. [Google Scholar]
Al-Rashed, H.; Ahmed, M.; Al-Otaibi, F. Enhancing Arabic Named Entity Recognition in Informal Texts through Morphological Analysis. J. Comput. Linguist. Arab. NLP 2024, 15, 45–60. [Google Scholar]
Zhang, H.; Wang, X.; Liu, J.; Zhang, L.; Ji, L. Enhancing rare entity recognition with lexicons. Inf. Sci. 2023, 625, 385–400. [Google Scholar] [CrossRef]
Wu, T.; Lin, X.; Zhang, M.; Chen, D. Cross-attention mechanisms for contextual alignment in Arabic named entity recognition. Nat. Lang. Process. AI Rev. 2023. [Google Scholar]
Ma, R.; Peng, M.; Zhang, Q.; Wei, Z.; Huang, X. Simplifying lexicon usage in Chinese NER. In Proceedings of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023. [Google Scholar]
Singh, R.; Patel, V.; Kumar, S.; Mehta, P. Integrating lexical features into transformer-based named entity recognition models. J. Comput. Linguist. 2024, 52, 45–62. [Google Scholar]
Park, J.; Kim, H.; Lee, S.; Choi, D. Zero-shot and few-shot learning approaches for named entity recognition in resource-scarce languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
Yamada, I.; Asai, A.; Shindo, H.; Takeda, H.; Matsumoto, Y. LUKE: Deep contextualized entity representations with entity-aware self-attention. arXiv 2020, arXiv:2010.01057. [Google Scholar]
Zhang, Z.; Zuo, Y.; Wu, J. Aspect sentiment triplet extraction: A Seq2Seq approach with span copy enhanced dual decoder. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2729–2742. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed Arabic Named Entity Recognition (NER) model, showcasing the input sequence, Character Representation Layer, Hybrid Feature Fusion Layer, Compound Span Representation Layer, and Enhanced Multi-Label Classification Layer.

Figure 2. Performance comparison of the proposed model with state-of-the-art methods on ANERcorp and ACE 2005 datasets, highlighting improvements in F1-score and Exact Match Ratio (EMR). The proposed model outperforms baselines such as AraBERT + CRF and XLM-R + CRF, demonstrating superior recognition of both flat and nested entities.

Figure 3. Comparative performance of the proposed model relative to Arabic-specific baselines (AraBERT + CRF and XLM-R + CRF) and generalized state-of-the-art methods (LUKE and Seq2Seq Span-based NER) across two datasets: ANERcorp and ACE 2005. The figure presents a side-by-side comparison of precision, recall, and F1-score, highlighting the proposed model’s superior ability to handle both flat and nested entity recognition challenges.

Figure 4. Entity classification heatmap illustrating the alignment scores between spans and entity types. High confidence scores indicate correctly assigned spans, while lower confidence scores highlight misclassified entities.

Figure 5. Precision–Recall Curve on the ANERcorp dataset, comparing the proposed model with Arabic-specific and generalized baselines. The proposed model achieves the best balance between precision and recall, demonstrating its robustness in handling both flat and nested entities in Arabic NER.

Table 1. Comparison of recent Named Entity Recognition (NER) studies, outlining their methodologies, key contributions, and limitations, particularly concerning challenges in morphologically rich languages like Arabic.

Year and Author	Focus Area	Technique	Dataset	Key Contribution	Limitations
Chen et al., 2023 [7]	Multilingual NER	Transfer learning	MultiCoNER	Achieved state-of-the-art performance in 10 languages.	Limited evaluation on morphologically rich languages like Arabic.
Li et al., 2024 [8]	Cross-lingual NER	Contrastive learning	WikiAnn	Enhanced NER for low-resource languages.	Does not address domain-specific challenges.
Youssef et al., 2024 [10]	Arabic NER	Morphological feature integration	ANERcorp	Enhanced performance using Arabic morphology.	Focused only on formal Arabic texts, neglecting informal domains.
Sun et al., 2023 [12]	Nested Entity Recognition	Span-based model	ACE 2005	Superior nested entity recognition.	No specific adaptation for multilingual or Arabic datasets.
Wu et al., 2023 [16]	Cross-Attention for NER	Cross-attention modules	CoNLL 2003	Improved contextual alignment.	Inefficient for large-scale datasets due to computational complexity.
Zhang et al., 2023 [15]	Rare Entity Recognition	External lexicons	Low-resource Datasets	Recognized rare entities efficiently.	Requires high-quality lexicons, which are scarce for many languages.
Ma et al., 2023 [17]	Lexicon-enhanced NER	Lexicon-enhanced transformers	Chinese NER Datasets	Improved performance using lexical information.	Limited application to other morphologically complex languages.

Table 2. Overview of datasets used for evaluation, including the number of sentences, tokens, and entity types and whether the dataset contains nested entities.

Dataset	# Sentences	# Tokens	# Entity Types	Nested Entities
ANERcorp	6000	150,000	3	No
ACE 2005	12,000	250,000	7	Yes
Custom Biomedical	5000	80,000	5	Yes

Table 3. Comparison of the proposed model with state-of-the-art methods on ANERcorp and ACE 2005 datasets, demonstrating superior F1-score and exact match ratio (EMR).

Model	Dataset	Precision (%)	Recall (%)	F1-Score (%)	Exact Match (%)
AraBERT + CRF	ANERcorp	89.2	87.4	88.3	85.1
XLM-R + CRF	ANERcorp	91.5	89.8	90.6	87.2
Proposed Model	ANERcorp	93.7	92.3	93.0	91.5
NestedNER	ACE 2005	85.6	82.9	84.2	80.4
GlobalPointer	ACE 2005	88.1	86.7	87.4	84.2
Proposed Model	ACE 2005	90.2	89.1	89.6	86.9

Table 4. Comparative performance of the proposed model and baselines across three datasets.

Model	Dataset	Precision (%)	Recall (%)	F1-Score (%)
AraBERT + CRF	ANERcorp	89.2	87.4	88.3
XLM-R + CRF	ANERcorp	91.5	89.8	90.6
LUKE	ANERcorp	91.9	90.1	91
Seq2Seq NER	ANERcorp	91.6	89.9	90.7
Proposed Model	ANERcorp	93.7	92.3	93
AraBERT + CRF	ACE 2005	85.6	82.9	84.2
XLM-R + CRF	ACE 2005	88.1	86.7	87.4
LUKE	ACE 2005	87.5	85.9	86.7
Seq2Seq NER	ACE 2005	87.7	86.1	86.9
Proposed Model	ACE 2005	90.2	89.1	89.6
AraBERT + CRF	Biomedical	84.7	82.5	83.6
XLM-R + CRF	Biomedical	86.3	84.1	85.2
LUKE	Biomedical	86.9	84.7	85.8
Seq2Seq NER	Biomedical	87.1	85	86
Proposed Model	Biomedical	89.5	87.3	88.4

Table 5. Results of the ablation study, showing the contribution of each component to overall performance. Removing key layers significantly reduces the F1-score.

Model Variant	Precision (%)	Recall (%)	F1-Score (%)
Full Model	93.7	92.3	93.0
Without Hybrid Feature Fusion	91.5	90.4	90.9
Without Compound Span Representation	90.8	89.3	90.0
Without Enhanced Multi-Label Classification	92.1	91.0	91.5

Table 6. Case study evaluation summary, detailing entity predictions, ground truth labels, and confidence scores for selected Arabic sentences.

Entity Span	Predicted Label	Ground Truth Label	Confidence (%)
“الملك سلمان”	Person	Person	98.5
“المؤتمر الدولي للسلام”	Event	Event	96.7
“الأمم المتحدة”	Organization	Organization	99.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Albahli, S. An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities. Appl. Sci. 2025, 15, 3073. https://doi.org/10.3390/app15063073

AMA Style

Albahli S. An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities. Applied Sciences. 2025; 15(6):3073. https://doi.org/10.3390/app15063073

Chicago/Turabian Style

Albahli, Saleh. 2025. "An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities" Applied Sciences 15, no. 6: 3073. https://doi.org/10.3390/app15063073

APA Style

Albahli, S. (2025). An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities. Applied Sciences, 15(6), 3073. https://doi.org/10.3390/app15063073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Advanced Natural Language Processing Framework for Arabic Named Entity Recognition: A Novel Approach to Handling Morphological Richness and Nested Entities

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Character Representation Layer

3.2. Hybrid Feature Fusion Layer: Cross-Attention + Gated Lexical Integration

3.3. Compound Span Representation Layer: Dense + Recurrent Positional Encoding

3.4. Enhanced Multi-Label Classification Layer: Span-Label Alignment Network

4. Experimental Analysis and Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.3.1. Comparison with Basic Models

4.3.2. Error Analysis

4.3.3. Case Study

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI