Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (463)

Search Parameters:
Keywords = named entity recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 718 KB  
Article
Construction of Mineral Resources Knowledge Graph: A Case Study of Linyi City, Shandong Province, China
by Xiaocai Liu, Yong Zhang, Ming Liu, Yonglin Yao, Kun Liu, Yongqing Tong and Xinqi Zheng
Appl. Sci. 2026, 16(6), 2749; https://doi.org/10.3390/app16062749 - 13 Mar 2026
Viewed by 84
Abstract
The efficient exploration and development of mineral resources rely on deep mining and correlation analysis of massive, multi-source, and unstructured geological data. Knowledge graph technology provides a structured solution for integrating fragmented knowledge in the field of mineral resources. This study takes the [...] Read more.
The efficient exploration and development of mineral resources rely on deep mining and correlation analysis of massive, multi-source, and unstructured geological data. Knowledge graph technology provides a structured solution for integrating fragmented knowledge in the field of mineral resources. This study takes the iron ore resources in Linyi City, Shandong Province as a typical case and proposes a method framework for automatically constructing a regional mineral resource knowledge graph from unstructured text. Firstly, seven types of mineral entities (location, ore body, scale, type, attitude, alteration, development degree) and five semantic relationships (type, scale, location, inclusion, development) were defined, and a high-quality Chinese annotation corpus containing 10,434 entities and 6660 relationships was constructed through domain ontology design. Secondly, BiLSTM-CRF, BiGRU-CRF, and various BERT based models were compared in the named entity recognition task, and it was found that the optimized BERT-CRF model achieved the best performance (F1 score: 82.8%). The BERT based model significantly outperforms traditional PCNN and BiGRU models, achieving an F1 score of 98.14%, which was found in relation extraction tasks. Finally, based on the extracted triples, a visual knowledge graph of iron ore resources in Linyi City was constructed using the Neo4j graph database, in order to achieve knowledge association queries and visual navigation. Full article
Show Figures

Figure 1

31 pages, 1936 KB  
Article
A Multi-Scale Heterogeneous Graph Attention Network for Nested Named Entity Recognition with Syntactic and Dependency Tree Structures
by Yifan Zhao, Lin Zhang and Yangshuyi Xu
Electronics 2026, 15(6), 1183; https://doi.org/10.3390/electronics15061183 - 12 Mar 2026
Viewed by 136
Abstract
Nested Named Entity Recognition (nested NER) frequently encounters challenges like boundary conflicts, complications in modeling long-distance dependencies, and inadequate representation of deep nested semantics resulting from overlapping spans and hierarchical inclusion relationships of entities. This research presents a multi-scale heterogeneous graph attention network [...] Read more.
Nested Named Entity Recognition (nested NER) frequently encounters challenges like boundary conflicts, complications in modeling long-distance dependencies, and inadequate representation of deep nested semantics resulting from overlapping spans and hierarchical inclusion relationships of entities. This research presents a multi-scale heterogeneous graph attention network to facilitate end-to-end recognition of nested entities through the collaborative modeling of structure and semantics. The model initially presents the structural integration mechanism, which consolidates the hierarchical restrictions of the syntactic tree and the inter-word relationships of the dependency tree within a singular heterogeneous graph space. It subsequently generates 1/2/3-hop multi-scale subgraphs and employs multi-scale subgraph attention to adaptively integrate information from various structural receptive fields, harmonizing the local cues of shallow entities with the global dependencies of deep entities. The experimental findings on the ACE2004, ACE2005, and GENIA benchmark datasets indicate that the proposed method surpasses several robust baselines regarding overall performance and nested entity recognition, particularly exhibiting notable advantages in identifying long entities and low-frequency entities. We further evaluate MHGAT on KBP2017 and GermEval2014 to validate generalization across datasets and languages. Full article
Show Figures

Figure 1

51 pages, 1067 KB  
Article
Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance
by Juuso Eronen, Michal Ptaszynski, Tomasz Wicherkiewicz, Robert Borges, Katarzyna Janic, Zhenzhen Liu, Tanjim Mahmud and Fumito Masui
Mach. Learn. Knowl. Extr. 2026, 8(3), 65; https://doi.org/10.3390/make8030065 - 7 Mar 2026
Viewed by 540
Abstract
Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS [...] Read more.
Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS (qWALS), a typology-based similarity metric derived from features in the World Atlas of Language Structures, and evaluate it against existing similarity baselines. Validation uses three complementary signals: computational similarity scores, zero-shot transfer performance of multilingual transformers (mBERT and XLM-R) on four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) across eight languages, and an expert-linguist similarity survey. Across tasks and models, higher linguistic similarity is associated with better transfer, and the survey provides independent support for the computational metrics. Full article
Show Figures

Figure 1

17 pages, 1296 KB  
Article
SSKEM: A Global Pointer Network Model for Joint Entity and Relation Extraction in Storm Surge Texts
by Yebin Chen, Mingjie Xie, Yongli Chen, Zhenduo Dou and Weihong Li
ISPRS Int. J. Geo-Inf. 2026, 15(3), 105; https://doi.org/10.3390/ijgi15030105 - 3 Mar 2026
Viewed by 241
Abstract
Storm surges are catastrophic marine disasters that pose severe threats to coastal populations, making the rapid extraction of key information from multi-source texts critical for effective emergency response. However, existing extraction methods often struggle with complex linguistic challenges, such as identifying nested entities [...] Read more.
Storm surges are catastrophic marine disasters that pose severe threats to coastal populations, making the rapid extraction of key information from multi-source texts critical for effective emergency response. However, existing extraction methods often struggle with complex linguistic challenges, such as identifying nested entities (e.g., overlapping geographic names), capturing relationships across long texts, and handling the disparity between formal official reports and unstructured social media data. To address these limitations, this study proposes a Storm Surge Knowledge Extraction Model (SSKEM) based on Global Pointer Networks. By constructing a domain-specific dataset of 4000 records from government bulletins, news reports, and social media, the proposed model utilizes a unified matrix decoding mechanism to treat entity and relation extraction as a holistic task. Experimental results demonstrate that the model achieves an F1-score of 88.4%, outperforming robust baseline models by 5.5%. Notably, it improves the recognition accuracy of complex nested entities by 13.7% and enhances the recall rate for cross-sentence relations by 18.2%. Furthermore, the model exhibits high computational efficiency, processing speed suitable for real-time applications, and effectively bridges the performance gap between standardized and fragmented data sources. This research provides a robust technical solution for transforming heterogeneous disaster big data into actionable knowledge for decision-support systems. Full article
(This article belongs to the Special Issue Spatial Data Science and Knowledge Discovery)
Show Figures

Figure 1

22 pages, 1675 KB  
Article
HybridNER: A Multi-Model Ensemble Framework for Robust Named Entity Recognition—From General Domains to Adversarial GNSS Scenarios
by Yixuan Liu, Jing Zhang, Ruipeng Luan and Xuewen Yu
Sensors 2026, 26(5), 1553; https://doi.org/10.3390/s26051553 - 2 Mar 2026
Viewed by 212
Abstract
Named entity recognition (NER), a core task in natural language processing (NLP), remains constrained by heavy reliance on annotated data, limited cross domain generalization, and difficulty in recognizing name entities out of vocabulary entities. In specialized domains such as analysis of Global Navigation [...] Read more.
Named entity recognition (NER), a core task in natural language processing (NLP), remains constrained by heavy reliance on annotated data, limited cross domain generalization, and difficulty in recognizing name entities out of vocabulary entities. In specialized domains such as analysis of Global Navigation Satellite System (GNSS) countermeasures, including anti-jamming and anti-spoofing, where datasets are small and domain knowledge is scarce, existing models exhibit marked performance degradation. To address these challenges, we propose HybridNER, a framework that integrates locally trained span-based models with large language models (LLMs). The approach employs a span prediction metasystem that first fuses outputs from multiple base learners by computing span to label compatibility scores and assigns an uncertainty estimate to each candidate entity. Entities with uncertainty above a preset threshold are then routed to an LLM for a second stage classification, and the final decision integrates both sources to realize complementary strengths. Experiments on multiple general purpose and domain specific datasets show that HybridNER achieves higher precision, recall, and F1 than traditional ensemble methods such as majority voting and weighted voting, with especially pronounced gains in specialized domains, thereby improving the robustness and generalization of NER. Full article
Show Figures

Figure 1

46 pages, 7510 KB  
Article
Semantic Modeling of Ship Collision Reports: Ontology Design, Knowledge Extraction, and Severity Classification
by Hongchu Yu, Xiaohan Xu, Zheng Guo, Tianming Wei and Lei Xu
J. Mar. Sci. Eng. 2026, 14(5), 448; https://doi.org/10.3390/jmse14050448 - 27 Feb 2026
Viewed by 402
Abstract
With the expansion of water transportation networks and increasing traffic intensity, maritime accidents have become frequent, posing significant threats to safety and property. This study presents a knowledge graph-driven framework for maritime accident analysis, addressing the limitations of traditional risk analysis methods in [...] Read more.
With the expansion of water transportation networks and increasing traffic intensity, maritime accidents have become frequent, posing significant threats to safety and property. This study presents a knowledge graph-driven framework for maritime accident analysis, addressing the limitations of traditional risk analysis methods in extracting and organizing unstructured accident data. First, a standardized ontology for ship collision accidents is developed, defining core concepts such as event, spatiotemporal behavior, causation, consequence, responsibility, and decision-making. Advanced natural language processing models, including a lexicon-enhanced LEBERT-BiLSTM-CRF and a K-BERT-BiLSTM-CRF incorporating ship collision knowledge triplets, are proposed for named entity recognition and relation extraction, with F1-score improvements of 6.7% and 1.2%, respectively. The constructed accident knowledge graph integrates heterogeneous data, enabling semantic organization and efficient retrieval. Leveraging graph topological features, an accident severity classification model is established, where a graph-feature-driven LSTM-RNN demonstrates robust performance, especially with imbalanced data. Comparative experiments show the superiority of this approach over conventional models such as XGBoost and random forest. Overall, this research demonstrates that knowledge graph-driven methods can significantly enhance maritime accident knowledge extraction and severity classification, providing strong information support and methodological advances for intelligent accident management and prevention. Full article
(This article belongs to the Section Ocean Engineering)
Show Figures

Graphical abstract

26 pages, 3199 KB  
Article
EDAER: Entropy-Driven Approach for Entity and Relation Extraction in Chinese Cyber Threat Intelligence
by Yong Li, Xiuping Li, Yangbai Zhang, Zhiqiang Liu, Xiaowei Li, Qi Xu and Xiaolin Chang
Entropy 2026, 28(3), 261; https://doi.org/10.3390/e28030261 - 27 Feb 2026
Viewed by 251
Abstract
Cyber threat intelligence (CTI) has been explored to strengthen system security via taking raw threat data from various data sources and transforming it into actionable insights that enable organizations to predict, detect, and respond to cyber threats. Named entity recognition (NER) and relation [...] Read more.
Cyber threat intelligence (CTI) has been explored to strengthen system security via taking raw threat data from various data sources and transforming it into actionable insights that enable organizations to predict, detect, and respond to cyber threats. Named entity recognition (NER) and relation extraction (RE) are the key tasks of CTI data mining. However, current CTI NER and/or RE research is mainly focused on English CTI, which is not directly transferable to Chinese CTI due to fundamental linguistic and terminological differences. Moreover, the existing limited studies on Chinese CTI do not effectively address uncertainty in predictions in low-resource scenarios where entities and relations are sparse. This work aims to improve the performance of NER and RE tasks in low-resource Chinese CTI scenarios, and we make two major contributions. The first is that we construct a Chinese CTI dataset, which includes 16 types of entities and 9 types of relations—more than those of the existing open-source dataset on Chinese CTI. The second is that we propose an entropy-driven approach for entity and relation (EDAER) extraction. EDAER is the first to combine the techniques of RoBERTa_wwm, Mamba, RDCNN and CRF to perform NER tasks. In addition, EDAER is the first to apply entropy to quantify the uncertainty of the model’s predictions in NER and RE tasks in Chinese CTI scenarios. Moreover, EDAER is the first to apply contrastive learning techniques in Chinese CTI scenarios to learn meaningful features by maximizing the similarity between positive samples and minimizing the similarity between negative samples. Extensive experimental results on public and our built datasets demonstrate that our proposed approach performs the best. These results show that (1) RoBERTa_wwwm significantly outperforms BERT on both NER and RE tasks; (2) Mamba outperforms BiLSTM on the NER task; (3) the entropy-based dynamic gating mechanism contributes to performance improvements in both NER and RE tasks; and (4) the uncertainty-guided contrastive learning mechanism is helpful for performance improvement in the NER task. Full article
(This article belongs to the Special Issue Entropy in Machine Learning Applications, 2nd Edition)
Show Figures

Figure 1

22 pages, 1247 KB  
Article
An Integrated Text Mining Approach for Discovering Pharmacological Effects, Drug Combinations, and Repurposing Opportunities of ACE Inhibitors
by Nadezhda Yu. Biziukova, Polina I. Savosina, Dmitry S. Druzhilovskiy, Olga A. Tarasova and Vladimir V. Poroikov
Int. J. Mol. Sci. 2026, 27(4), 2044; https://doi.org/10.3390/ijms27042044 - 22 Feb 2026
Viewed by 251
Abstract
The rapidly expanding body of biomedical literature encompasses a wealth of information concerning the pharmacological effects, mechanisms of action, adverse reactions, and repurposing potential of small-molecule therapeutics. Nevertheless, the systematic extraction and integration of this knowledge continue to pose substantial challenges. In this [...] Read more.
The rapidly expanding body of biomedical literature encompasses a wealth of information concerning the pharmacological effects, mechanisms of action, adverse reactions, and repurposing potential of small-molecule therapeutics. Nevertheless, the systematic extraction and integration of this knowledge continue to pose substantial challenges. In this study, we propose an integrated text-mining framework for the automated extraction and structured representation of information on the biological activities of low-molecular-weight compounds, exemplified by angiotensin-converting enzyme (ACE) inhibitors as a representative pharmacological class. A corpus comprising over 20,000 PubMed titles and abstracts reporting in vitro, in vivo, and clinical investigations of ACE inhibitors was assembled. Chemical compounds, proteins/genes, and diseases were recognized using a previously developed named entity recognition model based on conditional random fields. Entity-level associations were extracted at the sentence level through a rule-based approach employing manually curated pattern phrases, followed by normalization via automated queries to PubChem, UniProt, and the Human Disease Ontology. The proposed methodology facilitated the extraction of approximately 22,000 unique and normalized associations encompassing drug-target, drug-disease, and drug-drug relationships. In addition to confirming well-established therapeutic effects and clinically recognized drug combinations, the analysis identified underexplored pharmacological activities of ACE inhibitors, including antineoplastic, antifibrotic, and neuropsychiatric properties, along with mechanistic associations involving matrix metalloproteinases and neurotrophic signaling pathways. Collectively, these findings underscore the potential of automated literature mining to advance systematic knowledge integration and data-driven hypothesis generation in the contexts of drug repurposing and safety evaluation. Full article
Show Figures

Figure 1

25 pages, 2689 KB  
Article
Construction of Bridge Maintenance Knowledge Graph Based on Deep Learning
by Yiming Zhang and Hongshuai Gao
Appl. Sci. 2026, 16(4), 1985; https://doi.org/10.3390/app16041985 - 17 Feb 2026
Viewed by 361
Abstract
Bridge maintenance decision-making is challenged by the “data-rich but knowledge-poor” nature of unstructured inspection and maintenance reports. A bridge maintenance knowledge graph (BMKG) construction framework is proposed, developed from a corpus of 275 inspection reports, to enable structured representation of engineering knowledge and [...] Read more.
Bridge maintenance decision-making is challenged by the “data-rich but knowledge-poor” nature of unstructured inspection and maintenance reports. A bridge maintenance knowledge graph (BMKG) construction framework is proposed, developed from a corpus of 275 inspection reports, to enable structured representation of engineering knowledge and decision support. A standards-aligned domain ontology provides semantic constraints for downstream information extraction and organization. Building on this ontology, a RoBERTa–BiGRU–CRF named entity recognition (NER) model is developed, achieving a precision of 90.8%, recall of 93.8%, and a micro-averaged F1-score (micro-F1) of 92.3%. Inter-annotator agreement for the NER annotations was quantified using Cohen’s kappa, yielding κ = 0.86. To avoid the cost of large-scale relation annotation, relations are constructed using interpretable, rule-based constraints. Through manual verification audit of randomly sampled relationship instances under a strict exact-match criterion (i.e., requiring exact matches for entity boundaries, entity types, and relationship types), an overall manual verification rate of 93.67% was obtained. Unlike existing KG methods that rely heavily on annotated data, the BMKG framework integrates ontological constraints with a rule-driven approach, prioritizing interpretability and reducing dependency on large-scale relation labeling. Consequently, the resulting knowledge graph supports semantic retrieval and visual exploration, enabling efficient disease-to-recommendation queries for refined bridge maintenance management. Full article
Show Figures

Figure 1

20 pages, 2695 KB  
Article
Entity Recognition for Coal Mine Hydraulic Support Installation Process Driven by LLM LoRA Fine-Tuning
by Yunrui Wang, Xi He and Xintong Sui
Appl. Sci. 2026, 16(4), 1943; https://doi.org/10.3390/app16041943 - 15 Feb 2026
Viewed by 354
Abstract
Hydraulic supports, being the pivotal equipment in coal mining face operations, exhibit complex installation procedure knowledge that impedes efficient knowledge extraction and utilization, thereby hindering the provision of scientifically grounded installation guidance and ultimately affecting equipment installation efficiency. This study proposes the development [...] Read more.
Hydraulic supports, being the pivotal equipment in coal mining face operations, exhibit complex installation procedure knowledge that impedes efficient knowledge extraction and utilization, thereby hindering the provision of scientifically grounded installation guidance and ultimately affecting equipment installation efficiency. This study proposes the development of a domain-specific large-scale model utilizing named entity recognition (NER) for knowledge extraction to enhance the efficiency of hydraulic shield installation. Initially, a few-shot data augmentation method is introduced to enrich hydraulic shield assembly process data, thereby providing a robust dataset for fine-tuning the large language model (LLM). Subsequently, Low-Rank Adaptation (LoRA) fine-tuning techniques are leveraged to optimize large-scale model adaptation. Comparative analysis of the model’s performance post-fine-tuning was conducted using multiple evaluation metrics, revealing that the fine-tuned Deepseek-R1-7b-Distill model exhibited the most superior performance indicators. Ultimately, the fine-tuned Deepseek-R1-7b-Distill model was selected as the domain-specific LLM for NER in hydraulic support installation processes. The experimental results demonstrate that the entity recognition F1 score across all entity types reached 0.8887, validating the efficacy of the methodology. This provides technical support for enhancing the installation efficiency of hydraulic supports. Full article
Show Figures

Figure 1

24 pages, 2988 KB  
Article
Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning
by Yubo Wu and Junqiang Liu
Symmetry 2026, 18(2), 353; https://doi.org/10.3390/sym18020353 - 14 Feb 2026
Viewed by 221
Abstract
Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with [...] Read more.
Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with contrastive learning, proposing a novel model with a symmetric-encoder collaborative architecture. To mitigate the noise, a modality refinement encoder maps each modality to an exclusive space, while an aligned encoder bridges gaps via contrastive learning in a shared space, surpassing the superficial cross-modal mapping of existing models. Building on these encoders, the symmetric fusion module achieves deep bidirectional fusion, breaking traditional one-way or concatenation-based limitations. Experiments on two datasets show the model outperforms state-of-the-art methods, with ablation experiments validating the symmetric encoder’s uniqueness for consistent multimodal learning. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

22 pages, 494 KB  
Article
LinguoNER: A Language-Agnostic Framework for Named Entity Recognition in Low-Resource Languages with a Focus on Yambeta
by Philippe Tamla, Stephane Donna, Tobias Bigala, Dilan Nde, Maxime Yves Julien Manifi Abouh and Florian Freund
Informatics 2026, 13(2), 31; https://doi.org/10.3390/informatics13020031 - 11 Feb 2026
Viewed by 455
Abstract
This paper presents LinguoNER, a practical and extensible framework for bootstrapping Named Entity Recognition (NER) in extremely low-resource languages, demonstrated on Yambeta, a Bantu language spoken by a minority community in Cameroon. Due to scarce digital resources and the absence of [...] Read more.
This paper presents LinguoNER, a practical and extensible framework for bootstrapping Named Entity Recognition (NER) in extremely low-resource languages, demonstrated on Yambeta, a Bantu language spoken by a minority community in Cameroon. Due to scarce digital resources and the absence of annotated corpora, Yambeta has remained largely underrepresented in Natural Language Processing (NLP). LinguoNER addresses this gap by providing a methodologically transparent end-to-end workflow that integrates corpus acquisition, gazetteer-driven automatic annotation, tokenizer training, transformer fine-tuning, and multi-level evaluation in settings where large-scale manual annotation is infeasible. Using a Bible-derived corpus as a linguistically stable starting point, we release the first publicly available Yambeta NER dataset (≈25,000 tokens) annotated with the CoNLL BIO scheme and a restricted entity schema (PER/LOC/ORG). Because labels are generated via dictionary-based annotation, the corpus is best characterized as silver-standard; credibility is strengthened through recorded dictionaries, transparency logs, expert-in-the-loop validation on sampled subsets, and complementary qualitative error analysis. We additionally train a dedicated Yambeta WordPiece tokenizer that preserves tone markers and diacritics, and fine-tune a bert-base-cased transformer for token classification. On a held-out test split, LinguoNER achieves strong token-level performance (Precision = 0.989, Recall = 0.981, F1 = 0.985), substantially outperforming a dictionary-only gazetteer baseline (ΔF1 ≈ 0.36). Per-entity-type evaluation further indicates improvements beyond surface-form matching, while remaining errors are linguistically motivated and primarily involve multi-word entity boundaries, agglutinative constructions, and tone-/diacritic-sensitive tokenization. We emphasize that results are restricted to a Bible domain and a limited label space, and should be interpreted as proof-of-concept evidence rather than claims of broad out-of-domain generalization. Overall, LinguoNER provides a reproducible blueprint for bootstrapping NER resources in underrepresented languages and supports future work on broader corpora sources (e.g., news, OPUS, JW300), additional African languages (e.g., Yoruba, Igbo, Bassa), and the iterative creation of expert-refined datasets and gold-standard subsets. Full article
Show Figures

Figure 1

28 pages, 2122 KB  
Article
AraCoNER: Arabic Complex NER with Gold and Silver Labels
by Wesam Alruwaili, Najwa Altwaijry and Isra Al-Turaiki
Electronics 2026, 15(4), 750; https://doi.org/10.3390/electronics15040750 - 10 Feb 2026
Viewed by 272
Abstract
Named entity recognition (NER) is a fundamental task in natural language processing. Recently, non-traditional nouns (known as complex NER) have increasingly emerged, including long noun phrases and ambiguous names, for example, Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn), [...] Read more.
Named entity recognition (NER) is a fundamental task in natural language processing. Recently, non-traditional nouns (known as complex NER) have increasingly emerged, including long noun phrases and ambiguous names, for example, Birds of Prey (and the Fantabulous Emancipation of One Harley Quinn), Among Us, and Chicago, which may refer to a city or a novel. Such rapidly growing entity names pose significant challenges for NER. Arabic NER research is usually limited to flat and nested entities, overlooking complex entities due to limited resources, the language’s rich morphology, and context ambiguity. Such tasks require high-quality annotated data, yet most existing approaches rely heavily on supervised learning, which depends on large amounts of labeled data. However, acquiring large annotated datasets is costly and labor-intensive. We construct our corpus by leveraging the superior performance of large language models (LLMs), which have driven recent advances in dataset generation. We propose an Arabic complex NER (AraCoNER) dataset with semantically ambiguous and complex named entities, using both gold and silver labels. We investigate several agent-based annotation frameworks in addition to the plain LLM to determine the most efficient annotator for our task. Then, we introduce LLMAAA+, an LLM-agent-based framework that integrates an LLM-powered agent as an annotator into an active learning loop to efficiently select what should be labeled. Instead of solely synthesizing the training data from LLMs, we enhance both the annotation and training phases to generate pseudo-labels using k-NN sampling for in-context examples. Such an approach ensures both efficiency and quality, with cost-effective and minimal human involvement. Our results show that combining an LLM (GPT-4) with a structured agent framework (Google ADK) yields the highest annotation accuracy, even with a limited number of annotated examples, supporting the proposed LLM-agent-based active learning framework. Full article
Show Figures

Figure 1

27 pages, 2824 KB  
Article
De-Identification of Electronic Health Records Using Deep Learning and Transformers
by Fatih Dilmaç and Adil Alpkocak
Appl. Sci. 2026, 16(4), 1692; https://doi.org/10.3390/app16041692 - 8 Feb 2026
Viewed by 319
Abstract
Adoption of electronic health records (EHRs) has significantly advanced healthcare by enabling extensive data storage and analysis for clinical decisions and research. However, sensitive personally identifiable information (PII) within EHRs presents major challenges concerning patient privacy, data security, and regulatory compliance. Effective automated [...] Read more.
Adoption of electronic health records (EHRs) has significantly advanced healthcare by enabling extensive data storage and analysis for clinical decisions and research. However, sensitive personally identifiable information (PII) within EHRs presents major challenges concerning patient privacy, data security, and regulatory compliance. Effective automated de-identification techniques for detecting and removing protected health information (PHI) are thus essential. This study presents one of the first focused studies on Turkish EHR de-identification, comparing traditional sequence-based neural architectures with advanced transformer-based large language models (LLMs) for PHI detection. We introduce and publicly release a manually annotated benchmark dataset of TEHRs, covering diverse PHI types, supporting further research in Turkish clinical text. Two methodologies were evaluated: bidirectional long short-term memory (BiLSTM) models (with and without Conditional Random Fields (CRFs)) and six fine-tuned pre-trained LLMs. Experiments demonstrated the superior performance of transformer-based LLMs, achieving a macro F1 score of 92.20%, significantly outperforming traditional methods. Among sequence-based models, BiLSTM + CRF attained an 83.00% F1 score, exceeding the baseline BiLSTM 78.40%. Results highlight the potential of transformer-based models for privacy-preserving Turkish clinical text and underscore the importance of annotated benchmark datasets. Full article
Show Figures

Figure 1

21 pages, 4705 KB  
Article
Computational and Graph-Theoretic Analysis of Legislative Networks: New Zealand’s Mental Health Act as a Case Study
by Iman Ardekani, Maryam Ildoromi, Neda Sakhaee, Sewmini Gunawardhana and Parmida Raeis
Information 2026, 17(2), 161; https://doi.org/10.3390/info17020161 - 5 Feb 2026
Viewed by 350
Abstract
This paper presents a computational framework for constructing and analysing a focal legislative citation network. A depth-limited expansion strategy generates subgraphs of the network that capture the local structural environment of a seed Act while avoiding the global hub dominance present in whole-corpus [...] Read more.
This paper presents a computational framework for constructing and analysing a focal legislative citation network. A depth-limited expansion strategy generates subgraphs of the network that capture the local structural environment of a seed Act while avoiding the global hub dominance present in whole-corpus analyses. Centrality measures and community detection show how the seed Act’s perceived influence changes with network radius. To incorporate semantic information, we develop and apply an Large Language Model (LLM)-assisted topic modelling method in which representative keywords and LLM-generated summaries form a compact text representation that is converted into a Term Frequency-Inverse Document Frequency (TF–IDF) document–term matrix. Although demonstrated on New Zealand’s mental health legislation, the framework generalises to any legislative corpus or jurisdiction. Integrating graph-theoretic structure with LLM-assisted semantic modelling provides a scalable approach for analysing legislative systems, identifying domain-specific clusters, and supporting computational studies of legal evolution and policy impact. Full article
(This article belongs to the Section Information Theory and Methodology)
Show Figures

Figure 1

Back to TopTop