Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (232)

Search Parameters:
Keywords = named entity recognition (NER)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 2313 KB  
Article
A Cybersecurity NER Method Based on Hard and Easy Labeled Training Data Discrimination
by Lin Ye, Yue Wu, Hongli Zhang and Mengmeng Ge
Sensors 2025, 25(24), 7627; https://doi.org/10.3390/s25247627 - 16 Dec 2025
Viewed by 276
Abstract
Although general-domain Named Entity Recognition (NER) has achieved substantial progress in the past decade, its application to cybersecurity NER is hindered by the lack of publicly available annotated datasets, primarily because of the sensitive and privacy-related nature of security data. Prior research has [...] Read more.
Although general-domain Named Entity Recognition (NER) has achieved substantial progress in the past decade, its application to cybersecurity NER is hindered by the lack of publicly available annotated datasets, primarily because of the sensitive and privacy-related nature of security data. Prior research has largely sought to improve performance by expanding annotation volumes, while overlooking the intrinsic characteristics of training data. In this study, we propose a cybersecurity Named Entity Recognition (NER) method based on hard and easy labeled training data discrimination. Firstly, a hybrid strategy that integrates a deep learning (DL)-based discriminator and a rule-based discriminator is employed to partition the original dataset into hard and easy samples. Secondly, the proportion of hard and easy data in the training set is adjusted to determine the optimal balance. Finally, a data augmentation algorithm is applied to the partitioned dataset to further improve model performance. The results demonstrate that, under a fixed total training data size, the ratio of hard to easy samples has a significant impact on NER performance, with the optimal strategy achieved at a 1:1 proportion. Moreover, the proposed method further improves the overall performance of cybersecurity NER. Full article
Show Figures

Figure 1

22 pages, 5105 KB  
Article
From News to Knowledge: Leveraging AI and Knowledge Graphs for Real-Time ESG Insights
by Omar Mohmmed Hassan Nassar, Fahimeh Jafari and Chanchal Jain
Sustainability 2025, 17(24), 11128; https://doi.org/10.3390/su172411128 - 12 Dec 2025
Viewed by 648
Abstract
Traditional Environmental, Social, and Governance (ESG) assessments rely heavily on corporate disclosures and third-party ratings, which are often delayed, inconsistent, and prone to bias. These limitations leave stakeholders without timely visibility into rapidly evolving ESG events. These assessment frameworks also fail to capture [...] Read more.
Traditional Environmental, Social, and Governance (ESG) assessments rely heavily on corporate disclosures and third-party ratings, which are often delayed, inconsistent, and prone to bias. These limitations leave stakeholders without timely visibility into rapidly evolving ESG events. These assessment frameworks also fail to capture the dynamic nature of ESG issues reflected in public news media. This research addresses these limitations by proposing and implementing an automated framework utilising Artificial Intelligence (AI), specifically Natural Language Processing (NLP) and Knowledge Graphs (KG), to analyse ESG news data for companies listed on major stock indices. The methodology involves several stages: collecting a registry of target companies; retrieving relevant news articles; applying Named Entity Recognition (NER), sentiment analysis, and ESG domain classification; and constructing a linked property knowledge graph to structure the extracted information semantically. The framework culminates in an interactive dashboard for visualising and querying the resulting graph database. The resulting knowledge graph supports comparative inferential analytics across indices and sectors, uncovering divergent ESG sentiment profiles and thematic priorities that traditional reports overlook. The analysis also reveals comparative insights into sentiment trends and ESG focus areas across different exchanges and sectors, offering perspectives often missing from traditional methods. Findings indicate differing ESG sentiment profiles and thematic focuses between the UK (FTSE) and Australian (ASX) indices within the analysed dataset. This study confirms AI/KG’s potential for a modular, dynamic, and semantically rich ESG intelligence approach, transforming unstructured news into interconnected insights. Limitations and areas for future work, including model refinement and integration of financial data, are also discussed. This proposed framework augments traditional ESG evaluations with automated, scalable, and context-rich analysis. Full article
Show Figures

Figure 1

18 pages, 1606 KB  
Article
CLFF-NER: A Cross-Lingual Feature Fusion Model for Named Entity Recognition in the Traditional Chinese Festival Culture Domain
by Shenghe Yang, Kun He, Wei Li and Yingying He
Informatics 2025, 12(4), 136; https://doi.org/10.3390/informatics12040136 - 5 Dec 2025
Viewed by 369
Abstract
With the rapid development of information technology, there is an increasing demand for the digital preservation of traditional festival culture and the extraction of relevant knowledge. However, existing research on Named Entity Recognition (NER) for Chinese traditional festival culture lacks support from high-quality [...] Read more.
With the rapid development of information technology, there is an increasing demand for the digital preservation of traditional festival culture and the extraction of relevant knowledge. However, existing research on Named Entity Recognition (NER) for Chinese traditional festival culture lacks support from high-quality corpora and dedicated model methods. To address this gap, this study proposes a Named Entity Recognition model, CLFF-NER, which integrates multi-source heterogeneous information. The model operates as follows: first, Multilingual BERT is employed to obtain the contextual semantic representations of Chinese and English sentences. Subsequently, a Multiconvolutional Kernel Network (MKN) is used to extract the local structural features of entities. Then, a Transformer module is introduced to achieve cross-lingual, cross-attention fusion of Chinese and English semantics. Furthermore, a Graph Neural Network (GNN) is utilized to selectively supplement useful English information, thereby alleviating the interference caused by redundant information. Finally, a gating mechanism and Conditional Random Field (CRF) are combined to jointly optimize the recognition results. Experiments were conducted on the public Chinese Festival Culture Dataset (CTFCDataSet), and the model achieved 89.45%, 90.01%, and 89.73% in precision, recall, and F1 score, respectively—significantly outperforming a range of mainstream baseline models. Meanwhile, the model also demonstrated competitive performance on two other public datasets, Resume and Weibo, which verifies its strong cross-domain generalization ability. Full article
Show Figures

Figure 1

45 pages, 5846 KB  
Article
A Machine Learning Framework for Harvesting and Harmonizing Cultural and Touristic Data
by Kimon Deligiannis, Christos Tryfonopoulos, Paraskevi Raftopoulou, Costas Vassilakis, Vassilis Kaffes and Spiros Skiadopoulos
Information 2025, 16(12), 1038; https://doi.org/10.3390/info16121038 - 28 Nov 2025
Viewed by 1091
Abstract
Cultural and touristic information is increasingly available through a multitude of heterogeneous sources, including official repositories, community platforms, and open data initiatives. While prominent landmarks are typically covered across sources, less-known attractions are also documented with varying degrees of detail, resulting in fragmented, [...] Read more.
Cultural and touristic information is increasingly available through a multitude of heterogeneous sources, including official repositories, community platforms, and open data initiatives. While prominent landmarks are typically covered across sources, less-known attractions are also documented with varying degrees of detail, resulting in fragmented, overlapping, or complementary content. To enable integrated access to this wealth of information, harvesting and consolidation mechanisms are required to collect, reconcile, and unify distributed content referring to the same entities. This paper presents a machine learning-driven framework for harvesting, homogenizing, and augmenting cultural and touristic data across multilingual sources. Our approach addresses entity resolution, duplication detection, and content harmonization, laying the foundation for enriched, unified representations of attractions and points of interest. The framework is designed to support scalable integration pipelines and can be deployed in applications aimed at tourism promotion, digital heritage, and smart travel services. Full article
(This article belongs to the Special Issue Editorial Board Members’ Collection Series: "Information Systems")
Show Figures

Graphical abstract

22 pages, 2058 KB  
Article
CSIER-FM: A Chinese Sensitive Information Entity Recognition Framework with Few-Shot Learning and Multi-Instance Integration
by Yage Jin, Rui Ma, Hongming Chen, Yanhua Wu and Qingxin Li
Electronics 2025, 14(22), 4451; https://doi.org/10.3390/electronics14224451 - 14 Nov 2025
Viewed by 360
Abstract
The rapid growth of information technology has posed new challenges for recognizing sensitive information in Chinese text. Traditional rule-based, statistical, and machine learning methods face limitations in domain adaptability and Chinese semantic comprehension. Recently, large language models have offered new opportunities to address [...] Read more.
The rapid growth of information technology has posed new challenges for recognizing sensitive information in Chinese text. Traditional rule-based, statistical, and machine learning methods face limitations in domain adaptability and Chinese semantic comprehension. Recently, large language models have offered new opportunities to address these challenges with their strong semantic understanding and transfer capabilities. Building on this, we propose the CSIER-FM, which is a Chinese sensitive information entity recognition framework that integrates prompt design, few-shot learning, parameter-efficient fine-tuning, and multi-instance integration. We design multiple prompt templates and incorporate a k-nearest neighbor (k-NN) sample selection strategy to optimize prompts and enhance the effectiveness of few-shot learning. In addition, we apply Low-Rank Adaptation (LoRA) to efficiently fine-tune the locally deployed Qwen2.5-7B model. Finally, a multi-instance integration mechanism is employed to allow different models to focus on specific entity categories, thereby reducing category confusion and enhancing overall F1-score. We evaluate the CSIER-FM on the Chinese ResumeNER dataset. The results demonstrate that fine-tuning the Qwen2.5-7B model raises its F1-score from 0.7106 to above 0.9518. With the addition of multi-instance integration, the F1-score further increased to 0.9553. The findings indicate that the CSIER-FM effectively integrates named entity recognition with Chinese sensitive information detection, enabling efficient recognition of multiple sensitive entity types in Chinese text. Full article
Show Figures

Figure 1

13 pages, 624 KB  
Article
Contrastive Learning with Gaussian Embeddings and Self-Attention for Few-Shot Named Entity Recognition
by Yihao Zhang, Wei Chen and Lei Ma
Appl. Sci. 2025, 15(21), 11819; https://doi.org/10.3390/app152111819 - 6 Nov 2025
Viewed by 701
Abstract
Named entity recognition (NER) in few-shot scenarios plays a critical role in entity annotation for low-resource domains. However, existing methods are often limited to learning semantic features and intermediate representations specific to the source domain, which restricts their generalization capability when applied to [...] Read more.
Named entity recognition (NER) in few-shot scenarios plays a critical role in entity annotation for low-resource domains. However, existing methods are often limited to learning semantic features and intermediate representations specific to the source domain, which restricts their generalization capability when applied to unseen target domains and leads to prominent performance degradation. To address this issue, we propose a novel few-shot NER model based on contrastive learning. Specifically, the model enhances token representations through Gaussian distribution embedding and a self-attention mechanism, while adaptively optimizing the weighting parameters of the contrastive loss to achieve performance improvement. This design effectively mitigates overfitting and enhances the model’s generalization ability. Experiments on multiple datasets (including CoNLL2003, GUM, and Few-NERD) demonstrate that our approach achieves performance gains of 2.05% to 15.89% compared to state-of-the-art methods. These results confirm the effectiveness of our model in few-shot NER tasks and suggest its potential for broader application in low-resource information extraction scenarios. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

22 pages, 1208 KB  
Article
Geo-MRC: Dynamic Boundary Inference in Machine Reading Comprehension for Nested Geographic Named Entity Recognition
by Yuting Zhang, Jingzhong Li, Pengpeng Li, Tao Liu, Ping Du and Xuan Hao
ISPRS Int. J. Geo-Inf. 2025, 14(11), 431; https://doi.org/10.3390/ijgi14110431 - 2 Nov 2025
Viewed by 656
Abstract
Geographic Named Entity Recognition (Geo-NER) is a crucial task for extracting geography-related entities from unstructured text, and it plays an essential role in geographic information extraction and spatial semantic understanding. Traditional approaches typically treat Geo-NER as a sequence labeling problem, where each token [...] Read more.
Geographic Named Entity Recognition (Geo-NER) is a crucial task for extracting geography-related entities from unstructured text, and it plays an essential role in geographic information extraction and spatial semantic understanding. Traditional approaches typically treat Geo-NER as a sequence labeling problem, where each token is assigned a single label. However, this formulation struggles to handle nested entities effectively. To overcome this limitation, we propose Geo-MRC, an improved model based on a Machine Reading Comprehension (MRC) framework that reformulates Geo-NER as a question-answering task. The model identifies entities by predicting their start positions, end positions, and lengths, enabling precise detection of overlapping and nested entities. Specifically, it constructs a unified input sequence by concatenating a type-specific question (e.g., “What are the location names in the text?”) with the context. This sequence is encoded using BERT, followed by feature extraction and fusion through Gated Recurrent Units (GRU) and multi-scale 1D convolutions, which improve the model’s sensitivity to both multi-level semantics and local contextual information. Finally, a feed-forward neural network (FFN) predicts whether each token corresponds to the start or end of an entity and estimates the span length, allowing for dynamic inference of entity boundaries. Experimental results on multiple public datasets demonstrate that Geo-MRC consistently outperforms strong baselines, with particularly significant gains on datasets containing nested entities. Full article
Show Figures

Figure 1

18 pages, 864 KB  
Article
Enhanced Semantic BERT for Named Entity Recognition in Education
by Ping Huang, Huijuan Zhu, Ying Wang, Lili Dai and Lei Zheng
Electronics 2025, 14(19), 3951; https://doi.org/10.3390/electronics14193951 - 7 Oct 2025
Viewed by 652
Abstract
To address the technical challenges in the educational domain named entity recognition (NER), such as ambiguous entity boundaries and difficulties with nested entity identification, this study proposes an enhanced semantic BERT model (ES-BERT). The model innovatively adopts an education domain, vocabulary-assisted semantic enhancement [...] Read more.
To address the technical challenges in the educational domain named entity recognition (NER), such as ambiguous entity boundaries and difficulties with nested entity identification, this study proposes an enhanced semantic BERT model (ES-BERT). The model innovatively adopts an education domain, vocabulary-assisted semantic enhancement strategy that (1) applies the term frequency–inverse document frequency (TF-IDF) algorithm to weight domain-specific terms, and (2) fuses the weighted lexical information with character-level features, enabling BERT to generate enriched, domain-aware, character–word hybrid representations. A complete bidirectional long short-term memory-conditional random field (BiLSTM-CRF) recognition framework was established, and a novel focal loss-based joint training method was introduced to optimize the process. The experimental design employed a three-phase validation protocol, as follows: (1) In a comparative evaluation using 5-fold cross-validation on our proprietary computer-education dataset, the proposed ES-BERT model yielded a precision of 90.38%, which is higher than that of the baseline models; (2) Ablation studies confirmed the contribution of domain-vocabulary enhancement to performance improvement; (3) Cross-domain experiments on the 2016 knowledge base question answering datasets and resume benchmark datasets demonstrated outstanding precision of 98.41% and 96.75%, respectively, verifying the model’s transfer-learning capability. These comprehensive experimental results substantiate that ES-BERT not only effectively resolves domain-specific NER challenges in education but also exhibits remarkable cross-domain adaptability. Full article
Show Figures

Figure 1

37 pages, 5285 KB  
Article
Assessing Student Engagement: A Machine Learning Approach to Qualitative Analysis of Institutional Effectiveness
by Abbirah Ahmed, Martin J. Hayes and Arash Joorabchi
Future Internet 2025, 17(10), 453; https://doi.org/10.3390/fi17100453 - 1 Oct 2025
Viewed by 971
Abstract
In higher education, institutional quality is traditionally assessed through metrics such as academic programs, research output, educational resources, and community services. However, it is important that their activities align with student expectations, particularly in relation to interactive learning environments, learning management system interaction, [...] Read more.
In higher education, institutional quality is traditionally assessed through metrics such as academic programs, research output, educational resources, and community services. However, it is important that their activities align with student expectations, particularly in relation to interactive learning environments, learning management system interaction, curricular and co-curricular activities, accessibility, support services and other learning resources that ensure academic success and, jointly, career readiness. The growing popularity of student engagement metrics as one of the key measures to evaluate institutional efficacy is now a feature across higher education. By monitoring student engagement, institutions assess the impact of existing resources and make necessary improvements or interventions to ensure student success. This study presents a comprehensive analysis of student feedback from the StudentSurvey.ie dataset (2016–2022), which consists of approximately 275,000 student responses, focusing on student self-perception of engagement in the learning process. By using classical topic modelling techniques such as Latent Dirichlet Allocation (LDA) and Bi-term Topic Modelling (BTM), along with the advanced transformer-based BERTopic model, we identify key themes in student responses that can impact institutional strength performance metrics. BTM proved more effective than LDA for short text analysis, whereas BERTopic offered greater semantic coherence and uncovered hidden themes using deep learning embeddings. Moreover, a custom Named Entity Recognition (NER) model successfully extracted entities such as university personnel, digital tools, and educational resources, with improved performance as the training data size increased. To enable students to offer actionable feedback, suggesting areas of improvement, an n-gram and bigram network analysis was used to focus on common modifiers such as “more” and “better” and trends across student groups. This study introduces a fully automated, scalable pipeline that integrates topic modelling, NER, and n-gram analysis to interpret student feedback, offering reportable insights and supporting structured enhancements to the student learning experience. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing)
Show Figures

Figure 1

21 pages, 3434 KB  
Article
Deep Learning-Based Compliance Assessment for Chinese Rail Transit Dispatch Speech
by Qiuzhan Zhao, Jinbai Zou and Lingxiao Chen
Appl. Sci. 2025, 15(19), 10498; https://doi.org/10.3390/app151910498 - 28 Sep 2025
Viewed by 456
Abstract
Rail transit dispatch speech plays a critical role in ensuring the safety of urban rail operations. To enable automated and accurate compliance assessment of dispatch speech, this study proposes an improved deep learning model to address the limitations of conventional approaches in terms [...] Read more.
Rail transit dispatch speech plays a critical role in ensuring the safety of urban rail operations. To enable automated and accurate compliance assessment of dispatch speech, this study proposes an improved deep learning model to address the limitations of conventional approaches in terms of accuracy and robustness. Building upon the baseline Whisper model, two key enhancements are introduced: (1) low-rank adaptation (LoRA) fine-tuning to better adapt the model to the specific acoustic and linguistic characteristics of rail transit dispatch speech, and (2) a novel entity-aware attention mechanism that incorporates named entity recognition (NER) embeddings into the decoder. This mechanism enables attention computation between words belonging to the same entity category across different commands and recitations, which helps highlight keywords critical for compliance assessment and achieve precise inter-sentence element alignment. Experimental results on real-world test sets demonstrate that the proposed model improves recognition accuracy by 30.5% compared to the baseline model. In terms of robustness, we evaluate the relative performance retention under severe noise conditions. While Zero-shot, Full Fine-tuning, and LoRA-only models achieve robustness scores of 72.2%, 72.4%, and 72.1%, respectively, and the NER-only variant reaches 88.1%, our proposed approach further improves to 89.6%. These results validate the model’s significant robustness and its potential to provide efficient and reliable technical support for ensuring the normative use of dispatch speech in urban rail transit operations. Full article
Show Figures

Figure 1

22 pages, 1250 KB  
Article
Entity Span Suffix Classification for Nested Chinese Named Entity Recognition
by Jianfeng Deng, Ruitong Zhao, Wei Ye and Suhong Zheng
Information 2025, 16(10), 822; https://doi.org/10.3390/info16100822 - 23 Sep 2025
Viewed by 528
Abstract
Named entity recognition (NER) is one of the fundamental tasks in building knowledge graphs. For some domain-specific corpora, the text descriptions exhibit limited standardization, and some entity structures have entity nesting. The existing entity recognition methods have problems such as word matching noise [...] Read more.
Named entity recognition (NER) is one of the fundamental tasks in building knowledge graphs. For some domain-specific corpora, the text descriptions exhibit limited standardization, and some entity structures have entity nesting. The existing entity recognition methods have problems such as word matching noise interference and difficulty in distinguishing different entity labels for the same character in sequence label prediction. This paper proposes a span-based feature reuse stacked bidirectional long short term memory network (BiLSTM) nested named entity recognition (SFRSN) model, which transforms the entity recognition of sequence prediction into the problem of entity span suffix category classification. Firstly, character feature embedding is generated through bidirectional encoder representation of transformers (BERT). Secondly, a feature reuse stacked BiLSTM is proposed to obtain deep context features while alleviating the problem of deep network degradation. Thirdly, the span feature is obtained through the dilated convolution neural network (DCNN), and at the same time, a single-tail selection function is introduced to obtain the classification feature of the entity span suffix, with the aim of reducing the training parameters. Fourthly, a global feature gated attention mechanism is proposed, integrating span features and span suffix classification features to achieve span suffix classification. The experimental results on four Chinese-specific domain datasets demonstrate the effectiveness of our approach: SFRSN achieves micro-F1 scores of 83.34% on ontonotes, 73.27% on weibo, 96.90% on resume, and 86.77% on the supply chain management dataset. This represents a maximum improvement of 1.55%, 4.94%, 2.48%, and 3.47% over state-of-the-art baselines, respectively. The experimental results demonstrate the effectiveness of the model in addressing nested entities and entity label ambiguity issues. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Graphical abstract

19 pages, 1820 KB  
Article
PROMPT-BART: A Named Entity Recognition Model Applied to Cyber Threat Intelligence
by Xinzhu Feng, Songheng He, Xinxin Wei, Runshi Liu, Huanzhou Yue and Xuren Wang
Appl. Sci. 2025, 15(18), 10276; https://doi.org/10.3390/app151810276 - 22 Sep 2025
Viewed by 1124
Abstract
The growing sophistication of cyberattacks underscores the need for the automated extraction of machine-readable intelligence from unstructured Cyber Threat Intelligence (CTI), commonly achieved through Named Entity Recognition (NER). However, existing CTI-oriented NER research faces two major limitations: the scarcity of standardized datasets and [...] Read more.
The growing sophistication of cyberattacks underscores the need for the automated extraction of machine-readable intelligence from unstructured Cyber Threat Intelligence (CTI), commonly achieved through Named Entity Recognition (NER). However, existing CTI-oriented NER research faces two major limitations: the scarcity of standardized datasets and the lack of advanced models tailored to domain-specific entities. To address the dataset challenge, we present CTINER, the first STIX 2.1-aligned dataset, comprising 42,549 annotated entities across 13 cybersecurity-specific types. CTINER surpasses existing resources in both scale (+51.82% more annotated entities) and vocabulary coverage (+40.39%), while ensuring label consistency and rationality. To tackle the modeling challenge, we propose PROMPT-BART, a novel NER model built upon the BART generative framework and enhanced through three types of prompt designs. Experimental results show that PROMPT-BART improves F1 scores by 4.26–8.3% over conventional deep learning baselines and outperforms prompt-based baselines by 1.31%. Full article
Show Figures

Figure 1

30 pages, 696 KB  
Article
SPADR: A Context-Aware Pipeline for Privacy Risk Detection in Text Data
by Sultan Asiri, Randa Alshehri, Fatima Kamran, Hend Laznam, Yang Xiao and Saleh Alzahrani
Electronics 2025, 14(18), 3725; https://doi.org/10.3390/electronics14183725 - 19 Sep 2025
Viewed by 1447
Abstract
Large language models (LLMs) are powerful, but they can unintentionally memorize and leak sensitive information found in their training or input data. To address this issue, we propose SPADR, a semantic privacy anomaly detection and remediation pipeline designed to detect and remove privacy [...] Read more.
Large language models (LLMs) are powerful, but they can unintentionally memorize and leak sensitive information found in their training or input data. To address this issue, we propose SPADR, a semantic privacy anomaly detection and remediation pipeline designed to detect and remove privacy risks from text. SPADR addresses limitations in existing redaction methods by identifying deeper forms of sensitive content, including implied relationships, contextual clues, and non-standard identifiers that traditional NER systems often overlook. SPADR combines semantic anomaly scoring using a denoising autoencoder with named entity recognition and graph-based analysis to detect both direct and hidden privacy risks. It is flexible enough to work on both training data (to prevent memorization) and user input (to prevent leakage at inference time). We evaluate SPADR on the Enron Email Dataset, where it significantly reduces document-level privacy leakage while maintaining strong semantic utility. The enhanced version, SPADR (S2), reduces the PII leak rate from 100% to 16.06% and achieves a BERTScore F1 of 88.03%. Compared to standard NER-based redaction systems, SPADR offers more accurate and context-aware privacy protection. This work highlights the importance of semantic and structural understanding in building safer, privacy-respecting AI systems. Full article
Show Figures

Figure 1

19 pages, 1599 KB  
Article
Enhancing Clinical Named Entity Recognition via Fine-Tuned BERT and Dictionary-Infused Retrieval-Augmented Generation
by Soumya Challaru Sreenivas, Saqib Chowdhury and Mohammad Masum
Electronics 2025, 14(18), 3676; https://doi.org/10.3390/electronics14183676 - 17 Sep 2025
Viewed by 2572
Abstract
Clinical notes often contain unstructured text filled with abbreviations, non-standard terminology, and inconsistent phrasing, which pose significant challenges for automated medical information extraction. Named Entity Recognition (NER) plays a crucial role in structuring this data by identifying and categorizing key clinical entities such [...] Read more.
Clinical notes often contain unstructured text filled with abbreviations, non-standard terminology, and inconsistent phrasing, which pose significant challenges for automated medical information extraction. Named Entity Recognition (NER) plays a crucial role in structuring this data by identifying and categorizing key clinical entities such as symptoms, medications, and diagnoses. However, traditional and even transformer-based NER models often struggle with ambiguity and fail to produce clinically interpretable outputs. In this study, we present a hybrid two-stage framework that enhances medical NER by integrating a fine-tuned BERT model for initial entity extraction with a Dictionary-Infused Retrieval-Augmented Generation (DiRAG) module for terminology normalization. Our approach addresses two critical limitations in current clinical NER systems: lack of contextual clarity and inconsistent standardization of medical terms. The DiRAG module combines semantic retrieval from a UMLS-based vector database with lexical matching and prompt-based generation using a large language model, ensuring precise and explainable normalization of ambiguous entities. The fine-tuned BERT model achieved an F1 score of 0.708 on the MACCROBAT dataset, outperforming several domain-specific baselines, including BioBERT and ClinicalBERT. The integration of the DiRAG module further improved the interpretability and clinical relevance of the extracted entities. Through qualitative case studies, we demonstrate that our framework not only enhances clarity but also mitigates common issues such as abbreviation ambiguity and terminology inconsistency. Full article
(This article belongs to the Special Issue Advances in Text Mining and Analytics)
Show Figures

Figure 1

18 pages, 568 KB  
Article
Beyond Cross-Entropy: Discounted Least Information Theory of Entropy (DLITE) Loss and the Impact of Loss Functions on AI-Driven Named Entity Recognition
by Sonia Pascua, Michael Pan and Weimao Ke
Information 2025, 16(9), 760; https://doi.org/10.3390/info16090760 - 2 Sep 2025
Viewed by 885
Abstract
Loss functions play a significant role in shaping model behavior in machine learning, yet their design implications remain underexplored in natural language processing tasks such as Named Entity Recognition (NER). This study investigates the performance and optimization behavior of five loss functions—L1, L2, [...] Read more.
Loss functions play a significant role in shaping model behavior in machine learning, yet their design implications remain underexplored in natural language processing tasks such as Named Entity Recognition (NER). This study investigates the performance and optimization behavior of five loss functions—L1, L2, Cross-Entropy (CE), KL Divergence (KL), and the proposed DLITE (Discounted Least Information Theory of Entropy) Loss—within transformer-based NER models. DLITE introduces a bounded, entropy-discounting approach to penalization, prioritizing recall and training stability, especially under noisy or imbalanced data conditions. We conducted empirical evaluations across three benchmark NER datasets: Basic NER, CoNLL-2003, and the Broad Twitter Corpus. While CE and KL achieved the highest weighted F1-scores in clean datasets, DLITE Loss demonstrated distinct advantages in macro recall, precision–recall balance, and convergence stability—particularly in noisy environments. Our findings suggest that the choice of loss function should align with application-specific priorities, such as minimizing false negatives or managing uncertainty. DLITE adds a new dimension to model design by enabling more measured predictions, making it a valuable alternative in high-stakes or real-world NLP deployments. Full article
Show Figures

Figure 1

Back to TopTop