Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (723)

Search Parameters:
Keywords = DeepSeek

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
27 pages, 1655 KB  
Article
Multi-Model Ensemble Evaluation of Student Design Projects in Higher Education: A Comparative Analysis of AI and Human Expert Grading
by Filip Cvitić, Tajana Koren Ivančević and Nikolina Stanić Loknar
Technologies 2026, 14(7), 382; https://doi.org/10.3390/technologies14070382 (registering DOI) - 23 Jun 2026
Abstract
This study investigates the potential, limitations, and pedagogical implications of applying a parallel multi-model AI evaluation workflow, using ChatGPT, DeepSeek, and Uizard, to assess student design projects in higher education. Because design assessment involves both formal criteria and subjective creative interpretation, the study [...] Read more.
This study investigates the potential, limitations, and pedagogical implications of applying a parallel multi-model AI evaluation workflow, using ChatGPT, DeepSeek, and Uizard, to assess student design projects in higher education. Because design assessment involves both formal criteria and subjective creative interpretation, the study first established a human expert baseline based on three independent university professors. The human inter-rater reliability was low to moderate, with a mean pairwise Spearman’s ρ of 0.36 and Cronbach’s α of 0.60 for packaging design, and ρ of 0.43 and α of 0.69 for web design. This finding is central to the study, as it shows that the human benchmark in creative design assessment is itself variable and interpretive. Against this baseline, AI–human alignment remained limited and task-dependent. For packaging design, the AI ensemble showed only a weak positive association with the human expert baseline (Spearman’s ρ = 0.30, p = 0.031), which should be interpreted cautiously given the Bonferroni-adjusted significance threshold used in the study. For web design, no significant AI–human association was observed. Qualitative analysis of AI-generated rationales identified recurring limitations, including hallucination, aesthetic shield effects, and missed context, where visually polished work was rewarded despite deeper conceptual or structural weaknesses. The findings suggest that current AI systems can provide useful formative feedback on visible formal features, but they are not reliable as autonomous grading tools for complex creative work. AI-assisted assessment is therefore best understood as a supervised formative support mechanism, while final evaluation should remain grounded in human pedagogical judgment. Full article
Show Figures

Figure 1

22 pages, 662 KB  
Article
Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models
by Nasser A. Alsadhan
Appl. Sci. 2026, 16(12), 6247; https://doi.org/10.3390/app16126247 (registering DOI) - 22 Jun 2026
Abstract
The advancing fluency of large language models (LLMs) raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether state-of-the-art LLMs can convincingly mimic emotional nuance in English [...] Read more.
The advancing fluency of large language models (LLMs) raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether state-of-the-art LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models: Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F1 > 0.95), though classification performance deteriorates on paraphrased samples, indicating reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence, while LLaMA performs worse. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have significant implications for affective computing, authorship attribution, and responsible AI deployment, particularly within under-resourced language contexts where generative AI detection and alignment pose unique challenges. Full article
Show Figures

Figure 1

20 pages, 2301 KB  
Article
LLM-Assisted Semantic Pruning for Genetic Programming-Based Alpha Factor Discovery
by Hang Chen and Rui Qi
Appl. Sci. 2026, 16(12), 6231; https://doi.org/10.3390/app16126231 (registering DOI) - 21 Jun 2026
Viewed by 76
Abstract
Genetic programming (GP) has been widely used in quantitative finance for discovering formulaic alpha factors that can predict asset returns. However, GP often produces overgrown expressions that are difficult to interpret and expensive to evaluate. This paper proposes a large language model (LLM)-assisted [...] Read more.
Genetic programming (GP) has been widely used in quantitative finance for discovering formulaic alpha factors that can predict asset returns. However, GP often produces overgrown expressions that are difficult to interpret and expensive to evaluate. This paper proposes a large language model (LLM)-assisted pruning framework that reviews expression trees generated by GP, with the LLM acting as a semantic reviewer that flags redundant or financially implausible branches based on structural complexity and contextual reasoning. The proposed method is formalized as a closed-loop Trigger–Evaluate–Decide–Execute (TEDE) process. We present mathematical formulations, algorithmic design, and examples showing how redundant nested functions can be simplified while monitoring predictive performance. Experiments with high-frequency cryptocurrency market data, using DeepSeek-V4-Flash as the semantic engine, show lower expression complexity and higher rubric-based interpretability scores for the pruned symbolic factors. Under the reported test setup, the LLM-pruned configuration has higher Information Ratio (IR) values than the listed baselines and more compact expression trees than the GP baselines. Full article
(This article belongs to the Special Issue AI-Based Combinatorial Optimization and Multi-Objective Optimization)
Show Figures

Figure 1

16 pages, 775 KB  
Systematic Review
A Systematic Review of Generative AI in Cardiac Surgery and Surgical Education: A Laurillard-Based Learning-Activity Map
by Hakan Öntaş and Harun Çiğdem
Encyclopedia 2026, 6(6), 137; https://doi.org/10.3390/encyclopedia6060137 - 17 Jun 2026
Viewed by 186
Abstract
Generative Artificial Intelligence (GenAI) in cardiac surgery refers to the integration of advanced computational models, such as Large Language Models (LLMs), to automate and enhance clinical decision-making, preoperative risk assessment, and surgical education. In the context of surgical training, it functions as a [...] Read more.
Generative Artificial Intelligence (GenAI) in cardiac surgery refers to the integration of advanced computational models, such as Large Language Models (LLMs), to automate and enhance clinical decision-making, preoperative risk assessment, and surgical education. In the context of surgical training, it functions as a personalized pedagogical tool that supports various learning activities, ranging from information acquisition and clinical inquiry to procedural practice, while requiring rigorous human oversight to ensure patient safety and clinical accuracy. (1) Background: Generative Artificial Intelligence (GenAI) is increasingly integrated into health professions education, offering new opportunities for learning; however, its specific application and pedagogical mapping in high-stakes fields such as cardiac surgery remain underexplored. This systematic review investigates how GenAI is utilized in cardiac surgery and surgical education, aligning these uses with Laurillard’s six learning types. (2) Methods: Following the PRISMA 2020 guidelines, we searched the Web of Science Core Collection for studies on GenAI in cardiac surgery, resulting in 42 studies that met the inclusion criteria. Study quality was appraised using the Medical Education Research Study Quality Instrument (MERSQI). (3) Results: GenAI applications most frequently supported clinical inquiry (93.8%) and practice (68.8%), demonstrating expanding efficiency across commercial and open-source models (including ChatGPT-4o, Gemini AI, and emerging reasoning architectures such as DeepSeek) for knowledge acquisition and medical production. While it significantly improves individualized learning and preoperative assessment workflows, its practical role in Discussion and Collaboration remains heavily underutilized, highlighting a distinct shift toward individualized solo professional workflows. (4) Conclusions: GenAI provides a transformative and scalable approach to cardiac surgical training by offering personalized and accessible knowledge retrieval. However, clinical educators and governance bodies must deliberately balance these immediate productivity benefits with long-term concerns regarding structural “hallucinations,” data verifiability, and the preservation of collaborative competencies within modern multidisciplinary Heart Teams. Full article
(This article belongs to the Section Medicine & Pharmacology)
Show Figures

Figure 1

26 pages, 2738 KB  
Article
Temporal Robustness of Large Language Models for Thematic Classification of UN General Assembly Debates
by Fatima Mumtaz, Sadaf Abdul Rauf, Saadia Ishtiaq Nauman, Muhammad Ghulam Abbas Malik and Muhammad Imran
Information 2026, 17(6), 589; https://doi.org/10.3390/info17060589 (registering DOI) - 12 Jun 2026
Viewed by 181
Abstract
Thematic analysis of large-scale political discourse remains a challenge due to semantic complexity and overlapping policy areas and changing diplomatic vocabulary. Although large language models (LLMs) offer promise for scalable thematic classification, their reliability in politically sensitive contexts requires systematic validation against expert [...] Read more.
Thematic analysis of large-scale political discourse remains a challenge due to semantic complexity and overlapping policy areas and changing diplomatic vocabulary. Although large language models (LLMs) offer promise for scalable thematic classification, their reliability in politically sensitive contexts requires systematic validation against expert human annotations. We evaluate LLM-based thematic classification of United Nations General Assembly (UNGA) speeches across a decade (2014–2023), using 7680 human-annotated themes mapped into 12 policy domains. Our results show that DeepSeek R1 achieves the highest accuracy 77% (F1 = 0.73), followed by ChatGPT, Gemini and LLaMA, with strong performance in lexically stable domains but substantial degradation in semantically overlapping categories such as governance and international cooperation. A unique dimension of our work is timeline analysis, which shows that the performance of LLMs over the years varies strongly and the precision decreases during times of rhetorical transformation, including pandemic-related discussions and the discourses of cooperation determined by the Russia–Ukraine conflict. By linking domain-level ambiguity and geopolitical shifts to temporal instability, this study introduces a dynamic robustness perspective for evaluating LLMs in computational political discourse analysis. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

23 pages, 718 KB  
Article
Evaluating Symmetry and Asymmetry in Large Language Models’ Focus-Span Identification: Evidence from Chinese shi…de Cleft Constructions
by Danyang Zheng and Jinzhuo Zheng
Symmetry 2026, 18(6), 996; https://doi.org/10.3390/sym18060996 - 10 Jun 2026
Viewed by 197
Abstract
Large language models (LLMs) have achieved strong performance in many linguistic tasks, but their ability to process discourse-level information structure remains insufficiently understood. In particular, current models may identify locally coherent spans while failing to determine the minimal constituent that carries informational prominence [...] Read more.
Large language models (LLMs) have achieved strong performance in many linguistic tasks, but their ability to process discourse-level information structure remains insufficiently understood. In particular, current models may identify locally coherent spans while failing to determine the minimal constituent that carries informational prominence in context. Chinese “shi…de” cleft constructions provide a theoretically important testing ground for this problem because they combine a stable formal pattern with context-dependent focus interpretation, exhaustivity, and discourse-sensitive boundary variation. This study investigates whether current LLMs can identify the minimal focus domain in Chinese “shi…de” clefts and whether their performance goes beyond simple surface-form heuristics. Based on 105 human-validated gold-standard samples, we compared three API-accessible models, ChatGPT 5.4, Claude Opus 4.6, and DeepSeek-V4-Pro, with two rule-based baselines. Baseline 1, which extracted the full span between “shi” and “de”, achieved only 2.86% accuracy, while Baseline 2, a stronger minimal-cue heuristic, reached 46.67%. Under the main prompt condition, DeepSeek-V4-Pro achieved the highest accuracy (65.71%), followed by Claude Opus 4.6 (60.00%), whereas ChatGPT 5.4 (41.90%) did not outperform Baseline 2. A prompt-level QUD ablation showed no stable or statistically significant improvement, indicating that explicit discourse-question guidance alone is insufficient for minimal focus-boundary identification. Performance across focus types further showed that topical focus was relatively easier than informational and contrastive focus, suggesting the importance of topic continuity. Overall, the findings reveal both symmetry and asymmetry in LLM focus processing: models share certain task-level constraints, but differ in cue weighting and boundary-compression strategies. The study argues that Chinese “shi…de” focus identification is better modeled as a multi-cue focus-span ranking problem rather than as direct QUD-answer matching. Future research should extend the dataset and further test whether explicit multi-cue ranking methods can improve focus-boundary identification across models and languages. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

13 pages, 882 KB  
Article
Automated PROMISE V2 Scoring from PSMA PET/CT Reports Using Large Language Models: A Comparative Evaluation of Prompt Design and Model Performance
by Tilman Speicher, Isa Ethem Demirkol, Arne Blickle, Moritz B. Bastian, Stephan Maus, Andrea Schaefer-Schuler, Mark Bartholomä, Caroline Burgard, Samer Ezziddin and Florian Rosar
Curr. Oncol. 2026, 33(6), 349; https://doi.org/10.3390/curroncol33060349 - 9 Jun 2026
Viewed by 260
Abstract
Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple [...] Read more.
Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple LLMs for automated PROMISE V2 classification for prostate cancer. A total of 126 unambiguous German-language PSMA PET/CT text reports were retrospectively analyzed, with reference standards established by expert consensus based on image interpretation and the original report text. Five LLMs (GPT-5.4, DeepSeek-V3.2, Claude Sonnet 4.6, Gemini 3 Flash and Grok 4) were assessed using two English-language prompting strategies of varying complexity. Agreement with the reference standard served as the primary endpoint. Performance varied in the short-prompt setting (36.5–79.4%) but improved consistently with the long prompt (74.6–86.5%), with Gemini 3 Flash achieving the highest agreement. Across PROMISE V2 subcategories, agreement rates were high (miT: 81.0–92.1%, miN: 92.9–96.0%, miM: 92.9–95.2%), despite inter-model differences. In conclusion, contemporary LLMs demonstrate promising performance in deriving PROMISE V2 scores from unambiguous original report texts, particularly when guided by detailed prompts. Full article
Show Figures

Figure 1

32 pages, 908 KB  
Article
MetricDraft: A Metric-Driven Framework for Academic Paper Draft Generation and Iterative Optimization
by Ruifeng Guo, Zhijun Chang and Lijun Fu
Appl. Sci. 2026, 16(12), 5780; https://doi.org/10.3390/app16125780 - 8 Jun 2026
Viewed by 149
Abstract
Large language models (LLMs) are advancing intelligent writing systems from local text continuation and language polishing toward long-form structured text generation. However, directly generating full-length academic paper drafts remains challenging due to unclear research objectives, unstable discourse structures, insufficient long-text coherence, and the [...] Read more.
Large language models (LLMs) are advancing intelligent writing systems from local text continuation and language polishing toward long-form structured text generation. However, directly generating full-length academic paper drafts remains challenging due to unclear research objectives, unstable discourse structures, insufficient long-text coherence, and the lack of explicit quality control mechanisms. To address this long-form structured generation task, we propose MetricDraft, a metric-driven framework for academic paper draft generation. The framework organizes the drafting process as a closed-loop pipeline comprising research ideation clarification, structural anchoring, section-by-section generation, quality assessment, and feedback-driven revision. Its key components include adversarial research ideation clarification, staged structural anchoring, the PRISM structured metric system, progressive context injection with section-type-aware guided generation (PCI+STAGG), and a metric-feedback-driven generation–evaluation co-optimization mechanism. Experimental results demonstrate that MetricDraft achieves higher composite quality scores than one-shot generation, summary-based context passing, and context-accumulation-only baselines, improving MQS over Base1, Base2, and Base3 by +5.5, +7.9, and +7.0 points, respectively, with paired tests reaching statistical significance. To examine whether this advantage is tied to a single LLM backend, we further conduct a cross-model validation on all 15 tasks using Qwen3.7-Max in addition to the original DeepSeek-V4-Pro setting. MetricDraft remains the best-performing strategy under both models. To address citation reliability, an additional citation verification-and-retrieval-based replacement (CVRR) experiment reduces the fabricated citation rate of DeepSeek MetricDraft drafts from 56.0% to 15.0%. Furthermore, PRISM exhibits moderate-to-high positive correlations with expert ratings, providing preliminary evidence that it can serve as an auxiliary evaluation reference for draft quality diagnosis and iterative revision. This work reformulates academic writing as an adjustable, assessable, and iteratively optimizable long-form structured text generation problem, offering methodological insights for human–AI collaborative writing and intelligent text generation system design. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

25 pages, 3879 KB  
Article
LLM-Enabled Reconstruction of Farmer Fertilizer-Reduction Responses Under Policy Scenarios: Evidence from Sparse Stated-Preference Data
by Shuaiwen Liu, Yichuan Zhang, Zhentao Sun, Xiao Huang and Chaoqing Yu
Agriculture 2026, 16(12), 1266; https://doi.org/10.3390/agriculture16121266 - 8 Jun 2026
Viewed by 327
Abstract
Agricultural fertilizer reduction depends on farmers’ responses to policy incentives, but such responses are often observed only at a few subsidy levels and under hypothetical conditions. Using survey-based stated-preference data from 15 counties in China, this study examines whether large language model (LLM)-based [...] Read more.
Agricultural fertilizer reduction depends on farmers’ responses to policy incentives, but such responses are often observed only at a few subsidy levels and under hypothetical conditions. Using survey-based stated-preference data from 15 counties in China, this study examines whether large language model (LLM)-based methods can reconstruct fertilizer-reduction response intervals under alternative subsidy scenarios. Three LLM-based inference strategies were designed and compared with 14 conventional methods within an exploratory evaluation framework covering interval recovery, extrapolation behavior, and curve-shape plausibility. LLM-based methods were competitive in this sparse-anchor reconstruction task. The incremental inference strategy, which reconstructs target intervals through local changes between subsidy anchors, produced the most stable results. DeepSeek V3.2 Increment obtained the highest IO (0.528) and a high EIO (0.602), while Qwen3-8B Increment achieved the lowest MAME (1.291) and the highest EIO (0.636). SHAP analysis showed that reconstruction difficulty was mainly associated with fertilizer bags per mu (0.2414), annual fertilizer cost (0.1808), and fertilization training (0.1473). Overall, this study explores the potential of LLM-based inference as a flexible approach for fertilizer-reduction policy-response analysis from limited stated-preference data. Full article
Show Figures

Figure 1

18 pages, 917 KB  
Article
How Reliably Do Large Language Models Reproduce Vital Pulp Therapy Guidelines? A Mixed-Effects Evaluation of Guideline-Concordance and Error Directionality
by Sine Güngör Us, Arzu Şahin Mantı and Arzu Kaya Mumcu
Healthcare 2026, 14(12), 1605; https://doi.org/10.3390/healthcare14121605 - 7 Jun 2026
Viewed by 214
Abstract
Background: Large language models (LLMs) are increasingly consulted for clinical guidance, yet their reliability in protocol-sensitive domains remains insufficiently characterized. This study evaluated the ability of widely accessible LLMs to reproduce guideline-defined decision thresholds in vital pulp therapy (VPT), with emphasis on [...] Read more.
Background: Large language models (LLMs) are increasingly consulted for clinical guidance, yet their reliability in protocol-sensitive domains remains insufficiently characterized. This study evaluated the ability of widely accessible LLMs to reproduce guideline-defined decision thresholds in vital pulp therapy (VPT), with emphasis on guideline-concordance accuracy, professional-role prompting, short-term response stability, and decision-level error directionality. Methods: Twenty-six binary yes/no questions were derived from an internationally recognized evidence-based guideline for VPT. Four LLMs—GPT-5, GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash—were queried under non-prompted and professional-role-prompted conditions by two independent operators across three daily sessions over three consecutive days. Descriptive analyses were complemented by mixed-effects logistic regression in R to account for repeated responses clustered within guideline-derived questions. Results: Overall guideline-concordance accuracy was high across models. Gemini showed the highest observed accuracy under non-prompted conditions; DeepSeek showed the highest under prompted conditions. In the mixed-effects model, Gemini demonstrated significantly higher odds of guideline-concordant responses than GPT-5 under non-prompted conditions, whereas DeepSeek outperformed GPT-5 and GPT-4o under prompted conditions. The model × prompt interaction showed a trend toward significance but did not reach the conventional threshold. Day and within-day time point were not significantly associated with accuracy, supporting short-term response stability. Error-direction analysis revealed model-specific patterns: Gemini showed consistently low false-positive rates but increased false-negative responses under prompted conditions; DeepSeek showed reduced false-positive and no false-negative responses under prompted conditions. Conclusions: Average accuracy alone is insufficient to characterize the reliability of LLM-generated clinical guidance. Evaluation in protocol-sensitive domains should incorporate guideline-concordance, prompt responsiveness, short-term stability, and decision-level error directionality. Full article
(This article belongs to the Special Issue The Role of AI in Predictive and Prescriptive Healthcare)
Show Figures

Figure 1

31 pages, 1322 KB  
Article
Towards Responsible AI for IoT Network Security Auditing Using Knowledge Graph and RAGAS
by Obrina Briliyant, Amir Javed and Yulia Cherdantseva
J. Cybersecur. Priv. 2026, 6(3), 98; https://doi.org/10.3390/jcp6030098 - 6 Jun 2026
Viewed by 237
Abstract
The trustworthiness of AI-powered network security auditing depends not only on detection accuracy but on the faithfulness of the explanations that support compliance verdicts. In IoT network security, Large Language Models (LLMs) are increasingly utilized to produce natural-language security assessments from raw network [...] Read more.
The trustworthiness of AI-powered network security auditing depends not only on detection accuracy but on the faithfulness of the explanations that support compliance verdicts. In IoT network security, Large Language Models (LLMs) are increasingly utilized to produce natural-language security assessments from raw network traffic, yet the extent to which these explanations are grounded in retrieved evidence is rarely measured. This paper presents the Retrieval-Augmented Generation Assessment Suite (RAGAS) as an evaluation framework that compares three retrieval paradigms—rule-based heuristic scoring, dense vector retrieval, and knowledge graph traversal—on the task of explaining network compliance against ETSI EN 303 645 IoT cybersecurity provisions. Using 30 human expert-validated compliance scenarios derived from the CIC-IoT2023 dataset and three LLMs (DeepSeek-R1, Qwen-2.5, Llama-3.2), we find that graph-based retrieval achieves the highest faithfulness (0.570), outperforming rule-based (0.524) and vector retrieval (0.509). All methods, however, exhibit low context recall (≤22.4%), and we highlight that high detection F1 scores do not guarantee faithful explanations; over 40% of statements in compliance answers are unsupported by retrieved evidence. A proof-of-concept prototype, Security Audit Compliance Agent (SACA), demonstrates how knowledge graph traversal can be integrated with interactive visualization to support human auditor oversight. We argue that, in adherence to responsible AI principles, faithfulness measurement should become a standard complement to accuracy reporting for an AI-driven network audit or forensic analysis. Full article
Show Figures

Figure 1

31 pages, 2671 KB  
Article
Named Entity Recognition Method for Natural Disaster Emergencies Based on Instruction Tuning and Graph Retrieval-Augmented Generation
by Kehong Zhang, Xinyu Lin, Min Wang, Haisheng Yu and Lanjian Chen
Big Data Cogn. Comput. 2026, 10(6), 185; https://doi.org/10.3390/bdcc10060185 - 5 Jun 2026
Viewed by 193
Abstract
Named entity recognition in natural disaster emergencies is a critical foundational task for emergency management. However, existing methods face challenges including complex entity types, frequent emergence of new terminology, model knowledge obsolescence, and poor adaptability to dynamic knowledge updates, resulting in limited accuracy [...] Read more.
Named entity recognition in natural disaster emergencies is a critical foundational task for emergency management. However, existing methods face challenges including complex entity types, frequent emergence of new terminology, model knowledge obsolescence, and poor adaptability to dynamic knowledge updates, resulting in limited accuracy and generalization in real-world disaster scenarios. To address these issues, this paper proposes a named entity recognition method for natural disaster emergencies based on instruction tuning and knowledge graph retrieval-augmented generation. We first construct a dedicated instruction-tuning dataset, EM-InstructNER, and a domain-specific knowledge graph, EmergencyKG, tailored to natural disasters. Then, LoRA is employed for parameter-efficient fine-tuning of the Qwen2-7B-Instruct base model, while KG-based RAG dynamically retrieves subgraphs from the knowledge graph to generate semantically enriched augmented prompts, providing external structured knowledge support for generative NER. Experimental results demonstrate that the proposed method achieves a macro F1 score of 0.9205 on the EM-InstructNER test set, representing a 36.6% relative improvement over the best-performing zero-shot baseline (DeepSeek-R1:14B), while remaining competitive with strong supervised sequence labeling approaches (e.g., BERT + CRF). The framework provides knowledge graph update flexibility and significantly reduces training computational cost and GPU memory consumption through LoRA-based parameter-efficient fine-tuning. Cross-domain evaluation on the public CLUENER2020 benchmark further demonstrates its generalization capability. Full article
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))
Show Figures

Figure 1

23 pages, 1638 KB  
Article
Deep Dyspareunia One Year After Nerve-Sparing Endometriosis Surgery: An Observational Study Highlighting Undesirable Outcomes
by Nilton de Nadai Filho, Claudio Peixoto Crispi, Bruna Rafaela Santos de Oliveira, Claudio Peixoto Crispi and Marlon de Freitas Fonseca
J. Pers. Med. 2026, 16(6), 307; https://doi.org/10.3390/jpm16060307 - 5 Jun 2026
Viewed by 291
Abstract
Background/Objectives: This study evaluates the 1-year follow-up outcomes after minimally invasive nerve-sparing surgery for the complete excision of deep endometriosis (DE), with a specific focus on deep dyspareunia. Cases with undesirable outcomes were explored in detail to better understand the evolution of this [...] Read more.
Background/Objectives: This study evaluates the 1-year follow-up outcomes after minimally invasive nerve-sparing surgery for the complete excision of deep endometriosis (DE), with a specific focus on deep dyspareunia. Cases with undesirable outcomes were explored in detail to better understand the evolution of this cornerstone endometriosis-related symptom. This approach supports personalized medicine initiatives by seeking to stratify patients into likely surgical responders and non-responders. Methods: This is an interdisciplinary retrospective observational study assessing 195 consecutive cases. Inclusion criteria comprised women with an established diagnosis of DE who had been sexually active in the 6 months prior to surgery. Because pregnancy and postpartum can interfere with the longitudinal assessment of deep dyspareunia, women in these phases during follow-up were excluded. Additionally, individuals who had not been sexually active in the preceding 6 months for reasons unrelated to deep dyspareunia were excluded. Deep dyspareunia was measured using an 11-point (0–10) self-reported Numerical Rating Scale (NRS). Hierarchical clusters were established based on preoperative scores: NONE (NRS = 0), MILD (1 ≤ NRS ≤ 3), MODERATE (4 ≤ NRS ≤ 6), and SEVERE (NRS ≥ 7). Results: In the SEVERE cluster, 82.2% (95% CI: 72.4–92.0) of women improved by ≥3 points. In the NONE cluster, 70.1% (95% CI: 60.3–79.2) remained asymptomatic. Although improvements in deep dyspareunia were statistically significant across the total sample, individual trajectories were not uniform; the response was considered undesirable in 34 cases (17.4%; 95% CI: 12.1–22.8). The frequency of preoperatively asymptomatic women (NRS = 0) developing De Novo deep dyspareunia (NRS ≥ 3) at the 1-year follow-up was estimated at 14.9% (95% CI: 8.0–22.7). These results highlight the marked phenotypic and clinical heterogeneity in patient trajectories and the inherent unpredictability of adverse responses. Conclusions: Postoperative pain outcomes likely result from a complex interplay among surgical, myofascial, neurological, psychological, inflammatory, and hormonal factors. While surgery remains an effective and safe approach for treating pain, our findings underscore that even preoperatively asymptomatic patients should receive targeted counseling regarding the unexpected risk of developing postoperative deep dyspareunia. Full article
(This article belongs to the Special Issue Obstetrics and Gynecology and Women's Health—2nd Edition)
Show Figures

Figure 1

31 pages, 422 KB  
Review
Knowledge Hiding in Transactional Databases: A Focused Survey of Methods and Open Challenges
by Sotiris Kotsiantis and Vassilios S. Verykios
Appl. Sci. 2026, 16(11), 5656; https://doi.org/10.3390/app16115656 - 4 Jun 2026
Viewed by 178
Abstract
Privacy-preserving data mining (PPDM) seeks to extract useful patterns from shared data without revealing sensitive information. Within PPDM, knowledge hiding—encompassing both association rule hiding (ARH) and frequent itemset hiding (FIH)—forms a coherent family of techniques that sanitize transactional databases before release. This focused [...] Read more.
Privacy-preserving data mining (PPDM) seeks to extract useful patterns from shared data without revealing sensitive information. Within PPDM, knowledge hiding—encompassing both association rule hiding (ARH) and frequent itemset hiding (FIH)—forms a coherent family of techniques that sanitize transactional databases before release. This focused survey synthesizes the main algorithmic paradigms for knowledge hiding (1999–2026), covering heuristic sanitization, border-based and exact optimization via integer linear programming, constraint-based and graph-based formulations, emerging learning-guided support mechanisms, and extensions to utility mining and non-relational structures. We use a PRISMA-style search and selection protocol to make the evidence base transparent and to mitigate selection bias. We trace the evolution from early disclosure-limitation heuristics to graph-guided and knowledge-graph approaches, and we treat deep-learning, GNN, and federated graph-learning work as adjacent tools that may support candidate selection, representation learning, or distributed deployment rather than as replacements for classical hiding validation. We identify persistent challenges around scalability, infeasibility in LP formulations, and evaluation standardization, and outline directions for future research. Unlike broader PPDM overviews, this review centers exclusively on transactional knowledge hiding. Beyond cataloging algorithms, it compares method families through their intervention mechanisms, side-effect profiles, scalability assumptions, and benchmark regimes, and it distills reporting recommendations for more reproducible empirical evaluation. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

16 pages, 1678 KB  
Article
Artificial Intelligence and Synthetic Data: A Natural Language Processing Protocol for Synthetic Data Augmentation with Human Validation in Sensitive Domains
by Rafael Sosa-Ramírez, Eloy López-Meneses, Mariana-Daniela González-Zamar and María Belén Morales Cevallos
Educ. Sci. 2026, 16(6), 885; https://doi.org/10.3390/educsci16060885 - 4 Jun 2026
Viewed by 279
Abstract
Research on sensitive human narratives is increasingly constrained by ethical and privacy regulations that limit access to primary data, creating a structural small-data challenge that limits deep computational analysis. To address this limitation, this study validates a Natural Language Processing protocol that scales [...] Read more.
Research on sensitive human narratives is increasingly constrained by ethical and privacy regulations that limit access to primary data, creating a structural small-data challenge that limits deep computational analysis. To address this limitation, this study validates a Natural Language Processing protocol that scales 946 real breakup narratives from r/breakups to 6000 human-validated high-fidelity synthetic records across five BERTopic clusters. The architecture employs MPNet, UMAP, and HDBSCAN to map latent space and thematically cluster texts, extracts seed documents using the Kneedle algorithm, and orchestrates DeepSeek V3.2 with stochastic sampling and small batches (k = 5). Automated validation via Cosine Similarity with a P10 threshold attained a mean semantic similarity of 0.7204 (range 0.6413–0.7855) and a fidelity rate of 99.08%. Expert human review by two researchers of this investigation evaluated 1732 posts on topic adherence and emotional authenticity using Gwet’s AC2. Five of six clusters achieved AC2 ≥ 0.70 on both dimensions; Topic 3 showed marginal adherence (AC2 = 0.660) while maintaining acceptable authenticity (AC2 = 0.817), and the 1200 synthetic posts for Topic 5 failed human validation (AC2 < 0.50) due to documented LLM safety-filter limitations and are excluded from the final corpus. These results demonstrate that the proposed protocol enables the research community to generate validated, privacy-preserving synthetic data ecosystems while establishing empirical boundary conditions for sensitive topic analysis. Full article
Show Figures

Figure 1

Back to TopTop