MDPI - Publisher of Open Access Journals

13 pages, 675 KB

Open AccessArticle

Domain-Specific vs. General-Purpose Large Language Models in Orthodontics: A Blinded Comparison of AlimGPT, GPT-4o, Gemini, and Llama

by Aksakalli Sertac, Giray Bilgin and Temel Cagri

Dent. J. 2026, 14(4), 219; https://doi.org/10.3390/dj14040219 - 8 Apr 2026

Viewed by 94

Abstract

Objective: The application of artificial intelligence (AI) in orthodontics has evolved rapidly in recent years, encompassing areas such as diagnosis, treatment planning, and patient management, and AlimGPT is an AI-based tool that provides treatment options based on data and algorithms. Methods: [...] Read more.

Objective: The application of artificial intelligence (AI) in orthodontics has evolved rapidly in recent years, encompassing areas such as diagnosis, treatment planning, and patient management, and AlimGPT is an AI-based tool that provides treatment options based on data and algorithms. Methods: Fourteen different orthodontic questions were asked to each model, and answers were analyzed. This study aimed to compare AlimGPT with GPT-4o, Gemini, and Llama using standardized tests to evaluate the quality of information provided, including the Likert scale, modified DISCERN (mDISCERN), and modified Global Quality Score (mGQS). Results: Significant differences were detected for reliability (χ² = 15.267, p = 0.0016) and usefulness (χ² = 20.557, p = 0.0001). Post hoc tests showed AlimGPT > Gemini and Llama for reliability and AlimGPT > GPT-4o, Gemini, and Llama for usefulness. mDISCERN was significant overall (χ² = 11.047, p = 0.0115), but no pairwise contrast met adjusted significance; mGQS showed no significant differences (χ² = 7.071, p = 0.0697). Inter-rater agreement was moderate-to-good for reliability (ICC = 0.710, 95% CI 0.60–0.80) and usefulness (ICC = 0.729, 95% CI 0.63–0.82), moderate for mGQS (ICC = 0.596, 95% CI 0.47–0.71), and poor-to-moderate for mDISCERN (ICC = 0.435, 95% CI 0.30–0.58). Conclusions: In this blinded, within-subjects experiment, the domain-specific model (AlimGPT) received higher clinician ratings for usefulness and, for reliability, exceeded two general baselines. Differences in mGQS were not detected. Expanding the number of raters, increasing item diversity or integrating updated baselines would be beneficial. Full article

(This article belongs to the Special Issue Orthodontics and New Technologies: 2nd Edition)

► Show Figures

Graphical abstract

22 pages, 1170 KB

Open AccessArticle

Adverse Drug Reaction Detection on Social Media Based on Large Language Models

by Hao Li and Hongfei Lin

Information 2026, 17(4), 352; https://doi.org/10.3390/info17040352 - 7 Apr 2026

Viewed by 204

Abstract

Adverse drug reaction (ADR) detection is essential for ensuring drug safety and effective pharmacovigilance. The rapid growth of users’ medication reviews posted on social media has introduced a valuable new data source for ADR detection. However, the large scale and high noise inherent [...] Read more.

Adverse drug reaction (ADR) detection is essential for ensuring drug safety and effective pharmacovigilance. The rapid growth of users’ medication reviews posted on social media has introduced a valuable new data source for ADR detection. However, the large scale and high noise inherent in social media text pose substantial challenges to existing detection methods. Although large language models (LLMs) exhibit strong robustness to noisy and interfering information, they are often limited by issues such as stochastic outputs and hallucinations. To address these challenges, this paper proposes two generative detection frameworks based on Chain of Thought (CoT), namely LLaMA-DetectionADR for Supervised Fine-Tuning (SFT) and DetectionADRGPT for low-resource in-context learning. LLaMA-DetectionADR automatically generates CoT reasoning sequences to construct an instruction tuning dataset, which is then used to fine-tune the LLaMA3-8B model via Quantized Low-Rank Adaptation (QLoRA). In contrast, DetectionADRGPT leverages clustering algorithms to select representative unlabeled samples and enhances in-context learning by incorporating CoT reasoning paths together with their corresponding labels. Experimental results on the Twitter and CADEC social media datasets show that LLaMA-DetectionADR achieves excellent performance, with F1 scores of 92.67% and 86.13%, respectively. Meanwhile, DetectionADRGPT obtains competitive F1 scores of 87.29% and 82.80% with only a few labeled examples, approaching the performance of fully supervised advanced models. The overall results demonstrate the effectiveness and practical value of the proposed CoT-based generative frameworks for ADR detection from social media. Full article

(This article belongs to the Topic Generative AI and Interdisciplinary Applications)

► Show Figures

Figure 1

21 pages, 2519 KB

Open AccessArticle

PyAO: PyTorch-Based Memory-Efficient LLM Training on Ethernet-Interconnected Clusters

by Daemin Kim, Hyorim Kim, Juncheol Ahn and Sejin Park

Sensors 2026, 26(7), 2269; https://doi.org/10.3390/s26072269 - 7 Apr 2026

Viewed by 240

Abstract

As large language models (LLMs) pursue higher accuracy, their model sizes have surged, substantially increasing GPU memory consumption. Prior work mitigates this issue by distributing the memory burden across multiple GPUs. However, on clusters interconnected via Ethernet, the resulting computational intensity is insufficient [...] Read more.

As large language models (LLMs) pursue higher accuracy, their model sizes have surged, substantially increasing GPU memory consumption. Prior work mitigates this issue by distributing the memory burden across multiple GPUs. However, on clusters interconnected via Ethernet, the resulting computational intensity is insufficient to hide the significant network latency. Achieving a favorable compute-to-communication ratio is further constrained by the memory required to cache the massive activations generated during the forward pass. PyAO, proposed in this paper, effectively offloads activations, selects offloading strategies based on their offloading efficiency, and minimizes data-movement bottlenecks, thereby enabling larger micro-batch sizes. In Ethernet-interconnected cluster environments, experiments on popular models—including OPT-1.3B, GPT-0.8B, and Llama-1.2B—demonstrate that PyAO reduces peak GPU memory by up to 1.94× at the same micro-batch size, enables up to 2.5× larger batch sizes, and accelerates training by up to 3.63× relative to the baseline. Full article

(This article belongs to the Special Issue Intelligent Sensors and Advanced Computing: Developments in the Era of Industry 4.0: 2nd Edition)

► Show Figures

Figure 1

26 pages, 774 KB

Open AccessArticle

A Survey on Large Language Models in Software Security: Opportunities and Threats

by Md Bajlur Rashid, Mohammad Shafayet Jamil Hossain, Mohammad Ishtiaque Khan, Sharaban Tahora, Aiasha Siddika, Mahmudul Islam Prakash, Sharmin Yeasmin and Hossain Shahriar

Computers 2026, 15(4), 226; https://doi.org/10.3390/computers15040226 - 3 Apr 2026

Viewed by 567

Abstract

The rise of large language models (LLMs), such as GPT-4, Codex, Code Llama, Claude 3, CodeGemma and DeepSeek, etc., is changing the way software development is approached. These models provide strong support for tasks like writing codes, analyzing bugs, and automation. At the [...] Read more.

The rise of large language models (LLMs), such as GPT-4, Codex, Code Llama, Claude 3, CodeGemma and DeepSeek, etc., is changing the way software development is approached. These models provide strong support for tasks like writing codes, analyzing bugs, and automation. At the same time, their use in software development creates both opportunities and new risks. This survey reviews how LLMs are being used to improve security practices in software development, including vulnerability detection, secure code generation, threat analysis, and patch development. It also discusses how attackers may exploit LLMs for malicious purposes, such as writing malware, carrying out phishing campaigns, or bypassing defenses. We draw on case studies that show LLMs can help uncover zero-day vulnerabilities and speed up secure coding but also highlight cases where they have been misused to generate harmful code, sometimes unintentionally. The paper examines technical challenges like bias in training data, the difficulty of interpreting model outputs, and the risks of adversarial attacks. It also considers ethical and regulatory issues related to accountability, compliance, and responsible use. By bringing together findings from recent research and industry practice, the survey outlines future directions for building safer models, developing stronger defensive frameworks, and shaping policies that balance innovation with security. Overall, the paper argues for a careful approach where LLMs are used to strengthen software security while addressing the risks they introduce through collaboration, oversight, and ongoing improvements. Full article

(This article belongs to the Special Issue Using New Technologies in Cyber Security Solutions (3rd Edition))

► Show Figures

Figure 1

16 pages, 1185 KB

Open AccessArticle

Leveraging Large Language Models for Automated Extraction of Abdominal Aortic Aneurysm Features from Radiology Reports

by Praneel Mukherjee, Ryan C. Lee, Roham Hadidchi, Sonya Henry, Michael Coard, Matthew Davis, Yossef Rubinov, Ha Nguyen-Luong, Leah Katz and Tim Q. Duong

Diagnostics 2026, 16(7), 1083; https://doi.org/10.3390/diagnostics16071083 - 3 Apr 2026

Viewed by 240

Abstract

Background/Objectives. Abdominal computed tomography (CT) radiology reports contain critical information for abdominal aortic aneurysm (AAA) management, including aneurysm presence, size, rupture status, and prior repair. However, this information is often embedded within lengthy, heterogeneous reports, making manual extraction inefficient. We evaluated the [...] Read more.

Background/Objectives. Abdominal computed tomography (CT) radiology reports contain critical information for abdominal aortic aneurysm (AAA) management, including aneurysm presence, size, rupture status, and prior repair. However, this information is often embedded within lengthy, heterogeneous reports, making manual extraction inefficient. We evaluated the performance of multiple large language models (LLMs) for automated extraction of AAA-related findings from radiology reports. Methods. We retrospectively analyzed 500 abdominal CT reports mentioning AAA from an urban academic health system (2020–2024). Ground truth labels were established by manual review. Four open-source LLMs (Qwen2.5-7B-Instruct, Llama3-Med42-8B, GPT-OSS-20B, and MedGemma-27B-text-it) were evaluated for extraction of aneurysm presence, size, morphology, rupture status, impending rupture, and prior aortic repair. Model outputs were compared with ground truth using exact-match accuracy, and inter-model agreement was assessed using Fleiss’ kappa. Reasoning traces were examined to characterize correct and incorrect model behavior. Results. Accuracy for identifying AAA presence ranged from 0.90 to 0.95 (κ = 0.851), and prior aortic repair from 0.90 to 0.97 (κ = 0.793). Accuracy for aneurysm size ranged from 0.67 to 0.88 (κ = 0.340), with low κ’s due to class imbalance or dimension misselection. Rupture and impending rupture were identified with accuracies exceeding 0.90 across models, though agreement was lower (κ = 0.485 and 0.589), reflecting low event prevalence. Larger models (GPT-OSS-20B, MedGemma-27B) generally outperformed smaller models. Reasoning analysis revealed strengths in measurement prioritization but recurrent errors, including dimension misselection, over-inference of prior repair, and conservative classification of rupture-related findings. Conclusions. LLMs can accurately extract clinically relevant AAA information from radiology reports with interpretable reasoning, with larger and medically trained models outperforming smaller or general-purpose models. Performance varies by task and model, underscoring the need for careful validation and human-in-the-loop deployment in clinical settings. Full article

(This article belongs to the Special Issue Large Language Models in Medical Diagnostics: Advancing Clinical Practice, Research, and Patient Care)

► Show Figures

Figure 1

21 pages, 5627 KB

Open AccessArticle

Comparative Performance of Large Language Models on European Gastroenterology Board-Style Questions: Analysis of Reasoning Versus Non-Reasoning Architectures

by Cem Simsek, Petr Vanek, Hakan Aydinli, Jan Krivinka, Manuel Lehner, Sara Schiavone, Cesare Hassan and Henriette H. Heinrich

J. Clin. Med. 2026, 15(7), 2692; https://doi.org/10.3390/jcm15072692 - 2 Apr 2026

Viewed by 254

Abstract

Background: While large language models (LLMs) have demonstrated proficiency in medical examinations, their comparative performance on European gastroenterology assessments remains underexplored, particularly regarding architectural differences between reasoning and non-reasoning models. This study benchmarks five state-of-the-art LLMs—DeepSeek-R1, ChatGPT-o1, ChatGPT-4o, Gemini-1.5-Pro, and Llama-3.1-405B (All [...] Read more.

Background: While large language models (LLMs) have demonstrated proficiency in medical examinations, their comparative performance on European gastroenterology assessments remains underexplored, particularly regarding architectural differences between reasoning and non-reasoning models. This study benchmarks five state-of-the-art LLMs—DeepSeek-R1, ChatGPT-o1, ChatGPT-4o, Gemini-1.5-Pro, and Llama-3.1-405B (All versions January 2025)—using 203 board-style questions from validated ESEGH preparation materials. Methods: Questions from two commercial ESEGH preparation banks were administered five times per model using standardized prompts. Accuracy, consistency, and domain-specific performance across clinical, diagnostic, and therapeutic questions were analyzed. Four practicing gastroenterologists validated human performance under uniform conditions. Results: ChatGPT-o1 achieved the highest overall accuracy at 84.0% (95% CI: 81.8–86.3), followed closely by ChatGPT-4o (81.7%), DeepSeek-R1 (79.0%), and Llama-3.1-405B (77.2%), while Gemini-1.5-Pro significantly underperformed with 68.5% accuracy (difference vs. ChatGPT-o1: 15.5 percentage points, 95% CI: 11.9 to 19.1, p < 0.01). Although all models exhibited high internal consistency ≥98.4% average agreement across repeated attempts, with 94.6–98.0% of questions answered identically in all five attempts), greater consistency did not necessarily correspond to higher accuracy. Domain-specific analysis revealed that diagnostic questions were answered most accurately, whereas clinical examination questions posed considerable challenges. Topic analysis demonstrated that questions on small intestine disorders were answered with the highest accuracy, in contrast to the lower performance observed in bariatric and pancreatic disorders. Notably, reasoning models, which employed explicit chain-of-thought strategies, outperformed non-reasoning counterparts (81.5% vs. 75.8%, difference: 5.7 percentage points, 95% CI: 3.4 to 8.0, p < 0.001), particularly on therapy questions and complex bait-and-switch formats. Practicing gastroenterologists achieved substantially lower accuracy (mean: 50.9%, range: 37.9–69.0%) compared to all LLMs. All models exceeded the current ESEGH passing threshold of 61.5%, with the top four models surpassing this benchmark by 15.7–22.5 percentage points. Conclusions: This benchmarking study demonstrates that current LLMs, particularly those with reasoning architectures, achieve high accuracy on European gastroenterology board-style questions. However, significant performance gaps in specific domains highlight limitations that must be addressed before clinical application. These findings provide a baseline for evaluating LLM capabilities in European medical contexts. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Clinical Medicine)

► Show Figures

Figure 1

37 pages, 856 KB

Open AccessArticle

Unbiasing Greek: In-Context Learning Strategies for Gender Bias Identification and Mitigation for Legal Documents and Job Ads

by Dimitrios Doumanas, Andreas Soularidis, Nikolaos Zafeiropoulos, Stamatis Chatzistamatis, George E. Tsekouras, Andreas El Saer, Chrisaphis Nathanailidis and Konstantinos Kotis

Information 2026, 17(4), 342; https://doi.org/10.3390/info17040342 - 2 Apr 2026

Viewed by 468

Abstract

Gender bias embedded in legal and professional texts perpetuates systemic inequality, yet research on bias identification and mitigation remains largely confined to English. Morphologically rich languages such as Greek, where grammatical gender pervades nouns, adjectives, pronouns, and participles, present unique challenges that existing [...] Read more.

Gender bias embedded in legal and professional texts perpetuates systemic inequality, yet research on bias identification and mitigation remains largely confined to English. Morphologically rich languages such as Greek, where grammatical gender pervades nouns, adjectives, pronouns, and participles, present unique challenges that existing approaches fail to address. This paper elaborates on a systematic methodology primarily focusing on identifying and mitigating gender bias in Greek-language job advertisements and legal documents. To accomplish that task, we define a taxonomy of nine gender bias rules tailored to the linguistic properties of Greek and construct domain-specific annotated datasets comprising 90 expert-curated few-shot examples across both textual domains. Using these resources, we employ XML-structured prompt engineering with in-context learning (ICL)and systematically compare three classes of models: (i) commercial large language models (LLMs), namely Claude Sonnet 4.5 and GPT-5.2, (ii) two open-weight small language models (SLMs), Mistral Small (24B) and Ministral (14B), and (iii) Llama Krikri (8B), a Greek-native language model built on Llama 3.1 and fine-tuned on high-quality Greek corpora. For each input text, the system identifies biased expressions, maps them to specific bias rules, provides explanations, and generates a fully corrected inclusive version. Our experiments reveal substantial performance disparities across model scales and linguistic specialization, with LLMs demonstrating superior contextual reasoning and SLMs exhibiting systematic over-correction and grammatical errors in Greek morphology. We further introduce a critical meta-rule addressing gender agreement with named entities to prevent spurious corrections in legal texts referencing identified individuals. The findings highlight the importance of model scale, language-specific adaptation, and carefully designed prompting strategies for bias mitigation in underrepresented languages. Full article

(This article belongs to the Special Issue Modeling in the Era of Generative AI)

► Show Figures

Figure 1

36 pages, 1163 KB

Open AccessArticle

A Multicriteria Framework for Evaluation and Selection of Conversational AI Assistants in Mental Health

by Constanta Zoie Radulescu, Marius Radulescu and Alexandra Ioana Mihailescu

Future Internet 2026, 18(4), 191; https://doi.org/10.3390/fi18040191 - 1 Apr 2026

Viewed by 381

Abstract

The rapid proliferation of Conversational Artificial Intelligence Assistants (CAIs) has transformed access to mental health information through freely accessible web interfaces, mobile applications, and public APIs (Application Programming Interfaces), yet systematic methodologies for their evaluation remain limited. This paper introduces SELCAI-MH, a multicriteria [...] Read more.

The rapid proliferation of Conversational Artificial Intelligence Assistants (CAIs) has transformed access to mental health information through freely accessible web interfaces, mobile applications, and public APIs (Application Programming Interfaces), yet systematic methodologies for their evaluation remain limited. This paper introduces SELCAI-MH, a multicriteria framework for CAI evaluation and selection. This framework integrates four complementary multicriteria methods: Technique for Order Preference by Similarity to an Ideal Solution (TOPSIS), VIseKriterijumska Optimizacija I Kompromisno Resenje (VIKOR), Complex Proportional Assessment Method (COPRAS), and Combinative Distance-based Assessment (CODAS), capturing distance-based, compromise-based, proportional, and negative-ideal logics, and proposes SOLAG, an aggregation method that produces a consensus ranking across methods. SELCAI-MH employs a dual evaluation mechanism combining psychiatric expert assessment with AI-based scoring, expert-derived criterion weights, and domain-relevant conversational datasets. The framework is applied to nine internet-accessible CAIs: proprietary platforms (ChatGPT 5.2, Claude Sonnet 4.5, Gemini 1.5 Flash, Perplexity Sonar, Bing AI/Copilot) and open-source Llama variants deployed via cloud inference endpoints. Using a set of anxiety-related questions and CAI responses, evaluated across seven criteria, Claude Sonnet 4.5 emerged optimal, followed by ChatGPT 5.2 and Gemini 1.5 Flash. SOLAG produced highly consistent rankings across the four multicriteria decision-making (MCDM) methods (Spearman ρ ≥ 0.98). Overall, SELCAI-MH provides a structured and reproducible decision-support framework for selecting accessible CAIs in sensitive mental health contexts. Full article

(This article belongs to the Special Issue Artificial Intelligence-Enabled Smart Healthcare)

► Show Figures

Graphical abstract

36 pages, 9313 KB

Open AccessArticle

Development of Bispecific Antibody Targeting Human IL-17A and IL-6

by Beata Pamuła, Martyna Banach, Marta Mikońska, Karolina Korytkowska, Krzysztof Lacek, Oliwia Śniadała, Małgorzata Marczak, Krzysztof Flis, Aleksandra Sowińska, Damian Kołakowski, Jerzy Pieczykolan, Beata Zygmunt, Maciej Wieczorek and Olga Abramczyk

Antibodies 2026, 15(2), 29; https://doi.org/10.3390/antib15020029 - 30 Mar 2026

Viewed by 437

Abstract

Background/Objectives: Antibodies are a rapidly expanding field in drug discovery, but their monospecificity limits therapeutic applications, particularly in complex inflammatory diseases. Multispecific therapeutics, which combine variable regions targeting two or more antigens, offer potential advantages such as enhanced efficacy, broader target modulation, [...] Read more.

Background/Objectives: Antibodies are a rapidly expanding field in drug discovery, but their monospecificity limits therapeutic applications, particularly in complex inflammatory diseases. Multispecific therapeutics, which combine variable regions targeting two or more antigens, offer potential advantages such as enhanced efficacy, broader target modulation, and reduced side effects. This study aimed to identify and characterize bispecific, VHH-based antibodies simultaneously targeting IL-6 and IL-17A—two key cytokines involved in autoimmune and chronic inflammatory conditions. Methods: A phage display screening was conducted using llama-derived VHH libraries to select binders against human IL-6 and IL-17A. Binding affinities of individual VHHs and assembled bispecific constructs were assessed using Bio-Layer Interferometry (BLI). Functional activity was evaluated using reporter cell lines responsive to IL-6 and IL-17A signaling. Biophysical and quality assessments of selected VHHs and bispecific antibodies were performed using the Uncle screening platform and LabChip capillary electrophoresis. Results: Several high-affinity VHH binders were identified for both IL-6 and IL-17A, and incorporated into bispecific antibody formats. The bispecific candidates exhibited simultaneous inhibition of both cytokine pathways in functional reporter assays. Biophysical characterization confirmed good stability and purity profiles for selected molecules. Conclusions: This study demonstrates the feasibility of generating stable, functional bispecific VHH-based antibodies targeting IL-6 and IL-17A. These constructs show potential as therapeutic agents for treating autoimmune and chronic inflammatory diseases by modulating multiple signaling pathways simultaneously. Full article

(This article belongs to the Section Antibody Discovery and Engineering)

► Show Figures

Figure 1

32 pages, 1203 KB

Open AccessArticle

An Experimental Study on Harassment Moderation in Llama and Alpaca

by Henrique Tostes de Sousa and Leo Natan Paschoal

Big Data Cogn. Comput. 2026, 10(4), 100; https://doi.org/10.3390/bdcc10040100 - 24 Mar 2026

Viewed by 459

Abstract

The growing integration of chatbots and large language models (LLMs) into society raises important concerns about their potential to reproduce toxic human behaviors. As a result, it is essential to investigate these models to mitigate or eliminate such risks. This paper presents an [...] Read more.

The growing integration of chatbots and large language models (LLMs) into society raises important concerns about their potential to reproduce toxic human behaviors. As a result, it is essential to investigate these models to mitigate or eliminate such risks. This paper presents an experimental study evaluating the responses of the Llama and Alpaca models to scenarios involving verbal harassment. The methodology involved using harassment dialogues generated by an LLM as prompts to elicit responses from both models. The responses were then analyzed for levels of toxicity, sexually explicit content, and flirtatiousness. The results indicate that although both models reduce explicit offensive terms, they exhibit limitations in identifying and intercepting abusive behavior from users. Statistical analysis reveals that general-purpose instruction tuning in Alpaca does not provide a robust safety barrier compared to the Llama base model for most variables investigated in the experiment. However, a significant difference was observed concerning flirting, where Llama proved more prone to validation and encouragement than Alpaca. Furthermore, the study identifies critical vulnerabilities, such as a “self-deprecation” bias in Llama and “mirroring” behavior in Alpaca. We also report a complementary triangulation with GPT-family models as a secondary point of reference. This paper discusses and contains content that can be offensive or upsetting. Full article

(This article belongs to the Special Issue Artificial Intelligence in Digital Humanities)

► Show Figures

Figure 1

31 pages, 4949 KB

Open AccessArticle

Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design

by Sanoop Sadheerthan, Min-Jie Hsu, Chih-Hsiang Huang and Yin-Tien Wang

Electronics 2026, 15(6), 1312; https://doi.org/10.3390/electronics15061312 - 20 Mar 2026

Viewed by 438

Abstract

Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate

e^{x}

using uniform piecewise polynomials to enable O(1) SIMD indexing, but this [...] Read more.

Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate

e^{x}

using uniform piecewise polynomials to enable O(1) SIMD indexing, but this wastes computation by applying high-degree arithmetic indiscriminately in every segment. Conversely, fully adaptive approaches maximize statistical fidelity but introduce pipeline stalls due to comparator-based boundary search. To bridge this gap, we propose an attention distribution-aware softmax that uses Particle Swarm Optimization (PSO) to define non-uniform segments and variable polynomial degrees, prioritizing finer granularity and lower arithmetic complexity in attention-dense regions. To ensure efficiency, we snap boundaries into a 128-bin LUT, enabling O(1) retrieval of segment parameters without branching. Inference measurements show that this favors low-degree execution, minimizing exp-kernel overhead. Using TinyLlama-1.1B-Chat as a testbed, the proposed weighted design reduces cycles per call exp kernel (CPC) by 18.5% versus an equidistant uniform Degree-4 baseline and 13.1% versus uniform Degree-3, while preserving ranking fidelity. These results show that grid-snapped, variable-degree approximation can improve softmax efficiency while largely preserving attention ranking fidelity, enabling accurate edge LLM inference. Full article

(This article belongs to the Special Issue Emerging Applications of FPGAs and Reconfigurable Computing System)

► Show Figures

Figure 1

46 pages, 2822 KB

Open AccessReview

Generative AI and the Foundation Model Era: A Comprehensive Review

by Abdussalam Elhanashi, Siham Essahraui, Pierpaolo Dini, Davide Paolini, Qinghe Zheng and Sergio Saponara

Big Data Cogn. Comput. 2026, 10(3), 94; https://doi.org/10.3390/bdcc10030094 - 20 Mar 2026

Viewed by 1225

Abstract

Generative artificial intelligence and foundation models have changed machine learning by allowing systems to produce readable text, realistic images, and other multimodal content with little direct input from a user. Foundation models are large neural networks trained on very large and varied datasets, [...] Read more.

Generative artificial intelligence and foundation models have changed machine learning by allowing systems to produce readable text, realistic images, and other multimodal content with little direct input from a user. Foundation models are large neural networks trained on very large and varied datasets, and they form the core of many current generative AI (GenAI) systems. Their rapid development has led to major advances in areas like natural language processing, computer vision, multimodal learning, and robotics. Examples include GPT, LLaMA, and diffusion-based architectures, such as models often used for image generation. Systems such as Stable Diffusion show this shift by illustrating how AI can interpret information, draw basic inferences, and produce new outputs using more than one type of data. This review surveys common foundation model architectures and examines what they can do in generative tasks. It reviews Transformer, diffusion, and multimodal architectures, focusing on methods that support scaling and transfer across domains. The paper also reviews key approaches to pretraining and fine-tuning, including self-supervised learning, instruction tuning, and parameter-efficient adaptation, which support these systems’ ability to generalize across tasks. In addition to the technical details, this review discusses how GenAI is being used for text generation, image synthesis, robotics, and biomedical research. The study also notes continuing challenges, such as the high computing and energy demands of large models, ethical concerns about data bias and misinformation, and worries about privacy, reliability, and responsible use of AI in real settings. This review brings together ideas about model design, training methods, and social implications to point future research toward GenAI systems that are efficient, easy to interpret, and reliable, while supporting scientific progress and ethical responsibility. Full article

(This article belongs to the Special Issue Multimodal Deep Learning and Its Applications)

► Show Figures

Figure 1

19 pages, 599 KB

Open AccessArticle

Reducing Hallucinations in Medical AI Through Citation Enforced Prompting in RAG Systems

by Lukasz Pawlik and Stanislaw Deniziak

Appl. Sci. 2026, 16(6), 3013; https://doi.org/10.3390/app16063013 - 20 Mar 2026

Viewed by 567

Abstract

The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using [...] Read more.

The safe integration of Large Language Models in clinical environments requires strict adherence to verified medical evidence. As part of the PARROT AI project, this study provides a systematic evaluation of how prompting strategies affect the reliability of Retrieval-Augmented Generation (RAG) pipelines using the MedQA USMLE benchmark (

N = 500

). Four prompting strategies were examined: Baseline (zero-shot), Neutral, Expert Chain-of-Thought (Expert-CoT) with structured clinical reasoning, and StrictCitations with mandatory evidence grounding. The experiments covered six modern model architectures: Command R (35B), Gemma 2 (9B and 27B), Llama 3.1 (8B), Mistral Nemo (12B), and Qwen 2.5 (14B). Evaluation was conducted using the Deterministic RAG Evaluator, providing an objective assessment of grounding through the Unsupported Sentence Ratio (USR) based on TF-IDF and cosine similarity. The results indicate that structured reasoning in the Expert-CoT strategy significantly increases USR values (reaching 95–100%), as models prioritize internal diagnostic logic over verbatim context. In contrast, the StrictCitations strategy, while maintaining high USR due to the conservative evaluation threshold, achieves the highest level of verifiable grounding and source adherence. The analysis identifies a statistically significant Verbosity Signal (

r = 0.81, p < 0.001

), where increased response length serves as a proxy for model uncertainty and parametric leakage, a pattern particularly prominent in Llama 3.1 and Gemma 2. Overall, the findings demonstrate that prompting strategy selection is as critical for clinical reliability as model architecture. This work delivers a reproducible framework for the development of trustworthy medical AI assistants and highlights citation-enforced prompting as a vital mechanism for improving clinical safety. Full article

(This article belongs to the Special Issue Innovative Applications of AI, Machine Learning, IoT, and Assistive Robots in Health Monitoring and Care)

► Show Figures

Figure 1

23 pages, 973 KB

Open AccessArticle

Evaluation of Linguistic Consistency of LLM-Generated Text Personalization Using Natural Language Processing

by Linh Huynh and Danielle S. McNamara

Electronics 2026, 15(6), 1262; https://doi.org/10.3390/electronics15061262 - 18 Mar 2026

Viewed by 414

Abstract

This study proposes a Natural Language Processing (NLP)-based evaluation framework to examine the linguistic consistency of large language model (LLM)-generated personalized texts over time. NLP metrics were used to quantify and compare linguistic patterns across repeated generations produced using identical prompts. In Experiment [...] Read more.

This study proposes a Natural Language Processing (NLP)-based evaluation framework to examine the linguistic consistency of large language model (LLM)-generated personalized texts over time. NLP metrics were used to quantify and compare linguistic patterns across repeated generations produced using identical prompts. In Experiment 1, internal reliability was examined across 10 repeated generations from four LLMs (Claude, Llama, Gemini, and ChatGPT), applied to 10 scientific texts tailored for a specific reader profile. Linear mixed-effects models showed no effect of repeated generation on linguistic features (e.g., cohesion, syntactic complexity, lexical sophistication), suggesting short-term consistency across repeatedly generated outputs. Experiment 2 examined linguistic variation across model updates of GPT-4o (October 2024 vs. June 2025) and GPT-4.1 (June 2025). Significant variations were observed across outputs from different model versions. GPT-4o (June 2025) generated more concise but cohesive texts, whereas GPT-4.1 (June 2025) generated outputs that are more academic, lexically sophisticated, and complex in syntax. Given the rapid evolution of LLMs and the lack of standardized methods for tracking output consistency, the current work demonstrates one of the applications of NLP-based evaluation approaches for monitoring meaningful linguistic shifts across model updates over time. Full article

(This article belongs to the Special Issue AI-Powered Natural Language Processing Applications)

► Show Figures

Figure 1

26 pages, 977 KB

Open AccessArticle

KE-MLLM: A Knowledge-Enhanced Multi-Sensor Learning Framework for Explainable Fake Review Detection

by Jiaying Chen, Jingyi Liu, Yiwen Liang and Mengjie Zhou

Appl. Sci. 2026, 16(6), 2909; https://doi.org/10.3390/app16062909 - 18 Mar 2026

Viewed by 250

Abstract

The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they [...] Read more.

The proliferation of fake reviews on e-commerce and social platforms has severely undermined consumer trust and market integrity, necessitating robust and interpretable real-time detection mechanisms with multi-sensor data fusion capabilities. While traditional machine learning approaches have shown promise in identifying fraudulent reviews, they often lack transparency and fail to leverage the rich contextual knowledge embedded in large-scale datasets. In this paper, we propose KE-MLLM (Knowledge-Enhanced Multimodal Large Language Model), a unified framework that integrates knowledge-enhanced prompting with parameter-efficient fine-tuning for explainable fake review detection. Our approach employs LoRA (Low-Rank Adaptation) to fine-tune lightweight large language models (LLaMA-3-8B) on review text, while incorporating multimodal behavioral sensor signals including temporal patterns, user metadata, and social network characteristics for comprehensive anomaly sensing. To address the critical need for interpretability in fraud detection systems, we implement a Chain-of-Thought (CoT) reasoning module that generates human-understandable explanations for classification decisions, highlighting linguistic anomalies, sentiment inconsistencies, and behavioral red flags. We enhance the model’s discriminative capability through a knowledge distillation strategy that transfers domain-specific expertise from larger teacher models while maintaining computational efficiency suitable for edge sensing devices. Extensive experiments on two benchmark datasets—YelpChi and Amazon Reviews from the DGL Fraud Dataset—show that KE-MLLM achieves strong performance, reaching an F1-score of 94.3% and an AUC-ROC of 96.7% on YelpChi and outperforming the strongest baseline in our comparison by 5.8 and 4.2 percentage points, respectively. Furthermore, human evaluation indicates that the generated explanations achieve 89.5% consistency with expert annotations, suggesting that the framework can improve the interpretability and practical usefulness of automated fraud detection systems. The proposed framework provides a useful step toward more accurate and interpretable fake review detection and offers a practical reference for building more transparent and accountable AI systems in high-stakes applications. Full article

► Show Figures

Figure 1

Search Results (383)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (383)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI