Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (395)

Search Parameters:
Keywords = Llama

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
21 pages, 751 KB  
Article
NGS-Based Genomic Characterization of ESBL/AmpC-Producing Extraintestinal Pathogenic Escherichia coli from Captive Wildlife in Tunisia
by Zaineb Hamzaoui, Hajer Kilani, Sana Ferjani, Elaa Maamar, Ahmed Fakhfakh, Lamia Kanzari and Ilhem Boutiba-Ben Boubaker
Antibiotics 2026, 15(5), 449; https://doi.org/10.3390/antibiotics15050449 - 29 Apr 2026
Viewed by 162
Abstract
Background/Objectives: Multidrug-resistant (MDR) Escherichia coli resistant to third-generation cephalosporins are a growing One Health concern, but data on extraintestinal pathogenic E. coli (ExPEC) from wildlife in North Africa remain scarce. We aimed to characterize ESBL/AmpC-producing ExPEC from captive wild mammals in Tunisia and [...] Read more.
Background/Objectives: Multidrug-resistant (MDR) Escherichia coli resistant to third-generation cephalosporins are a growing One Health concern, but data on extraintestinal pathogenic E. coli (ExPEC) from wildlife in North Africa remain scarce. We aimed to characterize ESBL/AmpC-producing ExPEC from captive wild mammals in Tunisia and to situate these isolates in a global genomic context. Methods: In 2018, 30 fecal samples from 14 captive wild mammals in a private farm were screened on cefotaxime agar. Four cefotaxime-resistant E. coli isolates were recovered from a llama, lion, hyena, and tiger. Antimicrobial susceptibility testing and Illumina whole-genome sequencing were combined with in silico typing, resistome and virulome profiling, plasmid and mobile element analysis, human pathogenicity prediction and core-genome MLST-based minimum-spanning trees. Results: All isolates were MDR but remained susceptible to carbapenems, colistin and tigecycline. Two ST162/B1 isolates from the llama and tiger carried blaCMY-2, whereas two ST69/D isolates from the lion and hyena harbored blaCTX-M-15 and qnrS1. Genomes encoded 61–68 antimicrobial resistance genes and 114–131 virulence-associated genes, together with IncF-, IncI1- and IncY-type plasmids and IS26-rich insertion sequence profiles. Mating-out assays yielded cefotaxime-resistant transconjugants, supporting plasmid transferability of blaCMY-2 or blaCTX-M-15. PathogenFinder predicted a ≥0.93 probability of human pathogenicity for all isolates. cgMLST-based trees showed that Tunisian ST69 and ST162 clustered within internationally disseminated lineages containing human, animal and food isolates, rather than forming wildlife-restricted branches. Conclusions: Captive wild mammals in Tunisia can harbor high-risk ExPEC lineages combining ESBL/AmpC production, multidrug resistance and extensive virulence and mobility gene repertoires. These findings highlight captive wildlife as potential reservoirs and sentinels of clinically relevant E. coli and underscore the need for integrated WGS-based One Health surveillance at the human–animal–environment interface in North Africa. Full article
Show Figures

Figure 1

25 pages, 1105 KB  
Article
Few-Shot Portfolio Optimization: Can Large Language Models Outperform Quantitative Portfolio Optimization? A Comparative Study of LLMs and Optimized Portfolio Allocators
by Lamukanyani Alson Mantshimuli and John Weirstrass Muteba Mwamba
J. Risk Financial Manag. 2026, 19(5), 320; https://doi.org/10.3390/jrfm19050320 - 28 Apr 2026
Viewed by 268
Abstract
Recent advances in large language models (LLMs) have raised questions about their potential role in portfolio allocation beyond traditional sentiment analyses. This study investigated whether LLMs, when prompted directly, can autonomously generate portfolio weights that compete with classical optimization and AI-enhanced strategies. We [...] Read more.
Recent advances in large language models (LLMs) have raised questions about their potential role in portfolio allocation beyond traditional sentiment analyses. This study investigated whether LLMs, when prompted directly, can autonomously generate portfolio weights that compete with classical optimization and AI-enhanced strategies. We evaluated seven medium-sized open-source LLMs—Gemma-7B, Mistral-7B, Jansen Adapt-Finance-Llama2-7B, DeepSeek-R1-8B, QuantFactory Llama-3-8B-Instruct-Finance, Qwen-7B, and Llama2-7B—using systematic prompt engineering and temperature tuning. Portfolios were constructed from financial news headlines for S&P 500 equities and benchmarked against mean–variance optimization (MVO), the Black–Litterman model, AI-driven optimizers, and naive diversification strategies. The results show that, while LLM-generated portfolios outperformed naive diversification (Sharpe ratio up to 0.741), they lagged behind AI-optimized benchmarks (Sharpe ratio up to 1.361). A transaction cost analysis revealed that low-turnover LLM strategies retain their competitiveness post-costs, surpassing cap-weighted benchmarks. Statistical tests confirmed significant performance differences (p0.01). These findings highlight the ability of LLMs to extract signals from unstructured text, but also their limitations without explicit optimization. Future research should explore hybrid frameworks that combine LLM reasoning with quantitative optimization for cost-sensitive environments. Full article
(This article belongs to the Section Financial Technology and Innovation)
Show Figures

Figure 1

35 pages, 5864 KB  
Review
The State of Practice in Application of Natural Language Processing in Transportation Safety Analysis
by Mohammadjavad Bazdar, Hyun Kim, Branislav Dimitrijevic and Joyoung Lee
Appl. Sci. 2026, 16(9), 4223; https://doi.org/10.3390/app16094223 - 25 Apr 2026
Viewed by 465
Abstract
This paper provides a systematic review of recent applications of NLP methods for analyzing traffic crash reports, with a focus on estimating crash severity, crash duration, and crash causation. The review covers prior research using probabilistic topic modeling methods such as LDA, STM, [...] Read more.
This paper provides a systematic review of recent applications of NLP methods for analyzing traffic crash reports, with a focus on estimating crash severity, crash duration, and crash causation. The review covers prior research using probabilistic topic modeling methods such as LDA, STM, and hierarchical Dirichlet processes in addition to research using transformer-based language models, which include encoder-based models like BERT and PubMedBERT as well as decoder-based models like GPT, GPT2, ChatGPT, GPT-3, and LLaMA. The review starts with a systematic literature selection process with predefined inclusion criteria. We categorize the reviewed studies into the following application areas: crash severity prediction, risk factor identification in crashes, and road safety analysis. The results show several complementary advantages of using different NLP techniques to achieve different analytical goals. Topic models allow for interpretable and exploratory pattern discovery, while encoder models are well-suited for structured prediction problems. Decoder models have the additional flexibility to perform zero-shot and few-shot reasoning, which makes them useful for reasoning about under-sampled or under-reported data. Across the literature, hybrid methods that combine text and structured data outperform individual methods in terms of prediction accuracy and broad applicability. Challenges across the literature include class imbalance, lack of standardization in preprocessing and evaluation methods, and the tradeoff between prediction accuracy and interpretability of prediction models. These findings highlight the importance of aligning model selection with data availability and operational constraints, pointing toward future research directions in hybrid modeling frameworks, standardized evaluation protocols, and real-world deployment of NLP-driven traffic safety systems. Full article
(This article belongs to the Special Issue Traffic Safety Measures and Assessment: 2nd Edition)
Show Figures

Figure 1

24 pages, 750 KB  
Article
Adversarial Evaluation of Large Language Models for Building Robust Offensive Language Detection in Moroccan Arabic
by Soufiyan Ouali, Kanza Raisi, Asmaa Mourhir, El Habib Nfaoui and Said El Garouani
Big Data Cogn. Comput. 2026, 10(5), 132; https://doi.org/10.3390/bdcc10050132 - 24 Apr 2026
Viewed by 339
Abstract
Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such [...] Read more.
Offensive language detection is crucial for ensuring safe and inclusive digital environments. Identifying harmful content protects users and supports healthier online interactions. Despite advances in transformer-based models, particularly Large Language Models (LLMs), their application to this task remains underexplored for low-resource languages such as Moroccan Arabic, especially compared with high-resource languages. This study evaluates the performance of various open- and closed-source LLMs for offensive language detection in Moroccan Darija. The evaluated models include general-purpose LLMs such as LLaMA, Mistral, and Gemma, as well as Arabic-focused models such as ArabianGPT, Falcon Arabic, and Atlas-Chat. We also experiment with reasoning models such as DeepSeek and GPT-4. Beyond traditional evaluation metrics, we investigate the robustness of these LLMs and examine the impact of adversarial training on their performance. Moreover, we contribute to the field by creating a large, high-quality dataset. Our evaluation revealed that GPT-4o Mini achieved the best overall performance, reaching an F1-score of 88%. However, robustness testing under black-box and white-box adversarial attacks exposed notable vulnerabilities, with attack success rates reaching 30%, thereby highlighting the need for enhancement. Despite the complex morphology and linguistic variability of Moroccan Darija, adversarial training resulted in a notable improvement in both overall model performance and robustness against adversarial attacks, yielding an average increase of 20.89% in resistance to attacks. Furthermore, this approach enabled GPT-4o Mini to achieve an F1-score of 91%, surpassing the current state-of-the-art performance by 6%. These results highlight the importance of incorporating adversarial approaches in low-resource dialectal settings to effectively address linguistic variability and data scarcity. Full article
(This article belongs to the Special Issue Natural Language Processing Applications in Big Data)
Show Figures

Figure 1

10 pages, 418 KB  
Article
Empirical Analysis of Internal Hallucination Detection in Quantized LLMs: Layer Dynamics and White-Box Benchmarks
by Haohua Liu and Jinli Xu
Electronics 2026, 15(9), 1802; https://doi.org/10.3390/electronics15091802 - 23 Apr 2026
Viewed by 275
Abstract
As large language models (LLMs) move onto resource-constrained devices, maintaining factual reliability without adding another expensive decoding pass becomes a practical inference problem. Instead of introducing another complex hallucination detector, this paper presents an empirical study of which low-cost white-box features remain useful [...] Read more.
As large language models (LLMs) move onto resource-constrained devices, maintaining factual reliability without adding another expensive decoding pass becomes a practical inference problem. Instead of introducing another complex hallucination detector, this paper presents an empirical study of which low-cost white-box features remain useful under a controlled single-pass benchmark. Across repeated candidate-answer reruns on Qwen2.5-1.5B-Instruct and Llama-3.2-1B-Instruct, truthful and incorrect internal states are most separable in the middle-to-late layers, with the peak consistently falling at 50–70% of total network depth across both model families. The depth-relative pattern is more stable than any single detector ranking: simple residual-space baselines, including Mahalanobis scoring, remain competitive with more elaborate residual-plus-spectral fusion features under the same protocol, although detector ranking still changes by task. A separate preliminary two-seed Qwen2.5-7B-Instruct BF16 probe under that same white-box benchmark reproduces the same middle-to-late peak, and auxiliary Int8 checks on Qwen2.5-1.5B and Qwen2.5-7B remain consistent with that same localization under moderate quantization. Taken together, the results point away from detector complexity and toward a more reproducible question of where hallucination cues emerge, which internal statistics remain reliable, and how cautiously such conclusions should be transferred to deployment settings. Full article
Show Figures

Figure 1

23 pages, 4572 KB  
Article
LLaMA-XR: A Novel Framework for Radiology Report Generation Using LLaMA and QLoRA Fine Tuning
by Md. Zihad Bin Jahangir, Muhammad Ashad Kabir, Sumaiya Akter, Israt Jahan and Minh Chau
Bioengineering 2026, 13(5), 493; https://doi.org/10.3390/bioengineering13050493 - 23 Apr 2026
Viewed by 859
Abstract
Background: The goal of automated radiology report generation is to help radiologists in their task of creating descriptive reports from chest radiographs. However, the process of creating coherent and contextually accurate reports has been challenging, mainly due to the intricacies of medical language [...] Read more.
Background: The goal of automated radiology report generation is to help radiologists in their task of creating descriptive reports from chest radiographs. However, the process of creating coherent and contextually accurate reports has been challenging, mainly due to the intricacies of medical language and the need to correlate visual data with textual descriptions. Methods: This study presents LLaMA-XR, a novel framework that integrates Meta LLaMA 3.1 Large Language Model with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. Results: The experiment conducted on the IU X-ray dataset demonstrates that LLaMA-XR outperforms a range of state-of-the-art methods. It achieves an ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. Conclusions: These results underscore LLaMA-XR’s potential as an effective artificial intelligence system for automated radiology reporting, offering enhanced performance. Full article
(This article belongs to the Special Issue AI-Driven Imaging and Analysis for Biomedical Applications)
30 pages, 1495 KB  
Article
Echocardiography Report Translation and Inference Based on Parameter-Efficient Fine-Tuning of LLaMA Models
by Hsin-Ta Chiao, Wei-Wen Lin, Shang-Yang Tseng, Yu-Cheng Hsieh and Chao-Tung Yang
Diagnostics 2026, 16(8), 1223; https://doi.org/10.3390/diagnostics16081223 - 20 Apr 2026
Viewed by 313
Abstract
Background/Objectives: Echocardiography reports are essential diagnostic tools, but their complexity and specialized English terminology frequently hinder comprehension for non-specialists and patients. This study addresses these accessibility gaps by developing a resource-efficient large language model (LLM) system designed to translate and summarize English echocardiography [...] Read more.
Background/Objectives: Echocardiography reports are essential diagnostic tools, but their complexity and specialized English terminology frequently hinder comprehension for non-specialists and patients. This study addresses these accessibility gaps by developing a resource-efficient large language model (LLM) system designed to translate and summarize English echocardiography results into Traditional Chinese. Methods: To overcome significant hardware constraints, we utilized Quantized Low-Rank Adapter (QLoRA) techniques and the Unsloth acceleration framework to fine-tune LLaMA-3.2-1B and LLaMA-3.2-3B-Instruct models on a single mid-tier GPU. The system employs a dual-stage inference architecture: the first stage provides technical medical translation for clinicians, while the second stage generates simplified, patient-centric educational summaries to enhance health literacy. Results: Evaluation across multiple metrics, including BLEU, ROUGE, METEOR, and Perplexity, demonstrated that the LLaMA-3.2-3B-Instruct model with the AdamW 8-bit optimizer achieved the most stable validation performance, excelling in semantic coherence and structural consistency. A preliminary qualitative error analysis conducted in the Discussion section further identified clinical nuances, such as terminology simplification and minor hallucinations, underscoring the critical necessity of a Human-in-the-Loop verification procedure. Conclusions: These findings validate the feasibility of deploying cutting-edge medical AI in resource-limited clinical environments. While the results reflect validation-only performance on a specialized dataset, the platform offers a scalable foundation for enhancing clinical decision support and health literacy through accessible, automated medical text processing. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

23 pages, 2302 KB  
Article
TabEng-QLoRA: Criticality-Aware Tabular-to-Text Adaptation of Large Language Models via Saliency-Guided Quantized Low-Rank Fine-Tuning
by Seda Bayat Toksoz and Gultekin Isik
Electronics 2026, 15(8), 1728; https://doi.org/10.3390/electronics15081728 - 19 Apr 2026
Viewed by 324
Abstract
Applying large language models (LLMs) to industrial fault classification is hindered by the mismatch between tabular sensor data and text-based inputs and by the high memory cost of fine-tuning billion-parameter models on edge hardware. This paper presents TabEng-QLoRA, a framework with three contributions: [...] Read more.
Applying large language models (LLMs) to industrial fault classification is hindered by the mismatch between tabular sensor data and text-based inputs and by the high memory cost of fine-tuning billion-parameter models on edge hardware. This paper presents TabEng-QLoRA, a framework with three contributions: (1) a criticality-aware serialization module that converts tabular sensor records into structured prompts, placing fault-critical features in semantically prominent positions; (2) a saliency-guided rank allocation mechanism that profiles layer-wise activation norms on a 500-sample calibration set and assigns adapter ranks in three tiers (r ∈ {8, 16, 32}); and (3) a feed-forward domain router for automatic adapter selection (98.1% accuracy, 0.6 ms latency). Experiments on three public benchmarks (the AI4I Predictive Maintenance Dataset) using three foundation models (LLaMA-3-8B, Mistral-7B, and Qwen2-7B) show that TabEng-QLoRA achieves a mean macro F1 of 0.908, a 10.6% gain over standard QLoRA, within 4.6–5.2 GB peak GPU memory. The framework closes 82% of the gap to full fine-tuning, while offering advantages in cross-equipment transfer learning (zero-shot macro F1: 0.743 vs. 0.341 for XGBoost retrained on 20% of target-domain data, as XGBoost cannot perform zero-shot transfer). Ablation results confirm statistically significant contributions from all three components (p < 0.001). Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

11 pages, 525 KB  
Article
Assessment of Stage Two Hypertension Treatment Plans Written by Generative AI
by Tai Metzger, Zaheen Hossain, Kody Park, Stephen Vu, Simon Dixon and Tracey A. H. Taylor
J. Clin. Med. 2026, 15(8), 3103; https://doi.org/10.3390/jcm15083103 - 18 Apr 2026
Viewed by 287
Abstract
Background/Objectives: As use of large language models (LLMs) in clinical practice, in medical education, and by patients increases, it is essential to ensure that information provided is accurate and safe. Our objective was to compare stage two hypertension treatment plans generated by [...] Read more.
Background/Objectives: As use of large language models (LLMs) in clinical practice, in medical education, and by patients increases, it is essential to ensure that information provided is accurate and safe. Our objective was to compare stage two hypertension treatment plans generated by popular LLMs. Methods: ChatGPT (GPT-4o), Claude (Claude 4 Sonnet), ClinicalKey AI, Microsoft Copilot (Wave 2), DeepSeek-V3-0324, Dyna AI, Google Gemini (2.5 Flash), Grok (version 3), Meta AI assistant (Llama 4 Maverick), OpenEvidence (version 2.0), Perplexity (Sonar backend model), and Pi (Inflection-2.5) were prompted to generate a treatment plan for stage two hypertension. Six blinded reviewers scored each response in three domains: adherence to clinical guidelines, detail/clarity, and reliability/safety. Results: Perplexity received the highest composite score (8.17 out of 9), followed by OpenEvidence (7.92 out of 9). Dyna AI had the lowest overall score (3.75 out of 9). Perplexity (3.00 out of 3), Grok (2.83 out of 3), and OpenEvidence (2.75 out of 3) had the highest scores for detail/clarity, while Dyna AI had the lowest for both detail/clarity (1.00 out of 3) and reliability/safety (1.00 out of 3). ChatGPT had the highest score for adherence to guidelines (2.75 out of 3) while Pi had the lowest (1.58 out of 3). Kruskal–Wallis test showed p < 0.05 across sub-score domains and composite scores. Conclusions: LLMs tended to adhere to clinical guidelines and provide detailed responses but often did not provide sources or instruct users to see a healthcare professional. There was notable variability in quality, and medicine-specific LLMs were not superior to popular LLMs. Full article
Show Figures

Figure 1

37 pages, 3613 KB  
Article
Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data
by Maria C. Mariani, Sourav Malakar, Amrita Bagchi, Subhrajyoti Basu, Saptarsi Goswami, Osei Kofi Tweneboah, Sarbadeep Biswas, Ankit Dey and Ankit Sinha
Mach. Learn. Knowl. Extr. 2026, 8(4), 104; https://doi.org/10.3390/make8040104 - 17 Apr 2026
Viewed by 478
Abstract
This study provides a comparative evaluation of three state-of-the-art large language models (LLMs), namely OpenAI’s (San Francisco, CA, USA) GPT-4.0, Google’s (Google LLC, Mountain View, CA, USA) Gemini 2.0 Flash, and Meta’s (Meta Platforms, Menlo Park, CA, USA) LLaMA-4-Scout-17B-16E, in a decision-oriented framework [...] Read more.
This study provides a comparative evaluation of three state-of-the-art large language models (LLMs), namely OpenAI’s (San Francisco, CA, USA) GPT-4.0, Google’s (Google LLC, Mountain View, CA, USA) Gemini 2.0 Flash, and Meta’s (Meta Platforms, Menlo Park, CA, USA) LLaMA-4-Scout-17B-16E, in a decision-oriented framework in which the models generate structured outputs based only on historical closing-price data. The evaluation covers 150 stocks sampled from three countries (India, the United States, and South Africa) across ten economic sectors, including Information Technology, Banking, and Pharmaceuticals. Unlike many prior studies that combine numerical and textual inputs, this study relies solely on three years of numerical time series data and examines model responses in terms of decision labels such as buy, sell, or hold. The LLMs were provided with historical closing-price sequences and prompted with three types of finance-related questions: (a) whether to buy a stock, (b) whether to sell or hold a stock, and (c) in a pairwise comparison, which stock to buy or hold. These prompts were evaluated across two investment horizons: 1 month and 3 months. Model outputs were compared against realized market outcomes during the corresponding test periods. Performance was assessed across four key dimensions: country, sector, annualized volatility, and question type. The models were not given any supplementary financial information or instructions on specific analytical methods. The results indicate that GPT-4.0 achieves the highest average accuracy (56%), followed by LLaMA-4-Scout-17B-16E (48%) and Gemini 2.0 Flash (39%). Overall performance remains moderate and varies across market conditions, with relatively higher accuracy observed in high-volatility regimes (51%). This work evaluates how LLMs behave when presented with structured numerical price sequences in a controlled decision-labeling setting and contributes to the broader discussion on the potential and limitations of LLMs for numerical decision tasks in finance. Full article
Show Figures

Figure 1

23 pages, 5230 KB  
Review
Mapping the LLM Landscape: A Cross-Family Survey of Architectures, Alignment Methods, and Benchmark Performance
by Deepshikha Bhati, Fnu Neha, Devi Sri Bandaru, Matthew Weber and Ishan Dilipbhai Gajera
AI 2026, 7(4), 142; https://doi.org/10.3390/ai7040142 - 16 Apr 2026
Viewed by 1345
Abstract
Large Language Models (LLMs) have become foundational to modern Artificial Intelligence (AI), enabling advanced reasoning, multimodal understanding, and scalable human-AI interaction across diverse domains. This survey provides a comprehensive review of major proprietary and open-source LLM families, including GPT, LLaMA 2, Gemini, Claude, [...] Read more.
Large Language Models (LLMs) have become foundational to modern Artificial Intelligence (AI), enabling advanced reasoning, multimodal understanding, and scalable human-AI interaction across diverse domains. This survey provides a comprehensive review of major proprietary and open-source LLM families, including GPT, LLaMA 2, Gemini, Claude, DeepSeek, Falcon, and Qwen. It systematically examines architectural advancements such as transformer refinements, mixture-of-experts paradigms, attention optimization, long-context modeling, and multimodal integration. The paper further analyzes alignment and safety mechanisms, encompassing instruction tuning, reinforcement learning from human feedback, and constitutional frameworks, and discusses their implications for controllability, reliability, and responsible deployment. Comparative analysis of training strategies, data curation practices, efficiency optimizations, and application settings highlights key trade-offs among scalability, performance, interpretability, and ethical considerations. Beyond synthesis, the survey introduces a structured taxonomy and a feature-driven comparative study of over 50 reconstructed LLM architectures, complemented by an interactive visualization interface and an open-source implementation to support transparency and reproducibility. Finally, it outlines open challenges and future research directions related to transparency, computational cost, data governance, and societal impact, offering a unified reference for researchers and practitioners developing large-scale AI systems. Full article
Show Figures

Figure 1

33 pages, 3307 KB  
Article
Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening
by Irina Radeva, Teodora Noncheva, Lyubka Doukovska and Ivan Popchev
Electronics 2026, 15(8), 1661; https://doi.org/10.3390/electronics15081661 - 15 Apr 2026
Viewed by 273
Abstract
Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 [...] Read more.
Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records on blockchain-based e-voting. The best-performing configuration—a single-agent strategy with Qwen 2.5 7B in few-shot mode—achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All screening decisions were logged on a private blockchain with timestamped anchoring for reproducibility. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort. Full article
Show Figures

Figure 1

13 pages, 677 KB  
Article
Domain-Specific vs. General-Purpose Large Language Models in Orthodontics: A Blinded Comparison of AlimGPT, GPT-4o, Gemini, and Llama
by Sertaç Aksakallı, Bilgin Giray and Çağrı Temel
Dent. J. 2026, 14(4), 219; https://doi.org/10.3390/dj14040219 - 8 Apr 2026
Viewed by 298
Abstract
Objective: The application of artificial intelligence (AI) in orthodontics has evolved rapidly in recent years, encompassing areas such as diagnosis, treatment planning, and patient management, and AlimGPT is an AI-based tool that provides treatment options based on data and algorithms. Methods: [...] Read more.
Objective: The application of artificial intelligence (AI) in orthodontics has evolved rapidly in recent years, encompassing areas such as diagnosis, treatment planning, and patient management, and AlimGPT is an AI-based tool that provides treatment options based on data and algorithms. Methods: Fourteen different orthodontic questions were asked to each model, and answers were analyzed. This study aimed to compare AlimGPT with GPT-4o, Gemini, and Llama using standardized tests to evaluate the quality of information provided, including the Likert scale, modified DISCERN (mDISCERN), and modified Global Quality Score (mGQS). Results: Significant differences were detected for reliability (χ2 = 15.267, p = 0.0016) and usefulness (χ2 = 20.557, p = 0.0001). Post hoc tests showed AlimGPT > Gemini and Llama for reliability and AlimGPT > GPT-4o, Gemini, and Llama for usefulness. mDISCERN was significant overall (χ2 = 11.047, p = 0.0115), but no pairwise contrast met adjusted significance; mGQS showed no significant differences (χ2 = 7.071, p = 0.0697). Inter-rater agreement was moderate-to-good for reliability (ICC = 0.710, 95% CI 0.60–0.80) and usefulness (ICC = 0.729, 95% CI 0.63–0.82), moderate for mGQS (ICC = 0.596, 95% CI 0.47–0.71), and poor-to-moderate for mDISCERN (ICC = 0.435, 95% CI 0.30–0.58). Conclusions: In this blinded, within-subjects experiment, the domain-specific model (AlimGPT) received higher clinician ratings for usefulness and, for reliability, exceeded two general baselines. Differences in mGQS were not detected. Expanding the number of raters, increasing item diversity or integrating updated baselines would be beneficial. Full article
(This article belongs to the Special Issue Orthodontics and New Technologies: 2nd Edition)
Show Figures

Graphical abstract

22 pages, 1170 KB  
Article
Adverse Drug Reaction Detection on Social Media Based on Large Language Models
by Hao Li and Hongfei Lin
Information 2026, 17(4), 352; https://doi.org/10.3390/info17040352 - 7 Apr 2026
Viewed by 457
Abstract
Adverse drug reaction (ADR) detection is essential for ensuring drug safety and effective pharmacovigilance. The rapid growth of users’ medication reviews posted on social media has introduced a valuable new data source for ADR detection. However, the large scale and high noise inherent [...] Read more.
Adverse drug reaction (ADR) detection is essential for ensuring drug safety and effective pharmacovigilance. The rapid growth of users’ medication reviews posted on social media has introduced a valuable new data source for ADR detection. However, the large scale and high noise inherent in social media text pose substantial challenges to existing detection methods. Although large language models (LLMs) exhibit strong robustness to noisy and interfering information, they are often limited by issues such as stochastic outputs and hallucinations. To address these challenges, this paper proposes two generative detection frameworks based on Chain of Thought (CoT), namely LLaMA-DetectionADR for Supervised Fine-Tuning (SFT) and DetectionADRGPT for low-resource in-context learning. LLaMA-DetectionADR automatically generates CoT reasoning sequences to construct an instruction tuning dataset, which is then used to fine-tune the LLaMA3-8B model via Quantized Low-Rank Adaptation (QLoRA). In contrast, DetectionADRGPT leverages clustering algorithms to select representative unlabeled samples and enhances in-context learning by incorporating CoT reasoning paths together with their corresponding labels. Experimental results on the Twitter and CADEC social media datasets show that LLaMA-DetectionADR achieves excellent performance, with F1 scores of 92.67% and 86.13%, respectively. Meanwhile, DetectionADRGPT obtains competitive F1 scores of 87.29% and 82.80% with only a few labeled examples, approaching the performance of fully supervised advanced models. The overall results demonstrate the effectiveness and practical value of the proposed CoT-based generative frameworks for ADR detection from social media. Full article
(This article belongs to the Topic Generative AI and Interdisciplinary Applications)
Show Figures

Figure 1

21 pages, 2519 KB  
Article
PyAO: PyTorch-Based Memory-Efficient LLM Training on Ethernet-Interconnected Clusters
by Daemin Kim, Hyorim Kim, Juncheol Ahn and Sejin Park
Sensors 2026, 26(7), 2269; https://doi.org/10.3390/s26072269 - 7 Apr 2026
Viewed by 551
Abstract
As large language models (LLMs) pursue higher accuracy, their model sizes have surged, substantially increasing GPU memory consumption. Prior work mitigates this issue by distributing the memory burden across multiple GPUs. However, on clusters interconnected via Ethernet, the resulting computational intensity is insufficient [...] Read more.
As large language models (LLMs) pursue higher accuracy, their model sizes have surged, substantially increasing GPU memory consumption. Prior work mitigates this issue by distributing the memory burden across multiple GPUs. However, on clusters interconnected via Ethernet, the resulting computational intensity is insufficient to hide the significant network latency. Achieving a favorable compute-to-communication ratio is further constrained by the memory required to cache the massive activations generated during the forward pass. PyAO, proposed in this paper, effectively offloads activations, selects offloading strategies based on their offloading efficiency, and minimizes data-movement bottlenecks, thereby enabling larger micro-batch sizes. In Ethernet-interconnected cluster environments, experiments on popular models—including OPT-1.3B, GPT-0.8B, and Llama-1.2B—demonstrate that PyAO reduces peak GPU memory by up to 1.94× at the same micro-batch size, enables up to 2.5× larger batch sizes, and accelerates training by up to 3.63× relative to the baseline. Full article
Show Figures

Figure 1

Back to TopTop