MDPI - Publisher of Open Access Journals

14 pages, 366 KB

Open AccessArticle

Between Accessibility and Reliability: High Confidence, Low Control in General-Purpose Multimodal Models for Hip Fracture Radiograph Interpretation

by Hadar Gan-Or, Shaked Ankol, Guy Ben Arie, Itay Ashkenazi and Yaniv Warschawski

J. Clin. Med. 2026, 15(13), 4919; https://doi.org/10.3390/jcm15134919 (registering DOI) - 24 Jun 2026

Abstract

Background: Dedicated artificial intelligence (AI) systems for fracture detection already exist, yet general-purpose multimodal models are increasingly accessible to clinicians despite not being developed or formally validated as medical devices. Their behavior in focused orthopedic imaging tasks remains insufficiently characterized. Purpose: [...] Read more.

Background: Dedicated artificial intelligence (AI) systems for fracture detection already exist, yet general-purpose multimodal models are increasingly accessible to clinicians despite not being developed or formally validated as medical devices. Their behavior in focused orthopedic imaging tasks remains insufficiently characterized. Purpose: To characterize how two accessible general-purpose multimodal models interpret AP pelvis radiographs with hip fractures, focusing on context dependence, overconfidence, and complementary error patterns within a surgically confirmed positive-only cohort. This was a behavioral characterization study of a fracture-positive cohort, not a diagnostic accuracy evaluation. Methods: In April 2026, we retrospectively studied 214 surgically confirmed hip fractures on AP pelvis radiographs using two general-purpose multimodal models under six prompting conditions. In runs A–D, the models were explicitly told that a hip fracture was present and were asked to classify it; in runs E–F, they were not told whether a hip fracture was present. Each image was rerun de novo in a separate chat session through vendor APIs using a fixed base prompt and no image preprocessing. We recorded hip-fracture detection, correct laterality, coarse fracture pattern, intracapsular displacement, AO/OTA grading, subtrochanteric identification, and self-reported confidence. Because the cohort contained hip fractures only, we report fracture-detection rates and classification performance within a positive-only cohort rather than full diagnostic-accuracy metrics. Results: Using the more conservative endpoint of hip-fracture detection with correct laterality, GPT-5.4 was correct in 79.0% and 86.4% of cases in runs E and F, whereas Gemini was correct in 80.4% and 93.5%, respectively. When outputs from both models were combined, this endpoint reached 89.7% in run E and 96.7% in run F, indicating complementary rather than redundant error patterns. Incorrect laterality cues markedly degraded performance, from 90.7% to 66.4% in GPT-5.4 and from 97.7% to 57.0% in Gemini. Performance remained limited for treatment-relevant subtyping, particularly AO/OTA grading and subtrochanteric identification. Both models frequently remained highly confident when wrong, and self-reported confidence did not reliably distinguish correct from incorrect outputs. Conclusions: Accessible general-purpose multimodal models showed partial capability for coarse hip-fracture interpretation, but they remained context-sensitive, unreliable for treatment-relevant subtyping, and highly confident even when incorrect. Their complementary error patterns are hypothesis-generating rather than evidence of clinical readiness. On the basis of these findings, we do not support unvalidated or uncontrolled clinical use of such models. As access to these tools expands, explicit usage boundaries, minimum performance expectations, repeated local revalidation, and sustained human oversight become increasingly necessary. Full article

(This article belongs to the Special Issue Acute Trauma and Trauma Care in Orthopedics: 2nd Edition)

18 pages, 4314 KB

Open AccessArticle

Optimizing a Multimodal Large Language Model for Ultrasound-Based Thyroid Nodule Malignancy Classification: A Comparative Study of Few-Shot Learning, Prompt Engineering, and Fine-Tuning

by Yu-Hsuan Li, Yu-Cheng Cheng, Chih-Yun Chang and I-Te Lee

Diagnostics 2026, 16(12), 1931; https://doi.org/10.3390/diagnostics16121931 (registering DOI) - 22 Jun 2026

Viewed by 122

Abstract

Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs—GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite—for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia [...] Read more.

Objectives: Multimodal large language models (MLLMs) have shown potential for medical image classification. We evaluated four optimization strategies in two MLLMs—GPT-4o (gpt-4o-2024-08-06) and Gemini 2.5 Flash-Lite—for ultrasound-based thyroid nodule malignancy classification using two public datasets and a clinical cohort of nodules with atypia of undetermined significance (AUS) cytology. Methods: Text prompting, few-shot learning, fine-tuning, and a hybrid strategy combining fine-tuning with few-shot learning were evaluated for each model. Performance was assessed using the Digital Database of Thyroid Images (DDTI; n = 80), a 1000-image test subset of TN5000, and an institutional AUS cohort with surgical pathology (n = 84). In the AUS cohort, the best-performing strategy was compared with the consensus classification of three endocrinologists and the American Thyroid Association (ATA) ultrasound risk stratification. Results: For GPT-4o, the hybrid strategy achieved the highest area under the receiver operating characteristic curve (AUC) in DDTI (0.866), TN5000 (0.689), and the AUS cohort (0.836). In the AUS cohort, its specificity was higher than that of endocrinologist consensus and ATA risk stratification when only high-suspicion nodules were classified as malignant (95.1% vs. 70.7% and 70.7%; p = 0.002 and p = 0.001, respectively), while sensitivity did not differ significantly (72.1% vs. 74.4% and 79.1%, respectively; both p > 0.05). However, the hybrid model misclassified 12 of 43 malignant nodules, corresponding to a false-negative rate of 27.9%. When high- and intermediate-suspicion ATA categories were classified as malignant, ATA sensitivity increased to 83.7% and specificity decreased to 56.1%; the hybrid model had a higher AUC than ATA risk stratification (0.836 vs. 0.749; p = 0.017). For Gemini 2.5 Flash-Lite, few-shot learning, fine-tuning, and the hybrid strategy did not improve AUC relative to text prompting in any dataset. Conclusions: The hybrid strategy produced the most consistent performance gains for GPT-4o across the three datasets but did not improve Gemini 2.5 Flash-Lite. The optimized GPT-4o model achieved high specificity in the diagnostically challenging AUS cohort, although its false-negative rate limits its use as a stand-alone diagnostic tool. Further validation in larger, prospective multicenter cohorts is required before clinical use. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Graphical abstract

43 pages, 4497 KB

Open AccessArticle

OATS-RS: Ontology-Aware Adaptive and Selective Zero-Shot Scene Classification for Remote Sensing

by János Horváth

Remote Sens. 2026, 18(12), 2038; https://doi.org/10.3390/rs18122038 - 18 Jun 2026

Viewed by 333

Abstract

Zero-shot remote sensing is attractive for scene classification because new regions, sensors, and label taxonomies often appear before sufficient annotated data are available for supervised adaptation. We present OATS-RS, an inference-centric framework that keeps a remote sensing vision–language model (VLM) backbone frozen and [...] Read more.

Zero-shot remote sensing is attractive for scene classification because new regions, sensors, and label taxonomies often appear before sufficient annotated data are available for supervised adaptation. We present OATS-RS, an inference-centric framework that keeps a remote sensing vision–language model (VLM) backbone frozen and improves zero-shot decisions through ontology-aware prompt construction, hierarchical and contrastive scoring, adaptive multi-view aggregation, unlabeled transductive refinement, ambiguity-aware local re-ranking, and selective prediction. The method targets the common remote sensing regime in which neighboring classes such as annual crop, permanent crop, forest, pasture, herbaceous vegetation, river, and sea or lake overlap strongly in red–green–blue (RGB) appearance, meaning that they require more than a single class-name prompt. On the supplied final EuroSAT RGB evaluation with a GeoRSCLIP Contrastive Language–Image Pre-training (CLIP)-family Vision Transformer Base with 32 × 32-pixel patches (ViT-B-32) backbone, the complete pipeline obtains top-1 accuracy of 0.522, balanced accuracy of 0.522, macro-averaged F1 score (macro-F1) of 0.535, and top-3 accuracy of 0.887. The strongest classes are industrial area, residential area, river, highway, and pasture, whereas the weakest classes remain herbaceous vegetation and several fine-grained vegetation categories. Selective prediction increases accepted-example accuracy to 0.538 at 0.934 coverage, but the expected calibration error (ECE) remains high at 0.384. These results support a qualified conclusion: ontology-guided zero-shot inference can already recover useful semantic shortlists for structured remote-sensing scenes, but fine-grained natural-class disambiguation, calibrated confidence, multi-dataset transfer, component-level ablations, and measured runtime remain essential before dependable deployment claims can be made. Full article

► Show Figures

Figure 1

20 pages, 4288 KB

Open AccessArticle

A Prompt-Driven Vision-Language Framework for Deictic Interpretation in Human-Robot Handover

by Jimin Byeon, Song Min Ryu and Kyu Min Park

Actuators 2026, 15(6), 345; https://doi.org/10.3390/act15060345 - 18 Jun 2026

Viewed by 167

Abstract

Recent advancements in Vision-Language Models (VLMs) have enabled robotic systems to leverage model-based understanding and reasoning over visual and linguistic inputs, offering a promising approach for interpreting user intent in human–robot interaction (HRI). In particular, deictic expressions commonly used in object handovers, such [...] Read more.

Recent advancements in Vision-Language Models (VLMs) have enabled robotic systems to leverage model-based understanding and reasoning over visual and linguistic inputs, offering a promising approach for interpreting user intent in human–robot interaction (HRI). In particular, deictic expressions commonly used in object handovers, such as “take this” and “give me that”, cannot be fully interpreted through language alone and require a comprehensive understanding of the speaker’s perspective and the environment. This study proposes a prompt-driven vision-language framework for deictic interpretation in human–robot handover. The system integrates a pre-trained VLM with a hierarchical prompt that decomposes reasoning into intent classification, spatio-temporal grounding, and output self-validation, enabling accurate identification of target objects and goal locations without model fine-tuning. Experimental results demonstrate 100% command interpretation accuracy across multiple interaction scenarios, including pick-and-place tasks, robot-to-human and human-to-robot handovers, and temporal deictic commands. Notably, the system operates under a prompt–command language mismatch, accurately interpreting Korean commands while being guided by English-based prompts. Analysis across progressive system configurations further demonstrates that structured prompting plays a critical role in reasoning performance. These results highlight the effectiveness of a prompt-driven approach for deictic interpretation and spatio-temporal grounding, providing a practical training-free framework for HRI. Full article

(This article belongs to the Special Issue Human-Centered Actuation: Algorithms, Design, and Robotic Applications)

► Show Figures

Figure 1

20 pages, 6003 KB

Open AccessReview

Incidental Findings in [¹⁸F]-PSMA PET/CT for Prostate Cancer: Structured Reporting Across PET and Low-Dose CT, Clinical Relevance, and Cascade-Aware Management

by Katarzyna Sklinda, Marek Kasprowicz, Michał Małek, Bartlomiej Olczak, Tadeusz Budlewski, Malgorzata Kobylecka, Jerzy Walecki and Martyna Rajca

Uro 2026, 6(2), 17; https://doi.org/10.3390/uro6020017 - 17 Jun 2026

Viewed by 137

Abstract

[¹⁸F]-PSMA PET/CT is a high-impact modality for the staging and restaging of prostate cancer, but its wide anatomic coverage and tracer biology generate frequent incidental findings on both PET and the accompanying low-dose CT (LDCT). This narrative review is restricted in [...] Read more.

[¹⁸F]-PSMA PET/CT is a high-impact modality for the staging and restaging of prostate cancer, but its wide anatomic coverage and tracer biology generate frequent incidental findings on both PET and the accompanying low-dose CT (LDCT). This narrative review is restricted in scope to fluorine-18 PSMA tracers because tracer-specific biodistribution and pitfall profiles shape what is perceived as incidentaloma: how confidently lesions can be categorized, and how often borderline findings trigger downstream testing, particularly for skeletal foci with [¹⁸F]-PSMA-1007. Specifically, [¹⁸F]-PSMA-1007 shows substantially higher rates of focal unspecific bone uptake than [⁶⁸Ga]-PSMA-11—reported in multicenter studies as affecting up to 40–50% of patients—which directly inflates the pool of potential incidentalomas and creates a tracer-specific false-positive problem with no parallel in gallium-68 practice. Additionally, [¹⁸F]-DCFPyL has different urinary clearance kinetics that affect bladder and ureteral uptake patterns, altering what qualifies as physiologic versus incidental in the pelvis. These differences mean that the threshold for Category B versus C classification—and the appropriate cascade-resistant language—must be tuned to the specific tracer in use. A framework built on [⁶⁸Ga]-PSMA-11 data would systematically underestimate bone pitfall frequency in [¹⁸F]-PSMA-1007 practice and could therefore paradoxically increase rather than reduce cascades if applied uncritically across tracers. These biodistribution differences have direct and concrete consequences for reporting behaviour and downstream management. In [¹⁸F]-PSMA-1007 practice, a focal bone uptake without a CT correlate in a mechanically plausible location—such as an anterior rib or vertebral endplate—should trigger Category B language in the report conclusion: the finding is documented in the body with explicit safety netting (“most consistent with unspecific uptake; no routine workup unless interval growth, new pain, or aggressive CT morphology”), and no referral to bone scintigraphy or MRI is generated. Without tracer-specific awareness, the same finding would typically prompt a reflex bone scan or whole-body MRI referral, delaying definitive prostate cancer management by weeks and adding imaging costs without diagnostic gain. By contrast, in [⁶⁸Ga]-PSMA-11 practice, an equivalent focal bone uptake without a CT correlate carries a higher prior probability of true metastatic disease given the lower background rate of unspecific uptake and should more often be reported at Category B with a lower threshold for escalation or more cautious language. For [¹⁸F]-DCFPyL, the higher urinary activity in the pelvis means that ureteral segments can mimic lymph node disease; recognizing this as a physiologic variant (Category C) rather than an equivocal nodal finding (Category B) avoids unnecessary pelvic MRI referrals that would otherwise be triggered by an uncontextualized report. In practical terms, the tracer-specific calibration of the overlay therefore changes not only the category assigned but also the specific safety-netting language and the escalation trigger, which directly modifies the downstream management pathway for each affected finding type. The scanned population—predominantly older men with a high prevalence of degenerative, inflammatory, and vascular abnormalities—creates substantial background noise that can drive low-value diagnostic cascades if incidental findings are communicated without actionability context. We integrate society-endorsed frameworks (EANM/SNMMI procedure guideline 2.0; E-PSMA; PSMA-RADS; and PROMISE/miTNM with miPSMA score) and propose a cascade-aware overlay for incidental findings that can be appended to existing PSMA reporting standards rather than replacing them. The A/B/C actionability overlay is a structured expert-consensus framework informed by existing evidence-based guidelines for specific finding types and by tracer-specific cohort data; it has not yet been prospectively validated as a standalone tool, and its current level of evidence is therefore analogous to a structured expert recommendation rather than an evidence-based clinical guideline. We operationalize a three-tier actionability scheme across PET- and CT-dominant findings, provide cascade-resistant language for conclusions, and clarify why SUVmax-only “probability scales” for lymph nodes are not recommended in routine reports. Three practical tables summarize PET incidental findings, lymph node reporting frameworks, and LDCT incidental findings, and two structured report templates are provided (concise and extended), with the extended version explicitly labelling actionability tiers and escalation triggers. Finally, we outline concrete AI use cases for standardization and triage while emphasizing governance to avoid the amplification of false positives and paradoxical growth of cascades. Full article

► Show Figures

Figure 1

17 pages, 399 KB

Open AccessReview

Application of Artificial Intelligence in Breast Ultrasound Diagnosis

by Jian Zhang, André Pfob, Eva Reisig and Lie Cai

Diagnostics 2026, 16(12), 1839; https://doi.org/10.3390/diagnostics16121839 - 14 Jun 2026

Viewed by 290

Abstract

Artificial intelligence (AI) is reshaping ultrasound diagnosis by converting operator-dependent grayscale, Doppler, elastography, contrast-enhanced, automated-volume, and video data into reproducible decision support. In breast ultrasound, the most mature evidence involves benign–malignant lesion classification, BI-RADS risk stratification, reduction in unnecessary biopsy in selected low-risk [...] Read more.

Artificial intelligence (AI) is reshaping ultrasound diagnosis by converting operator-dependent grayscale, Doppler, elastography, contrast-enhanced, automated-volume, and video data into reproducible decision support. In breast ultrasound, the most mature evidence involves benign–malignant lesion classification, BI-RADS risk stratification, reduction in unnecessary biopsy in selected low-risk lesions, assistance for less experienced readers, automated breast volume scanning, video-based assessment, axillary staging, and prediction of biologic markers such as molecular subtype, HER2 status, Ki-67 expression, lymphovascular invasion, and nodal metastasis. AI does not replace sonographers, radiologists, pathologists, or clinical judgment; rather, it can standardize feature extraction, prompt second-reader review, quantify uncertainty, and integrate imaging with clinical context. This review summarizes current clinical applications of AI in ultrasound diagnosis, which has a strong recent multicenter evidence base. It also discusses implementation requirements, including standardized acquisition, external validation, calibration, imaging–pathology concordance, workflow integration, data security, and equity across scanners and patient populations. Full article

(This article belongs to the Special Issue Application of Ultrasound Imaging in Clinical Diagnosis)

► Show Figures

Figure 1

30 pages, 1019 KB

Open AccessReview

Critical Literature Review on Clinical Presentation of Oncocytic Thyroid Carcinoma with Immunoendocrine Complications and Unpredictable Outcome: Myths, Facts, and Their Overinterpretation

by Przemyslaw Zdziarski

Biomedicines 2026, 14(6), 1335; https://doi.org/10.3390/biomedicines14061335 - 12 Jun 2026

Viewed by 396

Abstract

Objectives: Endocrine neoplasms, as a general rule, show systemic, neuro-inflammatory and metabolic consequences, known as paraneoplastic syndrome. The comorbidity of thyroid tumors with neurological and autoimmune diseases prompt looking for common neuro-immuno-endocrine mechanisms of these disorders. While most TCs are well described, [...] Read more.

Objectives: Endocrine neoplasms, as a general rule, show systemic, neuro-inflammatory and metabolic consequences, known as paraneoplastic syndrome. The comorbidity of thyroid tumors with neurological and autoimmune diseases prompt looking for common neuro-immuno-endocrine mechanisms of these disorders. While most TCs are well described, there is a gap in the literature after the isolation of oncocytic/Hürthle cell carcinoma (HCC), as a unique type due to immunoendocrine and metabolic features (low TSH-receptor expression and radioiodine avidity). The aim of this study was to collect clearly defined reports of HCC (as a separate entity) and to attempt determining common clinical symptoms and the usefulness of various diagnostic techniques (comprehensive critical review). This may be an introduction to modern treatment (patient-centered care) since the main cause of mortality is not local progression or metastases. Results: Until now, due to misnomenclature and data misinterpretation, HCC has been treated according to general standards (with overuse of TSH-ST and RIA). High thyroglobulin level, decreased total thyroxin (with normal FT3 and spontaneous decrease in TSH), hypercalcemia, as well as the “reverse flip-flop” phenomenon, as common symptoms, indicate the neuroendocrine origin of HCC. Sparse, well-documented lymph node metastases are another feature, although from few studies. Most studies omit the N stage. Whole-body ¹³¹iodine and ¹⁸F-fluorodeoxyglucose scintigraphy may be useful before FNAB. Fine-needle aspiration biopsy (FNAB), as a “gold standard” in early diagnosis of thyroid nodules, delays HCC diagnosis because of the inability to determine a benign/malignant nature. Conclusions: Final HCC outcome may be affected by various overlapping immunoendocrine factors (paraneoplastic effects). Due to very few thyroid function tests performed in HCC, we have proposed a set of basic laboratory analyses, core biopsy in HCC differentiation, and diagnostic chain for standardization. According to the review, adaptation and treatment of HCC based on existing standards for other thyroid cancers seem to be insufficient, and the risks outweigh the benefits. The key recommendations resulting from the 5th edition of the WHO Classification of Endocrine Neoplasms are only the beginning of refuting many myths and biases. Full article

(This article belongs to the Special Issue New Advances in the Pathology, Diagnosis and Treatment of Thyroid Tumors)

► Show Figures

Graphical abstract

21 pages, 1073 KB

Open AccessArticle

A Unified AI Framework for Turkish E-Commerce Review Analysis: Sentiment Classification, LLM-Based Summarization, and Fuzzy Evaluation

by Erdal Özbay, Feyza Altunbey Özbay and Ahmet Bedri Özer

Appl. Sci. 2026, 16(12), 5849; https://doi.org/10.3390/app16125849 - 10 Jun 2026

Viewed by 186

Abstract

The rapid growth of user-generated reviews on e-commerce platforms has created a significant decision-making challenge for both consumers and sellers, particularly in morphologically rich low-resource languages such as Turkish. This study proposes a unified artificial intelligence framework for Turkish e-commerce review intelligence by [...] Read more.

The rapid growth of user-generated reviews on e-commerce platforms has created a significant decision-making challenge for both consumers and sellers, particularly in morphologically rich low-resource languages such as Turkish. This study proposes a unified artificial intelligence framework for Turkish e-commerce review intelligence by integrating transformer-based sentiment classification, instruction-tuned large language model summarization, and explainable fuzzy logic-based product evaluation within a single end-to-end architecture. A balanced dataset containing 183,333 Turkish reviews was constructed from Trendyol, Amazon Turkey, and Hepsiburada using LLM-assisted annotation and stratified downsampling. Experimental evaluations demonstrated that the fine-tuned BERTurk 128k model achieved a macro F1-score of 0.9243 on the held-out test set. To overcome the limitations of multilingual news-oriented summarization models on informal review text, the framework employed the Turkish instruction-tuned Kumru-2B model together with structured prompt engineering to generate sentiment-aware abstractive summaries. In addition, a Mamdani-type fuzzy inference system was designed to combine sentiment distribution, seller reliability, star ratings, and review volume into an interpretable product-level score. The complete pipeline was integrated into a FastAPI and React-based web platform capable of processing approximately 850 reviews in under 60 s. The findings demonstrate that domain-specific Turkish language models combined with explainable reasoning mechanisms can provide accurate, scalable, and human-interpretable decision support for large-scale e-commerce environments. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

13 pages, 882 KB

Open AccessArticle

Automated PROMISE V2 Scoring from PSMA PET/CT Reports Using Large Language Models: A Comparative Evaluation of Prompt Design and Model Performance

by Tilman Speicher, Isa Ethem Demirkol, Arne Blickle, Moritz B. Bastian, Stephan Maus, Andrea Schaefer-Schuler, Mark Bartholomä, Caroline Burgard, Samer Ezziddin and Florian Rosar

Curr. Oncol. 2026, 33(6), 349; https://doi.org/10.3390/curroncol33060349 - 9 Jun 2026

Viewed by 269

Abstract

Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple [...] Read more.

Large language models (LLMs) are increasingly explored for clinical use. However, the extent to which such models can reliably support physicians in reporting, staging, and the assessment of classification remains an active area of research. This study aimed to evaluate and compare multiple LLMs for automated PROMISE V2 classification for prostate cancer. A total of 126 unambiguous German-language PSMA PET/CT text reports were retrospectively analyzed, with reference standards established by expert consensus based on image interpretation and the original report text. Five LLMs (GPT-5.4, DeepSeek-V3.2, Claude Sonnet 4.6, Gemini 3 Flash and Grok 4) were assessed using two English-language prompting strategies of varying complexity. Agreement with the reference standard served as the primary endpoint. Performance varied in the short-prompt setting (36.5–79.4%) but improved consistently with the long prompt (74.6–86.5%), with Gemini 3 Flash achieving the highest agreement. Across PROMISE V2 subcategories, agreement rates were high (miT: 81.0–92.1%, miN: 92.9–96.0%, miM: 92.9–95.2%), despite inter-model differences. In conclusion, contemporary LLMs demonstrate promising performance in deriving PROMISE V2 scores from unambiguous original report texts, particularly when guided by detailed prompts. Full article

(This article belongs to the Special Issue AI-Powered Oncologic Nuclear Medicine in Clinical Translation: Advanced Assessment of Tumor Load and Microenvironment)

► Show Figures

Figure 1

32 pages, 25206 KB

Open AccessArticle

TransNet–SAM2: A Transformer–Foundation Model Framework for Prompt-Free Segmentation of White Blood Cells in Microscopic Blood Smear Images

by Julius Bamwenda, Mehmet Siraç Özerdem, Orhan Ayyildiz, Veysi Akpolat and İrem Akpolat

Diagnostics 2026, 16(11), 1737; https://doi.org/10.3390/diagnostics16111737 - 4 Jun 2026

Viewed by 375

Abstract

Background: Accurate segmentation of white blood cells (WBCs) in peripheral blood smear images is a fundamental step in computational hematology, enabling downstream tasks such as classification, morphological assessment, and quantitative analysis. However, reliable segmentation remains challenging due to staining variability, complex cellular [...] Read more.

Background: Accurate segmentation of white blood cells (WBCs) in peripheral blood smear images is a fundamental step in computational hematology, enabling downstream tasks such as classification, morphological assessment, and quantitative analysis. However, reliable segmentation remains challenging due to staining variability, complex cellular morphology, overlapping structures, and limited availability of high-quality annotations. Aim and Methods: The aim of this study is to develop a robust and fully automated segmentation framework for white blood cells (WBCs) in microscopic blood smear images, providing a reliable foundation for subsequent computational analysis. We propose TransNet–SAM2, a hybrid deep learning architecture that integrates hierarchical transformer-based feature extraction with a foundation-model-based decoder for prompt-free segmentation. Specifically, a Swin Transformer backbone is employed to capture multi-scale contextual representations, which are subsequently aligned and fused through a feature adaptation module. The fused features are directly injected into the SAM2 mask decoder, replacing conventional prompt-based conditioning and enabling fully automatic segmentation. In addition, a weakly supervised self-training strategy is incorporated to utilize partially annotated data, improving model generalization while reducing annotation requirements. The proposed framework is evaluated using a clinically curated dataset from Dicle University, the publicly available Raabin-WBC dataset, and an additional external leukemic blast validation dataset (ALL-IDB) to assess robustness under both routine and atypical hematological conditions. Results: TransNet-SAM2 achieved a Dice coefficient of 0.95 ± 0.01 and IoU of 0.90 on internal testing, significantly outperforming U-Net (0.89), Mask R-CNN (0.90), and SAM2 (0.92) (p < 0.05). In cross-dataset evaluation (Dicle training, Raabin-WBC testing), the framework maintained strong performance (Dice: 0.91, IoU: 0.84), demonstrating robustness to domain shifts. Ablation studies confirmed each component’s contribution, with the full model improving Dice by 6% over a CNN baseline. Qualitative analysis showed accurate boundary delineation even with cell overlap and background clutter. Conclusions: These findings indicate that the proposed framework provides a promising and scalable framework for WBC segmentation. While the current study focuses on segmentation, future work will investigate integration with classification and radiomics pipelines, as well as validation on more diverse clinical datasets, including bone marrow and leukemia samples. Full article

(This article belongs to the Special Issue AI and Digital Health for Disease Diagnosis and Monitoring, 2nd Edition)

► Show Figures

Graphical abstract

23 pages, 10423 KB

Open AccessArticle

Cloud-Aware Dual-Path Prompt Learning with CLIP for Few-Shot Fine-Grained Ship Classification in Mixed-Sky Remote Sensing Imagery

by Yiping Song, Liang Huang, He Yang and Shuo Li

Remote Sens. 2026, 18(11), 1815; https://doi.org/10.3390/rs18111815 - 2 Jun 2026

Viewed by 175

Abstract

Few-shot remote-sensing fine-grained ship classification (RS-FGSC) faces two coupled challenges: limited annotated samples and mixed-visibility imaging conditions caused by cloud occlusion. Although CLIP-based prompt learning provides useful transfer priors, conventional single-branch adaptation can encounter an over-correction dilemma: robust compensation applied globally may degrade [...] Read more.

Few-shot remote-sensing fine-grained ship classification (RS-FGSC) faces two coupled challenges: limited annotated samples and mixed-visibility imaging conditions caused by cloud occlusion. Although CLIP-based prompt learning provides useful transfer priors, conventional single-branch adaptation can encounter an over-correction dilemma: robust compensation applied globally may degrade clear samples, whereas clear-image optimization may fail on occluded samples. We propose CADP (Cloud-Aware Dual-Path Prompt Learning), which decouples clear and occluded processing through learnable routing. CADP contains three components: (1) a cloud detector (CloudDetector) trained with auxiliary cloud-state labels for instance-level routing, (2) a fine-grained adapter (FineGrainedAdapter) that preserves discriminative details for clear samples, and (3) a robust compensation path using occlusion-aware prompting (AOPD) from CARP for occluded samples. To evaluate mixed-visibility scenarios, we construct Mixed-Sky benchmarks by combining clear ship images with SeaCloud-Ship cloud-occluded samples introduced by CARP, using controlled cloud-mixed ratios (25%, 50%, and 75%) and a non-overlapping sampling strategy. Experiments from 1-shot to 16-shot show consistent gains over CoCoOp, CLIP-Adapter, and prior robust prompting methods. CADP achieves 35.49% average accuracy, improving the best-performing baseline in our protocol by +4.81 points (+15.7% relative improvement). Component ablations, routing controls, and attention visualizations indicate that explicit routing reduces negative transfer between clear and occluded samples. Full article

(This article belongs to the Special Issue Few-Sample Intelligence for Hyperspectral Remote Sensing Image Classification)

► Show Figures

Figure 1

24 pages, 327 KB

Open AccessArticle

AI-Driven Dental Procedure Coding: A Multi-Model Framework for CDT Extraction from Clinical Text

by Pranav Annareddy, Ali Noori, Deepthi Kollipara and Prashanti Manda

Dent. J. 2026, 14(6), 339; https://doi.org/10.3390/dj14060339 - 2 Jun 2026

Viewed by 255

Abstract

Background and Objectives: Dental procedure coding is essential for accurate billing, reimbursement, and clinical documentation, yet it remains largely manual, time-consuming, and error-prone. While natural language processing (NLP) has enabled significant advances in automated medical coding, limited work has focused on the [...] Read more.

Background and Objectives: Dental procedure coding is essential for accurate billing, reimbursement, and clinical documentation, yet it remains largely manual, time-consuming, and error-prone. While natural language processing (NLP) has enabled significant advances in automated medical coding, limited work has focused on the dental domain, particularly the assignment of Code on Dental Procedures and Nomenclature (CDT) codes from free-text clinical notes. This study aims to develop and evaluate an artificial intelligence framework that integrates large language models (LLMs) and traditional deep learning methods to automate CDT code extraction from narrative dental documentation. Methods: We evaluated three LLM-based strategies—zero-shot prompting, QLoRA fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA—alongside a supervised Bidirectional GRU (Bi-GRU) classifier. Experiments were conducted using a synthetic dataset designed to emulate real-world dental encounters. Structured JSON output schemas, few-shot prompting, and scalable batch inference pipelines were employed to ensure consistent and interpretable predictions. Model performance was assessed using micro- and macro-averaged F1 scores, precision, recall, exact-match accuracy, and Hamming loss. Results: The zero-shot LLM achieved the highest micro-F1 score (0.9614) and perfect recall for frequent CDT codes, demonstrating strong baseline reasoning without task-specific training; however, performance declined for rare procedures and diagnostic code hallucinations were common. Fine-tuning improved domain alignment, with the non-quantized PEFT LoRA model outperforming QLoRA across all metrics, though both fine-tuned LLMs showed tendencies to over-generate plausible but incorrect codes. The Bi-GRU model achieved balanced performance (micro-F1 = 0.9362, macro-F1 = 0.9377) with minimal hallucinations but occasionally missed context-dependent procedures. Conclusions: These findings highlight complementary strengths between LLM-based and supervised approaches. LLMs provide strong contextual understanding and rapid deployment, while traditional models offer stable and precise multi-label classification. This work supports the development of hybrid, schema-constrained systems for scalable dental procedure coding. Full article

► Show Figures

Figure 1

32 pages, 6483 KB

Open AccessArticle

Continual Learning for Histopathology Image Classification in Class-Incremental Learning

by Yuanyuan Wu, Yu Zhao and Anca Ralescu

Diagnostics 2026, 16(11), 1711; https://doi.org/10.3390/diagnostics16111711 - 2 Jun 2026

Viewed by 315

Abstract

Background: Continual learning (CL) is increasingly important for developing adaptive clinical AI models; however, its application to histopathology remains challenging due to privacy constraints, expanding diagnostic categories, and staining variability. We investigate CL for histopathology image classification under a class-incremental learning (CIL) [...] Read more.

Background: Continual learning (CL) is increasingly important for developing adaptive clinical AI models; however, its application to histopathology remains challenging due to privacy constraints, expanding diagnostic categories, and staining variability. We investigate CL for histopathology image classification under a class-incremental learning (CIL) scenario, where new diagnostic categories are introduced sequentially. Methods: We benchmark representative regularization-, replay-, architecture-, and prompt-based CL methods on the NCT-CRC-HE-100K dataset, with additional validation on CRC-HE-7K. We compare four normalization strategies and analyze the effects of replay buffer size and training epochs. In addition to average accuracy and forgetting, we conduct clinical relevance and error analysis using confusion matrices, ROC curves, and misclassification cases, then assess training dynamics and computational efficiency. Results: Dataset-level normalization consistently achieves the best performance among the evaluated normalization strategies. Among replay-based methods, DER++ achieves strong performance when previous-task images can be stored and replayed, reaching an average accuracy of

94.77 \pm 1.82

and forgetting of

3.66 \pm 1.73

with a buffer size of 500 and 50 training epochs. However, it requires higher memory usage, longer training time, and storage of previous samples. Among prompt-based methods, DualPrompt performs best with 5 epochs, reaching an average accuracy of

88.97 \pm 0.60

and forgetting of

7.70 \pm 1.21

while showing smoother training behavior and lower computational cost. Conclusions: Replay-based methods achieve higher accuracy and lower forgetting when exemplar storage and sufficient computational resources are available, but introduce higher computational and privacy costs. Prompt-based methods provide a competitive exemplar-free alternative under privacy- and resource-constrained settings. Dataset-level normalization is also important for stable CL performance in histopathology CIL. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

18 pages, 2578 KB

Open AccessArticle

AI in Dermato-Oncology: Diagnostic Performance and Prompt-Injection Vulnerability of Vision–Language Models in Dermoscopic Skin Cancer Assessment

by Ibrahim Güler, Armin Kraus, Gerrit Grieb, Tevfik Satir, Pascal Eberz and Henrik Stelling

Cancers 2026, 18(11), 1750; https://doi.org/10.3390/cancers18111750 - 27 May 2026

Viewed by 390

Abstract

Background/Objectives: Accurate differentiation of benign melanocytic nevi from invasive melanoma in dermato-oncology directly informs biopsy decisions and oncological management. Vision–language models (VLMs) are increasingly explored for image-based skin cancer assessment, but their diagnostic reliability and robustness to adversarial input manipulation remain insufficiently characterized. [...] Read more.

Background/Objectives: Accurate differentiation of benign melanocytic nevi from invasive melanoma in dermato-oncology directly informs biopsy decisions and oncological management. Vision–language models (VLMs) are increasingly explored for image-based skin cancer assessment, but their diagnostic reliability and robustness to adversarial input manipulation remain insufficiently characterized. We evaluated three contemporary VLMs for diagnostic performance and susceptibility to single-word adversarial input manipulation (prompt injection) on dermoscopic images of histopathologically confirmed lesions. Methods: Fifty-two dermoscopic images (26 benign melanocytic nevi, 26 invasive melanomas) were analyzed using Claude Opus 4.7, Gemini 3.1 Pro, and GPT-5.4 under four conditions: an unmodified baseline and three adversarial conditions with a single opposite-of-ground-truth label embedded as a visual overlay, in image metadata, or both. Three independent rounds per image × model × condition yielded 1872 classifications across 52 lesions (independent diagnostic units) and 16,848 structured-output observations in total. Results: Baseline diagnostic accuracy ranged from 58.3% to 62.2%, with asymmetric sensitivity and specificity, including a pronounced benign-labeling bias in one model that missed 22 of 26 invasive melanomas. All adversarial conditions reduced accuracy to near-zero levels (0.0–1.9%; all p < 10⁻⁷ after Bonferroni correction). Repeated queries produced identical incorrect outputs in 98–100% of cases (Fleiss’ κ 0.97–1.00). Non-diagnostic outputs remained largely unchanged, and self-reported confidence did not decrease. Conclusions: Contemporary VLMs show limited baseline performance and marked vulnerability to minimal adversarial input in dermoscopic skin cancer assessment. The failure selectively alters the malignancy decision while preserving surrounding outputs and confidence, indicating that, within the conditions evaluated here, these systems do not currently appear suitable for unsupervised clinical use in dermato-oncology in the absence of input-integrity safeguards and qualified human oversight. Full article

(This article belongs to the Special Issue AI-Driven Oncology: Advancing Cancer Detection, Diagnosis, and Personalized Treatment)

► Show Figures

Graphical abstract

19 pages, 792 KB

Open AccessArticle

EvoShield: Selective Test-Time Adaptation for Prompt Injection Detection via Active LLM Querying

by Zanhong Zheng, Jieming Liang, Mengqin Hu, Yijuan Pei, Guobao Xu and Zhenlu Wu

Mathematics 2026, 14(10), 1719; https://doi.org/10.3390/math14101719 - 16 May 2026

Viewed by 279

Abstract

Prompt injection detection is commonly studied as a static offline classification problem, yet deployed LLM systems face evolving attacks and distribution shift after deployment. Static detectors are therefore poorly matched to the threat model, while routing every input to a stronger external LLM [...] Read more.

Prompt injection detection is commonly studied as a static offline classification problem, yet deployed LLM systems face evolving attacks and distribution shift after deployment. Static detectors are therefore poorly matched to the threat model, while routing every input to a stronger external LLM is costly and defeats the purpose of a local detector. We formulate prompt injection detection as a selective test-time adaptation problem. Our framework combines a prompt-based local detector built on masked language modeling and a learnable soft verbalizer with an entropy-based active querying mechanism that escalates only high-uncertainty inputs to an external LLM. Queried hard samples are then stored in a review window and replayed for subsequent detector updates. Empirical evaluations across multiple benchmarks show that EvoShield achieves performance on par with or even exceeding pure Large Language Model baselines, while cutting API query costs by more than 85%. Full article

(This article belongs to the Special Issue Big Data Mining and Knowledge Graph with Application)

► Show Figures

Figure 1

Search Results (321)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (321)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI