MDPI - Publisher of Open Access Journals

23 pages, 15691 KB

Open AccessArticle

ProM-Pose: Language-Guided Zero-Shot 9-DoF Object Pose Estimation from RGB-D with Generative 3D Priors

by Yuchen Li, Kai Qin, Haitao Wu and Xiangjun Qu

Electronics 2026, 15(5), 1111; https://doi.org/10.3390/electronics15051111 - 7 Mar 2026

Viewed by 171

Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined [...] Read more.

Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined category boundaries, or suffer from scale ambiguity under sparse observations. We propose ProM-Pose, a unified cross-modal temporal perception framework for zero-shot 9-DoF object pose estimation. By integrating language-conditioned generative 3D shape priors as canonical geometric references, an asymmetric cross-modal attention mechanism for spatially aware fusion, and a decoupled pose decoding strategy with temporal refinement, ProM-Pose constructs metrically consistent and semantically grounded representations without relying on category-specific pose priors or instance-level CAD supervision. Extensive experiments on CAMERA25 and REAL275 benchmarks demonstrate that ProM-Pose achieves competitive or superior performance compared to category-level methods, with mAP of

75.0 %

at

5^{°}, 2 cm

and

90.5 %

at

10^{°}, 5 cm

on CAMERA25, and

42.2 %

at

5^{°}, 2 cm

and

76.0 %

at

10^{°}, 5 cm

on REAL275 under zero-shot cross-domain evaluation. Qualitative results on real-world logistics scenarios further validate temporal stability and robustness under occlusion and lighting variations. ProM-Pose effectively bridges semantic grounding and metric geometric reasoning within a unified formulation, enabling stable and scale-aware 9-DoF pose estimation for previously unseen objects under open-vocabulary conditions. Full article

► Show Figures

Figure 1

29 pages, 2340 KB

Open AccessArticle

Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture

by Abdul Rahaman Wahab Sait and Yazeed Alkhurayyif

Electronics 2026, 15(4), 830; https://doi.org/10.3390/electronics15040830 - 14 Feb 2026

Viewed by 170

Abstract

Stance detection has emerged as an essential tool in natural language processing for understanding how individuals express agreement, disagreement, or neutrality toward specific targets in social and online discourse. It plays a crucial role in bilingual and multilingual environments, including English-Arabic social media [...] Read more.

Stance detection has emerged as an essential tool in natural language processing for understanding how individuals express agreement, disagreement, or neutrality toward specific targets in social and online discourse. It plays a crucial role in bilingual and multilingual environments, including English-Arabic social media ecosystems, where differences in language structure, discourse style, and data availability pose significant challenges for reliable stance modelling. Existing approaches often struggle with target awareness, cross-lingual generalization, robustness to noisy user-generated text, and the interpretability of model decisions. This study aims to build a reliable, explainable target-aware bilingual stance-detection framework that generalizes across heterogeneous stance formats and languages without retraining on a dataset specific to the target language. Thus, a unified dual-encoder architecture based on mDeBERTa-v3 is proposed. Cross-language contrastive learning offers an auxiliary training objective to align English and Arabic stance representations in a common semantic space. Robustness-oriented regularization is used to mitigate the effects of informal language, vocabulary variation, and adversarial noise. To promote transparency and trustworthiness, the framework incorporates token-level rationale extraction, enables fine-grained interpretability, and supports analysis of hallucination. The proposed model is tested on a combined bilingual test set and two structurally distinct zero-shot benchmarks: MT-CSD and AraStance. Experimental results show consistent performance, with accuracies of 85.0% and 86.8% and F1-scores of 84.7% and 86.8% on the zero-shot benchmarks, confirming stable performance and realistic generalization. Ultimately, these findings reveal that effective bilingual stance detection can be achieved via explicit target conditioning, cross-lingual alignment, and explainability-driven design. Full article

(This article belongs to the Special Issue Next-Generation Machine Learning and Deep Learning Models for Complex Data, Vision, and Intelligent Applications)

► Show Figures

Figure 1

22 pages, 2153 KB

Open AccessArticle

Benchmark of Genomic Language Models on Human and Rice Genomic Tasks

by Xiaosheng Gao, Shunyao Wu and Weihua Pan

Appl. Sci. 2026, 16(4), 1745; https://doi.org/10.3390/app16041745 - 10 Feb 2026

Viewed by 428

Abstract

Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has [...] Read more.

Genomic Language Models (GLMs), leveraging their vast parameter scales and the similarities between DNA sequences and natural languages, demonstrate immense potential in processing large-scale genomic data and elucidating gene regulation and evolutionary relationships. However, the cross-species generalization capability of large genomic models has not yet been systematically evaluated. This study addresses this critical gap by benchmarking five GLMs (DNABERT-2, GROVER, HyenaDNA, NT-V2, and AgroNT) and a CNN baseline model using human (Homo sapiens) and rice (Oryza sativa) genomes across four downstream tasks: promoter detection, transcription start site (TSS) scanning, species classification, and gene region identification, through both zero-shot testing and fine-tuning. During testing, factors such as hyperparameters, early stopping protocols, and computational resources were fixed to ensure fairness, enabling us to systematically evaluate their performance and cross-species generalization capabilities. The results were further analyzed from multiple mathematical and representational perspectives to provide a more rigorous and objective assessment of each model’s performance. The results show that AgroNT consistently leads on rice tasks, while NT-V2 and DNABERT-2 achieved the best overall performance in fine-tuning and zero-shot experiments, respectively. Although their pretraining data did not include plants, they demonstrate excellent performance on rice-related tasks thanks to cross-species pretraining that enhances their generalization ability across human–rice domains. This benchmark study offers guidance on selecting appropriate genomic language models based on task characteristics and provides insights for future development in this field. Full article

► Show Figures

Figure 1

20 pages, 8793 KB

Open AccessArticle

Small Object Detection with Efficient Multi-Scale Collaborative Attention and Depth Feature Fusion Based on Detection Transformer

by Boran Song, Xizhen Zhu, Guiyuan Yuan, Haixin Wang and Cong Liu

Appl. Sci. 2026, 16(4), 1673; https://doi.org/10.3390/app16041673 - 7 Feb 2026

Viewed by 246

Abstract

Existing DEtection TRansformer-based (DETR) object detection methods have been widely applied to standard object detection tasks, but still face numerous challenges in detecting small objects. These methods frequently miss the fine details of small objects and fail to preserve global context, particularly under [...] Read more.

Existing DEtection TRansformer-based (DETR) object detection methods have been widely applied to standard object detection tasks, but still face numerous challenges in detecting small objects. These methods frequently miss the fine details of small objects and fail to preserve global context, particularly under scale variation or occlusion. The resulting feature maps lack sufficient spatial and structural information. Moreover, some DETR-based models specifically designed for small object detection often have poor generalization capabilities and are difficult to adapt to datasets with diverse object scales and complex backgrounds. To address these issues, this paper proposes a novel object detection model—small object detection with efficient multi-scale collaborative attention and depth feature fusion based on DETR (ED-DETR)—which consists of three core modules: an efficient multi-scale collaborative attention mechanism (EMCA), DepthPro, a zero-shot metric monocular depth estimation model, and an adaptive feature fusion module for depth maps and feature maps. Specifically, EMCA extends the single-space attention mechanism in efficient multi-scale attention (EMA) to a composite structure of parallel spatial and channel attention, enhancing ED-DETR’s ability to express features collaboratively in both spatial and channel dimensions. DepthPro generates depth maps to extract depth information. The adaptive feature fusion module integrates depth information with RGB visual features, improving ED-DETR’s ability to perceive object position, scale, and occlusion. The experimental results show that ED-DETR achieves the current best 33.6% mAP on the AI-TOD-V2 dataset, which predominantly contains tiny objects, outperforming previous CNN-based and DETR-based methods, and shows excellent generalization performance on the VisDrone and COCO datasets. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

31 pages, 8257 KB

Open AccessArticle

Analytical Assessment of Pre-Trained Prompt-Based Multimodal Deep Learning Models for UAV-Based Object Detection Supporting Environmental Crimes Monitoring

by Andrea Demartis, Fabio Giulio Tonolo, Francesco Barchi, Samuel Zanella and Andrea Acquaviva

Geomatics 2026, 6(1), 14; https://doi.org/10.3390/geomatics6010014 - 3 Feb 2026

Viewed by 1112

Abstract

Illegal dumping poses serious risks to ecosystems and human health, requiring effective and timely monitoring strategies. Advances in uncrewed aerial vehicles (UAVs), photogrammetry, and deep learning (DL) have created new opportunities for detecting and characterizing waste objects over large areas. Within the framework [...] Read more.

Illegal dumping poses serious risks to ecosystems and human health, requiring effective and timely monitoring strategies. Advances in uncrewed aerial vehicles (UAVs), photogrammetry, and deep learning (DL) have created new opportunities for detecting and characterizing waste objects over large areas. Within the framework of the EMERITUS Project, an EU Horizon Europe initiative supporting the fight against environmental crimes, this study evaluates the performance of pre-trained prompt-based multimodal (PBM) DL models integrated into ArcGIS Pro for object detection and segmentation. To test such models, UAV surveys were specially conducted at a semi-controlled test site in northern Italy, producing very high-resolution orthoimages and video frames populated with simulated waste objects such as tyres, barrels, and sand piles. Three PBM models (CLIPSeg, GroundingDINO, and TextSAM) were tested under varying hyperparameters and input conditions, including orthophotos at multiple resolutions and frames extracted from UAV-acquired videos. Results show that model performance is highly dependent on object type and imagery resolution. In contrast, within the limited ranges tested, hyperparameter tuning rarely produced significant improvements. The evaluation of the models was performed using low IoU to generalize across different types of detection models and to focus on the ability of detecting object. When evaluating the models with orthoimagery, CLIPSeg achieved the highest accuracy with F1 scores up to 0.88 for tyres, whereas barrels and ambiguous classes consistently underperformed. Video-derived (oblique) frames generally outperformed orthophotos, reflecting a closer match to model training perspectives. Despite the current limitations in performances highlighted by the tests, PBM models demonstrate strong potential for democratizing GeoAI (Geospatial Artificial Intelligence). These tools effectively enable non-expert users to employ zero-shot classification in UAV-based monitoring workflows targeting environmental crime. Full article

► Show Figures

Figure 1

21 pages, 1604 KB

Open AccessCommunication

Assessing the Diagnostic Accuracy of BiomedCLIP for Detecting Contrast Use and Esophageal Strictures in Pediatric Radiography

by Artur Fabijan, Michał Kolejwa, Agnieszka Zawadzka-Fabijan, Robert Fabijan, Róża Kosińska, Emilia Nowosławska, Anna Socha-Banasiak, Natalia Lwow, Marcin Tkaczyk, Krzysztof Zakrzewski, Elżbieta Czkwianianc and Bartosz Polis

J. Clin. Med. 2026, 15(3), 1150; https://doi.org/10.3390/jcm15031150 - 2 Feb 2026

Viewed by 311

Abstract

Background/Objectives: Vision–language models such as BiomedCLIP are increasingly investigated for their diagnostic potential in medical imaging. Although these foundation models show promise in general radiographic interpretation, their application in pediatric domains—particularly for subtle, postoperative findings like esophageal strictures—remains underexplored. This study aimed [...] Read more.

Background/Objectives: Vision–language models such as BiomedCLIP are increasingly investigated for their diagnostic potential in medical imaging. Although these foundation models show promise in general radiographic interpretation, their application in pediatric domains—particularly for subtle, postoperative findings like esophageal strictures—remains underexplored. This study aimed to evaluate the diagnostic performance of BiomedCLIP in classifying pediatric esophageal radiographs into three clinically relevant categories: presence of contrast agent, full esophageal visibility, and presence of esophageal stricture. Methods: We retrospectively analyzed 143 pediatric esophageal X-rays collected between 2021 and 2025. Each image was annotated by two pediatric radiology experts and categorized according to esophageal visibility, contrast presence, and stricture occurrence. BiomedCLIP was used in a zero-shot classification setup without fine-tuning. Model predictions were converted into binary outcomes and assessed against the ground truth using a comprehensive suite of 27 performance metrics, including accuracy, sensitivity, specificity, F1-score, AUC, and calibration analyses. Results: BiomedCLIP achieved high precision (88.7%) and a favorable AUC (85.4%) in detecting contrast agent presence, though specificity remained low (20%), leading to a high false-positive rate. The model correctly identified all cases of non-visible esophagus, but was untestable in predicting full visibility due to the absence of positive cases. Critically, its performance in detecting esophageal strictures was poor, with accuracy at 24%, sensitivity at 44%, specificity at 18%, and AUC of 0.26. Statistical overlap between contrast and stricture predictions indicated a lack of semantic differentiation within the model’s latent space. Conclusions: BiomedCLIP shows potential in detecting high-salience features such as contrast but fails to reliably identify esophageal strictures. Limitations include class imbalance, absence of fine-tuning, and architectural constraints in recognizing subtle morphologic abnormalities. These findings emphasize the need for domain-specific adaptation of foundation models before clinical implementation in pediatric radiology. Full article

(This article belongs to the Special Issue Artificial Intelligence in Gastrointestinal Disorders: Current Updates from Theory to Clinical Practice)

► Show Figures

Figure 1

15 pages, 1220 KB

Open AccessArticle

Diagnostic Accuracy and Stability of Multimodal Large Language Models for Hand Fracture Detection: A Multi-Run Evaluation on Plain Radiographs

by Ibrahim Güler, Gerrit Grieb, Armin Kraus, Martin Lautenbach and Henrik Stelling

Diagnostics 2026, 16(3), 424; https://doi.org/10.3390/diagnostics16030424 - 1 Feb 2026

Viewed by 329

Abstract

Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods [...] Read more.

Background/Objectives: Multimodal large language models (MLLMs) offer potential for automated fracture detection, yet their diagnostic stability under repeated inference remains underexplored. This study evaluates the diagnostic accuracy, stability, and intra-model consistency of four MLLMs in detecting hand fractures on plain radiographs. Methods: In total, images of hand radiographs of 65 adult patients with confirmed hand fractures (30 phalangeal, 30 metacarpal, 5 scaphoid) were evaluated by four models: GPT-5 Pro, Gemini 2.5 Pro, Claude Sonnet 4.5, and Mistral Medium 3.1. Each image was independently analyzed five times per model using identical zero-shot prompts (1300 total inferences). Diagnostic accuracy, inter-run reliability (Fleiss’ κ), case-level agreement profiles, subgroup performance, and exploratory demographic inference (age, sex) were assessed. Results: GPT-5 Pro achieved the highest accuracy (64.3%) and consistency (κ = 0.71), followed by Gemini 2.5 Pro (56.9%, κ = 0.57). Mistral Medium 3.1 exhibited high agreement (κ = 0.88) despite low accuracy (38.5%), indicating systematic error (“confident hallucination”). Claude Sonnet 4.5 showed low accuracy (33.8%) and consistency (κ = 0.33), reflecting instability. While phalangeal fractures were reliably detected by top models, scaphoid fractures remained challenging. Demographic analysis revealed poor capabilities, with age estimation errors exceeding 12 years and sex prediction accuracy near random chance. Conclusions: Diagnostic accuracy and consistency are distinct performance dimensions; high intra-model agreement does not imply correctness. While GPT-5 Pro demonstrated the most favorable balance of accuracy and stability, other models exhibited critical failure modes ranging from systematic bias to random instability. At present, MLLMs should be regarded as experimental diagnostic reasoning systems rather than reliable standalone tools for clinical fracture detection. Full article

(This article belongs to the Section Medical Imaging and Theranostics)

► Show Figures

Figure 1

35 pages, 4355 KB

Open AccessArticle

The Comparison of Human and Machine Performance in Object Recognition

by Gokcek Kul and Andy J. Wills

Behav. Sci. 2026, 16(1), 109; https://doi.org/10.3390/bs16010109 - 13 Jan 2026

Viewed by 458

Abstract

Deep learning models have advanced rapidly, leading to claims that they now match or exceed human performance. However, such claims are often based on closed-set conditions with fixed labels, extensive supervised training, and do not considering differences between the two systems. Recent findings [...] Read more.

Deep learning models have advanced rapidly, leading to claims that they now match or exceed human performance. However, such claims are often based on closed-set conditions with fixed labels, extensive supervised training, and do not considering differences between the two systems. Recent findings also indicate that some models align more closely with human categorisation behaviour, whereas other studies argue that even highly accurate models diverge from human behaviour. Following principles from comparative psychology and imposing similar constraints on both systems, this study investigates whether these models can achieve human-level accuracy and human-like categorisation through three experiments using subsets of the ObjectNet dataset. Experiment 1 examined performance under varying presentation times and task complexities, showing that while recent models can match or exceed humans under conditions optimised for machines, they struggle to generalise to certain real-world categories without fine-tuning or task-specific zero-shot classification. Experiment 2 tested whether human performance remains stable when shifting from N-way categorisation to a free-naming task, while machine performance declines without fine-tuning; the results supported this prediction. Additional analyses separated detection from classification, showing that object isolation improved performance for both humans and machines. Experiment 3 investigated individual differences in human performance and whether models capture the qualitative ordinal relationships characterising human categorisation behaviour; only the multimodal CoCa model achieved this. These findings clarify the extent to which current models approximate human categorisation behaviour beyond mere accuracy and highlight the importance of incorporating principles from comparative psychology while considering individual differences. Full article

(This article belongs to the Special Issue Advanced Studies in Human-Centred AI)

► Show Figures

Figure 1

15 pages, 1363 KB

Open AccessArticle

Hierarchical Knowledge Distillation for Efficient Model Compression and Transfer: A Multi-Level Aggregation Approach

by Titinunt Kitrungrotsakul and Preeyanuch Srichola

Information 2026, 17(1), 70; https://doi.org/10.3390/info17010070 - 12 Jan 2026

Viewed by 500

Abstract

The success of large-scale deep learning models in remote sensing tasks has been transformative, enabling significant advances in image classification, object detection, and image–text retrieval. However, their computational and memory demands pose challenges for deployment in resource-constrained environments. Knowledge distillation (KD) alleviates these [...] Read more.

The success of large-scale deep learning models in remote sensing tasks has been transformative, enabling significant advances in image classification, object detection, and image–text retrieval. However, their computational and memory demands pose challenges for deployment in resource-constrained environments. Knowledge distillation (KD) alleviates these issues by transferring knowledge from a strong teacher to a student model, which can be compact for efficient deployment or architecturally matched to improve accuracy under the same inference budget. In this paper, we introduce Hierarchical Multi-Segment Knowledge Distillation (HIMS_KD), a multi-stage framework that sequentially distills knowledge from a teacher into multiple assistant models specialized in low-, mid-, and high-level representations, and then aggregates their knowledge into the final student. We integrate feature-level alignment, auxiliary similarity-logit alignment, and supervised loss during distillation. Experiments on benchmark remote sensing datasets (RSITMD and RSICD) show that HIMS_KD improves retrieval performance and enhances zero-shot classification; and when a compact student is used, it reduces deployment cost while retaining strong accuracy. Full article

(This article belongs to the Special Issue AI-Based Image Processing and Computer Vision)

► Show Figures

Figure 1

21 pages, 3379 KB

Open AccessFeature PaperArticle

KORIE: A Multi-Task Benchmark for Detection, OCR, and Information Extraction on Korean Retail Receipts

by Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Mahmoud Abdalla and Hyun Soo Kang

Mathematics 2026, 14(1), 187; https://doi.org/10.3390/math14010187 - 4 Jan 2026

Viewed by 1328

Abstract

We introduce KORIE, a curated benchmark of 748 Korean retail receipts designed to evaluate scene text detection, Optical Character Recognition (OCR), and Information Extraction (IE) under challenging digitization conditions. Unlike existing large-scale repositories, KORIE consists exclusively of receipts digitized via flatbed scanning (HP [...] Read more.

We introduce KORIE, a curated benchmark of 748 Korean retail receipts designed to evaluate scene text detection, Optical Character Recognition (OCR), and Information Extraction (IE) under challenging digitization conditions. Unlike existing large-scale repositories, KORIE consists exclusively of receipts digitized via flatbed scanning (HP LaserJet MFP), specifically selected to preserve complex thermal printing artifacts such as ink fading, banding, and mechanical creases. We establish rigorous baselines across three tasks: (1) Detection, comparing Weakly Supervised Object Localization (WSOL) against state-of-the-art fully supervised models (YOLOv9, YOLOv10, YOLOv11, and DINO-DETR); (2) OCR, benchmarking Tesseract, EasyOCR, PaddleOCR, and a custom Attention-based BiGRU; and (3) Information Extraction, evaluating the zero-shot capabilities of Large Language Models (Llama-3, Qwen-2.5) on structured field parsing. Our results identify YOLOv11 as the optimal detector for dense receipt layouts and demonstrate that while PaddleOCR achieves the lowest Character Error Rate (15.84%), standard LLMs struggle in zero-shot settings due to domain mismatch with noisy Korean receipt text, particularly for price-related fields (F1 scores ≈ 25%). We release the dataset, splits, and evaluation code to facilitate reproducible research on degraded Hangul document understanding. Full article

(This article belongs to the Special Issue Advanced Methods and Applications with Deep Learning in Object Recognition)

► Show Figures

Figure 1

23 pages, 725 KB

Open AccessArticle

From Sound to Risk: Streaming Audio Flags for Real-World Hazard Inference Based on AI

by Ilyas Potamitis

J. Sens. Actuator Netw. 2026, 15(1), 6; https://doi.org/10.3390/jsan15010006 - 1 Jan 2026

Viewed by 1120

Abstract

Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between [...] Read more.

Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between the occurrence of a crime, conflict, or accident and the corresponding response by authorities. The key idea is to map reality as perceived by audio into a written story and question the text via a large language model. The method integrates streaming, zero-shot algorithms in an online decoding mode that convert sound into short, interpretable tokens, which are processed by a lightweight language model. CLAP text–audio prompting identifies agitation, panic, and distress cues, combined with conversational dynamics derived from speaker diarization. Lexical information is obtained through streaming automatic speech recognition, while general audio events are detected by a streaming version of Audio Spectrogram Transformer tagger. Prosodic features are incorporated using pitch- and energy-based rules derived from robust F0 tracking and periodicity measures. The system uses a large language model configured for online decoding and outputs binary (YES/NO) life-threatening risk decisions every two seconds, along with a brief justification and a final session-level verdict. The system emphasizes interpretability and accountability. We evaluate it on a subset of the X-Violence dataset, comprising only real-world videos. We release code, prompts, decision policies, evaluation splits, and example logs to enable the community to replicate, critique, and extend our blueprint. Full article

(This article belongs to the Topic Trends and Prospects in Security, Encryption and Encoding)

► Show Figures

Figure 1

15 pages, 5477 KB

Open AccessArticle

Few-Shot Transfer Learning for Diabetes Risk Prediction Across Global Populations

by Shrinit Babel, Sunit Babel, John Hodgson and Enrico Camporesi

Medicina 2026, 62(1), 7; https://doi.org/10.3390/medicina62010007 - 19 Dec 2025

Viewed by 443

Abstract

Background and Objectives: Type 2 diabetes mellitus (T2DM) affects over 537 million adults worldwide and disproportionately burdens low- and middle-income countries, where diagnostic resources are limited. Predictive models trained in one population often fail to generalize across regions due to shifts in [...] Read more.

Background and Objectives: Type 2 diabetes mellitus (T2DM) affects over 537 million adults worldwide and disproportionately burdens low- and middle-income countries, where diagnostic resources are limited. Predictive models trained in one population often fail to generalize across regions due to shifts in feature distributions and measurement practices, hindering scalable screening efforts. Materials and Methods: We evaluated a few-shot domain adaptation framework using a simple multilayer perceptron with four shared clinical features (age, body mass index, mean arterial pressure, and plasma glucose) across three adult cohorts: Bangladesh (n = 5288), Iraq (n = 662), and the Pima Indian dataset (n = 768). For each of the six source-target pairs, we pre-trained on the source cohort and then fine-tuned on 1, 5, 10, and 20% of the labeled target examples, reserving the remaining for testing; a final 20% few-shot version was compared with threshold tuning. Discrimination and calibration performance metrics were used before and after adaptation. SHAP explainability analyses quantified shifts in feature importance and decision thresholds. Results: Several source → target transfers produced zero true positives under the strict source-only baseline at a fixed 0.5 decision threshold (e.g., Bangladesh → Pima F₁ = 0.00, 0/268 diabetics detected). Few-shot fine-tuning restored non-zero recall in all such cases, with F₁ improvements up to +0.63 and precision–recall gains in every zero-baseline transfer. In directions with moderate baseline performance (e.g., Bangladesh → Iraq, Iraq → Pima, Pima → Iraq), 20% few-shot adaptation with threshold tuning improved AUROC by +0.01 to +0.14 and accuracy by +4 to +17 percentage points while reducing Brier scores by up to 0.14 and ECE by approximately 30–80% (suggesting improved calibration). All but one transfer (Iraq → Bangladesh) demonstrated statistically significant improvement by McNemar’s test (p < 0.001). SHAP analyses revealed population-specific threshold shifts: glucose inflection points ranged from ~120 mg/dL in Pima to ~150 mg/dL in Iraq, and the importance of BMI rose in Pima-targeted adaptations. Conclusions: Leveraging as few as 5–20% of local labels, few-shot domain adaptation enhances cross-population T2DM risk prediction using only routinely available features. This scalable, interpretable approach can democratize preventive screening in diverse, resource-constrained settings. Full article

(This article belongs to the Special Issue Novel Innovations in Diabetes Mellitus Prevention, Screening and Management)

► Show Figures

Figure 1

15 pages, 3989 KB

Open AccessArticle

YOLO-SAM AgriScan: A Unified Framework for Ripe Strawberry Detection and Segmentation with Few-Shot and Zero-Shot Learning

by Partho Ghose, Al Bashir, Yibin Wang, Cristian Bua and Azlan Zahid

Sensors 2025, 25(24), 7678; https://doi.org/10.3390/s25247678 - 18 Dec 2025

Cited by 1 | Viewed by 655

Abstract

Traditional segmentation methods are slow and rely on manual annotations, which are labor-intensive. To address these limitations, we propose YOLO-SAM AgriScan, a unified framework that combines the fast object detection capabilities of YOLOv11 with the zero-shot segmentation power of the Segment Anything Model [...] Read more.

Traditional segmentation methods are slow and rely on manual annotations, which are labor-intensive. To address these limitations, we propose YOLO-SAM AgriScan, a unified framework that combines the fast object detection capabilities of YOLOv11 with the zero-shot segmentation power of the Segment Anything Model 2 (SAM2). Our approach adopts a hybrid paradigm for on-plant ripe strawberry segmentation, wherein YOLOv11 is fine-tuned using a few-shot learning strategy with minimal annotated samples, and SAM2 performs mask generation without additional supervision. This architecture eliminates the bottleneck of pixel-wise manual annotation and enables the scalable and efficient segmentation of strawberries in both controlled and natural farm environments. Experimental evaluations on two datasets, a custom-collected dataset and a publicly available benchmark, demonstrate strong detection and segmentation performance in both full-data and data-constrained scenarios. The proposed framework achieved a mean Dice score of 0.95 and an IoU of 0.93 on our collected dataset and maintained competitive performance on public data (Dice: 0.95, IoU: 0.92), demonstrating its robustness, generalizability, and practical relevance in real-world agricultural settings. Our results highlight the potential of combining few-shot detection and zero-shot segmentation to accelerate the development of annotation-light, intelligent phenotyping systems. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition for Advanced Smart Agriculture Solutions)

► Show Figures

Figure 1

19 pages, 444 KB

Open AccessArticle

Enhancing Cascade Object Detection Accuracy Using Correctors Based on High-Dimensional Feature Separation

by Andrey V. Kovalchuk, Andrey A. Lebedev, Olga V. Shemagina, Irina V. Nuidel, Vladimir G. Yakhno and Sergey V. Stasenko

Technologies 2025, 13(12), 593; https://doi.org/10.3390/technologies13120593 - 16 Dec 2025

Cited by 2 | Viewed by 469

Abstract

This study addresses the problem of correcting systematic errors in classical cascade object detectors under severe data scarcity and distribution shift. We focus on the widely used Viola–Jones framework enhanced with a modified Census transform and propose a modular “corrector” architecture that can [...] Read more.

This study addresses the problem of correcting systematic errors in classical cascade object detectors under severe data scarcity and distribution shift. We focus on the widely used Viola–Jones framework enhanced with a modified Census transform and propose a modular “corrector” architecture that can be attached to an existing detector without retraining it. The key idea is to exploit the blessing of dimensionality: high-dimensional feature vectors constructed from multiple cascade stages are transformed by PCA and whitening into a space where simple linear Fisher discriminants can reliably separate rare error patterns from normal operation using only a few labeled examples. This study presents a novel algorithm designed to correct the outputs of object detectors constructed using the Viola–Jones framework enhanced with a modified census transform. The proposed method introduces several improvements addressing error correction and robustness in data-limited conditions. The approach involves image partitioning through a sliding window of fixed aspect ratio and a modified census transform in which pixel intensity is compared to the mean value within a rectangular neighborhood. Training samples for false negative and false positive correctors are selected using dual Intersection-over-Union (IoU) thresholds and probabilistic sampling of true positive and true negative fragments. Corrector models are trained based on the principles of high-dimensional separability within the paradigm of one- and few-shot learning, utilizing features derived from cascade stages of the detector. Decision boundaries are optimized using Fisher’s rule, with adaptive thresholding to guarantee zero false acceptance. Experimental results indicate that the proposed correction scheme enhances object detection accuracy by effectively compensating for classifier errors, particularly under conditions of scarce training data. On two railway image datasets with only about one thousand images each, the proposed correctors increase Precision from 0.36 to 0.65 on identifier detection while maintaining high Recall (0.98 → 0.94), and improve digit detection Recall from 0.94 to 0.98 with negligible loss in Precision (0.92 → 0.91). These results demonstrate that even under scarce training data, high-dimensional feature separation enables effective one-/few-shot error correction for cascade detectors with minimal computational overhead. Full article

(This article belongs to the Special Issue Image Analysis and Processing)

► Show Figures

Figure 1

19 pages, 2418 KB

Open AccessArticle

D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection

by Bintao He, Caixia Yan, Yan Kou, Yinghao Wang, Xin Lv, Haipeng Du and Yugui Xie

Appl. Sci. 2025, 15(23), 12723; https://doi.org/10.3390/app152312723 - 1 Dec 2025

Viewed by 518

Abstract

Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world [...] Read more.

Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world applications. To address this critical oversight, we pioneer Open-Domain Continual Object Detection (OD-COD), a new paradigm that requires detectors to continually adapt across domains with significant stylistic gaps. We propose Disentangled Domain Knowledge-Aided Learning (D-Know) to tackle this challenge. This framework explicitly disentangles domain-general priors from category-specific adaptation, managing them dynamically in a scalable domain knowledge base. Specifically, D-Know first learns domain priors in a self-supervised manner and then leverages these priors to facilitate category-specific adaptation within each domain. To rigorously evaluate this task, we construct OD-CODB, the first dedicated benchmark spanning six domains with substantial visual variations. Extensive experiments demonstrate that D-Know achieves superior performance, surpassing current state-of-the-art methods by an average of 4.2% mAP under open-domain continual settings while maintaining strong zero-shot generalization. Furthermore, experiments under the few-shot setting confirm D-Know’s superior data efficiency. Full article

► Show Figures

Figure 1

Search Results (56)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (56)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI