Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,259)

Search Parameters:
Keywords = multimodal features learning

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 12347 KB  
Article
BactoRamanBioNet: A Multimodal Neural Network for Bacterial Species Identification Using Raman Spectroscopy and Biological Knowledge
by Yaoxue Xu, Junzhuo Song, Zhen Zhang, Lin Feng, Yalan Yang, Yunsen Liang and Yan Guo
Sensors 2026, 26(6), 1828; https://doi.org/10.3390/s26061828 - 13 Mar 2026
Abstract
Accurate and rapid identification of bacterial species is essential for public health, clinical diagnostics, and environmental monitoring. Although Raman spectroscopy offers a powerful, non-invasive alternative, reliance solely on spectral data often fails to distinguish species with highly similar signatures, particularly when the discriminating [...] Read more.
Accurate and rapid identification of bacterial species is essential for public health, clinical diagnostics, and environmental monitoring. Although Raman spectroscopy offers a powerful, non-invasive alternative, reliance solely on spectral data often fails to distinguish species with highly similar signatures, particularly when the discriminating features are subtle. This difficulty is frequently compounded by a lack of integrated biological prior knowledge, which can hinder model performance. To address these challenges, we introduce BactoRamanBioNet, a novel multimodal neural network architecture. Our model employs a synergistic approach that utilizes a ResNet-Transformer architecture to capture complex spectral patterns and a CLIP text encoder to incorporate descriptive biological information, thereby enabling highly accurate multimodal classification of bacterial species. Empirical results demonstrate that BactoRamanBioNet achieves a classification accuracy of 98.2% and an F1-score of 98.0%. This performance surpasses the current state-of-the-art deep learning model, ResNet-1D, by 2.4% in accuracy and 2.0% in F1-score. Moreover, our model outperforms traditional classifiers, such as Support Vector Machine (SVM) and Random Forest (RF), by 9.8% and 7.9% in accuracy, respectively, while also exhibiting significant improvements in precision and recall. By establishing a new benchmark in performance and robustness, BactoRamanBioNet offers a powerful and reliable framework for automated microbiological analysis, paving the way for next-generation diagnostic systems. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

31 pages, 2057 KB  
Review
Clinical AI in Radiology: Foundations, Trends, Applications, and Emerging Directions
by Iryna Hartsock, Nikolas Koutsoubis, Sabeen Ahmed, Nathan Parker, Matthew B. Schabath, Cyrillo Araujo, Aliya Qayyum, Cesar Lam, Robert A. Gatenby and Ghulam Rasool
Cancers 2026, 18(6), 942; https://doi.org/10.3390/cancers18060942 - 13 Mar 2026
Abstract
Artificial intelligence (AI) is at the vanguard of transforming radiology in several ways, including augmenting diagnoses, improving workflows, and increasing operational efficiency. Several integration challenges, including concerns over privacy, clinical usability, and workflow compatibility, still remain. This review discusses the foundations and current [...] Read more.
Artificial intelligence (AI) is at the vanguard of transforming radiology in several ways, including augmenting diagnoses, improving workflows, and increasing operational efficiency. Several integration challenges, including concerns over privacy, clinical usability, and workflow compatibility, still remain. This review discusses the foundations and current trends of clinical AI in radiology to provide essential context for ongoing developments. To illustrate translational potential, we describe representative applications, including: (1) local deployment of large language models (LLMs) for restructuring and streamlining radiology reports, improving clarity and consistency without relying on external resources; (2) multimodal AI frameworks combining CT images, clinical data, laboratory biomarkers, and LLM-extracted features from clinical notes for early detection of cachexia in pancreatic cancer; (3) privacy-preserving federated learning (FL) infrastructure enabling collaborative AI model development across institutions without sharing raw patient data; and (4) an uncertainty-aware de-identification pipeline for removing Protected Health Information (PHI) from radiology images and clinical reports to support secure data analysis and sharing. We further discuss emerging opportunities for tumor board decision support, clinical trial matching, radiology report quality assurance, and the development of an imaging complexity index. Collectively, these applications highlight the importance of local deployment, multimodal reasoning, privacy preservation, and human-in-the-loop oversight in translating AI models from research to oncology radiology practice. Full article
(This article belongs to the Special Issue Advances in Medical Imaging for Cancer Detection and Diagnosis)
Show Figures

Figure 1

27 pages, 2784 KB  
Article
A Cloud-Aware Scalable Architecture for Distributed Edge-Enabled BCI Biosensor System
by Sayantan Ghosh, Raghavan Bhuvanakantham, Padmanabhan Sindhujaa, Purushothaman Bhuvana Harishita, Anand Mohan, Balázs Gulyás, Domokos Máthé and Parasuraman Padmanabhan
Biosensors 2026, 16(3), 157; https://doi.org/10.3390/bios16030157 - 13 Mar 2026
Abstract
BCI biosensors enable continuous monitoring of neural activity, but existing systems face challenges in scalability, latency, and reliable integration with cloud infrastructure. This work presents a cloud-aware, real-time cognitive grid architecture for multimodal BCI biosensors, validated at the system level through a full [...] Read more.
BCI biosensors enable continuous monitoring of neural activity, but existing systems face challenges in scalability, latency, and reliable integration with cloud infrastructure. This work presents a cloud-aware, real-time cognitive grid architecture for multimodal BCI biosensors, validated at the system level through a full physical prototype. The system integrates the BioAmp EXG Pill for signal acquisition with an RP2040 microcontroller for local preprocessing using edge-resident TinyML deployment for on-device feature/inference feasibility coupled with environmental context sensors to augment signal context for downstream analytics talking to the external world via Wi-Fi/4G connectivity. A tiered data pipeline was implemented: SD card buffering for raw signals, Redis for near-real-time streaming, PostgreSQL for structured analytics, and AWS S3 with Glacier for long-term archival. End-to-end validation demonstrated consistent edge-level inference with bounded latency, while cloud-assisted telemetry and analytics exhibited variable transmission and processing delays consistent with cellular connectivity and serverless execution characteristics; packet loss remained below 5%. Visualization was achieved through Python 3.10 using Matplotlib GUI, Grafana 10.2.3 dashboards, and on-device LCD displays. Hybrid deployment strategies—local development, simulated cloud testing, and limited cloud usage for benchmark capture—enabled cost-efficient validation while preserving architectural fidelity and latency observability. The results establish a scalable, modular, and energy-efficient biosensor framework, providing a foundation for advanced analytics and translational BCI applications to be explored in subsequent work, with explicit consideration of both edge-resident TinyML inference and cloud-based machine learning workflows. Full article
Show Figures

Figure 1

20 pages, 6854 KB  
Article
TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification
by Shuhong Shi and Lingchuan Zeng
Electronics 2026, 15(6), 1194; https://doi.org/10.3390/electronics15061194 - 13 Mar 2026
Abstract
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a [...] Read more.
Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a novel framework combining one-shot prototype initialization with trajectory-guided online adaptation for terrain segmentation. Using a single reference image of desired traversable terrain, TARTS establishes an initial prototype from pre-trained DINO Vision Transformer (ViT) features. The system performs segmentation through superpixel-based feature aggregation and valley-emphasis Otsu thresholding while continuously refining the prototype via Exponential Moving Average (EMA) updates driven by automated footprint supervision from the robot’s traversed trajectory. Extensive experiments on our introduced Reference-guided Traversability Segmentation Dataset (RTSD) and the challenging Off-Road Freespace Detection (ORFD) benchmark demonstrate strong performance, achieving 94.5% IoU on RTSD and 94.1% IoU on ORFD, outperforming state-of-the-art supervised methods that require multi-modal inputs and dedicated training. The framework maintains efficient performance (17–24 FPS) on embedded platforms, enabling practical deployment with only a reference image as initialization. Full article
Show Figures

Figure 1

33 pages, 4366 KB  
Article
Structured and Factorized Multi-Modal Representation Learning for Physiological Affective State and Music Preference Inference
by Wenli Qu and Mu-Jiang-Shan Wang
Symmetry 2026, 18(3), 488; https://doi.org/10.3390/sym18030488 - 12 Mar 2026
Abstract
Emotions and affective responses are core intervention targets in music therapy. Through acoustic elements, music can evoke emotional responses at physiological and neurological levels, influencing cognition and behavior while providing an important dimension for evaluating therapeutic efficacy. However, emotions are inherently abstract and [...] Read more.
Emotions and affective responses are core intervention targets in music therapy. Through acoustic elements, music can evoke emotional responses at physiological and neurological levels, influencing cognition and behavior while providing an important dimension for evaluating therapeutic efficacy. However, emotions are inherently abstract and difficult to represent directly. Artificial intelligence models therefore provide a promising tool for modeling and quantifying such abstract affective states from physiological signals. In this paper, we propose a structured and explicitly factorized multi-modal representation learning framework for joint affective state and preference inference. Instead of entangling heterogeneous dynamics within monolithic encoders, the framework decomposes representation learning into cross-channel interaction modeling and intra-channel temporal–spectral organization modeling. The framework integrates electroencephalography (EEG), peripheral physiological signals (GSR, BVP, EMG, respiration, and temperature), and eye-movement data (EOG) within a unified temporal modeling paradigm. At its core, a Dynamic Token Feature Extractor (DTFE) transforms raw time series into compact token representations and explicitly factorizes representation learning into (i) explicit channel-wise cross-series interaction modeling and (ii) temporal–spectral refinement via learnable frequency-domain gating. These complementary structural modules are implemented through Cross-Series Intersection (CSI) and Intra-Series Intersection (ISI), which perform low-rank channel dependency learning and adaptive spectral modulation, respectively. A hierarchical cross-modal fusion strategy integrates modality-level tokens in a representation-consistent and interaction-aware manner, enabling coordinated modeling of neural, autonomic, and attentional responses. The entire framework is optimized under a unified multi-task objective for valence, arousal, and liking prediction. Experiments on the DEAP dataset demonstrate consistent improvements over state-of-the-art methods. The model achieves 98.32% and 98.45% accuracy for valence and arousal prediction, 97.96% for quadrant classification in single-task evaluation, and 92.8%, 91.8%, and 93.6% accuracy for valence, arousal, and liking in joint multi-task settings. Overall, this work establishes a structure-aware and factorized multi-modal representation learning framework for robust affective decoding and intelligent music therapy systems. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

43 pages, 2166 KB  
Article
Research on Root Cause Analysis Method for Certain Civil Aircraft Based on Ensemble Learning and Large Language Model Reasoning
by Wenyou Du, Jingtao Du, Haoran Zhang and Dongsheng Yang
Machines 2026, 14(3), 322; https://doi.org/10.3390/machines14030322 - 12 Mar 2026
Abstract
To address the challenges commonly encountered in civil aircraft operating under multi-mode, strongly coupled closed-loop control—namely scarce fault samples, pronounced distribution shift, and root-cause explanations that are easily confounded by covariates—this paper proposes a root-cause analysis method that integrates ensemble learning with constraint-guided [...] Read more.
To address the challenges commonly encountered in civil aircraft operating under multi-mode, strongly coupled closed-loop control—namely scarce fault samples, pronounced distribution shift, and root-cause explanations that are easily confounded by covariates—this paper proposes a root-cause analysis method that integrates ensemble learning with constraint-guided reasoning by large language models (LLMs). First, for Full Authority Digital Engine Control (FADEC) monitoring sequences, a feature system comprising environment-normalized ratios, mechanism-informed mixing indices, and multi-scale temporal statistics is constructed, thereby improving cross-mode comparability and enhancing engineering-semantic expressiveness. Second, in the anomaly detection stage, a cost-sensitive LightGBM model is adopted and a validation-set-based adaptive thresholding strategy is introduced to achieve robust identification under highly imbalanced fault conditions. Furthermore, for Root Cause Analysis (RCA), a “computation–reasoning decoupling” framework is developed: Shapley Additive exPlanations (SHAP) are used to generate segment-level contribution evidence, while causal chains, engineering prohibitions, and structured output templates are injected into prompts to constrain the LLM, enabling it to infer root-cause candidates and produce structured explanations under mechanism-consistency constraints. Experiments on real flight data demonstrate that our method yields an anomaly detection F1-score of 0.9577 and improves overall RCA accuracy to 97.1% (versus 62.3% for a pure SHAP baseline). Practically, by translating complex high-dimensional data into actionable natural language diagnostic reports, the proposed method provides reliable and interpretable decision support for rapid RCA. Full article
(This article belongs to the Section Automation and Control Systems)
Show Figures

Figure 1

28 pages, 5658 KB  
Article
A Multimodule Collaborative Framework for Unsupervised Visible–Infrared Person Re-Identification with Channel Enhancement Modality
by Baoshan Sun, Yi Du and Liqing Gao
Sensors 2026, 26(6), 1770; https://doi.org/10.3390/s26061770 - 11 Mar 2026
Viewed by 109
Abstract
Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used [...] Read more.
Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used only for data augmentation, and its potential as an independent input modality remains unexplored. To address these shortcomings, we present a multimodule collaborative USL-VI-ReID framework that explicitly treats CA as a separate input modality. The framework combines four complementary modules. The Person-ReID Adaptive Convolutional Block Attention Module (PA-CBAM) module extracts discriminative features using a two-level attention mechanism that refines salient spatial and channel cues. The Varied Regional Alignment (VRA) module performs cross-modal regional alignment and leverages the Multimodal Assisted Adversarial Learning (MAAL) to reinforce region-level correspondence. The Varied Regional Neighbor Learning (VRNL) implements reliable neighborhood learning via multi-region association to stabilize pseudo-labels and capture local structure. Finally, the Uniform Merging (UM) module merges split clusters through alternating contrastive learning to improve cluster consistency. We evaluate the proposed method on SYSU-MM01 and RegDB. On RegDB’s visible-to-infrared setting, the approach achieves Rank-1 = 93.34%, mean Average Precision (mAP) = 87.55%, and mean Inverse Negative Penalty (mINP) = 76.08%. These results indicate that our method effectively reduces modal discrepancies and increases feature discriminability. It outperforms most existing unsupervised baselines and several supervised approaches, thereby advancing the practical applicability of USL-VI-ReID. Full article
(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)
Show Figures

Figure 1

18 pages, 3239 KB  
Article
LPA-Tuning CLIP: An Improved CLIP-Based Classification Model for Intestinal Polyps
by Zumin Wang, Jun Gao, Wenhao Ping, Jing Qin and Changqing Ji
Sensors 2026, 26(6), 1764; https://doi.org/10.3390/s26061764 - 11 Mar 2026
Viewed by 115
Abstract
Background and Objective: Accurate classification of intestinal polyps is crucial for preventing colorectal cancer but is hindered by visual similarity among subtypes and endoscopic variability. While deep learning aids in diagnosis, single-modal models face efficiency–accuracy trade-offs and ignore pathological semantics. We propose a [...] Read more.
Background and Objective: Accurate classification of intestinal polyps is crucial for preventing colorectal cancer but is hindered by visual similarity among subtypes and endoscopic variability. While deep learning aids in diagnosis, single-modal models face efficiency–accuracy trade-offs and ignore pathological semantics. We propose a multimodal framework that integrates endoscopic images with structured pathological descriptions to bridge this gap. Methods: We propose LPA-Tuning CLIP, which incorporates three key innovations: replacing CLIP’s instance-level contrastive loss with cross-modal projection matching (CMPM) with ID loss to explicitly optimize intraclass compactness and interclass separation through label-aware image-text similarity matrices; introducing structured clinical semantic templates that encode WHO diagnostic criteria into hierarchical text prompts for consistent pathology annotations; and developing medical-aware augmentation that preserves lesion features while reducing domain shifts. Results: The experimental results demonstrate that our proposed method achieves an accuracy of 85.8% and an F1 score of 0.862 on the internal test set, establishing a new state-of-the-art performance for intestinal polyp classification. Conclusions: This study proposes a multimodal polyp classification paradigm that achieves 85.8% accuracy on three-subtype classification via endoscopic image-pathology text joint representation learning, outperforming unimodal baselines by 8.7% and a multimodal baseline by 4.3%. Full article
(This article belongs to the Special Issue AI and Intelligent Sensors for Medical Imaging)
Show Figures

Figure 1

25 pages, 11205 KB  
Article
Remote Sensing Image Captioning via Self-Supervised DINOv3 and Transformer Fusion
by Maryam Mehmood, Ahsan Shahzad, Farhan Hussain, Lismer Andres Caceres-Najarro and Muhammad Usman
Remote Sens. 2026, 18(6), 846; https://doi.org/10.3390/rs18060846 - 10 Mar 2026
Viewed by 174
Abstract
Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. [...] Read more.
Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. This research proposes a novel encoder–decoder framework for captioning of remote sensing images that integrates self-supervised DINOv3 visual features with a hybrid Transformer–LSTM decoder. Unlike existing approaches that rely on supervised CNN-based encoders (e.g., ResNet, VGG), the proposed method leverages DINOv3’s self-supervised learning capabilities to extract dense, semantically rich features from aerial images without requiring domain-specific labeled pretraining. The proposed hybrid decoder combines Transformer layers for global context modeling with LSTM layers for sequential caption generation, producing coherent and context-aware descriptions. Feature extraction is performed using the DINOv3 model, which employs the gram-anchoring technique to stabilize dense feature maps. Captions are generated through a hybrid of Transformer with Long Short-Term Memory (LSTM) layers, which adds contextual meaning to captions through sequential hidden layer modeling with gated memory. The model is first evaluated on two traditional remote sensing image captioning datasets: RSICD and UCM-Captions. Multiple evaluation metrics like Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L), and Metric for Evaluation of Translation with Explicit Ordering (METEOR), are used to quantify the performance and robustness of the proposed DINOv3 hybrid model. The proposed model outperforms conventional Convolutional Neural Network (CNN) and Vision Transformers (ViT)-based models by approximately 9–12% across most evaluation metrics. Attention heatmaps are also employed to qualitatively validate the proposed model when identifying and describing key spatial elements. In addition, the proposed model is evaluated on advanced remote sensing datasets, including RSITMD, DisasterM3, and GeoChat. The results demonstrate that self-supervised vision transformers are robust encoders for multi-modal understanding in remote sensing image analysis and captioning. Full article
Show Figures

Figure 1

19 pages, 1345 KB  
Article
A Novel Dual-Modality Dual-View Hybrid Deep Learning–Machine Learning Framework for the Prediction of Carotid Plaque Vulnerability via Late Fusion
by Wenxuan Zhang, Chao Hou, Xinyi Wang, Hongyu Kang, Shuai Li, Yu Sun, Yongping Zheng, Wei Zhang and Sai-Kit Lam
Diagnostics 2026, 16(5), 807; https://doi.org/10.3390/diagnostics16050807 - 9 Mar 2026
Viewed by 210
Abstract
Background: Ultrasound imaging is an ideal tool for regular carotid plaque screening to identify individuals at high risk of stroke for clinical intervention. However, no existing study leverages multi-modal multi-view ultrasound imaging for AI-enabled auto-classification of carotid plaque vulnerability. This study aims [...] Read more.
Background: Ultrasound imaging is an ideal tool for regular carotid plaque screening to identify individuals at high risk of stroke for clinical intervention. However, no existing study leverages multi-modal multi-view ultrasound imaging for AI-enabled auto-classification of carotid plaque vulnerability. This study aims to develop and validate an effective AI model for carotid plaque vulnerability classification through the applications of dual-modal (B-Mode and contrast-enhanced mode) dual-view (longitudinal and cross-sectional) settings to maximize the utility and potential of ultrasound imaging. Methods: Hybrid deep-learning (DL) and machine-learning (ML) methods were employed to balance between model discriminability and interpretability. B-Mode ultrasound (BMUS) and contrast-enhanced ultrasound (CEUS) images from 241 patients were retrospectively analyzed using the proposed hybrid-DL-ML variants. Results: Our findings suggest the hybrid VGG-RF model developed from a dual-modal dual-view setting outperforms those developed from other settings for identifying vulnerable carotid plaques. The VGG-RF model emerged as the best-performing model, achieving an optimal performance with an AUC of 0.908, precision of 0.765, recall of 0.929, specificity of 0.886, and F1 score of 0.839. The inherent interpretability of the VGG-RF model divulged that long-axis views of BMUS and CEUS images were the major contributing features for discriminating vulnerable carotid plaques against their counterparts. Conclusions: The present study underscored the effectiveness of AI models developed from dual-modal dual-view settings of ultrasound images. Notably, the hybrid VGG-RF model was benchmarked as the best-performing model among other studied hybrid DL-ML variants. Further studies on a larger cohort in a prospective setting are warranted to validate the findings of the current study. Full article
Show Figures

Figure 1

17 pages, 1701 KB  
Article
CLIP-ArASL: A Lightweight Multimodal Model for Arabic Sign Language Recognition
by Naif Alasmari
Appl. Sci. 2026, 16(5), 2573; https://doi.org/10.3390/app16052573 - 7 Mar 2026
Viewed by 157
Abstract
Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. [...] Read more.
Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. This paper introduces CLIP-ArASL, a lightweight CLIP-style multimodal approach for static ArASL letter recognition that aligns visual hand gestures with bilingual textual descriptions. The approach integrates an EfficientNet-B0 image encoder with a MiniLM text encoder to learn a shared embedding space using a hybrid objective that combines contrastive and cross-entropy losses. This design supports supervised classification on seen classes and zero-shot prediction on unseen classes using textual class representations. The proposed approach is evaluated on two public datasets, ArASL2018 and ArASL21L. Under supervised evaluation, recognition accuracies of 99.25±0.14% and 91.51±1.29% are achieved, respectively. Zero-shot performance is assessed by withholding 20% of gesture classes during training and predicting them using only their textual descriptions. In this setting, accuracies of 55.2±12.15% on ArASL2018 and 37.6±9.07% on ArASL21L are obtained. These results show that multimodal vision–language alignment supports semantic transfer and enables recognition of unseen classes. Full article
(This article belongs to the Special Issue Machine Learning in Computer Vision and Image Processing)
Show Figures

Figure 1

30 pages, 6906 KB  
Article
A Method for Predicting Alfalfa Biomass Based on Multimodal Data and Ensemble Learning Model
by Yuehua Zhang, Zhaoming Wang, Zhendong Tian, Haotian Deng, Jungang Gao, Chen Chen, Wei Zhao, Xiaoping Ma, Xueqin Ding, Haoran Yan, Liu Yang, Hui Xie, Qing Li and Fengling Shi
Plants 2026, 15(5), 815; https://doi.org/10.3390/plants15050815 - 6 Mar 2026
Viewed by 279
Abstract
Accurate alfalfa biomass prediction is crucial for pasture management and sustainable livestock production. However, traditional methods often perform poorly under complex field conditions. To address the limited prediction accuracy of traditional methods under complex planting environments, this study proposes an alfalfa biomass prediction [...] Read more.
Accurate alfalfa biomass prediction is crucial for pasture management and sustainable livestock production. However, traditional methods often perform poorly under complex field conditions. To address the limited prediction accuracy of traditional methods under complex planting environments, this study proposes an alfalfa biomass prediction method combining multispectral and LiDAR data with ensemble learning model. Based on the multispectral images acquired by unmanned aerial vehicle (UAV) and airborne LiDAR data, the spectral features, three-dimensional structural features, and their interaction features are systematically extracted at the quadrat scale, and a high-quality modeling dataset is constructed by feature selection. Secondly, an ensemble model for alfalfa biomass prediction was constructed, which was composed of random forest, extra trees, and histogram gradient boosting. After model training, the coefficient of determination (R2) of the integrated model on the test set reached 0.813, and the root mean square error (RMSE) and mean absolute error (MAE) were 0.178 kg m−2 and 0.146 kg m−2, which were significantly better than those of similar single models. Under feature combinations, the fusion model was better than that of spectral indices only (R2 = 0.773) and LiDAR traits only (R2 = 0.576), and the model achieved the highest accuracy from bud emergence to early flowering (R2 = 0.917). The overall prediction error of the model was approximately normal distribution, and the absolute error of more than 65% of the samples was less than 0.2. However, there was still a trend of underestimation in the high biomass interval. This research showed that the multimodal data fusion and ensemble learning method could achieve high-precision prediction of alfalfa biomass, which provided reliable technical support for pasture resources monitoring and precision agriculture management. Full article
(This article belongs to the Section Plant Modeling)
Show Figures

Figure 1

24 pages, 320 KB  
Review
Application of Eye Movement Analysis in Medicine: A Review Across Neurodevelopmental, Neurological, and Neurodegenerative Disorders
by Amnaduny Akhara Nurhasan and Paweł Kasprowski
Appl. Sci. 2026, 16(5), 2548; https://doi.org/10.3390/app16052548 - 6 Mar 2026
Viewed by 225
Abstract
Eye tracking has emerged as a valuable, non-invasive tool for identifying cognitive and motor abnormalities across a wide range of brain-related disorders. Recent studies have explored its utility in neurodevelopmental, neurological, and neurodegenerative conditions. This review synthesizes the findings of studies that apply [...] Read more.
Eye tracking has emerged as a valuable, non-invasive tool for identifying cognitive and motor abnormalities across a wide range of brain-related disorders. Recent studies have explored its utility in neurodevelopmental, neurological, and neurodegenerative conditions. This review synthesizes the findings of studies that apply eye movement analysis including fixation patterns, saccades, scanpaths, and pupil dynamics combined with machine learning (ML) and deep learning (DL) approaches for disease detection and classification. Particular attention is given to the design of eye-tracking tasks, feature extraction strategies, and algorithmic frameworks. Across clinical categories, models such as Support Vector Machines (SVM), random forests (RF), and Convolutional Neural Networks (CNN) have demonstrated promising diagnostic potential, with several studies reporting classification accuracies exceeding 80%, although performance varies depending on the task design, dataset characteristics, and validation methodology. These findings support the potential of eye movement-based biomarkers for early detection and clinical monitoring. Despite encouraging results, current research faces important limitations, including small sample sizes, a lack of standardization, and limited generalizability across populations. To advance clinical translation, future work should emphasize data augmentation, multimodal integration, external validation, and the use of explainable AI (XAI). Overall, eye movement analysis offers a scalable and objective pathway toward improving diagnostic precision in brain-related disorders. Full article
(This article belongs to the Special Issue Eye Tracking Technology and Its Applications)
29 pages, 1672 KB  
Article
A Deep Multimodal Fusion Framework for Noncontact Temperature Detection in Ceramic Roller Kilns
by Kuiyang Cai, Shanchuan Tu and Shujuan Wang
Appl. Sci. 2026, 16(5), 2530; https://doi.org/10.3390/app16052530 - 6 Mar 2026
Viewed by 166
Abstract
Accurate temperature control in ceramic roller kilns is critical for ensuring product quality; however, it remains challenging due to nonlinear thermal dynamics and the spatial lag inherent in traditional contact-based sensors. To address the limitations of sparse wall-mounted thermocouples and optical interference in [...] Read more.
Accurate temperature control in ceramic roller kilns is critical for ensuring product quality; however, it remains challenging due to nonlinear thermal dynamics and the spatial lag inherent in traditional contact-based sensors. To address the limitations of sparse wall-mounted thermocouples and optical interference in kiln images, this paper presents a multimodal spatiotemporal fusion network (MST-FusionNet) for noncontact temperature detection of ceramic bodies on roller tracks. The proposed network integrates in-furnace combustion image sequences with distributed thermocouple measurements. First, a physics-informed pseudo-heatmap generation strategy based on Gaussian distributions is introduced to align discrete thermocouple readings with visual features, enabling effective early-stage multimodal fusion. Second, a residual compensation mechanism uses thermocouple data as a stable reference to learn local temperature deviations from visual and temporal features. In addition, an attention-enhanced LSTM module is employed to model combustion dynamics and suppress unreliable frames caused by smoke and flame fluctuations. Experimental results on a real industrial dataset show that the proposed method achieves a mean absolute error of 0.9164 °C and a root mean squared error of 1.2422 °C, demonstrating better performance than single-modal methods and simple fusion baselines. The proposed framework exhibits stable spatial characteristics across different roller positions and helps bridge the spatial discrepancy between boundary measurements and the actual thermal state of ceramic products, providing an effective solution for temperature detection in roller kilns. Full article
Show Figures

Figure 1

17 pages, 306 KB  
Article
Multimodal AI Screening of Developmental Language Disorder in Tunisian Arabic Children: Clinical Markers and Computational Detection
by Faten Bouhajeb, Redha Touati and Selçuk Güven
Behav. Sci. 2026, 16(3), 375; https://doi.org/10.3390/bs16030375 - 6 Mar 2026
Viewed by 176
Abstract
Developmental Language Disorder (DLD) is a common neurodevelopmental condition that affects language acquisition in children. However, standardized diagnostic tools for Tunisian Arabic, a widely spoken yet underrepresented dialect, is still lacking. This study presents a multimodal biomedical informatics framework that integrates clinical assessments, [...] Read more.
Developmental Language Disorder (DLD) is a common neurodevelopmental condition that affects language acquisition in children. However, standardized diagnostic tools for Tunisian Arabic, a widely spoken yet underrepresented dialect, is still lacking. This study presents a multimodal biomedical informatics framework that integrates clinical assessments, speech recordings, and artificial intelligence (AI) for early DLD detection. Three linguistic tasks (the CLT Task, the Arabic Verb Evaluation Task, and the Nonword Repetition Task) were adapted for Tunisian Arabic, and spontaneous speech samples were collected from children with typical development and those with DLD. Statistical analyses revealed significant deficits in verb production, past-tense morphology, and phonological memory in the DLD group. For automated screening, we developed two systems: a Random Forest classifier based on structured clinical and linguistic features and a multimodal deep learning model using Wav2Vec2 acoustic embeddings. The best model achieved an F1 score of 0.85, demonstrating the feasibility of AI-assisted DLD screening. This work introduces the first standardized dataset and computational baseline for DLD in Tunisian Arabic, providing clinically relevant tools for early identification and supporting research on underrepresented Arabic dialects. This work also highlights future implications, including potential applications in early screening, the integration of acoustic markers, and the development of culturally adapted assessment tools for underrepresented languages. Full article
Back to TopTop