Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (619)

Search Parameters:
Keywords = vision language model

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 2671 KB  
Article
Semantic-Aligned Multimodal Vision–Language Framework for Autonomous Driving Decision-Making
by Feng Peng, Shangju She and Zejian Deng
Machines 2026, 14(1), 125; https://doi.org/10.3390/machines14010125 (registering DOI) - 21 Jan 2026
Abstract
Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose [...] Read more.
Recent advances in Large Vision–Language Models (LVLMs) have demonstrated strong cross-modal reasoning capabilities, offering new opportunities for decision-making in autonomous driving. However, existing end-to-end approaches still suffer from limited semantic consistency, weak task controllability, and insufficient interpretability. To address these challenges, we propose SemAlign-E2E (Semantic-Aligned End-to-End), a semantic-aligned multimodal LVLM framework that unifies visual, LiDAR, and task-oriented textual inputs through cross-modal attention. This design enables end-to-end reasoning from scene understanding to high-level driving command generation. Beyond producing structured control instructions, the framework also provides natural-language explanations to enhance interpretability. We conduct extensive evaluations on the nuScenes dataset and CARLA simulation platform. Experimental results show that SemAlign-E2E achieves substantial improvements in driving stability, safety, multi-task generalization, and semantic comprehension, consistently outperforming state-of-the-art baselines. Notably, the framework exhibits superior behavioral consistency and risk-aware decision-making in complex traffic scenarios. These findings highlight the potential of LVLM-driven semantic reasoning for autonomous driving and provide a scalable pathway toward future semantic-enhanced end-to-end driving systems. Full article
(This article belongs to the Special Issue Control and Path Planning for Autonomous Vehicles)
Show Figures

Figure 1

32 pages, 4599 KB  
Article
Adaptive Assistive Technologies for Learning Mexican Sign Language: Design of a Mobile Application with Computer Vision and Personalized Educational Interaction
by Carlos Hurtado-Sánchez, Ricardo Rosales Cisneros, José Ricardo Cárdenas-Valdez, Andrés Calvillo-Téllez and Everardo Inzunza-Gonzalez
Future Internet 2026, 18(1), 61; https://doi.org/10.3390/fi18010061 - 21 Jan 2026
Abstract
Integrating people with hearing disabilities into schools is one of the biggest problems that Latin American societies face. Mexican Sign Language (MSL) is the main language and culture of the deaf community in Mexico. However, its use in formal education is still limited [...] Read more.
Integrating people with hearing disabilities into schools is one of the biggest problems that Latin American societies face. Mexican Sign Language (MSL) is the main language and culture of the deaf community in Mexico. However, its use in formal education is still limited by structural inequalities, a lack of qualified interpreters, and a lack of technology that can support personalized instruction. This study outlines the conceptualization and development of a mobile application designed as an adaptive assistive technology for learning MSL, utilizing a combination of computer vision techniques, deep learning algorithms, and personalized pedagogical interaction. The suggested system uses convolutional neural networks (CNNs) and pose-estimation models to recognize hand gestures in real time with 95.7% accuracy. It then gives the learner instant feedback by changing the difficulty level. A dynamic learning engine automatically changes the level of difficulty based on how well the learner is doing, which helps them learn signs and phrases over time. The Scrum agile methodology was used during the development process. This meant that educators, linguists, and members of the deaf community all worked together to design the product. Early tests show that sign recognition accuracy and indicators of user engagement and motivation show favorable performance and are at appropriate levels. This proposal aims to enhance inclusive digital ecosystems and foster linguistic equity in Mexican education through scalable, mobile, and culturally relevant technologies, in addition to its technical contributions. Full article
(This article belongs to the Special Issue Machine Learning Techniques for Computer Vision—2nd Edition)
Show Figures

Figure 1

18 pages, 3705 KB  
Article
Cross-Platform Multi-Modal Transfer Learning Framework for Cyberbullying Detection
by Weiqi Zhang, Chengzu Dong, Aiting Yao, Asef Nazari and Anuroop Gaddam
Electronics 2026, 15(2), 442; https://doi.org/10.3390/electronics15020442 - 20 Jan 2026
Abstract
Cyberbullying and hate speech increasingly appear in multi-modal social media posts, where images and text are combined in diverse and fast changing ways across platforms. These posts differ in style, vocabulary and layout, and labeled data are sparse and noisy, which makes it [...] Read more.
Cyberbullying and hate speech increasingly appear in multi-modal social media posts, where images and text are combined in diverse and fast changing ways across platforms. These posts differ in style, vocabulary and layout, and labeled data are sparse and noisy, which makes it difficult to train detectors that are both reliable and deployable under tight computational budgets. Many high performing systems rely on large vision language backbones, full parameter fine tuning, online retrieval or model ensembles, which raises training and inference costs. We present a parameter efficient cross-platform multi-modal transfer learning framework for cyberbullying and hateful content detection. Our framework has three components. First, we perform domain adaptive pretraining of a compact ViLT backbone on in domain image-text corpora. Second, we apply parameter efficient fine tuning that updates only bias terms, a small subset of LayerNorm parameters and the classification head, leaving the inference computation graph unchanged. Third, we use noise aware knowledge distillation from a stronger teacher built from pretrained text and CLIP based image-text encoders, where only high confidence, temperature scaled predictions are used as soft labels during training, and teacher models and any retrieval components are used only offline. We evaluate primarily on Hateful Memes and use IMDB as an auxiliary text only benchmark to show that the deployment aware PEFT + offline-KD recipe can still be applied when other modalities are unavailable. On Hateful Memes, our student updates only 0.11% of parameters and retain about 96% of the AUROC of full fine-tuning. Full article
(This article belongs to the Special Issue Data Privacy and Protection in IoT Systems)
Show Figures

Figure 1

25 pages, 19621 KB  
Article
Scrap-SAM-CLIP: Assembling Foundation Models for Typical Shape Recognition in Scrap Classification and Rating
by Guangda Bao, Wenzhi Xia, Haichuan Wang, Zhiyou Liao, Ting Wu and Yun Zhou
Sensors 2026, 26(2), 656; https://doi.org/10.3390/s26020656 - 18 Jan 2026
Viewed by 203
Abstract
To address the limitation of 2D methods in inferring absolute scrap dimensions from images, we propose Scrap-SAM-CLIP (SSC), a vision-language model integrating the segment anything model (SAM) and contrastive language-image pre-training in Chinese (CN-CLIP). The model enables identification of canonical scrap shapes, establishing [...] Read more.
To address the limitation of 2D methods in inferring absolute scrap dimensions from images, we propose Scrap-SAM-CLIP (SSC), a vision-language model integrating the segment anything model (SAM) and contrastive language-image pre-training in Chinese (CN-CLIP). The model enables identification of canonical scrap shapes, establishing a foundational framework for subsequent 3D reconstruction and dimensional extraction within the 3D recognition pipeline. Individual modules of SSC are fine-tuned on the self-constructed scrap dataset. For segmentation, the combined box-and-point prompt yields optimal performance among various prompting strategies. MobileSAM and SAM-HQ-Tiny serve as effective lightweight alternatives for edge deployment. Fine-tuning the SAM decoder significantly enhances robustness under noisy prompts, improving accuracy by at least 5.55% with a five-positive-points prompt and up to 15.00% with a five-positive-points-and-five-negative-points prompt. In classification, SSC achieves 95.3% accuracy, outperforming Swin Transformer V2_base by 2.9%, with t-SNE visualizations confirming superior feature learning capability. The performance advantages of SSC stem from its modular assembly strategy, enabling component-specific optimization through subtask decoupling and enhancing system interpretability. This work refines the scrap 3D identification pipeline and demonstrates the efficacy of adapted foundation models in industrial vision systems. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

23 pages, 2725 KB  
Article
Text- and Face-Conditioned Multi-Anchor Conditional Embedding for Robust Periocular Recognition
by Po-Ling Fong, Tiong-Sik Ng and Andrew Beng Jin Teoh
Appl. Sci. 2026, 16(2), 942; https://doi.org/10.3390/app16020942 - 16 Jan 2026
Viewed by 97
Abstract
Periocular recognition is essential when full-face images cannot be used because of occlusion, privacy constraints, or sensor limitations, yet in many deployments, only periocular images are available at run time, while richer evidence, such as archival face photos and textual metadata, exists offline. [...] Read more.
Periocular recognition is essential when full-face images cannot be used because of occlusion, privacy constraints, or sensor limitations, yet in many deployments, only periocular images are available at run time, while richer evidence, such as archival face photos and textual metadata, exists offline. This mismatch makes it hard to deploy conventional multimodal fusion. This motivates the notion of conditional biometrics, where auxiliary modalities are used only during training to learn stronger periocular representations while keeping deployment strictly periocular-only. In this paper, we propose Multi-Anchor Conditional Periocular Embedding (MACPE), which maps periocular, facial, and textual features into a shared anchor-conditioned space via a learnable anchor bank that preserves periocular micro-textures while aligning higher-level semantics. Training combines identity classification losses on periocular and face branches with a symmetric InfoNCE loss over anchors and a pulling regularizer that jointly aligns periocular, facial, and textual embeddings without collapsing into face-dominated solutions; captions generated by a vision language model provide complementary semantic supervision. At deployment, only the periocular encoder is used. Experiments across five periocular datasets show that MACPE consistently improves Rank-1 identification and reduces EER at a fixed FAR compared with periocular-only baselines and alternative conditioning methods. Ablation studies verify the contributions of anchor-conditioned embeddings, textual supervision, and the proposed loss design. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

21 pages, 16271 KB  
Article
Soybean Leaf Disease Recognition Methods Based on Hyperparameter Transfer and Progressive Fine-Tuning of Large Models
by Xiaoming Li, Wenxue Bian, Boyu Yang, Yongguang Li, Shiqi Wang, Ning Qin, Shanglong Ye, Zunyang Bao and Hongmin Sun
Agronomy 2026, 16(2), 218; https://doi.org/10.3390/agronomy16020218 - 16 Jan 2026
Viewed by 196
Abstract
Early recognition of crop diseases is essential for ensuring agricultural security and improving yield. However, traditional CNN-based methods often suffer from limited generalization when training data are scarce or when applied to transfer scenarios. To address these challenges, this study adopts the multimodal [...] Read more.
Early recognition of crop diseases is essential for ensuring agricultural security and improving yield. However, traditional CNN-based methods often suffer from limited generalization when training data are scarce or when applied to transfer scenarios. To address these challenges, this study adopts the multimodal large model Qwen2.5-VL as the core and targets three major soybean leaf diseases along with healthy samples. We propose a parameter-efficient adaptation framework that integrates cross-architecture hyperparameter transfer and progressive fine-tuning. The framework utilizes a Vision Transformer (ViT) as an auxiliary model, where Bayesian optimization is applied to obtain optimal hyperparameters that are subsequently transferred to Qwen2.5-VL. Combined with existing low-rank adaptation (LoRA) and a multi-stage training strategy, the framework achieves efficient convergence and robust generalization with limited data. To systematically evaluate the model’s multi-scale visual adaptability, experiments were conducted using low-resolution, medium-resolution, and high-resolution inputs. The results demonstrate that Qwen2.5-VL achieves an average zero-shot accuracy of 71.72%. With the proposed cross-architecture hyperparameter transfer and parameter-efficient tuning strategy, accuracy improves to 88.72%, and further increases to 93.82% when progressive fine-tuning is applied. The model also maintains an accuracy of 91.0% under cross-resolution evaluation. Overall, the proposed method exhibits strong performance in recognition accuracy, feature discriminability, and multi-scale robustness, providing an effective reference for adapting multimodal large language models to plant disease identification tasks. Full article
(This article belongs to the Special Issue Digital Twins in Precision Agriculture)
Show Figures

Figure 1

27 pages, 24824 KB  
Article
UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation
by Kai Tan, Yanlan Wu, Hui Yang and Xiaoshuang Ma
Remote Sens. 2026, 18(2), 282; https://doi.org/10.3390/rs18020282 - 15 Jan 2026
Viewed by 213
Abstract
Vision-language models can leverage natural language descriptions to encode stable farmland characteristics, providing a new paradigm for farmland extraction, yet existing methods face challenges in ambiguous text-visual alignment and loss of high-frequency boundary details during fusion. To address this, this article utilizes the [...] Read more.
Vision-language models can leverage natural language descriptions to encode stable farmland characteristics, providing a new paradigm for farmland extraction, yet existing methods face challenges in ambiguous text-visual alignment and loss of high-frequency boundary details during fusion. To address this, this article utilizes the semantic prior knowledge provided by textual descriptions in vision–language models to enhance the model’s ability to recognize polymorphic features, and proposes an Uncertainty-Guided and Frequency-Fused Vision-Language Model (UGFF-VLM) for remote sensing farmland extraction. The UGFF-VLM combines the semantic representation ability of vision-language models, further integrates an Uncertainty-Guided Adaptive Alignment (UGAA) module to dynamically adjust cross-modal fusion based on alignment confidence, and a Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism to preserve high-frequency boundary details in the frequency domain. Experimental results on the FarmSeg-VL dataset demonstrate that the proposed method delivers excellent and stable performance, achieving the highest mIoU across diverse geographical environments while showing significant improvements in boundary precision and robustness against false positives. Therefore, the proposed UGFF-VLM not only mitigates the issues of recognition confusion and poor generalization in purely vision-based models caused by farmland feature polymorphism but also effectively enhances boundary segmentation accuracy, providing a reliable method for the precise delineation of agricultural parcels in diverse landscapes. Full article
(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)
Show Figures

Figure 1

14 pages, 1872 KB  
Article
An AI-Driven Trainee Performance Evaluation in XR-Based CPR Training System for Enhancing Personalized Proficiency
by Junhyung Kwon and Won-Tae Kim
Electronics 2026, 15(2), 376; https://doi.org/10.3390/electronics15020376 - 15 Jan 2026
Viewed by 145
Abstract
Cardiac arrest is a life-threatening emergency requiring immediate intervention, with bystander-initiated Cardiopulmonary resuscitation (CPR) being critical for survival, especially in out-of-hospital situations where medical help is often delayed. Given that over 70% of out-of-hospital cases occur in private residences, there is a growing [...] Read more.
Cardiac arrest is a life-threatening emergency requiring immediate intervention, with bystander-initiated Cardiopulmonary resuscitation (CPR) being critical for survival, especially in out-of-hospital situations where medical help is often delayed. Given that over 70% of out-of-hospital cases occur in private residences, there is a growing imperative to provide widespread CPR training to the public. However, conventional instructor-led CPR training faces inherent limitations regarding spatiotemporal constraints and the lack of personalized feedback. To address these issues, this paper proposes an AI-integrated XR-based CPR training system designed as an advanced auxiliary tool for skill acquisition. The system integrates vision-based pose estimation with multimodal sensor data to assess the trainee’s posture and compression metrics in accordance with Korean regional CPR guidelines. Moreover, it utilizes a Large Language Model to evaluate verbal protocols, including requesting an emergency call that aligns with the guidelines. Experimental validation of the proof-of-concept reveals a verbal evaluation accuracy of 88% and a speech recognition accuracy of approximately 95%. Furthermore, the optimized concurrent architecture provides a real-time response latency under 0.5 s, and the automated marker-based tracking ensures precise spatial registration without manual calibration. These results confirm the technical feasibility of the system as a complementary solution for basic life support education. Full article
(This article belongs to the Special Issue Virtual Reality Applications in Enhancing Human Lives)
Show Figures

Figure 1

20 pages, 467 KB  
Systematic Review
Vision-Language Models in Teaching and Learning: A Systematic Literature Review
by Jing Tian
Educ. Sci. 2026, 16(1), 123; https://doi.org/10.3390/educsci16010123 - 14 Jan 2026
Viewed by 124
Abstract
Vision-language models (VLMs) integrate visual and textual information and are increasingly being used as innovative tools in educational applications. However, there is a lack of evidence regarding current practices for integrating VLMs into teaching and learning. To address this research gap and identify [...] Read more.
Vision-language models (VLMs) integrate visual and textual information and are increasingly being used as innovative tools in educational applications. However, there is a lack of evidence regarding current practices for integrating VLMs into teaching and learning. To address this research gap and identify the opportunities and challenges associated with the integration of VLMs in education, this paper presents a systematic review of VLM use in formal educational contexts. Peer-reviewed articles published between 2020 and 2025 were retrieved from five major databases: ACM Digital Library, Scopus, Web of Science, Engineering Village, and IEEE Xplore. Following the PRISMA-guided framework, 42 articles were selected for inclusion. Data were extracted and analyzed against six research questions: (1) where VLMs are applied across academic disciplines and educational levels; (2) what types of VLM solutions are deployed and which image–text modalities they infer and generate; (3) the pedagogical roles of VLMs within teaching workflows; (4) reported outcomes and benefits for learners and instructors; (5) challenges and risks identified in practice, together with corresponding mitigation strategies; and (6) reported evaluation methods. The included studies span K-12 through higher education and cover diverse disciplines, with deployments dominated by pre-trained models and a smaller number of domain-adapted approaches. VLM-supported pedagogical functions cluster into five roles: analyst, assessor, content curator, simulator, and tutor. This review concludes by discussing implications for VLM adoption in educational settings and offering recommendations for future research. Full article
Show Figures

Figure 1

33 pages, 4885 KB  
Article
Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images
by Yongqi Shi, Ruopeng Yang, Changsheng Yin, Yiwei Lu, Bo Huang, Yu Tao and Yihao Zhong
Remote Sens. 2026, 18(2), 266; https://doi.org/10.3390/rs18020266 - 14 Jan 2026
Viewed by 259
Abstract
Few-shot object detection (FSOD) in high-resolution remote sensing (RS) imagery remains challenging due to scarce annotations, large intra-class variability, and high visual similarity between categories, which together limit the generalization ability of convolutional neural network (CNN)-based detectors. To address this issue, we explore [...] Read more.
Few-shot object detection (FSOD) in high-resolution remote sensing (RS) imagery remains challenging due to scarce annotations, large intra-class variability, and high visual similarity between categories, which together limit the generalization ability of convolutional neural network (CNN)-based detectors. To address this issue, we explore leveraging large vision-language models (LVLMs) for FSOD in RS. We propose a two-stage, parameter-efficient fine-tuning framework with hierarchical prompting that adapts Qwen3-VL for object detection. In the first stage, low-rank adaptation (LoRA) modules are inserted into the vision and text encoders and trained jointly with a Detection Transformer (DETR)-style detection head on fully annotated base classes under three-level hierarchical prompts. In the second stage, the vision LoRA parameters are frozen, the text encoder is updated using K-shot novel-class samples, and the detection head is partially frozen, with selected components refined using the same three-level hierarchical prompting scheme. To preserve base-class performance and reduce class confusion, we further introduce knowledge distillation and semantic consistency losses. Experiments on the DIOR and NWPU VHR-10.v2 datasets show that the proposed method consistently improves novel-class performance while maintaining competitive base-class accuracy and surpasses existing baselines, demonstrating the effectiveness of integrating hierarchical semantic reasoning into LVLM-based FSOD for RS imagery. Full article
Show Figures

Figure 1

26 pages, 1167 KB  
Review
A Review of Multimodal Sentiment Analysis in Online Public Opinion Monitoring
by Shuxian Liu and Tianyi Li
Informatics 2026, 13(1), 10; https://doi.org/10.3390/informatics13010010 - 14 Jan 2026
Viewed by 260
Abstract
With the rapid development of the Internet, online public opinion monitoring has emerged as a crucial task in the information era. Multimodal sentiment analysis, through the integration of multiple modalities such as text, images, and audio, combined with technologies including natural language processing [...] Read more.
With the rapid development of the Internet, online public opinion monitoring has emerged as a crucial task in the information era. Multimodal sentiment analysis, through the integration of multiple modalities such as text, images, and audio, combined with technologies including natural language processing and computer vision, offers novel technical means for online public opinion monitoring. Nevertheless, current research still faces many challenges, such as the scarcity of high-quality datasets, limited model generalization ability, and difficulties with cross-modal feature fusion. This paper reviews the current research progress of multimodal sentiment analysis in online public opinion monitoring, including its development history, key technologies, and application scenarios. Existing problems are analyzed and future research directions are discussed. In particular, we emphasize a fusion-architecture-centric comparison under online public opinion monitoring, and discuss cross-lingual differences that affect multimodal alignment and evaluation. Full article
Show Figures

Figure 1

31 pages, 3792 KB  
Article
EdgeV-SE: Self-Reflective Fine-Tuning Framework for Edge-Deployable Vision-Language Models
by Yoonmo Jeon, Seunghun Lee and Woongsup Kim
Appl. Sci. 2026, 16(2), 818; https://doi.org/10.3390/app16020818 - 13 Jan 2026
Viewed by 183
Abstract
The deployment of Vision-Language Models (VLMs) in Satellite IoT scenarios is critical for real-time disaster assessment but is often hindered by the substantial memory and compute requirements of state-of-the-art models. While parameter-efficient fine-tuning (PEFT) enables adaptation, with minimal computational overhead, standard supervised methods [...] Read more.
The deployment of Vision-Language Models (VLMs) in Satellite IoT scenarios is critical for real-time disaster assessment but is often hindered by the substantial memory and compute requirements of state-of-the-art models. While parameter-efficient fine-tuning (PEFT) enables adaptation, with minimal computational overhead, standard supervised methods often fail to ensure robustness and reliability on resource-constrained edge devices. To address this, we propose EdgeV-SE, a self-reflective fine-tuning framework that significantly enhances the performance of VLM without introducing any inference-time overhead. Our framework incorporates an uncertainty-aware self-reflection mechanism with asymmetric dual pathways: a generative linguistic pathway and an auxiliary discriminative visual pathway. By estimating uncertainty from the linguistic pathway using a log-likelihood margin between class verbalizers, EdgeV-SE identifies ambiguous samples and refines its decision boundaries via consistency regularization and cross-pathway mutual learning. Experimental results on hurricane damage assessment demonstrate that our approach improves image classification accuracy, enhances image–text semantic alignment, and achieves superior caption quality. Notably, our work achieves these gains while maintaining practical deployment on a commercial off-the-shelf edge device such as NVIDIA Jetson Orin Nano, preserving the inference latency and memory footprint. Overall, our work contributes a unified self-reflective fine-tuning framework that improves robustness, calibration, and deployability of VLMs on edge devices. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

34 pages, 12645 KB  
Article
Multimodal Intelligent Perception at an Intersection: Pedestrian and Vehicle Flow Dynamics Using a Pipeline-Based Traffic Analysis System
by Bao Rong Chang, Hsiu-Fen Tsai and Chen-Chia Chen
Electronics 2026, 15(2), 353; https://doi.org/10.3390/electronics15020353 - 13 Jan 2026
Viewed by 221
Abstract
Traditional automated monitoring systems adopted for Intersection Traffic Control still face challenges, including high costs, maintenance difficulties, insufficient coverage, poor multimodal data integration, and limited traffic information analysis. To address these issues, the study proposes a sovereign AI-driven Smart Transportation governance approach, developing [...] Read more.
Traditional automated monitoring systems adopted for Intersection Traffic Control still face challenges, including high costs, maintenance difficulties, insufficient coverage, poor multimodal data integration, and limited traffic information analysis. To address these issues, the study proposes a sovereign AI-driven Smart Transportation governance approach, developing a mobile AI solution equipped with multimodal perception, task decomposition, memory, reasoning, and multi-agent collaboration capabilities. The proposed system integrates computer vision, multi-object tracking, natural language processing, Retrieval-Augmented Generation (RAG), and Large Language Models (LLMs) to construct a Pipeline-based Traffic Analysis System (PTAS). The PTAS can produce real-time statistics on pedestrian and vehicle flows at intersections, incorporating potential risk factors such as traffic accidents, construction activities, and weather conditions for multimodal data fusion analysis, thereby providing forward-looking traffic insights. Experimental results demonstrate that the enhanced DuCRG-YOLOv11n pre-trained model, equipped with our proposed new activation function βsilu, can accurately identify various vehicle types in object detection, achieving a frame rate of 68.25 FPS and a precision of 91.4%. Combined with ByteTrack, it can track over 90% of vehicles in medium- to low-density traffic scenarios, obtaining a 0.719 in MOTA and a 0.08735 in MOTP. In traffic flow analysis, the RAG of Vertex AI, combined with Claude Sonnet 4 LLMs, provides a more comprehensive view, precisely interpreting the causes of peak-hour congestion and effectively compensating for missing data through contextual explanations. The proposed method can enhance the efficiency of urban traffic regulation and optimizes decision support in intelligent transportation systems. Full article
(This article belongs to the Special Issue Interactive Design for Autonomous Driving Vehicles)
Show Figures

Figure 1

27 pages, 80350 KB  
Article
Pose-Based Static Sign Language Recognition with Deep Learning for Turkish, Arabic, and American Sign Languages
by Rıdvan Yayla, Hakan Üçgün and Mahmud Abbas
Sensors 2026, 26(2), 524; https://doi.org/10.3390/s26020524 - 13 Jan 2026
Viewed by 194
Abstract
Advancements in artificial intelligence have significantly enhanced communication for individuals with hearing impairments. This study presents a robust cross-lingual Sign Language Recognition (SLR) framework for Turkish, American English, and Arabic sign languages. The system utilizes the lightweight MediaPipe library for efficient hand landmark [...] Read more.
Advancements in artificial intelligence have significantly enhanced communication for individuals with hearing impairments. This study presents a robust cross-lingual Sign Language Recognition (SLR) framework for Turkish, American English, and Arabic sign languages. The system utilizes the lightweight MediaPipe library for efficient hand landmark extraction, ensuring stable and consistent feature representation across diverse linguistic contexts. Datasets were meticulously constructed from nine public-domain sources (four Arabic, three American, and two Turkish). The final training data comprises curated image datasets, with frames for each language carefully selected from varying angles and distances to ensure high diversity. A comprehensive comparative evaluation was conducted across three state-of-the-art deep learning architectures—ConvNeXt (CNN-based), Swin Transformer (ViT-based), and Vision Mamba (SSM-based)—all applied to identical feature sets. The evaluation demonstrates the superior performance of contemporary vision Transformers and state space models in capturing subtle spatial cues across diverse sign languages. Our approach provides a comparative analysis of model generalization capabilities across three distinct sign languages, offering valuable insights for model selection in pose-based SLR systems. Full article
(This article belongs to the Special Issue Sensor Systems for Gesture Recognition (3rd Edition))
Show Figures

Figure 1

27 pages, 16442 KB  
Article
Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning
by Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen and Xue Yang
Remote Sens. 2026, 18(2), 222; https://doi.org/10.3390/rs18020222 - 9 Jan 2026
Viewed by 233
Abstract
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater [...] Read more.
With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision-language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation procedure, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data procedure effectively addresses complex RS data enviroments and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models. Full article
Show Figures

Figure 1

Back to TopTop