Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (58)

Search Parameters:
Keywords = vision language models (VLMs)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 7466 KB  
Article
Feasibility Study of CLIP-Based Key Slice Selection in CT Images and Performance Enhancement via Lesion- and Organ-Aware Fine-Tuning
by Kohei Yamamoto and Tomohiro Kikuchi
Bioengineering 2025, 12(10), 1093; https://doi.org/10.3390/bioengineering12101093 - 10 Oct 2025
Abstract
Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the [...] Read more.
Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the most relevant 2D slices in volumetric computed tomography (CT) scans. To address this gap, a contrastive language–image pre-training (CLIP)-based key slice selection framework is proposed, which matches each sentence to its most informative CT slice via text–image similarity. This experiment demonstrates that models pre-trained in the medical domain already achieve competitive slice retrieval accuracy and that fine-tuning them on a small dual-supervised dataset that imparts both lesion- and organ-level awareness yields further gains. In particular, the best-performing model (fine-tuned BiomedCLIP) achieved a Top-1 accuracy of 51.7% for lesion-aware slice retrieval, representing a 20-point improvement over baseline CLIP, and was accepted by radiologists in 56.3% of cases. By automating the report-to-slice alignment, the proposed method facilitates scalable, clinically realistic construction of MedVQA resources. Full article
Show Figures

Graphical abstract

29 pages, 7711 KB  
Article
Fundamentals of Controlled Demolition in Structures: Real-Life Applications, Discrete Element Methods, Monitoring, and Artificial Intelligence-Based Research Directions
by Julide Yuzbasi
Buildings 2025, 15(19), 3501; https://doi.org/10.3390/buildings15193501 - 28 Sep 2025
Viewed by 424
Abstract
Controlled demolition is a critical engineering practice that enables the safe and efficient dismantling of structures while minimizing risks to the surrounding environment. This study presents, for the first time, a detailed, structured framework for understanding the fundamental principles of controlled demolition by [...] Read more.
Controlled demolition is a critical engineering practice that enables the safe and efficient dismantling of structures while minimizing risks to the surrounding environment. This study presents, for the first time, a detailed, structured framework for understanding the fundamental principles of controlled demolition by outlining key procedures, methodologies, and directions for future research. Through original, carefully designed charts and full-scale numerical simulations, including two 23-story building scenarios with different delay and blasting sequences, this paper provides real-life insights into the effects of floor-to-floor versus axis-by-axis delays on structural collapse behavior, debris spread, and toppling control. Beyond traditional techniques, this study explores how emerging technologies, such as real-time structural monitoring via object tracking, LiDAR scanning, and Unmanned Aerial Vehicle (UAV)-based inspections, can be further advanced through the integration of artificial intelligence (AI). The potential Deep learning (DL) and Machine learning (ML)-based applications of tools like Convolutional Neural Network (CNN)-based digital twins, YOLO object detection, and XGBoost classifiers are highlighted as promising avenues for future research. These technologies could support real-time decision-making, automation, and risk assessment in demolition scenarios. Furthermore, vision-language models such as SAM and Grounding DINO are discussed as enabling technologies for real-time risk assessment, anomaly detection, and adaptive control. By sharing insights from full-scale observations and proposing a forward-looking analytical framework, this work lays a foundation for intelligent and resilient demolition practices. Full article
(This article belongs to the Section Building Structures)
Show Figures

Figure 1

22 pages, 1269 KB  
Article
LightFakeDetect: A Lightweight Model for Deepfake Detection in Videos That Focuses on Facial Regions
by Sarab AlMuhaideb, Hessa Alshaya, Layan Almutairi, Danah Alomran and Sarah Turki Alhamed
Mathematics 2025, 13(19), 3088; https://doi.org/10.3390/math13193088 - 25 Sep 2025
Viewed by 700
Abstract
In recent years, the proliferation of forged videos, known as deepfakes, has escalated significantly, primarily due to advancements in technologies such as Generative Adversarial Networks (GANs), diffusion models, and Vision Language Models (VLMs). These deepfakes present substantial risks, threatening political stability, facilitating celebrity [...] Read more.
In recent years, the proliferation of forged videos, known as deepfakes, has escalated significantly, primarily due to advancements in technologies such as Generative Adversarial Networks (GANs), diffusion models, and Vision Language Models (VLMs). These deepfakes present substantial risks, threatening political stability, facilitating celebrity impersonation, and enabling tampering with evidence. As the sophistication of deepfake technology increases, detecting these manipulated videos becomes increasingly challenging. Most of the existing deepfake detection methods use Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or Vision Transformers (ViTs), achieving strong accuracy but exhibiting high computational demands. This highlights the need for a lightweight yet effective pipeline for real-time and resource-limited scenarios. This study introduces a lightweight deep learning model for deepfake detection in order to address this emerging threat. The model incorporates three integral components: MobileNet for feature extraction, a Convolutional Block Attention Module (CBAM) for feature enhancement, and a Gated Recurrent Unit (GRU) for temporal analysis. Additionally, a pre-trained Multi-Task Cascaded Convolutional Network (MTCNN) is utilized for face detection and cropping. The model is evaluated using the Deepfake Detection Challenge (DFDC) and Celeb-DF v2 datasets, demonstrating impressive performance, with 98.2% accuracy and a 99.0% F1-score on Celeb-DF v2 and 95.0% accuracy and a 97.2% F1-score on DFDC, achieving a commendable balance between simplicity and effectiveness. Full article
Show Figures

Figure 1

7 pages, 1431 KB  
Proceeding Paper
Application of Vision Language Models in the Shoe Industry
by Hsin-Ming Tseng and Hsueh-Ting Chu
Eng. Proc. 2025, 108(1), 50; https://doi.org/10.3390/engproc2025108050 - 24 Sep 2025
Viewed by 297
Abstract
The confluence of computer vision and natural language processing has yielded powerful vision language models (VLMs) capable of multimodal understanding. We applied state-of-the-art VLMs for quality monitoring of the shoe assembly industry. By leveraging the ability of VLMs to jointly process visual and [...] Read more.
The confluence of computer vision and natural language processing has yielded powerful vision language models (VLMs) capable of multimodal understanding. We applied state-of-the-art VLMs for quality monitoring of the shoe assembly industry. By leveraging the ability of VLMs to jointly process visual and textual data, we developed a system for automated defect detection and contextualized feedback generation to enhance the efficiency and consistency of quality assurance processes. We conducted an empirical evaluation by evaluating the effectiveness of the developed VLM system in identifying standard procedures for assembly, using the video data from a shoe assembly line. The experimental results validated the potential of the VLM system in detecting the quality of footwear assembly, highlighting the feasibility of future practical deployment in industrial quality control scenarios. Full article
Show Figures

Figure 1

26 pages, 1080 KB  
Systematic Review
Digital Twin and Computer Vision Combination for Manufacturing and Operations: A Systematic Literature Review
by Haji Ahmed Faqeer and Siavash H. Khajavi
Appl. Sci. 2025, 15(18), 10157; https://doi.org/10.3390/app151810157 - 17 Sep 2025
Cited by 1 | Viewed by 680
Abstract
This paper examines the transformative role of the Digital Twin-Computer Vision combination (DT-CV combo) in industrial operations, focusing on its applications, challenges, and future directions. It aims to synthesize the existing literature and explore the practical use cases in operations management (OM). A [...] Read more.
This paper examines the transformative role of the Digital Twin-Computer Vision combination (DT-CV combo) in industrial operations, focusing on its applications, challenges, and future directions. It aims to synthesize the existing literature and explore the practical use cases in operations management (OM). A comprehensive systematic literature review is conducted using PRISMA guidelines to analyze the DT-CV combo across the classification of industrial OM. However, given the breadth and importance of manufacturing and the OM field, the study excludes the literature on the DT-CV combo applied to other domains such as healthcare, smart buildings and cities, and transportation. We found that the DT-CV combo in OM is a relatively young but growing field of research. To date, only 29 articles have examined DT-CV combo solutions from various OM perspectives. Case studies are rare, with most studies relying on experimentation and laboratory testing to investigate DT-CV applications in the OM context. According to the cases and methods reviewed in the literature, the DT-CV combo has applications in different OM areas such as design, prototyping, simulation, real-time production monitoring, defect detection, process optimization, hazard detection and mitigation, safety training, emergency response simulation, optimal resource allocation, condition monitoring, inventory management, and scheduling maintenance. We also identified several benefits of DT-CV combo solutions in OM, including reducing human error, ensuring compliance with quality standards, lowering maintenance costs, mitigating production downtime, eliminating operational bottlenecks, and decreasing workplace accidents, while simultaneously improving the effectiveness of training. In this paper, we classify current applications of the DT-CV combo in OM, highlight gaps in the existing literature, and propose research questions to guide future studies in this domain. By considering the rapid phase of AI technology development and combining it with the current state of the art applications of the DT-CV combo in OM, we suggest novel concepts and future directions. The digital twin-vision language model (DT-VLM) combo as a future direction, emphasizing its potential to bridge physical–digital interfaces in industrial workflows, is one of the future development directions. Full article
(This article belongs to the Special Issue Digital Twins in the Industry 4.0)
Show Figures

Figure 1

16 pages, 881 KB  
Article
Text-Guided Spatio-Temporal 2D and 3D Data Fusion for Multi-Object Tracking with RegionCLIP
by Youlin Liu, Zainal Rasyid Mahayuddin and Mohammad Faidzul Nasrudin
Appl. Sci. 2025, 15(18), 10112; https://doi.org/10.3390/app151810112 - 16 Sep 2025
Viewed by 614
Abstract
3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual [...] Read more.
3D Multi-Object Tracking (3D MOT) is a critical task in autonomous systems, where accurate and robust tracking of multiple objects in dynamic environments is essential. Traditional approaches primarily rely on visual or geometric features, often neglecting the rich semantic information available in textual modalities. In this paper, we propose Text-Guided 3D Multi-Object Tracking (TG3MOT), a novel framework that incorporates Vision-Language Models (VLMs) into the YONTD architecture to improve 3D MOT performance. Our framework leverages RegionCLIP, a multimodal open-vocabulary detector, to achieve fine-grained alignment between image regions and textual concepts, enabling the incorporation of semantic information into the tracking process. To address challenges such as occlusion, blurring, and ambiguous object appearances, we introduce the Target Semantic Matching Module (TSM), which quantifies the uncertainty of semantic alignment and filters out unreliable regions. Additionally, we propose the 3D Feature Exponential Moving Average Module (3D F-EMA) to incorporate temporal information, improving robustness in noisy or occluded scenarios. Furthermore, the Gaussian Confidence Fusion Module (GCF) is introduced to weight historical trajectory confidences based on temporal proximity, enhancing the accuracy of trajectory management. We evaluate our framework on the KITTI dataset and compare it with the YONTD baseline. Extensive experiments demonstrate that although the overall HOTA gain of TG3MOT is modest (+0.64%), our method achieves substantial improvements in association accuracy (+0.83%) and significantly reduces ID switches (−16.7%). These improvements are particularly valuable in real-world autonomous driving scenarios, where maintaining consistent trajectories under occlusion and ambiguous appearances is crucial for downstream tasks such as trajectory prediction and motion planning. The code will be made publicly available. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

22 pages, 4937 KB  
Article
Multimodal AI for UAV: Vision–Language Models in Human– Machine Collaboration
by Maroš Krupáš, Ľubomír Urblík and Iveta Zolotová
Electronics 2025, 14(17), 3548; https://doi.org/10.3390/electronics14173548 - 6 Sep 2025
Viewed by 1195
Abstract
Recent advances in multimodal large language models (MLLMs)—particularly vision– language models (VLMs)—introduce new possibilities for integrating visual perception with natural-language understanding in human–machine collaboration (HMC). Unmanned aerial vehicles (UAVs) are increasingly deployed in dynamic environments, where adaptive autonomy and intuitive interaction are essential. [...] Read more.
Recent advances in multimodal large language models (MLLMs)—particularly vision– language models (VLMs)—introduce new possibilities for integrating visual perception with natural-language understanding in human–machine collaboration (HMC). Unmanned aerial vehicles (UAVs) are increasingly deployed in dynamic environments, where adaptive autonomy and intuitive interaction are essential. Traditional UAV autonomy has relied mainly on visual perception or preprogrammed planning, offering limited adaptability and explainability. This study introduces a novel reference architecture, the multimodal AI–HMC system, based on which a dedicated UAV use case architecture was instantiated and experimentally validated in a controlled laboratory environment. The architecture integrates VLM-powered reasoning, real-time depth estimation, and natural-language interfaces, enabling UAVs to perform context-aware actions while providing transparent explanations. Unlike prior approaches, the system generates navigation commands while also communicating the underlying rationale and associated confidence levels, thereby enhancing situational awareness and fostering user trust. The architecture was implemented in a real-time UAV navigation platform and evaluated through laboratory trials. Quantitative results showed a 70% task success rate in single-obstacle navigation and 50% in a cluttered scenario, with safe obstacle avoidance at flight speeds of up to 0.6 m/s. Users approved 90% of the generated instructions and rated explanations as significantly clearer and more informative when confidence visualization was included. These findings demonstrate the novelty and feasibility of embedding VLMs into UAV systems, advancing explainable, human-centric autonomy and establishing a foundation for future multimodal AI applications in HMC, including robotics. Full article
Show Figures

Figure 1

23 pages, 1928 KB  
Systematic Review
Eye Tracking-Enhanced Deep Learning for Medical Image Analysis: A Systematic Review on Data Efficiency, Interpretability, and Multimodal Integration
by Jiangxia Duan, Meiwei Zhang, Minghui Song, Xiaopan Xu and Hongbing Lu
Bioengineering 2025, 12(9), 954; https://doi.org/10.3390/bioengineering12090954 - 5 Sep 2025
Viewed by 1002
Abstract
Deep learning (DL) has revolutionized medical image analysis (MIA), enabling early anomaly detection, precise lesion segmentation, and automated disease classification. However, its clinical integration faces two major challenges: reliance on limited, narrowly annotated datasets that inadequately capture real-world patient diversity, and the inherent [...] Read more.
Deep learning (DL) has revolutionized medical image analysis (MIA), enabling early anomaly detection, precise lesion segmentation, and automated disease classification. However, its clinical integration faces two major challenges: reliance on limited, narrowly annotated datasets that inadequately capture real-world patient diversity, and the inherent “black-box” nature of DL decision-making, which complicates physician scrutiny and accountability. Eye tracking (ET) technology offers a transformative solution by capturing radiologists’ gaze patterns to generate supervisory signals. These signals enhance DL models through two key mechanisms: providing weak supervision to improve feature recognition and diagnostic accuracy, particularly when labeled data are scarce, and enabling direct comparison between machine and human attention to bridge interpretability gaps and build clinician trust. This approach also extends effectively to multimodal learning models (MLMs) and vision–language models (VLMs), supporting the alignment of machine reasoning with clinical expertise by grounding visual observations in diagnostic context, refining attention mechanisms, and validating complex decision pathways. Conducted in accordance with the PRISMA statement and registered in PROSPERO (ID: CRD42024569630), this review synthesizes state-of-the-art strategies for ET-DL integration. We further propose a unified framework in which ET innovatively serves as a data efficiency optimizer, a model interpretability validator, and a multimodal alignment supervisor. This framework paves the way for clinician-centered AI systems that prioritize verifiable reasoning, seamless workflow integration, and intelligible performance, thereby addressing key implementation barriers and outlining a path for future clinical deployment. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Graphical abstract

29 pages, 5574 KB  
Article
Comprehensive Fish Feeding Management in Pond Aquaculture Based on Fish Feeding Behavior Analysis Using a Vision Language Model
by Divas Karimanzira
Aquac. J. 2025, 5(3), 15; https://doi.org/10.3390/aquacj5030015 - 3 Sep 2025
Viewed by 745
Abstract
For aquaculture systems, maximizing feed efficiency is a major challenge since it directly affects growth rates and economic sustainability. Feed is one of the largest costs in aquaculture, and feed waste is a significant environmental issue that requires effective management strategies. This paper [...] Read more.
For aquaculture systems, maximizing feed efficiency is a major challenge since it directly affects growth rates and economic sustainability. Feed is one of the largest costs in aquaculture, and feed waste is a significant environmental issue that requires effective management strategies. This paper suggests a novel approach for optimal fish feeding in pond aquaculture systems that integrates vision language models (VLMs), optical flow, and advanced image processing techniques to enhance feed management strategies. The system allows for the precise assessment of fish needs in connection to their feeding habits by integrating real-time data on biomass estimates and water quality conditions. By combining these data sources, the system makes informed decisions about when to activate automated feeders, optimizing feed distribution and cutting waste. A case study was conducted at a profit-driven tilapia farm where the system had been operational for over half a year. The results indicate significant improvements in feed conversion ratios (FCR) and a 28% reduction in feed waste. Our study found that, under controlled conditions, an average of 135 kg of feed was saved daily, resulting in a cost savings of approximately $1800 over the course of the study. The VLM-based fish feeding behavior recognition system proved effective in recognizing a range of feeding behaviors within a complex dataset in a series of tests conducted in a controlled pond aquaculture setting, with an F1-score of 0.95, accuracy of 92%, precision of 0.90, and recall of 0.85. Because it offers a scalable framework for enhancing aquaculture resource use and promoting sustainable practices, this study has significant implications. Our study demonstrates how combining language models and image processing could transform feeding practices, ultimately improving aquaculture’s environmental stewardship and profitability. Full article
Show Figures

Figure 1

26 pages, 13544 KB  
Article
GeoJapan Fusion Framework: A Large Multimodal Model for Regional Remote Sensing Recognition
by Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa and Miki Haseyama
Remote Sens. 2025, 17(17), 3044; https://doi.org/10.3390/rs17173044 - 1 Sep 2025
Viewed by 1040
Abstract
Recent advances in large multimodal models (LMMs) have opened new opportunities for multitask recognition from remote sensing images. However, existing approaches still face challenges in effectively recognizing the complex geospatial characteristics of regions such as Japan, where its location along the seismic belt [...] Read more.
Recent advances in large multimodal models (LMMs) have opened new opportunities for multitask recognition from remote sensing images. However, existing approaches still face challenges in effectively recognizing the complex geospatial characteristics of regions such as Japan, where its location along the seismic belt leads to highly diverse urban environments and cityscapes that differ from those in other regions. To overcome these challenges, we propose the GeoJapan Fusion Framework (GFF), a multimodal architecture that integrates a large language model (LLM) and a vision–language model (VLM) and strengthens multimodal alignment ability through an in-context learning mechanism to support multitask recognition for Japanese remote sensing images. The GFF also incorporates a cross-modal feature fusion mechanism with low-rank adaptation (LoRA) to enhance representation alignment and enable efficient model adaptation. To facilitate the construction of the GFF, we construct the GeoJapan dataset, which comprises a substantial collection of high-quality Japanese remote sensing images, designed to facilitate multitask recognition using LMMs. We conducted extensive experiments and compared our method with state-of-the-art LMMs. The experimental results demonstrate that GFF outperforms previous approaches across multiple tasks, demonstrating its promising ability for multimodal multitask remote sensing recognition. Full article
(This article belongs to the Special Issue Remote Sensing Image Classification: Theory and Application)
Show Figures

Figure 1

16 pages, 2127 KB  
Article
VIPS: Learning-View-Invariant Feature for Person Search
by Hexu Wang, Wenlong Luo, Wei Wu, Fei Xie, Jindong Liu, Jing Li and Shizhou Zhang
Sensors 2025, 25(17), 5362; https://doi.org/10.3390/s25175362 - 29 Aug 2025
Viewed by 464
Abstract
Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered [...] Read more.
Unmanned aerial vehicles (UAVs) have become indispensable tools for surveillance, enabled by their ability to capture multi-perspective imagery in dynamic environments. Among critical UAV-based tasks, cross-platform person search—detecting and identifying individuals across distributed camera networks—presents unique challenges. Severe viewpoint variations, occlusions, and cluttered backgrounds in UAV-captured data degrade the performance of conventional discriminative models, which struggle to maintain robustness under such geometric and semantic disparities. To address this, we propose view-invariant person search (VIPS), a novel two-stage framework combining Faster R-CNN with a view-invariant re-Identification (VIReID) module. Unlike conventional discriminative models, VIPS leverages the semantic flexibility of large vision–language models (VLMs) and adopts a two-stage training strategy to decouple and align text-based ID descriptors and visual features, enabling robust cross-view matching through shared semantic embeddings. To mitigate noise from occlusions and cluttered UAV-captured backgrounds, we introduce a learnable mask generator for feature purification. Furthermore, drawing from vision–language models, we design view prompts to explicitly encode perspective shifts into feature representations, enhancing adaptability to UAV-induced viewpoint changes. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, with ablation studies validating the efficacy of each component. Beyond technical advancements, this work highlights the potential of VLM-derived semantic alignment for UAV applications, offering insights for future research in real-time UAV-based surveillance systems. Full article
(This article belongs to the Section Remote Sensors)
Show Figures

Figure 1

15 pages, 6764 KB  
Article
V-PRUNE: Semantic-Aware Patch Pruning Before Tokenization in Vision–Language Model Inference
by Hyein Seo and Yong Suk Choi
Appl. Sci. 2025, 15(17), 9463; https://doi.org/10.3390/app15179463 - 28 Aug 2025
Viewed by 851
Abstract
Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial [...] Read more.
Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low attention scores during inference, indicating that a substantial portion of visual content contributes little to final predictions. These observations raise questions about the efficiency of conventional token pruning strategies, which are typically applied after all attention operations and depend on late-emerging attention scores. To address this, we propose V-PRUNE, a semantic-aware patch-level pruning framework for vision–language models that removes redundant content before tokenization. By evaluating local similarity via color and histogram statistics, our method enables lightweight and interpretable pruning without architectural changes. Applied to CLIP-based models, our approach reduces FLOPs and inference time across vision–language understanding tasks, while maintaining or improving accuracy. Qualitative results further confirm that essential regions are preserved and the pruning behavior is human-aligned, making our method a practical solution for efficient VLM inference. Full article
Show Figures

Figure 1

23 pages, 16525 KB  
Article
Real-Time Vision–Language Analysis for Autonomous Underwater Drones: A Cloud–Edge Framework Using Qwen2.5-VL
by Wannian Li and Fan Zhang
Drones 2025, 9(9), 605; https://doi.org/10.3390/drones9090605 - 27 Aug 2025
Viewed by 1316
Abstract
Autonomous Underwater Vehicles (AUVs) equipped with vision systems face unique challenges in real-time environmental perception due to harsh underwater conditions and computational constraints. This paper presents a novel cloud–edge framework for real-time vision–language analysis in underwater drones using the Qwen2.5-VL model. Our system [...] Read more.
Autonomous Underwater Vehicles (AUVs) equipped with vision systems face unique challenges in real-time environmental perception due to harsh underwater conditions and computational constraints. This paper presents a novel cloud–edge framework for real-time vision–language analysis in underwater drones using the Qwen2.5-VL model. Our system employs a uniform frame sampling mechanism that balances temporal resolution with processing capabilities, achieving near real-time analysis at 1 fps from 23 fps input streams. We construct a comprehensive data flow model encompassing image enhancement, communication latency, cloud-side inference, and semantic result return, which is supported by a theoretical latency framework and sustainable processing rate analysis. Simulation-based experimental results across three challenging underwater scenarios—pipeline inspection, coral reef monitoring, and wreck investigation—demonstrate consistent scene comprehension with end-to-end latencies near 1 s. The Qwen2.5-VL model successfully generates natural language summaries capturing spatial structure, biological content, and habitat conditions, even under turbidity and occlusion. Our results show that vision–language models (VLMs) can provide rich semantic understanding of underwater scenes despite challenging conditions, enabling AUVs to perform complex monitoring tasks with natural language scene descriptions. This work contributes to advancing AI-powered perception systems for the growing autonomous underwater drone market, supporting applications in environmental monitoring, offshore infrastructure inspection, and marine ecosystem assessment. Full article
(This article belongs to the Special Issue Advances in Autonomous Underwater Drones: 2nd Edition)
Show Figures

Figure 1

18 pages, 3066 KB  
Article
A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning
by Min Zhou, Zhiheng Qi, Tianlin Zhu, Jan Vijg and Xiaoshui Huang
Mathematics 2025, 13(17), 2739; https://doi.org/10.3390/math13172739 - 26 Aug 2025
Viewed by 598
Abstract
Chart reasoning, a critical task for automating data interpretation in domains such as aiding scientific data analysis and medical diagnostics, leverages large-scale vision language models (VLMs) to interpret chart images and answer natural language questions, enabling semantic understanding that enhances knowledge accessibility and [...] Read more.
Chart reasoning, a critical task for automating data interpretation in domains such as aiding scientific data analysis and medical diagnostics, leverages large-scale vision language models (VLMs) to interpret chart images and answer natural language questions, enabling semantic understanding that enhances knowledge accessibility and supports data-driven decision making across diverse domains. In this work, we formalize chart reasoning as a sequential decision-making problem governed by a Markov Decision Process (MDP), thereby providing a mathematically grounded framework for analyzing visual question answering tasks. While recent advances such as multi-step reasoning with Monte Carlo tree search (MCTS) offer interpretable and stochastic planning capabilities, these methods often suffer from redundant path exploration and inefficient reward propagation. To address these challenges, we propose a novel algorithmic framework that integrates a pheromone-guided search strategy inspired by Ant Colony Optimization (ACO). In our approach, chart reasoning is cast as a combinatorial optimization problem over a dynamically evolving search tree, where path desirability is governed by pheromone concentration functions that capture global phenomena across search episodes and are reinforced through trajectory-level rewards. Transition probabilities are further modulated by local signals, which are evaluations derived from the immediate linguistic feedback of large language models. This enables fine grained decision making at each step while preserving long-term planning efficacy. Extensive experiments across four benchmark datasets, ChartQA, MathVista, GRAB, and ChartX, demonstrate the effectiveness of our approach, with multi-agent reasoning and pheromone guidance yielding success rate improvements of +18.4% and +7.6%, respectively. Full article
(This article belongs to the Special Issue Multimodal Deep Learning and Its Application in Healthcare)
Show Figures

Figure 1

28 pages, 17913 KB  
Article
Towards Robust Industrial Control Interpretation Through Comparative Analysis of Vision–Language Models
by Juan Izquierdo-Domenech, Jordi Linares-Pellicer, Carlos Aliaga-Torro and Isabel Ferri-Molla
Machines 2025, 13(9), 759; https://doi.org/10.3390/machines13090759 - 25 Aug 2025
Viewed by 632
Abstract
Industrial environments frequently rely on analog control instruments due to their reliability and robustness; however, automating the interpretation of these controls remains challenging due to variability in design, lighting conditions, and scale precision requirements. This research investigates the effectiveness of Vision–Language Models (VLMs) [...] Read more.
Industrial environments frequently rely on analog control instruments due to their reliability and robustness; however, automating the interpretation of these controls remains challenging due to variability in design, lighting conditions, and scale precision requirements. This research investigates the effectiveness of Vision–Language Models (VLMs) for automated interpretation of industrial controls through analysis of three distinct approaches: general-purpose VLMs, fine-tuned specialized models, and lightweight models optimized for edge computing. Each approach was evaluated using two prompting strategies, Holistic-Thought Protocol (HTP) and sequential Chain-of-Thought (CoT), across a representative dataset of continuous and discrete industrial controls. The results demonstrate that the fine-tuned Generative Pre-trained Transformer 4 omni (GPT-4o) significantly outperformed other approaches, achieving low Mean Absolute Error (MAE) for continuous controls and the highest accuracy and Matthews Correlation Coefficient (MCC) for discrete controls. Fine-tuned models demonstrated less sensitivity to prompt variations, enhancing their reliability. In contrast, although general-purpose VLMs showed acceptable zero-shot performance, edge-optimized models exhibited severe limitations. This work highlights the capability of fine-tuned VLMs for practical deployment in industrial scenarios, balancing precision, computational efficiency, and data annotation requirements. Full article
(This article belongs to the Section Automation and Control Systems)
Show Figures

Figure 1

Back to TopTop