You are currently viewing a new version of our website. To view the old version click .

603 Results Found

  • Article
  • Open Access
1,080 Views
29 Pages

13 November 2025

Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on t...

  • Article
  • Open Access
8 Citations
5,086 Views
16 Pages

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

  • Han Ma,
  • Baoyu Fan,
  • Benjamin K. Ng and
  • Chan-Tong Lam

16 January 2024

Multimodal learning is a promising area in artificial intelligence (AI) that can make the model understand different kinds of data. Existing works are trying to re-train a new model based on pre-trained models that requires much data, computation pow...

  • Article
  • Open Access
199 Views
14 Pages

Comparative Evaluation of Vision–Language Models for Detecting and Localizing Dental Lesions from Intraoral Images

  • Maria Jahan,
  • Al Ibne Siam,
  • Lamim Zakir Pronay,
  • Saif Ahmed,
  • Nabeel Mohammed,
  • James Dudley and
  • Taseef Hasan Farook

To assess the efficiency of vision–language models in detecting and classifying carious and non-carious lesions from intraoral photo imaging. A dataset of 172 annotated images were classified for microcavitation, cavitated lesions, staining, ca...

  • Article
  • Open Access
2,432 Views
24 Pages

In this work, the utility of multimodal vision–language models (VLMs) for visual product understanding in e-commerce is investigated, focusing on two complementary models: ColQwen2 (vidore/colqwen2-v1.0) and ColPali (vidore/colpali-v1.2-hf). Th...

  • Article
  • Open Access
15 Citations
5,319 Views
22 Pages

Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models

  • Yueyue Zhou,
  • Hongping Yan,
  • Kun Ding,
  • Tingting Cai and
  • Yan Zhang

21 September 2024

Accurate crop disease classification is crucial for ensuring food security and enhancing agricultural productivity. However, the existing crop disease classification algorithms primarily focus on a single image modality and typically require a large...

  • Article
  • Open Access
1 Citations
3,189 Views
18 Pages

CoCM: Conditional Cross-Modal Learning for Vision-Language Models

  • Juncheng Yang,
  • Shuai Xie,
  • Shuxia Li,
  • Zengyu Cai,
  • Yijia Li and
  • Weiping Zhu

Parameter tuning based adapter methods have achieved notable success in vision-language models (VLMs). However, they face challenges in scenarios with insufficient training samples or limited resources. While leveraging image modality caching and ret...

  • Review
  • Open Access
19 Citations
14,973 Views
28 Pages

Advancements in Vision–Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

  • Lijie Tao,
  • Haokui Zhang,
  • Haizhao Jing,
  • Yu Liu,
  • Dawei Yan,
  • Guoting Wei and
  • Xizhe Xue

6 January 2025

Recently, the remarkable success of ChatGPT has sparked a renewed wave of interest in artificial intelligence (AI), and the advancements in Vision–Language Models (VLMs) have pushed this enthusiasm to new heights. Differing from previous AI app...

  • Article
  • Open Access
4 Citations
2,411 Views
18 Pages

25 January 2025

Weakly supervised crack segmentation aims to create pixel-level crack masks with minimal human annotation, which often only differentiate between crack and normal no-crack patches. This task is crucial for assessing structural integrity and safety in...

  • Review
  • Open Access
5 Citations
7,218 Views
22 Pages

31 October 2022

Because the pretraining model is not limited by the scale of data annotation and can learn general semantic information, it performs well in tasks related to natural language processing and computer vision. In recent years, more and more attention ha...

  • Article
  • Open Access
727 Views
15 Pages

Preliminary Study on Image-Finding Generation and Classification of Lung Nodules in Chest CT Images Using Vision–Language Models

  • Maiko Nagao,
  • Atsushi Teramoto,
  • Kaito Urata,
  • Kazuyoshi Imaizumi,
  • Masashi Kondo and
  • Hiroshi Fujita

9 November 2025

In the diagnosis of lung cancer, imaging findings of lung nodules are essential for benign and malignant classifications. Although numerous studies have investigated the classification of lung nodules, no method has been proposed for obtaining detail...

  • Article
  • Open Access
1,678 Views
20 Pages

26 October 2025

Building on advances in promptable segmentation models, this work introduces a framework that integrates Large Vision-Language Model (LVLM) bounding box priors with prototype-based region of interest (ROI) selection to improve zero-shot medical image...

  • Article
  • Open Access
1,224 Views
28 Pages

Towards Robust Industrial Control Interpretation Through Comparative Analysis of Vision–Language Models

  • Juan Izquierdo-Domenech,
  • Jordi Linares-Pellicer,
  • Carlos Aliaga-Torro and
  • Isabel Ferri-Molla

25 August 2025

Industrial environments frequently rely on analog control instruments due to their reliability and robustness; however, automating the interpretation of these controls remains challenging due to variability in design, lighting conditions, and scale p...

  • Article
  • Open Access
471 Views
15 Pages

Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study

  • Robert Kaczmarczyk,
  • Philipp Pieroh,
  • Sebastian Koob,
  • Frank Sebastian Fröschen,
  • Sebastian Scheidt,
  • Kristian Welle,
  • Ron Martin and
  • Jonas Roos

16 December 2025

Background: Vision-language models show promise in medical image interpretation, but their performance in musculoskeletal tumor diagnostics remains underexplored. Objective: To evaluate the diagnostic accuracy of six large language models on orthoped...

  • Article
  • Open Access
990 Views
29 Pages

An Empirical Evaluation of Low-Rank Adapted Vision–Language Models for Radiology Image Captioning

  • Mahmudul Hoque,
  • Raisa Nusrat Chowdhury,
  • Md Rakibul Hasan,
  • Ojonugwa Oluwafemi Ejiga Peter,
  • Fahmi Khalifa and
  • Md Mahmudur Rahman

Rapidly growing medical imaging volumes have increased radiologist workloads, creating demand for automated tools that support interpretation and reduce reporting delays. Vision-language models (VLMs) can generate clinically relevant captions to acce...

  • Article
  • Open Access
1,310 Views
41 Pages

13 August 2025

Despite advances in complex reasoning, Vision-Language Models (VLMs) remain inadequately benchmarked for safety-critical applications like childcare. To address this gap, we conduct a multilingual (English, French, Polish, Japanese) comparison of VLM...

  • Article
  • Open Access
4 Citations
4,524 Views
26 Pages

Estimating Age and Sex from Dental Panoramic Radiographs Using Neural Networks and Vision–Language Models

  • Salem Shamsul Alam,
  • Nabila Rashid,
  • Tasfia Azrin Faiza,
  • Saif Ahmed,
  • Rifat Ahmed Hassan,
  • James Dudley and
  • Taseef Hasan Farook

8 January 2025

Purpose: The purpose of this study was to compare multiple deep learning models for estimating age and sex using dental panoramic radiographs and identify the most successful deep learning models for the specified tasks. Methods: The dataset of 437 p...

  • Article
  • Open Access
1,558 Views
17 Pages

31 October 2025

Marine low clouds have a strong impact on Earth’s system but remain a major source of uncertainty in anthropogenic radiative forcing simulated by general circulation models. This uncertainty arises from incomplete understanding of the many proc...

  • Article
  • Open Access
1,525 Views
23 Pages

Recently, prompt learning has emerged as a viable technique for fine-tuning pre-trained vision–language models (VLMs). The use of prompts allows pre-trained VLMs to be quickly adapted to specific downstream tasks, bypassing the necessity to upd...

  • Article
  • Open Access
5,377 Views
35 Pages

21 July 2025

Medical Visual Question Answering (MedVQA) lies at the intersection of computer vision, natural language processing, and clinical decision-making, aiming to generate accurate responses from medical images paired with complex inquiries. Despite recent...

  • Article
  • Open Access
1 Citations
3,463 Views
22 Pages

Multimodal AI for UAV: Vision–Language Models in Human– Machine Collaboration

  • Maroš Krupáš,
  • Ľubomír Urblík and
  • Iveta Zolotová

6 September 2025

Recent advances in multimodal large language models (MLLMs)—particularly vision– language models (VLMs)—introduce new possibilities for integrating visual perception with natural-language understanding in human–machine collabo...

  • Article
  • Open Access
2 Citations
4,653 Views
14 Pages

19 November 2024

This study focuses on Scene Text Recognition (STR), which plays a crucial role in various applications of artificial intelligence such as image retrieval, office automation, and intelligent transportation systems. Currently, pre-trained vision-langua...

  • Article
  • Open Access
2,024 Views
15 Pages

28 August 2025

Recent vision–language models (VLMs) achieve strong performance across multimodal benchmarks but suffer from high inference costs due to the large number of visual tokens. Prior studies have shown that many image tokens receive consistently low...

  • Article
  • Open Access
5 Citations
2,829 Views
14 Pages

Several attacks have been proposed against autonomous vehicles and their subsystems that are powered by machine learning (ML). Road sign recognition models are especially heavily tested under various adversarial ML attack settings, and they have prov...

  • Article
  • Open Access
1 Citations
1,116 Views
36 Pages

29 October 2025

Urban recreational spaces (URSs) are pivotal for enhancing resident well-being, making the accurate assessment of public perceptions crucial for quality optimization. Compared to traditional surveys, social media data provide a scalable means for mul...

  • Article
  • Open Access
2 Citations
2,890 Views
15 Pages

Auto-Rad: End-to-End Report Generation from Lumber Spine MRI Using Vision–Language Model

  • Mohammed Yeasin,
  • Kazi Ashraf Moinuddin,
  • Felix Havugimana,
  • Lijia Wang and
  • Paul Park

23 November 2024

Background: Lumbar spinal stenosis (LSS) is a major cause of chronic lower back and leg pain, and is traditionally diagnosed through labor-intensive analysis of magnetic resonance imaging (MRI) scans by radiologists. This study aims to streamline the...

  • Article
  • Open Access
380 Views
16 Pages

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

  • Zhendong Xiao,
  • Shan Yang,
  • Shujie Ji,
  • Jun Yin,
  • Ziling Wen and
  • Wu Wei

28 November 2025

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera’s position and orientation from images and is essential for applications in augmented reality, mixed reality, autonomous driving, delivery...

  • Article
  • Open Access
791 Views
17 Pages

SADAMB: Advancing Spatially-Aware Vision-Language Modeling Through Datasets, Metrics, and Benchmarks

  • Giorgos Papadopoulos,
  • Petros Drakoulis,
  • Athanasios Ntovas,
  • Alexandros Doumanoglou and
  • Dimitris Zarpalas

29 September 2025

Understanding spatial relationships between objects in images is crucial for robotic navigation, augmented reality systems, and autonomous driving applications, among others. However, existing vision-language benchmarks often overlook explicit spatia...

  • Article
  • Open Access
6 Citations
3,419 Views
26 Pages

6 September 2024

As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image pro...

  • Feature Paper
  • Article
  • Open Access
2 Citations
3,486 Views
27 Pages

Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition

  • Constantin-Bogdan Popescu,
  • Laura Florea and
  • Corneliu Florea

20 August 2025

Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in image...

  • Article
  • Open Access
2 Citations
1,869 Views
14 Pages

30 June 2025

Crop diseases pose a significant threat to agricultural productivity and global food security. Timely and accurate disease identification is crucial for improving crop yield and quality. While most existing deep learning-based methods focus primarily...

  • Article
  • Open Access
2 Citations
2,846 Views
17 Pages

RelVid: Relational Learning with Vision-Language Models for Weakly Video Anomaly Detection

  • Jingxin Wang,
  • Guohan Li,
  • Jiaqi Liu,
  • Zhengyi Xu,
  • Xinrong Chen and
  • Jianming Wei

25 March 2025

Weakly supervised video anomaly detection aims to identify abnormal events in video sequences without requiring frame-level supervision, which is a challenging task in computer vision. Traditional methods typically rely on low-level visual features w...

  • Article
  • Open Access
2,322 Views
27 Pages

10 May 2025

The Controller Area Network (CAN) facilitates efficient communication among vehicle components. While it ensures fast and reliable data transmission, its lightweight design makes it susceptible to data manipulation in the absence of security layers....

  • Proceeding Paper
  • Open Access
633 Views
7 Pages

24 September 2025

The confluence of computer vision and natural language processing has yielded powerful vision language models (VLMs) capable of multimodal understanding. We applied state-of-the-art VLMs for quality monitoring of the shoe assembly industry. By levera...

  • Article
  • Open Access
61 Citations
17,335 Views
18 Pages

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

  • Yakoub Bazi,
  • Laila Bashmal,
  • Mohamad Mahmoud Al Rahhal,
  • Riccardo Ricci and
  • Farid Melgani

23 April 2024

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking poten...

  • Article
  • Open Access
4 Citations
1,498 Views
15 Pages

11 March 2025

This paper presents the experimental evaluation and analyzes the results of the first edition of the pedestrian attribute recognition (PAR) contest, the international competition which focused on smart visual sensors based on multi-task computer visi...

  • Review
  • Open Access
1 Citations
3,194 Views
51 Pages

16 October 2025

With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines t...

  • Article
  • Open Access
17 Citations
6,164 Views
16 Pages

Vision-Language Models for Zero-Shot Classification of Remote Sensing Images

  • Mohamad Mahmoud Al Rahhal,
  • Yakoub Bazi,
  • Hebah Elgibreen and
  • Mansour Zuair

17 November 2023

Zero-shot classification presents a challenge since it necessitates a model to categorize images belonging to classes it has not encountered during its training phase. Previous research in the field of remote sensing (RS) has explored this task by tr...

  • Article
  • Open Access
3 Citations
2,017 Views
18 Pages

6 April 2025

Thermal comfort in urban commercial spaces significantly impacts both business performance and public well-being. Traditional evaluation methods relying on field surveys and expert assessments are often time-consuming and labor-intensive. This study...

  • Article
  • Open Access
529 Views
31 Pages

Bridge health diagnosis plays a vital role in ensuring structural safety and extending service life while reducing maintenance costs. Traditional structural health monitoring approaches rely on sensor-based measurements, which are costly, labor-inten...

  • Article
  • Open Access
663 Views
17 Pages

CracksGPT: Exploring the Potential and Limitations of Multimodal AI for Building Crack Analysis

  • Biyanka Ekanayake,
  • Vishal Thengane,
  • Johnny Kwok-Wai Wong,
  • Sara Wilkinson and
  • Sai Ho Ling

28 November 2025

Building cracks are among the critical building defects, as they can compromise structural integrity, occupant safety and building sustainability. Traditional laborious building inspection methods are cumbersome and erroneous. Computer vision-based c...

  • Article
  • Open Access
714 Views
26 Pages

Think-to-Detect: Rationale-Driven Vision–Language Anomaly Detection

  • Mahmoud Abdalla,
  • Mahmoud SalahEldin Kasem,
  • Mohamed Mahmoud,
  • Mostafa Farouk Senussi,
  • Abdelrahman Abdallah and
  • Hyun-Soo Kang

8 December 2025

Large vision–language models (VLMs) can describe images fluently, yet their anomaly decisions often rely on opaque heuristics and manual thresholds. We present ThinkAnomaly, a rationale-first vision–language framework for industrial anoma...

  • Review
  • Open Access
29 Citations
23,134 Views
39 Pages

A Survey of Robot Intelligence with Large Language Models

  • Hyeongyo Jeong,
  • Haechan Lee,
  • Changwon Kim and
  • Sungtae Shin

2 October 2024

Since the emergence of ChatGPT, research on large language models (LLMs) has actively progressed across various fields. LLMs, pre-trained on vast text datasets, have exhibited exceptional abilities in understanding natural language and planning tasks...

  • Article
  • Open Access
2,595 Views
23 Pages

27 August 2025

Autonomous Underwater Vehicles (AUVs) equipped with vision systems face unique challenges in real-time environmental perception due to harsh underwater conditions and computational constraints. This paper presents a novel cloud–edge framework f...

  • Article
  • Open Access
1,385 Views
24 Pages

29 August 2025

Precise classification of unsound wheat grains is essential for crop yields and food security, yet most existing approaches rely on vision-only models that demand large labeled datasets, which is often impractical in real-world, data-scarce settings....

  • Article
  • Open Access
3,672 Views
18 Pages

28 November 2024

This article presents CapFlow, an integrated approach to detailed image captioning and hashtag generation. Based on a thorough performance evaluation, the image captioning model utilizes a fine-tuned vision-language model with Low-Rank Adaptation (Lo...

  • Article
  • Open Access
3,161 Views
16 Pages

Parameter-Efficient Adaptation of Large Vision—Language Models for Video Memorability Prediction

  • Iván Martín-Fernández,
  • Sergio Esteban-Romero,
  • Fernando Fernández-Martínez and
  • Manuel Gil-Martín

7 March 2025

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying...

  • Article
  • Open Access
1 Citations
1,224 Views
27 Pages

An Exploratory Study on Workover Scenario Understanding Using Prompt-Enhanced Vision-Language Models

  • Xingyu Liu,
  • Liming Zhang,
  • Zewen Song,
  • Ruijia Zhang,
  • Jialin Wang,
  • Chenyang Wang and
  • Wenhao Liang

15 May 2025

As oil and gas exploration has deepened, the complexity and risk of well repair operations has increased, and the traditional description methods based on text and charts have limitations in accuracy and efficiency. Therefore, this study proposes a w...

  • Article
  • Open Access
1 Citations
1,837 Views
17 Pages

21 August 2025

Current vision–language models (VLMs) are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision–...

  • Article
  • Open Access
4 Citations
2,404 Views
29 Pages

19 March 2025

The integration of vision–language models (VLMs) with robotic systems represents a transformative advancement in autonomous task planning and execution. However, traditional robotic arms relying on pre-programmed instructions exhibit limited ad...

  • Article
  • Open Access
1 Citations
4,192 Views
23 Pages

26 February 2025

Abnormal phenomena on urban roads, including uneven surfaces, garbage, traffic congestion, floods, fallen trees, fires, and traffic accidents, present significant risks to public safety and infrastructure, necessitating real-time monitoring and early...

of 13