MDPI - Publisher of Open Access Journals

28 pages, 2518 KiB

Open AccessArticle

Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain

by Stergios Papazis, Angelos P. Giotis and Christophoros Nikou

Electronics 2025, 14(14), 2900; https://doi.org/10.3390/electronics14142900 - 20 Jul 2025

Viewed by 297

Handwritten Keyword Spotting (KWS) remains a challenging task, particularly in segmentation-free scenarios where word images must be retrieved and ranked based on their similarity to a query without relying on prior page-level segmentation. Traditional KWS methods primarily focus on visual similarity, often overlooking [...] Read more.

Handwritten Keyword Spotting (KWS) remains a challenging task, particularly in segmentation-free scenarios where word images must be retrieved and ranked based on their similarity to a query without relying on prior page-level segmentation. Traditional KWS methods primarily focus on visual similarity, often overlooking the underlying semantic relationships between words. In this work, we propose a novel NLP-driven re-ranking approach that refines the initial ranked lists produced by state-of-the-art KWS models. By leveraging semantic embeddings from pre-trained BERT-like Large Language Models (LLMs, e.g., RoBERTa, MPNet, and MiniLM), we introduce a relevance feedback mechanism that improves both verbatim and semantic keyword spotting. Our framework operates in two stages: (1) projecting retrieved word image transcriptions into a semantic space via LLMs and (2) re-ranking the retrieval list using a weighted combination of semantic and exact relevance scores based on pairwise similarities with the query. We evaluate our approach on the widely used George Washington (GW) and IAM collections using two cutting-edge segmentation-free KWS models, which are further integrated into our proposed pipeline. Our results show consistent gains in Mean Average Precision (mAP), with improvements of up to

2.3 %

(from

94.3 %

to

96.6 %

) on GW and

3 %

(from

79.15 %

to

82.12 %

) on IAM. Even when mAP gains are smaller, qualitative improvements emerge: semantically relevant but inexact matches are retrieved more frequently without compromising exact match recall. We further examine the effect of fine-tuning transformer-based OCR (TrOCR) models on historical GW data to align textual and visual features more effectively. Overall, our findings suggest that semantic feedback can enhance retrieval effectiveness in KWS pipelines, paving the way for lightweight hybrid vision-language approaches in handwritten document analysis. Full article

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

► Show Figures

Figure 1

25 pages, 1669 KiB

Open AccessArticle

Zero-Shot Infrared Domain Adaptation for Pedestrian Re-Identification via Deep Learning

by Xu Zhang, Yinghui Liu, Liangchen Guo and Huadong Sun

Electronics 2025, 14(14), 2784; https://doi.org/10.3390/electronics14142784 - 10 Jul 2025

Viewed by 243

Abstract

In computer vision, the performance of detectors trained under optimal lighting conditions is significantly impaired when applied to infrared domains due to the scarcity of labeled infrared target domain data and the inherent degradation in infrared image quality. Progress in cross-domain pedestrian re-identification [...] Read more.

In computer vision, the performance of detectors trained under optimal lighting conditions is significantly impaired when applied to infrared domains due to the scarcity of labeled infrared target domain data and the inherent degradation in infrared image quality. Progress in cross-domain pedestrian re-identification is hindered by the lack of labeled infrared image data. To address the degradation of pedestrian recognition in infrared environments, we propose a framework for zero-shot infrared domain adaptation. This integrated approach is designed to mitigate the challenges of pedestrian recognition in infrared domains while enabling zero-shot domain adaptation. Specifically, an advanced reflectance representation learning module and an exchange–re-decomposition–coherence process are employed to learn illumination invariance and to enhance the model’s effectiveness, respectively. Additionally, the CLIP (Contrastive Language–Image Pretraining) image encoder and DINO (Distillation with No Labels) are fused for feature extraction, improving model performance under infrared conditions and enhancing its generalization capability. To further improve model performance, we introduce the Non-Local Attention (NLA) module, the Instance-based Weighted Part Attention (IWPA) module, and the Multi-head Self-Attention module. The NLA module captures global feature dependencies, particularly long-range feature relationships, effectively mitigating issues such as blurred or missing image information in feature degradation scenarios. The IWPA module focuses on localized regions to enhance model accuracy in complex backgrounds and unevenly lit scenes. Meanwhile, the Multi-head Self-Attention module captures long-range dependencies between cross-modal features, further strengthening environmental understanding and scene modeling. The key innovation of this work lies in the skillful combination and application of existing technologies to new domains, overcoming the challenges posed by vision in infrared environments. Experimental results on the SYSU-MM01 dataset show that, under the single-shot setting, Rank-1 Accuracy (Rank-1) andmean Average Precision (mAP) values of 37.97% and 37.25%, respectively, were achieved, while in the multi-shot setting, values of 34.96% and 34.14% were attained. Full article

(This article belongs to the Special Issue Deep Learning in Image Processing and Computer Vision)

► Show Figures

Figure 1

17 pages, 5079 KiB

Open AccessArticle

M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks

by Xudong Liang, Jiang Xie, Mengfei Zhang and Zhuo Bi

Bioengineering 2025, 12(7), 738; https://doi.org/10.3390/bioengineering12070738 - 6 Jul 2025

Viewed by 384

Abstract

Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet [...] Read more.

Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Figure 1

16 pages, 3114 KiB

Open AccessArticle

TDA-L: Reducing Latency and Memory Consumption of Test-Time Adaptation for Real-Time Intelligent Sensing

by Rahim Hossain, Md Tawheedul Islam Bhuian and Kyoung-Don Kang

Sensors 2025, 25(12), 3574; https://doi.org/10.3390/s25123574 - 6 Jun 2025

Viewed by 609

Abstract

Vision–language models learn visual concepts from the supervision of natural language. It can significantly enhance the generalizability of real-time intelligent sensing, such as analyzing camera-captured real-time images for visually impaired users. However, adapting vision–language models to distribution shifts at test time, caused by [...] Read more.

Vision–language models learn visual concepts from the supervision of natural language. It can significantly enhance the generalizability of real-time intelligent sensing, such as analyzing camera-captured real-time images for visually impaired users. However, adapting vision–language models to distribution shifts at test time, caused by several factors such as lighting or weather changes, remains challenging. In particular, most existing test-time adaptation methods rely on gradient-based fine-tuning and backpropagation, making them computationally expensive and unsuitable for real-time applications. To address this challenge, the Training-Free Dynamic Adapter (TDA) has recently been introduced as a lightweight alternative that uses a dynamic key–value cache and pseudo-label refinement for test-time adaptation without backpropagation. Building on this, we propose TDA-L, a new framework that integrates Low-Rank Adaptation (LoRA) to reduce the size of feature representations and related computational overhead at test time using pre-learned low-rank matrices. TDA-L applies LoRA transformations to both query and cached features during inference, cost-efficiently improving robustness to distribution shifts while maintaining the training-free nature of TDA. Experimental results on seven benchmarks show that TDA-L maintains accuracy but achieves lower latency, less memory consumption, and higher throughput, making it well-suited for AI-based real-time sensing. Full article

(This article belongs to the Special Issue Edge AI for Wearables and IoT)

► Show Figures

Figure 1

25 pages, 6786 KiB

Open AccessArticle

Data Quality Monitoring for the Hadron Calorimeters Using Transfer Learning for Anomaly Detection

by Mulugeta Weldezgina Asres, Christian Walter Omlin, Long Wang, David Yu, Pavel Parygin, Jay Dittmann and the CMS-HCAL Collaboration

Sensors 2025, 25(11), 3475; https://doi.org/10.3390/s25113475 - 31 May 2025

Viewed by 445

Abstract

The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms [...] Read more.

The proliferation of sensors brings an immense volume of spatio-temporal (ST) data in many domains, including monitoring, diagnostics, and prognostics applications. Data curation is a time-consuming process for a large volume of data, making it challenging and expensive to deploy data analytics platforms in new environments. Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task. Despite the triumph of TL in fields like computer vision and natural language processing, efforts on complex ST models for anomaly detection (AD) applications are limited. In this study, we present the potential of TL within the context of high-dimensional ST AD with a hybrid autoencoder architecture, incorporating convolutional, graph, and recurrent neural networks. Motivated by the need for improved model accuracy and robustness, particularly in scenarios with limited training data on systems with thousands of sensors, this research investigates the transferability of models trained on different sections of the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN. The key contributions of the study include exploring TL’s potential and limitations within the context of encoder and decoder networks, revealing insights into model initialization and training configurations that enhance performance while substantially reducing trainable parameters and mitigating data contamination effects. Full article

(This article belongs to the Special Issue AI-Assisted Condition Monitoring and Fault Diagnosis)

► Show Figures

Figure 1

24 pages, 1736 KiB

Open AccessArticle

ProFusion: Multimodal Prototypical Networks for Few-Shot Learning with Feature Fusion

by Jia Zhao, Ziyang Cao, Huiling Wang, Xu Wang and Yingzhou Chen

Symmetry 2025, 17(5), 796; https://doi.org/10.3390/sym17050796 - 20 May 2025

Viewed by 791

Abstract

Existing few-shot learning models leverage vision-language pre-trained models to alleviate the data scarcity problem. However, such models usually process visual and text information separately, which causes still inherent disparities between cross-modal features. Therefore, we propose the ProFusion model, which leverages multimodal pre-trained models [...] Read more.

Existing few-shot learning models leverage vision-language pre-trained models to alleviate the data scarcity problem. However, such models usually process visual and text information separately, which causes still inherent disparities between cross-modal features. Therefore, we propose the ProFusion model, which leverages multimodal pre-trained models and prototypical networks to construct multiple prototypes. Specifically, ProFusion generates image and text prototypes symmetrically using the visual encoder and text encoder, while integrating visual and text information through the fusion module to create more expressive multimodal feature fusion prototypes. Additionally, we introduce the alignment module to ensure consistency between image and text prototypes. During inference, ProFusion calculates the similarity of test images to the three types of prototypes separately and applies a weighted sum to generate the final prediction. Experiments demonstrate that ProFusion performs outstanding classification tasks on 15 benchmark datasets. Full article

(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)

► Show Figures

Figure 1

38 pages, 2033 KiB

Open AccessArticle

DCAT: A Novel Transformer-Based Approach for Dynamic Context-Aware Image Captioning in the Tamil Language

by Jothi Prakash Venugopal, Arul Antran Vijay Subramanian, Manikandan Murugan, Gopikrishnan Sundaram, Marco Rivera and Patrick Wheeler

Appl. Sci. 2025, 15(9), 4909; https://doi.org/10.3390/app15094909 - 28 Apr 2025

Viewed by 587

Abstract

The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer [...] Read more.

The task of image captioning in low-resource languages like Tamil is fraught with challenges due to limited linguistic resources and complex semantic structures. This paper addresses the problem of generating contextually and linguistically coherent captions in Tamil. We introduce the Dynamic Context-Aware Transformer (DCAT), a novel approach that synergizes the Vision Transformer (ViT) with the Generative Pre-trained Transformer (GPT-3), reinforced by a unique Context Embedding Layer. The DCAT model, tailored for Tamil, innovatively employs dynamic attention mechanisms during its Initialization, Training, and Inference phases to focus on pertinent visual and textual elements. Our method distinctively leverages the nuances of Tamil syntax and semantics, a novelty in the realm of low-resource language image captioning. Comparative evaluations against established models on datasets like Flickr8k, Flickr30k, and MSCOCO reveal DCAT’s superiority, with a notable 12% increase in BLEU score (0.7425) and a 15% enhancement in METEOR score (0.4391) over leading models. Despite its computational demands, DCAT sets a new benchmark for image captioning in Tamil, demonstrating potential applicability to other similar languages. Full article

(This article belongs to the Special Issue Natural Language Processing and Semantic Technologies: From Theories to Applications)

► Show Figures

Figure 1

19 pages, 17459 KiB

Open AccessArticle

Enhancing Adversarial Defense via Brain Activity Integration Without Adversarial Examples

by Tasuku Nakajima, Keisuke Maeda, Ren Togo, Takahiro Ogawa and Miki Haseyama

Sensors 2025, 25(9), 2736; https://doi.org/10.3390/s25092736 - 25 Apr 2025

Viewed by 505

Abstract

Adversarial attacks on large-scale vision–language foundation models, such as the contrastive language–image pretraining (CLIP) model, can significantly degrade performance across various tasks by generating adversarial examples that are indistinguishable from the original images to human perception. Although adversarial training methods, which train models [...] Read more.

Adversarial attacks on large-scale vision–language foundation models, such as the contrastive language–image pretraining (CLIP) model, can significantly degrade performance across various tasks by generating adversarial examples that are indistinguishable from the original images to human perception. Although adversarial training methods, which train models with adversarial examples, have been proposed to defend against such attacks, they typically require prior knowledge of the attack. These methods also lead to a trade-off between robustness to adversarial examples and accuracy for clean images. To address these challenges, we propose an adversarial defense method based on human brain activity data by hypothesizing that such adversarial examples are not misrecognized by humans. The proposed method employs an encoder that integrates the features of brain activity and augmented images from the original images. Then, by maximizing the similarity between features predicted by the encoder and the original visual features, we obtain features with the visual invariance of the human brain and the diversity of data augmentation. Consequently, we construct a model that is robust against adversarial attacks and maintains accuracy for clean images. Unlike existing methods, the proposed method is not trained on any specific adversarial attack information; thus, it is robust against unknown attacks. Extensive experiments demonstrate that the proposed method significantly enhances robustness to adversarial attacks on the CLIP model without degrading accuracy for clean images. The primary contribution of this study is that the performance trade-off can be overcome using brain activity data. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

31 pages, 3985 KiB

Open AccessArticle

Receipt Recognition Technology Driven by Multimodal Alignment and Lightweight Sequence Modeling

by Jin-Ming Yu, Hui-Jun Ma and Jian-Lei Kong

Electronics 2025, 14(9), 1717; https://doi.org/10.3390/electronics14091717 - 23 Apr 2025

Viewed by 660

Abstract

With the rapid advancement of global digital transformation, enterprises and financial institutions face increasing challenges in managing and processing receipt-like financial documents. Traditional manual document processing methods can no longer meet the demands of modern office operations and business expansion. To address these [...] Read more.

With the rapid advancement of global digital transformation, enterprises and financial institutions face increasing challenges in managing and processing receipt-like financial documents. Traditional manual document processing methods can no longer meet the demands of modern office operations and business expansion. To address these issues, automated document recognition systems based on computer vision and deep learning technologies have emerged. This paper proposes a receipt recognition technology based on multimodal alignment and lightweight sequence modeling, integrating the CLIP (Contrastive Language-Image Pretraining) and Bidirectional Gated Recurrent Unit (BiGRU) framework. The framework aims to achieve synergistic optimization of image and text information through semantic correction. By leveraging dynamic threshold classification, geometric regression loss, and multimodal feature alignment, the framework significantly improves text detection and recognition accuracy in complex layouts and low-quality images. Experimental results show that the model achieves a detection F1 score of 93.1% and a Character Error Rate (CER) of 5.1% on the CORD dataset. Through a three-stage compression strategy of quantization, pruning, and distillation, the model size is reduced to 18 MB, achieving real-time inference speeds of 25 FPS on the Jetson AGX Orin edge device, with power consumption stabilized below 12 W. This framework provides an efficient, accurate, and edge-computing-friendly solution for automated receipt processing. Practical implications include its potential to enhance the efficiency of financial audits, improve tax compliance, and streamline the operational management of financial institutions, making it a valuable tool for real-world applications in receipt automation. Full article

► Show Figures

Figure 1

17 pages, 2758 KiB

Open AccessArticle

History-Aware Multimodal Instruction-Oriented Policies for Navigation Tasks

by Renas Mukhametzianov and Hidetaka Nambo

AI 2025, 6(4), 75; https://doi.org/10.3390/ai6040075 - 11 Apr 2025

Viewed by 741

Abstract

The rise of large-scale language models and multimodal transformers has enabled instruction-based policies, such as vision-and-language navigation. To leverage their general world knowledge, we propose multimodal annotations for action options and support selection from a dynamic, describable action space. Our framework employs a [...] Read more.

The rise of large-scale language models and multimodal transformers has enabled instruction-based policies, such as vision-and-language navigation. To leverage their general world knowledge, we propose multimodal annotations for action options and support selection from a dynamic, describable action space. Our framework employs a multimodal transformer that processes front-facing camera images, light detection and ranging (LIDAR) sensor’s point clouds, and tasks as textual instructions to produce a history-aware decision policy for mobile robot navigation. Our approach leverages a pretrained vision–language encoder and integrates it with a custom causal generative pretrained transformer (GPT) decoder to predict action sequences within a state–action history. We propose a trainable attention score mechanism to efficiently select the most suitable action from a variable set of possible options. Action options are text–image pairs and encoded using the same multimodal encoder employed for environment states. This approach of annotating and dynamically selecting actions is applicable to broader multidomain decision-making tasks. We compared two baseline models, ViLT (vision-and-language transformer) and FLAVA (foundational language and vision alignment), and found that FLAVA achieves superior performance within the constraints of 8 GB video memory usage in the training phase. Experiments were conducted in both simulated and real-world environments using our custom datasets for instructed task completion episodes, demonstrating strong prediction accuracy. These results highlight the potential of multimodal, dynamic action spaces for instruction-based robot navigation and beyond. Full article

(This article belongs to the Section AI in Autonomous Systems)

► Show Figures

Figure 1

19 pages, 1054 KiB

Open AccessArticle

Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions

by Minjun Cho, Sungwoo Kim, Dooho Choi and Yunsick Sung

Appl. Sci. 2025, 15(7), 3712; https://doi.org/10.3390/app15073712 - 28 Mar 2025

Viewed by 2106

Abstract

Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving [...] Read more.

Autonomous driving technology has advanced significantly. However, it is challenging to accurately generate captions for driving environment scenes, which involve dynamic elements such as vehicles, traffic signals, road conditions, weather, and the time of day. Capturing these elements accurately is important for improving situational awareness in autonomous systems. Driving environment scene captioning is an important part of generating driving scenarios and enhancing the interpretability of autonomous systems. However, traditional vision–language models struggle with domain adaptation since autonomous driving datasets with detailed captions of dashcam-recorded scenes are limited and they cannot effectively capture diverse driving environment factors. In this paper, we propose an enhanced method based on the bootstrapping language-image pre-training with frozen vision encoders and large language Model (BLIP-2) to optimize the domain adaptation by improving scene captioning in autonomous driving environments. It comprises two steps: (1) transforming structured dataset labels into descriptive captions in natural language using a large language model (LLM), and (2) optimizing Q-Former in a BLIP-2 module with low-rank adaptation (LoRA) to achieve efficient domain adaptation. The structured dataset labels are originally stored in JSON format, where driving environment scene factors are encoded as key-value pairs that represent attributes such as the object type, position, and state. Using the Large-Scale Diverse Driving Video Database (BDD-100K), our method significantly improves performance, achieving BLEU-4, CIDEr, and SPICE scores that were each approximately 1.5 times those for the baseline BLIP-2, respectively. Higher scores show the effectiveness of LoRA-based optimization and, hence, its suitability for autonomous driving applications. Our method effectively enhances accuracy, contextual relevance, and interpretability, contributing to improved scene understanding in autonomous driving systems. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

27 pages, 10045 KiB

Open AccessArticle

Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding

by Mohammed Elhenawy, Huthaifa I. Ashqar, Andry Rakotonirainy, Taqwa I. Alhadidi, Ahmed Jaber and Mohammad Abu Tami

Electronics 2025, 14(7), 1282; https://doi.org/10.3390/electronics14071282 - 24 Mar 2025

Cited by 2 | Viewed by 3013

Abstract

Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be [...] Read more.

Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analyses on the Honda Scenes Dataset, which contains a collection of about 80 h of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. The results also showed that fine-tuning the CLIP models, such as ViT-L/14 (Vision Transformer) and ViT-B/32, significantly improved scene classification, achieving a top F1-score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of advanced driver assistance systems (ADASs). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems. Full article

(This article belongs to the Special Issue Intelligent Transportation Systems and Sustainable Smart Cities)

► Show Figures

Figure 1

20 pages, 20407 KiB

Open AccessArticle

VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection

by Andrea Appiani and Cigdem Beyan

Information 2025, 16(3), 233; https://doi.org/10.3390/info16030233 - 16 Mar 2025

Viewed by 1577

Abstract

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining [...] Read more.

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets. Full article

(This article belongs to the Special Issue Application of Machine Learning in Human Activity Recognition)

► Show Figures

Figure 1

29 pages, 549 KiB

Open AccessReview

Generative Models in Medical Visual Question Answering: A Survey

by Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu and Hongxia Xu

Appl. Sci. 2025, 15(6), 2983; https://doi.org/10.3390/app15062983 - 10 Mar 2025

Cited by 1 | Viewed by 3892

Abstract

Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined [...] Read more.

Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined candidates. However, these methods struggle to effectively address open-ended, domain-specific, or complex queries. Recent advancements have shifted the focus toward generative models, leveraging autoregressive decoders, large language models (LLMs), and multimodal large language models (MLLMs) to generate more nuanced and free-form answers. This review comprehensively examines the paradigm shift from discriminative to generative systems, examining generative MedVQA works on their model architectures and training process, summarizing evaluation benchmarks and metrics, highlighting key advances and techniques that propels the development of generative MedVQA, such as concept alignment, instruction tuning, and parameter-efficient fine-tuning (PEFT), alongside strategies for data augmentation and automated dataset creation. Finally, we propose future directions to enhance clinical reasoning and intepretability, build robust evaluation benchmarks and metrics, and employ scalable training strategies and deployment solutions. By analyzing the strengths and limitations of existing generative MedVQA approaches, we aim to provide valuable insights for researchers and practitioners working in this domain. Full article

(This article belongs to the Special Issue Feature Review Papers in "Computing and Artificial Intelligence")

► Show Figures

Figure 1

16 pages, 3396 KiB

Open AccessArticle

Parameter-Efficient Adaptation of Large Vision—Language Models for Video Memorability Prediction

by Iván Martín-Fernández, Sergio Esteban-Romero, Fernando Fernández-Martínez and Manuel Gil-Martín

Sensors 2025, 25(6), 1661; https://doi.org/10.3390/s25061661 - 7 Mar 2025

Viewed by 1295

Abstract

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have [...] Read more.

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision–Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

Search Results (86)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (86)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI