MDPI - Publisher of Open Access Journals

18 pages, 2335 KiB

Open AccessArticle

MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models

by Angus Fung, Aaron Hao Tan, Haitong Wang, Bensiyon Benhabib and Goldie Nejat

Robotics 2025, 14(8), 102; https://doi.org/10.3390/robotics14080102 - 28 Jul 2025

Viewed by 230

Robotic search of people in human-centered environments, including healthcare settings, is challenging, as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans, or locations. Furthermore, robots need to be able to adapt to real-time events that [...] Read more.

Robotic search of people in human-centered environments, including healthcare settings, is challenging, as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans, or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person’s plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints using a topological graph and regions by semantic labels. This is incorporated into an MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario and a waypoint planner that generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the-art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to new and unseen environments. Full article

(This article belongs to the Section Intelligent Robots and Mechatronics)

► Show Figures

Figure 1

17 pages, 1603 KiB

Open AccessPerspective

A Perspective on Quality Evaluation for AI-Generated Videos

by Zhichao Zhang, Wei Sun and Guangtao Zhai

Sensors 2025, 25(15), 4668; https://doi.org/10.3390/s25154668 - 28 Jul 2025

Viewed by 148

Abstract

Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames [...] Read more.

Recent breakthroughs in AI-generated content (AIGC) have transformed video creation, empowering systems to translate text, images, or audio into visually compelling stories. Yet reliable evaluation of these machine-crafted videos remains elusive because quality is governed not only by spatial fidelity within individual frames but also by temporal coherence across frames and precise semantic alignment with the intended message. The foundational role of sensor technologies is critical, as they determine the physical plausibility of AIGC outputs. In this perspective, we argue that multimodal large language models (MLLMs) are poised to become the cornerstone of next-generation video quality assessment (VQA). By jointly encoding cues from multiple modalities such as vision, language, sound, and even depth, the MLLM can leverage its powerful language understanding capabilities to assess the quality of scene composition, motion dynamics, and narrative consistency, overcoming the fragmentation of hand-engineered metrics and the poor generalization ability of CNN-based methods. Furthermore, we provide a comprehensive analysis of current methodologies for assessing AIGC video quality, including the evolution of generation models, dataset design, quality dimensions, and evaluation frameworks. We argue that advances in sensor fusion enable MLLMs to combine low-level physical constraints with high-level semantic interpretations, further enhancing the accuracy of visual quality assessment. Full article

(This article belongs to the Special Issue Perspectives in Intelligent Sensors and Sensing Systems)

► Show Figures

Figure 1

25 pages, 732 KiB

Open AccessArticle

Accuracy-Aware MLLM Task Offloading and Resource Allocation in UAV-Assisted Satellite Edge Computing

by Huabing Yan, Hualong Huang, Zijia Zhao, Zhi Wang and Zitian Zhao

Drones 2025, 9(7), 500; https://doi.org/10.3390/drones9070500 - 16 Jul 2025

Viewed by 323

Abstract

This paper presents a novel framework for optimizing multimodal large language model (MLLM) inference through task offloading and resource allocation in UAV-assisted satellite edge computing (SEC) networks. MLLMs leverage transformer architectures to integrate heterogeneous data modalities for IoT applications, particularly real-time monitoring in [...] Read more.

This paper presents a novel framework for optimizing multimodal large language model (MLLM) inference through task offloading and resource allocation in UAV-assisted satellite edge computing (SEC) networks. MLLMs leverage transformer architectures to integrate heterogeneous data modalities for IoT applications, particularly real-time monitoring in remote areas. However, cloud computing dependency introduces latency, bandwidth, and privacy challenges, while IoT device limitations require efficient distributed computing solutions. SEC, utilizing low-earth orbit (LEO) satellites and unmanned aerial vehicles (UAVs), extends mobile edge computing to provide ubiquitous computational resources for remote IoTDs. We formulate the joint optimization of MLLM task offloading and resource allocation as a mixed-integer nonlinear programming (MINLP) problem, minimizing latency and energy consumption while optimizing offloading decisions, power allocation, and UAV trajectories. To address the dynamic SEC environment characterized by satellite mobility, we propose an action-decoupled soft actor–critic (AD-SAC) algorithm with discrete–continuous hybrid action spaces. The simulation results demonstrate that our approach significantly outperforms conventional deep reinforcement learning methods in convergence and system cost reduction compared to baseline algorithms. Full article

► Show Figures

Figure 1

25 pages, 4948 KiB

Open AccessReview

A Review of Visual Grounding on Remote Sensing Images

by Ziyan Wang, Lei Liu, Gang Wan, Wei Zhang, Binjian Zhong, Haiyang Chang, Xinyi Li, Xiaoxuan Liu and Guangde Sun

Electronics 2025, 14(14), 2815; https://doi.org/10.3390/electronics14142815 - 13 Jul 2025

Viewed by 425

Abstract

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity [...] Read more.

Remote sensing visual grounding, a pivotal technology bridging natural language and high-resolution remote sensing images, holds significant application value in disaster monitoring, urban planning, and related fields. However, it faces critical challenges due to the inherent scale heterogeneity, semantic complexity, and annotation scarcity of remote sensing data. This paper first reviews the development history of remote sensing visual grounding, providing an overview of the basic background knowledge, including fundamental concepts, datasets, and evaluation metrics. Then, it categorizes methods by whether they employ large language models as a pedestal, and provides in-depth analyses of the innovations and limitations of Transformer-based and multimodal large language model-based methods. Furthermore, focusing on remote sensing image characteristics, it discusses cutting-edge techniques such as cross-modal feature fusion, language-guided visual optimization, multi-scale, and hierarchical feature processing, open-set expansion and efficient fine-tuning. Finally, it outlines current bottlenecks and proposes valuable directions for future research. As the first comprehensive review dedicated to remote sensing visual grounding, this work is a reference resource for researchers to grasp domain-specific concepts and track the latest developments. Full article

► Show Figures

Figure 1

24 pages, 2351 KiB

Open AccessReview

Advancing Object Detection in Transportation with Multimodal Large Language Models (MLLMs): A Comprehensive Review and Empirical Testing

by Huthaifa I. Ashqar, Ahmed Jaber, Taqwa I. Alhadidi and Mohammed Elhenawy

Computation 2025, 13(6), 133; https://doi.org/10.3390/computation13060133 - 3 Jun 2025

Cited by 1 | Viewed by 1492

Abstract

This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in [...] Read more.

This study aims to comprehensively review and empirically evaluate the application of multimodal large language models (MLLMs) and Large Vision Models (VLMs) in object detection for transportation systems. In the first fold, we provide a background about the potential benefits of MLLMs in transportation applications and conduct a comprehensive review of current MLLM technologies in previous studies. We highlight their effectiveness and limitations in object detection within various transportation scenarios. The second fold involves providing an overview of the taxonomy of end-to-end object detection in transportation applications and future directions. Building on this, we proposed empirical analysis for testing MLLMs on three real-world transportation problems that include object detection tasks, namely, road safety attribute extraction, safety-critical event detection, and visual reasoning of thermal images. Our findings provide a detailed assessment of MLLM performance, uncovering both strengths and areas for improvement. Finally, we discuss practical limitations and challenges of MLLMs in enhancing object detection in transportation, thereby offering a roadmap for future research and development in this critical area. Full article

(This article belongs to the Special Issue Object Detection Models for Transportation Systems)

► Show Figures

Figure 1

19 pages, 1840 KiB

Open AccessArticle

Facial Analysis for Plastic Surgery in the Era of Artificial Intelligence: A Comparative Evaluation of Multimodal Large Language Models

by Syed Ali Haider, Srinivasagam Prabha, Cesar A. Gomez-Cabello, Sahar Borna, Ariana Genovese, Maissa Trabilsy, Adekunle Elegbede, Jenny Fei Yang, Andrea Galvao, Cui Tao and Antonio Jorge Forte

J. Clin. Med. 2025, 14(10), 3484; https://doi.org/10.3390/jcm14103484 - 16 May 2025

Viewed by 872

Abstract

Background/Objectives: Facial analysis is critical for preoperative planning in facial plastic surgery, but traditional methods can be time consuming and subjective. This study investigated the potential of Artificial Intelligence (AI) for objective and efficient facial analysis in plastic surgery, with a specific focus [...] Read more.

Background/Objectives: Facial analysis is critical for preoperative planning in facial plastic surgery, but traditional methods can be time consuming and subjective. This study investigated the potential of Artificial Intelligence (AI) for objective and efficient facial analysis in plastic surgery, with a specific focus on Multimodal Large Language Models (MLLMs). We evaluated their ability to analyze facial skin quality, volume, symmetry, and adherence to aesthetic standards such as neoclassical facial canons and the golden ratio. Methods: We evaluated four MLLMs—ChatGPT-4o, ChatGPT-4, Gemini 1.5 Pro, and Claude 3.5 Sonnet—using two evaluation forms and 15 diverse facial images generated by a Generative Adversarial Network (GAN). The general analysis form evaluated qualitative skin features (texture, type, thickness, wrinkling, photoaging, and overall symmetry). The facial ratios form assessed quantitative structural proportions, including division into equal fifths, adherence to the rule of thirds, and compatibility with the golden ratio. MLLM assessments were compared with evaluations from a plastic surgeon and manual measurements of facial ratios. Results: The MLLMs showed promise in analyzing qualitative features, but they struggled with precise quantitative measurements of facial ratios. Mean accuracy for general analysis were ChatGPT-4o (0.61 ± 0.49), Gemini 1.5 Pro (0.60 ± 0.49), ChatGPT-4 (0.57 ± 0.50), and Claude 3.5 Sonnet (0.52 ± 0.50). In facial ratio assessments, scores were lower, with Gemini 1.5 Pro achieving the highest mean accuracy (0.39 ± 0.49). Inter-rater reliability, based on Cohen’s Kappa values, ranged from poor to high for qualitative assessments (κ > 0.7 for some questions) but was generally poor (near or below zero) for quantitative assessments. Conclusions: Current general purpose MLLMs are not yet ready to replace manual clinical assessments but may assist in general facial feature analysis. These findings are based on testing models not specifically trained for facial analysis and serve to raise awareness among clinicians regarding the current capabilities and inherent limitations of readily available MLLMs in this specialized domain. This limitation may stem from challenges with spatial reasoning and fine-grained detail extraction, which are inherent limitations of current MLLMs. Future research should focus on enhancing the numerical accuracy and reliability of MLLMs for broader application in plastic surgery, potentially through improved training methods and integration with other AI technologies such as specialized computer vision algorithms for precise landmark detection and measurement. Full article

(This article belongs to the Special Issue Innovation in Hand Surgery)

► Show Figures

Figure 1

29 pages, 8544 KiB

Open AccessReview

Innovative Approaches to Traffic Anomaly Detection and Classification Using AI

by Borja Pérez, Mario Resino, Teresa Seco, Fernando García and Abdulla Al-Kaff

Appl. Sci. 2025, 15(10), 5520; https://doi.org/10.3390/app15105520 - 15 May 2025

Viewed by 1501

Abstract

Video anomaly detection plays a crucial role in intelligent transportation systems by enhancing urban mobility and safety. This review provides a comprehensive analysis of recent advancements in artificial intelligence methods applied to traffic anomaly detection, including convolutional and recurrent neural networks (CNNs and [...] Read more.

Video anomaly detection plays a crucial role in intelligent transportation systems by enhancing urban mobility and safety. This review provides a comprehensive analysis of recent advancements in artificial intelligence methods applied to traffic anomaly detection, including convolutional and recurrent neural networks (CNNs and RNNs), autoencoders, Transformers, generative adversarial networks (GANs), and multimodal large language models (MLLMs). We compare their performance across real-world applications, highlighting patterns such as the superiority of Transformer-based models in temporal context understanding and the growing use of multimodal inputs for robust detection. Key challenges identified include dependence on large labeled datasets, high computational costs, and limited model interpretability. The review outlines how recent research is addressing these issues through semi-supervised learning, model compression techniques, and explainable AI. We conclude with future directions focusing on scalable, real-time, and interpretable solutions for practical deployment. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition: Advanced Techniques and Applications)

► Show Figures

Figure 1

15 pages, 3014 KiB

Open AccessArticle

Leveraging Bird Eye View Video and Multimodal Large Language Models for Real-Time Intersection Control and Reasoning

by Sari Masri, Huthaifa I. Ashqar and Mohammed Elhenawy

Safety 2025, 11(2), 40; https://doi.org/10.3390/safety11020040 - 7 May 2025

Viewed by 1189

Abstract

Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that [...] Read more.

Managing traffic flow through urban intersections is challenging. Conflicts involving a mix of different vehicles with blind spots makes it relatively vulnerable for crashes to happen. This paper presents a new framework based on a fine-tuned Multimodal Large Language Model (MLLM), GPT-4o, that can control intersections using bird eye view videos taken by drones in real-time. This fine-tuned GPT-4o model is used to logically and visually reason traffic conflicts and provide instructions to the drivers, which aids in creating a safer and more efficient traffic flow. To fine-tune and evaluate the model, we labeled a dataset that includes three-month drone videos, and their corresponding trajectories recorded in Dresden, Germany, at a 4-way intersection. Preliminary results showed that the fine-tuned GPT-4o achieved an accuracy of about 77%, outperforming zero-shot baselines. However, using continuous video-frame sequences, the model performance increased to about 89% on a time serialized dataset and about 90% on an unbalanced real-world dataset, respectively. This proves the model’s robustness in different conditions. Furthermore, manual evaluation by experts includes scoring the usefulness of the predicted explanations and recommendations by the model. The model surpassed on average rating of 8.99 out of 10 for explanations, and 9.23 out of 10 for recommendations. The results demonstrate the advantages of combining MLLMs with structured prompts and temporal information for conflict detection. These results offer a flexible and robust prototype framework to improve the safety and effectiveness of uncontrolled intersections. The code and labeled dataset used in this study are publicly available (see Data Availability Statement). Full article

► Show Figures

Figure 1

15 pages, 3852 KiB

Open AccessArticle

Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception

by Rachid Belaroussi

Big Data Cogn. Comput. 2025, 9(4), 100; https://doi.org/10.3390/bdcc9040100 - 14 Apr 2025

Cited by 2 | Viewed by 1197

Abstract

The emergence of Multimodal Large Language Models (MLLMs) has made methods of artificial intelligence accessible to the general public in a conversational way. It offers tools for the automated visual assessment of the quality of a built environment for professionals of urban planning [...] Read more.

The emergence of Multimodal Large Language Models (MLLMs) has made methods of artificial intelligence accessible to the general public in a conversational way. It offers tools for the automated visual assessment of the quality of a built environment for professionals of urban planning without requiring specific technical knowledge on computing. We investigated the capability of MLLMs to perceive urban environments based on images and textual prompts. We compared the outputs of several popular models—ChatGPT, Gemini and Grok—to the visual assessment of experts in Architecture, Engineering and Construction (AEC) in the context of a real estate construction project. Our analysis was based on subjective attributes proposed to characterize various aspects of a built environment. Four urban identities served as case studies, set in a virtual environment designed using professional 3D models. We found that there can be an alignment between human and AI evaluation on some aspects such as space and scale and architectural style, and more general accordance in environments with vegetation. However, there were noticeable differences in response patterns between the AIs and AEC experts, particularly concerning subjective aspects such as the general emotional resonance of specific urban identities. It raises questions regarding the hallucinations of generative AI where the AI invents information and behaves creatively but its outputs are not accurate. Full article

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

► Show Figures

Figure 1

30 pages, 452 KiB

Open AccessArticle

Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance

by Minjun Son and Sungjin Lee

Appl. Sci. 2025, 15(7), 3992; https://doi.org/10.3390/app15073992 - 4 Apr 2025

Cited by 2 | Viewed by 3042

Abstract

This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought [...] Read more.

This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications. Full article

► Show Figures

Figure 1

18 pages, 13300 KiB

Open AccessArticle

IAACLIP: Image Aesthetics Assessment via CLIP

by Zhuo Li, Xingao Yan, Xuebin Wei and Feng Shao

Electronics 2025, 14(7), 1425; https://doi.org/10.3390/electronics14071425 - 1 Apr 2025

Viewed by 1504

Abstract

Aesthetics primarily focuses on the study of art, encompassing the aesthetic categories of beauty and ugliness, as well as human aesthetic activities. Image Aesthetics Assessment (IAA) seeks to automatically evaluate the aesthetic quality of images by mimicking the perceptual mechanisms of humans. Recently, [...] Read more.

Aesthetics primarily focuses on the study of art, encompassing the aesthetic categories of beauty and ugliness, as well as human aesthetic activities. Image Aesthetics Assessment (IAA) seeks to automatically evaluate the aesthetic quality of images by mimicking the perceptual mechanisms of humans. Recently, researchers have increasingly explored using user comments to assist in IAA tasks. However, human aesthetics are subjective, and individuals may have varying preferences for the same image, leading to diverse comments that can influence model decisions. Moreover, in practical scenarios, user comments are often unavailable. Thus, this paper proposes a CLIP-based method for IAA (IAACLIP) using generative descriptions and prompts. First, leveraging the growing interest in multimodal large language models (MLLMs), we generate objective and consistent aesthetic descriptions (GADs) for images. Second, based on aesthetic images, labels, and GADs, we introduce a unified contrast pre-training approach to transition the network from the general domain to the aesthetic domain. Lastly, we employ prompt templates for perceptual training to address the lack of real-world comments. Experimental validation on three mainstream IAA datasets demonstrates the effectiveness of our proposed method. Full article

► Show Figures

Figure 1

15 pages, 5001 KiB

Open AccessArticle

Reasoning-Driven Food Energy Estimation via Multimodal Large Language Models

by Hikaru Tanabe and Keiji Yanai

Nutrients 2025, 17(7), 1128; https://doi.org/10.3390/nu17071128 - 24 Mar 2025

Cited by 1 | Viewed by 1077

Abstract

Background/Objectives: Image-based food energy estimation is essential for user-friendly food tracking applications, enabling individuals to monitor their dietary intake through smartphones or AR devices. However, existing deep learning approaches struggle to recognize a wide variety of food items, due to the labor-intensive nature [...] Read more.

Background/Objectives: Image-based food energy estimation is essential for user-friendly food tracking applications, enabling individuals to monitor their dietary intake through smartphones or AR devices. However, existing deep learning approaches struggle to recognize a wide variety of food items, due to the labor-intensive nature of data annotation. Multimodal Large Language Models (MLLMs) possess extensive knowledge and human-like reasoning abilities, making them a promising approach for image-based food energy estimation. Nevertheless, their ability to accurately estimate food energy is hindered by limitations in recognizing food size, a critical factor in energy content assessment. Methods: To address this challenge, we propose two approaches: fine-tuning, and volume-aware reasoning with fine-grained estimation prompting. Results: Experimental results on the Nutrition5k dataset demonstrated the effectiveness of these approaches in improving estimation accuracy. We also validated the effectiveness of adapting LoRA to enhance food energy estimation performance. Conclusions: These findings highlight the potential of MLLMs for image-based dietary assessment and emphasize the importance of integrating volume-awareness into food energy estimation models. Full article

(This article belongs to the Special Issue Human Nutrition Research in the Data Era)

► Show Figures

Figure 1

20 pages, 4510 KiB

Open AccessArticle

AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks

by Min-Woo Kim, Hyo-Bin Park, Hee-Jin Ahn, Woo-Ram Park, Jae-Wan Jeon, Kyong-Ha Lee, Ryong Lee and Dong-Geol Choi

Electronics 2025, 14(6), 1175; https://doi.org/10.3390/electronics14061175 - 17 Mar 2025

Viewed by 1071

Abstract

AutoPaperBench proposes a benchmark generation system to automatically evaluate the comprehensibility of papers in a Multimodal Large Language Model (MLLM). The proposed system efficiently structures the content of a paper through semantic parsing and automatically generates text-based QAs and visual-based VQAs. To ensure [...] Read more.

AutoPaperBench proposes a benchmark generation system to automatically evaluate the comprehensibility of papers in a Multimodal Large Language Model (MLLM). The proposed system efficiently structures the content of a paper through semantic parsing and automatically generates text-based QAs and visual-based VQAs. To ensure the quality of the generated QA, we introduce a reviewer system that evaluates six criteria such as logic and appropriateness. In our experiments on 60 research papers from the medical, natural, and engineering fields, the generated benchmarks demonstrate comparable performance rankings to those of previous benchmarks, and the performance improvements achieved through semantic parsing are validated. The system can run on a single GPU environment and provides a framework for efficiently evaluating LLM thesis comprehension. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

29 pages, 549 KiB

Open AccessReview

Generative Models in Medical Visual Question Answering: A Survey

by Wenjie Dong, Shuhao Shen, Yuqiang Han, Tao Tan, Jian Wu and Hongxia Xu

Appl. Sci. 2025, 15(6), 2983; https://doi.org/10.3390/app15062983 - 10 Mar 2025

Cited by 1 | Viewed by 3892

Abstract

Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined [...] Read more.

Medical Visual Question Answering (MedVQA) is a crucial intersection of artificial intelligence and healthcare. It enables systems to interpret medical images—such as X-rays, MRIs, and pathology slides—and respond to clinical queries. Early approaches primarily relied on discriminative models, which select answers from predefined candidates. However, these methods struggle to effectively address open-ended, domain-specific, or complex queries. Recent advancements have shifted the focus toward generative models, leveraging autoregressive decoders, large language models (LLMs), and multimodal large language models (MLLMs) to generate more nuanced and free-form answers. This review comprehensively examines the paradigm shift from discriminative to generative systems, examining generative MedVQA works on their model architectures and training process, summarizing evaluation benchmarks and metrics, highlighting key advances and techniques that propels the development of generative MedVQA, such as concept alignment, instruction tuning, and parameter-efficient fine-tuning (PEFT), alongside strategies for data augmentation and automated dataset creation. Finally, we propose future directions to enhance clinical reasoning and intepretability, build robust evaluation benchmarks and metrics, and employ scalable training strategies and deployment solutions. By analyzing the strengths and limitations of existing generative MedVQA approaches, we aim to provide valuable insights for researchers and practitioners working in this domain. Full article

(This article belongs to the Special Issue Feature Review Papers in "Computing and Artificial Intelligence")

► Show Figures

Figure 1

26 pages, 4614 KiB

Open AccessArticle

A Multimodal Framework Embedding Retrieval-Augmented Generation with MLLMs for Eurobarometer Data

by George Papageorgiou, Vangelis Sarlis, Manolis Maragoudakis and Christos Tjortjis

AI 2025, 6(3), 50; https://doi.org/10.3390/ai6030050 - 3 Mar 2025

Cited by 1 | Viewed by 5353

Abstract

This study introduces a multimodal framework integrating retrieval-augmented generation (RAG) with multimodal large language models (MLLMs) to enhance the accessibility, interpretability, and analysis of Eurobarometer survey data. Traditional approaches often struggle with the diverse formats and large-scale nature of these datasets, which include [...] Read more.

This study introduces a multimodal framework integrating retrieval-augmented generation (RAG) with multimodal large language models (MLLMs) to enhance the accessibility, interpretability, and analysis of Eurobarometer survey data. Traditional approaches often struggle with the diverse formats and large-scale nature of these datasets, which include textual and visual elements. The proposed framework leverages multimodal indexing and targeted retrieval to enable focused queries, trend analysis, and visualization, across multiple survey editions. The integration of LLMs facilitates advanced synthesis of insights, providing a more comprehensive understanding of public opinion trends. The proposed framework offers prospective benefits for different types of stakeholders, including policymakers, journalists, nongovernmental organizations (NGOs), researchers, and citizens, while highlighting the need for performance assessment to evaluate its effectiveness based on specific business requirements and practical applications. The framework’s modular design supports applications, such as survey studies, comparative analyses, and domain-specific investigations, while its scalability and reproducibility make it suitable for e-governance and public sector deployment. The results indicate potential enhancements in data interpretation and data analysis by providing stakeholders with the capability not only to utilize raw text data for knowledge extraction but also to conduct image analysis based on indexed content, paving the way for informed policymaking and advanced research in the social sciences, while emphasizing the need for performance assessment to validate the framework’s output and functionality, based on the selected architectural components. Future research will explore expanded functionalities and real-time applications, ensuring the framework remains adaptable to evolving needs in public opinion analysis and multimodal data integration. Full article

► Show Figures

Figure 1

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (28)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI