MDPI - Publisher of Open Access Journals

31 pages, 15756 KB

Open AccessArticle

PMA-VQA: Progressive Multi-Scale Feature Fusion with Spatially Adaptive Attention for Remote Sensing Visual Question Answering

by Yifei He, Chen Qiu and Jinguang Gu

Sensors 2026, 26(8), 2351; https://doi.org/10.3390/s26082351 - 10 Apr 2026

Abstract

Remote sensing visual question answering (RS-VQA) is essential to intelligent Earth observation, as it supports interactive querying of high-resolution aerial images. Many existing methods struggle with fine-detail geospatial reasoning with remote sensing (RS) scenes due to RS scenes having intrinsic multi-scale object variance [...] Read more.

Remote sensing visual question answering (RS-VQA) is essential to intelligent Earth observation, as it supports interactive querying of high-resolution aerial images. Many existing methods struggle with fine-detail geospatial reasoning with remote sensing (RS) scenes due to RS scenes having intrinsic multi-scale object variance and pronounced spatial heterogeneity. The models tend to rely more on the linguistic prior than reasoning based on visual evidence. In this paper, we present PMA-VQA, a progressive multi-scale feature fusion with spatially adaptive attention, to embed the RS-VQA task in spatially based hierarchical feature integration. For hierarchical, multi-level, language-informed integration, we propose a spatial attention aggregation module (SAAM) and a progressive feature fusion and classification module (PFCM). The SAAM employs spatially adaptive gating to align cross-modal features with semantic context, while the PFCM integrates multi-scale representations across high-level semantic abstractions and low-level space. The experimental results on RS-VQA LR and HR benchmarks validate that PMA-VQA outperformed all competing methods in terms of accuracy and robustness. Evaluation of HRVQA further confirmed the effectiveness of the SAAM and PFCM across diverse RS scenes. Full article

(This article belongs to the Section Remote Sensors)

19 pages, 1466 KB

Open AccessArticle

D²MNet: Difference-Aware Decoupling and Multi-Prompt Learning for Medical Difference Visual Question Answering

by Lingge Lai, Weihua Ou, Jianping Gou and Zhonghua Liu

J. Imaging 2026, 12(4), 162; https://doi.org/10.3390/jimaging12040162 - 9 Apr 2026

Viewed by 146

Abstract

Difference visual question answering (Diff-VQA) aims to answer questions by identifying and reasoning about differences between medical images. Existing methods often rely on simple feature subtraction or fusion to model image differences, while overlooking the asymmetric descriptive requirements of changed and unchanged cases [...] Read more.

Difference visual question answering (Diff-VQA) aims to answer questions by identifying and reasoning about differences between medical images. Existing methods often rely on simple feature subtraction or fusion to model image differences, while overlooking the asymmetric descriptive requirements of changed and unchanged cases and providing limited task-specific guidance to pretrained language decoders. To address these limitations, we propose D²MNet (Difference-aware Decoupling and Multi-prompt Network), a framework for medical Diff-VQA that combines change-aware reasoning with prompt-guided answer generation. Specifically, a Change Analysis Module (CAM) predicts whether a change is present and produces a binary change-aware prompt; a Difference-Aware Module (DAM) uses dual attention to capture fine-grained difference features; and a multi-prompt learning mechanism (MLM) injects question-aware, change-aware, and learnable prompts into the language decoder to improve contextual alignment and response generation. Experiments on the MIMIC-DiffVQA benchmark show that D2MNet achieves a CIDEr score of 2.907 ± 0.040, outperforming the strongest baseline, ReAl (2.409), under the same evaluation setting. These results demonstrate the effectiveness of the proposed design on benchmark medical Diff-VQA and suggest its potential for assisting difference-aware medical answer generation. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

24 pages, 39455 KB

Open AccessArticle

Information Bottleneck Scores for Identifying Causally Informative Attention Heads in Vision–Language Models

by Yiyou Zhang and Liyan Ma

Algorithms 2026, 19(3), 238; https://doi.org/10.3390/a19030238 - 23 Mar 2026

Viewed by 292

Abstract

Vision–language models (VLMs) have demonstrated remarkable performance on a wide range of multimodal reasoning tasks, yet their visual grounding mechanisms remain poorly understood and are often unreliable for fine-grained visual concepts. Existing approaches typically rely on raw attention maps or gradient-based saliency, which [...] Read more.

Vision–language models (VLMs) have demonstrated remarkable performance on a wide range of multimodal reasoning tasks, yet their visual grounding mechanisms remain poorly understood and are often unreliable for fine-grained visual concepts. Existing approaches typically rely on raw attention maps or gradient-based saliency, which provide heuristic explanations but lack a causal interpretation of how visual evidence contributes to model predictions. In this paper, we propose an Information Bottleneck Score (IBS) framework that explicitly quantifies the causal importance of visual patches through interventional analysis. By masking candidate image patches and measuring the induced change in the model prediction, the IBS captures patch-level causal contributions rather than correlation-based signals. We further lift patch-level importance to the attention-head level by aggregating the IBS with text-to-image attention, enabling the identification of a small subset of information-transmitting attention heads responsible for visual grounding. Building on the selected heads, we construct refined importance maps that guide visual cropping in a fully training-free manner. Extensive experiments on multiple detail-sensitive benchmarks, including TextVQA, V*, POPE, and DocVQA, demonstrate consistent improvements in fine-grained visual understanding, while evaluations on general-purpose datasets such as GQA, AOKVQA, and VQAv2 confirm that overall reasoning performance is preserved. Additional ablation studies further validate the effectiveness of each component in the proposed framework. Overall, our work provides a causal perspective on visual grounding in VLMs and offers a model-agnostic, training-free approach for both interpreting and enhancing multimodal reasoning. Full article

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

► Show Figures

Figure 1

46 pages, 33541 KB

Open AccessArticle

AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments

by Georgios Simantiris, Konstantinos Bacharidis, Apostolos Papanikolaou, Petros Giannakakis and Costas Panagiotakis

Remote Sens. 2026, 18(6), 938; https://doi.org/10.3390/rs18060938 - 19 Mar 2026

Viewed by 322

Abstract

Accurate flood detection is critical for disaster response, yet the scarcity of diverse annotated datasets hinders robust model development. Existing resources typically suffer from limited geographic scope and insufficient annotation granularity, restricting the generalization capabilities of computer vision methods. To bridge this gap, [...] Read more.

Accurate flood detection is critical for disaster response, yet the scarcity of diverse annotated datasets hinders robust model development. Existing resources typically suffer from limited geographic scope and insufficient annotation granularity, restricting the generalization capabilities of computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive evaluation benchmark designed to advance domain-generalized Artificial Intelligence for climate resilience. The dataset comprises 470 high-resolution aerial images capturing 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures exceptional global diversity and temporal relevance (2022–2024), supporting three complementary tasks: (i) Image Classification, featuring novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation, providing precise pixel-level masks for flood, sky, buildings, and background; and (iii) Visual Question Answering (VQA), enabling natural language reasoning for disaster assessment. We provide baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset’s complexity and its utility in fostering robust AI tools for environmental monitoring. Crucially, we show that despite its compact size, AIFloodSense enables better generalization on external test sets than much larger alternatives, validating the premise that rigorous diversity is more effective than scale for training robust flood detection models, and is made publicly available to accelerate further research in the field. Full article

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Fourth Edition))

► Show Figures

Figure 1

14 pages, 4655 KB

Open AccessArticle

Fine-Tuning a Small Vision Language Model Using Synthetic Data for Explaining Bacterial Skin Disease Images

by Shiwan Zhang, Abdurrahim Yilmaz, Gulsum Gencoglan and Burak Temelkuran

Diagnostics 2026, 16(4), 603; https://doi.org/10.3390/diagnostics16040603 - 18 Feb 2026

Viewed by 688

Abstract

Background/Objectives: Vision language models (VLMs) show strong potential for medical image understanding, but their large scale often limits practical deployment. This study investigates whether a compact VLM can be effectively adapted for dermatology, with a focus on explaining bacterial skin disease images. Methods: [...] Read more.

Background/Objectives: Vision language models (VLMs) show strong potential for medical image understanding, but their large scale often limits practical deployment. This study investigates whether a compact VLM can be effectively adapted for dermatology, with a focus on explaining bacterial skin disease images. Methods: We curate a dataset derived from PMC-OA using the BIOMEDICA dataset and construct PMC-derma-VQA-bacteria by pairing images with inherited figure captions and synthetically generated question–answer (QA) supervision produced by Google’s Gemini model. SmolVLM is fine-tuned under three supervision settings: QA-only, caption-only, and a combined QA+caption strategy. The models are evaluated on a held-out test set for both text-generation quality and diagnostic classification performance. Results: QA-only supervision yields the best report-generation performance, while the combined QA+caption setting achieves the highest classification accuracy (70.20%). Conclusions: Synthetic QA supervision can meaningfully enhance compact VLMs for medical image understanding and diagnostic support in dermatology. Full article

(This article belongs to the Special Issue Artificial Intelligence in Skin Disorders 2025)

► Show Figures

Graphical abstract

31 pages, 2850 KB

Open AccessArticle

Context-Aware Multi-Agent Architecture for Wildfire Insights

by Ashen Sandeep, Sithum Jayarathna, Sunera Sandaruwan, Venura Samarappuli, Dulani Meedeniya and Charith Perera

Sensors 2026, 26(3), 1070; https://doi.org/10.3390/s26031070 - 6 Feb 2026

Viewed by 917

Abstract

Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment [...] Read more.

Wildfires are environmental hazards with severe ecological, social, and economic impacts. Wildfires devastate ecosystems, communities, and economies worldwide, with rising frequency and intensity driven by climate change, human activity, and environmental shifts. Analyzing wildfire insights such as detection, predictive patterns, and risk assessment enables proactive response and long-term prevention. However, most of the existing approaches have been focused on isolated processing of data, making it challenging to orchestrate cross-modal reasoning and transparency. This study proposed a novel orchestrator-based multi-agent system (MAS), with the aim of transforming multimodal environmental data into actionable intelligence for decision making. We designed a framework to utilize Large Multimodal Models (LMMs) augmented by structured prompt engineering and specialized Retrieval-Augmented Generation (RAG) pipelines to enable transparent and context-aware reasoning, providing a cutting-edge Visual Question Answering (VQA) system. It ingests diverse inputs like satellite imagery, sensor readings, weather data, and ground footage and then answers user queries. Validated by several public datasets, the system achieved a precision of 0.797 and an F1-score of 0.736. Thus, powered by Agentic AI, the proposed, human-centric solution for wildfire management, empowers firefighters, governments, and researchers to mitigate threats effectively. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

14 pages, 8775 KB

Open AccessArticle

Improving Transferability of Adversarial Attacks via Maximization and Targeting from Image to Video Quality Assessment

by Georgii Gotin, Ekaterina Shumitskaya, Dmitriy Vatolin and Anastasia Antsiferova

Big Data Cogn. Comput. 2026, 10(2), 50; https://doi.org/10.3390/bdcc10020050 - 5 Feb 2026

Viewed by 484

Abstract

This paper proposes a novel method for transferable adversarial attacks from Image Quality Assessment (IQA) to Video Quality Assessment (VQA) models. Attacking modern VQA models is challenging due to their high complexity and the temporal nature of video content. Since IQA and VQA [...] Read more.

This paper proposes a novel method for transferable adversarial attacks from Image Quality Assessment (IQA) to Video Quality Assessment (VQA) models. Attacking modern VQA models is challenging due to their high complexity and the temporal nature of video content. Since IQA and VQA models share similar low- and mid-level feature representations, and IQA models are substantially cheaper and faster to run, we leverage them as surrogates to generate transferable adversarial perturbations. Our method, MaxT-I2VQA jointly Maximizes IQA scores and Targets IQA feature activations to improve transferability from IQA to VQA models. We first analyze the correlation between IQA and VQA internal features and use these insights to design a feature-targeting loss. We evaluate MaxT-I2VQA by transferring attacks from four state-of-the-art IQA models to four recent VQA models and compare against three competitive baselines. Compared to prior methods, MaxT-I2VQA increases the transferability of an attack success rate by 7.9% and reduces per-example attack runtime by 8 times. Our experiments confirm that IQA and VQA feature spaces are sufficiently aligned to enable effective cross-task transfer. Full article

► Show Figures

Figure 1

33 pages, 1460 KB

Open AccessArticle

Systematic Analysis of Vision–Language Models for Medical Visual Question Answering

by Muhammad Haseeb Shah and Heriberto Cuayáhuitl

Multimodal Technol. Interact. 2026, 10(2), 16; https://doi.org/10.3390/mti10020016 - 3 Feb 2026

Viewed by 1178

Abstract

General-purpose vision–language models (VLMs) are increasingly applied to imaging tasks, yet their reliability on medical visual question answering (Med-VQA) remains unclear. We investigate how three state-of-the-art VLMs—ViLT, BLIP, and MiniCPM-V-2—perform on radiology-focused Med-VQA when evaluated in a modality-aware manner. Using SLAKE and OmniMedVQA-Mini, [...] Read more.

General-purpose vision–language models (VLMs) are increasingly applied to imaging tasks, yet their reliability on medical visual question answering (Med-VQA) remains unclear. We investigate how three state-of-the-art VLMs—ViLT, BLIP, and MiniCPM-V-2—perform on radiology-focused Med-VQA when evaluated in a modality-aware manner. Using SLAKE and OmniMedVQA-Mini, we construct harmonised subsets for computed tomography (CT), magnetic resonance imaging (MRI), and X-ray, standardising schema and answer processing. We first benchmark all models in a strict zero-shot setting, then perform supervised fine-tuning on modality-specific data splits, and finally add a post-hoc semantic option-selection layer that maps free-text predictions to multiple-choice answers. Zero-shot performance is modest (exact match ≈20% for ViLT/BLIP and 0% for MiniCPM-V-2), confirming that off-the-shelf deployment is inadequate. Fine-tuning substantially improves all models, with ViLT reaching ≈80% exact match and BLIP ≈50%, while MiniCPM-V-2 lags behind. When coupled with option selection, ViLT and BLIP achieve 90–93% exact match and F1 across all modalities, corresponding to 95–97% BERTScore-F1. Our novel results show that (i) modality-specific supervision is essential for Med-VQA, and (ii) post-hoc option selection can transform strong but imperfect generative predictions into highly reliable discrete decisions on harmonised radiology benchmarks. The latter is useful for medical VLMs that combine generative responses with option or sentence selection. Full article

► Show Figures

Figure 1

27 pages, 16442 KB

Open AccessArticle

Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning

by Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen and Xue Yang

Remote Sens. 2026, 18(2), 222; https://doi.org/10.3390/rs18020222 - 9 Jan 2026

Cited by 1 | Viewed by 997

Abstract

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater [...] Read more.

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision-language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation procedure, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data procedure effectively addresses complex RS data enviroments and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models. Full article

► Show Figures

Figure 1

19 pages, 650 KB

Open AccessArticle

HQD-EM: Robust VQA Through Hierarchical Question Decomposition Bias Module and Ensemble Adaptive Angular Margin Loss

by SeongHyeon Noh and Jae Won Cho

Mathematics 2025, 13(22), 3656; https://doi.org/10.3390/math13223656 - 14 Nov 2025

Viewed by 746

Abstract

Recent studies in Visual Question Answering (VQA) have revealed that models often rely heavily on language priors rather than vision–language understanding, leading to poor generalization under distribution shifts. To address this challenge, we propose HQD-EM, a unified debiasing framework that combines the Hierarchical [...] Read more.

Recent studies in Visual Question Answering (VQA) have revealed that models often rely heavily on language priors rather than vision–language understanding, leading to poor generalization under distribution shifts. To address this challenge, we propose HQD-EM, a unified debiasing framework that combines the Hierarchical Question Decomposition (HQD) module with an Ensemble adaptive angular Margin (EM) loss. HQD systematically decomposes questions into multi-granular representations to capture layered language biases, while EM leverages bias confidence to modulate per-sample decision margins dynamically. Our method integrates an ensemble-based method with adaptive margin learning in an end-to-end trainable architecture. Experiments on VQA benchmarks demonstrate that HQD-EM significantly outperforms prior works on VQA-CP2 and VQA-CP1. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

25 pages, 2968 KB

Open AccessArticle

ECSA: Mitigating Catastrophic Forgetting and Few-Shot Generalization in Medical Visual Question Answering

by Qinhao Jia, Shuxian Liu, Mingliang Chen, Tianyi Li and Jing Yang

Tomography 2025, 11(10), 115; https://doi.org/10.3390/tomography11100115 - 20 Oct 2025

Cited by 1 | Viewed by 1007

Abstract

Objective: Medical Visual Question Answering (Med-VQA), a key technology that integrates computer vision and natural language processing to assist in clinical diagnosis, possesses significant potential for enhancing diagnostic efficiency and accuracy. However, its development is constrained by two major bottlenecks: weak few-shot generalization [...] Read more.

Objective: Medical Visual Question Answering (Med-VQA), a key technology that integrates computer vision and natural language processing to assist in clinical diagnosis, possesses significant potential for enhancing diagnostic efficiency and accuracy. However, its development is constrained by two major bottlenecks: weak few-shot generalization capability stemming from the scarcity of high-quality annotated data and the problem of catastrophic forgetting when continually learning new knowledge. Existing research has largely addressed these two challenges in isolation, lacking a unified framework. Methods: To bridge this gap, this paper proposes a novel Evolvable Clinical-Semantic Alignment (ECSA) framework, designed to synergistically solve these two challenges within a single architecture. ECSA is built upon powerful pre-trained vision (BiomedCLIP) and language (Flan-T5) models, with two innovative modules at its core. First, we design a Clinical-Semantic Disambiguation Module (CSDM), which employs a novel debiased hard negative mining strategy for contrastive learning. This enables the precise discrimination of “hard negatives” that are visually similar but clinically distinct, thereby significantly enhancing the model’s representation ability in few-shot and long-tail scenarios. Second, we introduce a Prompt-based Knowledge Consolidation Module (PKC), which acts as a rehearsal-free non-parametric knowledge store. It consolidates historical knowledge by dynamically accumulating and retrieving task-specific “soft prompts,” thus effectively circumventing catastrophic forgetting without relying on past data. Results: Extensive experimental results on four public benchmark datasets, VQA-RAD, SLAKE, PathVQA, and VQA-Med-2019, demonstrate ECSA’s state-of-the-art or highly competitive performance. Specifically, ECSA achieves excellent overall accuracies of 80.15% on VQA-RAD and 85.10% on SLAKE, while also showing strong generalization with 64.57% on PathVQA and 82.23% on VQA-Med-2019. More critically, in continual learning scenarios, the framework achieves a low forgetting rate of just 13.50%, showcasing its significant advantages in knowledge retention. Conclusions: These findings validate the framework’s substantial potential for building robust and evolvable clinical decision support systems. Full article

► Show Figures

Figure 1

24 pages, 3721 KB

Open AccessArticle

Interactive Environment-Aware Planning System and Dialogue for Social Robots in Early Childhood Education

by Jiyoun Moon and Seung Min Song

Appl. Sci. 2025, 15(20), 11107; https://doi.org/10.3390/app152011107 - 16 Oct 2025

Viewed by 798

Abstract

In this study, we propose an interactive environment-aware dialog and planning system for social robots in early childhood education, aimed at supporting the learning and social interaction of young children. The proposed architecture consists of three core modules. First, semantic simultaneous localization and [...] Read more.

In this study, we propose an interactive environment-aware dialog and planning system for social robots in early childhood education, aimed at supporting the learning and social interaction of young children. The proposed architecture consists of three core modules. First, semantic simultaneous localization and mapping (SLAM) accurately perceives the environment by constructing a semantic scene representation that includes attributes such as position, size, color, purpose, and material of objects, as well as their positional relationships. Second, the automated planning system enables stable task execution even in changing environments through planning domain definition language (PDDL)-based planning and replanning capabilities. Third, the visual question answering module leverages scene graphs and SPARQL conversion of natural language queries to answer children’s questions and engage in context-based conversations. The experiment conducted in a real kindergarten classroom with children aged 6 to 7 years validated the accuracy of object recognition and attribute extraction for semantic SLAM, the task success rate of the automated planning system, and the natural language question answering performance of the visual question answering (VQA) module.The experimental results confirmed the proposed system’s potential to support natural social interaction with children and its applicability as an educational tool. Full article

(This article belongs to the Special Issue Robotics and Intelligent Systems: Technologies and Applications)

► Show Figures

Figure 1

20 pages, 7466 KB

Open AccessArticle

Feasibility Study of CLIP-Based Key Slice Selection in CT Images and Performance Enhancement via Lesion- and Organ-Aware Fine-Tuning

by Kohei Yamamoto and Tomohiro Kikuchi

Bioengineering 2025, 12(10), 1093; https://doi.org/10.3390/bioengineering12101093 - 10 Oct 2025

Cited by 2 | Viewed by 1475

Abstract

Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the [...] Read more.

Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the most relevant 2D slices in volumetric computed tomography (CT) scans. To address this gap, a contrastive language–image pre-training (CLIP)-based key slice selection framework is proposed, which matches each sentence to its most informative CT slice via text–image similarity. This experiment demonstrates that models pre-trained in the medical domain already achieve competitive slice retrieval accuracy and that fine-tuning them on a small dual-supervised dataset that imparts both lesion- and organ-level awareness yields further gains. In particular, the best-performing model (fine-tuned BiomedCLIP) achieved a Top-1 accuracy of 51.7% for lesion-aware slice retrieval, representing a 20-point improvement over baseline CLIP, and was accepted by radiologists in 56.3% of cases. By automating the report-to-slice alignment, the proposed method facilitates scalable, clinically realistic construction of MedVQA resources. Full article

(This article belongs to the Special Issue Machine Learning-Driven Innovations in Biomedical Signal and Image Processing)

► Show Figures

Graphical abstract

29 pages, 13142 KB

Open AccessArticle

Automatic Complexity Analysis of UML Class Diagrams Using Visual Question Answering (VQA) Techniques

by Nimra Shehzadi, Javed Ferzund, Rubia Fatima and Adnan Riaz

Software 2025, 4(4), 22; https://doi.org/10.3390/software4040022 - 23 Sep 2025

Cited by 1 | Viewed by 3052

Abstract

Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them [...] Read more.

Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them manually, especially for large-scale systems, poses significant challenges. Objectives: This study aims to automate the analysis of UML class diagrams by assessing their complexity using a machine learning approach. The goal is to support software developers in identifying potential design issues early in the development process and to improve overall software quality. Methodology: To achieve this, this research introduces a Visual Question Answering (VQA)-based framework that integrates both computer vision and natural language processing. Vision Transformers (ViTs) are employed to extract global visual features from UML class diagrams, while the BERT language model processes natural language queries. By combining these two models, the system can accurately respond to questions related to software complexity, such as class coupling and inheritance depth. Results: The proposed method demonstrated strong performance in experimental trials. The ViT model achieved an accuracy of 0.8800, with both the F1 score and recall reaching 0.8985. These metrics highlight the effectiveness of the approach in automatically evaluating UML class diagrams. Conclusions: The findings confirm that advanced machine learning techniques can be successfully applied to automate software design analysis. This approach can help developers detect design flaws early and enhance software maintainability. Future work will explore advanced fusion strategies, novel data augmentation techniques, and lightweight model adaptations suitable for environments with limited computational resources. Full article

(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)

► Show Figures

Figure 1

28 pages, 20825 KB

Open AccessArticle

Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models

by Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi and Sangwoo Kang

Mathematics 2025, 13(18), 3046; https://doi.org/10.3390/math13183046 - 22 Sep 2025

Cited by 1 | Viewed by 3172

Abstract

Remote sensing visual question answering (RSVQA) involves interpreting complex geospatial information captured by satellite imagery to answer natural language questions, making it a vital tool for observing and analyzing Earth’s surface without direct contact. Although numerous studies have addressed RSVQA, most have focused [...] Read more.

Remote sensing visual question answering (RSVQA) involves interpreting complex geospatial information captured by satellite imagery to answer natural language questions, making it a vital tool for observing and analyzing Earth’s surface without direct contact. Although numerous studies have addressed RSVQA, most have focused primarily on answer accuracy, often overlooking the underlying reasoning capabilities required to interpret spatial and contextual cues in satellite imagery. To address this gap, this study presents a comprehensive evaluation of four large multimodal models (LMMs) as follows: GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet. We used a curated subset of the EarthVQA dataset consisting of 100 rural images with 29 question–answer pairs each and 100 urban images with 42 pairs each. We developed the following three task-specific frameworks: (1) Zero-GeoVision, which employs zero-shot prompting with problem-specific prompts that elicit direct answers from the pretrained knowledge base without fine-tuning; (2) CoT-GeoReason, which enhances the knowledge base with chain-of-thought prompting, guiding it through explicit steps of feature detection, spatial analysis, and answer synthesis; and (3) Self-GeoSense, which extends this approach by stochastically decoding five independent reasoning chains for each remote sensing question. Rather than merging these chains, it counts the final answers, selects the majority choice, and returns a single complete reasoning chain whose conclusion aligns with that majority. Additionally, we designed the Geo-Judge framework to employ a two-stage evaluation process. In Stage 1, a GPT-4o-mini-based LMM judge assesses reasoning coherence and answer correctness using the input image, task type, reasoning steps, generated model answer, and ground truth. In Stage 2, blinded human experts independently review the LMM’s reasoning and answer, providing unbiased validation through careful reassessment. Focusing on Self-GeoSense with Grok 3, this framework achieves superior performance with 94.69% accuracy in Basic Judging, 93.18% in Basic Counting, 89.42% in Reasoning-Based Judging, 83.29% in Reasoning-Based Counting, 77.64% in Object Situation Analysis, and 65.29% in Comprehensive Analysis, alongside RMSE values of 0.9102 in Basic Counting and 1.0551 in Reasoning-Based Counting. Full article

(This article belongs to the Special Issue Big Data Mining and Knowledge Graph with Application)

► Show Figures

Figure 1

Search Results (90)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (90)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI