MDPI - Publisher of Open Access Journals

19 pages, 704 KB

Open AccessArticle

Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion

by Junkai Zhang, Bin Li and Shoujun Zhou

Appl. Sci. 2025, 15(9), 4712; https://doi.org/10.3390/app15094712 - 24 Apr 2025

Cited by 4 | Viewed by 4471

Medical Visual Question Answering (Med-VQA) is designed to accurately answer medical questions by analyzing medical images when given both a medical image and its corresponding clinical question. Designing the MedVQA system holds profound importance in assisting clinical diagnosis and enhancing diagnostic accuracy. Building upon this foundation, Hierarchical Medical VQA extends Medical VQA by organizing medical questions into a hierarchical structure and making level-specific predictions to handle fine-grained distinctions. Recently, many studies have proposed hierarchical Med-VQA tasks and established datasets. However, several issues still remain: (1) imperfect hierarchical modeling leads to poor differentiation between question levels, resulting in semantic fragmentation across hierarchies. (2) Excessive reliance on implicit learning in Transformer-based cross-modal self-attention fusion methods, which can obscure crucial local semantic correlations in medical scenarios. To address these issues, this study proposes a Hierarchical Modeling for Medical Visual Question Answering with Cross-Attention Fusion (HiCA-VQA) method. Specifically, the hierarchical modeling includes two modules: Hierarchical Prompting for fine-grained medical questions and Hierarchical Answer Decoders. The hierarchical prompting module pre-aligns hierarchical text prompts with image features to guide the model in focusing on specific image regions according to question types, while the hierarchical decoder performs separate predictions for questions at different levels to improve accuracy across granularities. The framework also incorporates a cross-attention fusion module where images serve as queries and text as key-value pairs. This approach effectively avoids the irrelevant signals introduced by global interactions while achieving lower computational complexity compared to global self-attention fusion modules. Experiments on the Rad-Restruct benchmark demonstrate that the HiCA-VQA framework outperforms existing state-of-the-art methods in answering hierarchical fine-grained questions, especially achieving an 18 percent improvement in the F1 score. This study provides an effective pathway for hierarchical visual question answering systems, advancing medical image understanding. Full article

(This article belongs to the Special Issue New Trends in Natural Language Processing)

► Show Figures

Figure 1

17 pages, 5437 KB

Open AccessArticle

ChartLine: Automatic Detection and Tracing of Curves in Scientific Line Charts Using Spatial-Sequence Feature Pyramid Network

by Wenjin Yang, Jie He and Qian Li

Sensors 2024, 24(21), 7015; https://doi.org/10.3390/s24217015 - 31 Oct 2024

Cited by 1 | Viewed by 3879

Abstract

Line charts are prevalent in scientific documents and commercial data visualization, serving as essential tools for conveying data trends. Automatic detection and tracing of line paths in these charts is crucial for downstream tasks such as data extraction, chart quality assessment, plagiarism detection, and visual question answering. However, line graphs present unique challenges due to their complex backgrounds and diverse curve styles, including solid, dashed, and dotted lines. Existing curve detection algorithms struggle to address these challenges effectively. In this paper, we propose ChartLine, a novel network designed for detecting and tracing curves in line graphs. Our approach integrates a Spatial-Sequence Attention Feature Pyramid Network (SSA-FPN) in both the encoder and decoder to capture rich hierarchical representations of curve structures and boundary features. The model incorporates a Spatial-Sequence Fusion (SSF) module and a Channel Multi-Head Attention (CMA) module to enhance intra-class consistency and inter-class distinction. We evaluate ChartLine on four line chart datasets and compare its performance against state-of-the-art curve detection, edge detection, and semantic segmentation methods. Extensive experiments demonstrate that our method significantly outperforms existing algorithms, achieving an F-measure of 94% on a synthetic dataset. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

22 pages, 958 KB

Open AccessArticle

Automatic Detection of Inconsistencies and Hierarchical Topic Classification for Open-Domain Chatbots

by Mario Rodríguez-Cantelar, Marcos Estecha-Garitagoitia, Luis Fernando D’Haro, Fernando Matía and Ricardo Córdoba

Appl. Sci. 2023, 13(16), 9055; https://doi.org/10.3390/app13169055 - 8 Aug 2023

Cited by 11 | Viewed by 3718

Abstract

Current State-of-the-Art (SotA) chatbots are able to produce high-quality sentences, handling different conversation topics and larger interaction times. Unfortunately, the generated responses depend greatly on the data on which they have been trained, the specific dialogue history and current turn used for guiding the response, the internal decoding mechanisms, and ranking strategies, among others. Therefore, it may happen that for semantically similar questions asked by users, the chatbot may provide a different answer, which can be considered as a form of hallucination or producing confusion in long-term interactions. In this research paper, we propose a novel methodology consisting of two main phases: (a) hierarchical automatic detection of topics and subtopics in dialogue interactions using a zero-shot learning approach, and (b) detecting inconsistent answers using k-means and the Silhouette coefficient. To evaluate the efficacy of topic and subtopic detection, we use a subset of the DailyDialog dataset and real dialogue interactions gathered during the Alexa Socialbot Grand Challenge 5 (SGC5). The proposed approach enables the detection of up to 18 different topics and 102 subtopics. For the purpose of detecting inconsistencies, we manually generate multiple paraphrased questions and employ several pre-trained SotA chatbot models to generate responses. Our experimental results demonstrate a weighted F-1 value of 0.34 for topic detection, a weighted F-1 value of 0.78 for subtopic detection in DailyDialog, then 81% and 62% accuracy for topic and subtopic classification in SGC5, respectively. Finally, to predict the number of different responses, we obtained a mean squared error (MSE) of 3.4 when testing smaller generative models and 4.9 in recent large language models. Full article

(This article belongs to the Special Issue IberSPEECH 2022: Speech and Language Technologies for Iberian Languages)

► Show Figures

Figure 1

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI