Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention

AI 2025, 6(2), 38; https://doi.org/10.3390/ai6020038

by Ikumi Sata¹, Motoki Amagasaki^2,*

and Masato Kiyama²

Reviewer 1: Anonymous

Reviewer 2:

Jiawei Zhang

Reviewer 3:

Guanping Feng

Reviewer 4: Anonymous

AI 2025, 6(2), 38; https://doi.org/10.3390/ai6020038

Submission received: 26 December 2024 / Revised: 11 February 2025 / Accepted: 17 February 2025 / Published: 18 February 2025

(This article belongs to the Section Medical & Healthcare AI)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a multimodal retrieval method for medical data. The method adds cross attention at the end of BioMedclip to learn joint embeddings for images and text. The results on mimic-5x200 data show better results than BioMedclip.

Weaknesses:

The paper has limited novelty. Adding related work on multimodal retrieval and comparing the proposed method with SOTA on mimic-5x200 data would enhance the paper. The "additional training" details for BioMedclip should be discussed in detail. It could be interesting to see the results if the BioMedclip part of the proposed method is also trained from scratch. Further, some qualitative results with some examples of retrieval for BioMedclip and proposed methods should be added.

Line 214: For how many sample findings and impression sections do not exist? and what is in the last paragraph?

Figure 1 has a red underline under BioMedclip that should be removed. In Table 2 what is Sum(values)?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors This article proposes a new image retrieval method that combines medical images with text embedding through cross attention mechanism, improving retrieval performance. The theme of the article is clear, and the experimental results demonstrate the potential of this method. However, further improvement is needed in some details and arguments. The specific opinions are as follows: 1. The images inserted in the text should be vector graphics to ensure clarity, however, Figure 1 has a low resolution and even appears with red wavy lines, which affects the overall reading experience of the paper. Please check the resolution of all images in full and update them to vector format (such as PDF or SVG). 2. The author mentioned that positional embedding was abandoned in the proposed method, but did not provide sufficient experimental results to support this conclusion. Suggest adding relevant comparative experiments to demonstrate the difference in retrieval performance between retaining position embeddings and discarding position embeddings. In addition, please analyze why abandoning positional embedding can improve performance and provide a reasonable explanation from a theoretical or intuitive perspective. 3. To enhance readers' understanding of the dataset, please provide several specific examples, including medical images and their corresponding text prompts. These examples help readers intuitively understand the characteristics of the dataset and the input form of the model. 4. This article does not specify whether fine-tuning BioMedCLIP would further improve the retrieval performance. Suggest adding relevant experiments to explore the impact of fine-tuning on model performance, in order to enhance the completeness of the paper. 5. The author's description in the Contributions section is vague and cannot clearly demonstrate the innovation of this article. Suggest reorganizing Contributions from the following aspects: The innovation of the proposed method compared to existing methods. The uniqueness of the proposed new model or mechanism in solving medical image retrieval problems. Therefore, from all I said above, I recommend a major revision for this manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In this manuscript, the authors proposed a novel approach to medical image retrieval using cross-attention mechanisms to integrate image and text data. The work makes several valuable contributions to the field while having some areas that could benefit from additional development. The proposed methodology looks innovative in its use of cross-attention for combining image and text modalities, showing significant performance improvements over baseline methods.

The following are my suggestions that would improve the manuscript:

1. Line 247-269 describes the metrics to characterize and evaluate the performance. I suggest moving this part to the Methods section instead of the Results.

2. I notice a wavy red line under the BioMedCLIP on Fig.1. Does it mean anything special? If not, please remove.

3. The authors present their results in Tables 1-2 but lacking sufficient interpretation in the manuscript. It would be helpful to provide more details regarding the result interpretation.

4. The evaluation focuses on only 5 medical conditions from MIMIC-CXR. The authors should address how well the method might generalize to a broader range of conditions.

5. While mentioned in the limitations section, more discussion is needed about potential dataset biases and their impact on real-world applications.

6. The mathematical notations in equations could be more consistently formatted

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

This paper presents a method for integrating medical image and text embeddings using a cross-attention mechanism. The authors claim that this work is novel. However, similar studies have been conducted. For instance, Ou et al. (Multimedia Systems 31, 58 (2025)) used a report entity graph and dual attention mechanisms to align fine-grained semantic representations between images and text. Simon et al. (Diagn Interv Radiol 2024; DOI: 10.4274/dir.2024.242631) integrated imaging and clinical metadata, often employing advanced architectures like transformers and graph neural networks. Moreover, Jeong et al. (Medical Imaging with Deep Learning 2024 Jan 23 (pp. 978-990). PMLR) reported a generation module that uses an image-text matching score to measure the similarity between chest X-ray images and radiology reports. These works are not mentioned in the paper. As the Journal aims to publish novel and innovative work, it is difficult to recommend this paper for publication.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have answered my concerns and i increased the ratings of review.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have thoroughly addressed the concerns raised in the previous review. They have enhanced the methodological clarity and incorporated stronger baseline comparisons. Given these substantial improvements, I now find the manuscript to be well-structured, methodologically sound, and a valuable contribution to the field. Therefore, I recommend its acceptance for publication.

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have addressed my concerns.

Article Menu

Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention

Further Information

Guidelines

MDPI Initiatives

Follow MDPI