Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

MDFormer: Transformer-Based Multimodal Fusion for Robust Chest Disease Diagnosis

Electronics 2025, 14(10), 1926; https://doi.org/10.3390/electronics14101926

by Xinlong Liu¹, Fei Pan², Hainan Song², Siyi Cao²

, Chunping Li^1,*

and Tanshi Li^2,*

Reviewer 1:

Mansoor Hayat

Reviewer 2:

Chao Mi

Reviewer 3:

Chrysovalantis Voutouri

Electronics 2025, 14(10), 1926; https://doi.org/10.3390/electronics14101926

Submission received: 14 April 2025 / Revised: 3 May 2025 / Accepted: 7 May 2025 / Published: 9 May 2025

(This article belongs to the Topic Applications of Image and Video Processing in Medical Imaging)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The process for deriving the 14 “structured labels” is not clearly described. It appears you remove the “Impression” section from reports to avoid label leakage, but then use those reports to generate diagnosis labels. Please clarify exactly how you extract each label (e.g., via keyword search, CheXpert annotations, manual curation) and provide statistics on label prevalence and inter-label co-occurrence.
MFTrans uses six stacked fusion layers, group convolutions with 12 groups, and specific projection dimensions. There is no justification for these design choices or evidence that they are optimal. Please report hyperparameter tuning experiments, or at least ablate the number of fusion layers and group sizes, to show how these affect performance and parameter/compute trade-offs.
The Sigmoid-based dynamic weighting only adjusts wclsm and wmmc for the first 5 epochs, then holds weights constant. Why 5 epochs? How sensitive are results to this schedule or to the choice of the momentum coefficient α? Please include an ablation on the length/shape of this scheduling and its impact on masked vs. full-modality performance.
All metrics are reported as single point estimates. For clinical relevance, please provide 95% confidence intervals (e.g., via bootstrapping) for your key metrics (F1, AUC) and perform statistical tests comparing MDFormer to the strongest baseline (e.g., ALBEF) on the same splits, to confirm performance gains are significant.
Some baselines (MedCLIP, ALBEF) were originally trained with different objectives or data augmentations. Please clarify that all models used identical preprocess, train/validation/test splits, and augmentation pipelines. If you re-implemented baselines, detail any deviations from their original setups.
The text refers to “MIMIC-CXR-JPG,” “MIMIC-MCC,” and “MIMIC-IV-ED.” This is confusing. Please standardize names (e.g., MIMIC-CXR and MIMIC-IV-ED) and explain the role of “MIMIC-MCC.”
“ConcatnateFusion” in Table 1 should be “ConcatenateFusion.”. In several equations, spacing around subscripts and parentheses is inconsistent (e.g., in Eq. (8)). Ensure all figures (especially Fig. 4 and Fig. 6) are high-resolution and that axis labels are legible.
More quantitative metrics (e.g., FLOPs, interference time) are needed to substantiate efficiency claims.
Literature review can be improved. Consider including (DOI:10.1109/ACCESS.2024.3503413) as this also proved potential of transformer while performing diagnosis.

Author Response

Thank you very much for your time and effort in the review process. We have fed back the questions and opinions raised by the reviewer, and revised our paper accordingly. The revised part is highlighted in red text in the paper. Here, we will explain how each of the points of the reviewer is addressed.

Comment 1: "The process for deriving the 14 “structured labels” is not clearly described. It appears you remove the “Impression” section from reports to avoid label leakage, but then use those reports to generate diagnosis labels. Please clarify exactly how you extract each label (e.g., via keyword search, CheXpert annotations, manual curation) and provide statistics on label prevalence and inter-label co-occurrence."

Response 1: Thank you for your valuable comment and suggestion. In this study, we use the MIMIC-CXR-JPG dataset, which contains chest X-ray images, corresponding radiology reports, and associated structured labels. These labels were not manually extracted by us, but were directly provided by the dataset itself. According to the official description of MIMIC-CXR-JPG, the labels were automatically extracted from the reports using a rule-based tool e.g., CheXpert. Therefore, the detailed label extraction process is not the main focus of this paper. Additionally, it is important to note that the radiology reports are semi-structured texts, mainly consisting of two parts: "Findings" and "Impression." The "Findings" section primarily contains descriptive statements regarding the images, while the "Impression" section provides concise diagnostic information about the patient's condition. To avoid potential information leakage, we remove the "Impression" section during text preprocessing, retaining only the "Findings" and other relevant parts as input text. Moreover, since we have removed any conclusive statements and all models in this study use the same dataset, we believe it is unnecessary to additionally include the label co-occurrence matrix statistics in the paper.

We have revised and supplemented the relevant description in Section 3.2 of the paper, clarifying the source and basic information of the labels as follows:

"In addition, in the MIMIC-CXR-JPG dataset, radiology reports mainly consist of two sections: 'Findings' and 'Impression.' The 'Findings' section mainly includes descriptive statements about the patient's images, while the 'Impression' section mainly contains brief diagnostic information. Considering that the labels in the MIMIC-CXR-JPG dataset are automatically extracted from the reports using the CheXpert method, we remove the 'Impression' part of the radiology reports during preprocessing (82.4% of reports in the original dataset contain the 'Impression' section). Directly using the 'Impression' information may lead to label leakage."

Comment 2: "MFTrans uses six stacked fusion layers, group convolutions with 12 groups, and specific projection dimensions. There is no justification for these design choices or evidence that they are optimal. Please report hyperparameter tuning experiments, or at least ablate the number of fusion layers and group sizes, to show how these affect performance and parameter/compute trade-offs."

Response 2: Thanks for the comment and suggestion. To maintain the conciseness of the paper, some experimental results were previously omitted. We have now added the hyperparameter tuning experiments of MFTrans in Section 4.2, specifically analyzing the impact of the number of fusion layers and the number of groups in grouped convolution on the model's F1 score. Additionally, we discuss how the number of fusion layers affects the model's parameter count. In addition, the projection dimension of the fusion module is set to 768, which is the default hidden size of the Transformer. We simply apply a linear mapping to project the imaging embeddings and time series embeddings to 768 dimensions, facilitating the concatenation of features from different modalities.

The added content is as follows:

"Finally, we present the hyperparameter tuning experiments for the fusion module MFTrans, focusing on evaluating the impact of different numbers of fusion layers and grouped convolution groups on model performance. The experimental results are shown in the Figure. As illustrated in the left panel, the model's F1 score steadily improves as the number of fusion layers increases from 2 to 6, with a particularly notable gain between 2 and 4 layers. This suggests that increasing the number of fusion layers helps to more effectively model multimodal information. However, when the number of layers increases to 8, the performance gain saturates and even slightly declines, possibly due to redundant information or overfitting introduced by the overly deep structure. In the right panel, the number of grouped convolutions has a limited effect on performance when it exceeds 2. Specifically, from 4 to 12 groups, the F1 score remains around 0.796, but performance slightly drops when the group number exceeds 16. These results indicate that moderately increasing the number of convolution groups can reduce the number of parameters, but an excessive number may harm model performance. Considering both performance and computational cost, the optimal configuration for the fusion module is 6 fusion layers and 12 convolution groups."

Comment 3: "The Sigmoid-based dynamic weighting only adjusts wclsm and wmmc for the first 5 epochs, then holds weights constant. Why 5 epochs? How sensitive are results to this schedule or to the choice of the momentum coefficient α? Please include an ablation on the length/shape of this scheduling and its impact on masked vs. full-modality performance."

Response 3: Thanks for the comment and suggestion. Our second-stage training typically terminates at the 11th epoch through an early stopping mechanism. Therefore, we choose the first half (i.e., 5 epochs) to dynamically adjust the loss weights in order to better balance the masked classification loss and contrastive loss in the early stages of the model. The remaining epochs maintain fixed loss weights to stabilize the training process and prevent over-adjustment of weights from interfering with the training. Our experiments adopted the same momentum coefficient of 0.995 as in ALBEF, which has yielded good results in our setup. Moreover, in our experiments, the impact of scheduling length and momentum coefficient on the final results is minimal, so we believe there is no need for additional ablation experiments.

Comment 4: "All metrics are reported as single point estimates. For clinical relevance, please provide 95% confidence intervals (e.g., via bootstrapping) for your key metrics (F1, AUC) and perform statistical tests comparing MDFormer to the strongest baseline (e.g., ALBEF) on the same splits, to confirm performance gains are significant."

Response 4: Thanks for the comment and suggestion. To further evaluate the clinical applicability of the models, we performed 100 bootstrap resampling trials on the test set. MDFormer achieved an average F1 score of 0.7920 (95% confidence interval: 0.7796-0.8044) and an AUC of 0.9011 (95% confidence interval: 0.8948-0.9074). In comparison, ALBEF attained an average F1 score of 0.7720 (95% confidence interval: 0.7596-0.7844) and an AUC of 0.8860 (95% confidence interval: 0.8798-0.8922). Additionally, under the same data splits, we conducted a paired Wilcoxon signed-rank test to statistically compare the two methods in terms of F1 score and AUC. The results indicated that the performance improvement was statistically significant (F1: p = 0.0047; AUC: p = 0.0031), further demonstrating the stability and reliability of our method. We have added these results and discussions to Section 4.2 of the revised manuscript.

Comment 5: "Some baselines (MedCLIP, ALBEF) were originally trained with different objectives or data augmentations. Please clarify that all models used identical preprocess, train/validation/test splits, and augmentation pipelines. If you re-implemented baselines, detail any deviations from their original setups. "

Response 5: Thanks for the comment and suggestion. In this study, all baseline models were reproduced based on their original papers or official released code, and we strictly followed the same train/validation/test splits, data preprocessing procedures, and data augmentation strategies to ensure fair and consistent comparisons. We adopted a unified classification head across all models for the multi-label classification of chest diseases, while keeping the rest of each model’s architecture unchanged.

Comment 6: " The text refers to “MIMIC-CXR-JPG,” “MIMIC-MCC,” and “MIMIC-IV-ED.” This is confusing. Please standardize names (e.g., MIMIC-CXR and MIMIC-IV-ED) and explain the role of “MIMIC-MCC."

Response 6: Thanks for the comment and suggestion. We have clarified all relevant dataset names in Section 3.1 Specifically, MIMIC-CXR-JPG and MIMIC-CXR are not exactly the same dataset. The dataset we used is MIMIC-CXR-JPG, which contains chest X-ray images (in JPEG format), corresponding radiology reports, and disease labels. MIMIC-IV-ED provides time-series vital sign data for emergency department patients. Regarding "MIMIC-MCC", it was a labeling error in the dataset construction flowchart; this dataset does not actually exist. We have corrected the figure and updated the corresponding content in the revised manuscript to avoid confusion.

Comment 7: " "ConcatnateFusion" in Table 1 should be "ConcatenateFusion". In several equations, spacing around subscripts and parentheses is inconsistent (e.g., in Eq. (8)). Ensure all figures (especially Fig. 4 and Fig. 6) are high-resolution and that axis labels are legible."

Response 7: Thanks for the comment and suggestion. We have carefully addressed the issues you pointed out:

The spelling of “ConcatnateFusion” in Table 1 has been corrected to “ConcatenateFusion”.
We have reviewed and standardized the spacing around subscripts and parentheses in all equations, including Equation (8), to ensure consistency throughout the manuscript.
All figures we used are high-resolution.

Comment 8: "More quantitative metrics (e.g., FLOPs, interference time) are needed to substantiate efficiency claims."

Response 8: Thanks for the comment and suggestion. We have added a new efficiency metric in Table 1, specifically the average inference time per sample on the test set, to more intuitively demonstrate the computational performance of the model in practical applications. In addition, we have included further analysis and discussion of the relevant experimental results in Section 3.2.

The added content is as follows:

"To further assess the practicality of each model in clinical settings, we compare their average inference time on the test set. As shown in Table 1, although MDFormer achieves the best overall performance across all evaluation metrics, its inference time (26.63ms) remains within a reasonable range. In fact, it is comparable to or even faster than some less accurate multimodal models, such as MedFuse (31.71ms) and MedCLIP (29.70ms). This demonstrates that MDFormer not only provides high accuracy but also maintains efficient inference speed, making it suitable for real-time or time-sensitive medical applications. "

Comment 9: "Literature review can be improved. Consider including (DOI:10.1109/ACCESS.2024.3503413) as this also proved potential of transformer while performing diagnosis."

Response 9: Thanks for the comment and suggestion. We have carefully reviewed the paper recommended by the reviewer, which indeed demonstrates the effectiveness of the Transformer architecture in medical diagnosis tasks. We have added a discussion of this work in Section 2.1.

The added content is as follows:

"Hayat et al. proposed a hybrid deep learning model that combines EfficientNetV2 and Vision Transformer (ViT) for breast cancer histopathological image classification. Experiments conducted on the BreakHis dataset showed that the model achieved an accuracy of 99.83% in binary classification and 98.10% in multi-class classification, demonstrating its superior performance in breast cancer detection. "

Reviewer 2 Report

Comments and Suggestions for Authors

The paper proposes MDFormer, a Transformer-based model for robust chest disease diagnosis, integrating medical imaging, clinical text, and vital signs through a novel multimodal fusion module (MFTrans) and a two-stage Mask-Enhanced Classification and Contrastive Learning (MECCL) framework. Its innovations include the lightweight MFAttention mechanism for efficient cross-modal fusion, the MECCL framework for handling modality missing, and dynamic loss weighting to balance training objectives, achieving superior performance on the MIMIC dataset. The paper is well-structured with comprehensive experimental design. The following minor revisions are suggested to enhance its quality:

1.The study relies on the MIMIC dataset but does not adequately discuss its limitations. In the "Data and Methods" or "Discussion" section, explicitly address the limitations of the MIMIC dataset, such as label distribution imbalance, data noise, or regional representation biases.

2.While the paper highlights the low parameter count of MFTrans, it lacks details on the actual computational costs of model training and inference (e.g., training time, GPU memory usage, inference latency). Provide computational efficiency metrics, such as training time, inference speed, and GPU memory consumption. Discuss whether the model can be further optimized through techniques like quantization or pruning to suit edge devices or mobile healthcare scenarios.

3.Although the paper emphasizes the practical significance of modality missing, it lacks discussion on MDFormer's application in real-world clinical settings (e.g., emergency room diagnostics, telemedicine). In the "Conclusions" or "Discussion" section, include a detailed analysis of MDFormer's potential deployment in clinical scenarios, such as emergency rooms or low-resource hospitals. Discuss how the model can be integrated with existing clinical workflows (e.g., electronic medical record systems) and address challenges in handling real-time data inputs.

The paper is recommended for acceptance with minor revisions.

Author Response

Thank you so much for your useful comments that help us to improve our paper. Our responses to your comments are described below.

Comment 1: "The study relies on the MIMIC dataset but does not adequately discuss its limitations. In the "Data and Methods" or "Discussion" section, explicitly address the limitations of the MIMIC dataset, such as label distribution imbalance, data noise, or regional representation biases."

Response 1: Thanks for the comment and suggestion. We have added a discussion on the limitations of the MIMIC dataset in Section 3.1. First, the MIMIC-CXR-JPG dataset does exhibit class imbalance in its label distribution. To mitigate this issue, we incorporated class weights during training to reduce the impact of imbalance on model performance. Second, the labels in MIMIC-CXR-JPG were automatically generated using the rule-based tool CheXpert, which may introduce some noise. However, this tool has been shown to provide reasonably accurate and reliable annotations in multiple studies. Additionally, as the dataset is primarily derived from a specific region in the United States, there may be potential biases in regional representation. To address this, we plan to incorporate multi-center datasets from different geographic locations in future work to further validate the robustness and generalizability of the proposed model.

The added content is as follows:

" The final dataset exhibits class imbalance in its label distribution. To alleviate this issue, we incorporated class weights during training to reduce the impact of label imbalance on model performance. "

Comment 2: "While the paper highlights the low parameter count of MFTrans, it lacks details on the actual computational costs of model training and inference (e.g., training time, GPU memory usage, inference latency). Provide computational efficiency metrics, such as training time, inference speed, and GPU memory consumption. Discuss whether the model can be further optimized through techniques like quantization or pruning to suit edge devices or mobile healthcare scenarios."

Response 2: Thanks for the comment and suggestion. Firstly, regarding training time, all models were trained under the same hardware environment, and the training duration for each did not exceed 2.5 hours, indicating a relatively low overall training cost. Therefore, we believe it is unnecessary to list the training time for each model individually in the paper. Secondly, concerning GPU memory usage, the memory consumption is mainly related to the scale of model parameters. We have listed the number of parameters for each model in Table 1, which can serve as an indirect reference for memory overhead. In addition, we have added a new efficiency metric in Table 1, specifically the average inference time per sample on the test set, to more intuitively demonstrate the computational performance of the model in practical applications. In addition, we have included further analysis and discussion of the relevant experimental results in Section 3.2. Finally, we also recognize the potential of deploying the models in edge devices or mobile healthcare scenarios, and we have introduced our follow-up research plans in Section 5, including incorporating lightweight techniques such as model quantization and pruning to further improve deployment efficiency and resource utilization.

The additional content in Section 3.2 is as follows:

The additional content in Section 5 is as follows:

"Explore lightweight deployment techniques such as model quantization, pruning, and knowledge distillation to enable real-time, resource-efficient inference on edge devices, which is essential for mobile healthcare and remote diagnostic scenarios."

Comment 3: "Although the paper emphasizes the practical significance of modality missing, it lacks discussion on MDFormer's application in real-world clinical settings (e.g., emergency room diagnostics, telemedicine). In the "Conclusions" or "Discussion" section, include a detailed analysis of MDFormer's potential deployment in clinical scenarios, such as emergency rooms or low-resource hospitals. Discuss how the model can be integrated with existing clinical workflows (e.g., electronic medical record systems) and address challenges in handling real-time data inputs."

Response 3: Thanks for the comment and suggestion. In we have added the discussion on the practical deployment potential of MDFormer in real-world clinical scenarios in Section 5. These additions aim to better position MDFormer within the context of practical, real-world medical applications.

The added content is as follows:

"Beyond technical advancements, we also deeply recognize the significant importance of applying MDFormer in real clinical environments. In emergency departments or resource-limited medical settings, the ability to make accurate decisions quickly based on incomplete multimodal data is critical for clinical decision-making. Our experimental results show that MDFormer demonstrates strong robustness even under conditions of modality missingness, making it an ideal choice for such application scenarios. Especially in the context of telemedicine or remote regions with scarce medical resources, MDFormer shows great potential for practical deployment. To further enhance MDFormer’s adaptability to real-world clinical settings, we plan to expand our experiments to more comprehensive multimodal datasets (such as CheXpert) and incorporate textual perturbation analysis involving low-quality or non-standard radiology reports. This will allow us to systematically assess the model's stability and practicality when faced with cross-institutional and heterogeneous data. Future research will also focus on the seamless integration of MDFormer into existing clinical workflows, such as Electronic Health Record (EHR) systems, targeting key tasks like early disease detection and critical case triage. This includes addressing the challenges posed by asynchronous or real-time multimodal data input. Specifically, MDFormer is expected to serve as an intelligent decision support tool within EHR systems, enabling real-time analysis of patient data from various modalities to assist clinicians in identifying early disease signals more accurately, providing diagnostic recommendations, or automatically assessing the urgency of clinical cases. Furthermore, we will explore the potential of MDFormer in disease progression prediction and personalized treatment planning. To meet the practical needs of telemedicine under edge-computing resource constraints, we also plan to conduct research on model lightweighting and deployment optimization. In summary, these efforts will lay a solid foundation for advancing MDFormer from research into clinical application. "

Reviewer 3 Report

Comments and Suggestions for Authors

This manuscript proposes MDFormer, a novel Transformer-based framework for multimodal fusion of chest X-ray images, radiology reports, and vital signs for chest disease diagnosis. The architecture includes an innovative MFTrans fusion module and a two-stage MECCL training strategy to enhance robustness, especially in the presence of missing modalities. The work is comprehensive and methodologically solid, offering a significant contribution to multimodal learning in medical diagnostics. The manuscript is well-organized and technically detailed. However, some points need clarification, and the biological and clinical implications could be better emphasized.

While the technical innovation is clear, the manuscript could better emphasize clinical relevance:
1. How would MDFormer be integrated into a real-world clinical workflow?
2. What is the impact on patient outcomes or radiologist decision-making?
3. Could the model support early diagnosis or triaging of critical cases?
The proposed system is tested only on MIMIC datasets. Consider discussing:
1. Generalization to other datasets (e.g., CheXpert, NIH ChestX-ray14).
2. Variability in image/report quality across institutions.
The model is designed to be interpretable, but explainability tools (e.g., attention maps, SHAP, Grad-CAM) are not presented. Including these would help clinicians understand which modality or region contributed most to each decision.
MFTrans is compared to classical fusion strategies, but more modern multimodal transformer baselines (e.g., Flamingo, Gato, or Med-Flamingo) could be mentioned—even if only in discussion—to contextualize novelty. It's also unclear how modality contributions are balanced in the fusion. Could the model learn to "ignore" low-quality modalities?
The training strategy is effective, but it's somewhat overly complex and may be hard to reproduce. A simplified diagram or step-by-step breakdown would improve accessibility. It would help to include metrics like training time increase or convergence behavior in the second stage.

Minor Comments

Add error bars or statistical significance to performance comparisons.
Add more recent references (2023–2024) in multimodal medical transformers to strengthen the related work section.
Consider stating whether code/models will be publicly released to support reproducibility.

Author Response

Thank you so much for your useful comments that help us to improve our paper. Our responses to your comments are described below.

Comment 1: " While the technical innovation is clear, the manuscript could better emphasize clinical relevance:

How would MDFormer be integrated into a real-world clinical workflow?
What is the impact on patient outcomes or radiologist decision-making?
Could the model support early diagnosis or triaging of critical cases?"

Response 1: Thanks for the comment and suggestion. We agree with your perspective on emphasizing clinical relevance. The design goal of MDFormer is to provide technical support for multimodal chest disease diagnosis. In response to your specific concerns, we have made the following additions and revisions in Section 5 of the paper:

The added content is as follows:

Comment 2: "The proposed system is tested only on MIMIC datasets. Consider discussing:

Generalization to other datasets (e.g., CheXpert, NIH ChestX-ray14).
Variability in image/report quality across institutions."

Response 2: Thanks for the comment and suggestion. First, publicly available labeled multimodal medical datasets remain relatively scarce. The CheXpert dataset provides a large number of chest X-ray images along with radiology reports and limited label information. We plan to expand our experiments in the future to evaluate the generalization ability of our model on this dataset. As for the NIH ChestX-ray14 dataset, although it includes labels, it only contains imaging data and thus is not suitable for our multimodal learning task. Second, we agree that a key challenge lies in the significant differences in image and report quality across institutions, which may affect the model's transferability and robustness in real-world clinical scenarios. To address this issue, we propose the following initial strategies:

Image quality variation: To enhance the model’s robustness to different imaging qualities, we incorporated a variety of image augmentation techniques during training, including random noise injection, contrast adjustment, and blurring. These augmentations simulate the quality variation observed in data from different institutions, thereby improving the model’s generalization capability.

2.Text modality variation: Radiology reports from different institutions may vary in terminology, writing style, and structural consistency. To address this, we’ll introduce perturbation-based analysis using low-quality or non-standard reports in future work, to evaluate the model’s stability under diverse textual inputs.

A discussion of these aspects has been added to Section 5 of the paper.

Comment 3: " The model is designed to be interpretable, but explainability tools (e.g., attention maps, SHAP, Grad-CAM) are not presented. Including these would help clinicians understand which modality or region contributed most to each decision."

Response 3: Thanks for the comment and suggestion. We fully agree with the reviewer’s comment that “model interpretability is crucial for clinical applications.” In fact, our method incorporates a cross-modal attention mechanism based on the Transformer architecture, which offers a certain degree of inherent interpretability. By visualizing the cross-modal attention maps, we can preliminarily observe the relative contributions of different modality-specific tokens to the model’s decision-making. However, such visualizations cannot be precisely mapped back to the pixel space of the original images or the specific time points of sequential data. They only provide a global and intuitive reference for understanding the model’s decision basis. In future work, we’ll further explore and refine attention map interpretation to enhance both the interpretability and clinical usability of our model.

Regarding the SHAP method, since our model is a custom multimodal Transformer built on the Hugging Face framework, integrating SHAP currently presents several engineering challenges, including:

Constructing a wrapper function that encapsulates the tokenizer and inference process;
Ensuring that the model’s input and output formats are compatible with SHAP (e.g., NumPy arrays or Tensors);
Properly handling padding and masking mechanisms during inference to ensure valid explanations.

As for Grad-CAM, it is primarily designed for visualizing convolutional neural networks (CNNs). Since our model architecture includes not only CNN modules but also Transformer components, Grad-CAM is not applicable in our case.

Comment 4: "MFTrans is compared to classical fusion strategies, but more modern multimodal transformer baselines (e.g., Flamingo, Gato, or Med-Flamingo) could be mentioned—even if only in discussion—to contextualize novelty. It's also unclear how modality contributions are balanced in the fusion. Could the model learn to "ignore" low-quality modalities?"

Response 4: Thanks for the comment and suggestion. Our task focuses on multimodal multi-label chest disease classification, which places high demands on both efficiency and model size. Therefore, we primarily compare our method with encoder-only architectures in the experiments. While models such as Flamingo, Gato, and Med-Flamingo—which adopt decoder or encoder-decoder architectures—demonstrate stronger generative capabilities and cross-task adaptability, their large parameter sizes and high inference costs make them less suitable for our target application. In response to the reviewer’s suggestion, we have added a brief introduction to Flamingo and Med-Flamingo in Section 2 (Related Work) to better contextualize the novelty of our work, which lies in proposing a lightweight and extensible fusion module designed to accommodate varying modality combinations in diagnostic tasks.

Regarding the question of balancing the contribution of each modality, the attention mechanism in MFTrans dynamically adjusts the contribution weights during fusion based on the modality-specific feature representations. Through backpropagation during training, the model automatically learns the effectiveness of each modality in the diagnostic task and can implicitly downweight or ignore lower-quality modalities to a certain extent. However, we acknowledge that significant discrepancies in modality quality between training and testing data may still impact performance—this is a common challenge in multimodal learning and one of the directions we aim to further address in future work.

Comment 5: "The training strategy is effective, but it's somewhat overly complex and may be hard to reproduce. A simplified diagram or step-by-step breakdown would improve accessibility. It would help to include metrics like training time increase or convergence behavior in the second stage."

Response 5: Thanks for the comment and suggestion. The two-stage training strategy we propose is very intuitive in practice: the first stage directly optimizes the disease classification loss based on labels; the second stage, with the encoder parameters frozen, jointly optimizes four loss functions, i.e., the full-modal classification loss, the masked-modal classification loss, the contrastive learning loss, and the KL divergence loss. The training process is illustrated in details in Figure 5, and we have provided the further explanation of the loss weight adjustments in Section 3.4.4.

Regarding the time cost of the second stage of training, we have observed from multiple experiments that the training time for this stage is approximately 2.5 hours, which falls within a reasonable range. Therefore, we believe that no additional description of convergence metrics is necessary, and we have briefly noted the training time for this stage in Section 3 of the paper.

Comment 6: "Add error bars or statistical significance to performance comparisons."

Response 6: Thanks for the comment and suggestion. To further evaluate the clinical applicability of the models, we performed 100 bootstrap resampling trials on the test set. MDFormer achieved an average F1 score of 0.7920 (95% confidence interval: 0.7796-0.8044) and an AUC of 0.9011 (95% confidence interval: 0.8948-0.9074). In comparison, ALBEF attained an average F1 score of 0.7720 (95% confidence interval: 0.7596-0.7844) and an AUC of 0.8860 (95% confidence interval: 0.8798-0.8922). Additionally, under the same data splits, we conducted a paired Wilcoxon signed-rank test to statistically compare the two methods in terms of F1 score and AUC. The results indicated that the performance improvement was statistically significant (F1: p = 0.0047; AUC: p = 0.0031), further demonstrating the stability and reliability of our method. We have added these results and discussions to Section 4.2 of the revised manuscript.

Comment 7: "Add more recent references (2023–2024) in multimodal medical transformers to strengthen the related work section."

Response 7: Thanks for the comment and suggestion. we have updated the Related Work section to include additional recent references from 2023 and 2024 in the field of multimodal medical Transformers. Specifically, we have added the following:

Med-Flamingo: A Multimodal Medical Few-Shot Learner (2023), which adapts Flamingo to the medical domain and demonstrates few-shot capabilities in medical vision-language tasks.
Hybrid Deep Learning EfficientNetV2 and Vision Transformer (EffNetV2-ViT) Model for Breast Cancer Histopathological Image Classification (2024), which explores the integration of convolutional and Transformer-based architectures for histopathological image analysis.

These additions complement the existing 16 references (7 from 2023 and 9 from 2024) and further strengthen the contextual foundation for our work by highlighting recent advances in both fusion strategies and medical Transformer applications.

Comment 8: "Consider stating whether code/models will be publicly released to support reproducibility."

Response 8: Thanks for the comment and suggestion. We highly value the reproducibility of our research. We have organized and publicly released the code and model in a GitHub repository at https://github.com/thulxl/MFTrans to support replication and further development of this work. The relevant information has been added to Section 5.

Article Menu

MDFormer: Transformer-Based Multimodal Fusion for Robust Chest Disease Diagnosis

Further Information

Guidelines

MDPI Initiatives

Follow MDPI