Review Reports - MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In the Introduction section, the authors mention that, unlike traditional autoencoders, the loss function of their proposed method focuses solely on predicting the pixel values of masked blocks rather than reconstructing the entire input. The authors should provide a more detailed explanation of the motivation behind this design choice.
The authors compare MedMAE to several methods for evaluation. However, these methods are relatively outdated. To strengthen their analysis, the authors should include comparisons with the most recent state-of-the-art methods.
In the experimental results, the authors report only accuracy as a performance metric. To offer a more comprehensive evaluation, they should include additional model performance indicators, such as sensitivity, AUC (Area Under the Curve), F1 score and others.
In the experimental setup, the authors should implement 5-fold cross-validation and report the standard deviation of the results to assess the robustness of MedMAE.

Author Response

Comment 1: In the Introduction section, the authors mention that, unlike traditional autoencoders, the loss function of their proposed method focuses solely on predicting the pixel values of masked blocks rather than reconstructing the entire input. The authors should provide a more detailed explanation of the motivation behind this design choice.

Response 1: Thank you for your insightful feedback. The primary motivation for focusing the loss function on predicting masked pixel values rather than reconstructing the entire input stems from the need for efficient representation learning in medical imaging. Unlike traditional autoencoders, which aim to reconstruct the full input and may encode redundant information, our approach encourages the model to capture meaningful semantic features from the available context.
Medical images contain complex anatomical structures with high redundancy, meaning that a standard reconstruction-based loss might lead to an overemphasis on low-level pixel similarities rather than learning task-relevant representations. By masking a significant portion of the image (75%) and training the model to infer only the missing regions, we ensure that the model learns robust, high-level features rather than simply memorizing the training data.
Additionally, this design choice aligns with the self-supervised learning paradigm, where the model is forced to develop an understanding of the underlying medical structures and patterns rather than relying on direct pixel-level correlations. This results in more transferable representations that generalize well to downstream tasks such as classification and segmentation, even with limited labeled data.
We will revise the introduction to elaborate further on this motivation, ensuring clarity for future readers. We added a paragraph in the introduction section about this design choice.

Comment 2: The authors compare MedMAE to several methods for evaluation. However, these methods are relatively outdated. To strengthen their analysis, the authors should include comparisons with the most recent state-of-the-art methods.

Response 2: Thank you for your valuable feedback. We acknowledge the importance of comparing our proposed MedMAE with more recent state-of-the-art methods to provide a stronger and more comprehensive evaluation. While our current comparisons include widely used models such as ResNet, ViT, Swin Transformer, and MAE, we updated our experimental comparisons by including more recent models, ConvNextV2 and DINOv2.

Comment 3: In the experimental results, the authors report only accuracy as a performance metric. To offer a more comprehensive evaluation, they should include additional model performance indicators, such as sensitivity, AUC (Area Under the Curve), F1 score, and others.

Response 3: Thank you for your insightful suggestion. We completely agree that including additional evaluation metrics such as sensitivity and AUC would provide a more comprehensive assessment of MedMAE’s performance. We added the sensitivity and AUC for all experiments.

Comment 4: In the experimental setup, the authors should implement 5-fold cross-validation and report the standard deviation of the results to assess the robustness of MedMAE.

Response 4: Thank you for your valuable suggestion. We agree that implementing 5-fold cross-validation and reporting the standard deviation would provide a more thorough assessment of MedMAE’s robustness.
However, performing 5-fold cross-validation requires re-running all experiments, which is computationally intensive and cannot be completed within the 10-day review period. Given the scale of our experiments and the time required for training and evaluation, it is not feasible to incorporate this within the current review cycle.
That said, we recognize the importance of this analysis and plan to include it in a future version of the paper or a follow-up study to further strengthen our evaluation. We sincerely appreciate your feedback and your understanding

Reviewer 2 Report

Comments and Suggestions for Authors

In this article, the authors propose a pre-trained backbone using the collected medical imaging dataset with a self-supervised learning technique called a masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. The author's work is not bad, but still, some issues must be addressed, such as:

The manuscript lacks a clear description of how the collected dataset differs from existing public datasets. The authors should include more detailed information on the dataset's novelty, such as the criteria used for selecting the medical images and any potential biases that might arise from the dataset's sources.
While the concept of using self-supervised learning in medical imaging is promising, the manuscript fails to adequately address potential limitations, such as the challenges of domain shift between medical and natural images.
The literature review is weak. The authors should cite some more latest paper for comparison. The authors can also refer to "visionary vigilance: optimized YOLOV8 for fallen person detection with large-scale benchmark dataset," "a hybrid convolution transformer for hyperspectral image classification," "depthwise channel attention network (DWCAN): an efficient and lightweight model for single image super‐resolution and metaverse gaming," "multiscale feature-learning with a unified model for hyperspectral image classification," "a novel dimensionality reduction algorithm for Cholangiocarcinoma hyperspectral images," "high-precision skin disease diagnosis through deep learning on dermoscopic images,"
The methodology section is overly focused on technical details about MedMAE's architecture. It should be balanced with more explanation of the rationale behind choosing specific techniques, such as why ViT was selected or how the masked autoencoder performs better than other SSL methods for medical imaging.
Evaluating only one pre-trained model (ViT) compared with other models like ResNet and EfficientNet may not be enough to demonstrate the proposed method's generalizability. The authors should test their model with a wider variety of pre-trained models or provide more information on why the chosen model is ideal for this type of task.
The authors claim that their proposed method, MedMAE, outperforms existing models, but the comparisons with other models are not sufficiently rigorous. A more comprehensive statistical analysis, including confidence intervals or significance testing, should be provided to substantiate the claims of superior performance.
A more detailed comparison with other state-of-the-art approaches should be added. I recommend adding a separate table to compare with other existing approaches. Without a proper comparison, how can novel researchers differentiate that this work is novel or efficient compared to the existing works?
The quality of all the figures could be improved. Also, the text in the figures must be in the same style as the text in the paper body.
In the experiments, the authors state that they used "linear probing" to evaluate downstream tasks. However, they provide no explanation of why this approach was chosen over other methods, like fine-tuning the entire model. Clarification of the advantages and disadvantages of linear probing in this context would improve the transparency of the methodology.
The authors mention a conscious decision to avoid data augmentation techniques in the training process, but it is unclear why this approach was taken. The authors should explain why they opted not to use augmentation and discuss the potential consequences of this choice on model performance.
The results section would benefit from a more detailed performance analysis. For instance, the authors should include additional metrics such as precision, recall, F1-score, or confusion matrices for the tasks involving binary classification and segmentation, as accuracy alone is not always the most reliable indicator of performance in medical imaging.
The manuscript does not adequately discuss the computational cost and the scalability of the proposed model. The authors should include a more thorough analysis of the training time and resource requirements for MedMAE, particularly in comparison to other existing models, to provide readers with a clearer understanding of its feasibility for real-world applications.

Comments on the Quality of English Language

Please revise

Author Response

Comment 1: The manuscript lacks a clear description of how the collected dataset differs from existing public datasets. The authors should include more detailed information on the dataset's novelty, such as the criteria used for selecting the medical images and any potential biases that might arise from the dataset's sources.

Response1: We thank the reviewer for this insightful comment. In our revised manuscript Section 3.1, we have expanded the description of the LUMID dataset to better highlight its novelty compared to existing public datasets. In particular, we now clarify that:
Selection Criteria and Image Inclusion: The images included in LUMID were selected using a set of rigorous criteria aimed at maximizing both diversity and clinical relevance. Specifically, we ensured representation from multiple imaging modalities and anatomical regions while applying strict quality control measures. Each image was standardized through a pre-processing pipeline that involved converting files to a consistent format, resizing images to a common resolution, and excluding any corrupted or substandard files. This systematic selection and preprocessing process sets LUMID apart from existing datasets that often focus on narrower clinical scenarios or imaging techniques.
Addressing Potential Biases: While aggregating data from multiple public sources may introduce biases due to varying imaging protocols, demographic differences, and scanner-specific characteristics, these factors also play a pivotal role in enhancing our model’s generalizability. The natural diversity in imaging conditions forces the model to learn robust, transferable features that are not overly tuned to any single domain. In other words, by being exposed to a wide spectrum of acquisition settings and patient populations, the pre-trained model is less likely to overfit to a narrow, domain-specific dataset. Nonetheless, we acknowledge these sources of variation and have implemented a standardized pre-processing pipeline to minimize extreme discrepancies. Additionally, we performed thorough checks to identify and remove corrupted or substandard images. This step is crucial for maintaining the overall quality of the dataset and preventing potential issues during model training.
Comparison with Existing Datasets: While many available public datasets are limited either in scale or in the range of modalities and anatomical regions they cover, LUMID addresses these gaps by providing a comprehensive resource that supports self-supervised learning. This wide-ranging dataset enables the development of robust pre-trained models that demonstrate improved generalizability across various downstream medical imaging tasks.

Comment 2: While the concept of using self-supervised learning in medical imaging is promising, the manuscript fails to adequately address potential limitations, such as the challenges of domain shift between medical and natural images.

Response 2: Thank you for your comment. In the revised manuscript, we now include a dedicated paragraph in the conclusion about our strategy to address the challenges of domain shift between medical and natural images. “Our approach leverages a large-scale medical imaging dataset for self-supervised pre-training, thereby significantly reducing the domain gap with natural images. We acknowledge that differences in image texture, contrast, and noise between medical and natural images remain a challenge. Models pre-trained on natural images often encounter difficulties when applied directly to medical imaging tasks due to these inherent disparities. Our strategy of exclusively using medical data helps mitigate this issue, but further research into advanced domain adaptation techniques could enhance model generalizability even more. Addressing this limitation will be a key focus of our future work”.

Comment 3: The literature review is weak. The authors should cite some more latest paper for comparison. The authors can also refer to "visionary vigilance: optimized YOLOV8 for fallen person detection with large-scale benchmark dataset," "a hybrid convolution transformer for hyperspectral image classification," "depthwise channel attention network (DWCAN): an efficient and lightweight model for single image super‐resolution and metaverse gaming," "multiscale feature-learning with a unified model for hyperspectral image classification," "a novel dimensionality reduction algorithm for Cholangiocarcinoma hyperspectral images," "high-precision skin disease diagnosis through deep learning on dermoscopic images."

Response 3: We appreciate the reviewer’s feedback. In the revised manuscript, we have expanded the related work section to include all the recommended papers. These additions enrich the literature review by providing a broader context of the latest advancements in deep learning across diverse applications, which helps to further position our work within the state-of-the-art.

Comment 4: The methodology section is overly focused on technical details about MedMAE's architecture. It should be balanced with more explanation of the rationale behind choosing specific techniques, such as why ViT was selected or how the masked autoencoder performs better than other SSL methods for medical imaging.

Response 4: Thank you for the comment. In response, we have expanded the methodology section to provide additional context and rationale behind our technical choices. For example, we now explain that the choice of the Vision Transformer (ViT) architecture was motivated by its ability to capture long-range dependencies and global context within medical images which is a critical aspect given the subtle and diffuse features often present in such data. Moreover, we discuss how the masked autoencoder approach encourages the model to learn robust, context-aware representations by reconstructing missing patches, which has shown superior performance in transferring learned features to various downstream medical imaging tasks compared to other self-supervised learning methods. These revisions balance the technical details with the underlying rationale, clarifying why these techniques are particularly well-suited for medical imaging.

Comment 5: Evaluating only one pre-trained model (ViT) compared with other models like ResNet and EfficientNet may not be enough to demonstrate the proposed method's generalizability. The authors should test their model with a wider variety of pre-trained models or provide more information on why the chosen model is ideal for this type of task.

Response 5: We appreciate the reviewer’s suggestion. In the revised manuscript, we have expanded our experiments to include two recent models, ConvNextV2 and DINOv2, to further demonstrate the generalizability of our approach. Additionally, we have provided more discussion in the methodology section on why the Vision Transformer (ViT) architecture was chosen. ViT's ability to capture long-range dependencies and global context is particularly advantageous in medical imaging, where subtle and diffuse features are common. Our new comparisons with ConvNextV2 and DINOv2 not only reinforce the robustness of our method but also illustrate that our proposed pre-training strategy is effective across a range of modern architectures. We believe these additions adequately address the concern regarding the variety of pre-trained models evaluated.

Comment 6: The authors claim that their proposed method, MedMAE, outperforms existing models, but the comparisons with other models are not sufficiently rigorous. A more comprehensive statistical analysis, including confidence intervals or significance testing, should be provided to substantiate the claims of superior performance.

Response 6: We appreciate the reviewer’s feedback. In our revised manuscript, we have expanded our evaluation by adding two important metrics: sensitivity and area under the ROC curve (AUC) for all experiments. These metrics are particularly relevant in medical imaging, where understanding the true positive rate and the overall diagnostic ability of a model is critical.

Comment 7: A more detailed comparison with other state-of-the-art approaches should be added. I recommend adding a separate table to compare with other existing approaches. Without a proper comparison, how can novel researchers differentiate that this work is novel or efficient compared to the existing works?

Response 7: Thank you for your comment. We have Table 6, which provides a comparison of our model’s performance against other state-of-the-art approaches. This table clearly illustrates how MedMAE outperforms existing methods, helping to delineate the novelty and efficiency of our approach for researchers in the field. In this table, we report the results of other research without reproducing the results as in other tables.

Comment 8: The quality of all the figures could be improved. Also, the text in the figures must be in the same style as the text in the paper body.

Response 8: We have improved the quality of figures and made sure the text is readable.

Comment 9: In the experiments, the authors state that they used "linear probing" to evaluate downstream tasks. However, they provide no explanation of why this approach was chosen over other methods, like fine-tuning the entire model. Clarification of the advantages and disadvantages of linear probing in this context would improve the transparency of the methodology.

Response 9: We appreciate this observation. We chose linear probing because it offers a clear and unbiased measure of the quality of the learned representations by keeping the pre-trained backbone frozen and only training a simple linear classifier. This approach highlights the transferability of features to downstream tasks without the confounding effects of full model fine-tuning and is commonly used with self-supervised learning models. However, we acknowledge that while fine-tuning the entire model may potentially yield higher performance, it also increases computational complexity and the risk of overfitting to task-specific data and reducing the model generalization by forgetting the features learned during the pre-text phase. This problem is also known as catastrophic forgetting. We have now included a clarification in the revised manuscript to address these trade-offs in Section 4.2.

Comment 10: The authors mention a conscious decision to avoid data augmentation techniques in the training process, but it is unclear why this approach was taken. The authors should explain why they opted not to use augmentation and discuss the potential consequences of this choice on model performance.

Response 10: We appreciate the reviewer’s comment. In our revised manuscript Section 3.1, we have clarified our decision to avoid data augmentation. Specifically, we opted not to use augmentation techniques because our dataset is already large and inherently diverse, which reduces the need for artificial data expansion. Additionally, we were concerned that certain augmentation strategies might introduce unnatural distortions or artifacts that could obscure critical clinical features, potentially compromising the model's ability to learn robust and clinically relevant representations. While data augmentation can be beneficial in scenarios with limited data, in our context, it may risk reducing the fidelity of the imaging data and ultimately impact performance negatively.

Comment 11: The results section would benefit from a more detailed performance analysis. For instance, the authors should include additional metrics such as precision, recall, F1-score, or confusion matrices for the tasks involving binary classification and segmentation, as accuracy alone is not always the most reliable indicator of performance in medical imaging.

Response 11: We appreciate the reviewer’s suggestion. In response, we have expanded our evaluation metrics across all experiments by adding sensitivity and AUC. These metrics are especially important in medical imaging, as they provide a better assessment of a model's diagnostic performance beyond what accuracy alone can capture.

Comment 12: The manuscript does not adequately discuss the computational cost and the scalability of the proposed model. The authors should include a more thorough analysis of the training time and resource requirements for MedMAE, particularly in comparison to other existing models, to provide readers with a clearer understanding of its feasibility for real-world applications.

Response 12: We appreciate the reviewer’s feedback regarding computational cost and scalability. In the revised manuscript, we now include a detailed analysis of training time in Section 4.1. In particular, our downstream training employs linear probing, where only the final classification layer is updated. This approach results in very short training times that are comparable to those of other state-of-the-art models using similar strategies. The minimal computational cost associated with this phase makes our method highly feasible for real-world applications, despite the intensive pre-training phase.