MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Gupta, Anubhav; Osman, Islam; Shehata, Mohamed S.; Braun, W. John; Feldman, Rebecca E.

doi:10.3390/computation13040088

Open AccessArticle

MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

by

Anubhav Gupta

^†,

Islam Osman

^*,†

,

Mohamed S. Shehata

,

W. John Braun

and

Rebecca E. Feldman

Department of Computer Science, University of British Columbia, 3333 University Way, Kelowna, BC V1V 1V7, Canada

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Computation 2025, 13(4), 88; https://doi.org/10.3390/computation13040088

Submission received: 8 February 2025 / Revised: 15 March 2025 / Accepted: 18 March 2025 / Published: 1 April 2025

(This article belongs to the Special Issue Computational Medical Image Analysis—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using a medical imaging dataset. However, all existing models are pre-trained using natural images, which represent a different domain from that of medical imaging; this leads to poor performance due to domain shift. To overcome these problems, we propose a pre-trained backbone using a collected medical imaging dataset with a self-supervised learning tool called a masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. To evaluate the performance of the proposed backbone, we use four different medical imaging tasks. The results are compared with existing pre-trained models. These experiments show the superiority of our proposed backbone in medical imaging tasks.

Keywords:

medical imaging; deep learning; transformers; masked autoencoder

1. Introduction

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using medical imaging datasets. However, most existing models are pre-trained on natural images, which differ significantly from medical images in terms of structure, texture, and clinical relevance. This domain shift often leads to suboptimal performance in medical tasks.

To address these challenges, we propose a Vision Transformer (ViT)-based backbone [1] trained using self-supervised learning on a large-scale unlabeled dataset of medical images. Our model leverages a masked autoencoder (MAE) [2] framework, where a significant portion of the input image is masked and the model learns to reconstruct only the missing regions. This approach differs from traditional autoencoders, which aim to reconstruct the entire input. Instead, focusing on predicting masked pixel values encourages the model to extract high-level, meaningful features that are more transferable to downstream medical imaging tasks.

The key motivation for this design choice is that medical images contain complex anatomical structures with high redundancy. A conventional reconstruction-based loss might lead to the model learning low-level pixel dependencies rather than meaningful semantic representations. By masking

75 %

of the image and requiring the model to infer the missing regions, we enforce a learning process that captures essential anatomical patterns and contextual information, making the learned representations more robust and generalizable. Additionally, this targeted reconstruction approach aligns with recent advances in self-supervised learning, which have shown superior performance in learning transferable features from unlabeled data. The contribution of this work is summarized as follows:

Collecting a large-scale unlabeled medical imaging dataset from various sources that can be used for self-supervised and unsupervised learning techniques;
Proposing a medical masked autoencoder (MedMAE), a pre-trained backbone that can be used for any medical imaging task;
Extensive evaluation on multiple medical imaging tasks, demonstrating the superiority of our proposed backbone over existing pre-trained models.

2. Related Work

Self-supervised learning (SSL) has emerged as a powerful paradigm in medical image analysis, particularly when annotated data are scarce and there is an abundance of unlabeled data [3]. Several researchers have demonstrated the effectiveness of the SSL approach throughout various medical image analysis tasks, such as detection and classification [4,5,6], detection and localization [7,8,9], and segmentation tasks [10,11,12]. Huang et al. offer a comprehensive review of deep learning techniques utilizing self-supervised learning (SSL) for medical image classification in their work [3]. In contrast, our approach focuses primarily on self-prediction strategies, which align with the development of an SSL-based model specifically designed for medical imaging tasks. The self-prediction technique involves masking parts of an image and attempting to reconstruct the missing areas using the remaining visible parts. Various studies have highlighted the application of self-prediction methods in medical image restoration. For example, Jung et al. proposed a masked autoencoder approach to restore functional connectivity matrices in resting-state fMRI data, where different rows and columns of the matrix were masked and then reconstructed to represent the relationships between brain regions [13]. Similarly, Jana et al. developed an encoder–decoder architecture to restore CT scans corrupted by patch swaps within a single slice [14]. Liu et al. introduced a U-net model for the restoration of ultrasound images altered by local pixel shuffling, where clinical data such as age, gender, and tumor size are integrated into the encoder’s output for further predictions [15]. Currently, no SSL-based backbone has been specifically designed for medical image interpretation tasks. Given the challenges of limited labeled data, SSL presents a valuable solution, as it can create flexible models that can later be fine-tuned for diverse applications, even in the absence of vast labeled datasets, unlike supervised learning methods that heavily rely on large quantities of annotated data.

To provide further context regarding the evolving landscape of deep learning in visual tasks, we highlight several recent contributions from adjacent domains. Khan et al. [16] presented an optimized YOLOv8 framework specifically tailored for fallen person detection. Their work introduces a large-scale benchmark dataset and demonstrates how architectural and training optimizations can lead to improvements in detection speed and accuracy in safety-critical scenarios. Arshad et al. [17] proposed a hybrid model that integrates convolutional layers with transformer blocks to effectively capture both local and global features in hyperspectral images. The method shows state-of-the-art performance on benchmark datasets by leveraging the complementary strengths of both architectures. The authors in [18] introduced DWCAN, a lightweight model that employs depthwise channel attention to achieve efficient super-resolution performance. Its design caters to real-time applications such as metaverse gaming, balancing high performance with a low computational cost. The authors in [19] presented a unified framework that learns features at multiple scales for enhanced hyperspectral image classification. By integrating multiscale information, the model achieves robust classification performance across varying conditions and datasets. Li et al. [20] focused on the unique challenges posed by hyperspectral images in a medical context; their paper introduces a new dimensionality reduction technique. The algorithm effectively preserves essential diagnostic features while reducing redundancy, thereby facilitating the more accurate classification of cholangiocarcinoma. Malik et al. [21] explored deep learning methodologies for the automated diagnosis of skin diseases using dermoscopic images. The authors report high-precision classification outcomes, highlighting the potential for deep learning to enhance the diagnostic accuracy in dermatology.

3. Methodology

We propose a pre-trained backbone using the collected medical imaging dataset and a self-supervised learning technique called the masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images.

3.1. LUMID: Large-Scale Unlabeled Medical Imaging Dataset for Unsupervised and Self-Supervised Learning

Our proposed dataset, LUMID, is designed to overcome the limitations found in many existing public medical imaging datasets. In contrast to datasets that focus narrowly on a single modality or a specific set of anatomical regions, LUMID comprises over 2 million images collected from diverse public repositories. The details of these public repositories are shown in Table 1. This extensive collection spans multiple imaging modalities, including CT scans, MRI, and X-rays, and covers a wide range of anatomical regions, such as the chest, abdomen, brain, and pelvis. This breadth ensures that models pre-trained on LUMID can learn rich and transferable representations applicable to a wide variety of medical imaging tasks.

Selection Criteria and Image Inclusion: The images included in LUMID were selected using a set of rigorous criteria aimed at maximizing both diversity and clinical relevance. Specifically, we ensured representation from multiple imaging modalities and anatomical regions while applying strict quality control measures. Each image was standardized through a pre-processing pipeline that involved converting files to a consistent format, resizing images to a common resolution, and excluding any corrupted or substandard files. This systematic selection and pre-processing process sets LUMID apart from existing datasets that often focus on narrower clinical scenarios or imaging techniques. It is important to mention that we avoided applying any data augmentation because our dataset was already large and inherently diverse, which reduced the need for artificial data expansion. Additionally, we were concerned that certain augmentation strategies might introduce unnatural distortions or artifacts that could obscure critical clinical features, potentially compromising the model’s ability to learn robust and clinically relevant representations. While data augmentation can be beneficial in scenarios with limited data, in our context, it may risk reducing the fidelity of the imaging data and ultimately impact the performance negatively.

Addressing Potential Biases: While aggregating data from multiple public sources may introduce biases due to varying imaging protocols, demographic differences, and scanner-specific characteristics, these factors also play a pivotal role in enhancing our model’s generalizability. The natural diversity in imaging conditions forces the model to learn robust, transferable features that are not overly tuned to any single domain. In other words, by being exposed to a wide spectrum of acquisition settings and patient populations, the pre-trained model is less likely to overfit to a narrow, domain-specific dataset. Nonetheless, we acknowledge these sources of variation and have implemented a standardized pre-processing pipeline to minimize extreme discrepancies. Additionally, we performed thorough checks to identify and remove corrupted or substandard images. This step is crucial in maintaining the overall quality of the dataset and preventing potential issues during model training.

Comparison with Existing Datasets: While many available public datasets are limited either in scale or in the range of modalities and anatomical regions that they cover, LUMID addresses these gaps by providing a comprehensive resource that supports self-supervised learning. This wide-ranging dataset enables the development of robust pre-trained models that demonstrate improved generalizability across various downstream medical imaging tasks.

3.2. Rationale for Key Architectural Choices

Medical imaging poses unique challenges, including subtle features spread over large regions and significant variability in image quality. To address these challenges, we selected the Vision Transformer (ViT) architecture as the backbone for MedMAE. Unlike traditional convolutional neural networks, the ViT is capable of capturing long-range dependencies and global contexts, which are critical in detecting subtle and diffuse patterns in medical images. This global receptive field allows the model to integrate contextual information effectively across the entire image, a crucial requirement in many clinical scenarios.

In parallel, we adopted a masked autoencoder (MAE) approach for self-supervised pre-training. The MAE framework trains the model by reconstructing missing parts of the image, forcing it to learn robust, context-aware representations. This reconstruction-based task contrasts other SSL methods, such as contrastive learning, by focusing on understanding the intrinsic structure of the image, rather than simply discriminating between instances. For medical imaging, where annotated data are scarce and details are critical, this approach ensures that the model develops a deeper understanding of anatomical variations and pathological features, thereby enhancing its transferability to downstream tasks.

3.3. MedMAE Architecture in Pre-Training

This section delves into the core building blocks of the medical masked autoencoder (MedMAE): the encoder, decoder, and loss function. Details of the MedMAE architecture are shown in Figure 1.

Encoder: The MedMAE encoder leverages a Vision Transformer (ViT) architecture. The input image/volume is first divided into non-overlapping patches. These patches are then randomly assigned to two groups: visible and masked patches. The encoder operates solely on the visible patches, aiming to learn a meaningful representation based on this partial information. To compensate for the lack of complete spatial information, each patch is augmented with a corresponding positional embedding before being fed into the ViT. This positional encoding helps the encoder to maintain an understanding of the relative location of each patch within the larger image/volume. Crucially, since the encoder’s output is used to reconstruct the masked regions, it is incentivized to extract a comprehensive representation from these incomplete observations.

Decoder: The goal of the decoder is to fill the gaps (i.e., predict the masked patches). The MedMAE decoder receives two sets of tokens as input: (1) patch-wise representations, which are the outputs generated by the encoder for the visible patches; (2) mask tokens, which are learnable tokens representing the masked regions. They are inserted into the decoder at the corresponding positions where patches are masked in the original input. Similarly to the encoder, positional embeddings are added to all input tokens in the decoder. This enables the decoder to reconstruct the missing information at each masked position. It is important to note that the decoder serves as an auxiliary module used only during the pre-training stage. It is not employed in downstream tasks where the pre-trained encoder plays a central role.

Loss function: MedMAE leverages a reconstruction loss function, specifically the mean squared error (MSE). However, unlike traditional autoencoders that aim to reconstruct the entire input, MedMAE only focuses on predicting the pixel values of the masked patches. This approach has been shown to yield superior results. In practice, for better training stability, the normalized pixel values within each masked patch are used as reconstruction targets, instead of the raw values.

3.4. MedMAE Architecture in Downstream Tasks

After MedMAE is pre-trained, for each downstream task, we replace the MedMAE decoder with a task-specific head. In the classification task, the task-specific head is simply a fully connected layer with the number of neurons equal to the number of classes. This layer is followed by a softmax activation function. On the other hand, for the segmentation task, four transposed convolutional layers are added to scale up the dimensions of the embeddings of the MedMAE encoder, followed by a convolutional layer with a single filter to produce the output mask. This layer is followed by a sigmoid activation function. During the training of the downstream tasks, only the parameters of the added head are updated. A well-known process is linear probing.

4. Experiments and Results

4.1. Implementation Details

In this section, we delve into the outcomes of our MedMAE-B model, which underwent an extensive pre-training process spanning over 1000 epochs. Each epoch took approximately 1 h and 20 min to train. Therefore, the model training time in the pre-text phase was approximately 1200 h in totalHowever, our downstream training employs linear probing, where only the final classification layer is updated. This approach results in very short training times that are comparable to those of other state-of-the-art models using similar strategies. The minimal computational cost associated with this phase makes our method highly feasible for real-world applications, despite the intensive pre-training phase. As previously stated, our evaluation was exclusively based on the MAE ViT-B model. We configured our model with a batch size of 64, and the dimensions of the input images were set to

224 \times 224

pixels. The learning rate was

10^{- 3}

.

4.2. MedMAE Evaluation

To evaluate MedMAE, we conducted four experiments. Each experiment represented a different medical imaging task. The first two experiments were conducted on private datasets, and the other two experiments were conducted on publicly available datasets. The first experiment was the task of automating quality control for both CT scanners and MRI scanners. To maintain the reliability of radiology scanners, regular quality assurance (QA) protocols are implemented, typically on a daily or weekly schedule [49]. These procedures, however, are time-consuming and require the scanners to be temporarily taken offline, which can disrupt operations and patient care. The goal is to automate this process using deep learning to detect whether an image is captured by a well-calibrated or a miscalibrated scanner. In this experiment, we used a dataset collected from two hospitals for CT and MRI scans. The images in the dataset were labeled with either ‘pass’ or ‘fail’. Here, ‘pass’ means that the image was captured by a well-calibrated CT scanner, while ‘fail’ means that the image was captured by a miscalibrated CT scanner. The second experiment was the task of breast cancer detection. In this experiment, we used images captured of the breast, and the goal was to determine whether there was a disease or not. The third experiment was the task of pneumonia detection using the publicly available dataset ChestX-ray14. Finally, the last experiment was a medical image segmentation task. In this experiment, we used a publicly available dataset called CVC-ClinicDB. We compared the results of our proposed model with those of existing models pre-trained on the ImageNet (IN1k) dataset [50]. We chose five different models that had almost the same number of parameters. These models were ResNet [51], EfficientNetv2-S [52], ConvNext-B [53], ConvNextV2-B [54], DINOv2 [55], ViT-B [1], and Swin-B [56]. All models were pre-trained with supervised learning on ImageNet. Additionally, we compared our results with those of the original MAE [2]. Evaluation was performed by linearly probing all models for 300 epochs on the training set, and the performance on the testing set is reported for each experiment. Linear probing is the process of fine-tuning only the last fully connected layer of the model, while keeping the rest of the model parameters unchanged (i.e., pre-trained parameters). It is important to mention that the datasets in all experiments except ChestX-ray14 were not included in the pre-training dataset (i.e., all of these datasets were unseen during pre-training).

Evaluation via linear probing: We chose linear probing because it offers a clear and unbiased measure of the quality of the learned representations by keeping the pre-trained backbone frozen and only training a simple linear classifier. This approach highlights the transferability of features to downstream tasks without the confounding effects of full model fine-tuning and is commonly used with self-supervised learning models. However, we acknowledge that, while fine-tuning the entire model may potentially yield higher performance, it also increases the computational complexity and the risk of overfitting to task-specific data, reducing the model’s capacity for generalization by forgetting the features learned during the pre-text phase. This problem is also known as catastrophic forgetting.

4.2.1. Task 1: Automating Quality Control for CT and MRI Scanners

The accuracy and timeliness of patient diagnoses using radiology scanners are directly dependent on the quality of the images produced during scanning. If the image quality is compromised, it can have serious consequences. For instance, scans from a miscalibrated scanner may require costly and time-consuming repeat procedures. In more critical cases, undetected quality issues can lead to missed diagnoses, putting patients’ health at significant risk and, in extreme cases, resulting in fatal outcomes [57,58,59,60]. Implementing the real-time quality control (QC) of scanner calibration offers multiple benefits, including (1) improved patient throughput by minimizing downtime, which reduces wait times, and (2) enhanced diagnostic accuracy through the quick detection and resolution of image quality issues.

CT scan dataset: Specialized datasets were built to meet the unique requirements of this experiment. We procured patient studies. These studies were retrieved in the form of DICOM files. A meticulous de-identification process was applied to each scan, ensuring the absence of any identifiable patient data. After this, each DICOM file was converted to an image format. The images represented different body parts captured by a CT scanner. Each image was labeled with either ‘pass’ or ‘fail’. Hence, the task to be performed was binary image classification. The collected data comprised CT scan images of different body parts for more than 100 unique patients, with a total number of images of more than 30,000.

Table 2 shows the results of our proposed model in comparison with existing deep learning models. As shown in Table 2, the performance of our proposed model is superior to that of other models. Even in comparison with the same model but pre-trained on Imagenet (i.e., MAE), there is still a large gap in performance of approximately

12 %

in comparison with the MAE pre-trained on Imagenet. It is important to mention that we could not pre-train other models using our dataset because these models were trained using supervised learning, and our dataset had no labels.

MRI scan dataset: This dataset was collected in the same way as the CT scan dataset, but the images were captured from an MRI scanner instead of via CT. The dataset consisted of images of three body parts: the abdomen, head, and shoulder. The collected number of cases (patients) was 426. Each case had around 20 images. Each image was labeled with either ‘pass’ or ‘fail’, as with the CT dataset.

Table 3 shows the results of our proposed model against those of other models. Our model, MedMAE, shows superior performance, with a gap of

11.3 %

in comparison with the MAE pre-trained on ImageNet. It is important to mention that both the MRI and CT scan datasets were not shown during the pre-training phase. However, the pre-training dataset LUMID has both MRI and CT images collected from other datasets.

4.2.2. Task 2: Breast Cancer Prediction from CT Images

This dataset contained data from 29,686 patients. Each patient had multiple CT studies collected from 1988 to 2018. The dataset contained more than 5 million CT images of the chest area. We split the dataset into training, validation, and testing sets with ratios of

70 %

,

15 %

, and

15 %

, respectively. The results are shown in Table 4. MedMAE has a performance gain of

8.9 %

over the MAE pre-trained on ImageNet.

4.2.3. Task 3: Pneumonia Detection from Chest X-Ray Images

We used the ChestX-ray14 [22] dataset in this experiment. This dataset comprises 112,120 frontal-view X-ray images of more than 30,000 unique patients. For this dataset, we report the results of our model, MedMAE, against those of the same models as in the previous experiments. Additionally, we report the results of MedMAE against state-of-the-art models in this specific dataset. Table 5 shows the results of the linear probing of our model and other selected models on the ChestX-ray14 dataset. As shown in the table, MedMAE has

1 %

higher accuracy than the second-best model. As shown in Table 6, to enable a fair comparison with the other models, we fine-tuned all parameters of MedMAE, instead of performing linear probing. This was due to the fact that the state-of-the-art models are trained using supervised learning using this dataset (i.e., all model parameters are updated). The results of our model are compared with those of existing supervised learning and self-supervised learning models. As shown in the table, our model, MedMAE, shows an improvement in performance in comparison with other models, with a performance gain of

3.6 %

.

4.2.4. Task 4: Polyp Segmentation in Colonoscopy Sequences

In this experiment, we used the dataset CVC-ClinicDB [69]. This is a dataset of frames extracted from colonoscopy videos. This dataset contains several examples of polyp frames and the corresponding ground truth for them. The number of labeled images in this dataset is 612. The dataset was split into training, validation, and testing sets with ratios of

70 %

,

15 %

, and

15 %

, respectively. The split was applied randomly. The task performed in this dataset was different from those of the other datasets. In this dataset, our goal was to segment the region of interest and produce a binary mask, where the white pixels represented the region of interest and the black pixels represented the remainder. In the other experiments, we added a fully connected layer for linear probing. However, in this experiment, we added a sequence of transposed convolutional layers to upscale the embeddings produced by the encoder to a resolution equivalent to that of the input image. The sequence of transposed convolutional layers consisted of four layers, with each layer followed by a GeLU [70] activation function. The number of filters used in these layers was 256, 128, 64, and 32, respectively. All layers had a kernel size of

2 \times 2

and stride of 2 (i.e., each layer increased the dimensions of the input embedding to double its size). After the sequence of layers, a final convolutional layer was added with a single filter of size

1 \times 1

, followed by a Sigmoid activation function to produce the output mask. As shown in Table 7, our proposed model, MedMAE, outperforms the other models by a noticeable gap of

4.1 %

in the f1-score. The f1-score is calculated as follows:

F = \frac{2 t p}{2 t p + f p + f n}

(1)

where

F

is the f1-score,

t p

is the true positive rate,

f p

is the false positive rate, and

f n

is the false negative rate. A sample result of the segmentation is shown in Figure 2.

4.3. Visual Results

To provide a clearer understanding of how the training process influences the model’s performance, we present examples of image reconstructions before and after the extensive training phase. Figure 3 and Figure 4 illustrate the reconstruction results for a sample image from a dataset not included in the training process. These figures compare the original and reconstructed images, showcasing the initial performance of our MedMAE-B model in interpreting and reconstructing medical images. The improvements in the reconstructed image, observed throughout the training, highlight the model’s progress. These examples demonstrate the effectiveness of our rigorous training process in improving the model’s ability to analyze medical images.

5. Conclusions and Future Work

In this paper, we propose a large-scale unlabeled medical imaging dataset that has extensive and diverse medical images. Additionally, we propose a ViT-based backbone that is pre-trained using the proposed dataset, and we show that this backbone can be used for various medical imaging tasks. We evaluated the proposed model using four tasks. The results of all tasks showed the superiority of our proposed model in comparison with other models. The average performance gain in all tasks between our MedMAE and the original MAE was approximately

8 %

. From the conducted experiments, we can see that models pre-trained with self-supervised learning generalize better to medical tasks than models pre-trained with supervised learning. Additionally, if the model is pre-trained using medical images, its performance on other medical tasks is much better than that of a model pre-trained using natural images.

Our approach leverages a large-scale medical imaging dataset for self-supervised pre-training, thereby significantly reducing the domain gap with natural images. We acknowledge that differences in image texture, contrast, and noise between medical and natural images remain a challenge. Models pre-trained on natural images often encounter difficulties when applied directly to medical imaging tasks due to these inherent disparities. Our strategy of exclusively using medical data helps to mitigate this issue, but further research into advanced domain adaptation techniques could enhance the model’s generalizability even more. Addressing this limitation will be a key focus of our future work.

Author Contributions

Conceptualization, M.S.S., R.E.F. and W.J.B.; methodology, A.G. and I.O.; validation, M.S.S., R.E.F. and W.J.B.; formal analysis, W.J.B.; investigation, I.O.; resources, M.S.S.; data curation, A.G.; writing—original draft preparation, A.G.; writing—review and editing, I.O.; visualization, A.G.; supervision, M.S.S., R.E.F. and W.J.B.; project administration, M.S.S., R.E.F. and W.J.B.; funding acquisition, M.S.S. and W.J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by The Natural Sciences and Engineering Research Council of Canada (NSERC).

Data Availability Statement

LUMID is a large-scale, unlabeled collection of over two million medical images spanning multiple imaging modalities, including CT scans, X-rays, MRI, and more. This dataset has been meticulously curated from publicly available medical imaging repositories, addressing the critical challenge of limited scale in existing public datasets and the inaccessibility of high-quality private datasets. The primary motivation behind creating this dataset was to empower the medical imaging community with a resource suitable for the development and training of advanced deep learning models. By enabling the use of unsupervised and self-supervised learning approaches, this dataset facilitates the learning of rich, transferable representations that can significantly enhance the performance across various medical imaging tasks, including classification, segmentation, and anomaly detection. The dataset is available at https://doi.org/10.20383/103.01017 (accessed on 17 March 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16000–16009. [Google Scholar]
Huang, S.C.; Pareek, A.; Jensen, M.; Lungren, M.P.; Yeung, S.; Chaudhari, A.S. Self-supervised learning for medical image classification: A systematic review and implementation guidelines. NPJ Digit. Med. 2023, 6, 74. [Google Scholar] [CrossRef] [PubMed]
Sriram, A.; Muckley, M.; Sinha, K.; Shamout, F.; Pineau, J.; Geras, K.J.; Azour, L.; Aphinyanaphongs, Y.; Yakubova, N.; Moore, W. COVID-19 prognosis via self-supervised representation learning and multi-image prediction. arXiv 2021, arXiv:2101.04909. [Google Scholar]
Lu, M.Y.; Chen, R.J.; Mahmood, F. Semi-supervised breast cancer histology classification using deep multiple instance learning and contrast predictive coding. In Proceedings of the Medical Imaging 2020: Digital Pathology, SPIE, Houston, TX, USA, 15–20 February 2020; Volume 11320, p. 113200J. [Google Scholar]
Li, X.; Hu, X.; Qi, X.; Yu, L.; Zhao, W.; Heng, P.A.; Xing, L. Rotation-oriented collaborative self-supervised learning for retinal disease diagnosis. IEEE Trans. Med. Imaging 2021, 40, 2284–2294. [Google Scholar] [CrossRef]
Chen, L.; Bentley, P.; Mori, K.; Misawa, K.; Fujiwara, M.; Rueckert, D. Self-supervised learning for medical image analysis using image context restoration. Med. Image Anal. 2019, 58, 101539. [Google Scholar] [CrossRef]
Nguyen, X.B.; Lee, G.S.; Kim, S.H.; Yang, H.J. Self-supervised learning based on spatial awareness for medical image analysis. IEEE Access 2020, 8, 162973–162981. [Google Scholar] [CrossRef]
Sowrirajan, H.; Yang, J.; Ng, A.Y.; Rajpurkar, P. Moco pretraining improves representation and transferability of chest X-ray models. Proc. Med. Imaging Deep Learn. PMLR 2021, 143, 728–744. [Google Scholar]
Karani, K.; Konukoglu, E. Contrastive learning of global and local features for medical image segmentation with limited annotations. Adv. Neural Inf. Process. Syst. 2020, 33, 12546–12558. [Google Scholar]
Taleb, A.; Loetzsch, W.; Danz, N.; Severin, J.; Gaertner, T.; Bergner, B.; Lippert, C. 3D self-supervised methods for medical imaging. Adv. Neural Inf. Process. Syst. 2020, 33, 18158–18172. [Google Scholar]
Xie, Y.; Zhang, J.; Liao, Z.; Xia, Y.; Shen, C. PGL: Prior-guided local self-supervised learning for 3D medical image segmentation. arXiv 2020, arXiv:2011.12640. [Google Scholar]
Jung, W.; Heo, D.W.; Jeon, E.; Lee, J.; Suk, H.I. Inter-regional high-level relation learning from functional connectivity via self-supervision. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part II 24. Springer: Cham, Switzerland, 2021; pp. 284–293. [Google Scholar]
Jana, A.; Qu, H.; Minacapelli, C.D.; Catalano, C.; Rustgi, V.; Metaxas, D. Liver fibrosis and nas scoring from ct images using self-supervised learning and texture encoding. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1553–1557. [Google Scholar]
Liu, C.; Qiao, M.; Jiang, F.; Guo, Y.; Jin, Z.; Wang, Y. TN-USMA Net: Triple normalization-based gastrointestinal stromal tumors classification on multicenter EUS images with ultrasound-specific pretraining and meta attention. Med. Phys. 2021, 48, 7199–7214. [Google Scholar] [CrossRef] [PubMed]
Khan, H.; Ullah, I.; Shabaz, M.; Omer, M.F.; Usman, M.T.; Guellil, M.S.; Koo, J. Visionary vigilance: Optimized YOLOV8 for fallen person detection with large-scale benchmark dataset. Image Vis. Comput. 2024, 149, 105195. [Google Scholar] [CrossRef]
Arshad, T.; Zhang, J.; Ullah, I. A hybrid convolution transformer for hyperspectral image classification. Eur. J. Remote Sens. 2024, 57, 2330979. [Google Scholar] [CrossRef]
Yasir, M.; Ullah, I.; Choi, C. Depthwise channel attention network (DWCAN): An efficient and lightweight model for single image super-resolution and metaverse gaming. Expert Syst. 2024, 41, e13516. [Google Scholar] [CrossRef]
Arshad, T.; Zhang, J.; Ullah, I.; Ghadi, Y.Y.; Alfarraj, O.; Gafar, A. Multiscale feature-learning with a unified model for hyperspectral image classification. Sensors 2023, 23, 7628. [Google Scholar] [CrossRef]
Li, C.; Wang, M.; Sun, X.; Zhu, M.; Gao, H.; Cao, X.; Ullah, I.; Liu, Q.; Xu, P. A novel dimensionality reduction algorithm for Cholangiocarcinoma hyperspectral images. Opt. Laser Technol. 2023, 167, 109689. [Google Scholar]
Malik, S.G.; Jamil, S.S.; Aziz, A.; Ullah, S.; Ullah, I.; Abohashrh, M. High-precision skin disease diagnosis through deep learning on dermoscopic images. Bioengineering 2024, 11, 867. [Google Scholar] [CrossRef]
Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2097–2106. [Google Scholar]
Rutherford, M.; Mun, S.K.; Levine, B.; Bennett, W.; Smith, K.; Farmer, P.; Jarosz, Q.; Wagner, U.; Freyman, J.; Blake, G.; et al. A DICOM dataset for evaluation of medical image de-identification. Sci. Data 2021, 8, 183. [Google Scholar] [CrossRef]
Saltz, J.; Saltz, M.; Prasanna, P.; Moffitt, R.; Hajagos, J.; Bremer, E.; Balsamo, J.; Kurc, T. Stony Brook University COVID-19 Positive Cases. 2021. Available online: https://www.cancerimagingarchive.net/collection/covid-19-ny-sbu (accessed on 17 March 2025).
Chen, L.; Wang, W.; Jin, K.; Yuan, B.; Tan, H.; Sun, J.; Guo, Y.; Luo, Y.; Feng, S.T.; Yu, X.; et al. Special issue “The advance of solid tumor research in China”: Prediction of Sunitinib efficacy using computed tomography in patients with pancreatic neuroendocrine tumors. Int. J. Cancer 2023, 152, 90–99. [Google Scholar]
Li, M.; Gong, J.; Bao, Y.; Huang, D.; Peng, J.; Tong, T. Special issue “The advance of solid tumor research in China”: Prognosis prediction for stage II colorectal cancer by fusing computed tomography radiomics and deep-learning features of primary lesions and peripheral lymph nodes. Int. J. Cancer 2023, 152, 31–41. [Google Scholar]
An, P.; Xu, S.; Harmon, S.A.; Turkbey, E.B.; Sanford, T.H.; Amalou, A.; Kassin, M.; Varble, N.; Blain, M.; Anderson, V.; et al. Ct Images in COVID-19 [Data Set]. 2020. Available online: https://www.cancerimagingarchive.net/collection/ct-images-in-covid-19 (accessed on 17 March 2025).
Tsai, E.; Simpson, S.; Lungren, M.P.; Hershman, M.; Roshkovan, L.; Colak, E.; Erickson, B.J.; Shih, G.; Stein, A.; Kalpathy-Cramer, J.; et al. Data from Medical Imaging Data Resource Center (MIDRC)-RSNA International COVID Radiology Database (RICORD) Release 1C-Chest X-ray, Covid+(MIDRC-RICORD-1C). 2021. Available online: https://www.cancerimagingarchive.net/collection/midrc-ricord-1c (accessed on 17 March 2025).
Rister, B.; Yi, D.; Shivakumar, K.; Nobashi, T.; Rubin, D.L. CT-ORG, a new dataset for multiple organ segmentation in computed tomography. Sci. Data 2020, 7, 381. [Google Scholar] [CrossRef] [PubMed]
Yorke, A.A.; McDonald, G.C.; Solis, D.; Guerrero, T. Pelvic Reference Data. 2019. Available online: https://www.cancerimagingarchive.net/collection/pelvic-reference-data (accessed on 17 March 2025).
Kalpathy-Cramer, J.; Napel, S.; Goldgof, D.; Zhao, B. QIN Multi-Site Collection of Lung CT Data with Nodule Segmentations. 2015. Available online: https://www.cancerimagingarchive.net/collection/qin-lung-ct (accessed on 17 March 2025).
Roth, H.R.; Farag, A.; Turkbey, E.; Lu, L.; Liu, J.; Summers, R.M. Data from pancreas-ct. 2016. Available online: https://www.cancerimagingarchive.net/collection/pancreas-ct (accessed on 17 March 2025).
Grove, O.; Berglund, A.E.; Schabath, M.B.; Aerts, H.J.; Dekker, A.; Wang, H.; Velazquez, E.R.; Lambin, P.; Gu, Y.; Balagurunathan, Y.; et al. Quantitative computed tomographic descriptors associate tumor shape complexity and intratumor heterogeneity with prognosis in lung adenocarcinoma. PLoS ONE 2015, 10, e0118261. [Google Scholar] [CrossRef] [PubMed]
Aerts, H.; Velazquez, E.R.; Leijenaar, R.; Parmar, C.; Grossmann, P.; Cavalho, S.; Bussink, J.; Monshouwer, R.; Haibe-Kains, B.; Rietveld, D.; et al. Data from NSCLC-Radiomics. Available online: https://www.cancerimagingarchive.net/collection/nsclc-radiomics (accessed on 17 March 2025).
Teng, X. Improving Radiomic Model Reliability and Generalizability Using Perturbations in Head and Neck Carcinoma. Ph.D. Thesis, Hong Kong Polytechnic University, Hong Kong, 2023. [Google Scholar]
Smith, K.; Clark, K.; Bennett, W.; Nolan, T.; Kirby, J.; Wolfsberger, M.; Moulton, J.; Vendt, B.; Freymann, J. Data from CT_COLONOGRAPHY. 2015. Available online: https://www.cancerimagingarchive.net/collection/ct-colonography (accessed on 17 March 2025).
Armato, S.G., III; McLennan, G.; Bidaut, L.; McNitt-Gray, M.F.; Meyer, C.R.; Reeves, A.P.; Zhao, B.; Aberle, D.R.; Henschke, C.I.; Hoffman, E.A.; et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans. Med. Phys. 2011, 38, 915–931. [Google Scholar] [CrossRef] [PubMed]
Machado, M.A.; Moraes, T.F.; Anjos, B.H.; Alencar, N.R.; Chang, T.M.C.; Santana, B.C.; Menezes, V.O.; Vieira, L.O.; Brandão, S.C.; Salvino, M.A.; et al. Association between increased Subcutaneous Adipose Tissue Radiodensity and cancer mortality: Automated computation, comparison of cancer types, gender, and scanner bias. Appl. Radiat. Isot. 2024, 205, 111181. [Google Scholar] [CrossRef]
Albertina, B.; Watson, M.; Holback, C.; Jarosz, R.; Kirk, S.; Lee, Y.; Rieger-Christ, K.; Lemmerman, J. The Cancer Genome Atlas Lung Adenocarcinoma Collection (tcga-luad) (Version 4) [Data Set]. Available online: https://www.cancerimagingarchive.net/collection/tcga-luad (accessed on 17 March 2025).
Desai, S.; Baghal, A.; Wongsurawat, T.; Al-Shukri, S.; Gates, K.; Farmer, P.; Rutherford, M.; Blake, G.; Nolan, T.; Powell, T.; et al. Data from Chest Imaging with Clinical and Genomic Correlates Representing a Rural COVID-19 Positive Population [Data Set]. Available online: https://www.cancerimagingarchive.net/collection/covid-19-ar (accessed on 17 March 2025).
Biobank, C. Cancer Moonshot Biobank-Lung Cancer Collection (CMB-LCA) (Version 3) [Dataset]. Available online: https://www.cancerimagingarchive.net/collection/cmb-lca (accessed on 17 March 2025).
Biobank, C. Cancer Moonshot Biobank-Lung Cancer Collection (CMB-CRC) (Version 5) [Dataset]. Available online: https://www.cancerimagingarchive.net/collection/cmb-crc (accessed on 17 March 2025).
Akin, O.; Elnajjar, P.; Heller, M.; Jarosz, R.; Erickson, B.; Kirk, S.; Lee, Y.; Linehan, M.; Gautam, R.; Vikram, R.; et al. The Cancer Genome Atlas Kidney Renal Clear Cell Carcinoma Collection (TCGA-KIRC) (Version 3). Available online: https://www.cancerimagingarchive.net/collection/tcga-kirc (accessed on 17 March 2025).
Biobank, C. Cancer Moonshot Biobank-Lung Cancer Collection (CMB-PCA) (Version 5) [Dataset]. Available online: https://www.cancerimagingarchive.net/collection/cmb-pca (accessed on 17 March 2025).
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar]
Roche, C.; Bonaccio, E.; Filippini, J. The Cancer Genome Atlas Sarcoma Collection (TCGA-SARC) (Version 3) [Data Set]. Available online: https://www.cancerimagingarchive.net/collection/tcga-sarc (accessed on 17 March 2025).
Albertina, B.; Watson, M.; Holback, C.; Jarosz, R.; Kirk, S.; Lee, Y.; Lemmerman, J. Radiology Data from the Cancer Genome Atlas Lung Adenocarcinoma [tcga-luad] Collection. 2016. Available online: https://www.cancerimagingarchive.net/collection/tcga-luad (accessed on 17 March 2025).
Holback, C.; Jarosz, R.; Prior, F.; Mutch, D.G.; Bhosale, P.; Garcia, K.; Lee, Y.; Kirk, S.; Sadow, C.A.; Levine, S.; et al. The Cancer Genome Atlas Ovarian CANCER collection (tcga-ov) (Version 4) [Data Set]. Available online: https://www.cancerimagingarchive.net/collection/tcga-ov (accessed on 17 March 2025).
McRobbie, D.W.; Semple, S.; Barnes, A.P. Quality Control and Artefacts in Magnetic Resonance Imaging (Update of IPEM Report 80); Institute of Physics and Engineering in Medicine: York, UK, 2017. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. Proc. Int. Conf. Mach. Learn. PMLR 2021, 139, 10096–10106. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Dehkordi, A.N.; Koohestani, S. The Influence of Signal to Noise Ratio on the Pharmacokinetic Analysis in DCE-MRI Studies. Front. Biomed. Technol. 2019, 4, 187–196. [Google Scholar]
Parrish, T.B.; Gitelman, D.R.; LaBar, K.S.; Mesulam, M.M. Impact of signal-to-noise on functional MRI. Magn. Reson. Med. Off. J. Int. Soc. Magn. Reson. Med. 2000, 44, 925–932. [Google Scholar]
Song, B.; Tan, W.; Xu, Y.; Yu, T.; Li, W.; Chen, Z.; Yang, R.; Hou, J.; Zhou, Y. 3D-MRI combined with signal-to-noise ratio measurement can improve the diagnostic accuracy and sensitivity in evaluating meniscal healing status after meniscal repair. Knee Surg. Sports Traumatol. Arthrosc. 2019, 27, 177–188. [Google Scholar] [CrossRef]
Rubenstein, J.D.; Li, J.; Majumdar, S.; Henkelman, R. Image resolution and signal-to-noise ratio requirements for MR imaging of degenerative cartilage. AJR Am. J. Roentgenol. 1997, 169, 1089–1096. [Google Scholar] [PubMed]
Yao, L.; Poblenz, E.; Dagunts, D.; Covington, B.; Bernard, D.; Lyman, K. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv 2017, arXiv:1710.10501. [Google Scholar]
Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv 2017, arXiv:1711.05225. [Google Scholar]
Blog, G. AutoML for Large Scale Image Classification and Object Detection. Google Research. 2017. Available online: https://research.google/blog/automl-for-large-scale-image-classification-and-object-detection (accessed on 17 March 2025).
Lu, Z.; Whalen, I.; Dhebar, Y.; Deb, K.; Goodman, E.D.; Banzhaf, W.; Boddeti, V.N. Multiobjective evolutionary design of deep convolutional neural networks for image classification. IEEE Trans. Evol. Comput. 2020, 25, 277–291. [Google Scholar]
Lu, Z.; Deb, K.; Boddeti, V.N. MUXConv: Information multiplexing in convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12044–12053. [Google Scholar]
Ranjan, E.; Paul, S.; Kapoor, S.; Kar, A.; Sethuraman, R.; Sheet, D. Jointly learning convolutional representations to compress radiological images and classify thoracic diseases in the compressed domain. In Proceedings of the 11th Indian Conference on Computer Vision, Graphics and Image Processing, Hyderabad, India, 18–22 December 2018; pp. 1–8. [Google Scholar]
Liang, J.; Meyerson, E.; Hodjat, B.; Fink, D.; Mutch, K.; Miikkulainen, R. Evolutionary neural automl for deep learning. In Proceedings of the Genetic and Evolutionary Computation Conference, Prague, Czech Republic, 13–17 July 2019; pp. 401–409. [Google Scholar]
Zhou, H.Y.; Yu, S.; Bian, C.; Hu, Y.; Ma, K.; Zheng, Y. Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part I 23. Springer: Cham, Switzerland, 2020; pp. 398–407. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 2015, 43, 99–111. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]

Figure 1. MedMAE architecture: The process begins by randomly masking 75% of the input image, with the remaining 25% of visible patches fed into the encoder. The encoder then extracts latent representations and encodes these patches. The decoder’s goal is to reconstruct the full image by utilizing the encoded visible patches and the masked areas. The reconstruction loss function is used to progressively refine the reconstruction in each iteration.

Figure 2. Sample segmentation result from CVC-ClinicDB dataset.

Figure 3. Image construction using pre-trained MAE with natural images.

Figure 4. Image construction using our proposed MedMAE.

Table 1. A detailed overview of the various datasets collected to form LUMID, the medical imaging dataset.

Collection	Location	Subjects	Data Types
NIH Chest X-Ray [22]	Frontal Chest	30,805	X-Ray
Pseudo-PHI-DICOM-Data [23]	Various	21	CR, CT, DX, MG, MR, PT
COVID-19-NY-SBU [24]	Lung	1384	CR, CT, DX, MR, PT, NM, OT, SR
CTPred-Sunitinib-panNet [25]	Pancreas	38	CT
Stagell-Colorectal-CT [26]	Abdomen, Pelvis	230	CT
CT images in COVID19 [27]	Lung	661	CT
MIDR-RICORD-1a&1b [28]	Lung	227	CT
CT-ORG [29]	Bladder, Brain, Kidney, Liver	140	CT
Pelvic-Reference-Data [30]	Pelvis, Prostate, Anus	58	CT
QIN Lung CT [31]	Lung	47	CT
Pancreas CT [32]	Pancreas	82	CT
LungCT-Diagnosis [33]	Lung	61	CT
NSCLC-Radiomics-Genomics [34]	Lung	89	CT
RIDER Lung CT [35]	Chest	32	CT, CR, DX
CT Colon ACRIN 6664 [36]	Colon	825	CT
LIDC-IDRI [37]	Chest	1010	CT, CR, DX
TCGA-BLCA [38]	Bladder	120	CT, CR, MR, PT, DX, Pathology
TCGA-UCEC [39]	Uterus	65	CT, CR, MR, PT, Pathology
COVID-19-AR [40]	Lung	105	CT, DX, CR
CMB-LCA [41]	Lung	10	CT, DX, MR, NM, US
CMB-CRC [42]	Colon	12	CT, DX, MR, PT, US
TCGA-KIRC [43]	Kidney	267	CT, MR, CR, Pathology
CMB-PCA [44]	Prostate	3	CT, MR, NM
CPTAC-CCRCC [45]	Kidney	222	CT, MR, Pathology
TCGA-SARC [46]	Chest, Abdomen, Pelvis, Leg	5	CT, MR, Pathology
TCGA-READ [47]	Rectum	3	CT, MR, Pathology
TCGA-OV [48]	Ovary	143	CT. MR, Pathology

Table 2. Results of our proposed model, MedMAE, against other models on the CT dataset. The reported accuracy is the accuracy of pre-trained models subjected to linear probing.

Pre-Training	Dataset	Backbone	Accuracy (%)	Sensitivity (%)	AUC
Supervised	IN1k	ResNet	75.6	72.1	0.80
	IN1k	EfficientNet-S	71.3	68.3	0.77
	IN1k	ConvNext-B	76.8	74.1	0.82
	IN1k	ConvNextV2-B	78.1	75.9	0.83
	IN1k	ViT-B	78.2	76.0	0.84
	IN1k	Swin-B	77.5	75.2	0.83
Self-supervised	IN1k	MAE	78.5	77.2	0.85
	IN1k	DINOv2	81.2	79.4	0.88
	LUMID	MedMAE (Ours)	90.2	88.0	0.95

Table 3. Results of our proposed model, MedMAE, against other models on the MRI dataset. The reported accuracy is the accuracy of pre-trained models subjected to linear probing.

Pre-Training	Dataset	Backbone	Accuracy (%)	Sensitivity (%)	AUC
Supervised	IN1k	ResNet	71.6	69.0	0.79
	IN1k	EfficientNet-S	66.1	63.2	0.74
	IN1k	ConvNext-B	71.9	70.4	0.80
	IN1k	ConvNextV2-B	74.2	72.8	0.82
	IN1k	ViT-B	72.7	71.3	0.81
	IN1k	Swin-B	72.1	70.6	0.80
Self-supervised	IN1k	MAE	74.3	73.6	0.83
	IN1k	DINOv2	75.9	73.9	0.84
	LUMID	MedMAE (Ours)	85.6	84.7	0.93

Table 4. Results of our proposed model, MedMAE, against other models on the breast cancer dataset. The reported accuracy is the accuracy of pre-trained models subjected to linear probing.

Pre-Training	Dataset	Backbone	Accuracy (%)	Sensitivity (%)	AUC
Supervised	IN1k	ResNet	79.9	78.8	0.86
	IN1k	EfficientNet-S	75.1	73.3	0.82
	IN1k	ConvNext-B	83.8	82.2	0.88
	IN1k	ConvNextV2-B	84.5	83.8	0.91
	IN1k	ViT-B	84.0	83.4	0.89
	IN1k	Swin-B	81.3	79.1	0.87
Self-supervised	IN1k	MAE	84.3	83.5	0.90
	IN1k	DINOv2	88.1	87.8	0.93
	LUMID	MedMAE (Ours)	93.2	92.2	0.97

Table 5. Results of our proposed model, MedMAE, against other models on the ChestX-ray14 dataset. The reported accuracy is the accuracy of pre-trained models subjected to linear probing.

Pre-Training	Dataset	Backbone	Accuracy (%)	Sensitivity (%)	AUC
Supervised	IN1k	ResNet	66.4	64.8	0.74
	IN1k	EfficientNet-S	62.9	60.0	0.71
	IN1k	ConvNext-B	67.5	65.6	0.76
	IN1k	ConvNextV2-B	68.0	66.1	0.78
	IN1k	ViT-B	67.8	65.9	0.77
	IN1k	Swin-B	67.1	65.4	0.75
Self-supervised	IN1k	MAE	67.9	66.7	0.78
	IN1k	DINOv2	69.1	68.4	0.80
	LUMID	MedMAE (Ours)	70.1	69.1	0.81

Table 6. Results of our proposed model, MedMAE, against state-of-the-art models on the ChestX-ray14 dataset. ‘SSL-Supervised’ means that the model was pre-trained using self-supervised learning; we then fine-tuned the entire model using supervised learning.

Training	Method	Accuracy
Supervised	Wang et al. [22]	73.8
	Yao et al. [61]	79.8
	CheXNet [62]	84.4
	Google AutoML [63]	79.7
	NSGANetV1 [64]	84.7
	MUXNet-m [65]	84.1
	AE-CNN [66]	82.4
	LEAF [67]	84.3
SSL-Supervised	ResNet+MoCo [68]	81.4
	DenseNet+C2L [68]	84.4
	DINOv2 [55]	83.8
	MedMAE (Ours)	88.0

Table 7. Results of our proposed model, MedMAE, against other models on the CVC-ClinicDB dataset. The reported numbers are the f1-scores of pre-trained models subjected to linear probing.

Pre-Training	Dataset	Backbone	$F$
Supervised	IN1k	ResNet	57.9
	IN1k	EfficientNet-S	53.1
	IN1k	ConvNext-B	61.8
	IN1k	ConvNextV2-B	64.2
	IN1k	ViT-B	63.5
	IN1k	Swin-B	60.2
Self-supervised	IN1k	MAE	64.6
	IN1k	DINOv2	67.3
	LUMID	MedMAE (Ours)	71.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gupta, A.; Osman, I.; Shehata, M.S.; Braun, W.J.; Feldman, R.E. MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks. Computation 2025, 13, 88. https://doi.org/10.3390/computation13040088

AMA Style

Gupta A, Osman I, Shehata MS, Braun WJ, Feldman RE. MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks. Computation. 2025; 13(4):88. https://doi.org/10.3390/computation13040088

Chicago/Turabian Style

Gupta, Anubhav, Islam Osman, Mohamed S. Shehata, W. John Braun, and Rebecca E. Feldman. 2025. "MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks" Computation 13, no. 4: 88. https://doi.org/10.3390/computation13040088

APA Style

Gupta, A., Osman, I., Shehata, M. S., Braun, W. J., & Feldman, R. E. (2025). MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks. Computation, 13(4), 88. https://doi.org/10.3390/computation13040088

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. LUMID: Large-Scale Unlabeled Medical Imaging Dataset for Unsupervised and Self-Supervised Learning

3.2. Rationale for Key Architectural Choices

3.3. MedMAE Architecture in Pre-Training

3.4. MedMAE Architecture in Downstream Tasks

4. Experiments and Results

4.1. Implementation Details

4.2. MedMAE Evaluation

4.2.1. Task 1: Automating Quality Control for CT and MRI Scanners

4.2.2. Task 2: Breast Cancer Prediction from CT Images

4.2.3. Task 3: Pneumonia Detection from Chest X-Ray Images

4.2.4. Task 4: Polyp Segmentation in Colonoscopy Sequences

4.3. Visual Results

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI