Next Article in Journal
Proof-of-Concept of IMU-Based Detection of ICU-Relevant Agitation Motion Patterns in Healthy Volunteers
Previous Article in Journal
Time-Dependent Anchor Hole Expansion May Associate with Meniscal Extrusion After Open-Wedge High Tibial Osteotomy Combined with Medial Meniscus Posterior Root Tear Repair and Meniscal Centralization
Previous Article in Special Issue
Can Machines Identify Pain Effects? A Machine Learning Proof of Concept to Identify EMG Pain Signature
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Conversion from Mild Cognitive Impairment to Alzheimer’s Disease Using a Vision Transformer and Hippocampal MRI Slices

by
René Seiger
* and
Peter Fierlinger
on behalf of the Alzheimer’s Disease Neuroimaging Initiative
Physics Department, TUM School of Natural Sciences, Technical University of Munich, 85748 Garching, Germany
*
Author to whom correspondence should be addressed.
Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or writing of this report. A complete listing of ADNI investigators can be found at: https://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf (accessed on 22 December 2025).
Bioengineering 2026, 13(2), 163; https://doi.org/10.3390/bioengineering13020163
Submission received: 23 December 2025 / Revised: 13 January 2026 / Accepted: 27 January 2026 / Published: 29 January 2026
(This article belongs to the Special Issue AI and Data Analysis in Neurological Disease Management)

Abstract

Convolutional neural networks (CNNs) have been the standard for computer vision tasks including applications in Alzheimer’s disease (AD). Recently, Vision Transformers (ViTs) have been introduced, which have emerged as a strong alternative to CNNs. A common precursor stage of AD is a syndrome called mild cognitive impairment (MCI). However, not all individuals diagnosed with MCI progress to AD. In this exploratory investigation, we aimed to assess whether a ViT can reliably classify converters versus non-converters. A transfer learning approach was used for model training by applying a pretrained ViT model, fine-tuned on the ADNI dataset. The cohort comprised 575 individuals (299 stable MCIs; 276 progressive MCIs who converted within 36 months) from whom axial T1-weighted MRI slices covering the hippocampal region were used as model inputs. Results showed an average area under the receiver operating characteristic curve (AUC-ROC) on the test set of 0.74 ± 0.02 (mean ± SD), an accuracy of 0.69 ± 0.03, a sensitivity of 0.65 ± 0.07, a specificity of 0.72 ± 0.06, and an F1-score for the progressive MCI class of 0.67 ± 0.04. These findings demonstrate that a ViT approach achieves reasonable accuracy for classifying AD converters vs. non-converters, though its generalizability and clinical utility require further validation.

1. Introduction

Convolutional neural networks (CNNs) [1,2] have long served as the main architecture for visual processing tasks and are widely used for classification and prediction based on magnetic resonance imaging (MRI) data. However, transformer models, introduced by Vaswani et al. [3] and utilizing the attention mechanism [4], have recently revolutionized the field of natural language processing and have also proven valuable for vision tasks involving image data [5]. The Vision Transformer (ViT) architecture offers several advantages over classical CNN approaches, making it a formidable architecture for the medical domain, where MRI and similar imaging techniques are used to assess disease status and prognosis. While CNNs focus primarily on local or adjacent regions of an image, ViTs are capable of modeling contextualized long-range dependencies on a global scale via the self-attention mechanism. Images are divided into fixed-size patches, which are treated as input tokens, similar to how words are processed in classical Transformers used for natural language processing tasks. When trained on sufficiently large datasets, ViTs are able to learn more generalized and flexible representations, while, unlike CNNs, not relying on inductive biases such as locality or translation equivariance [5,6].
Hence, reliable, accurate, and well-performing deep learning models are needed, as they could serve as valuable tools with the potential to support physicians in the classification and prediction of various diseases, particularly Alzheimer’s disease (AD), where early diagnosis is key. AD is one of the most prevalent conditions among elderly individuals, with estimates indicating that over 30 million people worldwide are affected and numbers are steadily increasing [7]. To date, no cure exists, and its underlying causes remain poorly understood, although research indicates that brain atrophy patterns, which can be detected with MRI, are closely linked to, and likely result from, the pathological accumulation and spread of amyloid-beta and tau proteins [8]. The precursor stage of AD is referred to as mild cognitive impairment (MCI). During this phase, signs of cognitive decline particularly related to memory are already present, although activities of daily life are basically not impaired [9]. Estimates vary, but approximately 15% of individuals with MCI progress to Alzheimer’s dementia within two years, and about 33% do so within five years. Conversely, around 26% of individuals with MCI revert to normal cognition [10]. It has been shown in people with a rare genetic form of AD that neuropathological changes underlying dementia are already present in the brain several years before any clinical symptoms are evident [11]. This relatively long preclinical phase makes early detection using modern deep learning algorithms particularly valuable and promising. Therefore, developing accurate classification models to project the cognitive trajectory of individuals with MCI presents a critical opportunity to reduce patient suffering and long-term health care costs.
While ViTs have already demonstrated excellent performance in classifying AD patients from subjects without dementia based on MRI data [12,13] and can be considered a viable alternative to CNNs for this task, studies evaluating their predictive power regarding AD remain scarce. To our knowledge, only one study by Hoang et al. [14] has explored this question using structural MRI. However, following the methodological procedure reported in their investigation, data leakage likely occurred as previously noted by Valizadeh et al. [15]. Hence, results presented in their work probably represent overly optimistic accuracy metrics rather than an unbiased assessment of a ViT in classifying conversion from MCI to AD. Therefore, further research is needed to determine whether this ViT approach provides state of the art results, as seen in CNNs for example, which was the dedicated objective of our present study.
To this end, three consecutive axial brain slices covering the hippocampal region were selected as inputs to our model. We focused on this area located in the temporal lobe, as it is one of the earliest brain structures affected by the disease [16] and is strongly associated with the processing of memory-related content and memory function [17]. Thus, by focusing on slices encompassing this region, we aim to capture early, subtle morphological changes that may be indicative of MCI-to-AD conversion. One drawback of the ViT architecture, however, is its need for massive training data [5], which poses a challenge in medical imaging, where available datasets typically include MRI scans from thousands of subjects at best. If not properly addressed, these models are prone to overfitting, memorizing the training data without generalizing to unseen cases. This issue can be mitigated by using pretrained models, which are already trained on large-scale image datasets and can be fine-tuned through a transfer learning approach using a specific dataset relevant to the target task. To account for this, we applied a ViT architecture that was pretrained on millions of images and fine-tuned on a dataset of MRI slices for our specific task of MCI-to-AD classification.
While ViTs and hybrid CNN–transformer models have already been used for investigating MCI-to-AD conversion, our work specifically focuses on strict subject-level leakage control, a slice-based ViT implementation using the hippocampal region, and robust evaluation procedures. Furthermore, we directly compared the output of our ViT model to a 3D ResNet, enabling a transparent assessment of the ViT’s performance.

2. Materials and Methods

2.1. Magnetic Resonance Imaging Data

Data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) were used for this investigation. As stated by ADNI, the initiative was launched in 2003 as a public–private partnership, led by Principal Investigator Michael W. Weiner, MD. The original goal of ADNI was to test whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. Further goals of ADNI include validating biomarkers for clinical trials, improving the generalizability of ADNI data by increasing diversity in the participant cohort, and to provide data concerning the diagnosis and progression of AD to the scientific community. For up-to-date information, see adni.loni.usc.edu.

2.2. Demographics and MCI Status

The baseline MRI scan from participants with a diagnosis of MCI in ADNI1-3 and ADNIGO was included in the analysis. Subjects were divided according to their assessments into a stable (sMCI) and a progressive (pMCI) group based on the following approach: Individuals in the sMCI group were categorized as stable if they retained this status for a minimum of three years and maintained it throughout all available follow-up assessments provided by ADNI. The pMCI group included participants who progressed from MCI at baseline to Alzheimer’s disease (AD) within 36 months. Individuals who transitioned from MCI to AD and subsequently reverted to MCI were excluded. Similarly, participants who converted from MCI to a cognitively normal (CN) status, or who then reverted from CN back to MCI, were also excluded. In total, data from 575 participants were included, with 299 in the sMCI group and 276 in the pMCI group. Details regarding demographics and clinical assessment scores are provided in Table 1.

2.3. MRI Data Preprocessing and Preparation

Structural 3D T1-weighted MRI data provided by ADNI were available either processed (for details see: https://adni.loni.usc.edu/data-samples/adni-data/neuroimaging/mri/mri-pre-processing/ (accessed on 22 December 2025)) or in raw DICOM format. Available DICOM scans were transformed into NIfTI format, while all data subsequently underwent the same preprocessing pipeline. The T1-weighted scans were processed using MATLAB (The MathWorks Inc., Natick, MA, USA, version R2023a) and the SPM12 toolbox (FIL, UCL, London, UK, https://www.fil.ion.ucl.ac.uk/spm/software/spm12/, version 7771, (accessed on 22 December 2025)). Structural images underwent bias field correction to account for intensity inhomogeneities. Gray and white matter probability maps, thresholded at 0.5, were then used as masks to extract brain tissue from the bias-corrected volumes. Normalization to MNI standard space was performed via the forward deformation fields estimated during segmentation, resulting in one unified volume with dimensions 157 × 189 × 156 and a voxel size of 1 × 1 × 1 mm3. Three consecutive axial slices of the 3D volume of each subject centered around the hippocampal area were extracted (MNI space from z = −17 to z = −19) and assigned to the respective 2D RGB-image channels of the model for further analysis.

2.4. Model Architecture—Vision Transformer (ViT)

A ViT architecture ‘vit_base_patch16_224’ from the TIMM (PyTorch Image Models, PyTorch version 2.9.0) library was used as the backbone of our experiments. The model was pretrained on ImageNet and natively processes 224 × 224 RGB images by dividing them into 16 × 16 non-overlapping patches. The original model included a classification head consisting of a single linear layer that maps the 768-dimensional transformer output to 1000 output classes, corresponding to the ImageNet dataset. To adapt this model for our task, we modified the classification head by inserting a dropout layer (p = 0.3) before the final linear layer to reduce overfitting, and adjusted the final output to produce a single value for binary classification. The original model architecture is described in [5], while the used ViT model with our custom adjustments are depicted in Figure 1.
The three axial MRI slices of each subject were resized to 224 × 224 pixels to match the ViT’s input dimension. To account for intensity heterogeneity across the MRI scans, each subject’s three-slice volume was first normalized to a [0, 1] range using min-max scaling. Subsequently, the data was linearly transformed to the [−1, 1] range. This was achieved by subtracting 0.5 from each value and then dividing by 0.5, a common practice for fine-tuning.

2.5. Model Training and Fine-Tuning

The model was trained using PyTorch in a Google Colab environment with an NVIDIA A100 GPU (40 GB memory), with 83.5 GB of system RAM. The data were split into training (70%), validation (15%), and test (15%) sets, while maintaining a constant sMCI/pMCI subject ratio across all splits. As data leakage is a serious issue frequently encountered in deep learning studies [18], data of each subject was assigned only once to one of the three folds to prevent overly optimistic and biased metrics. Data augmentation was applied during training, including random rotation (±10°) and horizontal flipping. Training was conducted using binary cross-entropy loss with logits, optimized using the AdamW optimizer (learning rate = 1 × 10−5, weight decay = 0.05). The sigmoid function was applied to model outputs at inference time to obtain probability scores for binary classification. The model was trained for 30 epochs with a batch size of 64, and the best model was selected based on the epoch with the lowest validation loss. This process was repeated using various hyperparameter configurations as part of model tuning, with the best results achieved using the settings described above. The final performance of the model was evaluated on the independent test data set. A classification report was generated using the scikit-learn library, reporting the area under the receiver operating characteristic curve (ROC-AUC), accuracy, sensitivity, specificity, and F1-score.

2.6. 3D-ResNet Model for Comparison

For a comparative analysis, a ResNet [19] network for 3D data (3D ResNet-18) [20] was also implemented. The MRI preprocessing was adjusted to output MRI volumes with a dimension of 128 × 128 and 64 axial slices. To utilize a transfer learning strategy, the 3D ResNet-18 was initialized with weights pre-trained on the Kinetics-400 video dataset [21] and subsequently adapted for our binary classification task. The model’s input layer was modified to accept single-channel grayscale MRI volumes. This was achieved by averaging the weights of the original three RGB input channels to create a new single-channel convolutional layer, thereby preserving the learned spatial feature detectors from the pre-trained model. Furthermore, the final classifier head was replaced. The new head consists of a dropout layer with a probability of 0.5 for regularization, followed by a single linear layer for binary classification. The model was trained using the same hyperparameters as the ViT to ensure a fair comparison. Data augmentation was performed on the training set using the MONAI library [22], applying random horizontal flips (p = 0.5) and random rotations within a range of ±10 degrees (p = 0.5) to improve model generalization and mitigate overfitting.

3. Results

The comparison between the two groups regarding demographics showed no significant difference in gender distribution (Chi-square test, p = 0.58). Although the pMCI group was only 1.5 years older than the sMCI group, this age difference reached statistical significance (T-test, p = 0.013). Furthermore, both Mini-Mental State Examination (MMSE) and Clinical Dementia Rating Scale—Sum of Boxes (CDR-SOB) scores differed significantly (p < 0.001) between groups, with the pMCI cohort demonstrating poorer performance, reflected by lower MMSE scores and higher CDR-SOB scores. Detailed results can be found in Table 1.
The distribution of conversion times of the pMCI group is depicted in Figure 2.
To obtain robust performance estimates and account for stochastic variability introduced by random weight initialization, data shuffling, and dropout, the model was trained and evaluated across 10 independent runs. On average, the model achieved an AUC-ROC of 0.74 with a standard deviation of 0.02 and a range of 0.72 to 0.78, indicating acceptable discriminative ability between the sMCI and pMCI groups, consistent with commonly used interpretations of AUC values in classification tasks. The mean accuracy was 0.69 (SD = 0.03; range = 0.63–0.74), sensitivity was 0.65 (SD = 0.07; range = 0.55–0.74), specificity was 0.72 (SD = 0.06; range = 0.58–0.78), and the F1-score for the pMCI class was 0.67 (SD = 0.04; range = 0.59–0.72).
Figure 3 shows the distribution of these performance metrics across all runs. The mean ROC and individual ROC curves and corresponding aggregated confusion matrix are displayed in Figure 4.
In comparison, the 3D ResNet-18 model showed an AUC-ROC of 0.75 (SD = 0.01; range = 0.73–0.78). The model’s mean accuracy was 0.69 (SD = 0.02; range = 0.66–0.71), sensitivity was 0.61 (SD = 0.04; range = 0.55–0.64), specificity was 0.76 (SD = 0.02; range = 0.73–0.78), and the F1-score for the pMCI class was 0.65 (SD = 0.03; range = 0.61–0.68).

4. Discussion

In this investigation, we utilized axial T1-weighted MRI brain slices centered on the hippocampal region as inputs to a pretrained ViT model to classify AD converters vs. nonconverters. As the hippocampus is one of the first brain regions affected by the disease, we focused our analysis on this area to capture potential early structural changes. Specifically, we mapped three consecutive slices per subject to the three input channels of the 2D ViT model. However, it should be noted that restricting the input to three hippocampal slices limits the spatial information available compared to full volumetric approaches and may exclude predictive information from other brain regions. In addition, fixed slice selection may introduce sensitivity to slice placement, particularly given that hippocampal subfields are differentially affected across disease stages [23]. Future work could address these limitations by incorporating volumetric representations or adaptive slice-selection strategies. Currently, no pretrained ViT models for 3D medical imaging trained on large-scale datasets comparable to the scale and diversity of ImageNet exist, which would provide more comprehensive representations of volumetric data and potentially deeper insights into disease progression. However, developing such models poses significant computational and data-related challenges. Hence, establishing large-scale 3D pretrained models that can be fine-tuned for specific clinical tasks remains a promising direction. ViTs represent a relatively recent advancement in computer vision and can complement traditional CNN approaches or even serve as a viable alternative, delivering competitive results on tasks related to pathology detection [24]. Despite their advantages, such as modeling global context and long-range dependencies, ViTs are computationally expensive due to their quadratic complexity with respect to input size and large datasets are needed to reach best performance [5].
Here, we focused on the implementation of a standard ViT based solely on MRI as input data. Only one prior study has investigated MCI-to-AD conversion using a pre-trained ViT model with mid-sagittal slices [14]. An accuracy of 83% was reported in distinguishing MCI converters from non-converters. Their accuracy was higher than ours, as we achieved a mean accuracy of 69%. However, the presented steps in their investigation suggest that data from subjects were present in both the training and the validation and test sets, indicating data leakage. Thus, their reported accuracy metrics should be interpreted with caution, as they are likely overly optimistic and may not generalize well. Several studies employing architectures other than ViTs have investigated the prediction of MCI progression to AD based on structural MRI. CNNs have been utilized in numerous studies [18,25,26,27,28,29,30,31]. Models combining CNNs with attention mechanisms have also been explored [32,33,34,35]. Additionally, hybrid models integrating CNNs with transformer-based architectures have been proposed, including Hu et al. [36] with a Swin Transformer, Cao et al. [37] with MobileViT, and Khatri et al. [38]. Some of these studies reported high accuracies, with Ren et al. [33] achieving 87% and Khatri et al. [38] even exceeding the 90% mark. Notably, although Khatri et al. reported high accuracies using ViT-based and hybrid architectures, all models were trained from scratch on a small dataset with limited methodological transparency, rendering these results likely optimistic and not directly comparable to studies employing stricter validation protocols. Given the diversity in methodological approaches, specific techniques, and architectures across studies, a direct comparison of accuracy values does not reliably reflect actual model performance. More importantly, a specific algorithm must generalize well to new and unseen data and uses sound procedures for validation and to prevent any form of data leakage. This was systematically tested by [18], who applied various CNN architectures leading to balanced accuracy values ranging from 0.69 to 0.74 for the sMCI vs. pMCI classification task, which is comparable to our results. Future research should emphasize transparent validation protocols and explore hybrid architectures that integrate the strengths of both CNNs and transformers. Compared to other studies in the field, the ViT model proposed here falls within the range of reported metrics; however, it performs not as good as the best models discussed. As already mentioned, data leakage, particularly in studies involving slice-level data, is a well-documented issue in deep learning studies. This occurs when data from the same subject appears in both training and evaluation sets, allowing the model to memorize rather than generalize and thereby inflating performance estimates. While awareness of this problem has grown in recent years, it can still occur—particularly in studies with limited methodological transparency. Strict subject-level separation was maintained during data splitting to prevent any form of data leakage.
In our investigation, we used only baseline MRI scans to train and evaluate our model. This decision reflects realistic clinical scenarios, where prognosis must be based on a single timepoint rather than longitudinal information. Including additional scans acquired closer to or even after conversion in progressive MCI individuals would have risked biasing the model toward easier cases and inflating performance, thereby limiting generalizability. Furthermore, we specified—as performed in almost all related studies—a 36-month period for classifying individuals with MCI as progressive if they changed their status to AD within this period. However, individuals with stable MCI status were only included if they extended beyond that period and remained stable in all available follow-up assessments. This ensured that we did not label individuals as stable who may convert, e.g., shortly after the 36-month period, provided that assessment data were available. A strength of this study is the well-balanced dataset, which supports the reliability of reported metrics such as accuracy and F1-score as meaningful indicators of real-world model performance. However, while group differences in demographics and clinical scores were observed, these must be acknowledged as potential sources of bias. Specifically, the pMCI group was slightly older and showed worse baseline cognitive performance. Although the observed age difference was small, age-related structural changes in the brain may nonetheless contribute to the features learned by the model. Similarly, poorer cognitive performance in the pMCI group might be associated with more pronounced structural alterations already present at baseline, which the model could exploit for classification. These factors should therefore be taken into consideration when interpreting the model’s classification performance. Furthermore, as the ADNI study spans multiple sites and started already over a decade ago, the dataset includes images acquired using different scanners, field strengths, and acquisition protocols. Although this heterogeneity introduces variability, it also reflects real-world clinical diversity and may, in turn, enhance the generalizability of the model to broader populations. Nevertheless, ADNI represents a highly curated research dataset, and the generalizability of the proposed approach to routine clinical settings, different scanner configurations, and more heterogeneous patient populations remains to be demonstrated. In addition, while MCI and AD statuses in ADNI were clinically defined without requiring biomarker confirmation, it should be noted that recent research is adopting a more biologically driven approach, defining Alzheimer’s disease as a continuum based on biomarker evidence [39].
Recent advances in model architectures, such as the combination of CNN and transformer architectures, show promising outcomes and represent a viable future direction, as indicated above. New developments are already emerging, including revived state space models, in particular Mamba [40], which have been already applied to imaging tasks with Vision Mamba [41]. These models are not yet widely utilized in the medical domain but show promising potential also in the area of AD research.
Our comparison between the ViT and the 3D ResNet-18 model revealed a similar overall performance, with both achieving nearly identical mean accuracies of 0.69. The 3D ResNet-18 demonstrated a slightly higher discriminative ability with an AUC-ROC of 0.75, compared to the ViT’s 0.74. The most significant distinction, however, was a trade-off between sensitivity and specificity. The ViT model indicated a higher sensitivity metric, showing a superior ability to correctly identify the pMCI class, whereas the 3D ResNet-18 achieved higher specificity, indicating better performance in correctly classifying sMCI participants.
In this work, we solely focused on MRI features, a choice that provides useful information while remaining relatively more accessible compared to other modalities, such as PET, where radioactive tracers are used. Furthermore, combining different input features with MRI, such as genetic and clinical data can also be seen as a viable approach [42]. Finally, we argue for the establishment of a global, standardized, unique and diverse test set, which could serve as a benchmark for future studies, allowing a fair comparison and realistic and unbiased estimation of each model’s performance. This would prevent biased results and limits the occurrence of overly optimistic accuracy metric reports.

5. Conclusions

Early prediction of whether a person with MCI will convert to AD or remain stable is still one of the greatest challenges in neuroscientific research. Given recent developments in the deep learning field with rapidly evolving model architectures, detecting early signs of progression, reflected in altered brain patterns based on structural MRI, is expected to become increasingly precise. Here, a standard ViT model was used, which is thought to capture complex and early patterns of brain atrophy, and was pretrained on a large dataset containing millions of images. We used three axial MRI slices covering the hippocampal area as input for the classification task. With our pretrained and fine-tuned model, we were able to classify sMCI from pMCI individuals with moderate accuracy. Our results were not as strong as those reported in some other studies where MCI-to-AD conversion has been investigated using neural networks. However, our direct comparison in this study with the 3D ResNet-18 revealed identical accuracy and suggests that ViTs are a promising direction for future investigations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/bioengineering13020163/s1.

Author Contributions

Conceptualization, R.S. and P.F.; methodology, R.S.; software, R.S.; resources, P.F.; data curation, R.S.; writing—original draft preparation, R.S.; writing—review and editing, R.S.; visualization, R.S.; funding acquisition, R.S. All authors have read and agreed to the published version of the manuscript.

Funding

R. Seiger gratefully acknowledges the financial support from the Society for Research Promotion Lower Austria through an Excellence Scholarship for Research (ExzF-0005), which was instrumental in conducting this study. Data collection and sharing for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is funded by the National Institute on Aging (National Institutes of Health Grant U19AG024904). The grantee organization is the Northern California Institute for Research and Education. In the past, ADNI has also received funding from the National Institute of Biomedical Imaging and Bioengineering, the Canadian Institutes of Health Research, and private sector contributions through the Foundation for the National Institutes of Health (FNIH) including generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd. and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics.

Institutional Review Board Statement

All data used in this study were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The original data collection procedures and measurements of the ADNI study were conducted in accordance with the Declaration of Helsinki and approved by the institutional review boards of all participating institutions (for the complete list see Supplementary Materials). The present study involved secondary analysis of fully de-identified, publicly available data and therefore did not require additional institutional review board approval.

Informed Consent Statement

Written informed consent was obtained from all participants as part of their enrollment in the ADNI study.

Data Availability Statement

Data available upon reasonable request.

Acknowledgments

Permission has been granted by the ADNI consortium to access the database. We acknowledge the support of large language models (ChatGPT 5.2) for proofreading, stylistic refinements and code assistance.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ADAlzheimer’s Disease
AUC-ROCArea Under the Receiver Operating Characteristic Curve
CDR-SOBClinical Dementia Rating—Sum of Boxes
CNCognitively Normal
CNNConvolutional Neural Network
DICOMDigital Imaging and Communications in Medicine
MCIMild Cognitive Impairment
MLPMultilayer Perceptron
MMSEMini-Mental State Examination
MNIMontreal Neurological Institute
MRIMagnetic Resonance Imaging
NIfTINeuroimaging Informatics Technology Initiative
PETPositron Emission Tomography
pMCIProgressive Mild Cognitive Impairment
sMCIStable Mild Cognitive Impairment
SPMStatistical Parametric Mapping
TIMMPyTorch Image Models
ViTVision Transformer

References

  1. Fukushima, K. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
  2. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  3. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  4. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  6. Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
  7. Gustavsson, A.; Norton, N.; Fast, T.; Frölich, L.; Georges, J.; Holzapfel, D.; Kirabali, T.; Krolak-Salmon, P.; Rossini, P.M.; Ferretti, M.T.; et al. Global Estimates on the Number of Persons across the Alzheimer’s Disease Continuum. Alzheimer’s Dement. 2023, 19, 658–670. [Google Scholar] [CrossRef]
  8. Bloom, G.S. Amyloid-β and Tau: The Trigger and Bullet in Alzheimer Disease Pathogenesis. JAMA Neurol. 2014, 71, 505–508. [Google Scholar] [CrossRef]
  9. Gauthier, S.; Reisberg, B.; Zaudig, M.; Petersen, R.C.; Ritchie, K.; Broich, K.; Belleville, S.; Brodaty, H.; Bennett, D.; Chertkow, H.; et al. Mild Cognitive Impairment. Lancet 2006, 367, 1262–1270. [Google Scholar] [CrossRef]
  10. Alzheimer’s Association. 2024 Alzheimer’s Disease Facts and Figures; Alzheimer’s Association: Chicago, IL, USA, 2024; pp. 3708–3821. [Google Scholar]
  11. Gordon, B.A.; Blazey, T.M.; Su, Y.; Hari-Raj, A.; Dincer, A.; Flores, S.; Christensen, J.; McDade, E.; Wang, G.; Xiong, C.; et al. Spatial Patterns of Neuroimaging Biomarker Change in Individuals from Families with Autosomal Dominant Alzheimer’s Disease: A Longitudinal Study. Lancet Neurol. 2018, 17, 241–250. [Google Scholar] [CrossRef] [PubMed]
  12. Bravo-Ortiz, M.A.; Holguin-Garcia, S.A.; Quiñones-Arredondo, S.; Mora-Rubio, A.; Guevara-Navarro, E.; Arteaga-Arteaga, H.B.; Ruz, G.A.; Tabares-Soto, R. A Systematic Review of Vision Transformers and Convolutional Neural Networks for Alzheimer’s Disease Classification Using 3D MRI Images. Neural Comput. Appl. 2024, 36, 21985–22012. [Google Scholar] [CrossRef]
  13. Mubonanyikuzo, V.; Yan, H.; Komolafe, T.E.; Zhou, L.; Wu, T.; Wang, N. Detection of Alzheimer Disease in Neuroimages Using Vision Transformers: Systematic Review and Meta-Analysis. J. Med. Internet Res. 2025, 27, e62647. [Google Scholar] [CrossRef] [PubMed]
  14. Hoang, G.M.; Kim, U.-H.; Kim, J.G. Vision Transformers for the Prediction of Mild Cognitive Impairment to Alzheimer’s Disease Progression Using Mid-Sagittal sMRI. Front. Aging Neurosci. 2023, 15, 1102869. [Google Scholar] [CrossRef] [PubMed]
  15. Valizadeh, G.; Elahi, R.; Hasankhani, Z.; Rad, H.S.; Shalbaf, A. Deep Learning Approaches for Early Prediction of Conversion from MCI to AD Using MRI and Clinical Data: A Systematic Review. Arch. Comput. Methods Eng. 2025, 32, 1229–1298. [Google Scholar] [CrossRef]
  16. Braak, H.; Braak, E. Neuropathological Stageing of Alzheimer-Related Changes. Acta Neuropathol. 1991, 82, 239–259. [Google Scholar] [CrossRef]
  17. Scoville, W.B.; Milner, B. Loss of Recent Memory After Bilateral Hippocampal Lesions. J. Neurol. Neurosurg. Psychiatry 1957, 20, 11–21. [Google Scholar] [CrossRef]
  18. Wen, J.; Thibeau-Sutre, E.; Diaz-Melo, M.; Samper-González, J.; Routier, A.; Bottani, S.; Dormont, D.; Durrleman, S.; Burgos, N.; Colliot, O. Convolutional Neural Networks for Classification of Alzheimer’s Disease: Overview and Reproducible Evaluation. Med. Image Anal. 2020, 63, 101694. [Google Scholar] [CrossRef]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  20. Hara, K.; Kataoka, H.; Satoh, Y. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  21. Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  22. Cardoso, M.J.; Li, W.; Brown, R.; Ma, N.; Kerfoot, E.; Wang, Y.; Murrey, B.; Myronenko, A.; Zhao, C.; Yang, D.; et al. MONAI: An Open-Source Framework for Deep Learning in Healthcare. arXiv 2022, arXiv:2211.02701. [Google Scholar] [CrossRef]
  23. De Flores, R.; La Joie, R.; Chételat, G. Structural Imaging of Hippocampal Subfields in Healthy Aging and Alzheimer’s Disease. Neuroscience 2015, 309, 29–50. [Google Scholar] [CrossRef] [PubMed]
  24. Deininger, L.; Stimpel, B.; Yuce, A.; Abbasi-Sureshjani, S.; Schönenberger, S.; Ocampo, P.; Korski, K.; Gaire, F. A Comparative Study between Vision Transformers and CNNs in Digital Pathology. arXiv 2022, arXiv:2206.00389. [Google Scholar] [CrossRef]
  25. Bae, J.; Stocks, J.; Heywood, A.; Jung, Y.; Jenkins, L.; Hill, V.; Katsaggelos, A.; Popuri, K.; Rosen, H.; Beg, M.F.; et al. Transfer Learning for Predicting Conversion from Mild Cognitive Impairment to Dementia of Alzheimer’s Type Based on a Three-Dimensional Convolutional Neural Network. Neurobiol. Aging 2021, 99, 53–64. [Google Scholar] [CrossRef] [PubMed]
  26. Oh, K.; Chung, Y.-C.; Kim, K.W.; Kim, W.-S.; Oh, I.-S. Classification and Visualization of Alzheimer’s Disease Using Volumetric Convolutional Neural Network and Transfer Learning. Sci. Rep. 2019, 9, 18150. [Google Scholar] [CrossRef]
  27. Lu, P.; Hu, L.; Zhang, N.; Liang, H.; Tian, T.; Lu, L. A Two-Stage Model for Predicting Mild Cognitive Impairment to Alzheimer’s Disease Conversion. Front. Aging Neurosci. 2022, 14, 826622. [Google Scholar] [CrossRef]
  28. Ashtari-Majlan, M.; Seifi, A.; Dehshibi, M.M. A Multi-Stream Convolutional Neural Network for Classification of Progressive MCI in Alzheimer’s Disease Using Structural MRI Images. IEEE J. Biomed. Health Inform. 2022, 26, 3918–3926. [Google Scholar] [CrossRef]
  29. Bron, E.E.; Klein, S.; Papma, J.M.; Jiskoot, L.C.; Venkatraghavan, V.; Linders, J.; Aalten, P.; De Deyn, P.P.; Biessels, G.J.; Claassen, J.A.H.R.; et al. Cross-Cohort Generalizability of Deep and Conventional Machine Learning for MRI-Based Diagnosis and Prediction of Alzheimer’s Disease. NeuroImage Clin. 2021, 31, 102712. [Google Scholar] [CrossRef]
  30. Basaia, S.; Agosta, F.; Wagner, L.; Canu, E.; Magnani, G.; Santangelo, R.; Filippi, M. Automated Classification of Alzheimer’s Disease and Mild Cognitive Impairment Using a Single MRI and Deep Neural Networks. NeuroImage Clin. 2019, 21, 101645. [Google Scholar] [CrossRef]
  31. Abrol, A.; Bhattarai, M.; Fedorov, A.; Du, Y.; Plis, S.; Calhoun, V. Deep Residual Learning for Neuroimaging: An Application to Predict Progression to Alzheimer’s Disease. J. Neurosci. Methods 2020, 339, 108701. [Google Scholar] [CrossRef] [PubMed]
  32. Zheng, B.; Gao, A.; Huang, X.; Li, Y.; Liang, D.; Long, X. A Modified 3D EfficientNet for the Classification of Alzheimer’s Disease Using Structural Magnetic Resonance Images. IET Image Process. 2023, 17, 77–87. [Google Scholar] [CrossRef]
  33. Ren, F.; Yang, C.; Nanehkaran, Y.A. MRI-Based Model for MCI Conversion Using Deep Zero-Shot Transfer Learning. J. Supercomput. 2023, 79, 1182–1200. [Google Scholar] [CrossRef]
  34. Zhang, J.; Zheng, B.; Gao, A.; Feng, X.; Liang, D.; Long, X. A 3D Densely Connected Convolution Neural Network with Connection-Wise Attention Mechanism for Alzheimer’s Disease Classification. Magn. Reson. Imaging 2021, 78, 119–126. [Google Scholar] [CrossRef] [PubMed]
  35. Zhu, W.; Sun, L.; Huang, J.; Han, L.; Zhang, D. Dual Attention Multi-Instance Deep Learning for Alzheimer’s Disease Diagnosis with Structural MRI. IEEE Trans. Med. Imaging 2021, 40, 2354–2366. [Google Scholar] [CrossRef]
  36. Hu, Z.; Wang, Z.; Jin, Y.; Hou, W. VGG-TSwinformer: Transformer-Based Deep Learning Model for Early Alzheimer’s Disease Prediction. Comput. Methods Programs Biomed. 2023, 229, 107291. [Google Scholar] [CrossRef]
  37. Cao, G.; Zhang, M.; Wang, Y.; Zhang, J.; Han, Y.; Xu, X.; Huang, J.; Kang, G. End-to-End Automatic Pathology Localization for Alzheimer’s Disease Diagnosis Using Structural MRI. Comput. Biol. Med. 2023, 163, 107110. [Google Scholar] [CrossRef]
  38. Khatri, U.; Shin, S.; Kwon, G.-R. Convolution Driven Vision Transformer for the Prediction of Mild Cognitive Impairment to Alzheimer’s Disease Progression. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–7. [Google Scholar]
  39. Jack, C.R.; Bennett, D.A.; Blennow, K.; Carrillo, M.C.; Dunn, B.; Haeberlein, S.B.; Holtzman, D.M.; Jagust, W.; Jessen, F.; Karlawish, J.; et al. NIA-AA Research Framework: Toward a Biological Definition of Alzheimer’s Disease. Alzheimer’s Dement. 2018, 14, 535–562. [Google Scholar] [CrossRef] [PubMed]
  40. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
  41. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
  42. Spasov, S.; Passamonti, L.; Duggento, A.; Liò, P.; Toschi, N. A Parameter-Efficient Deep Learning Approach to Predict Conversion from Mild Cognitive Impairment to Alzheimer’s Disease. NeuroImage 2019, 189, 276–287. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Model architecture featuring a customized MLP (multilayer perceptron) head with a dropout layer, adapted for binary classification. The backbone consists of 12 transformer encoder blocks and 12 multi-head attention modules, as in the pretrained ViT-base model. For illustration purposes, only 9 image patches are shown; the actual model processes 196 input patches. Architecture adapted from [5].
Figure 1. Model architecture featuring a customized MLP (multilayer perceptron) head with a dropout layer, adapted for binary classification. The backbone consists of 12 transformer encoder blocks and 12 multi-head attention modules, as in the pretrained ViT-base model. For illustration purposes, only 9 image patches are shown; the actual model processes 196 input patches. Architecture adapted from [5].
Bioengineering 13 00163 g001
Figure 2. Overview of the distribution of conversion times in the pMCI group. Assessment frequency and intervals varied slightly between participants and across the different ADNI studies, and not all assessment periods were available in every study.
Figure 2. Overview of the distribution of conversion times in the pMCI group. Assessment frequency and intervals varied slightly between participants and across the different ADNI studies, and not all assessment periods were available in every study.
Bioengineering 13 00163 g002
Figure 3. Distribution of model performance metrics across 10 runs of the test set (area under the Curve (AUC), accuracy, sensitivity, specificity, and F1-score).
Figure 3. Distribution of model performance metrics across 10 runs of the test set (area under the Curve (AUC), accuracy, sensitivity, specificity, and F1-score).
Bioengineering 13 00163 g003
Figure 4. (a) Mean Receiver Operating Characteristic (ROC) curve (orange), including all 10 individual runs on the test dataset. Colored curves represent individual ROC curves from each run, illustrating performance variability. (b) Confusion matrix showing aggregated results from 10 independent model runs on the test dataset. Total counts of true and predicted labels for the sMCI and pMCI classes are summarized.
Figure 4. (a) Mean Receiver Operating Characteristic (ROC) curve (orange), including all 10 individual runs on the test dataset. Colored curves represent individual ROC curves from each run, illustrating performance variability. (b) Confusion matrix showing aggregated results from 10 independent model runs on the test dataset. Total counts of true and predicted labels for the sMCI and pMCI classes are summarized.
Bioengineering 13 00163 g004
Table 1. Demographic information and clinical assessment scale scores. MMSE: Mini-Mental State Examination; CDR-SOB: Clinical Dementia Rating—Sum of Boxes.
Table 1. Demographic information and clinical assessment scale scores. MMSE: Mini-Mental State Examination; CDR-SOB: Clinical Dementia Rating—Sum of Boxes.
CharacteristicStable MCIProgressive MCIp-Value
n299276
Female (n, %)120 (40.1%)118 (42.8%)0.5806
Male (n, %)179 (59.9%)158 (57.2%)
Age (Mean ± SD)72.59 ± 7.4874.09 ± 7.050.0132
MMSE (Mean ± SD)27.99 ± 1.6926.75 ± 1.80<0.001
CDR-SOB (Mean ± SD, Range)1.20 ± 0.66 (0.5–3.5)1.99 ± 0.96 (0.5–5.0)<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Seiger, R.; Fierlinger, P., on behalf of the Alzheimer’s Disease Neuroimaging Initiative. Predicting Conversion from Mild Cognitive Impairment to Alzheimer’s Disease Using a Vision Transformer and Hippocampal MRI Slices. Bioengineering 2026, 13, 163. https://doi.org/10.3390/bioengineering13020163

AMA Style

Seiger R, Fierlinger P on behalf of the Alzheimer’s Disease Neuroimaging Initiative. Predicting Conversion from Mild Cognitive Impairment to Alzheimer’s Disease Using a Vision Transformer and Hippocampal MRI Slices. Bioengineering. 2026; 13(2):163. https://doi.org/10.3390/bioengineering13020163

Chicago/Turabian Style

Seiger, René, and Peter Fierlinger on behalf of the Alzheimer’s Disease Neuroimaging Initiative. 2026. "Predicting Conversion from Mild Cognitive Impairment to Alzheimer’s Disease Using a Vision Transformer and Hippocampal MRI Slices" Bioengineering 13, no. 2: 163. https://doi.org/10.3390/bioengineering13020163

APA Style

Seiger, R., & Fierlinger, P., on behalf of the Alzheimer’s Disease Neuroimaging Initiative. (2026). Predicting Conversion from Mild Cognitive Impairment to Alzheimer’s Disease Using a Vision Transformer and Hippocampal MRI Slices. Bioengineering, 13(2), 163. https://doi.org/10.3390/bioengineering13020163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop