Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment

Li, Yang; Xu, Biao; Bai, Qiang; Liu, Zhenghong; Zhu, Junfeng; Chen, Qipeng

doi:10.3390/info17020129

Open AccessArticle

Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment

by

Yang Li

^1,2,3

,

Biao Xu

³

,

Qiang Bai

^1,2

,

Zhenghong Liu

^1,2,

Junfeng Zhu

^4,* and

Qipeng Chen

^1,*

¹

School of Mechanical Engineering, Guiyang University, Guiyang 550005, China

²

Guizhou Provincial Key Laboratory for Digital Protection, Development and Utilization of Cultural Heritage, Guiyang University, Guiyang 550002, China

³

Key Laboratory of Advanced Manufacturing Technology of the Ministry of Education, Guizhou University, Guiyang 550025, China

⁴

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

Information 2026, 17(2), 129; https://doi.org/10.3390/info17020129

Submission received: 29 November 2025 / Revised: 18 January 2026 / Accepted: 26 January 2026 / Published: 30 January 2026

(This article belongs to the Special Issue Advances in Human–Robot Interactions and Assistive Applications)

Download

Browse Figures

Versions Notes

Abstract

Distinguishing Alzheimer’s Disease (AD) from Mild Cognitive Impairment (MCI) is challenging due to their subtle morphological similarities in MRI, yet distinct therapeutic strategies are required. To assist junior clinicians with limited diagnostic experience, this paper proposes Vi-ADiM, a Vision Transformer framework designed for the early differentiation of AD and MCI. Leveraging cross-domain feature adaptation and task-specific data augmentation, the model ensures rapid convergence and robust generalization even in data-limited regimes. By optimizing a two-stage encoding module, Vi-ADiM efficiently extracts both global and local MRI features. Furthermore, by integrating SHAP and Grad-CAM++, the framework offers multi-granular interpretability of pathological regions, providing intuitive visual evidence for clinical decision-making. Experimental results demonstrate that Vi-ADiM outperforms the standard ViT-Base/16, improving accuracy, precision, recall, and F1 score by 0.444%, 0.486%, 0.476%, and 0.482%, respectively, while reducing standard deviations by approximately 0.06–0.29%. Notably, the model achieves these gains with a 48.96% reduction in parameters and a 49.65% decrease in computational cost (FLOPs), offering a reliable, efficient, and interpretable solution for computer-aided diagnosis.

Keywords:

vision transformers; alzheimer’s disease; mild cognitive impairment; MRI image analysis; cross-domain feature adaptation

1. Introduction

Distinguishing Mild Cognitive Impairment (MCI) from early-stage Alzheimer’s Disease (AD) is a critical yet challenging task in neurology. While MCI serves as a prodromal phase bridging normal aging and dementia, its neuroimaging patterns exhibit substantial morphological overlap with those of early AD [1,2]. Crucially, this prodromal phase represents the optimal "early prediction" window for therapeutic intervention, as it precedes irreversible neurodegeneration. Epidemiological evidence indicates that approximately 30% to 50% of individuals with MCI will progress to dementia within 5 to 10 years. Consequently, establishing precise predictive capabilities at this specific temporal juncture is paramount for shifting the clinical paradigm from reactive management to preemptive intervention, thereby securing a vital lead time to slow disease progression. However, current clinical protocols rely heavily on manual interpretation of MRI scans, which is prone to inter-observer variability, particularly among less experienced practitioners. Consequently, there is an urgent need for automated decision-support systems that can precisely discern subtle structural variations, such as fine-grained hippocampal atrophy, that differentiate MCI from AD.

Although deep learning has revolutionized medical image analysis [3,4,5], Convolutional Neural Networks (CNNs) often struggle to meet these diagnostic demands. While CNNs dominate medical imaging tasks owing to their robust local feature extraction [6,7,8,9,10], they inherently lack the capacity to model long-range dependencies. This deficiency is particularly pronounced when processing low-resolution inputs, complicating the detection of subtle biomarkers critical for diagnosing early AD and MCI. Alternatively, Vision Transformers (ViT) have emerged as a powerful paradigm for global context modeling. Nevertheless, deploying them on small-sample medical datasets remains non-trivial due to inherent overfitting risks and high computational demands. These bottlenecks currently constrain the reliability and clinical utility of diagnostic frameworks for early AD and MCI.

To address these limitations, this work presents Vi-ADiM, a computer-aided diagnosis framework that assists junior clinicians in detecting early AD and MCI. Specifically, we devise a cross-domain feature adaptation strategy that leverages knowledge from ImageNet-pretrained networks to enhance representation learning on data-scarce MRI sets, thereby overcoming the scarcity of labeled medical data. Concurrently, a task-driven data augmentation mechanism is developed to synthesize high-fidelity samples embedded with pathological priors, effectively curbing overfitting. Crucially, to synergize global context modeling and local feature extraction, we introduce a two-stage encoder optimization strategy that enhances diagnostic precision while significantly lowering computational overhead. To ensure clinical transparency and trust, the framework incorporates Shapley Additive Explanations (SHAP) [11] and Grad-CAM++ to provide global and local visual explanations, highlighting salient pathological regions to facilitate reliable clinical decision-making.

The main contributions of this paper are as follows:

(1): We present Vi-ADiM, a framework that mitigates data scarcity by integrating cross-domain feature adaptation and task-driven augmentation. This approach ensures robust generalization on small-scale MRI datasets and defines a data-efficient paradigm for automated diagnosis.
(2): We devise a Two-Stage Encoding Optimization Strategy to reconfigure the Transformer architecture, effectively mitigating parameter redundancy. By reducing encoder depth and implementing a dual-optimizer schedule, transitioning from SGDM to AdamW, our method balances model complexity against limited data availability to prevent overfitting.
(3): We introduce a Global-Local interpretability mechanism that combines Grad-CAM++ and SHAP, which validates that our model’s decision logic aligns with distinct pathological biomarkers, thereby fostering clinical trust.
(4): Our structural optimizations strike an optimal balance between efficiency and performance. Compared to the baseline, the proposed method reduces the number of parameters by 49.0% and FLOPs by 49.7% while maintaining superior diagnostic precision, making Vi-ADiM highly suitable for resource-constrained clinical deployment.

The rest of this paper is structured as follows: Section 2 reviews related work. Section 3 details the proposed Vi-ADiM framework for early AD and MCI diagnosis. Section 4 reports extensive experimental results. Section 5 analyzes the findings and limitations, followed by the Conclusion in Section 6.

2. Related Work

This section scrutinizes the prevailing methodologies for early diagnosis of AD and MCI. We categorize the related literature into four distinct streams: neuroimaging-based computer-aided diagnosis (CAD), strategies for mitigating data scarcity in medical model training, Transformer-driven architectures for identifying cognitive decline, and the Integration of explainable AI (XAI) into clinical decision-making [12].

2.1. Medical Imaging-Based Auxiliary Diagnostic Methods

In medical image analysis, AI-driven diagnostic tools [13,14] provide critical support to clinicians, particularly novice practitioners. Deploying such intelligent detection systems substantially enhances diagnostic efficiency and precision. For instance, Liu et al. [15] devised a ResNet34-based cascade framework for classifying early small pulmonary nodules. Departing from traditional binary or quaternary schemes, their method distinguishes six nodule categories. Concurrently, He et al. [16] proposed HCTNet, a hybrid CNN–Transformer architecture, to precisely segment breast lesions in ultrasound scans, thereby bolstering diagnostic accuracy for radiologists. Additionally, Al-Fahdawi et al. [17] developed Fundus-DeepNet, an automated multi-label system that identifies diverse ocular pathologies by fusing feature representations from binocular fundus images. By leveraging automated feature extraction, these approaches effectively mitigate diagnostic errors stemming from varying levels of clinical expertise.

Moreover, deep learning-based diagnostic frameworks targeting neurodegenerative disorders, particularly those leveraging CNNs and Transformer paradigms [18,19,20,21,22], excel at extracting salient pathological features. Such systems offer objective, data-driven decision support, proving especially valuable for assisting junior clinicians. Nevertheless, substantial impediments persist, notably a heavy reliance on annotated training data and inherent opacity in model reasoning, which constrain their deployment in real-world clinical scenarios.

2.2. Challenges in Training Models with Small Sample Medical Data

Patient privacy concerns and high acquisition costs often hinder access to clinical data. Furthermore, the stringent requirement for high-quality expert annotation exacerbates the scarcity of available samples. While deep learning typically requires extensive datasets to ensure generalization, the scarcity of medical data often leads to overfitting, compromising model robustness. Recent advancements have focused on optimizing the efficacy of deep learning in small-sample regimes. For example, He et al. [23] employed a data augmentation method based on statistical deformation models, generating realistic variations that surpass the accuracy limitations of traditional techniques. Xu et al. [24] proposed an attention-guided cross-domain tumor image generation model incorporating an information enhancement strategy (CDA-GAN); this approach synthesizes diverse samples to augment the dataset volume and improve diagnostic utility. Transfer learning has also gained prominence, improving performance on limited medical imaging benchmarks by leveraging feature representations from large-scale datasets such as ImageNet [25,26]. For example, Alzubaidi et al. [27] introduced a strategy in which deep learning models are pre-trained on a large, unlabeled medical image dataset before transferring knowledge to smaller, labeled cohorts. Lai et al. [28] integrated transfer learning with the Vision Mamba architecture for brain tumor classification, attaining superior accuracy despite limited annotations.

2.3. Transformer-Based Diagnosis of Alzheimer’s Disease and Mild Cognitive Impairment

While deep learning efficacy hinges on high-quality data, the suitability of the model architecture is equally decisive. In particular, CNNs have been widely used for neurodegenerative disease research. For instance, Ieracitano et al. [29] developed a specialized EEG-CNN to categorize AD, MCI, and HC using fixed-duration EEG epochs. Similarly, Luo et al. [30] designed a dual-attention network fusing MRI and neurocognitive metadata to discriminate between progressive (pMCI) and stable MCI (sMCI). However, differentiating the subtle structural variations between MCI and early AD in MRI remains non-trivial. Furthermore, successive downsampling in deep CNNs reduces feature map resolution, impeding the capture of fine-grained textural patterns essential for early diagnosis.

Capitalizing on the Transformer’s aptitude for capturing long-range dependencies, recent studies have increasingly deployed this architecture in neurodegenerative disease diagnosis. Liu et al. [31] proposed TriFormer, a multimodal framework that leverages three distinct Transformers to distinguish between stable and progressive MCI. Concurrently, Hu et al. [32] engineered VGG-TSwinformer, a hybrid CNN–Transformer architecture optimized for early AD detection via short-term longitudinal analysis. To mitigate computational costs, Khatri and Kwon [33] embedded convolutional attention within a Transformer classifier, incorporating lightweight multi-head self-attention (LMHSA), inverted residual units (IRU), and local feed-forward networks (LFFN). Furthermore, Chen et al. [34] established a multimodal fusion paradigm (MMDF) that incorporates a multi-scale attention-based 1D-CNN (MA-1DCNN) to process clinical records for AD profiling. Despite the performance gains achieved by Transformers in classifying pathologies such as AD and MCI, the architecture’s inherent complexity often complicates training regimes, particularly within the constraints of small-scale medical datasets.

2.4. Application of Interpretability Research in Medical Diagnosis

Nevertheless, in computer-aided diagnosis, while deep learning yields superior performance, its inherent “black box” nature remains a primary concern for clinicians. This opacity hinders the understanding of the decision-making logic behind predictions. Given the high-stakes nature of healthcare, models require robust interpretability to foster clinical trust. Consequently, interpretability research in medical imaging has advanced significantly. Approaches such as Grad-CAM++, SHAP, and their variants are widely used to elucidate predictive mechanisms, providing intuitive visualizations of feature contributions [35,36,37,38,39].

In AD and MCI diagnosis, these techniques facilitate the identification of key pathological regions in MRI scans, offering visualization support from both global and local perspectives. For instance, Gelir et al. [40] leveraged SHAP to pinpoint critical features associated with AD progression, evaluating eight classification methods on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort. Similarly, Yi et al. [41] integrated SHAP with XGBoost to develop XGBoost-SHAP, an interpretable framework that mitigates class imbalance across progression states. Furthermore, Zhu et al. [42] proposed a personalized diagnostic approach integrating interpretable feature learning and dynamic graph learning within a Graph Convolutional Network (GCN). Li et al. [43] introduced FSNet, a dual-interpretable GCN performing simultaneous feature and sample selection to enhance both diagnostic accuracy and model interpretability. Zhu et al. [44] developed the Deep Multimodal Discriminative and Interpretability Network (DMDIN), which aligns samples within a discriminative common space to identify key Regions of Interest (ROIs). They utilized knowledge distillation to reconstruct coordinated representations, capturing regional impacts on AD classification. Collectively, these studies demonstrate that incorporating interpretability analysis not only bolsters clinical confidence but also significantly lowers misdiagnosis rates.

Addressing these challenges, we propose a ViT-based framework for early AD and MCI detection. We optimize performance on data-scarce datasets via cross-domain feature adaptation and task-driven data augmentation. By strategically configuring the two-stage encoding module, we capture complex global-local feature interactions with minimal computational cost. To ensure clinical trustworthiness, we integrate SHAP and Grad-CAM++ for comprehensive global and local interpretability. Finally, we explore Human-AI collaboration, augmenting junior clinicians’ expertise with our model to maximize diagnostic precision and efficiency.

3. Proposed Method

Discriminating between AD and MCI at an early stage remains clinically formidable due to subtle and diffuse structural anomalies within the brain MRI. To navigate this in data-scarce regimes, we present Vi-ADiM, a lightweight, interpretable auxiliary diagnostic framework that leverages ViT [45]. As depicted in Figure 1, Vi-ADiM incorporates four pivotal modules: (1) cross-domain feature adaptation to alleviate data scarcity; (2) task-driven medical image augmentation to bolster robustness; (3) a specialized shallow ViT architecture optimized for early AD/MCI discrimination; and (4) an explainability module integrating Grad-CAM++ and SHAP to foster clinical confidence. The framework is fundamentally designed to harmonize global contextual modeling with local pathological sensitivity, a prerequisite for identifying incipient neurodegenerative patterns.

3.1. Cross-Domain Feature Adaptation and Task-Driven Data Augmentation Strategy

3.1.1. Cross-Domain Feature Adaptation

Constrained by the scarcity of annotated medical samples, training deep models from scratch often leads to overfitting. To circumvent this, we leverage domain adaptation theory to transfer intrinsic feature correlations, such as texture and structural symmetry, from natural images (Source Domain) to MRI scans (Target Domain). This process aims to minimize the distribution shift between domains, as shown in Figure 2 and Equation (1).

d_{H △ H} (p_{S}, p_{T}) = 2 \sup_{h, h' \in H} | \Pr_{S} [h \neq h'] - \Pr_{T} [h \neq h'] |

(1)

Here,

\Pr_{S} [h \neq h']

and

\Pr_{T} [h \neq h']

represent the probability distributions of the source and target domains, respectively. By mathematically aligning these distributions, the model efficiently transfers robust feature extractors trained on 14 million natural images (ImageNet) to the MRI domain, significantly boosting diagnostic accuracy despite limited medical data.

3.1.2. Task-Driven Data Augmentation Strategy

Distinguishing early AD from MCI relies on detecting imperceptible structural variations. While ViTs excel at global representation, their lack of inductive bias necessitates extensive training data to prevent overfitting. However, medical datasets are inherently restricted by privacy protocols and high acquisition costs. To bridge this gap, we introduce a task-driven data augmentation strategy, as shown in Table 1. Unlike aggressive augmentations that risk distorting pathological semantics, this protocol simulates realistic clinical artifacts, such as patient head movement (translation/rotation) and scanner variability (noise/scaling), to synthesize high-fidelity MRI samples. This approach ensures that the augmented data remains biologically plausible, thereby significantly enhancing model robustness against heterogeneous acquisition protocols.

3.2. Design of the AD and MCI Auxiliary Diagnosis Model

Building upon the aforementioned cross-domain adaptation and task-driven augmentation strategies, we design the Vi-ADiM framework to specifically capture fine-grained structural features critical for early AD and MCI diagnosis. Based on the ViT architecture, which excels at modeling long-range dependencies and global contextual relationships, the framework’s workflow is illustrated in Figure 3.

The processing pipeline begins by partitioning the 2D MRI x ∈ ℝ ^H^×W×C input into a sequence of fixed-size patches. Specifically, the image is divided into N = HW/P² patches, each with a resolution of P × P. Each patch is then flattened and mapped into a D-dimensional latent feature space via a trainable linear projection

E \in ℝ^{{(P}^{2} \cdot C) \times D}

. To prevent the loss of spatial structural information, which is vital for locating anatomical regions, a standard 1D positional encoding E_pos ∈ ℝ ^(N+1)×D is added to these patch embeddings, as defined in Equation (2).

z_{0} = [x_{c l a s s}; x_{p}^{1} E; x_{p}^{2} E; \dots; x_{p}^{N} E] + E_{p o s}, E \in ℝ^{{(P}^{2} \cdot C) \times D}, E_{p o s} \in ℝ^{(N + 1) \times D}

(2)

Here, (H, W) denotes the input image resolution, C represents the number of channels, (P, P) is the resolution of each patch, N = HW/P² is the total number of patches, and x_class represents the [class] token.

Subsequently, the resulting sequence z₀ is forwarded into a Transformer Encoder composed of a stack of 6 blocks. This depth is optimized to balance feature abstraction with computational efficiency. Structurally, each encoder block consists of alternating Multi-Head Self-Attention (MSA) and Multi-Layer Perceptron (MLP) layers. To ensure training stability, Layer Normalization (LN) is applied before each layer, and residual connections are employed after each layer, as formulated in Equation (3).

z_{l}^{'} = M S A (L N (z_{l - 1})) + z_{l - 1}, z_{l} = M L P (L N (z_{l}^{'})) + z_{l}^{'}, l = 1 \dots L

(3)

Finally, the high-level features

z_{L}^{0}

extracted by the encoder are passed through a linear MLP head to output the final diagnostic prediction (AD, MCI, or NC). This streamlined end-to-end architecture ensures precise diagnosis while maintaining structural simplicity.

Given the limited training samples, the Vi-ADiM architecture is engineered to exploit ViT’s feature-extraction capabilities to discern subtle pathological markers in early AD and MCI. By integrating a two-stage encoding mechanism, our approach efficiently models the intricate interplay between global and local features, significantly reducing computational redundancy and thereby optimizing training efficiency on small-scale datasets.

3.2.1. Preliminary Adaptation of the One-Stage Encoding Module

While ViT excels at modeling global dependencies, its significant depth makes it susceptible to overfitting on small-scale medical datasets, potentially prioritizing global abstractions over the fine-grained local biomarkers essential for early AD/MCI diagnosis. To address this complexity-data mismatch, we tailored the ViT-Base/16 backbone.

The core adaptation lies in the MSA mechanism. While MSA inherently favors global receptive fields, unconstrained attention on limited data may introduce noise. The mechanism is formally defined in Equation (4).

\begin{array}{l} Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V \\ MSA (z) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O} \\ where h e a d_{i} = Attention (z W_{i}^{Q}, z W_{i}^{K}, z W_{i}^{V}) \end{array}

(4)

Here,

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

are learnable projection matrices. W^O represents the linear matrix. Clinically, this mechanism functions as a "feature filter": the dot product computes a relevance map, enabling the model to suppress background artifacts and strictly focus on pathologically significant regions.

In this study, the encoding module is initially optimized to 11 layers. The output of the l-th layer z_l is formulated in Equation (5).

\begin{array}{l} z_{l}^{'} = MSA (LN (z_{l - 1})) + z_{l - 1} \\ z_{l} = MLP (LN (z_{l}^{'})) + z_{l}^{'} \end{array}

(5)

Here, z_l represents the output features of the l-th layer, LN represents Layer Normalization, f_l is the non-linear activation function, W_l is the weight matrix of the l-th layer, and b_l is the bias term.

Reducing the encoding depth constrains the parameter space, preventing excessive “global smoothing” and ensuring the model retains critical local texture details associated with early-stage atrophy.

Furthermore, to mitigate the instability caused by noise in MRI scans, the SGDM optimizer [45] is employed. SGDM uses a momentum term to accumulate historical gradients, thereby optimizing the update process, as shown in Equation (6).

\begin{array}{l} m_{k + 1} = β m_{k} + (1 - β) \nabla f (x_{k}) \\ x_{k + 1} = x_{k} - α m_{k + 1} \end{array}

(6)

Here, α = 1 is the learning rate, m_k denotes the momentum term. β represents the momentum weight. ∇ f(x_k) indicates the stochastic gradient of f(x) at x_k. This momentum β acts as a stabilizer, smoothing the optimization trajectory to prevent oscillations caused by outlier scans and to help the model escape local minima.

Concurrently, a cosine annealing learning rate schedule is applied to prevent premature convergence, as defined in Equation (7).

η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + \cos (\frac{T_{c u r}}{T_{m a x}} π))

(7)

Here, η_t represents the current learning rate; η_min denotes the minimum learning rate; η_max signifies the maximum learning rate, which is the initial learning rate; T_cur indicates the current epoch number, and T_max represents the maximum epoch number, which is the cycle length.

This learning rate strategy exhibits a fluctuating curve similar to a cosine function, gradually decreasing from its maximum to its minimum within each cycle and then rebounding at the end of the cycle. Periodically adjusting the learning rate can effectively escape local optima and explore better parameter values. This method facilitates fine-tuning of network parameters, which is particularly crucial for diagnosing AD and MCI, as premature convergence can cause the model to overlook subtle pathological changes.

3.2.2. Two-Stage Fine-Tuning and Optimization

Following the preliminary adaptation, the model enters a refinement phase to enhance diagnostic precision. In this stage, we transition the optimizer from SGDM to AdamW [46] and implement a rigorous structural pruning strategy. The update rule for AdamW, which introduces decoupled weight decay, is defined in Equation (8).

\begin{array}{l} m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} \\ v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} \\ θ_{t} = θ_{t - 1} - η_{t} (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} - λ θ_{t - 1}) \end{array}

(8)

Here, m_t and v_t represent the bias-corrected first and second moment estimates of the gradient, respectively. θ_t denotes the parameter at iteration t, ϵ = 10⁻⁸, prevents division by zero, and λ = 0. 05 is the weight decay coefficient. Unlike SGDM’s uniform learning rate schedule, AdamW dynamically adapts the learning rate for each parameter. In the context of AD diagnosis, this is critical: it ensures that sparse, infrequently activated features (e.g., subtle entorhinal cortex anomalies) receive larger relative updates compared to dominant background features, thereby preventing pathological signals from being overshadowed.

Simultaneously, we reduce the encoder depth from 11 to 6 layers and the learning rate from 0.01 to 0.0001. The final update rule for the Vi-ADiM framework is generalized in Equation (9).

θ_{t} = θ_{t - 1} - η_{t} \nabla L (θ_{t - 1})

(9)

Here, θ_t represents the current parameters of the model,

\nabla L (θ_{t - 1})

denotes the gradient of the loss function with respect to the parameters, and η_t signifies the learning rate.

This structural reduction serves a dual purpose: (1) Feature Preservation: Deep networks tend to produce highly abstract semantic features. By limiting the depth to 6 layers, we force the model to retain low-level texture and edge information, which are often the first indicators of structural atrophy in MCI. (2) Stable Micro-Adjustment: Reducing the learning rate to 0.0001 acts as a "fine-tuning" mechanism. A large step size in a shallow network might skip over critical feature regions; the reduced rate ensures precise convergence around the optimal solution for these sensitive pathological markers.

Through this collaborative optimization, combining AdamW’s adaptive gradients with a streamlined architecture, the model achieves a statistical equilibrium, capturing key disease features with high stability.

3.3. Integrated Interpretation

Following the construction of Vi-ADiM, a diagnostic framework for early AD and MCI, we substantiated the model’s trustworthiness and transparency through comprehensive interpretability analyses using Grad-CAM++ and SHAP. We first deployed Grad-CAM++ to generate activation heatmaps, thereby visualizing the salient regions that guide the model’s attention. Subsequently, to rigorously validate classification reliability across AD, MCI, and NC cohorts, we leveraged the game-theoretic SHAP framework. By quantifying the marginal contribution of each feature to the prediction output, SHAP elucidates the underlying decision-making logic for individual samples. Assuming the input of the i-th MRI image sample is X_i, then the i-th feature of the i-th sample is

X_{i_{j}}

, the model’s predicted value for this sample is y, and the baseline of the entire model (usually the mean of the target variable of all samples) is y_base, the SHAP values conform to Equation (10) as follows:

y_{i} = y_{b a s e} + f ({X_{i}}_{1}) + f ({X_{i}}_{2}) + \dots + f ({X_{i}}_{k})

(10)

Here,

f ({X_{i}}_{j})

is the SHAP value of

X_{i_{j}}

. When

f ({X_{i}}_{j})

> 0, it indicates that the feature has increased the predicted value; conversely, it suggests that the feature has decreased the predicted value, exerting a reverse effect.

In this study, to generate local perturbations, we applied fuzzy masks and leveraged the SHAP framework to evaluate pixel-wise significance within the input MRI images, concurrently computing SHAP values for distinct features. SHAP quantifies feature contributions across class predictions and explicitly highlights key brain regions correlated with AD and MCI through visual mapping. The Grad-CAM++ visuals and SHAP attributions ensure the Vi-ADiM framework exhibits robust interpretability, thereby providing transparent and reliable support for clinical decision-making.

4. Analysis of Experimental Results

4.1. Experimental Platform and Parameter Configuration

All experiments were conducted on a workstation running Ubuntu 20.04, equipped with 32 GB of RAM, a 12th Gen Intel Core i7-12700KF processor, and an NVIDIA GeForce RTX 4080 SUPER GPU (16 GB VRAM). The deep learning models were implemented in Python 3.8 using PyTorch 1.10.0, accelerated by CUDA Toolkit 13.3 and cuDNN v8.2.0.

We adopted ViT-Base/16 as the backbone, configured with an input resolution of 224 × 224. This study employed a two-stage transfer learning strategy. First, to adapt features, the model was fine-tuned with an 11-layer encoding module using the SGDM optimizer (momentum: 0.9, weight decay: 5 × 10⁻⁵, learning rate: 0.01). Subsequently, we performed a refined optimization by switching to AdamW, increasing the weight decay to 0.01, setting the learning rate to 1 × 10⁻⁴, and increasing the number of layers in the encoding module to 6. In both stages, the MLP Head consisted of a single linear classification layer. Training was performed with a batch size of 64 using a cosine annealing schedule for 50 epochs, resulting in rapid convergence.

4.2. Dataset Establishment

Data for this study were sourced from the ADNI repository (https://adni.loni.usc.edu/, accessed on 25 January 2026). We processed the 3D NIfTI volumes by extracting axial slices to generate 2D sequences. The curated dataset consists of 5154 PNG images derived from 199 subjects, categorized into 1124 Alzheimer’s Disease (AD), 2590 Mild Cognitive Impairment (MCI), and 1440 Normal Control (NC) samples. To improve model generalization and mitigate overfitting, we applied standard data augmentation techniques to the training partition using the PyTorch framework.

To enhance robustness specific to medical imaging, we implement a suite of data augmentation strategies, including rotation, translation, random cropping, noise injection, random scaling, and color jittering. To rigorously assess model efficacy, we adopt two protocols: five-fold cross-validation and standard hold-out validation. In the cross-validation scheme, the dataset is stratified into five distinct subsets. Iterative training preserves 80% of samples for learning while validating on the remaining 20%, ensuring comprehensive coverage of the data distribution. Consequently, we retain the checkpoint exhibiting optimal performance. For the hold-out strategy, the data is split into training, validation, and test sets at an 8:1:1 ratio, enabling generalization analysis across the isolated validation and test partitions.

4.3. Evaluation Metrics

To comprehensively evaluate Vi-ADiM’s performance in identifying AD, MCI, and NC, accuracy, precision, recall, and F1 score are selected as evaluation metrics. Accuracy is the most intuitive indicator of classification performance, representing the proportion of samples correctly classified by the model, as shown in Formula (11):

A c c u r c a y = \frac{T P + T N}{T P + T N + F P + F N}

(11)

Here, TP, TN, FP, and FN represent the numbers of samples for true positives, true negatives, false positives, and false negatives, respectively.

Accuracy measures the model’s overall classification performance. Precision measures the proportion of actual positive samples among those the model predicts as positive. Its calculation formula is as follows (12):

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

Recall measures the proportion of all actual positive samples that the model successfully identifies. The Formula for calculating recall is (13):

R e c a l l = \frac{T P}{T P + F N}

(13)

The F1 score is the harmonic mean of precision and recall, balancing the two metrics. Its calculation formula is (14):

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

We evaluated Vi-ADiM on AD, MCI, and NC classification using accuracy, precision, recall, and F1 score. These metrics collectively characterize class-wise discrimination and robustness against data imbalance. Furthermore, computational efficiency was assessed via parameter count (Params) and floating-point operations (FLOPs). Results from five-fold cross-validation are reported as "mean ± standard deviation" to quantify both average performance and variability.

4.4. Experimental Results

4.4.1. Task-Driven Data Augmentation Comparative Experiment

To assess the efficacy of data augmentation on model training, we initialized the ViT-Base/16 backbone with ImageNet pre-trained weights and implemented cross-domain feature adaptation. The ViT encoder was configured with 12 transformer layers, and performance was assessed using accuracy, precision, recall, and F1-score metrics. As shown in Table 2, 5-fold cross-validation results demonstrate that data augmentation significantly boosts performance. Specifically, accuracy improved from 99.196% to 99.224%, recall from 99.122% to 99.196%, and F1 score from 99.128% to 99.160%, while precision declined marginally from 99.144% to 99.130%. Furthermore, the standard deviations for accuracy, precision, recall, and F1 score diminished by 0.022%, 0.122%, 0.059%, and 0.071%, respectively, suggesting that task-driven data augmentation not only enhances performance but also substantially improves model stability.

Overall, these incremental gains suggest that augmenting the dataset with tailored medical image strategies effectively bolsters the model’s robustness to MRI scans across heterogeneous acquisition protocols and hardware.

4.4.2. Comparative Experiment on Fine-Tuning and Optimization of the Encoding Module

(1): Preliminary adaptation comparison test of the phase one encoding module

Building upon the proposed medical image augmentation strategy, we refined the structure of the encoding module. Training utilized the SGDM optimizer paired with a Cosine Annealing scheduler, initialized at a learning rate of 0.01. To analyze the trade-off between model complexity and performance, we varied the depth of the encoding module across 8 to 12 layers.

Table 3 summarizes the quantitative evaluation. We employed 5-fold cross-validation and reported the mean and standard deviation for each metric. Performance metrics improved consistently as the encoder depth increased from 8 to 12. The configuration with 11 layers yielded the optimal balance, achieving 99.36 ± 0.31% accuracy, 99.33 ± 0.39% precision, 99.30 ± 0.30% recall, and an F1 score of 99.31 ± 0.34%. Furthermore, this setup demonstrated the lowest variance across all core metrics, indicating superior stability compared to other depths.

Increasing the overlap count to 12 results in a trade-off: while primary metrics dip marginally, variance is significantly suppressed. Specifically, standard deviations for accuracy, precision, recall, and F1 score narrowed to 0.299%, 0.325%, 0.194%, and 0.254%, respectively. This reduction signifies enhanced predictive robustness across AD, MCI, and NC categories. Despite satisfactory performance with the SGDM optimizer, the model exhibits suboptimal stability with 11 encoding modules. To address this, we transitioned to the AdamW optimizer (learning rate: 0.0001, weight decay: 0.01) for subsequent retraining.

We then re-evaluated the impact of encoding depth. Reducing the module count to six yielded superior results compared to the 11-module configuration. The optimized model achieved an accuracy of 99.640 ± 0.163%, precision of 99.630 ± 0.160%, recall of 99.598 ± 0.188%, and F1 score of 99.610 ± 0.161%. As detailed in Table 4, this configuration improved the metrics by roughly 0.27~0.30% while reducing the standard deviation by 0.11~0.23%.

Relative to the ViT-Base/16 baseline, the proposed Vi-ADiM yields consistent performance gains, boosting accuracy, precision, recall, and F1 score by 0.444%, 0.486%, 0.476%, and 0.482%, respectively. Concurrently, the model exhibits enhanced stability, as evidenced by reductions in standard deviation of 0.158%, 0.287%, 0.065%, and 0.164%. Crucially, this performance is accompanied by a substantial drop in computational cost: parameter count and FLOPs are reduced by 48.964% and 49.653%, respectively. These metrics collectively underscore Vi-ADiM’s superior robustness and efficiency, rendering it highly amenable to practical deployment.

Figure 4 details the fold-wise confusion matrices. Fold₁ exhibits minimal error, with a single AD case misidentified as MCI and two NC subjects mislabeled as AD and MCI. Fold₂ records 10 misclassifications, notably including one AD sample labeled as NC, and 6 NC cases incorrectly predicted as AD (4) or MCI (2). In Fold₃, errors include 2 AD and 5 NC samples, as well as 3 MCI cases misclassified as NC. Fold₄ shows singular misclassifications for AD and NC inputs, while Fold₅ reveals only two errors: one NC sample misidentified as AD, and one MCI sample as NC.

Compared with the second and third folds, the remaining iterations yielded marginally lower metrics, whereas the second and third folds exhibited superior performance. Examination of the confusion matrix reveals negligible misclassifications within the AD and MCI categories. Conversely, the NC category showed a slightly elevated error rate, primarily due to erroneously predicted AD or MCI cases. While the model demonstrated robust discriminative power for AD and MCI, confusion involving NC was more pronounced. Quantitative metrics detailed in Table 5 further substantiate the classification efficacy of the proposed Vi-ADiM.

Meanwhile, as shown in Table 6, the dataset is split into training, validation, and test sets at 8:1:1, and Vi-ADiM is trained using conventional training methods. Initially, experiments are conducted on the validation and test sets separately, where Vi-ADiM demonstrated excellent recognition performance. Subsequently, the validation and test sets are combined for testing, and the model’s performance is nearly identical to that on the test set alone, with accuracy, precision, recall, and F1 scores of 99.224%, 99.191%, 99.150%, and 99.164%, respectively, which indicates that Vi-ADiM performs consistently across different data partitions, showcasing strong generalization capabilities. The minor performance changes after including the validation set further demonstrate the model’s stable performance and the absence of overfitting.

4.4.3. Interpretability Analysis

(1) Grad-CAM++ feature visualization. In the clinical diagnosis of AD, MRI typically focuses on pathological features such as the hippocampus, entorhinal cortex, and temporal lobe, as well as ventricular enlargement. Previous studies have identified these regions as key biomarkers, including hippocampal and cerebral atrophy, cortical changes induced by amyloid plaques, and pathological features in the amygdala, entorhinal cortex, and hippocampal formation [47,48,49,50,51,52]. Based on this medical consensus, this study randomly selected AD, MCI, and NC samples from the test set for Grad-CAM++ visualization analysis. As shown in Figure 5, the activation regions of AD and MCI in the Vi-ADiM model overlap closely with the aforementioned clinical-pathological regions. The model exhibits significant weight distribution, particularly in regions associated with memory functions such as the hippocampus and temporal lobe. This consistency demonstrates that Vi-ADiM can accurately capture clinically relevant features. In terms of category discriminability, the AD heatmap shows significant differences from the NC heatmap, while the MCI activation regions lie between the two, consistent with the medical definition of MCI as a precursor stage to AD. Additionally, the heatmap distribution in the NC group is relatively scattered, suggesting that the model tends to classify the normal category based on overall brain morphology rather than local lesions. Through local perturbation interpretation and structural correlation validation, the feasibility and reliability of Vi-ADiM in assisting diagnostic decision-making have been demonstrated.

(2) SHAP Visual Feature Contribution. Following Grad-CAM++ feature visualization, SHAP is employed for global interpretation to quantify feature contributions further, as shown in Figure 6a–c. The areas of positive contribution are indicated in red, while those of negative contribution are marked in blue. Green and orange boxes highlight these regions, respectively. Additionally, the Shapley values corresponding to each image’s category are calculated to facilitate a more intuitive evaluation. This paper analyzes the feature contributions allocated from a global perspective for AD, MCI, and NC.

(1) Visualization Analysis of AD Characteristics. As shown in Figure 6a, the Shapley values for the AD category are 3.4918, 5.0077, 5.0242, 5.0309, and 3.4831. It can be observed that the regions corresponding to the green box for MCI and NC are he prediction of NC and MCI categories. Below is a detailed analysis of each image: predominantly within the orange box, indicating a clear exclusion relationship between these regions and t

(a) AD1: The left temporal lobe and hippocampal regions are marked with green boxes, and the corresponding MCI category areas display a distinct blue color, sharing a significant negative correlation.

(b) AD2: The temporal lobe and periventricular regions are marked with green boxes with significant Shapley values. When predicted as the NC category, the Shapley value is −2.7283, indicating that this region has an insignificant impact on the NC prediction.

(c) AD3: The hippocampal and parietal cortex regions are marked with green boxes, and when predicted as the MCI category, these regions exhibit a high-intensity blue color with a Shapley value of −1.2707, indicating that this region is an important signal for the transition from MCI to AD.

(d) AD4: The regions surrounding the temporal lobe are marked with green boxes, indicating a close association with the pathological features of AD.

(e) AD5: Large areas of the brain show positive contributions to the prediction of the AD category while also showing positive contributions to the prediction of the NC category, with a Shapley value of 0.7792 for the prediction of the NC category. Additionally, there are two small areas marked in green boxes that contribute positively to predicting the MCI category. This phenomenon may be because MCI itself is a transitional state between AD and NC. However, overall, the model still tends to predict it as the AD category. It is noteworthy that although the red areas are large, they do not appear deep red, a characteristic of the early stages of AD.

It is noteworthy that the AD categories of AD1 and AD5 have lower Shapley values. In the AD1 brain regions, certain features contribute positively to predictions for both AD and MCI, as highlighted in the green boxes, suggesting overlap in features between late-stage MCI and early-stage AD. Across the five images from AD1 to AD5, the temporal lobe, hippocampus, and areas surrounding the lateral ventricles are frequently highlighted in red boxes, indicating that these regions are significant contributors to AD classification.

(2) Visualization Analysis of MCI Characteristics. In MCI, the red area, marked by the green box in Figure 6b, appears relatively lighter than in AD, and the feature map predicts that the MCI category contains more blue regions. The detailed analysis is as follows:

(a) In MCI1, the Shapley values for predicting the categories MCI, NC, and AD are 1.7044, −0.3917, and −1.6140, respectively, with fewer red regions concentrated in the hippocampus and parietal lobes. Additionally, two green boxes appear in the results, predicted as AD, indicating that some features are similar to those of AD. This result aligns with the characteristic of MCI as a transitional stage between AD and NC.

(b) In MCI2: Large light red regions are mainly concentrated in the parietal lobe, and this area appears blue when predicted as the NC category, indicating that its features differ significantly from NC. When predicted as the AD category, its characteristic regions overlap with the red regions predicted as the MCI category (marked with a green box), suggesting a potential trend toward conversion to AD. However, the Shapley value for the prediction (i.e., the MCI category) is 5.2744, indicating that this feature remains highly associated with MCI.

(c) MCI3: The Shapley value for the prediction, as the MCI category, is 5.2199, with the red regions primarily distributed in areas with larger gray matter volumes.

(d) MCI4: The red regions predicted as the MCI category partially resemble those predicted as the NC category and also partly resemble those predicted as the AD category, suggesting that MCI is a transitional stage between AD and NC.

(e) MCI5, the Shapley value predicted for the MCI category is 3.8750. The region marked by the green box, corresponding to this value, overlaps with the region marked by the orange box, which is predicted for the NC category, indicating that the brain structure may have already undergone pathological changes at this stage but has not yet reached the severity of AD. Overall, across MCI1-MCI5, the red and blue features are alternately distributed, indicating that MCI characteristics represent a transitional state between NC and AD.

(3) Visualization and Interpretive Analysis of AD Category Features. In Figure 6 (NC1-NC5), the predicted Shapley values for the NC category show a relatively uniform distribution. The red regions show no significant contributions, and the blue regions are sparse, indicating that the features in these areas are not important for AD and MCI classification, consistent with the stability of normal brain structures. Additionally, the red regions are evenly distributed, further reflecting the integrity and wholeness of the brain structures in the NC category.

In summary, in AD, the red and blue points (mainly concentrated in regions related to pathology, such as the periphery of enlarged lateral ventricles) further indicate that the model has captured the typical features of the disease. In MCI, feature points are densely distributed, indicating that the model incorporates more local features in MCI classification, consistent with the complexity of early MCI lesions. In NC, the feature points are sparsely and uniformly distributed, suggesting that the model primarily relies on the consistency of the overall brain structure to determine normal cognition, which is highly consistent with the lesion areas marked in the related references in Figure 5.

By combining Grad-CAM++ and SHAP, the experiment demonstrates that the model’s regions of interest align with medical common sense, particularly when classifying AD and MCI. The regions focused on by the model are highly correlated with neurodegeneration, thereby enhancing the credibility of the results. The classification of AD and MCI relies on pathological changes in brain structures, whereas the identification of NC is based on the integrity and consistency of brain structures. By integrating these two interpretability tools, this study not only validates the model’s high accuracy but also further reveals the rationale behind classification decisions, with significant medical reference value.

4.4.4. Comparative Experiments of Different Models

Subsequently, this study selected 18 models for a comprehensive comparative analysis with the proposed Vi-ADiM. These include 13 mainstream deep learning models(VGG [53], GoogleNet [54], ResNet [55], MobileNetV2 [56], MobileNetV3 [57], ShuffleNetV2 [58], DenseNet [59], EfficientNet [60], EfficientNetV2 [61], RegNet [62], MobileViT [63], Swin Transformer [64], and Vision Transformer [65]) and 5 recent methods designed explicitly for AD diagnosis (pre-trained SqueezeNet [66], SCCAN [67], HDFE+FEA+MFF [68], lightweight CNN-LSTM [69], and BSGAN-ADD [70]). To ensure fairness in experimental comparisons, this study set the learning rate, optimizer, weight decay parameters, and learning rate adjustment strategy to be identical across all baseline models, thereby allowing for a direct comparison of their recognition performance.

As shown in Table 7, Vi-ADiM achieved accuracy, precision, recall, and F1 scores of 99.640 ± 0.163%, 99.630 ± 0.160%, 99.598 ± 0.188%, and 99.610 ± 0.161%, respectively. Compared with architectures, Vi-ADiM significantly outperforms CNN-based models such as VGG19 (95.896 ± 1.982%) and ShuffleNetV2 (89.272 ± 1.337%). While the deep ResNet50 achieves comparable accuracy (99.672 ± 0.132%), Vi-ADiM leverages its Transformer architecture to capture long-range global dependencies, a capability lacking in CNNs. Furthermore, compared to other Vision Transformers such as Swin Transformer and MobileViT, Vi-ADiM demonstrates superior stability, evidenced by the lowest standard deviations across all metrics.

Crucially, to benchmark against the latest advancements in the field, we compared Vi-ADiM with recent specialized methods (Table 7). Vi-ADiM demonstrates superior or highly competitive performance against these recent models. Specifically, it outperforms BSGAN-ADD [70], lightweight CNN-LSTM [69], and HDFE+FEA+MFF [68] in accuracy by significant margins of 1.04%, 7.34%, and 3.50%, respectively. While SCCAN [67] reports a marginally higher F1 score (99.66%), Vi-ADiM achieves a higher accuracy (99.64% vs. 99.58%) and precision (99.63% vs. 99.58%). Moreover, unlike several recent works that do not report variance, Vi-ADiM exhibits exceptional stability with low standard deviations, reinforcing its reliability for clinical deployment.

In terms of computational efficiency, Vi-ADiM maintains a moderate parameter count (43.712 M) and FLOPs (8.490 G). Despite being more complex than lightweight models like MobileNetV3, it yields substantial performance gains. Conversely, it is significantly more efficient than heavy models such as VGG19 (139M Params), offering an optimal balance between diagnostic precision and resource consumption.

5. Discussion

5.1. Synergizing Global-Local Features for High-Precision Diagnosis

This study introduces Vi-ADiM, a domain-specific framework designed to overcome bottlenecks arising from data scarcity and feature sparsity in early AD and MCI diagnosis. Unlike traditional CNNs that rely predominantly on local texture, our strategically fine-tuned Vision Transformer captures long-range dependencies, effectively paralleling the global diagnostic reasoning of expert radiologists. Empirical results substantiate the superiority of this approach, with Vi-ADiM achieving an accuracy of 99.640 ± 0.163% under rigorous five-fold cross-validation. While ResNet50 showed a marginal advantage in precision, Vi-ADiM outperformed it in stability and performance balance, as evidenced by an F1-score of 99.610%, which confirms that the Transformer’s global attention mechanism offers a robust alternative to CNNs for identifying diffuse pathological changes. Crucially, we emphasize that these high-performance metrics primarily reflect the model’s feature extraction efficacy within the current dataset distribution. As detailed in the limitations section, we acknowledge that the slice-level data partitioning may contribute to these favorable results; consequently, they should be interpreted as a validation of architectural utility rather than a definitive measure of clinical generalizability across unseen populations.

5.2. Bridging the Trust Gap via Dual-Perspective Interpretability

Beyond quantitative metrics, clinical deployment demands interpretability. By integrating Grad-CAM++ and SHAP, the framework establishes a Global-Local interpretability mechanism that aligns algorithmic attention with biological priors. Locally, Grad-CAM++ visualizations corroborate that Vi-ADiM autonomously attends to the hippocampus and temporal lobe—regions confirmed to undergo atrophy in early AD. Globally, SHAP quantification delineates the precise contribution of these features to the decision boundary. To address concerns regarding the qualitative nature of these explanations, we emphasize the "cross-verification" inherent in our approach: local pixel-level attention (Grad-CAM++) consistently aligns with global feature attribution (SHAP) across diverse samples. Furthermore, these algorithmic explanations exhibit high concordance with established medical literature. This alignment between independent interpretability methods and ground-truth pathological knowledge serves as a qualitative proxy for robustness, suggesting the model learns valid disease markers rather than confounding artifacts or noise.

5.3. Proposed Clinical Integration Workflow

This study focuses on algorithmic construction; we propose Vi-ADiM as a "second-opinion" screening tool within real-world clinical workflows to assist less experienced clinicians. The proposed integration comprises three steps: (1) Primary Screening: Clinicians upload patient MRI sequences to the system; (2) ROI Localization: The model generates Grad-CAM++ heatmaps to delineate suspicious regions (e.g., hippocampal atrophy) alongside its prediction; (3) Diagnostic Alert: If the model predicts MCI/AD with high confidence, it flags the case for senior radiologist review. This "Human-in-the-Loop" paradigm leverages the model’s high sensitivity to minimize false negatives, while relying on human expertise for final confirmation, thereby mitigating the risk of algorithmic false positives.

5.4. Limitations and Critical Analysis

To maintain a rigorous scientific perspective, it is imperative to delineate the inherent constraints of the proposed Vi-ADiM framework explicitly:

(1): Data Partitioning and Leakage Risks: Strictly speaking, the current evaluation employed a slice-level random split strategy due to the finite sample size. We recognize that this approach introduces an intrinsic risk of data leakage stemming from intra-subject anatomical correlation, as adjacent slices from the same subject may reside in opposing data splits. Consequently, the reported metrics (e.g., accuracy > 99%) should be interpreted as the model’s upper bound for characterizing morphological features within this specific distribution, rather than a definitive measure of generalization to unseen subjects. Future work will strictly implement subject-level separation to assess generalization rigorously.
(2): Dataset Diversity and External Validation: The model’s validation is currently confined to the ADNI cohort. The absence of external multi-center validation leaves the model’s resilience to heterogeneous acquisition protocols (e.g., variations in magnetic field strength and vendor-specific artifacts) unverified, potentially leading to overfitting to the specific domain characteristics of the ADNI dataset.
(3): Clinical Validation: The proposed "Human-in-the-Loop" workflow remains a theoretical construct. As comparative studies with human clinicians (e.g., inter-rater variability analysis) have not yet been conducted, the practical utility of the model in a prospective decision-support scenario remains a hypothesis that necessitates empirical verification in future clinical trials.
(4): Two-Dimensional Slice-based vs. Three-Dimensional Volumetric Analysis: This study relies on 2D slice-based analysis. While this paradigm is computationally efficient for extracting fine-grained texture anomalies, it inherently sacrifices volumetric spatial coherence along the z-axis. This limitation may impede the detection of pathological patterns that rely on inter-slice continuity, a gap that future 3D Vision Transformers could potentially address.
(5): Parameter Efficiency: Although the Two-Stage Encoding Optimization achieved a 49% reduction in parameters compared to the ViT-Base baseline, the model’s complexity remains higher than that of lightweight CNNs. Further investigation into model quantization and pruning is essential to facilitate deployment on resource-constrained edge devices.

5.5. Future Perspectives

To address these limitations and advance the Vi-ADiM framework, future research will focus on three key dimensions: (1) Scale: Expanding to large-scale multi-center datasets to enhance generalization and mitigate site-specific bias; (2) Efficiency: Investigating advanced compression techniques (e.g., quantization and pruning) to enable real-time inference on portable medical devices; and (3) Integration: Embedding this diagnostic engine into medical robotics for automated screening. These efforts aim to evolve Vi-ADiM from a high-performance algorithm into a reliable, clinically integrated precision medicine tool.

6. Conclusions

Clinicians with limited experience often struggle with subjective interpretation of MRI scans, potentially compromising diagnostic accuracy for AD. To address this, we present Vi-ADiM, an interpretable framework tailored for early AD and MCI diagnosis via strategic Vision Transformer fine-tuning. Initially, cross-domain feature alignment and task-aware augmentation facilitate rapid convergence and robust representation learning. Subsequently, a refined two-stage optimization protocol propels the model to achieve state-of-the-art accuracy, precision, recall, and F1 scores of 99.640 ± 0.163%, 99.630 ± 0.160%, 99.598 ± 0.188%, and 99.610 ± 0.161%, respectively. Furthermore, the integration of explainability mechanisms significantly bolsters clinical trust, underscoring the Transformer’s versatility in data-limited medical regimes. This study presents a robust paradigm for intelligent neurodegenerative screening, advancing the trajectory of precision medicine.

Author Contributions

Y.L.: Conceptualization, Methodology, Investigation, Formal analysis, Writing—original draft. B.X.: Validation, Resources, Data curation, Visualization. Q.B.: Investigation, Software. Z.L.: Resources, Writing—review and editing. J.Z.: Conceptualization, Writing—review and editing, Supervision, Project administration, Funding acquisition. Q.C.: Conceptualization, Writing—review and editing, Supervision, Project administration, Funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guiyang University Doctoral Research Launch Project (No. GYU-KY-[2026]); the Guizhou Provincial Basic Research Program (Natural Science) (Grant No. Qiankehejichu-ZK [2023] General 014); the Guizhou Science and Technology Cooperation (Basic Research) Project (Grant No. QN [2025] 369); the Youth Science and Technology Talent Growth Project of the Guizhou Provincial Department of Education (Grant No. Guizhou Education Technology [2024] 191); the Guizhou Provincial Science and Technology Department Platform (No. ZSYS [2025]012); and the Guizhou Province High-level Innovative Talent Project (No. GCC [2023]010); Guizhou Graduate Education Innovation Plan project (2025YJSKYJJ065).

Institutional Review Board Statement

Ethical review and approval were waived for this study because the research utilized exclusively pre-existing, publicly available, and de-identified data (ADNI database). The analysis of such anonymized data does not constitute research on human subjects as defined by our institution’s ethical review board.

Informed Consent Statement

Informed Consent was waived for this study because the research utilized exclusively pre-existing, publicly available, and de-identified data (ADNI database). The analysis of such anonymized data does not constitute research on human subjects as defined by our institution’s ethical review board.

Data Availability Statement

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in the analysis or Writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf (accessed on 25 January 2026).

Acknowledgments

The authors acknowledge the Public Big Data Computing Center of Guizhou University for providing the platform and technical support. The authors acknowledge the use of ChatGPT-4.1 (OpenAI) for minor grammatical checks and language polishing during the preparation of this manuscript. The authors have thoroughly reviewed and edited the output and take full responsibility for the publication’s content.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jack, C.R.; Andrews, S.J.; Beach, T.G.; Buracchio, T.; Dunn, B.; Graf, A.; Hansson, O.; Ho, C.; Jagust, W.; McDade, E.; et al. Revised criteria for the diagnosis and staging of Alzheimer’s disease. Nat. Med. 2024, 30, 2121–2124. [Google Scholar] [CrossRef]
Wilcock, D.M.; Lamb, B.T. The importance of continuing development of novel animal models of Alzheimer’s disease and Alzheimer’s disease and related dementias. Alzheimers Dement. 2024, 20, 5078–5079. [Google Scholar] [CrossRef] [PubMed]
Han, K.; Sheng, V.S.; Song, Y.Q.; Liu, Y.; Qiu, C.J.; Ma, S.Q.; Liu, Z. Deep semi-supervised learning for medical image segmentation: A review. Expert Syst. Appl. 2024, 245, 123052. [Google Scholar] [CrossRef]
Li, Z.H.; Li, Y.X.; Li, Q.D.; Wang, P.Y.; Guo, D.Z.; Lu, L.; Jin, D.K.; Zhang, Y.; Hong, Q.Q. LViT: Language Meets Vision Transformer in Medical Image Segmentation. IEEE Trans. Med. Imaging 2024, 43, 96–107. [Google Scholar] [CrossRef] [PubMed]
Xu, B.; Liu, X.; Gu, W.; Liu, J.; Wang, H. A fine segmentation model of flue-cured tobacco’s main veins based on multi-level-scale features of hybrid fusion. Soft Comput. 2024, 28, 10537–10555. [Google Scholar] [CrossRef]
Behera, T.K.; Khan, M.A.; Bakshi, S. Brain MR Image Classification Using Superpixel-Based Deep Transfer Learning. IEEE J. Biomed. Health Inform. 2024, 28, 1218–1227. [Google Scholar] [CrossRef]
Alsahafi, Y.S.; Kassem, M.A.; Hosny, K.M. Skin-Net: A novel deep residual network for skin lesions classification using multilevel feature extraction and cross-channel correlation with detection of outlier. J. Big Data 2023, 10, 105. [Google Scholar] [CrossRef]
Haq, A.U.; Li, J.P.; Khan, I.; Agbley, B.L.Y.; Ahmad, S.; Uddin, M.I.; Zhou, W.; Khan, S.; Alam, I. DEBCM: Deep Learning-Based Enhanced Breast Invasive Ductal Carcinoma Classification Model in IoMT Healthcare Systems. IEEE J. Biomed. Health Inform. 2024, 28, 1207–1217. [Google Scholar] [CrossRef]
Remigio, A.S. IncARMAG: A convolutional neural network with multi-level autoregressive moving average graph convolutional processing framework for medical image classification. Neurocomputing 2025, 617, 129038. [Google Scholar] [CrossRef]
Wu, J.; Ma, J.Q.; Xi, H.R.; Li, J.B.; Zhu, J.H. Multi-scale graph harmonies: Unleashing U-Net’s potential for medical image segmentation through contrastive learning. Neural Netw. 2025, 182, 106914. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar] [CrossRef]
Xu, B.; Yang, G. Interpretability research of deep learning: A literature survey. Inf. Fusion 2025, 115, 102721. [Google Scholar] [CrossRef]
Hsieh, P.J. Determinants of physicians’ intention to use AI-assisted diagnosis: An integrated readiness perspective. Comput. Hum. Behav. 2023, 147, 107868. [Google Scholar] [CrossRef]
Kara, O.C.; Xue, J.Q.; Venkatayogi, N.; Mohanraj, T.G.; Hirata, Y.; Ikoma, N.; Atashzar, S.F.; Alambeigi, F. A Smart Handheld Edge Device for On-Site Diagnosis and Classification of Texture and Stiffness of Excised Colorectal Cancer Polyps. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots And Systems, IROS, Detroit, MN, USA, 1–5 October 2023. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, Y.; Zhang, C.; Zhu, Q.; Xu, X.F.; Yuan, M.; Tan, W.J. Hierarchical classification of early microscopic lung nodule based on cascade network. Health Inf. Sci. Syst. 2024, 12, 13. [Google Scholar] [CrossRef] [PubMed]
He, Q.Q.; Yang, Q.J.; Xie, M.H. HCTNet: A hybrid CNN-transformer network for breast ultrasound image segmentation. Comput. Biol. Med. 2023, 155, 106629. [Google Scholar] [CrossRef] [PubMed]
Al-Fahdawi, S.; Al-Waisy, A.S.; Zeebaree, D.Q.; Qahwaji, R.; Natiq, H.; Mohammed, M.A.; Nedoma, J.; Martinek, R.; Deveci, M. Fundus-DeepNet: Multi-label deep learning classification system for enhanced detection of multiple ocular diseases through data fusion of fundus images. Inf. Fusion 2024, 102, 102059. [Google Scholar] [CrossRef]
Odimayo, S.; Olisah, C.C.; Mohammed, K. Structure focused neurodegeneration convolutional neural network for modelling and classification of Alzheimer’s disease. Sci. Rep. 2024, 14, 15270. [Google Scholar] [CrossRef]
Erdas, Ç.; Sümer, E.; Kibaroglu, S. Neurodegenerative disease detection and severity prediction using deep learning approaches. Biomed. Signal Process. Control 2021, 70, 103069. [Google Scholar] [CrossRef]
Cheriet, M.; Dentamaro, V.; Hamdan, M.; Impedovo, D.; Pirlo, G. Multi-speed transformer network for neurodegenerative disease assessment and activity recognition. Comput. Methods Programs Biomed. 2023, 230, 107344. [Google Scholar] [CrossRef]
Özdemir, E.Y.; Özyurt, F. Elasticnet-Based Vision Transformers for early detection of Parkinson’s disease. Biomed. Signal Process. Control 2025, 101, 107198. [Google Scholar] [CrossRef]
Rashid, A.H.; Gupta, A.; Gupta, J.; Tanveer, M. Biceph-Net: A Robust and Lightweight Framework for the Diagnosis of Alzheimer’s Disease Using 2D-MRI Scans and Deep Similarity Learning. IEEE J. Biomed. Health Inform. 2023, 27, 1205–1213. [Google Scholar] [CrossRef]
He, W.; Zhang, C.; Dai, J.; Liu, L.; Wang, T.; Liu, X.; Jiang, Y.; Li, N.; Xiong, J.; Wang, L.; et al. A statistical deformation model-based data augmentation method for volumetric medical image segmentation. Med. Image Anal. 2024, 91, 102984. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Tang, J.; Qi, C.; Yao, D.; Liu, C.; Zhan, Y.; Lukasiewicz, T. Cross-domain attention-guided generative data augmentation for medical image analysis with limited data. Comput. Biol. Med. 2024, 168, 107744. [Google Scholar] [CrossRef] [PubMed]
Kim, H.E.; Cosa-Linan, A.; Santhanam, N.; Jannesari, M.; Maros, M.E.; Ganslandt, T. Transfer learning for medical image classification: A literature review. BMC Med. Imaging 2022, 22, 69. [Google Scholar] [CrossRef] [PubMed]
Kora, P.; Ooi, C.P.; Faust, O.; Raghavendra, U.; Gudigar, A.; Chan, W.Y.; Meenakshi, K.; Swaraja, K.; Plawiak, P.; Rajendra Acharya, U. Transfer learning techniques for medical image analysis: A review. Biocybern. Biomed. Eng. 2022, 42, 79–107. [Google Scholar] [CrossRef]
Alzubaidi, L.; Al-Amidie, M.; Al-Asadi, A.; Humaidi, A.J.; Al-Shamma, O.; Fadhel, M.A.; Zhang, J.; Santamaría, J.; Duan, Y. Novel Transfer Learning Approach for Medical Imaging with Limited Labeled Data. Cancers 2021, 13, 1590. [Google Scholar] [CrossRef]
Lai, Y.; Cao, A.; Gao, Y.; Shang, J.; Li, Z.; Guo, J. Advancing Efficient Brain Tumor Multi-Class Classification—New Insights from the Vision Mamba Model in Transfer Learning. arXiv 2024, arXiv:2410.21872. [Google Scholar] [CrossRef]
Ieracitano, C.; Mammone, N.; Hussain, A.; Morabito, F.C. A Convolutional Neural Network based self-learning approach for classifying neurodegenerative states from EEG signals in dementia. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
Luo, M.; He, Z.; Cui, H.; Ward, P.; Chen, Y.-P.P. Dual attention based fusion network for MCI Conversion Prediction. Comput. Biol. Med. 2024, 182, 109039. [Google Scholar] [CrossRef]
Liu, L.F.; Lyu, J.Y.; Liu, S.Y.; Tang, X.Y.; Chandra, S.S.; Nasrallah, F.A. TriFormer: A Multimodal Transformer Framework For Mild Cognitive Impairment Conversion Prediction. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging, ISBI, Cartagena, Colombia, 18–21 April 2023. [Google Scholar] [CrossRef]
Hu, Z.T.; Wang, Z.; Jin, Y.; Hou, W. VGG-TSwinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction. Comput. Methods Programs Biomed. 2023, 229, 107291. [Google Scholar] [CrossRef]
Khatri, U.; Kwon, G.R. Diagnosis of Alzheimer’s disease via optimized lightweight convolution-attention and structural MRI. Comput. Biol. Med. 2024, 171, 108116. [Google Scholar] [CrossRef]
Chen, J.D.; Wang, Y.; Zeb, A.; Suzauddola, M.D.; Wen, Y.X. Multimodal mixing convolutional neural network and Transformer for Alzheimer’s disease recognition. Expert Syst. Appl. 2025, 259, 125321. [Google Scholar] [CrossRef]
Kun, Y.; Chunqing, G.; Yuehui, G. An Optimized LIME Scheme for Medical Low Light Level Image Enhancement. Comput. Intell. Neurosci. 2022, 2022, 9613936. [Google Scholar] [CrossRef]
Kamal, M.S.; Dey, N.; Chowdhury, L.; Hasan, S.I.; Santosh, K.C. Explainable AI for Glaucoma Prediction Analysis to Understand Risk Factors in Treatment Planning. IEEE Trans. Instrum. Meas. 2022, 71, 2509209. [Google Scholar] [CrossRef]
Deshmukh, S.; Behera, B.K.; Mulay, P.; Ahmed, E.A.; Al-Kuwari, S.; Tiwari, P.; Farouk, A. Explainable quantum clustering method to model medical data. Knowl.-Based Syst. 2023, 267, 110413. [Google Scholar] [CrossRef]
Teneggi, J.; Luster, A.; Sulam, J. Fast Hierarchical Games for Image Explanations. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4494–4503. [Google Scholar] [CrossRef] [PubMed]
Tanone, R.; Li, L.H.; Saifullah, S. ViT-CB: Integrating hybrid Vision Transformer and CatBoost to enhanced brain tumor detection with SHAP. Biomed. Signal Process. Control. 2025, 100, 107027. [Google Scholar] [CrossRef]
Gelir, F.; Akan, T.; Alp, S.; Gecili, E.; Bhuiyan, M.S.; Disbrow, E.A.; Conrad, S.A.; Vanchiere, J.A.; Kevil, C.G.; Bhuiyan, M.A.N.; et al. Machine Learning Approaches for Predicting Progression to Alzheimer’s Disease in Patients with Mild Cognitive Impairment. J. Med. Biol. Eng. 2024, 45, 63–83. [Google Scholar] [CrossRef]
Yi, F.L.; Yang, H.; Chen, D.R.; Qin, Y.; Han, H.J.; Cui, J.; Bai, W.L.; Ma, Y.F.; Zhang, R.; Yu, H.M. XGBoost-SHAP-based interpretable diagnostic framework for alzheimer’s disease. BMC Med. Inform. Decis. Mak. 2023, 23, 137. [Google Scholar] [CrossRef]
Zhu, Y.H.; Ma, J.B.; Yuan, C.A.; Zhu, X.F. Interpretable learning based Dynamic Graph Convolutional Networks for Alzheimer’s Disease analysis. Inf. Fusion 2022, 77, 53–61. [Google Scholar] [CrossRef]
Li, H.X.; Shi, X.S.; Zhu, X.F.; Wang, S.H.; Zhang, Z. FSNet: Dual Interpretable Graph Convolutional Network for Alzheimer’s Disease Analysis. IEEE Trans. Emerg. Top. Comput. Intell. 2023, 7, 15–25. [Google Scholar] [CrossRef]
Zhu, Q.; Xu, B.L.; Huang, J.S.; Wang, H.Y.; Xu, R.T.; Shao, W.; Zhang, D.Q. Deep Multimodal Discriminative and Interpretability Network for Alzheimer’s Disease Diagnosis. IEEE Trans. Med. Imaging 2023, 42, 1472–1483. [Google Scholar] [CrossRef]
Liu, Y.; Gao, Y.; Yin, W. An Improved Analysis of Stochastic Gradient Descent with Momentum. arXiv 2020, arXiv:2007.07989. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Patro, S.; NishaV, M. Early Detection of Alzheimer’s Disease using Image Processing. Int. J. Eng. Res. Technol. 2019, 8, 468–471. [Google Scholar]
Suk, H.-I.; Lee, S.-W.; Shen, D. Hierarchical feature representation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage 2014, 101, 569–582. [Google Scholar] [CrossRef]
Saraiva, C.; Praça, C.; Ferreira, R.; Santos, T.; Ferreira, L.; Bernardino, L. Nanoparticle-mediated brain drug delivery: Overcoming blood–brain barrier to treat neurodegenerative diseases. J. Control. Release 2016, 235, 34–47. [Google Scholar] [CrossRef]
Helaly, H.A.; Badawy, M.; Haikal, A.Y. Toward deep MRI segmentation for Alzheimer’s disease detection. Neural Comput. Appl. 2022, 34, 1047–1063. [Google Scholar] [CrossRef]
Rathore, S.; Habes, M.; Iftikhar, M.A.; Shacklett, A.; Davatzikos, C. A review on neuroimaging-based classification studies and associated feature extraction methods for Alzheimer’s disease and its prodromal stages. NeuroImage 2017, 155, 530–548. [Google Scholar] [CrossRef]
Moradi, E.; Pepe, A.; Gaser, C.; Huttunen, H.; Tohka, J. Machine learning framework for early MRI-based Alzheimer’s conversion prediction in MCI subjects. NeuroImage 2015, 104, 398–412. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Wei, L.; Yangqing, J.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Tan, M.X.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar] [CrossRef]
Tan, M.X.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar] [CrossRef]
Xu, J.; Pan, Y.; Pan, X.; Hoi, S.; Yi, Z.; Xu, Z. RegNet: Self-Regulated Network for Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 9562–9567. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Lightweight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Emily Esther Rani, K.; Baulkani, S. Alzheimer disease classification using optimal clustering based pre-trained SqueezeNet model. Biomed. Signal Process. Control 2025, 100, 107032. [Google Scholar] [CrossRef]
Hassan, N.; Miah, A.S.M.; Suzuki, K.; Okuyama, Y.; Shin, J. Stacked CNN-based multichannel attention networks for Alzheimer disease detection. Sci. Rep. 2025, 15, 5815. [Google Scholar] [CrossRef]
Huang, H.; Pedrycz, W.; Hirota, K.; Yan, F. A multiview-slice feature fusion network for early diagnosis of Alzheimer’s disease with structural MRI images. Inf. Fusion 2025, 119, 103010. [Google Scholar] [CrossRef]
Ul Haq, E.; Yong, Q.; Yuan, Z.; Xu, H.R.; Ul Haq, R. Multimodal fusion diagnosis of the Alzheimer’s disease via lightweight CNN-LSTM model using magnetic resonance imaging (MRI). Biomed. Signal Process. Control. 2025, 104, 107545. [Google Scholar] [CrossRef]
Bai, T.; Du, M.; Zhang, L.; Ren, L.; Ruan, L.; Yang, Y.; Qian, G.; Meng, Z.; Zhao, L.; Deen, M.J. A novel Alzheimer’s disease detection approach using GAN-based brain slice image enhancement. Neurocomputing 2022, 492, 353–369. [Google Scholar] [CrossRef]

Figure 1. Early intelligent assisted diagnosis process for Alzheimer’s Disease. “*” indicates Extra learnable [class] embedding.

Figure 2. Schematic Diagram of Cross-Domain Feature Adaptation.

Figure 3. Vi-ADiM model architecture.

Figure 4. Confusion matrix of the five-fold cross-validation model.

Figure 5. Grad-Cam++ feature visualization.

Figure 6. Comparison of Shapley values and feature visualization analysis.

Table 1. Task-Driven Medical Image Enhancement Strategies.

Perspective	Method	Parameter Settings	Clinical Rationale
Image characteristics	Rotation	±5°	Preserves anatomical orientation; prevents structural distortion of key brain regions.
	Translation	Shift ≤ 0.05 × Size	Simulates minor patient head movement during scanning to enhance positional invariance.
	Random Cropping	Pad 4px to crop 224 × 224	Maintains consistent input dimensions while preserving edge semantic integrity.
	Gaussian noise	Intensity σ = 0.05	Simulates sensor thermal noise without masking subtle gray/white matter atrophy.
Imaging equipment	Random Scaing	Scale ∈ [0.9, 1.1]	Adapts the model to varying voxel resolutions across different scanner vendors.
Imaging equipment	Color Jittering	Brightness/Contrast/Sat = 0.1	Mimics signal-intensity variations caused by magnetic-field inhomogeneities.

Table 2. Comparison of evaluation results before and after task-driven proprietary medical data augmentation.

Validation Assessment	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
Unenhanced	99.196 ± 0.321	99.144 ± 0.447	99.122 ± 0.253	99.128 ± 0.325
Enhance	99.224 ± 0.299	99.130 ± 0.325	99.196 ± 0.194	99.160 ± 0.254

Table 3. Preliminary adaptation test results of the encoding module code.

Number	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	Parameters (M)	FLOPs (G)
8	98.890 ± 0.639	98.812 ± 0.778	98.780 ± 0.531	98.790 ± 0.650	57.888	11.281
9	99.086 ± 0.442	99.022 ± 0.505	99.010 ± 0.336	99.010 ± 0.417	64.976	12.677
10	99.162 ± 0.524	99.246 ± 0.584	99.334 ± 0.312	99.288 ± 0.444	72.064	14.072
11	99.364 ± 0.311	99.332 ± 0.389	99.300 ± 0.302	99.314 ± 0.344	79.152	15.468
12	99.224 ± 0.299	99.130 ± 0.325	99.196 ± 0.194	99.160 ± 0.254	85.649	16.863

Table 4. Experimental results of fine-tuning and optimization of the encoding module.

Number	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	Params (M)	FLOPs (G)
4	98.118 ± 0.543	98.138 ± 0.590	97.840 ± 0.763	97.978 ± 0.653	29.537	5.699
5	99.310 ± 0.177	99.300 ± 0.214	99.258 ± 0.128	99.278 ± 0.151	36.624	7.095
6	99.640 ± 0.163	99.630 ± 0.160	99.598 ± 0.188	99.610 ± 0.161	43.712	8.490

Table 5. Evaluation results of precision, recall, and specificity for each fold.

Fold Number	Class	Precision (%)	Recall (%)	Specificity (%)
Fold₁	AD	99.703	99.703	99.917
	NC	99.537	99.537	99.820
	MCI	99.743	99.743	99.740
Fold₂	AD	98.824	99.703	99.669
	NC	99.07	98.611	99.641
	MCI	99.742	99.614	99.740
Fold₃	AD	100.0	99.407	100.0
	NC	98.843	98.843	99.551
	MCI	99.358	99.614	99.35
Fold₄	AD	99.703	99.703	99.917
	NC	99.768	99.537	99.910
	MCI	99.743	99.871	99.740
Fold₅	AD	99.704	100.0	99.917
	NC	99.769	99.769	99.910
	MCI	100.0	99.871	100.0

Table 6. Performance evaluation results for various metrics on the test set, the validation set, and the combined test and validation sets.

Dataset	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)
Validation set	99.223	99.287	98.978	99.125
Test Set	99.225	99.099	99.319	99.201
Validation and Test Set	99.224	99.191	99.150	99.164

Table 7. Experimental comparison results of different network models.

Model	Version	Accuracy (%)	Precision (%)	Recall (%)	F1 Score (%)	Params (M)	FLOPs (G)
VGG	16	94.262 ± 1.435	94.206 ± 1.114	93.234 ± 1.779	93.648 ± 1.472	134.273	15.466
VGG	19	95.896 ± 1.982	95.674 ± 2.391	95.224 ± 2.071	95.418 ± 2.225	139.583	19.628
GoogleNet	-	91.296 ± 1.276	90.504 ± 1.219	90.804 ± 1.691	90.560 ± 1.217	5.977	1.582
ResNet	34	99.444 ± 0.289	99.422 ± 0.293	99.416 ± 0.356	99.418 ± 0.287	21.286	3.678
ResNet	50	99.612 ± 0.159	99.672 ± 0.132	99.522 ± 0.206	99.596 ± 0.158	23.514	4.132
MobileNetV2	-	99.030 ± 0.518	98.938 ± 0.607	98.972 ± 0.602	98.952 ± 0.595	2.228	0.326
MobileNetV3	Small	95.816 ± 0.444	95.814 ± 0.742	94.798 ± 0.630	95.264 ± 0.557	1.521	0.061
MobileNetV3	Large	99.364 ± 0.311	99.374 ± 0.254	99.234 ± 0.427	99.302 ± 0.339	4.206	0.233
ShuffleNetV2	1.0	89.272 ± 1.337	88.372 ± 1.556	87.106 ± 1.285	87.536 ± 1.325	1.257	0.152
ShuffleNetV2	2.0	98.778 ± 0.206	98.706 ± 0.186	98.612 ± 0.345	98.652 ± 0.187	5.351	0.596
DenseNet	121	99.308 ± 0.195	99.178 ± 0.243	99.284 ± 0.134	99.232 ± 0.188	6.957	2.896
EfficientNet	B0	98.698 ± 0.255	98.440 ± 0.435	98.740 ± 0.150	98.778 ± 0.557	4.011	0.412
EfficientNetV2	Small	99.306 ± 0.261	99.278 ± 0.308	99.246 ± 0.331	99.260 ± 0.309	20.181	2.897
RegNet	-	98.836 ± 0.408	98.704 ± 0.483	98.734 ± 0.393	98.716 ± 0.423	2.317	0.207
ConvNext	Tiny	98.918 ± 0.432	98.848 ± 0.601	98.806 ± 0.409	98.826 ± 0.489	27.801	4.455
ConvNext	Base	98.946 ± 0.361	99.066 ± 0.204	98.750 ± 0.569	98.920 ± 0.362	87.513	15.354
MobileViT	Small	98.450 ± 0.687	98.342 ± 0.904	98.166 ± 0.727	98.236 ± 0.820	4.940	1.464
Swin Transformer	Tiny	99.168 ± 0.501	99.208 ± 0.472	99.012 ± 0.639	99.104 ± 0.529	27.498	4.371
Swin Transformer	Small	98.928 ± 0.778	98.984 ± 0.778	98.796 ± 0.895	98.902 ± 0.785	48.792	8.544
ViT	Base	99.196 ± 0.321	99.144 ± 0.447	99.122 ± 0.253	99.128 ± 0.325	85.649	16.863
Pre-trained SqueezeNet [66]	--	98.3 ± 1.05	98.9 ± 1.28	98.1 ± 1.45	98.6 ± 1.3	--	--
SCCAN [67]	--	99.58	99.58	99.58	99.66	--	--
HDFE+FEA+MFF [68]	--	96.14 ± 0.31	96.33 ± 0.13	95.39 ± 0.42	95.74 ± 0.27	--	--
Lightweight CNN-LSTM [69]	--	92.30	92.10	92.20	92.25	--	--
BSGAN-ADD [70]	--	98.60	98.20	99.70	99.10	--	--
Vi-ADiM	-	99.640 ± 0.163	99.630 ± 0.160	99.598 ± 0.188	99.610 ± 0.161	43.712	8.490

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Xu, B.; Bai, Q.; Liu, Z.; Zhu, J.; Chen, Q. Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment. Information 2026, 17, 129. https://doi.org/10.3390/info17020129

AMA Style

Li Y, Xu B, Bai Q, Liu Z, Zhu J, Chen Q. Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment. Information. 2026; 17(2):129. https://doi.org/10.3390/info17020129

Chicago/Turabian Style

Li, Yang, Biao Xu, Qiang Bai, Zhenghong Liu, Junfeng Zhu, and Qipeng Chen. 2026. "Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment" Information 17, no. 2: 129. https://doi.org/10.3390/info17020129

APA Style

Li, Y., Xu, B., Bai, Q., Liu, Z., Zhu, J., & Chen, Q. (2026). Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment. Information, 17(2), 129. https://doi.org/10.3390/info17020129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision Transformer-Based Identification for Early Alzheimer’s Disease and Mild Cognitive Impairment

Abstract

1. Introduction

2. Related Work

2.1. Medical Imaging-Based Auxiliary Diagnostic Methods

2.2. Challenges in Training Models with Small Sample Medical Data

2.3. Transformer-Based Diagnosis of Alzheimer’s Disease and Mild Cognitive Impairment

2.4. Application of Interpretability Research in Medical Diagnosis

3. Proposed Method

3.1. Cross-Domain Feature Adaptation and Task-Driven Data Augmentation Strategy

3.1.1. Cross-Domain Feature Adaptation

3.1.2. Task-Driven Data Augmentation Strategy

3.2. Design of the AD and MCI Auxiliary Diagnosis Model

3.2.1. Preliminary Adaptation of the One-Stage Encoding Module

3.2.2. Two-Stage Fine-Tuning and Optimization

3.3. Integrated Interpretation

4. Analysis of Experimental Results

4.1. Experimental Platform and Parameter Configuration

4.2. Dataset Establishment

4.3. Evaluation Metrics

4.4. Experimental Results

4.4.1. Task-Driven Data Augmentation Comparative Experiment

4.4.2. Comparative Experiment on Fine-Tuning and Optimization of the Encoding Module

4.4.3. Interpretability Analysis

4.4.4. Comparative Experiments of Different Models

5. Discussion

5.1. Synergizing Global-Local Features for High-Precision Diagnosis

5.2. Bridging the Trust Gap via Dual-Perspective Interpretability

5.3. Proposed Clinical Integration Workflow

5.4. Limitations and Critical Analysis

5.5. Future Perspectives

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI