1. Introduction
Neurodegenerative disorders represent a growing global healthcare challenge as aging populations continue to increase. Among these conditions, Alzheimer’s disease (AD) is the most common cause of dementia, affecting more than 55 million people worldwide [
1]. AD is characterized by progressive neuronal degeneration, brain atrophy, and the accumulation of pathological proteins such as amyloid-
and tau, which lead to severe cognitive decline and functional impairment. Neurodegeneration particularly affects brain regions associated with memory and cognition, including the entorhinal cortex, hippocampus, fornix, and the frontal, temporal, and parietal lobes [
2,
3,
4]. Because neuropathological changes can begin years before clinical symptoms appear, early and accurate diagnosis is crucial for enabling timely interventions and improving disease management [
5,
6].
AD can be characterized using multiple biomarkers, including neuroimaging, cerebrospinal fluid (CSF), genetic, and blood-based markers. Among these, neuroimaging biomarkers play a crucial role in detecting structural and functional brain alterations associated with disease progression. Based on the biophysical characteristics of AD pathology, neuroimaging techniques are commonly categorized into structural, functional, and molecular imaging modalities [
7]. Structural imaging methods such as magnetic resonance imaging (MRI) detect anatomical alterations, including hippocampal atrophy, ventricular enlargement, and overall brain volume loss. Functional imaging techniques, such as functional MRI (fMRI), evaluate neuronal activity through hemodynamic responses, while molecular imaging approaches such as single-photon emission computed tomography (SPECT) reveal biochemical changes related to neurodegeneration.
Among these techniques, MRI is one of the most widely used non-invasive neuroimaging modalities for evaluating AD. MRI-derived biomarkers—including hippocampal and medial temporal lobe atrophy, ventricular enlargement, and cortical thinning—are among the most validated imaging indicators of AD progression. These structural changes enable the detection of neurodegeneration and support the differentiation of AD from other forms of dementia [
8]. In clinical practice, AD diagnosis typically involves a combination of neurological examinations, cognitive assessments such as the Mini-Mental State Examination (MMSE), and neuroimaging techniques, including MRI and computed tomography (CT) [
3,
9]. To support research on AD diagnosis, large-scale datasets such as the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [
10], Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) [
11], and Open Access Series of Imaging Studies (OASIS) [
12] provide multimodal information, including imaging, demographic, biological, and clinical data.
Traditional approaches for detecting brain abnormalities in AD can be broadly categorized into several groups, including voxel-based morphometry (VBM), statistical atlas-based methods, functional and connectivity-based analyses, and Gaussian Hidden Markov Model (GHMM)-based approaches. VBM techniques analyze structural MRI data to identify regional gray matter loss, particularly in areas such as the hippocampus and surrounding cortical regions. Atlas-based approaches compare patient scans with normative templates to identify abnormal patterns, while connectivity-based methods analyze disruptions in brain networks associated with AD. GHMM-based methods further model spatial dependencies in brain images to detect localized pathological changes [
13]. Although these approaches have demonstrated promising results, they often rely on handcrafted features or predefined statistical models that may limit their ability to capture complex patterns in high-dimensional neuroimaging data.
Recent advances in deep learning have significantly improved automated AD classification using neuroimaging data.
Deep learning-based computer vision methods can detect structural brain changes from MRI using different input representations, including 2D slices, region-of-interest (ROI)-based inputs, 3D patches, and full 3D subject-level volumes [
14]. While ROI-based approaches can reduce the amount of non-informative anatomical content provided to the model and potentially mitigate overfitting, they may also overlook disease-related patterns distributed across multiple brain regions.
However, most existing approaches either rely on 2D-slice-based representations, which fail to preserve volumetric context, or focus on isolated ROIs, limiting their ability to capture distributed neurodegenerative patterns.
Transformer architectures, originally introduced by Ashish Vaswani in the Transformer model [
15], have recently shown strong performance in computer vision tasks. In particular, the Vision Transformer (ViT) architecture [
16] leverages self-attention mechanisms to capture long-range dependencies within image data. Compared with 2D transformer models that process slices independently, 3D transformer architectures can better preserve spatial context by jointly modeling volumetric information across consecutive slices [
17,
18].
AD datasets often contain heterogeneous information, including demographic variables (e.g., age and sex), clinical scores such as MMSE, and MRI scans. Effectively integrating these multimodal data sources remains a major challenge for automated AD classification, as many existing models rely on single modalities or 2D representations of 3D volumes, potentially missing important spatial and contextual information [
19].
Unlike existing approaches that rely primarily on either full-volume analysis or isolated single-ROI representations, the proposed framework integrates a multi-atlas ROI-based instance selection strategy with a multimodal 3D transformer architecture. This design enables anatomically informed learning by emphasizing clinically relevant brain regions while integrating complementary volumetric and clinical information from multimodal data sources.
To address these limitations, this study proposes a multiple-input multimodal framework based on a 3D Vision Transformer that integrates categorical, numerical, and volumetric MRI data together with 3D region-of-interest (ROI) images from key brain regions, including the entorhinal cortex, fornix, hippocampus, and major cortical lobes. By leveraging transformer-based self-attention and volumetric imaging information, the proposed approach aims to improve robustness and diagnostic accuracy for AD classification.
The main contributions of this study are centered on the design of an anatomically informed multimodal transformer framework for Alzheimer’s disease classification and can be summarized as follows:
Unified multimodal transformer-based representation learning. We propose a multiple-input transformer framework that jointly models heterogeneous data modalities, including multiple 3D MRI-derived regions of interest (ROIs), clinical metadata, and volumetric biomarkers within a unified self-attention architecture. Unlike conventional late-fusion approaches, the proposed framework enables token-level cross-modal interaction through transformer self-attention mechanisms.
Anatomically guided multi-ROI volumetric transformer encoding. The proposed framework integrates multi-atlas ROI-based instance selection with 3D tubelet embedding, enabling anatomically constrained volumetric representation learning focused on disease-relevant brain structures associated with Alzheimer’s disease pathology.
Feature-wise tokenization of structured biomarkers. Clinical and volumetric variables are transformed into learnable token representations, allowing structured non-imaging information to participate directly in the multimodal transformer attention process rather than being incorporated through conventional late-fusion strategies.
Hemisphere-aware ROI representation selection. A hemisphere-specific ROI evaluation strategy is introduced to identify the most discriminative anatomical representation for each ROI, enabling optimized multi-ROI construction based on cross-validation performance.
Statistically grounded attention-based interpretability analysis. The proposed framework incorporates quantitative interpretability analysis through repeated-validation attention aggregation, ROI attention stability evaluation, coefficient-of-variation analysis, Friedman statistical testing, and anatomically grounded attention consistency assessment.
Multi-cohort multimodal evaluation framework. The proposed methodology is evaluated using a heterogeneous merged multi-cohort dataset combining ADNI, AIBL, and OASIS subjects under stratified 7-fold cross-validation, improving robustness against dataset-specific bias and enabling evaluation under heterogeneous acquisition conditions.
The structure of the paper is outlined as follows:
Section 2 presents some related works. The materials and methods used for preprocessing and building AD transformer-based classification models are included in
Section 3.
Section 4 provides a detailed description of the experiments conducted in this work and the parameter settings used. The results of the experiments are discussed in
Section 5.
Section 6 addresses the limitations of the proposed methods. Finally, the concluding remarks for this work are summarized in
Section 8.
2. Related Work
Neuroimaging preprocessing plays a pivotal role in Alzheimer’s disease (AD) classification, as raw MRI and PET scans frequently contain noise, intensity inhomogeneities, and anatomical variability. To address these issues, previous studies have implemented standardized pipelines that include skull stripping, spatial normalization, bias field correction, and intensity scaling to maintain consistent image quality across subjects [
20,
21,
22]. These pipelines facilitate accurate tissue segmentation of gray matter, white matter, and cerebrospinal fluid. Advanced methods further incorporate denoising, contrast enhancement, and multimodal alignment to enhance feature extraction from critical regions such as the hippocampus and cortical structures [
23,
24,
25]. Surface-based frameworks, such as FreeSurfer, are also commonly used to extract cortical morphometric features—including volume, thickness, curvature, folding, and surface area—from anatomically defined regions (e.g., DKT atlas), which have demonstrated effectiveness for machine learning- and deep learning-based AD classification [
26].
Recent research demonstrates that combining volumetric alterations of key neuroanatomical structures, particularly the hippocampus, amygdala, and ventricular system, yields highly discriminative biomarkers for Alzheimer’s disease (AD), achieving performance comparable to deep learning methods [
27]. Hippocampal volumetry, in particular, is recognized as one of the most robust biomarkers and is frequently computed bilaterally to enhance stability and discriminative capacity [
27]. In practical applications, hippocampal and amygdalar volumes, typically normalized by estimated total intracranial volume (eTIV), are widely utilized, while ventricular enlargement serves as an additional marker of global atrophy [
28,
29]. These findings collectively support the adoption of compact, biologically informed volumetric features as effective representations for AD classification.
Initial efforts in automated AD diagnosis frequently utilized region-of-interest (ROI)-based models that target specific brain regions affected by the disease. For instance, ref. [
30] extracts hippocampal blocks from MRI scans, whereas ref. [
31] examines medial temporal lobe structures using coronal slices. Other methodologies develop ensemble classifiers by extracting patches from multiple regions, including the hippocampus, amygdala, and insulae [
32,
33,
34]. Further studies implement ROI-based frameworks that incorporate anatomical landmarks [
35], employ explainable 3D convolutional neural networks for patient-specific ROI detection [
36], or utilize statistical techniques to identify informative ROI content [
37]. Although ROI-based approaches reduce computational complexity and emphasize disease-relevant structures, they may fail to capture global structural patterns distributed throughout the brain.
Deep learning techniques have substantially advanced AD classification using neuroimaging data. Convolutional neural networks (CNNs) are widely adopted for their capacity to automatically learn hierarchical representations from MRI scans. However, CNNs predominantly capture local spatial features and often struggle to model long-range relationships in volumetric brain images. To overcome this limitation, recent research has increasingly investigated transformer-based architectures that model global dependencies via self-attention mechanisms.
The Vision Transformer (ViT) has garnered significant attention in AD classification due to its capacity to capture long-range interactions between image patches. Multiple studies utilize ViT models with transfer learning to address the scarcity of labeled neuroimaging data and to enhance disease staging and progression prediction [
38,
39]. Hybrid CNN–Transformer architectures have also been introduced to integrate local feature extraction with global contextual modeling for AD diagnosis [
34,
40,
41,
42]. Furthermore, transformer architectures adapted for volumetric MRI data improve the analysis of three-dimensional brain structures and facilitate the capture of spatial dependencies across slices [
43].
The Swin Transformer represents another notable architecture, introducing hierarchical feature representations and shifted-window self-attention to efficiently process high-resolution images. Recent studies combine Swin Transformers with CNN backbones to enhance feature extraction and facilitate early AD detection [
5,
20,
22,
44]. Additional advancements incorporate frequency-domain features and specialized attention mechanisms to improve classification accuracy and lesion localization [
45,
46,
47]. Collectively, these transformer-based approaches demonstrate enhanced capability to capture both local and global contextual information in neuroimaging data.
Recent research has explored multimodal deep learning approaches to address the multifactorial nature of AD by integrating diverse data sources, such as neuroimaging, clinical information, and genetic data. For instance, ref. [
48] analyzed omics, imaging, and clinical features from the ANMerge dataset [
49], demonstrating improved performance when combining imaging and omics data. Similarly, ref. [
50] integrated structural MRI, SNP-based genetic profiles, and electronic health records using stacked denoising autoencoders and 3D CNNs, achieving higher diagnostic accuracy than single-modality approaches. Additional studies address the issue of incomplete modality availability. For example, ref. [
25] proposed a multi-input 3D CNN to manage missing MRI or PET data, while ref. [
51] generated missing modalities using a generative adversarial network prior to applying a multimodal transformer for classification.
Attention-based multimodal fusion strategies have been investigated to model interactions across different modalities. The MADDi framework [
52] utilizes cross-modal attention to jointly analyze imaging, genetic, and clinical data, whereas ref. [
53] combines MRI and PET hippocampal features using dual-branch CNN architectures. Early-fusion strategies have also been introduced, such as the modified ResNet architecture in [
54], which integrates MRI and PET inputs and incorporates explainable artificial intelligence techniques to enhance interpretability.
Despite substantial advancements, several limitations persist in current methodologies. Many early approaches rely on handcrafted features or predefined statistical models, limiting their ability to capture complex, high-dimensional patterns in neuroimaging data. Although convolutional neural networks (CNNs) have improved representation learning, they primarily focus on local spatial features and often fail to model long-range dependencies in volumetric brain images.
Recent transformer-based models mitigate this limitation by employing self-attention mechanisms to capture global contextual relationships. Nevertheless, many of these approaches continue to operate on 2D-slice-based representations, which do not fully preserve the three-dimensional anatomical structure of the brain. Although several studies have extended transformer architectures to volumetric 3D neuroimaging data, many of these approaches rely on full-volume inputs, which typically require substantial computational resources and include redundant or non-informative anatomical regions that are not directly associated with disease-related patterns.
ROI-based methods provide a more efficient alternative by concentrating on anatomically relevant regions. However, many current ROI-based approaches analyze isolated structures independently, which limits their ability to capture distributed neurodegenerative patterns across multiple brain regions. Additionally, several studies utilize single-modality inputs, thereby restricting their capacity to model the multifactorial nature of Alzheimer’s disease.
Moreover, multimodal approaches have demonstrated enhanced performance by integrating imaging, clinical, and genetic data. However, these methods frequently lack effective strategies for selecting informative regions or for fully leveraging volumetric spatial relationships within MRI data.
While ROI-based CNN frameworks and transformer architectures have been previously investigated for Alzheimer’s disease classification, several limitations remain in current methodologies. Many ROI-based approaches process anatomical regions independently and rely on conventional feature concatenation or late-fusion strategies, limiting the ability to model interactions between distributed neuroanatomical structures and heterogeneous clinical information. Similarly, several transformer-based neuroimaging approaches focus primarily on full-volume representations without incorporating anatomically guided ROI decomposition or multimodal biomarker integration.
Although previous studies have demonstrated the benefits of transformer architectures, multimodal fusion, and ROI-based analysis for Alzheimer’s disease diagnosis, these approaches typically investigate these components independently. To address these limitations, we propose a unified multimodal transformer framework that combines anatomically guided multi-ROI MRI representations, heterogeneous clinical and volumetric biomarkers, and attention-based interpretability.
Recent advances in medical image analysis have demonstrated the effectiveness of transformer-based architectures, including TransMed [
55], UNETR [
56], Medical Transformer (MedT) [
57], and Swin-Unet3D [
58], for learning contextual representations from volumetric medical data. Similarly, attention-based multimodal fusion methods [
59] and graph neural network-based approaches [
60] have shown promising results for integrating heterogeneous biomarkers in Alzheimer’s disease diagnosis. In parallel, multi-ROI studies have demonstrated the importance of focusing on anatomically relevant brain regions rather than relying solely on whole-brain representations [
61]. However, existing approaches typically investigate these components independently, focusing on segmentation tasks, whole-brain analysis, population-level graph modeling, or CNN-based ROI feature extraction. In contrast, the proposed framework integrates atlas-guided multi-ROI MRI representations; multimodal fusion of imaging, clinical, and volumetric biomarkers; and attention-based interpretability within a unified 3D Vision Transformer architecture. Furthermore, ROI- and feature-level attention analyses are evaluated across cross-validation folds to quantify the consistency and stability of the learned representations. This combination enables anatomically grounded, clinically interpretable, and robust subject-level Alzheimer’s disease classification.
In contrast, the proposed framework integrates a unified token-based multimodal transformer architecture that jointly models multiple anatomically constrained 3D ROIs together with clinical and volumetric biomarkers within a shared self-attention space. The proposed methodology additionally incorporates feature-wise tokenization of structured biomarkers, hemisphere-aware ROI selection, and statistically grounded attention stability analysis across cross-validation folds. These components collectively enable anatomically informed multimodal representation learning together with quantitative interpretability analysis for Alzheimer’s disease classification.
To address these limitations, the present study introduces a multimodal 3D Vision Transformer framework that integrates multi-atlas ROI-based instance selection with heterogeneous clinical and imaging data. By combining anatomically informed ROI extraction with transformer-based global context modeling, the proposed framework enables comprehensive learning of distributed neurodegenerative patterns across multiple brain regions. This multimodal representation strategy enhances anatomical interpretability, robustness, and overall classification performance by integrating complementary spatial and clinical information associated with Alzheimer’s disease.
These gaps underscore the necessity for approaches that concurrently address volumetric representation, efficient ROI selection, and multimodal data integration within a unified learning framework.
3. Materials and Methods
This section presents the datasets and the proposed methodology for building AD classification models. It covers data preparation, instance selection, ROI extraction, 3D ROI batch generation, and the transformer classification model used in this work.
3.1. Datasets
This study utilizes three publicly available and widely adopted neuroimaging datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI) [
10], the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) [
11], and the Open Access Series of Imaging Studies (OASIS) [
12]. These datasets provide complementary clinical and neuroimaging information and are widely used benchmarks for Alzheimer’s disease classification.
3.1.1. Multimodal Data Representation
Each subject is represented using multiple data modalities aligned with the proposed multimodal transformer framework:
3D MRI (ROI-based): Structural T1-weighted brain volumes are processed to extract anatomically relevant regions of interest (ROIs), including the hippocampus, entorhinal cortex, fornix, and major cortical lobes. These regions are strongly associated with AD-related neurodegeneration.
Clinical and Demographic Data: Subject-level attributes such as age and sex, together with cognitive scores (e.g., MMSE), are included to capture inter-subject variability.
Volumetric Biomarkers: Quantitative volumetric measures derived from neuroanatomical structures (e.g., hippocampus, amygdala, ventricles) are incorporated as structured features.
3.1.2. Diagnostic Labels
Subjects are labeled using the Clinical Dementia Rating (CDR) scale [
62]. In this study, a binary classification setting is adopted, where subjects with CDR = 0 are considered cognitively normal (CN), and subjects with CDR ≥ 1 are classified as Alzheimer’s disease (AD). Subjects with CDR = 0.5 (mild cognitive impairment) are excluded to ensure clear class separation.
3.1.3. Cohort Construction
To ensure balanced class distribution and reduce cohort-specific bias, a stratified sampling strategy was applied across datasets. The final cohort consisted of 420 subjects (one MRI scan per subject), with 70 subjects per class (CN and AD) selected from each dataset (3 datasets × 2 classes × 70 subjects). This balanced design mitigates class imbalance and facilitates robust model evaluation across heterogeneous populations while avoiding repeated-subject bias and data leakage associated with longitudinal acquisitions.
Although the resulting dataset size is smaller than those employed in some large-scale deep learning studies, strict subject-level separation, cross-validation folds, and multimodal integration were incorporated to improve robustness and reduce overfitting risk. Furthermore, the merged multi-cohort design introduces variability in acquisition protocols, scanner characteristics, and demographic distributions, providing a more heterogeneous evaluation setting than single-cohort studies.
Demographic characteristics of the resulting cohort are summarized in
Table 1.
3.1.4. Data Harmonization
To reduce inter-dataset variability, all MRI volumes are spatially normalized to a common template space and intensity-normalized. Additionally, one scan per subject is retained to avoid data leakage and ensure independence between samples.
3.2. Proposed Methodology
This methodology aims to develop a multimodal transformer-based classification framework that integrates image information from multiple regions of interest (ROIs) with heterogeneous clinical and volumetric data to improve classification performance, robustness, and reliability.
As illustrated in
Figure 1, the proposed methodology begins with a data preparation stage, in which the dataset is partitioned into seven stratified subject-level folds for cross-validation. In each iteration, one fold was reserved for testing, while the remaining subjects were further divided into training and validation subsets for model development and hyperparameter optimization. The MRI volumes are then preprocessed to remove non-informative regions and ensure spatial alignment across subjects. Subsequently, the instance selection procedure identifies slices containing the predefined ROIs and estimates the corresponding centroid coordinates
using the statistical mode. Finally, the proposed framework is trained and evaluated using transformer-based classification models that integrate ROI-based image representations with mixed data modalities within a unified multimodal representation space.
3.2.1. Dataset Preparation
This phase involves the selection of a single representative MRI volume per subject, followed by the partitioning of the dataset into training, validation, and test subsets. The selected volumes are subsequently subjected to preprocessing steps, including skull stripping, tissue segmentation, spatial normalization (registration), and volumetric feature extraction, to ensure anatomical consistency and facilitate reliable quantitative analysis across subjects.
Volume Dataset Building
For each subject, MRI volumes were chronologically ordered according to the acquisition date, and only the most recent scan was retained to ensure a single representative volume per subject. To address class imbalance, an undersampling strategy was subsequently applied by selecting an equal number of subjects per class (k), where k corresponded to the size of the minority class or a predefined lower threshold. The balanced dataset was then partitioned into seven stratified subject-level folds for cross-validation. In each iteration, one fold was reserved for testing, while the remaining subjects were used for training and validation. This protocol preserved class distributions across folds, reduced partition-related bias, and prevented data leakage through strict subject-level separation throughout the evaluation process.
Data Preprocessing
To ensure data consistency and enhance model performance, a series of preprocessing steps were applied across numerical, categorical, and imaging modalities. These steps were designed to standardize feature representations, reduce inter-subject variability, and preserve anatomical fidelity.
Numerical Data: Continuous variables were normalized using min–max scaling to the range , ensuring comparable feature magnitudes and stable optimization during training.
Categorical Data: Categorical variables were encoded using one-hot encoding, producing binary vectors within the range and avoiding the introduction of ordinal relationships.
Image Intensity Scaling: MRI voxel intensities were normalized to the range to improve numerical stability and convergence of deep learning models.
MRI Image Preprocessing
T1-weighted structural MRI scans were subjected to a standardized preprocessing pipeline to ensure anatomical consistency across subjects and datasets (ADNI, AIBL, and OASIS). The main steps are described as follows:
Skull Stripping: Raw MRI volumes were processed to remove non-brain tissues—including skin, fat, muscle, neck, and ocular structures—thereby isolating the intracranial region of interest.
Tissue Segmentation and Surface Reconstruction: The brain was segmented into major tissue classes, including gray matter (GM), white matter (WM), cerebrospinal fluid (CSF), and background. Subsequently, the white matter and pial surfaces were reconstructed, enabling accurate modeling of cortical boundaries [
26].
Spatial Normalization (Registration): Skull-stripped volumes were nonlinearly registered to the MNI152 T1-weighted template, ensuring uniformity in anatomical orientation, shape, and alignment. The resulting volumes were resampled to a standardized resolution of and dimensions of voxels.
Volumetric Feature Extraction: Region-of-interest (ROI) volumetric measures were computed, focusing on structures strongly associated with Alzheimer’s disease. These include the left and right hippocampus, amygdala, and lateral ventricles (including inferior lateral ventricles), which are often combined into bilateral measures to improve robustness and discriminative power [
27,
63].
All preprocessing steps, including spatial normalization and ROI extraction, were performed using predefined atlas-based procedures independent of class labels and model training. Subject-level separation was subsequently enforced during cross-validation to ensure an unbiased evaluation protocol.
3.2.2. Instance Dataset Building
Instance selection techniques play a crucial role in optimizing predictive models by prioritizing informative data, improving performance, reducing computational cost, and limiting dataset size. In this study, we adopt a novel instance selection framework proposed in ref. [
37], which comprises two complementary components. This strategy addresses a key limitation in neuroimaging pipelines, where redundant or non-informative slices may degrade model performance and increase computational cost. First, a multi-atlas ROI-based instance selection strategy integrates annotations from multiple atlases to retain the most informative and representative slices. Second, an ROI content extraction method employs the statistical mode to refine the centroid
location, enabling precise extraction of relevant anatomical content for accurate 2D slice cropping. The complete procedure is summarized in Algorithm 1.
| Algorithm 1 ROI-Based Instance Selection with Centroid Refinement. |
- Require:
MRI volume ; atlas set ; ROI label r; threshold - Ensure:
Selected slices and refined centroid - 1:
for to M do - 2:
Register atlas to the MRI space of - 3:
Extract binary ROI mask from atlas - 4:
end for - 5:
Aggregate ROI masks using voxel-wise majority voting: - 6:
Identify informative slices: - 7:
Compute slice-wise ROI centroids: - 8:
Refine centroid using the statistical mode: return
|
3.2.3. Data Generation
Data generation is based on instance-level metadata, including age, MMSE score, sex, volume filename, slice index, ROI centroid coordinates, and class labels. To address memory constraints, data are processed in batches during training.
All preprocessing steps were applied as described in
Section 3.2.1. Finally, 3D regions of interest were extracted by cropping MRI volumes around the ROI centroid coordinates, ensuring consistent spatial localization of the input data.
3.2.4. Proposed Multimodal Transformer Architecture
The proposed model adopts a unified token-based transformer architecture to integrate heterogeneous modalities, including 3D MRI-derived regions of interest (ROIs), clinical variables, and volumetric biomarkers. The architecture is designed to enable direct interaction among imaging and non-imaging information within a shared attention-based representation space.
3D Tubelet Embedding and Patch Configuration
Prior to tokenization, each ROI volume is resized to a standardized input size of voxels to ensure consistent anatomical representation across subjects and ROIs. Each ROI is subsequently partitioned into non-overlapping 3D tubelets using a patch size of voxels. Tubelet embedding is implemented through a 3D convolutional projection layer with kernel size and stride equal to the patch dimensions. The resulting tubelets are projected into a shared embedding space with embedding dimension .
For an ROI volume of size
voxels, the selected patch configuration generates
Tubelet tokens per ROI. Given the six selected anatomical regions, a total of 2304 MRI-derived tokens are generated before multimodal fusion.
The transformer encoder consists of four transformer encoder blocks employing four multi-head self-attention heads per block, embedding dimension , feed-forward dimension , residual connections, layer normalization, and dropout regularization.
Multi-ROI MRI Encoding
Each ROI is processed independently through a dedicated tubelet embedding stage, enabling the model to learn anatomically localized representations while preserving volumetric spatial information. The selected ROIs include the entorhinal cortex, fornix, frontal lobe, hippocampus, parietal lobe, and temporal lobe, which are strongly associated with Alzheimer’s disease pathology.
Tabular Feature Encoding
Clinical and demographic variables, including age, sex, and MMSE score, are transformed into feature-wise token representations using learnable embeddings. Similarly, volumetric biomarkers derived from FreeSurfer analyses, including bilateral hippocampal, amygdalar, ventricular, and entorhinal cortex volumes, together with whole-brain, gray matter, white matter, and cerebrospinal fluid (CSF) volumes, are encoded as individual learnable tokens. This feature-wise tokenization strategy enables structured non-imaging information to participate directly in the transformer attention mechanism, facilitating joint modeling of clinical, volumetric, and neuroanatomical biomarkers within a unified multimodal representation space.
Multimodal Fusion Mechanism
The multimodal fusion strategy is based on token-level integration. ROI-derived MRI tokens, clinical feature tokens, and volumetric biomarker tokens are concatenated into a unified multimodal sequence. Unlike conventional fusion approaches based on feature concatenation or late integration, the proposed architecture enables direct token-level interaction among imaging, clinical, and volumetric biomarkers through shared self-attention operations. This design facilitates joint modeling of intra-modal and cross-modal relationships within a common representation space.
Positional and Modality Embeddings
After multimodal token fusion, a learnable classification token (CLS) is prepended to the sequence. Learnable positional embeddings are subsequently added to preserve token ordering and spatial context, while modality embeddings are incorporated to explicitly distinguish MRI, clinical, and volumetric tokens. These embeddings enable the transformer to jointly model heterogeneous information sources while preserving modalityspecific characteristics.
Hybrid Representation Learning
The multimodal sequence is processed through four transformer encoder blocks, allowing information exchange among all token types through multi-head self-attention. The final representation is obtained through a hybrid aggregation strategy that combines the CLS-token representation with attention-based pooling across all tokens. This dual aggregation mechanism captures both global contextual information and distributed feature relevance.
Classification Head
The aggregated multimodal representation is passed through two fully connected layers with ReLU activation and dropout regularization. A final dense layer with softmax activation produces the binary classification output corresponding to cognitively normal (CN) and Alzheimer’s disease (AD) subjects.
Figure 2 illustrates the overall architecture of the proposed multimodal transformer framework. MRI-derived ROI tokens, clinical feature tokens, and volumetric biomarker tokens are integrated into a unified token sequence augmented with modality, positional, and CLS embeddings. The resulting representation is processed by stacked transformer encoder blocks and aggregated through hybrid CLS-token and attention-pooling mechanisms for binary Alzheimer’s disease classification. Detailed architectural specifications, tokenization procedures, and transformer configurations are provided in
Section 3.2.4.
3.2.5. Mathematical Formulation of the Multimodal Transformer
Let denote the k-th 3D region of interest (ROI), where .
ROI Tokenization
Each ROI is partitioned into non-overlapping 3D tubelets of size
and projected into an embedding space:
where
is the number of tokens and
d is the embedding dimension.
Tabular Tokenization
Let
denote tabular features (clinical and volumetric). Feature-wise tokenization is defined as
where
are learnable parameters.
Multimodal Token Fusion
All tokens are concatenated:
A learnable classification token
is prepended:
Feature Aggregation
Final representation combines CLS token and attention pooling:
Classification
Let
denote the number of tubelet tokens extracted from the
k-th ROI volume after 3D patch partitioning. The multimodal token sequence is defined as
where
represents the MRI tubelet tokens extracted from the
k-th ROI,
denotes the feature-wise tokenized clinical metadata, and
denotes the feature-wise tokenized volumetric biomarkers.
After multimodal token fusion, a learnable CLS token and positional embeddings are incorporated:
where
and
correspond to learnable positional and modality embeddings, respectively. The resulting token sequence is processed through four transformer encoder layers employing four attention heads, embedding dimension
, and feed-forward dimension
.
Training and Optimization
The model is trained end-to-end using a cross-entropy loss function. Hyperparameters—including the number of transformer layers, embedding dimension, number of attention heads, and learning rate—are systematically optimized to enhance model performance. In particular, this study employs the Hyperband algorithm [
64], an efficient hyperparameter optimization strategy that extends random search by dynamically allocating computational resources and applying early stopping to poorly performing configurations. This approach enables effective exploration of the hyperparameter space while reducing computational cost. Additionally, regularization techniques such as dropout and early stopping are incorporated during training to mitigate overfitting and improve generalization.
4. Experimental Setup
This section presents the experimental framework for evaluating the proposed multimodal 3D Vision Transformer model for Alzheimer’s disease (AD) classification. It describes the datasets and MRI-derived regions of interest (ROIs) used in the experiments, along with the preprocessing procedures applied to ensure spatial consistency and reproducibility. The section also outlines the experimental scenarios designed to analyze the impact of mixed data modalities, the hyperparameter optimization strategy, and the statistical methods used to assess model performance. Finally, implementation details and computational settings are provided to facilitate reproducibility and enable fair comparisons with state-of-the-art approaches.
4.1. Experimental Dataset
The experiments were conducted on the multimodal dataset described in
Section 3, which integrates data from the ADNI, AIBL, and OASIS cohorts. The dataset includes structural MRI, clinical metadata, and volumetric biomarkers, enabling comprehensive multimodal analysis.
All experiments follow the cohort construction and preprocessing procedures detailed in
Section 3. In particular, a balanced dataset of 420 subjects is used, with equal representation of cognitively normal (CN) and Alzheimer’s disease (AD) cases across the three datasets.
To ensure fair evaluation and prevent data leakage, all experimental splits are performed at the subject level, with each subject contributing a single MRI scan. This guarantees independence between training and evaluation data.
The use of a merged multi-cohort dataset introduces variability in acquisition protocols and population characteristics, providing a more challenging and realistic evaluation setting for assessing model generalization.
No additional class balancing techniques were required during training, as the dataset was explicitly constructed to maintain equal class distribution.
4.2. Experimental Design
The proposed model was trained and evaluated on multiple 3D ROI datasets derived from anatomically relevant brain regions, including the entorhinal cortex, fornix, frontal lobe, hippocampus, parietal lobe, and temporal lobe. All datasets were generated from a common set of MRI volumes to ensure experimental consistency and to prevent data leakage across regions.
Prior to ROI extraction, the MRI volumes were processed using a standardized preprocessing pipeline that includes skull stripping, automated tissue segmentation, and spatial normalization to the MNI152 template. Tissue segmentation and anatomical delineation were performed using established neuroimaging tools, enabling consistent identification of brain structures across subjects. These steps ensure inter-subject alignment and reduce non-brain variability, thereby supporting reliable ROI extraction and robust feature learning.
4.2.1. ROI-Specific Hemispheric Analysis for Multi-ROI Representation
This experiment is designed to evaluate the contribution of anatomical regions of interest (ROIs) across cerebral hemispheres for Alzheimer’s disease (AD) classification, with the aim of defining an optimized multi-ROI representation. The analysis focuses on determining whether hemispheric lateralization influences the discriminative capacity of individual ROIs and how this information can be leveraged to improve model design.
The analyzed feature set includes:
Clinical metadata: Demographic and cognitive variables such as age, sex, and MMSE.
Volumetric biomarkers: Structural measurements derived from MRI, including gray matter, white matter, cerebrospinal fluid (CSF), hippocampus, amygdala, ventricles, entorhinal cortex, and whole-brain volume.
Each ROI was independently analyzed using MRI data extracted from both left and right hemispheres. The evaluated regions include the entorhinal cortex, fornix, frontal lobe, hippocampus, parietal lobe, and temporal lobe. For each ROI, two configurations were considered: (i) extraction from the left hemisphere and (ii) extraction from the right hemisphere. Each configuration was used to train and evaluate a classification model under identical conditions, ensuring a controlled hemispheric comparison.
Model evaluation was conducted using stratified 7-fold cross-validation, providing robust performance estimates and reducing variability associated with data partitioning. Performance metrics were computed independently for each ROI–hemisphere configuration.
Based on this experimental setup, a selection mechanism was defined to identify the most informative hemisphere for each ROI.
Formally, the optimal hemisphere
for each ROI
r is selected as
The selected ROI–hemisphere pairs are subsequently combined to construct the multi-ROI representation used as input to the proposed multimodal framework.
This design enables the identification of spatially localized and hemisphere-specific patterns relevant to AD classification while reducing the inclusion of redundant or non-informative regions. By focusing on anatomically meaningful inputs, the approach is expected to enhance both predictive performance and interpretability.
The outcomes of this analysis provide the methodological foundation for the ROI-based decomposition strategy evaluated in the subsequent ablation study.
4.2.2. Ablation Study Design: Multimodal Integration and ROI-Based Representation
This experiment investigates the contribution of different data modalities and assesses the effectiveness of the proposed ROI-based decomposition strategy within the multimodal framework for Alzheimer’s disease (AD) classification.
The experimental design considers four model configurations, each representing a different level of information integration:
MRI Only (ROI-based): Uses only MRI inputs extracted from anatomically defined regions of interest.
Tabular Only: Uses only non-imaging features, including clinical metadata (e.g., age, sex, MMSE) and volumetric biomarkers derived from structural MRI.
Whole-brain (w/o ROI): Uses full MRI volumes without ROI decomposition, providing a baseline to evaluate the impact of anatomically constrained representations.
Multi-ROI + Tabular (Proposed): Integrates ROI-based MRI inputs with clinical and volumetric features within a multimodal transformer architecture.
All configurations are trained and evaluated under identical conditions to ensure a fair comparison. The same preprocessing pipeline, data splits, and training protocol are applied across all models.
Performance is assessed using standard classification metrics, including accuracy and area under the ROC curve (AUC), computed independently for each fold.
To quantify the significance of performance differences between configurations, paired statistical tests are applied across folds. Specifically, paired Student’s t-tests are used to compare models, and effect sizes are measured using Cohen’s d, providing a standardized estimate of the magnitude of observed differences.
This experimental design enables a systematic analysis of: (i) the individual contribution of imaging and non-imaging modalities, (ii) the added value of multimodal integration, and (iii) the impact of ROI-based decomposition compared to whole-brain representations.
The outcomes of this experiment are presented in the following section, where the comparative performance of each configuration is analyzed in detail.
4.2.3. Attention-Based Feature Importance: Clinical and Volumetric Contributions
This experiment evaluates the contribution of non-imaging features within the proposed multimodal framework by leveraging the attention-pooling mechanism. The objective is to obtain an interpretable approximation of feature importance for clinical metadata and volumetric biomarkers integrated into the model.
The analysis focuses exclusively on non-imaging tokens. To this end, MRI patch tokens corresponding to spatial representations are excluded, and only the attention weights associated with clinical and volumetric inputs are considered. This separation enables an isolated evaluation of structured features within the multimodal attention space.
Attention weights are extracted from the attention-pooling layer of the trained model for each fold in the 7-fold cross-validation. For each feature, attention scores are aggregated across all samples within a fold and subsequently averaged across folds to obtain a robust estimate of feature contribution. The variability in these estimates is quantified using the standard deviation across folds.
Since attention weights are normalized and not directly interpretable in absolute terms, features are ranked based on their relative importance. This ranking provides a comparative assessment of the contribution of each feature to the model’s decision-making process.
This experimental design enables the identification of the most influential non-imaging features contributing to classification decisions while ensuring robustness through aggregation across stratified 7-fold cross-validation. Additionally, it provides an interpretable connection between model predictions and clinically relevant biomarkers associated with Alzheimer’s disease.
The outcomes of this analysis are presented in the subsequent section, where the ranked feature importance scores are reported and discussed.
4.2.4. ROI Attention Analysis and Interpretability
This experiment was designed to evaluate the interpretability and stability of the proposed multimodal transformer by quantifying the contribution of each anatomical region of interest (ROI) to the final classification decision. Specifically, the analysis focused on determining whether the model assigns consistent attention to clinically relevant brain regions across the 7-fold cross-validation protocol.
To improve the statistical rigor of the interpretability analysis, attention-based features and ROI importance were quantitatively evaluated across the seven folds of the subject-level stratified cross-validation procedure. Attention weights were extracted from the attention-pooling layer of the trained multimodal transformer model and aggregated at both the feature and ROI levels. For each fold, attention scores were first averaged across all samples and subsequently summarized across the cross-validation procedure using mean ± standard deviation to quantify the magnitude, stability, and consistency of the learned attention patterns across different data partitions.
Additionally, a Friedman test was performed on the ROI attention distributions to evaluate whether any anatomical region received statistically dominant attention across the stratified 7-fold cross-validation. ROI stability was further quantified using the coefficient of variation and stability scores derived from the iteration-wise attention distributions. These analyses provide a statistically grounded evaluation of the consistency of the learned attention mechanisms rather than relying solely on qualitative visualization.
After training the proposed model in each fold, ROI-level attention importance was extracted from the CLS-token attention weights. Attention values were aggregated across transformer heads and tokens to obtain a single importance score for each ROI. The evaluated ROIs included the entorhinal cortex, hippocampus, fornix, frontal lobe, parietal lobe, and temporal lobe.
For each cross-validation fold, the attention scores were averaged at the ROI level. These fold-level values were then summarized across the seven folds using the mean and standard deviation. This procedure allowed the assessment of both the average contribution of each ROI and the variability in its attention response across different data splits.
This procedure enabled the assessment of both the average contribution of each ROI and the stability of its attention response across different data partitions.
To evaluate whether any ROI received statistically dominant attention, a Friedman test was applied across ROI attention scores obtained from the seven folds. When pairwise comparisons were required, Wilcoxon signed-rank tests were performed between ROI pairs, and Bonferroni correction was applied to control for multiple comparisons.
Finally, attention stability was quantified using the coefficient of variation (CV) and a normalized stability score. Lower CV values and higher stability scores were interpreted as evidence of more consistent regional contribution patterns across folds. This analysis was used to assess whether the model learned reproducible and anatomically meaningful attention distributions rather than relying on unstable or fold-specific patterns.
4.2.5. Multimodal 3D Vision Transformer vs. State-of-the-Art Methods
This experiment is conducted to benchmark the proposed Multimodal 3D Vision Transformer against recent state-of-the-art methods for Alzheimer’s disease (AD) classification. The objective is to evaluate the effectiveness of the proposed approach in relation to existing transformer-based and hybrid deep learning models reported in the literature.
A comparative analysis is conducted using previously published studies that employ transformer-based architectures or related deep learning approaches on widely used neuroimaging datasets, including ADNI, AIBL, and OASIS. These methods encompass a range of modeling strategies, such as Vision Transformers, Swin Transformers, and hybrid CNN–Transformer architectures.
For comparison, the best-performing configuration of the proposed method is selected based on the experimental protocol defined in the previous sections. In addition, representative single-ROI configurations are included to assess the discriminative capacity of anatomically localized representations.
The evaluation is based on reported classification accuracy and AUC when available for the AD vs. cognitively normal (CN) task, as this metric is consistently available across the selected studies. When multiple datasets were used in prior work, their reported results are included as presented in the original publications.
It is important to note that differences in experimental protocols—such as dataset composition, preprocessing pipelines, and validation strategies—may affect direct comparability. In particular, many prior studies report single-split evaluations or lack fold-wise performance statistics. In contrast, the proposed method is evaluated using a 7-fold cross-validation protocol, reporting mean and standard deviation to provide a more robust estimate of performance.
The proposed model is trained and evaluated on a merged multi-cohort dataset combining ADNI, AIBL, and OASIS subjects, introducing additional variability and increasing the complexity of the classification task. This setting provides a more realistic and challenging benchmark compared to single-cohort evaluations.
This experimental design enables a contextualized comparison between the proposed multimodal framework and existing approaches, highlighting differences in model architecture, input modalities, and evaluation protocols.
4.3. Hyperparameter Optimization
Hyperparameter optimization was conducted using the Hyperband algorithm [
64], which efficiently explores large and complex search spaces through adaptive resource allocation and early-stopping strategies.
The search space encompasses optimization-related parameters (optimizer type and learning rate), regularization mechanisms (dropout rate), and training configuration (batch size and number of epochs), as well as key architectural components of the transformer model, including the number of transformer layers and embedding dimensionality. The optimization objective was defined in terms of validation accuracy, enabling the selection of configurations that maximize generalization performance.
Table 2 summarizes both the explored hyperparameter search space and the final selected configuration.
Following the Hyperband-based search, a targeted fine-tuning phase was performed to further improve training stability and convergence behavior. This additional refinement step allows the model to operate under a more optimized configuration beyond the discretized search space.
4.4. Statistical Analysis
Statistical analyses were performed to assess the robustness and variability of the proposed multimodal transformer framework across cross-validation folds. For each experiment, classification accuracy and ROC-AUC were computed independently across the stratified 7-fold cross-validation and summarized using mean ± standard deviation. This cross-validation strategy provides a more robust estimate of model generalization performance under varying data partitions.
In addition to performance averages, the variability observed across cross-validation folds was used to characterize the stability of the proposed framework under limited-data conditions. Since transformer-based architectures are known to be sensitive to dataset size and training variability, stratified 7-fold cross-validation was adopted to mitigate partition-specific bias and reduce the risk of overestimating performance due to favorable train–test splits.
For the ablation study, paired Student’s t-tests were applied across folds to compare the proposed Multi-ROI + Tabular model against the MRI-only, tabular-only, and whole-brain configurations. A significance level of was adopted. In addition to p-values, Cohen’s d was computed to estimate the magnitude of the observed differences, enabling both statistical and practical interpretation of model improvements.
For attention-based interpretability analyses, feature and ROI attention scores were first averaged within each fold and then summarized across the seven folds using mean and standard deviation. Since ROI attention scores are repeated measurements obtained from the same cross-validation folds, a Friedman test was used to evaluate whether statistically significant differences existed among ROI attention distributions. When post hoc pairwise comparisons were required, Wilcoxon signed-rank tests were applied with Bonferroni correction to control for multiple comparisons.
Finally, attention stability across folds was quantified using the coefficient of variation (CV) and a normalized stability score. Lower CV values and higher stability scores were interpreted as evidence of more consistent attention allocation across folds, supporting the reproducibility of the model’s learned anatomical importance patterns.
4.5. Implementation Details
A stratified 7-fold cross-validation strategy with subject-level separation was employed to prevent data leakage across subsets. In each iteration, one fold was reserved as an independent test set, while the remaining subjects were used for training and validation. The validation subset was used exclusively for hyperparameter optimization, model selection, and early stopping. Evaluation was performed across seven stratified folds to improve robustness and reduce partition-related bias. Importantly, the independent test subset remained completely unseen during training and validation and was used solely for final performance assessment.
The multiple-input model was implemented in Python 3.12.7 (Python Software Foundation, Wilmington, DE, USA) using TensorFlow 2.15.1 and Keras 2.15.0 (Google LLC, Mountain View, CA, USA), with the random seed set for NumPy, TensorFlow, Random, and OS libraries to ensure reproducible results. Images were preprocessed using NiBabel 5.4.2, TorchIO 1.0.2, Pillow (PIL) 12.2.0, and NumPy 1.26.4. Skull stripping and MRI registration with the MNI152 template were performed using FreeSurfer 8.2.0 (Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Charlestown, MA, USA). All evaluated models were trained sequentially on ten workstations equipped with Intel Core i9-9900K processors, 32 GB RAM, and 11 GB NVIDIA RTX 2080 Ti GPUs.
5. Results
The proposed Multimodal 3D Vision Transformer was comprehensively evaluated through a series of experiments designed to assess its predictive performance, interpretability, and robustness for Alzheimer’s disease classification. The evaluation includes (i) analysis of ROI-specific hemispheric contributions to identify the most discriminative anatomical regions, (ii) an ablation study to quantify the impact of multimodal integration and ROI-based representation, (iii) attention-based feature importance analysis to interpret the contribution of clinical and volumetric variables, and (iv) a comparison with state-of-the-art transformer-based methods. Performance was assessed using accuracy and area under the receiver operating characteristic curve (ROC-AUC) within a subject-level stratified seven-fold cross-validation framework to provide robust and reliable estimates of model performance and generalization capability.
5.1. Analysis of ROI-Specific Hemispheric Contributions for Multi-ROI Alzheimer’s Disease Classification
The performance of individual ROIs across hemispheres is summarized in
Table 3. For each ROI, the hemisphere achieving the highest classification accuracy is identified and highlighted. These best-performing ROI–hemisphere pairs are subsequently selected to construct the proposed multi-ROI representation. This selection strategy ensures that only the most discriminative anatomical regions contribute to the final model, improving both predictive performance and interpretability. Notably, the results reveal hemispheric asymmetries, indicating that certain brain regions provide more informative features depending on lateralization. This observation is consistent with known patterns of neurodegeneration in Alzheimer’s disease and provides insight into the spatial distribution of disease-relevant biomarkers, supporting anatomically grounded model decisions.
These findings motivate the use of ROI-based decomposition evaluated in the following ablation study.
5.2. Ablation Study: Contribution of Multimodal Integration and ROI-Based Representation
To quantitatively assess the contribution of each modality and the effectiveness of the proposed multi-ROI decomposition strategy, an ablation study was conducted comparing different model configurations, including MRI-only (ROI-based), tabular-only, whole-brain input without ROI decomposition, and the full multimodal framework. As shown in
Table 4, the proposed Multi-ROI + Tabular model achieves the highest overall performance, reaching an AUC of
(95% CI: [0.9885, 0.9995]) and an accuracy of
(95% CI: [96.50, 98.74]), substantially outperforming the MRI-only configuration (AUC:
(95% CI: [0.6023, 0.7057]), Accuracy:
(95% CI: [59.56, 67.10])).
The tabular-only model exhibits strong baseline performance (AUC: (95% CI: [0.9789, 0.9953]), Accuracy: (95% CI: [92.55, 93.63])), confirming that clinical and volumetric features capture highly discriminative disease-related patterns. However, the integration of ROI-based MRI information yields consistent improvements in both AUC and accuracy, demonstrating that localized anatomical features provide complementary information beyond structured clinical data.
In comparison with the whole-brain configuration (AUC: (95% CI: [0.9796, 0.9998]), Accuracy: (95% CI: [94.82, 97.08])), the proposed multi-ROI model achieves comparable AUC but a noticeable improvement in classification accuracy. This indicates that ROI-based decomposition primarily enhances class separability and decision calibration rather than global ranking performance. Moreover, the whole-brain model tends to rely on diffuse spatial representations, limiting interpretability, whereas the proposed approach enables anatomically grounded attention focused on disease-relevant regions such as the hippocampus and entorhinal cortex.
Overall, these results demonstrate that (i) MRI data alone is insufficient for robust classification, (ii) tabular clinical features provide a strong predictive foundation, and (iii) the integration of structured data with anatomically constrained MRI representations yields superior performance while improving interpretability. This confirms the effectiveness of the proposed multimodal transformer framework for Alzheimer’s disease classification.
Statistical significance was assessed using paired
t-tests across the 7-fold cross-validation for both AUC and accuracy metrics (
Table 5). The proposed multimodal model significantly outperforms the MRI-only configuration (
p < 0.001), with extremely large effect sizes (Cohen’s
), indicating a substantial performance gap. It also achieves statistically significant improvements over the tabular-only model (
p < 0.01 for accuracy and
p < 0.05 for AUC), with large effect sizes, confirming the strong contribution of multimodal integration.
In contrast, the difference between the proposed model and the whole-brain configuration is smaller and not consistently statistically significant (p > 0.05 for AUC), although a significant improvement is observed in accuracy (p < 0.05) with a moderate-to-large effect size. This suggests that while both models capture highly discriminative imaging features, the multimodal approach provides additional refinement through complementary clinical and volumetric information.
Importantly, beyond predictive performance, the proposed multi-ROI framework offers enhanced interpretability by explicitly focusing on anatomically relevant brain regions. This enables spatially localized and clinically meaningful explanations, in contrast to the more diffuse representations produced by whole-brain models.
Overall, these results demonstrate that multimodal integration yields statistically significant and practically meaningful improvements, while ROI-based decomposition enhances interpretability, supporting the suitability of the proposed framework for clinically explainable AI in Alzheimer’s disease diagnosis.
To further elucidate the contribution of non-imaging features within the multimodal framework, we analyze their influence through the attention-pooling mechanism.
5.3. Attention-Based Feature Importance: Clinical and Volumetric Contributions
To provide deeper insight into the multimodal decision-making process, we examine the relative contribution of non-imaging features using the attention-pooling mechanism. Feature importance scores were obtained by aggregating attention weights across the 7-fold cross-validation (mean ± standard deviation), ensuring robustness and consistency of the identified biomarkers. Since the absolute magnitude of attention scores is not directly interpretable, features are ranked based on their relative importance to assess their contribution to the model.
After excluding MRI patch tokens, the attention weights associated with clinical metadata and volumetric biomarkers were extracted and ranked to identify the most influential variables. As reported in
Table 6, the resulting top-
k features offer an interpretable approximation of feature importance within the multimodal framework. The small numerical scale reflects normalization of attention weights rather than weak feature contribution.
The results reveal a clear predominance of volumetric biomarkers, particularly gray matter volume, ventricular structures, and entorhinal cortex regions, which consistently rank among the most influential variables. These findings are well aligned with established neurobiological evidence, as structural atrophy in medial temporal regions and ventricular enlargement are key hallmarks of Alzheimer’s disease progression. In addition, clinical variables such as MMSE, age, and sex also exhibit meaningful contributions, highlighting the complementary role of demographic and cognitive information.
Notably, the relatively low variability (standard deviation) across folds indicates that the identified features are stable and consistently leveraged by the model. Overall, this analysis demonstrates that the attention-pooling mechanism not only supports high predictive performance but also enables clinically coherent interpretation by prioritizing biologically relevant biomarkers.
These results reinforce the clinical validity of the proposed model by demonstrating that the learned representations are consistent with established neurodegenerative biomarkers.
Furthermore, these interpretable insights complement the quantitative performance analysis presented in the subsequent comparison with state-of-the-art methods.
5.4. ROI Attention Analysis and Interpretability
ROI attention importance was computed using CLS-token attention aggregated across transformer heads and tokens. To ensure statistical rigor, attention scores were first averaged per fold and subsequently summarized across the 7-fold cross-validation protocol using mean ± standard deviation.
As illustrated in
Figure 3, the proposed multimodal transformer distributes attention relatively consistently across anatomically relevant brain regions associated with Alzheimer’s disease.
The ROI attention analysis demonstrates that the proposed multimodal transformer distributes attention relatively consistently among anatomically relevant brain regions associated with Alzheimer’s disease. As shown by the mean attention values, the entorhinal cortex achieved the highest average attention weight (), followed closely by the parietal lobe () and the fornix (). These findings are anatomically meaningful, as the entorhinal cortex and fornix are strongly linked to memory impairment and early neurodegenerative progression in Alzheimer’s disease.
The hippocampus, despite being one of the most clinically recognized biomarkers of Alzheimer’s disease, exhibited the lowest average attention weight (). However, its low standard deviation indicates that the model consistently considers this region across folds, suggesting stable but more balanced attention allocation relative to the other ROIs.
As shown in
Figure 3, the relatively small differences between attention means across ROIs suggest that the proposed architecture does not rely excessively on a single anatomical region. Instead, the model appears to learn a distributed multimodal representation in which multiple ROIs contribute complementary information to the classification process. This behavior is desirable in neuroimaging applications because Alzheimer’s disease affects multiple interconnected brain structures rather than a single isolated region.
To further assess whether the observed attention differences among ROIs were statistically meaningful, non-parametric statistical analyses were performed across cross-validation folds.
To statistically evaluate whether the observed differences in attention allocation were significant, a Friedman test was performed across ROIs. The test yielded a statistic of with a p-value of , indicating that no statistically significant global differences were observed among ROI attention distributions across folds. This result indicates that no ROI exhibited statistically dominant attention allocation across folds, suggesting that the transformer distributes attention in a relatively balanced manner among the selected anatomical regions.
Pairwise Wilcoxon signed-rank tests were subsequently conducted to explore potential differences between specific ROI pairs. Although the comparison between the entorhinal cortex and hippocampus produced the lowest uncorrected p-value (), this difference did not remain statistically significant after Bonferroni correction (). All other pairwise comparisons similarly showed non-significant adjusted p-values.
Overall, these findings indicate that the proposed multimodal transformer learns a stable and anatomically distributed attention strategy, where multiple ROIs jointly contribute to Alzheimer’s disease classification without a statistically dominant single region. This behavior supports the robustness, interpretability, and biological plausibility of the proposed multi-ROI representation framework.
The results presented in
Table 7 demonstrate that the proposed multimodal transformer produces robust and stable attention distributions across cross-validation folds for all evaluated ROIs. Stability was quantified using the coefficient of variation (CV) and a normalized stability score, where lower CV values and higher stability scores indicate more consistent attention allocation.
Among all regions, the frontal lobe achieved the highest stability score (0.9399) and the lowest coefficient of variation (0.0640), indicating that the model consistently assigns similar attention weights to this region across different training folds. This suggests that frontal lobe representations contribute robust and reproducible discriminative information for Alzheimer’s disease classification.
The entorhinal cortex and hippocampus also exhibited high stability scores above 0.93, which is anatomically meaningful because these regions are strongly associated with early neurodegenerative changes in Alzheimer’s disease. Their consistent attention allocation supports the biological plausibility of the proposed attention mechanism.
The parietal lobe and fornix showed slightly higher variability but still maintained strong stability scores above 0.92, indicating reliable participation in the decision-making process.
In contrast, the temporal lobe exhibited the highest coefficient of variation (0.1175) and the lowest stability score (0.8949), indicating comparatively greater variability in attention allocation across the stratified 7-fold cross-validation procedure. Although still relatively stable, this behavior may indicate higher sensitivity to inter-subject anatomical variability or dataset heterogeneity in temporal lobe patterns.
Overall, these findings indicate that the proposed multimodal transformer learns reproducible and anatomically coherent attention patterns, supporting the robustness and interpretability of the ROI-based representation strategy.
The consistency of attention allocation across folds further suggests that the learned ROI representations contribute reliably to the overall classification performance of the proposed multimodal framework.
5.5. Multimodal 3D Vision Transformer vs. State-of-the-Art Methods
Due to the relatively recent adoption of 3D Vision Transformers in medical imaging, only a limited number of studies have explored multimodal settings that combine multiple inputs, heterogeneous data sources, and merged datasets for Alzheimer’s disease (AD) classification. To provide a comprehensive benchmark against state-of-the-art approaches,
Table 8 summarizes the performance of recent transformer-based methods evaluated on commonly used datasets. For direct comparison, the best-performing configuration of the proposed method from
Table 3 is included.
As shown in
Table 8, the proposed Multimodal 3D Vision Transformer achieves the highest reported accuracy (97.62%) among the compared methods. In particular, it surpasses the best-reported transformer-based approach [
38], which achieved
accuracy on ADNI using a Vision Transformer model.
Furthermore, the single-ROI configurations also demonstrate competitive performance, with several regions—such as the fornix, parietal lobe, and temporal lobe—reaching accuracies above . This highlights the strong discriminative capacity of anatomically focused representations, while the multimodal multi-ROI configuration provides a more robust and generalizable solution.
The superior performance of the proposed model can be attributed to three key factors. First, the use of a 3D Vision Transformer enables the joint modeling of spatial and contextual information across volumetric MRI data through self-attention mechanisms. Second, the multi-ROI strategy focuses the model on disease-relevant anatomical regions, reducing noise from non-informative areas. Finally, the integration of heterogeneous data—including clinical metadata, cognitive assessments, and volumetric biomarkers—enhances the model’s ability to capture complementary patterns associated with disease progression. Together, these components result in a highly accurate and interpretable framework for Alzheimer’s disease classification.
The proposed model demonstrates competitive performance within the range of results reported by selected published studies. However, direct quantitative comparison is inherently limited by differences in datasets, preprocessing pipelines, subject selection criteria, and validation protocols. Notably, our evaluation reports mean and standard deviation across 7-fold cross-validation, providing a more robust and reliable performance estimate.
Table 8 presents a contextual comparison between the proposed framework and selected state-of-the-art studies reported in the literature. Because these studies employ different datasets, preprocessing pipelines, ROI definitions, multimodal configurations, and validation strategies, the comparison should not be interpreted as a controlled statistical superiority analysis. Instead, the table is intended to provide a qualitative reference for situating the reported performance of the proposed method within the current literature.
It is important to note that the proposed method is evaluated on a merged multi-cohort dataset, which introduces additional variability and increases the difficulty of the classification task.
The ROC curves presented in
Figure 4 illustrate the discriminative performance of the proposed Multimodal 3D Vision Transformer across the 7-fold cross-validation. The consistently high true positive rates observed across folds, together with a mean AUC of
, indicate excellent classification capability and robust generalization. The narrow variability further suggests stable model behavior across different data partitions. Overall, these results demonstrate the model’s strong ability to accurately distinguish between cognitively normal (CN) and Alzheimer’s disease (AD) subjects.
Together, these results demonstrate that the proposed framework effectively balances predictive performance, robustness, and interpretability, addressing key challenges in clinically applicable Alzheimer’s disease classification.
6. Limitations
Despite its strong performance, the proposed Multimodal 3D Vision Transformer presents several limitations. The use of public datasets (ADNI, AIBL, OASIS) introduces variability in acquisition protocols, scanner characteristics, and demographics, potentially leading to residual biases despite preprocessing. Additionally, diagnostic labels are based on clinical assessments rather than pathological confirmation, which may introduce label noise.
The reliance on predefined ROIs enhances interpretability but may overlook other informative regions and limit the capture of diffuse or atypical neurodegeneration patterns. Moreover, ROI-based sampling may result in partial information loss. Future work should explore adaptive or data-driven ROI selection strategies.
From a computational perspective, Transformer-based multimodal models still require substantial computational resources, particularly when processing volumetric neuroimaging data and multiple heterogeneous input streams, which may limit scalability and practical clinical deployment. Although the proposed framework demonstrated consistent performance across the seven stratified folds, external validation on independent cohorts remains necessary to further assess generalization across unseen populations and acquisition settings. Additionally, the binary classification setting considered in this study does not fully capture the continuous and heterogeneous progression of Alzheimer’s disease. Furthermore, computational efficiency was not explicitly evaluated through quantitative complexity analyses such as FLOPs, inference time, or memory consumption. Consequently, the proposed ROI-based decomposition should be interpreted primarily as an anatomically motivated strategy for emphasizing disease-relevant brain regions rather than as a computationally optimized framework.
Although the proposed framework incorporates attention-based interpretability mechanisms together with quantitative stability analysis, several limitations remain regarding the clinical validation of the learned attention distributions. The current study did not perform direct correlation analyses between attention scores and continuous clinical measurements such as MMSE or CDR values, nor did it include direct quantitative comparison against external clinical ROI importance rankings derived from independent neuroimaging studies. Furthermore, although the framework integrates FreeSurfer-derived volumetric biomarkers associated with Alzheimer’s disease pathology, additional validation studies are required to establish stronger correspondence between learned attention patterns and clinically validated neurodegenerative biomarkers. Future work should therefore investigate longitudinal clinical correlation analysis, cohort-specific interpretability evaluation, and multimodal explainability techniques to further strengthen the clinical reliability and biological interpretability of transformer-based neuroimaging models.
Several limitations of this study should be acknowledged. First, the proposed framework was evaluated exclusively under a binary classification setting (CN vs. AD), while intermediate clinical stages such as subjective memory complaints (SMCs), early mild cognitive impairment (EMCI), and late mild cognitive impairment (LMCI) were not included in the experimental design. Although this simplified setting enabled clearer class separation and facilitated controlled evaluation of the proposed multimodal transformer architecture, it does not fully represent the complexity and heterogeneity of Alzheimer’s disease progression encountered in clinical practice. Future work should therefore extend the proposed framework toward multi-class and longitudinal progression modeling scenarios focused on early-stage disease detection and MCI-to-AD conversion prediction.
Second, although the study employed a merged multi-cohort dataset combining ADNI, AIBL, and OASIS to increase data heterogeneity and improve robustness, cohort-specific analyses—including sensitivity and specificity evaluation per dataset—were not explicitly performed. Consequently, the influence of site-specific acquisition variability, scanner differences, and demographic heterogeneity on model performance remains insufficiently characterized. Additional external validation studies and cross-site evaluation protocols are necessary to further assess the generalization capability and clinical applicability of the proposed framework across independent populations and acquisition settings.
The integration of clinical and demographic data introduces challenges such as missing values, temporal inconsistencies, and potential imputation bias. While attention mechanisms improve interpretability, they provide only partial explanations of model decisions, and enhancing transparency remains essential for clinical adoption.
Finally, dependence on atlas-based ROI annotations and limited dataset sizes may affect robustness and generalization. Addressing these limitations through improved data diversity, external validation, adaptive feature selection, and more interpretable models is crucial for advancing real-world applicability.
7. Discussion
The results demonstrate that the proposed multimodal transformer architecture effectively captures both anatomical and clinical patterns associated with Alzheimer’s disease, as reflected by its high predictive performance (AUC = 0.9940 ± 0.0059). By integrating multi-ROI volumetric MRI data with structured clinical features, the model leverages complementary sources of information that are typically analyzed independently in conventional approaches.
The use of ROI-based decomposition enables the model to focus on anatomically relevant brain regions, such as the hippocampus and entorhinal cortex, which are known to be critically affected by neurodegeneration. This observation is consistent with established clinical findings identifying medial temporal lobe atrophy as a hallmark of Alzheimer’s disease. Compared with whole-brain approaches, the proposed strategy reduces the amount of non-informative anatomical content presented to the model while preserving disease-relevant brain regions.
A key strength of the proposed framework lies in its token-based multimodal representation. By embedding imaging and tabular information into a shared feature space, the model facilitates direct cross-modal interactions through self-attention mechanisms. This allows the architecture to capture complex dependencies between anatomical patterns and clinical biomarkers that are often overlooked by conventional fusion approaches based on late concatenation.
The incorporation of modality embeddings further improves the model’s ability to differentiate and relate heterogeneous data sources, while attention-based pooling complements the global representation learned by the CLS token. This hybrid aggregation strategy enhances robustness by jointly modeling global contextual information and localized feature relevance.
The ablation study quantitatively supports these architectural design choices, demonstrating that both ROI decomposition and multimodal integration contribute substantially to classification performance. In particular, the multimodal framework consistently outperforms the tabular-only configuration, indicating that ROI-based imaging provides complementary information beyond clinical and volumetric biomarkers. These findings suggest that the proposed architecture does not rely solely on metadata, but instead effectively integrates heterogeneous modalities to improve disease characterization.
While the individual components of the proposed framework, including transformer architectures, ROI-based neuroimaging analysis, and multimodal learning, have been previously investigated, the contribution of this work lies in their unified integration within a token-based multimodal representation framework specifically designed for Alzheimer’s disease classification. To the best of our knowledge, existing ROI-based approaches typically process anatomical regions independently or rely on feature concatenation and late-fusion mechanisms, whereas the proposed methodology enables direct interaction between ROI-derived imaging tokens and structured clinical and volumetric biomarkers through shared self-attention operations. Furthermore, the incorporation of hemisphere-aware ROI selection and statistically grounded attention stability analysis extends current interpretability practices beyond qualitative visualization, providing a more reproducible assessment of biomarker relevance across validation folds.
The proposed multimodal transformer framework demonstrated competitive performance relative to recent transformer-based Alzheimer’s disease classification methods while additionally emphasizing anatomical interpretability and multimodal representation learning. Unlike conventional full-volume approaches, the proposed methodology integrates anatomically constrained multi-ROI decomposition with clinical and volumetric biomarkers, enabling focused representation learning on disease-relevant neuroanatomical structures associated with Alzheimer’s disease pathology.
Importantly, the primary contribution of the proposed framework extends beyond purely quantitative classification performance. The proposed approach additionally incorporates hemisphere-aware ROI selection, multimodal token-based fusion, and statistically grounded attention stability analysis across cross-validation folds. These characteristics improve the transparency and interpretability of the learned multimodal representations while maintaining strong performance under a heterogeneous multi-cohort evaluation setting.
Nevertheless, additional benchmark studies directly comparing the proposed framework against recent large-scale transformer-based medical imaging architectures and advanced multimodal fusion strategies under identical preprocessing and validation protocols remain important future research directions.
Although transformer-based architectures generally benefit from large-scale training datasets, the proposed framework demonstrated robust performance under a comparatively limited but carefully controlled multi-cohort setting. Several design choices were incorporated to mitigate data scarcity challenges, including anatomically constrained ROI decomposition, multimodal integration of clinical and volumetric biomarkers, dropout regularization, Hyperband-based hyperparameter optimization, and stratified 7-fold cross-validation.
Additionally, the use of ROI-based decomposition reduces the dimensionality of the input space compared with full-volume transformer approaches, enabling more focused representation learning on anatomically relevant brain regions associated with Alzheimer’s disease pathology. The merged multi-cohort design combining ADNI, AIBL, and OASIS further introduces variability in acquisition protocols and demographic distributions, partially improving robustness against dataset-specific bias.
Nevertheless, additional large-scale validation studies remain necessary to further assess generalization capability across independent cohorts and acquisition settings. Future work should investigate leave-one-cohort-out validation protocols, self-supervised pre-training strategies, foundation-model initialization, and advanced medical image augmentation techniques to further improve robustness and clinical applicability under limited-data conditions.
The attention-based interpretability analysis demonstrated that the proposed framework consistently emphasized anatomically relevant regions strongly associated with Alzheimer’s disease pathology, including the hippocampus, entorhinal cortex, and ventricular-related structures. These findings are consistent with established neuroimaging studies reporting medial temporal lobe atrophy and ventricular enlargement as characteristic biomarkers of Alzheimer’s disease progression. Additionally, several volumetric biomarkers derived from FreeSurfer-based anatomical analysis received consistently high attention importance scores across cross-validation folds, supporting the clinical relevance of the learned multimodal representations.
Statistical analyses further indicated that the learned attention patterns remained relatively stable across cross-validation folds, suggesting that the model did not rely on unstable partition-specific attention distributions. Nevertheless, the proposed attention mechanisms should not be interpreted as direct causal explanations of disease pathology, but rather as supportive indicators of learned feature relevance within the multimodal transformer framework. Although direct correlation analyses between attention scores and continuous clinical measures such as MMSE and CDR values were not included in the present study, these analyses represent important directions for future work aimed at strengthening the clinical interpretability of transformer-based neuroimaging models.
Although the proposed framework demonstrated strong performance for binary Alzheimer’s disease classification, the current experimental setting does not fully capture the clinical continuum of neurodegenerative progression observed in real-world practice. In clinical settings, diagnostic evaluation often involves distinguishing between cognitively normal (CN) subjects, subjective memory complaints (SMCs), early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and Alzheimer’s disease (AD). Among these stages, MCI-related conditions are particularly important because they may represent transitional phases preceding Alzheimer’s disease progression. In the present study, subjects with CDR = 0.5 were excluded to ensure clearer class separation and reduce diagnostic ambiguity during the initial evaluation of the proposed multimodal transformer framework. Consequently, the reported results should be interpreted within the context of a controlled binary classification scenario rather than as a comprehensive clinical diagnostic framework.
Additionally, although the use of merged multi-cohort datasets (ADNI, AIBL, and OASIS) increases population heterogeneity and partially improves robustness to dataset-specific bias, cohort-specific sensitivity and specificity analyses were not explicitly performed in this study. Differences in scanner characteristics, acquisition protocols, demographic distributions, and site-specific variability may influence model generalization and should be further investigated through dedicated cross-site evaluation protocols and external validation studies.
Despite these promising results, several limitations remain. The use of undersampling to balance class distributions may reduce training diversity and limit the ability of the model to capture rare disease patterns. Furthermore, although attention mechanisms improve interpretability, additional clinical validation is required to confirm the reliability and consistency of the identified attention patterns across heterogeneous populations and imaging protocols.
8. Conclusions
This study presented a multimodal 3D Vision Transformer framework for Alzheimer’s disease classification that integrates ROI-based MRI representations with clinical and volumetric biomarkers within a unified transformer architecture. The proposed methodology combines anatomically guided ROI decomposition, multimodal token fusion, modality embeddings, and attention-based learning to jointly model heterogeneous data sources.
Experimental results demonstrate that the proposed framework achieves high classification performance, obtaining an AUC of 0.9940 ± 0.0059 under stratified 7-fold cross-validation. The results further show that multimodal integration improves predictive performance compared to unimodal configurations, while ROI-based decomposition enhances anatomical interpretability by focusing on clinically relevant brain structures associated with neurodegeneration.
The proposed architecture also provides a clinically meaningful interpretation of the classification process through attention-based mechanisms that highlight relevant anatomical regions and multimodal feature interactions. In contrast to conventional whole-brain approaches, the proposed ROI-centered strategy improves representation quality while preserving interpretability.
Overall, the findings indicate that combining anatomically informed representations with multimodal transformer-based fusion constitutes a robust and clinically relevant approach for Alzheimer’s disease classification. Future work will focus on validating the proposed framework on independent external datasets, incorporating additional modalities such as PET imaging and genetic biomarkers, and extending the model to longitudinal studies for early prediction of disease progression.