This section explains the proposed ARE-PaLED framework, detailing its architecture, the process of discriminant patch extraction using an evolutionary approach, and the integration of explainability through AR projection with proposed algorithms.
4.2. Adaptive Multi-Scale Patch Extraction
4.2.1. Voxel-Based Morphometry (VBM)
The FCN designed for VBM on 3D brain MRI scans consists of six convolutional layers. The input to the network is a 3D brain MRI scan with a shape of (D, H, W, 1), where D, H, and W characterise the scan’s depth, height, and width, and 1 represents the single-channel grayscale intensity values. The network begins with four 3D convolutional layers, each kernel of size 3 × 3 × 3, with padding applied to preserve the input dimensions. Every convolutional layer is accompanied by batch normalisation and a rectified linear unit (ReLU) activation function to introduce non-linearity.
After the convolutional layers, max-pooling is applied using a filter size of 2 × 2 × 2 in between specific layers to reduce the spatial dimensions, allowing the network to learn more abstract feature representations. This is followed by two additional 3D convolutional layers with up-sampling to restore the original 3D spatial structure of the input scan, ensuring voxel-wise outputs that align with the input.
Finally, the network concludes with a 1 × 1 × 1 convolutional layer, which produces the voxel-wise output corresponding to p-values. The function applied to the final layer is a sigmoid activation to guarantee that the output values are within the range of 0 to 1.
4.2.2. Gradient Estimation (GE)
The FCN designed for GE on 3D brain MRI scans is structured to estimate voxel-wise gradient magnitudes, which can be used to identify discriminative locations in the various region of brain symmetry. The network starts with two 3D convolutional layers, each kernel of size 3 × 3 × 3 to capture low-level features, succeeded by batch normalisation and ReLU activation for non-linearity. After each convolution, a layer of size 2 × 2 × 2 max-pooling is applied to down-sample the spatial resolution, allowing the network to focus on more abstract and complex features.
Following the down-sampling layers, two more 3D convolutional layers refine the learned features further, and up-sampling layers restore the spatial resolution to match the input. These up-sampling layers ensure that the input image corresponds to the final output voxel-wise. The concluding layer of the network uses a 1 × 1 × 1 convolutional layer with three filters to estimate the gradient in the three directions (x, y, and z). The outcome is a three-channel map, where each channel corresponds to the gradient in one of these directions.
The network uses the Euclidean norm to compute the gradient magnitude for each voxel. of the gradient vectors. This can be computed as part of the network or as a post-processing step. The network output is a single-channel 3D volume in which each voxel contains the gradient magnitude, representing the sharpness of structural changes at that location.
The designed FCN is trained with a mean squared error loss function, and efficient weight updates are performed using the Adam optimiser. This structure allows the network to estimate the gradient magnitudes across the 3D MRI scan, with its convolutional, pooling, and up-sampling layers.
4.2.3. Fuse-Net of VBM and GE
Inspired by the need to fuse both statistical and structural information from brain MRI scans, a Fuse-Net structure is designed to fuse the outputs of two fully convolutional networks (FCNs), one for VBM and the other for GE, to produce normalised z-scores, which highlight discriminant regions in 3D brain MRI. The network begins by taking two inputs, the voxel-wise p-value map from the VBM FCN and the gradient magnitude map from the GE FCN; both are single-channel 3D volumes with the shape (D, H, W, 1). These inputs are concatenated along the channel axis, forming a two-channel input , which contains both statistical significance and structural information.
After concatenation, a series of 3D convolutional layers is applied to learn the joint features between the two modalities. Each convolutional layer uses a kernel size of 3 × 3 × 3 and applies the ReLU activation function. Batch normalisation follows each convolution to stabilise training, and pooling layers with 2 × 2 × 2 max-pooling filters are intermittently applied to down-sample the feature maps, allowing the network to capture abstract and larger-scale patterns.
Up-sampling layers restore the spatial dimensions to match the input and maintain the original spatial resolution. The up-sampling is applied after the down-sampling layers to ensure that the final output corresponds voxel-wise to the original 3D MRI scan. These up-sampled features are passed through a final 3D convolutional layer with a filter of size 1 × 1 × 1, reducing the feature map to a single-channel output.
The output of this final convolutional layer is a 3D volume containing raw voxel-wise values, which are then normalised into z-scores. This z-score normalisation ensures that the network output is standardised, providing a voxel-wise representation of how much each region deviates from the mean, thus identifying discriminant areas in the brain. The final output is a 3D map of normalised z-scores , which highlights the brain areas with significant structural or statistical deviations.
Combining the p-value maps and gradient magnitudes, this CNN provides a robust method for identifying discriminant locations in 3D brain MRI, offering a joint statistical and structural analysis. The network efficiently processes both inputs to extract meaningful patterns and provide a normalized voxel-wise measure of discriminant significance, which is crucial for analysing brain abnormalities.
4.2.4. Dynamic 3D Patch Extraction
The CNN structure is designed to extract non-overlapping 3D patches from a brain MRI based on normalised z-scores. It dynamically adjusts the patch size according to the significance of the z-score. A normalised z-score map, representing voxel-wise z-scores for the entire 3D brain MRI, is provided as input to the designed CNN. For voxels where the z-score is greater than 1, the network extracts a larger patch of size 32 × 32 × 32, as these regions indicate higher discriminative significance. In contrast, for voxels where the z-score is less than or equal to 1, a minor patch of size 16 × 16 × 16 is extracted, as these regions are considered less discriminative.
A custom layer is implemented within the CNN to achieve this dynamic patch extraction. This layer scans the z-score map and determines the appropriate patch size for each voxel. If Z(V) > 1, a 32 × 32 × 32 patch centered at voxel V is extracted, and if Z(V) < 1, a 16 × 16 × 16 patch is extracted. The patch extraction process is non-overlapping, meaning no two patches share the same voxels. This is ensured by setting the extraction stride equal to the patch size (32 or 16), ensuring that patches are spaced apart based on their respective sizes.
The network begins by using a convolutional layer to detect regions where z-scores are higher than 1, acting as a filter to highlight more discriminative regions. Based on this filter, the custom extraction layer selects the patch size accordingly. Larger patches are extracted from areas of greater significance, while smaller patches are extracted from regions of lesser interest. After extraction, the patches can be passed through additional convolutional layers to extract higher-level features for tasks such as segmentation or classification.
This adaptive method is shown in Algorithm 1. It allows efficient 3D brain MRI data processing, focusing computational resources on regions with higher z-scores (potentially indicating abnormalities) while minimising attention to less significant regions. Importantly, by extracting patches from both hemispheres, the method enables implicit comparison across anatomically corresponding areas, allowing downstream tasks to detect asymmetrical structural changes. The network’s final output is a set of non-overlapping 3D patches, varying in size based on the z-score, which can then be used for further analysis in medical image processing including the identification of symmetry-related atrophy patterns.
Thus, the proposed AMPEN performs dynamic, non-overlapping, multi-scale patch extraction by jointly leveraging voxel-wise statistical significance (derived from voxel-based morphometry) and gradient-based feature sensitivity. Unlike traditional methods that rely on fixed patch sizes or uniform sampling, AMPEN adaptively determines both patch size and location based on the degree of region-specific brain atrophy, allowing the model to concentrate on structurally and functionally salient brain areas.
Importantly, AMPEN integrates voxel-level and patch-level information, enabling the system to retain fine-grained spatial detail while also capturing broader regional context. This dual-resolution strategy enhances both classification performance and model interpretability by grounding predictions in anatomically meaningful features.
This integrated, adaptive approach to multi-scale patch extraction, guided by both statistical and gradient based cues, represents a key innovation and a novel contribution to explainable deep learning in neuroimaging-based AD diagnosis.
4.3. Informative Patch Selection
The IPSA (given in Algorithm 2) aims to identify the most informative 3D patches from brain MRI data using a combination of z-scores, SHapley Additive exPlanations (SHAP) coefficients, and an evolutionary approach. This methodology lets
denote a subsample of patches selected for network training while X represents the aggregated vector of features from all possible patches. The fitness of each subgroup, F(
), measures its usefulness for classification tasks. This fitness score is formulated as a weighted sum of various components, as shown in Equation (
1).
Here, the term quantifies the statistical significance of the patches. This ensures that patches selected from regions with higher z-scores are prioritised, as they likely correspond to more clinically relevant areas. The second component, , utilises SHAP values to gauge an individual patch’s influence on the model’s classification. SHAP values provide insight into how much each patch influences the overall classification decision, making them vital for interpretability. The final term, , penalises redundancy among selected patches, discouraging overlap and promoting diversity. This is particularly important in imaging contexts, where multiple patches may capture similar anatomical features.
Algorithm 2 operates as follows: It begins by creating an initial population of chromosomes, each demonstrating a subset of patches. In each generation, the algorithm calculates the fitness of every individual chromosome using the formulation in Equation (
1). The best-performing chromosomes are then selected as parents for the next generation.
Crossover operations are performed on these parent chromosomes to create offspring, combining features of selected patches from both parents. The mutation is also applied to introduce variability, allowing for the exploration of new combinations of patches. The new generation of chromosomes is formed by selecting the best individuals from both the parents and offspring, a strategy known as elitism.
This iterative process continues for several generations to maximise the fitness function. The result is the best chromosome Copt, which identifies the optimal subset of informative patches that balance statistical significance and interpretability while minimising redundancy.
4.4. Patch Feature Profiling
The CNN designed for local patch feature analysis of non-overlapping 3D multi-scale patches from 3D brain sMRI scans begins by accepting two different patch sizes: 16 × 16 × 16 and 32 × 32 × 32. Each patch is independently processed, with the input shape being 16 × 16 × 16 or 32 × 32 × 32, representing grayscale intensity values. The CNN consists of three 3D convolutional layers with 3 × 3 × 3 filter size, followed by ReLU activations to capture local spatial features. Padding is applied to preserve the input dimensions during convolutions. As the network progresses, filters increase (e.g., from 32 to 64 and 128), allowing the network to capture more abstract and complex features.
To reduce spatial dimensions and extract essential features, 3D max-pooling layers are applied with a kernel size 2 × 2 × 2, halving the spatial resolution while retaining critical information. After the final convolutional and pooling stages, a global pooling layer (global max pooling) converts the 3D feature volumes from each patch into single-dimensional feature vectors. These feature vectors summarise the local information from each patch. This vector represents the local features extracted by the CNN and is ready for further analysis.
This architecture is optimised to extract informative features from both fine-grained (16 × 16 × 16) and coarse-scale (32 × 32 × 32) patches, providing a compact representation of local brain structures within the 3D sMRI data.
4.5. Global Spatial Mapping
The global spatial mapping is performed using ViT. It processes the CNN-derived patch embeddings, which already include positional information. Instead of explicitly adding positional encodings in the transformer, the tokens fed into the ViT include patch and positional representations from the CNN output.
The input token to the transformer, corresponding to the ith patch, is defined as . is the CNN-extracted embedding, which already encodes positional and feature information through learned convolutional features. The entire input sequence to the ViT consists of all positional-encoded tokens , one for each patch.
The ViT comprises multiple transformer encoder layers to model the global relationships between patches. The core of each encoder layer is the multi-head self-attention mechanism, which computes how each patch is related to every other patch. For a given token, self-attention is computed using the formulation in Equation (
2).
Q is the queries, K is the keys, and V is the values resulting from the input tokens through learned projection matrices. The dimension of the keys is
, and the softmax operation normalises the attention scores. In multi-head attention, multiple sets of attention heads are computed in parallel. After self-attention, a feed-forward network (FFN) is fed with the output that applies a non-linear transformation as shown in Equation (
3).
where
and
are weight matrices and
and
are biases. This FFN is applied independently to each token. Additionally, residual connections and layer normalisation are used by each transformer encoder layer to improve the learning process and stabilise training. The ViT’s output, after multiple transformers layers, provides a globally refined representation of the 3D brain sMRI. This globally aware output from the ViT represents local details and long-range spatial interactions across the brain.
4.6. Classifier
The fully connected layer designed for multi-class classification (distinguishing among AD, MCI, and NC) receives its input from the ViT, which has processed the global spatial features of the brain sMRI. The ViT output, the refined sequence of patch tokens, is first flattened into a vector of features. The fully connected layer completes the classification by receiving this vector as input. It also consists of one or more dense layers.
The first step is to project the ViT’s output to a lower-dimensional space using a dense layer, as shown in Equation (
4).
where
z is the input feature vector from the ViT,
is a learned weighted matrix,
is the term for bias, and the activation function is ReLU that introduces non-linearity. This transformation reduces the dimensionality of the input while capturing the most discriminative features.
The result produced in this layer is carried to another dense layer, typically without an activation function, to project the features onto the target class labels (three in this case, AD, MCI, and NC). The final output logits are given by Equation (
5).
and
are the weight matrix and bias for the output layer, and
O contains each class’s raw prediction scores. A softmax activation function is applied to convert these logits into probabilities for each class. This function ensures that the sum of the probabilities across all three classes is 1, making it suitable for multi-class classification.
The final predicted class is assigned based on the maximum predicted probability. Through its series of transformations, this fully connected layer ultimately distinguishes among AD, MCI, and NC using the global spatial features extracted by the ViT.
4.7. Patch-Level Explainability with AR
In proposed ARE-PaLED network, the selected informative patches are projected using AR to provide an interactive and semi-immersive visualisation of critical brain regions associated with AD. AR integration is achieved through the Unity platform combined with the AR Foundation.
The AR system for XAI receives patch-level information from two core components of the pipeline. First, the voxel-wise z-scores from IPSA that identify anatomically significant patches that exhibit high statistical deviation across the brain volume. Second, a classification label is assigned by the classifier based on the selected patches and learned spatial patterns.
To enable intuitive, marker-based AR explainability, the WebAR pipeline for the proposed system is extended using frameworks such as AR.js, A-Frame, or MindAR. Upon detecting a printed fiducial marker, the system queries a backend API to retrieve patch metadata, including MNI-normalised 3D coordinates, z-scores, and predicted class labels. Each patch is then projected into 2D screen space using Three.js’s camera projection matrix and visualised as a lightweight rectangular outline anchored to the physical marker. Box centers are computed via a linear transformation from MNI coordinates to marker space. To ensure high frame rates on web and mobile devices, each overlay is rendered without lighting or shadows. An A-Frame raycaster module listens for user interactions. When a user taps on a patch, a dashed bounding box is overlaid, and a contextual label slides into view, displaying the patch’s z-score and class confidence. Patch elements are loaded efficiently in small batches using request IdleCallback to avoid blocking the main rendering thread.
The AR system for XAI supports:
2D interactive viewing: Users move their device over a marker to inspect the patch layout projected onto the brain model within a controlled 2D view.
Patch highlighting: Patches identified by IPSA that contribute to the class prediction are highlighted.
Metadata popups: Tapping a patch reveals its IPSA-derived z-score and ViT-predicted class label, offering transparency into both statistical significance and model reasoning.
By combining statistical evidence with deep model predictions and projecting them through an AR module, the proposed system offers intuitive, spatially grounded insights that support clinical decision-making by linking model outputs to visually interpretable brain regions.
4.10. Implementation
The proposed ARE-PaLED is developed in Python 3.8 using the PyTorch 2.3.0 library, ensuring a robust and efficient model training and evaluation environment. To prevent overfitting, batch normalisation is applied after each convolutional layer, enhancing model stability and convergence. The AMPEN extracts image patches from distinct brain regions, capturing various anatomical variations and increasing the training data’s diversity. The training process uses cross-validation using five split partitions on 80% of the dataset, reserving 20% for testing and ensuring a rigorous assessment of model performance. The model’s trainable parameters, like batch size, learning rate, and patch count (e.g., 50 patches), are optimised during validation for best results.
The model is trained over 100 epochs using the Adam optimiser with a learning rate 0.001, achieving a balance between computational efficiency and learning performance. The system configuration for model execution includes an NVIDIA QUADRO P5000 GPU, 64 GB RAM and a CPU of Intel(R) Xeon(R) W-2133 at 3.60 GHz, providing the computational power needed for deep learning tasks. Unity3D 6.0 and the AR Foundation package are utilised for interpretability to project the discriminative patches in AR.