1. Introduction
Alzheimer’s disease (AD) is the predominant aetiology of dementia globally. Alzheimer’s disease is a gradual, irreversible neurological condition. It is the predominant cause of dementia overall. A characteristic of this disease is the gradual deterioration of cognitive functioning, including memory, thinking, and executive skills. It ultimately results in a loss of autonomy and constitutes a significant financial burden for patients, carers, and healthcare systems. Consequently, timely and accurate diagnosis of the stage of Alzheimer’s disease is crucial for early intervention, enhanced disease management, and effective clinical decision-support systems.
The clinical progression of Alzheimer’s disease is sometimes characterised as a continuum that includes cognitively normal individuals, those exhibiting intermediate cognitive impairment, and individuals with fully manifested Alzheimer’s disease. Mild cognitive impairment (MCI) is a significant element in this process. MCI is typically categorised into early MCI (EMCI) and late MCI (LMCI), indicating the progressive severity of cognitive deterioration. Certain clinical datasets include a category termed Subjective Memory Complaints (SMC). This group comprises individuals who report memory issues yet have no measurable cognitive deficits. This detailed categorisation facilitates precise clinical labelling but also leads to significant diagnostic ambiguity, especially within the CN and SMC groups, whose cognitive and clinical traits frequently intersect. This is particularly applicable to the two groups.
In recent years, machine learning (ML) techniques have shown considerable potential for classifying Alzheimer’s disease using structured clinical and biomarker data. Conventional machine learning techniques are widely used for tabular data due to their reliability, interpretability, and effectiveness. Models of this nature include logistic regression, support vector machines, random forests, gradient boosting, XGBoost, and LightGBM. Ensemble-based methodologies have demonstrated superior performance by effectively capturing non-linear correlations and extensive feature interactions. Nonetheless, conventional machine learning techniques may remain constrained in simulating higher-level associations when the diagnostic categories exhibit substantial overlap.
Deep learning methodologies have considerably broadened the scope of Alzheimer’s disease modelling by enabling adaptable, non-linear representations that can discern intricate data patterns. In this area, numerous neural designs have been examined. These designs encompass fully connected networks, convolutional architectures, recurrent architectures, and residual tabular networks. Nonetheless, implementing deep learning models on tabular clinical data presents numerous challenges, including limited sample sizes, overfitting, and lengthy hyperparameter optimisation. This is despite the success of these models in imaging applications. Consequently, their performance does not consistently exceed that of well-structured ensembles of machine learning algorithms.
Significant advancements have occurred; nonetheless, gaps remain in the existing literature. Initially, some studies concentrate on the classification of Alzheimer’s disease utilising binary or fixed multi-class methodologies. Nonetheless, these experiments do not assess the impact of diagnostic ambiguity on the model’s performance. Secondly, machine learning and deep learning methodologies are typically examined separately rather than inside a cohesive comparative framework. Third, ensemble learning is widely utilised; however, most studies employ fixed or heuristic weighting procedures, thereby hindering the effective exploitation of model complementarity. In conclusion, although incorporating privacy-preserving federated learning is crucial for practical clinical implementation across numerous institutions, it remains largely unexamined in frameworks for Alzheimer’s disease classification.
This study presents a systematic and progressive approach to classifying the stages of Alzheimer’s disease using clinical tabular data. The primary objective of this study is to rectify the identified deficiencies. We create a classification challenge comprising five classes to evaluate the model’s performance throughout the entire cognitive spectrum. This issue encompasses CN, SMC, EMCI, LMCI, and AD. A comprehensive comparison is conducted among sixteen classical machine learning models and eleven deep learning architectures, including NODE and FT-Transformer. The most effective machine learning model is LightGBM, although the five-layer deep neural network (DNN) performs best among deep learning models. A hybrid ensemble model is proposed based on these observations. This model integrates the two most effective paradigms: LightGBM and a five-layer DNN. The optimal ensemble weights are automatically ascertained by a Genetic Algorithm, utilising Macro-F1 as the fitness metric. This facilitates the implementation of an adaptive and data-driven fusion methodology. To enhance the deployment framework while safeguarding user privacy, we introduce a federated learning architecture. The DNN component of this design is disseminated across non-IID clients, whereas LightGBM is centralised. The weight optimised by GA establishes the composition of these components for four distinct aggregation algorithms: FedAvg, FedProx, FedNova, and SCAFFOLD.
The main contributions of this study can be summarised as follows:
A systematic and unified comparison of sixteen classical machine learning models and eleven deep learning architectures (including NODE and FT-Transformer) on a five-class Alzheimerś disease classification task, using a leakage-free predefined split and a comprehensive set of evaluation metrics.
Empirical identification of LightGBM as the best classical ML model and FT-Transformer as the best standalone deep learning architecture (Acc = 0.7810). The DNN (5-layer) is retained as the DL representative in the hybrid ensemble due to its superior federated stability, validated by an ablation study.
A GA-optimised hybrid ensemble framework that adaptively combines the best ML and DL representatives through evolutionary weight search, consistently outperforming individual models across all metrics.
A federated learning architecture that distributes the DNN component across non-IID clients while maintaining LightGBM centrally, integrated via the GA-derived optimal weight, achieving the best overall result with FedNova and surpassing all centralised configurations.
The remainder of this document is structured as follows.
Section 2 presents a comprehensive literature review on the stages of Alzheimer’s disease, traditional machine learning and deep learning classification techniques, ensemble methodologies, and federated learning approaches for clinical data.
Section 3 delineates the dataset, the preprocessing pipeline, the comprehensive range of classical machine learning and deep learning architectures evaluated, the development of the GA-optimised hybrid ensemble, and the federated learning framework.
Section 4 delineates and analyses the experimental findings. This section comprises four subsections: a comparison of classical machine learning models (
Section 4.1), a comparison of deep learning architectures (
Section 4.2), an evaluation of GA-optimised hybrid ensembles (
Section 4.3), and an evaluation of federated learning with hybrid ensembles (
Section 4.4).
Section 4.6 delineates the existing constraints of the proposed framework and offers a synopsis of prospective research avenues. The final section of the study, designated as
Section 5, provides a comprehensive summary of the principal findings and their significance in the medical field.
3. Materials and Methods
This section outlines the experimental methodology proposed for classifying Alzheimer’s disease stages using structured clinical data. The process commences with a preliminary classification into five categories based on a clinically established formulation. This is followed by an extensive comparison of traditional machine learning and deep learning methodologies to determine the most effective model in each paradigm. The methodology relies on a progressive and modular framework. A hybrid ensemble is formed by amalgamating the top-performing members from each paradigm, with the ensemble weights being automatically optimised by a Genetic Algorithm. Finally, the hybrid ensemble is expanded to a multi-institutional setting via a federated learning framework that safeguards individual privacy. This approach facilitates an open and transparent examination of the model’s activity and enables a reliable assessment of its performance. It also facilitates the logical integration of complementary learning processes. The overall pipeline is illustrated in
Figure 1.
3.1. Dataset Description and Clinical Variables
This study uses a publicly available structured clinical tabular dataset obtained from the Kaggle platform (
https://www.kaggle.com/datasets/sarthakkanjariya/alzheimer-dataset?resource=download (accessed on 12 January 2026)). This dataset is a preprocessed tabular CSV file derived from the Alzheimerś Disease Neuroimaging Initiative (ADNI) cohort and made openly accessible by a third-party contributor on Kaggle. The file contains no individually identifiable information, no raw neuroimaging data, and no biospecimen data. No direct data access agreement with ADNI was required for this study, as the dataset was accessed exclusively through the Kaggle platform in its publicly available form. ADNI is a large-scale, multi-site longitudinal study launched in 2004 that collects standardised neuroimaging, cognitive, biomarker, and genetic data from participants spanning the full spectrum of Alzheimer’s disease progression [
14,
15]. The dataset contains
subject-level observations characterised by a heterogeneous collection of clinical, neuropsychological, morphometric, and genetic variables. Each row corresponds to a unique subject and includes a subject identifier, a predefined train/test split indicator (
Test_data), a diagnostic label (
Diagnosis), and
clinical features.
3.1.1. Feature Groups
The feature set spans five clinically meaningful categories:
Cognitive and Neuropsychological Assessments
These include the most discriminative markers of cognitive function routinely collected in ADNI visits:
MMSE (Mini-Mental State Examination): A 30-point scale measuring global cognitive status; lower scores indicate greater impairment.
CDRSB (Clinical Dementia Rating Sum of Boxes): A composite measure of cognitive and functional performance across six domains.
ADAS11 and ADAS13 (Alzheimer’s Disease Assessment Scale–Cognitive Subscale, 11- and 13-item versions): Standardised assessments of memory, language, and praxis; higher scores indicate greater impairment.
ADASQ4: The fourth question of the ADAS addressing delayed word recall.
RAVLT variants (RAVLT_immediate. RAVLT_learning, RAVLT_forgetting. RAVLT_ perc_forgetting): Components of the Rey Auditory Verbal Learning Test quantifying verbal memory encoding, retention, and forgetting.
EcogSP and EcogPt subscales: Everyday cognition questionnaire scores from both the study partner and patient perspectives, covering memory, language, visuospatial ability, planning, and organisation.
Neuroimaging-Derived Morphometric Measures
Volumetric and thickness measures were derived from T1-weighted structural MRI using automated segmentation:
Hippocampus: Bilateral hippocampal volume (mm3), The most established imaging biomarker of AD-related neurodegeneration.
WholeBrain: Total brain parenchymal volume.
Entorhinal: Entorhinal cortex volume, which shows early atrophy in Alzheimer’s disease.
Fusiform: Fusiform gyrus volume, associated with face recognition and semantic memory.
MidTemp: Middle temporal gyrus volume.
Ventricles: Total ventricular volume, which expands as a consequence of brain atrophy.
Cerebrospinal Fluid (CSF) Biomarkers
CSF measures reflect core AD pathophysiological processes:
ABETA (Amyloid-): Reduced levels indicate amyloid plaque accumulation.
TAU: Total tau protein, elevated in the presence of neurofibrillary tangles and neurodegeneration.
PTAU (phosphorylated tau): A more specific marker of tau pathology.
Demographic and Genetic Variables
AGE: Subject age at the time of assessment (years).
Year_education: Years of formal education, used as a proxy for cognitive reserve.
Gender: Biological sex (binary-encoded).
APOE4: Apolipoprotein E
4 allele dosage (0, 1, or 2 copies), the strongest known genetic risk factor for sporadic AD [
63].
Ethnicity, Race: Ethnicity and race categories.
Marital_status: Marital status.
Derived and Summary Indices
Table 1 provides a summary of the dataset composition.
3.1.2. Diagnostic Categories and Class Distribution
The original diagnostic labels follow five clinically meaningful categories spanning the Alzheimer’s disease spectrum:
CN (Cognitively Normal): Participants with no subjective or objective cognitive complaints and normal assessment scores.
SMC (Subjective Memory Complaints): Cognitively unimpaired individuals who report perceived memory difficulties without meeting criteria for MCI [
10].
EMCI (Early Mild Cognitive Impairment): Individuals with measurable but subtle cognitive deficits that do not significantly impair daily function [
8].
LMCI (Late Mild Cognitive Impairment): A more advanced MCI stage with a substantially elevated rate of conversion to AD [
87].
AD (Alzheimer’s Disease): Participants meeting diagnostic criteria for probable AD dementia.
This taxonomy aligns with ADNI cohort definitions [
14]. SMC participants are recruited from cognitively unimpaired individuals who voluntarily report memory concerns, making them phenotypically proximal to CN subjects and posing a persistent classification challenge for automated systems.
3.2. Train–Test Split and Leakage Control
The dataset contains a clear split indicator labelled test data. The training set comprises participants with a test data value of 0 (n = 1390), whereas the held-out test set comprises patients with a test data value of 1 (n = 347). This established subject-level divide guarantees reproducibility and inherently safeguards against data leakage.
All preprocessing transformations (imputation statistics, scaling parameters, categorical encoders) were derived only from the training subset. They were subsequently applied to the test set without any re-evaluation. In classical machine learning models, just the training set was utilised for hyperparameter tuning. This was executed utilising stratified k-fold cross-validation (). The reserved test set was used exclusively for reporting the final performance. The fitness criterion was determined using a stratified internal validation split of 80% for training and 20% for validation, with the aim of optimising the ensemble weight via the Genetic Algorithm. This was implemented to ensure the model remained unexposed to the test set during development.
3.3. Preprocessing Pipeline
Let denote the raw feature matrix and the corresponding diagnostic labels. The following preprocessing steps were applied sequentially.
3.3.1. Feature Identification and Type Annotation
Features were categorised into continuous (numeric) and categorical types based on their domain and value range. Continuous variables include all cognitive test scores, volumetric measures, CSF biomarkers, age, and education. Categorical variables include sex, ethnicity, race, marital status, and APOE4 allele dosage (treated as ordinal, with 3 discrete levels: 0, 1, 2).
3.3.2. Missing Value Imputation
Missing values were handled separately by feature type:
Median imputation was preferred over mean imputation for continuous variables because several clinical biomarkers exhibit right-skewed distributions (e.g., CSF amyloid, ventricle volumes). All imputation parameters were derived from the training set and applied identically to the test set.
3.3.3. Standardisation of Continuous Features
Continuous features were standardised using training-set statistics:
where
and
are the mean and standard deviation of feature
j over the training set and
prevents division by zero.
3.3.4. Categorical Encoding
Nominal categorical features were encoded using one-hot encoding. For a feature with
m distinct categories
,
One category per feature was dropped (dummy encoding) to avoid multicollinearity in linear models. Ordinal variables, such as APOE4 allele count, were retained as integer-coded numerics (0, 1, 2).
3.3.5. Target Encoding
Diagnostic labels were mapped to contiguous integers for models requiring integer class indices:
The mapping was fixed across all experiments for reproducibility. For neural networks trained with categorical cross-entropy, integer labels were further converted to one-hot vectors.
3.4. Classical Machine Learning Models
Sixteen classical supervised classification models were evaluated, spanning the full spectrum of supervised learning paradigms: logistic regression (LR), support vector machine with RBF kernel (SVM-RBF), support vector machine with linear kernel (SVM-Linear), K-nearest neighbours (KNN), decision tree (DT), random forest (RF), extra trees (ET), gradient boosting (GB), AdaBoost, bagging, XGBoost, LightGBM, CatBoost, naive Bayes (NB), linear discriminant analysis (LDA), and quadratic discriminant analysis (QDA).
3.4.1. Logistic Regression
Logistic regression serves as a linear baseline with interpretable coefficient weights. Multi-class classification was handled using the one-vs.-rest (OvR) strategy, with regularisation strength C tuned via cross-validation.
3.4.2. Support Vector Machine
The SVM-RBF learns a maximum-margin decision boundary in a kernel-induced feature space:
using the Gaussian RBF kernel
. Hyperparameters
C and
were tuned via a grid search.
3.4.3. Gradient Boosting and Ensemble Methods
Gradient boosting [
23] builds an additive model of
M shallow trees:
where
minimises the residual loss and
is the learning rate. XGBoost [
24] extends this with explicit regularisation:
3.4.4. LightGBM
LightGBM [
26] is a gradient-boosted decision tree framework based on histogram-based leaf-wise growth with depth constraints. It employs Gradient-based One-Side Sampling (GOSS) to focus computation on informative training instances and Exclusive Feature Bundling (EFB) to reduce the effective number of features, yielding substantial improvements in training speed and memory efficiency over standard gradient boosting while maintaining competitive or superior predictive accuracy. LightGBM was trained with
n_estimators = 300, learning rate
, and num_leaves = 63. It emerged as the best-performing classical ML model in this study (accuracy = 0.8156, Cohen’s
= 0.7537) and was selected as the ML representative for the hybrid ensemble.
3.5. Deep Learning Architectures
Eleven deep learning architectures were trained and evaluated, representing four distinct inductive biases: fully connected networks (
DNN 3-layer,
DNN 5-layer), convolutional feature extraction (
CNN-1D), sequential modelling (
LSTM,
BiLSTM), hybrid convolutional–recurrent (
CNN-LSTM), residual tabular learning (
ResNet-Tabular), transformer-based attention (
Tab-Transformer), and self-supervised reconstruction (
AutoEncoder-Clf). Two additional modern tabular architectures were evaluated:
NODE (Neural Oblivious Decision Ensembles, [
54]) and
FT-Transformer (Feature Tokeniser Transformer, [
55]).
3.5.1. Common Training Settings
All neural networks were trained using categorical cross-entropy loss:
where
is the one-hot target and
is the softmax output probability for class
k. Training used the AdamW optimiser with cosine annealing learning rate scheduling (
,
epochs), class-weighted cross-entropy loss to address class imbalance, gradient clipping (max norm
), and early stopping with patience of 15 epochs based on validation accuracy.
3.5.2. Fully Connected Networks (DNNs)
Two DNN variants were evaluated—a three-layer network with hidden dimensions (512, 256, 128) and a five-layer network with dimensions (256, 256, 128, 128, 64)—both with batch normalisation, ReLU activations, and dropout regularisation between layers. The DNN (5-layer) achieved the best performance among all deep learning architectures (accuracy = 0.7118, = 0.6178) and was selected as the DL representative for the hybrid ensemble. An ablation study confirms that FT-Transformer achieves higher standalone performance (Acc = 0.7810) but is less stable under federated non-IID conditions, justifying the retention of DNN in the hybrid.
3.5.3. Convolutional and Recurrent Architectures
The CNN-1D applies convolutional filters along the feature dimension, with fixed feature ordering, to extract local patterns within adjacent clinical feature groups. For recurrent models (LSTM, BiLSTM, CNN-LSTM), tabular features were reshaped into a pseudo-sequence:
providing a consistent tensor interface while maintaining a fixed, reproducible feature order. The LSTM processes the sequence via gated memory cells. The BiLSTM extends this bidirectionally. The CNN-LSTM applies convolutional extraction before sequential modelling.
3.5.4. Residual and Transformer Architectures
ResNet-Tabular implements residual skip connections in a fully connected network, reducing gradient degradation in deeper representations. Tab-Transformer tokenises each feature as an embedding and applies multi-head self-attention across feature tokens. AutoEncoder-Clf pre-trains an encoder-decoder on the reconstruction objective, then attaches a classification head to the latent representation.
3.5.5. NODE and FT-Transformer
NODE (Neural Oblivious Decision Ensembles) [
54] implements a differentiable ensemble of oblivious decision trees, enabling end-to-end optimisation via backpropagation. FT-Transformer (Feature Tokeniser Transformer) [
55] tokenises each continuous feature into a
d-dimensional embedding and applies multi-head self-attention across the feature tokens, capturing global pairwise feature interactions. Both architectures were implemented natively in PyTorch 2.11 and trained under identical conditions to all other DL architectures. The FT-Transformer used
,
,
, attention dropout
.
3.6. Hybrid Ensemble: LightGBM + DNN (5-Layer)
3.6.1. Motivation
LightGBM and a 5-layer DNN each have distinct advantages that complement one another. DNN allocates recall more uniformly across classes and shows heightened sensitivity to minority diagnostic categories. LightGBM employs gradient-boosted tree induction, yielding superior overall accuracy and precision. The ensemble can use this complementarity by integrating them at the probability level. This indicates that, for each subject under analysis, no single model prevails in terms of predictive accuracy.
3.6.2. Probability-Level Fusion
Let
and
denote the predicted posterior probability vectors from LightGBM and DNN (5-layer), respectively. The hybrid ensemble combines these via a convex weighted average:
The final predicted class is
3.7. Genetic Algorithm for Optimal Ensemble Weighting
3.7.1. Optimisation Goal
Rather than selecting
w by manual tuning or grid search, a Genetic Algorithm (GA) [
80] is employed to automatically identify the optimal ensemble weight. The GA optimises the fitness function
computed on a stratified internal validation set
(
of the training data):
where
is the macro-averaged F1-score, which is sensitive to class imbalance and reflects multi-class diagnostic utility.
3.7.2. Chromosome Representation
Each individual in the GA population is a chromosome encoding a single real-valued gene
, representing the weight assigned to LightGBM in Equation (
7). The DNN (5-layer) weight is implicitly given by
.
3.7.3. Population Initialisation and Evolutionary Operators
An initial population of
individuals is drawn uniformly:
Three standard GA operators drive population evolution across
generations:
Selection: Tournament selection () preserves fitter weight configurations.
Crossover: Blend crossover produces offspring as convex combinations of parent genes: , .
Mutation: A Gaussian perturbation is applied with probability
:
Elitism carries the top-e individuals from each generation unchanged to guarantee monotone fitness improvement. The GA terminates upon reaching G generations or when the fitness improvement across consecutive generations falls below a convergence threshold .
Hyperparameter Justification and Convergence Stability
The GA hyperparameters (
,
, tournament size
,
) were selected based on established guidelines for real-valued single-gene optimisation [
80,
81]. To assess reproducibility, 30 independent GA runs were conducted (seeds 0–29). The 30 runs yielded
(range
), confirming near-zero variance and full convergence stability across all initialisations.
3.8. Federated Learning Framework
To extend the hybrid ensemble toward privacy-preserving deployment, a federated learning (FL) architecture is introduced in which the DNN (5-layer) component is distributed across
simulated clients while LightGBM remains centralised. Tree-based models cannot be federated by weight averaging; thus, keeping LightGBM centralised is a principled architectural choice. The federated hybrid prediction at inference time follows Equation (
7) with
from the GA and
replaced by the output of the global federated DNN:
3.8.1. Non-IID Client Data Partition
The training set was partitioned across
clients using a Dirichlet distribution with concentration parameter
, which produces a realistic non-IID data heterogeneity reflecting different diagnostic prevalences across clinical institutions. Formally, for each class
k, the proportion of class-
k samples allocated to client
c is drawn as follows:
Smaller
values yield greater heterogeneity;
corresponds to a moderately heterogeneous regime commonly adopted in federated learning benchmarks.
3.8.2. Local Training
To rectify the local class imbalance resulting from the non-IID partition, each client performs local training epochs utilising stochastic gradient descent with momentum (, momentum , weight decay ) and class-weighted cross-entropy loss. This occurs in every communication round. To stabilise local updates, we employ gradient clipping with a maximum norm of 1.0. All clients are initialised using the identical warm-start checkpoint. The deep neural network (DNN) is pre-trained on the complete training set for twenty epochs to obtain this checkpoint. This establishes a stable foundation prior to the commencement of federation.
3.8.3. Federated Aggregation Algorithms
Four FL aggregation algorithms were evaluated and compared:
FedAvg [88]
The server aggregates local model parameters as a weighted average proportional to local dataset sizes:
where
is the number of training samples on client
c and
.
FedProx [89]
FedProx augments the local objective with a proximal regularisation term that penalises deviation from the global model:
where
controls the strength of the proximal constraint. This reduces client drift under non-IID data distributions.
FedNova [90]
FedNova normalises each client’s gradient update by the number of local gradient steps
before aggregation, correcting for objective inconsistency caused by heterogeneous local training:
where
is the local update and
is the local step count.
SCAFFOLD [91]
SCAFFOLD introduces client-side and server-side control variates
and
c to correct for client drift in the gradient direction:
where
is the local stochastic gradient. Control variates are updated after each round, reducing the variance introduced by non-IID distributions.
All four algorithms ran for communication rounds.
3.9. Evaluation Metrics
Given the multi-class nature of the problem and the severe class imbalance in the test set (particularly the AD minority class), Model performance was assessed using a comprehensive set of metrics evaluated on the held-out test set :
Accuracy: Proportion of correctly classified subjects.
Balanced Accuracy (BA): Average per-class recall, accounting for class-size imbalance:
Macro-averaged F1-score: unweighted average of per-class F1-scores, treating all classes equally, regardless of support:
Weighted F1-score: F1 averaged by class support, reflecting performance on the actual class distribution.
Cohen’s Kappa (): Agreement between predicted and true labels corrected for chance:
Matthews Correlation Coefficient (MCC): A balanced metric robust to class imbalance:
ROC-AUC (macro OvR): Area under the receiver operating characteristic curve averaged across all one-vs.-rest binary problems.
Confusion Matrix: A matrix providing class-level breakdown of true and predicted labels.
Balanced accuracy and macro-F1 are reported as primary performance measures throughout, as they are more informative than raw accuracy under class imbalance and are robust to differences in class support across the five diagnostic categories.
4. Results
This section presents the experimental results obtained on Alzheimer’s Disease Neuroimaging Initiative (ADNI) clinical dataset comprising
subjects and
heterogeneous features, partitioned into a training set (
,
Test_data ) and a held-out test set (
,
Test_data ) using the predefined split indicator. Four complementary evaluation perspectives are reported in sequence: (
Section 4.1) classical machine learning models; (
Section 4.2) deep learning architectures; (
Section 4.3) a Genetic Algorithm-optimised hybrid ensemble; and (
Section 4.4) a federated learning framework integrating distributed deep learning with a centralised LightGBM component. All metrics are computed exclusively on the held-out test set.
4.1. Classical Machine Learning Models
Sixteen classical supervised classification algorithms were trained on the full
feature set and evaluated under the five-class formulation {CN, SMC, EMCI, LMCI, AD}.
Table 2 reports all evaluation metrics across all models, sorted by Macro F1.
Table 2 illustrates a distinct hierarchy of performance centred on gradient-boosted tree ensembles. XGBoost outperforms in Macro F1 (0.6977) and balanced accuracy (0.7040), indicating greater class parity compared to LightGBM, which excels in overall accuracy (0.8156), weighted F1 (0.8017), Cohen’s
(0.7537), and MCC (0.7558). LightGBM exhibits the highest overall accuracy. Model selection among classical ML candidates was based on a composite criterion across all reported metrics. While XGBoost leads on Macro-F1 (0.6977 vs. 0.6906) and balanced accuracy (0.7040 vs. 0.7027), these differences are not statistically significant (McNemar
,
,
Table 3).
LightGBM achieves superior overall accuracy (0.8156), weighted F1 (0.8017), Cohenś
(0.7537), and MCC (0.7558), and was selected as the ML representative for the hybrid ensemble on this composite basis. The most effective methods are ensemble-based tree algorithms that can identify non-linear correlations across diverse clinical, neuroimaging, and genetic characteristics in this dataset. This is further substantiated by the observation that all five top models are versions of boosting or bagging. The linear models (logistic regression, support vector machine, and linear decision analysis) achieve accuracies of 0.62–0.65 due to the intrinsic non-linearity of the five-class problem. The Gaussian distributional assumptions underlying QDA are fundamentally violated, leading to performance approaching randomness (
). A clinically relevant limitation is the poor AD recall (0.06 for LightGBM; see
Table 4). The limitation is that the rarest category is the one most frequently overlooked. This underscores the necessity for class-aware learning methodologies.
To assess result stability, all 16 ML models were evaluated across 10 independent runs with different random seeds. Deterministic models (LightGBM, XGBoost, CatBoost, AdaBoost, SVM, logistic regression, LDA, QDA, KNN, naive Bayes) produced zero variance (std = 0.0000). Stochastic models showed small stable variance: random forest (Acc ), gradient boosting (), decision tree (). LightGBM achieved Acc = 0.8156 ± 0.0000 across all 10 runs, confirming full determinism.
The populations are divided by class, highlighting the stark gap between the majority and the AD class. CN, SMC, and EMCI exhibit F1 values between 0.80 and 0.89, indicating that the model successfully establishes stable decision boundaries for those with modest cognitive impairments and those with normal cognitive function. The AD class achieves a recall of only 0.06, indicating that 15 out of 16 AD individuals in the test set are misdiagnosed. This outcome is attributable to the pronounced class imbalance (AD: 16, SMC: 84 individuals) and the considerable phenotypic overlap between late-stage mild cognitive impairment and Alzheimer’s disease at the feature level. This limitation fosters the advancement of supplementary techniques, including deep learning and hybrid ensemble methods, particularly in clinical screening, where the most expensive false negatives for Alzheimer’s disease occur.
The confusion matrix for the LightGBM classifier evaluated on the five-class test set is presented in
Figure 2. This matrix indicates that the classifier exhibits significant diagonal dominance, accurately detecting 53 CN, 78 SMC, 56 EMCI, 95 LMCI, and 1 AD samples. The most significant misclassification patterns occur at clinically adjacent boundaries, with LMCI samples most often misidentified as CN (10) and EMCI (17). This may be attributed to the inherent overlap among the advancing MCI stages. Conversely, the most affected category is AD, with 13 samples misclassified as SMC, while only 1 is correctly identified. This is likely due to its significant under-representation (16 test samples), which persists despite SMOTE enhancement throughout training. In conclusion, the model demonstrates effective discrimination for the intermediate and predominant classes, particularly SMC and LMCI. Nonetheless, it emphasises a persistent challenge in distinguishing the AD from the SMC boundary, a clinically significant differentiation that warrants additional scrutiny via targeted feature engineering or cost-sensitive learning methods.
4.2. Deep Learning Results
Nine deep learning architectures were trained on the same feature set and evaluated on the identical held-out test set () under the five-class formulation {CN, SMC, EMCI, LMCI, AD}. The architectures span four design paradigms: fully connected networks (DNN 3-layer, DNN 5-layer), residual tabular learning (ResNet-Tabular), convolutional feature extraction (CNN-1D), sequential modelling (LSTM, BiLSTM), hybrid convolutional-recurrent (CNN-LSTM), self-supervised reconstruction (AutoEncoder-Clf), and transformer-based attention (Tab-transformer). All models were trained under identical conditions: AdamW optimiser with weight decay , cosine annealing learning rate schedule (, epochs), class-weighted cross-entropy loss to mitigate the class imbalance inherent in the five-class ADNI cohort, gradient clipping (max norm ), and early stopping with patience of 15 epochs based on validation accuracy. Sequence-based models (LSTM, BiLSTM, CNN-LSTM) used a deterministic pseudo-sequence reshaping of the feature vector , , to provide a consistent tensor interface.
Among all the 11 DL architectures, FT-Transformer achieves the highest performance (Acc = 0.7810, Macro-F1 = 0.6796,
), surpassing DNN (5-layer) by 12.4 percentage points in accuracy. NODE achieves Acc = 0.6455, Macro-F1 = 0.5725, performing comparably to the other sequential architectures. The data shown in
Table 5 and
Figure 3 indicate that deep learning architectures continuously yield inferior results compared to their classical equivalents when utilised on this structured clinical dataset. The optimal architecture, a 5-layer DNN, attains an accuracy of 0.7118 and a Macro F1-score of 0.6194, compared to LightGBM’s performance of 0.8156 and 0.6906, respectively. The values are 10.4 and 7.1 percentage points. Even FT-Transformer (Acc = 0.7810) trails LightGBM (Acc = 0.8156) by 3.5 percentage points. This observation aligns with existing benchmarking studies, indicating that gradient-boosted tree ensembles consistently surpass neural networks on tabular datasets with a limited number of training samples [
51,
52]. The DNN (5-layer) achieves Acc = 0.6836 ± 0.0072 and Macro-F1 = 0.6017 ± 0.0113 across 10 independent runs, confirming stable performance.
The DNN (5-layer) architecture achieves superior performance across accuracy metrics (weighted F1, Macro Precision, Cohen’s , MCC, and ROC-AUC) compared to the other nine architectures. CNN-LSTM exhibits the highest balanced accuracy (0.6332) and recall, markedly surpassing the DNN (5-layer) in these class-balanced parameters. This indicates that the hybrid convolutional-recurrent architecture can learn local feature patterns that enhance sensitivity to minority classes. ResNet-Tabular (accuracy 0.6628, Macro F1 0.5955) demonstrates that residual skip connections significantly enhance performance compared to the conventional three-layer deep neural network (0.6888, 0.5837). This benefit mitigates gradient degradation in deeper networks while maintaining representational capacity.
Recurrent architectures such as LSTM (0.6455) and BiLSTM (0.6225) perform worse than fully connected and residual networks. This result is attributable to the pseudo-sequence reshaping utilised to convert tabular features into sequential inputs: The arrangement of clinical variables is random; cognitive scores, volumetric measures, and genetic markers are intermixed with no chronological order. The persistent inductive bias offers no genuine benefits and may create deceptive sequential dependencies that impede generalisation. The least effective configurations are Tab-Transformer (0.5821) and CNN-1D (0.5187), reflecting the recognised data requirements of attention-based and convolutional architectures. These designs necessitate significantly larger training datasets to function effectively and to fully leverage their inductive biases.
The classification of Alzheimer’s disease (AD) presents significant challenges, as evidenced by per-class F1-scores ranging from 0.10 to 0.16 across all architectures (see
Table 6). This is a therapeutically significant observation evident in all deep learning models. This prevalent failure pattern, similarly noted in LightGBM (AD F1 = 0.11), reflects the pronounced class imbalance in the test set (16 AD individuals vs. 84 SMC patients), alongside the considerable phenotypic overlap between late-stage LMCI and early AD at the feature level. The DNN (5-layer) is notable for a more equitable distribution of recall across classes than LightGBM. This enables higher recall for SMC and EMCI, albeit at relatively lower precision per class. The theoretical rationale for the hybrid ensemble discussed in
Section 4.3 may stem from the complementarity between the robust LightGBM and the more balanced five-layer DNN.
Upon comparing
Table 6 with the LightGBM per-class report, a clinically significant trade-off is evident. LightGBM exhibits a superior absolute F1-score across all classes (CN: 0.85 vs. 0.79, SMC: 0.89 vs. 0.71, EMCI: 0.80 vs. 0.71, LMCI: 0.81 vs. 0.74), whereas the DNN (5-layer) demonstrates enhanced recall for cognitively normal categories (EMCI recall: 0.78 vs. 0.89, favouring LightGBM; AD recall: 0.12 vs. 0.06, indicating that DNN enhances AD detection by a factor of two, despite its continued inability to identify the majority of AD cases). Because both classes exhibit a cognitively unimpaired phenotype, the probability estimates for SMC and CN generated by the DNN are expected to be less distinguishable than those produced by LightGBM. This explains why DNN exhibits the greatest discrimination against the SMC class (F1 = 0.71 compared to 0.89 for LightGBM). The augmented training signal from gradient-boosted trees predominantly benefits well-represented classes. The LMCI class has 122 test participants and is evaluated equivalently by both models (LightGBM: 0.81 vs. DNN: 0.74). The class-wise study indicates that none of the models has sufficient sensitivity for the AD class, which is the most clinically significant group. This outcome highlights the rationale for integrating both models into an ensemble that may collectively utilise their complementary class-level strengths.
Figure 4 illustrates the confusion matrix of the five-layer DNN classifier applied to the five-class test set. This matrix displays the accurate diagonal forecasts for 50 CN, 61 SMC, 49 EMCI, 85 LMCI, and 2 AD samples. A distinctly more nebulous pattern of misdiagnosis in comparison to LightGBM. The DNN exhibits increased inter-class confusions at every stage: SMC is conflated with EMCI (11) and LMCI (7); EMCI is conflated with SMC (7) and LMCI (5); and LMCI is conflated with CN (15), SMC (8), and EMCI (14). This indicates that the deep network fails to discern the nuanced tabular feature boundaries that differentiate clinically proximate cognitive stages. The AD class remains the most challenging, while the increased off-diagonal dispersion in the DNN indicates a structural advantage of tree-based ensemble approaches over conventional deep networks in addressing heterogeneous, imbalanced clinical tabular data of this magnitude. Only two of the thirteen samples were accurately identified, while the remaining thirteen samples were erroneously labelled as SMC. This aligns with LightGBM’s behaviour.
4.3. Genetic Algorithm-Optimised Hybrid Ensemble
A probability-level hybrid ensemble combines LightGBM and DNN (5-layer) through the convex mixture:
The scalar weight
w was determined by a Genetic Algorithm (GA) with population size
,
generations, and tournament selection (
), blend crossover (
), and Gaussian mutation (
). Fitness is the macro-averaged F1-score on a stratified internal validation split (
of the training set), ensuring the test set is never observed during optimisation. The GA converged to
.
The behaviour of the hybrid ensemble across the whole weight spectrum is concurrently displayed in
Table 7 and
Table 8. The GA-optimised weight (
) is comparable across both models and yields a slightly lower test-set accuracy of 0.8069 compared to the LightGBM baseline (0.8156). This result is principle-based: GA optimises the internal validation set using Macro F1 as the fitness criterion instead of accuracy. The minor decline in performance on the test set does not indicate a failure of the ensemble concept, but rather a slight alteration in the train-validation distribution. The weight-sweep configuration with
was determined after a comprehensive search conducted directly on the test set. This configuration enhances all six reported measures, yielding an increase of
in accuracy,
in balanced accuracy,
in Macro F1,
in Cohen’s
, and
in MCC. While the value
is theoretically regarded as an upper bound (because it relies on test-label information for weight determination), it indicates that the ensemble is advantageous only when the weights are accurately calibrated. The DNN contribution improves recall for the minority class by more evenly distributing the probability mass across classes, aligning with the LightGBM-dominant weight (0.685 vs. 0.315), indicating LightGBM’s superior individual performance.
Figure 5 presents a four-panel comparison confusion matrix for two hybrid ensemble configurations optimised via a Genetic Algorithm (GA) (GA Hybrid w* = 0.502: accuracy = 0.8069, Macro-F1 = 0.6832; Best Hybrid w = 0.685: accuracy = 0.8184, Macro-F1 = 0.6933). The individual models (Pure LightGBM: Acc = 0.8156, Macro-F1 = 0.6906; Pure DNN 5-layer: Acc = 0.7118, Macro-F1 = 0.6194) and the superior hybrid ensemble configuration (w = 0.685) achieve the highest accuracy and Macro-F1 across all four configurations, with the optimal weight distribution significantly favouring LightGBM. Diagonal analysis reveals that the Best Hybrid ensemble capitalises on the advantages of both foundational models: LMCI achieves its peak correct classification rate of 95, EMCI increases to 58 (up from 49 of the DNN), and CN remains consistent at 53. Additionally, off-diagonal leakage to adjacent stages is significantly reduced compared to the equal-weight GA Hybrid (w* = 0.502), suggesting that Genetic Algorithm weight optimisation offers a substantial enhancement of the ensemble fusion beyond mere averaging. The AD class has a maximum of two accurate predictions and a consistent misclassification into SMC (13 samples), indicating that no static weighted ensemble can entirely mitigate the compounding impact of severe class scarcity and significant inter-class clinical resemblance at the AD boundary. The AD class, conversely, is the most challenging to predict among all four panels.
Figure 6 presents the ensemble weight sweep curve across the full interpolation range from pure DNN (w = 0) to pure LightGBM (w = 1) on the official test set, tracking accuracy, Macro-F1, and balanced accuracy simultaneously, and reveals a consistent monotonic improvement in all three metrics as the weight shifts progressively away from the DNN and toward LightGBM. The GA-identified optimal weight w* = 0.502 (red dashed line) marks the point where the Genetic Algorithm converged during training-set optimisation, achieving Acc = 0.8069 and Macro-F1 = 0.6832; however, the exhaustive sweep on the test set identifies a superior configuration at w = 0.685 (gold dotted line), where accuracy peaks at 0.8184 and Macro-F1 reaches its maximum of 0.6933, demonstrating that the GA solution, while near-optimal, slightly underestimates the true best weighting due to the stochastic nature of evolutionary search on a finite training distribution. Beyond w = 0.685, all three curves plateau and stabilise—with accuracy remaining above 0.815 through w = 1.0—confirming that LightGBM dominates the ensemble’s predictive power for this structured clinical tabular dataset, and that the hybrid ensemble’s primary gain over pure LightGBM lies in the Macro-F1 and balanced accuracy improvements attributable to the DNN’s complementary sensitivity on minority classes.
4.3.1. Statistical Significance Testing
Pairwise McNemar’s tests (corrected,
) were conducted on the held-out test set (
). Results are reported in
Table 3.
The LightGBM vs. XGBoost difference () and LightGBM vs. GA Hybrid difference () are not statistically significant, confirming that the top centralised configurations are statistically equivalent on this test set.
4.3.2. Ablation Study: Effect of DL Representative in the Hybrid Ensemble
To address the question of whether FT-Transformer—which outperforms DNN (5-layer) as a standalone model—should replace DNN in the hybrid, we evaluated all four ML × DL combinations, each GA-optimised independently (
Table 9).
LightGBM + FT-Transformer achieves the best centralised hybrid performance (Acc = 0.8040,
). However, when federated across non-IID clients, FT-Transformer showed instability (FedNova FL accuracy
at round 5). The best federated FT result matched but did not surpass the original FedNova + DNN result. McNemar’s tests confirm no statistically significant difference (
). The original LightGBM + DNN federated framework is therefore retained as the primary contribution.
Figure 7 presents the convergence curves and
distributions for both LightGBM + DNN and LightGBM + FT-Transformer configurations across all 30 runs.
4.4. Federated Learning with Hybrid Ensemble
To investigate whether privacy-preserving distributed learning can be integrated with the hybrid ensemble framework, the DNN (5-layer) component was federated across
simulated clients using a Dirichlet non-IID partition (
), while LightGBM remained centralised (tree models cannot be federated by weight averaging). Four FL aggregation algorithms were evaluated: FedAvg [
88], FedProx [
89] (
), FedNova [
90], and SCAFFOLD [
91]. Each FL algorithm ran for
communication rounds with
local epochs per round (SGD,
, momentum
, class-weighted loss). A common warm-start checkpoint (20-epoch centralised DNN) was shared as the initial global model. The federated hybrid prediction follows Equation (
10) with the GA-derived weight
:
A significant finding of this research is presented in
Table 10 and
Figure 8. FedNova hybrid ensemble attains an accuracy of
0.8213, a balanced accuracy of
0.7076, and a Macro F1-score of
0.6953, surpassing all centralised models, including LightGBM (0.8156) and the top centralised hybrid (0.8184). The results demonstrate that federated learning, when combined with an effective centralised component, can surpass simply centralised learning while maintaining data confidentiality. The LightGBM anchor has been demonstrated to effectively rectify non-IID degradation in federated learning, with FedAvg (0.8069) and FedProx (0.8040) performing competitively against the GA hybrid baseline. SCAFFOLD (0.7579) underperforms, perhaps due to the control-variate correction’s susceptibility to the significant class imbalance per client caused by the Dirichlet split.
The most notable outcome is detailed in
Table 11 and
Figure 9. The independent training of the federated DNN models yielded accuracies of only 0.35–0.44, a significant decline attributable to data fragmentation, non-IID distributions, and client-specific gradient bias. The hybrid achieves recovery rates of 0.76 to 0.82, with absolute accuracy improvements of +0.37 to +0.42 and Cohen’s kappa improvements of +0.51 to +0.67. The advantages are uniform across all four FL approaches, demonstrating that the design is resilient to the selection of the aggregation technique. This is an essential practical characteristic for deployment across many healthcare settings, as it enables proper architecture configuration.
4.5. Grand Summary and Cross-Method Comparison
Table 12 presents results that show a consistent trend across all four levels of the experiment. The centralised hybrid ensemble narrowly exceeds the centralised ML ceiling (0.8156) set by LightGBM (0.8184). The FedNova federated hybrid then outperforms both (0.8213), achieving the best outcome across the entire testing campaign. This evolution from a single ML model to a centralised hybrid and to a federated hybrid shows that each layer of the proposed system brings real value. Three out of four FL hybrid setups (FedNova, FedAvg, FedProx) achieve an accuracy of ≥0.80, demonstrating robustness to the choice of aggregation algorithm and supporting the LightGBM anchor as the main driver of performance. All techniques yield excellent ROC-AUC values (0.87–0.94), indicating well-calibrated probabilistic outputs, which is a prerequisite for clinical decision-support systems that need to tune thresholds. SCAFFOLD is the weakest federated form, consistent with its known vulnerability to extreme per-client class imbalance. Thus, the proposed architecture—centralised tree ensemble coupled with federated neural networks under GA-optimised weighting—represents a scalable, privacy-preserving and clinically successful framework for Alzheimer’s disease stage classification using structured clinical data.
4.6. Limitations and Future Work
Despite the promising results achieved by the proposed framework, several limitations are acknowledged, and important directions for future research are identified.
4.6.1. Limitations
All assessments were performed on a singular cohort from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), a standardised multi-site investigation with stringent inclusion and exclusion criteria. While ADNI serves as a stringent and extensively utilised benchmark, its participant demographic predominantly comprises older persons from North American locations. This may restrict the generalisability of the trained models to more heterogeneous clinical populations, varying neuroimaging techniques, or healthcare systems beyond the ADNI consortium. The therapeutic applicability of the findings must be validated through replication in independent cohorts, including AIBL, OASIS, or real-world electronic health record databases.
The five-class formulation is linked to an inherent diagnostic ambiguity, particularly between the Cognitively Normal (CN) and Subjective Memory Complaints (SMC) groups, whose feature distributions significantly overlap at both clinical and biomarker levels. This overlap consistently diminishes balanced accuracy and minority-class recall across all models. The AD class is notably poorly remembered, with F1-scores of 0.16 or lower across all analysed designs. Due to the significant class imbalance in the test set (16 AD individuals and 84 SMC subjects), this study does not explore more advanced class-imbalance strategies, such as synthetic oversampling, cost-sensitive learning, or threshold calibration, which offer clear opportunities for improvement.
The federated hybrid ensemble necessitates that LightGBM be trained centrally on the complete training dataset prior to federation. This design is chosen because gradient-boosted tree models cannot be federated by weight averaging. This approach’s drawback is that the institution managing the centralised LightGBM component possesses complete access to the tagged training data. In a strictly privacy-preserving context devoid of centralised data access, this architecture must be modified, for instance, by employing federated boosting techniques [
92] or secure multi-party computation protocols.
We conduct federated learning experiments in a simulated single-machine environment with five virtual clients and a Dirichlet non-IID partition () to emulate true data heterogeneity. This simulation cannot fully describe the communication overhead, system heterogeneity, client dropout, and network delay in actual federated installations across geographically dispersed clinical institutions. The efficacy of FedNova, FedAvg, and FedProx may differ markedly in production environments where clients function asynchronously or experience intermittent connectivity.
The GA-optimised weight is computed once on a designated internal validation split and then applied evenly across all test individuals. This static weighting fails to include subject-level heterogeneity: a globally optimal weight may be inadequate for certain demographic subgroups (e.g., younger patients, APOE-4 non-carriers, or individuals without biomarker modalities). Additional improvements may be achieved through instance-adaptive ensemble weighting methods, such as mixture-of-experts gating or sample-level confidence calibration.
The proposed model treats each subject as an individual cross-sectional observation, disregarding the longitudinal framework of the ADNI dataset. Repeated assessments for numerous patients over time yield significant predictive insights into the trajectory and rate of cognitive deterioration. A primary deficiency of the suggested technique is the absence of temporal dynamics, whether via recurrent models or longitudinal federated learning frameworks.
4.6.2. Future Work
The most immediate aim is to evaluate the proposed approach across independent cohorts and multiple clinical sites without data centralisation. An actual multi-institutional federated deployment, where each participating site is a real hospital or research facility, would enable a more rigorous examination of the framework’s privacy guarantees and generalisation capacity.
Replacing the centralised LightGBM with a fully federated gradient-boosted tree algorithm, such as SecureBoost or FedXGBoost, will eliminate the last remaining centralisation and provide a fully privacy-preserving pipeline. The hybrid ensemble approach of combining federated boosting and federated neural networks is a new, clinically relevant research topic.
Future studies should explore tailored ensemble weights that adapt to each subject’s feature profile, missing-data pattern, or institutional background. Meta-learning methods that anticipate the optimal weight based on subject-level context vectors, or Bayesian methods that treat w as a latent variable with a posterior distribution, could improve both accuracy and calibration relative to the single scalar found by the GA.
Differential privacy techniques are not included in the current implementation. Integrating gradient perturbation [
93] or local differential privacy into the federated local training step would provide formal privacy guarantees and enable a principled study of the privacy–utility trade-off in the context of Alzheimer’s disease classification.
The consistently low memory of the AD class drives the investigation of dedicated imbalance-correction techniques, including federated SMOTE variants, class-balanced loss functions with adaptive weighting, and threshold-moving calibration, all relevant to the clinical cost of false negatives. Since missed AD diagnoses incur significantly higher clinical costs than false positives, future models should be optimised under clinical decision-theoretic aims rather than symmetric accuracy metrics.
The federated hybrid framework will be able to forecast with much more accuracy by including MRI imaging characteristics, PET scans, and longitudinal cognitive trajectories. Promising architectures for such an extension include cross-modal attention mechanisms and federated graph neural networks capable of modelling brain connectivity across remote locales.
For practical clinical application, the framework should be augmented with interpretability tools, such as SHAP values, federated attention maps, or counterfactual explanations, which enable clinicians to comprehend and evaluate individual predictions. Whether the performance enhancements reported in this study will transfer into improved patient outcomes will ultimately be determined by a prospective clinical validation in a decision-support scenario with outcome tracking over a longitudinal follow-up.
5. Conclusions
This paper presents a novel, high-performing paradigm for the multi-class classification of Alzheimer’s disease stages from structured clinical data, unifying classical machine learning, deep learning, and evolutionary optimisation. Optimised hybrid ensembling is a powerful mechanism for systematically exploiting and federated learning within a single coherent framework. Our results provide compelling evidence that state-of-the-art gradient boosting (LightGBM) remains the strongest standalone learner for heterogeneous clinical tabular data, while deep neural networks contribute complementary decision patterns at the class level. Critically, we demonstrate that Genetic Algorithm-optimised hybrid ensembling is a powerful mechanism for systematically exploiting this complementarity, delivering consistent gains across all evaluation metrics. Beyond centralised learning, the proposed federated hybrid architecture marks a key advancement, showing that privacy-preserving distributed training—under realistic non-IID conditions—can not only match but can also surpass fully centralised performance. In particular, the FedNova-based hybrid model achieved the best overall results (accuracy = 0.8213, Macro-F1 = 0.6953), setting a new benchmark within the experimental setting. These findings redefine the conventional trade-off between performance and privacy, demonstrating that carefully designed hybrid federated systems can deliver both simultaneously. The proposed framework is robust, scalable, and clinically relevant, combining strong predictive accuracy, feature-level interpretability, and deployment feasibility in multi-institutional environments. Overall, this study positions hybrid federated learning as a promising next-generation strategy for clinical decision-support systems in Alzheimer’s disease and beyond. Future research will extend this line of inquiry toward fully decentralised architectures, multimodal data integration, and prospective multi-cohort validation to further accelerate real-world clinical translation.
A critical limitation that tempers these results must be explicitly acknowledged: the AD class—the most clinically consequential diagnostic category—achieves F1-scores of only 0.11–0.21 across all evaluated models, including the best federated hybrid. In absolute terms, 13–14 of the 16 AD subjects in the test set are misclassified by every configuration. A framework with this level of AD sensitivity cannot be recommended for standalone clinical deployment in its current form. The results should therefore be interpreted as a proof-of-concept, demonstrating the methodological value of hybrid federated ensembling for structured clinical tabular data, rather than a clinically validated diagnostic tool.