Ensemble Model for Diagnostic Classification of Alzheimer’s Disease Based on Brain Anatomical Magnetic Resonance Imaging

Alzheimer’s is one of the fast-growing diseases among people worldwide leading to brain atrophy. Neuroimaging reveals extensive information about the brain’s anatomy and enables the identification of diagnostic features. Artificial intelligence (AI) in neuroimaging has the potential to significantly enhance the treatment process for Alzheimer’s disease (AD). The objective of this study is two-fold: (1) to compare existing Machine Learning (ML) algorithms for the classification of AD. (2) To propose an effective ensemble-based model for the same and to perform its comparative analysis. In this study, data from the Alzheimer’s Diseases Neuroimaging Initiative (ADNI), an online repository, is utilized for experimentation consisting of 2125 neuroimages of Alzheimer’s disease (n = 975), mild cognitive impairment (n = 538) and cognitive normal (n = 612). For classification, the framework incorporates a Decision Tree (DT), Random Forest (RF), Naïve Bayes (NB), and K-Nearest Neighbor (K-NN) followed by some variations of Support Vector Machine (SVM), such as SVM (RBF kernel), SVM (Polynomial Kernel), and SVM (Sigmoid kernel), as well as Gradient Boost (GB), Extreme Gradient Boosting (XGB) and Multi-layer Perceptron Neural Network (MLP-NN). Afterwards, an Ensemble Based Generic Kernel is presented where Master-Slave architecture is combined to attain better performance. The proposed model is an ensemble of Extreme Gradient Boosting, Decision Tree and SVM_Polynomial kernel (XGB + DT + SVM). At last, the proposed method is evaluated using cross-validation using statistical techniques along with other ML models. The presented ensemble model (XGB + DT + SVM) outperformed existing state-of-the-art algorithms with an accuracy of 89.77%. The efficiency of all the models was optimized using Grid-based tuning, and the results obtained after such process showed significant improvement. XGB + DT + SVM with optimized parameters outperformed all other models with an efficiency of 95.75%. The implication of the proposed ensemble-based learning approach clearly shows the best results compared to other ML models. This experimental comparative analysis improved understanding of the above-defined methods and enhanced their scope and significance in the early detection of Alzheimer’s disease.


Introduction
Alzheimer's disease (AD) can be treated with the use of healthcare informatics. The Alzheimer's enigma has troubled both researchers and clinicians. A German doctor Alois Alzheimer over a century ago spotted anomalies in brain sections of a patient with dementia. Researchers tried to study these strange anomalies known as plaques and tangles in the brain. The brain with dementia is characterized by wiped memories and intellect getting The most important and familiar tool for understanding and studying neurodegeneration and neurodegenerative diseases is neuroimaging [16] which includes multiple imaging techniques such as computed tomography (CT), dopamine transporter scan (DatScan) magnetic resonance imaging (MRI), positron emission tomography (PET), and Single-photon emission computed tomography (SPECT) [17]. Neuroimaging gives a comprehensive understanding of the potential mechanisms underlying these neurodegenerative diseases such as AD, Parkinson's disease (PD) [18], Attention-Deficit/Hyperactivity Disorder (ADHD) [19], and Autism Spectrum Disorder (ASD) [20] so that better clinical trials are developed and deployed to improve diagnosis and effective healthcare. These large-scale neuroimaging databases may be used to train and evaluate diagnostics using Artificial intelligence and Machine learning methods for neurodegenerative diseases (e.g., Alzheimer's disease). Predicting Alzheimer's disease with machine learning has been the subject of several research studies.
However, scrutinizing previous studies, some gaps are identified related to the diagnosis of AD which encourages us to work in the direction to overcome those drawbacks. Taking this into consideration, this paper presents an effective ensemble-based technique and provides an in-depth comparison of Machine learning methods to predict AD, as well as future research possibilities. Ensemble learning is the process of transforming numerous weak models into a single robust model. This can be achieved using a variety of ensemble strategies, including bagging, boosting, and stacking. Ensemble learning integrates the information to acquire new information and enhances the performance accuracy of predictions done adaptively through voting schemes. The fundamental goal of voting is to increase generalization by balancing individual model errors, particularly when the models perform well on a predictive modelling task. Moreover, in the Ensemble approach, if it is found that the same model performs well when using different hyperparameters, then combining multiple fits of the same machine-learning algorithm is beneficial. When compared to a single model, an ensemble model always provides more accurate results and has greater applicability.
This paper provides a comprehensive evaluation of learning approaches for Alzheimer's prediction, as well as potential future developments.
The main contributions of this article can be understood through the points mentioned as under: 1. This study evaluated the performance of various machine learning techniques, i.e., 11 models which lack in the state-of-the-art literature. 2. To enhance the diagnostic performance of AD, this article presents an ensemble technique formed by combining SVM kernel master demonstration, along with extreme gradient boosting transitioning sequential model to concur the hidden difficult extractions in the dataset. The proposed architecture (XGB + DT + SVM) uses a Decision Tree to reduce invariability in model demonstration and performance. Overall, the However, scrutinizing previous studies, some gaps are identified related to the diagnosis of AD which encourages us to work in the direction to overcome those drawbacks. Taking this into consideration, this paper presents an effective ensemble-based technique and provides an in-depth comparison of Machine learning methods to predict AD, as well as future research possibilities. Ensemble learning is the process of transforming numerous weak models into a single robust model. This can be achieved using a variety of ensemble strategies, including bagging, boosting, and stacking. Ensemble learning integrates the information to acquire new information and enhances the performance accuracy of predictions done adaptively through voting schemes. The fundamental goal of voting is to increase generalization by balancing individual model errors, particularly when the models perform well on a predictive modelling task. Moreover, in the Ensemble approach, if it is found that the same model performs well when using different hyperparameters, then combining multiple fits of the same machine-learning algorithm is beneficial. When compared to a single model, an ensemble model always provides more accurate results and has greater applicability.
This paper provides a comprehensive evaluation of learning approaches for Alzheimer's prediction, as well as potential future developments.
The main contributions of this article can be understood through the points mentioned as under: 1.
This study evaluated the performance of various machine learning techniques, i.e., 11 models which lack in the state-of-the-art literature.

2.
To enhance the diagnostic performance of AD, this article presents an ensemble technique formed by combining SVM kernel master demonstration, along with extreme gradient boosting transitioning sequential model to concur the hidden difficult extractions in the dataset. The proposed architecture (XGB + DT + SVM) uses a Decision Tree to reduce invariability in model demonstration and performance. Overall, the proposed model not only decreases inheriting variance but enhances tough sample prediction using boosting.

3.
This article performed a comparative analysis of algorithms before and after hyper parameter tuning accomplished through Grid search and the outcomes are validated using k-fold cross-validation.

4.
To ensure the generalization of the considered models, this article evaluated the efficiency of the same using statistical test, i.e., Friedman's rank test and t-test.

5.
Various performance metrics such as receiver operating characteristics (ROC) curve, sensitivity, specificity and log loss are used to compare the proposed model with other ML models.

6.
This article also performed a comparison of the proposed model with existing stateof-the-art studies to prove its efficiency in detecting AD.  7.
The article also focuses on the future scope that should be addressed for reliable AD diagnosis.
The remaining part of the paper is organized as follows: Section 2 discusses related work. Section 3 discusses the existing approaches and dataset. Section 4 presents a description of the experimental analysis. Section 5 discusses the comparative performance evaluation of each method of the framework and Section 6 concludes the paper following the future work in Section 7.

Related Work
As with other neurodegenerative illnesses, Alzheimer's disease causes persistent and irreversible cognitive deterioration, culminating in brain shrinkage. It is an expensive medical condition that places a substantial financial strain on families and care givers. As neurodegeneration is an irreversible process, it cannot be treated. For this reason, researchers are focusing on the early identification and prediction of neurodegeneration to give optimal treatment to those suffering from neurodegenerative disorders. Substantial research has been performed and continues to be conducted in this field to enhance the early identification and prediction of a better diagnostic assistance system. Roman Casanova et al. [21] used a large-scale regularization approach based on logistic regression to classify MRI scans to determine a person's level of thinking and reasoning. The authors used a regularization approach on a dataset of cognitively Normal Controls (NC) and disease patients (AD). Their results showed 85.7% sensitivity, 82.9% specificity, and 90% accuracy, respectively. In a study by Costafreda et al. [22], MRI data were used to identify people with mild mental impairment who were at risk of developing Alzheimer's disease. The authors used SVM to classify mild cognitive impairment as Alzheimer's disease, with 80% and 77% sensitivity, respectively. Battineni et al. [23] tested four machine learning (ML) models, including NB, ANN, K-NN, and SVM, and the receiver operating characteristic (ROC) curve metric was used to verify the model performance. Each model was evaluated in three separate experiments. Firstly, the models were trained on manually selected features, and ANN produced the greatest ROC accuracy (0.812). In the second experiment, automated feature selection was performed via wrapping approaches, with the NB achieving the greatest ROC of 0.942 and finally, the ensemble model showed an ROC of 0.991.
Padilla et al. [24] examined neuroimaging data from MRI, PET, and SPECT scans of healthy and sick subjects. As a result, they used the Fisher discriminant ratio (FDR) and nonnegative matrix factorization (NMF) to identify and categorize the most significant features. Their method yielded a 90% accuracy rate. Xin Liu et al. [25] investigated structural-MRI by using locally linear embedding (LLE) to transform multivariate MRI data into linear space, focusing on regional brain volume and cortical thickness. In this study, LLE was found to be useful in identifying Alzheimer's disease. Gray et al. [26] performed a multi-modal classification of MRI data, focusing on MRI volumes and voxel-based PET signal intensities. Heung-II Suk et al. [27] proposed a deep learning feature-based strategy. The authors concentrated on grey matter tissue volumes, which are low-level MRI characteristics. They also used PET signal intensities in their method. The dataset was obtained from ADNI, and the experimental results had a 95.9% accuracy rate. Juergen Dukart et al. [28] proposed the use of SVM on information derived from the combination of MRI and FDG PET for detecting frontotemporal lobar degeneration. Using the ADNI dataset to train the classifier, they achieved 91% classification accuracy. Combining MRI and FED-PET findings may better determine Alzheimer's disease (AD).
Liu et al. [29] introduced a novel hierarchical technique combining multi-level classifiers resulting in ensemble classification. They continued to add features from various parts of their brains until they were all in one place. They used imaging parameters such as spatially normalized grey matter (GM) from each MRI scan to achieve an accuracy of 92.0%. Authors presented MK-Boost to classify ADNI data using a Whole-Brain Hierarchical Network (WBHN) based on an Automated Anatomical Labelling atlas (ALL). With an AUC of  [30] proposed a multimodality-based classification of Alzheimer's disease and mild cognitive impairment (MCI). They used a linear regression model with group sparsity regularization on weights from the ADNI dataset to improve classification performance and identify syndromeassociated biomarkers useful for illness analysis. Zu et al. [31] classified Alzheimer's disease and mild cognitive impairment using a new multimodal technique. As a result, label-aligned multiple-task feature extraction and multimodal categorization have been added. They combined selected multimodality characteristics using a multi-kernel SVM. They achieved 95.05% classification accuracy using the ADNI MRI and FDG-PET datasets.
According to Sorensen et al. [32], variations in volumetric alterations contribute to early cognitive decline. T1-weighted MRI images were analyzed from the ADNI database. Liu et al. [33] used a blend of various kernels to link edge characteristics with node architecture while analyzing MRI data from the Alzheimer's disease Neuroimaging Initiative (ADNI). Their investigation discovered that 3D texture was more effective for categorization findings, with an accuracy of 91%. Wang et al. [34] used two data sources, from the OASIS and the Washington University School of Medicine. They formed an ethical committee and signed consent from each subject before entering the study. The study included four native hospitals. Table 1 depicts a summary of existing work using MR imaging in Alzheimer's disease categorization using multiple techniques. The after section comprises the methodology that we used for the classification of AD and MCI using different categories of algorithms.

Materials and Methods
This section comprises the methodology for comparative analysis of classification algorithms for predicting individuals with Alzheimer's disease and mild cognitive impairment. The given procedure consists of different phases: (i) collection of MRI data (ii) classification, and (iii) comparison of performance among these algorithms.
In Section 3.1, the framework of this study is addressed. In Section 3.2, a description of the ADNI dataset, data pre-processing and feature extraction is given. In Section 3.3, we discussed the different learning techniques to identify individuals with AD and MCI. Figure 2 presents the framework of the study. With the advent of ML throughout the last couple of years, a variety of methods have been introduced to enhance AD classification using neuroimaging as a biomarker. In this study, we used the axial MR Neuroimaging modality, and the dataset is divided into Train-set (70%) and Test-set (30%). Classification algorithms are trained for three classes of data by providing a training-set as input.

Dataset Description
Neuroimaging data were obtained from the ADNI website (http://adni.loni.usc. edu/ (accessed on 7 September 2022)). The ADNI initiative was launched by Dr. Michael W. Weiner as its primary investigator, at understanding the evolution from usual ageing to insignificant memory problems in Alzheimer's disease. Along with the neuroimaging initiative, they also process the genetic and clinical information to analyze and measure various genes for determining the extent of genetic risk participating in the increase in the Alzheimer's disease enigma. In this study, we have taken 2127 MRI (T1 and T2 weighted) data from ADNI. The downloaded dataset is categorized into three classes: 612 samples for the AD class, 538 samples for MCI, and 975 samples for the CN class as shown in Table 2.

Dataset Description
Neuroimaging data were obtained from the ADNI website (http://adni.loni.usc.edu/ (accessed on 7 September 2022)). The ADNI initiative was launched by Dr. Michael W. Weiner as its primary investigator, at understanding the evolution from usual ageing to insignificant memory problems in Alzheimer's disease. Along with the neuroimaging initiative, they also process the genetic and clinical information to analyze and measure various genes for determining the extent of genetic risk participating in the increase in the Alzheimer's disease enigma. In this study, we have taken 2127 MRI (T1 and T2 weighted) data from ADNI. The downloaded dataset is categorized into three classes: 612 samples for the AD class, 538 samples for MCI, and 975 samples for the CN class as shown in Table  2.

Pre-Processing Steps
The data taken from the ADNI database require some basic tasks before it is presented to the algorithms: (a) Data filtering is performed based on the acquisition plane and ADNI phase. We selected the axial acquisition plan of the MR (T1 and T2 weighted) images and ADNI phase 3 for three classes: AD, CN, and MCI as shown in Figure 3. (b) The images are reshaped to 512 × 512 to provide the best quality pixel.
(c) The labels about the classes of the images are extracted and put into the vector storage.
The classes undergo label encoding, i.e., 0 for AD, 1 for MCI and 2 for CN. (d) The loaded images are standardized and normalized using a standard scaler and min-max scaler as the normalization functions. (e) Once the preprocessing of the data is completed, the data are split into training and testing to perform validation on unseen data.

Pre-Processing Steps
The data taken from the ADNI database require some basic tasks before it is presented to the algorithms: (a) Data filtering is performed based on the acquisition plane and ADNI phase. We selected the axial acquisition plan of the MR (T1 and T2 weighted) images and ADNI phase 3 for three classes: AD, CN, and MCI as shown in Figure 3. (b) The images are reshaped to 512 × 512 to provide the best quality pixel. (c) The labels about the classes of the images are extracted and put into the vector storage. The classes undergo label encoding, i.e., 0 for AD, 1 for MCI and 2 for CN. (d) The loaded images are standardized and normalized using a standard scaler and min-max scaler as the normalization functions. (e) Once the preprocessing of the data is completed, the data are split into training and testing to perform validation on unseen data. The MRI axial scans are fed into the model as input, which is then converted into NumPy arrays. A NumPy array is an n-dimensional array that is represented as a grid of values with non-negative index tuples. The rank of the array gives the number of dimensions, and the shape of the tuple gives the array size. Figure 4 given below presents the workflow of this study as discussed below.
1. Data: Brain axial T1 and T2 weighted MRI data taken from the ADNI database is labelled into three classes such that AD CN, and MCI along with the preprocessing and data loading pipelines. 2. Data training: using VGG-16 [43] as a feature extractor primarily using the illustrations provided by the Imagenet on the ADNI dataset. After Feature Extraction using VGG-16, the extracted blob is then processed and sent to conventional machine learning models and proposed ensemble architecture. 3. Parameter optimization: post-training hyperparameter tuning is achieved using GridSearch. 4. Data Evaluation: after rigorous training, the models are tested and compared for overfitting. If overfitting is found, then retraining is done with reduced hyper-planes and data sampling. Else, a comparative analysis of the result is evaluated. The MRI axial scans are fed into the model as input, which is then converted into NumPy arrays. A NumPy array is an n-dimensional array that is represented as a grid of values with non-negative index tuples. The rank of the array gives the number of dimensions, and the shape of the tuple gives the array size. Figure 4 given below presents the workflow of this study as discussed below.

Feature Extraction
Feature Extraction is an integral aspect of image-based machine learning as it allows for the inculcation of the feature-wise similarity graph or correlation table available to the deep neural architectures such as VGG-16 that is utilized in this paper. In this study, a

1.
Data: Brain axial T1 and T2 weighted MRI data taken from the ADNI database is labelled into three classes such that AD CN, and MCI along with the preprocessing and data loading pipelines.

2.
Data training: using VGG-16 [43] as a feature extractor primarily using the illustrations provided by the Imagenet on the ADNI dataset. After Feature Extraction using VGG-16, the extracted blob is then processed and sent to conventional machine learning models and proposed ensemble architecture.

4.
Data Evaluation: after rigorous training, the models are tested and compared for overfitting. If overfitting is found, then retraining is done with reduced hyper-planes and data sampling. Else, a comparative analysis of the result is evaluated.

Feature Extraction
Feature Extraction is an integral aspect of image-based machine learning as it allows for the inculcation of the feature-wise similarity graph or correlation table available to the deep neural architectures such as VGG-16 that is utilized in this paper. In this study, a freezing bottleneck of VGG-16 is implemented, with the initial convolutional layer extraction being transformed into dense arrays of 1000 vectors for use in the machine learning proposition. In comparison to local filtering or Gaussian noise extraction methods, this method accelerates convergence and improves accuracy by 31.27% in aggregate. VGG-16 uses the internal feature extractor using images of size 100 × 100 × 1 which is self-convoluted 100 × 100 × 3 using 3 filtered convolutions with padding maintained at the same value. Further, the kernel association blob that is predicted or processed by the VGG-16 architecture is 2 × 2 × 512, when linearized using flatten and dense layers it is converted to 4096 vectors. This blob is fed into traditional machine learning models and the suggested ensemble architecture, from which AD, CN, and MCI predictions are derived. The proposed VGG-16 model uses label-encoded vectors for dependent variables and cross-channel normalization. In VGG-16, modelling, the batch size is 32, and the loss function is Cross-entropy loss. The feature stimulator extracts image information using 1 × 1, 3 × 3, and 5 × 5 kernels, as well as the gallort weight initializer and pre-trained Imagenet data.

Naive Bayes (NB)
NB is a machine learning algorithm whose work is based on the standards of the Bayes theorem named after Thomas Bayes which is based on conditional probability [44,45]. It is used as a classifier in face recognition, weather prediction, news classification, and medical diagnosis. The posterior probability in this algorithm is calculated using Equation (1).
Equation (1) above is used to depict posterior probability P (C|X) where C is a class variable and X is a dependent feature vector. The posterior probability is denoted by P (C|X) , likelihood is denoted by (P(X|C) class prior probability by P(C) and predictor prior probability by P(X).

Decision Tree (DT)
DT classifier has the inherent capability of eliminating useless elements from the input data, thereby increasing the performance of the algorithm [46]. The classification technique is based on non-parametric supervised learning, which predicts the class of given data by learning features from data in the form of a tree structure. This classification technique is used for predicting the type of class to which our data belong. Thus, it supports diagnosis decisions by dividing the dataset into smaller sets to specify a sequence of decisions and consequences. Numerous metrics can be used to determine the optimal method for splitting the records. These metrics are described in terms of pre-splitting and post-splitting class distributions of the documents. p(i|t) represents the percentage of records from class 1 located at node t. The degree of impurity in the child nodes influences several measures used to determine the ideal split. The degree of impurity is proportional to how distorted the distribution of classes is. Impurity measurements are given in Equations (2)-(4).
The input features for the DT models can be selected using Entropy (t) as stated in Equation (2), Gini Index (Gini (t)) using Equation (3) and the Classification error (t) calculated using Equation (4). In the above Equations, c is the number of classes.

Random Forest (RF)
Random Forest (RF) [47] is a classifier that works as a group of decorrelated decision trees to perform classification. It was created using multiple decision trees. RF is best used for medical diagnosis, remote sensing, and object detection in images. RF finalizes the decision by estimating average votes from different trees. The matrices containing sample features of the training dataset are given as input to the algorithm. For each sample, a DT is created to make a classification decision and the significance of every feature in each tree is computed. The prediction in RF is taken as an average of the trees and the formula in RF is given in Equation (5): where F(x) is the prediction function, j is the number of trees in the RF model and k is the total number of features.

K-Nearest Neighbor (K-NN)
The K-NN algorithm [48] works on the principle of the similarity measure. The number of available cases or nearest neighbours that contribute to voting is represented by K and the model organizes new data depending on similarity with the available cases. The algorithm is mostly used in search applications where the task is to search for similar data objects. The probabilistic interpretation of K-NN is discussed using Equations (6)- (8).

of 27
In Equations (6) and (7), the nearest neighbour from class 1 is denoted by k 1 and N 1 signifies the total sum of instances in class 1.
In Equation (8), R D K (x) is the distance between the estimation point and the kth closest neighbour and C d represents the volume of the sphere in the D dimension.
3.3.5. Support Vector Machine (RBF Kernel) Support vector machine methods analyze data and sort them into categories based on supervised learning. SVM (RBF kernel) follows the use of a labelled training dataset as input to learn the classifier and produce input-output mapping functions [49]. Support vector machines are versatile because diverse kernel functions can be listed for the judgment functions. It is possible to define common kernels as well as custom kernels. RBF stands for the Radial Basis function kernel that works in infinite dimensions. Radial kernel SVM was used to deal with overlapping data. The radial kernel is mathematically represented in Equation (9).
The kernel k determines the influence of each observation in the training dataset as stated in Equation (9). γ (gamma) is determined by cross-validation scales, perceived as an inverse of the radius.
For d = n, In the above Equations (10) and (11), a and b define different observations in the dataset, r defines the polynomial coefficient, and d sets the degree of the polynomial. The polynomial method sacrifices the linearity and uses higher-degree polynomial equations for higher-dimensional data.

Support Vector Machine (Sigmoid Kernel)
SVM (Sigmoid Kernel) [52] is known by the name hyperbolic tangent kernel and also as a multilayer perceptron kernel. The sigmoid function is used for activation, mathematically represented as: In Equation (12), c is the intercepting constant and α is the slope. x i is the ith instance and x j is the jth feature of the ith instance. The Support vector sigmoid kernel is interesting because of its origin in neuronal network theory and has been found to perform well.

Gradient Boost (GB)
GB [53] is also founded on the theory of sequential ensemble or collection of predictors where the basic learners are created progressively in such a way that the existing base learner should be more competent than the previous one. Predictions are combined by weighted average or vote to provide an overall prediction. The goal of boosting algorithms is to minimize the empirical loss functional as given in Equation (13): In Equation (15), L( f ) is an empirical loss function and is viewed as Functional Gradient Descent rather than a vector-valued function to minimize the loss L. x i is a training sample of m size.

Extreme Gradient Boosting (XGB)
The XGB classifier [54] is commonly used for complex and more convoluted datadriven real-world problems. Boosting comes under the category of ensemble learning techniques where weak learners are combined to become strong learners so that the accuracy of the model is increased. Thus, boosting is an efficient method for increasing the performance of classification. It is possible to execute an ensemble in one of two ways: using a sequential ensemble model, also known as boosting, where the weak learners are created sequentially during the training phase or using a parallel ensemble model, also known as bagging, where the weak learners are produced in parallel during the training phase. Mathematically, boosting can be represented with the help of Equation (14).
where f 0 is considered the initial model to predict the target variable. h 1 is used to fit the residual from the preceding steps. The performance can be further improved by building a new model after the residuals of f 1 as shown in Equation (15).
The same can be repeated for 'm' iterations until it gets minimized.

Multi-Layer Perceptron Neural Network (MLP-NN)
In this study, we presented a multilayer perceptron neural network (MLP-NN), a feedforward artificial neural network [55] creating outputs from a collection of inputs. A neural network is a computational model inspired by the biological brain. This artificial neuron is called a perceptron as shown in Figure 5, which takes input convolved with weights. X1 and X2 up to Xn are the inputs to the perceptron that are convolved with weights W1 and W2 up to Wn. In this study, we presented a multilayer perceptron neural network (MLP-NN), a feedforward artificial neural network [55] creating outputs from a collection of inputs. A neural network is a computational model inspired by the biological brain. This artificial neuron is called a perceptron as shown in Figure 5, which takes input convolved with weights. X1 and X2 up to Xn are the inputs to the perceptron that are convolved with weights W1 and W2 up to Wn.   The activation function takes an input of the calculated weighted sum to pass through it to provide a threshold value. Thus, an artificial neural network involves n such perceptrons. The architectural framework of the proposed MLP-NN is shown in Figure 6. The multi-layer perceptrons (MLPs) with backpropagation learning techniques have input layers with rectified linear units (ReLu), hidden layers with multiple dense units and ReLu activation function and output dense layers with a softmax activation function. The input layer has 2000 × 1024 units with a bias of 1024. Furthermore, four hidden layers (1024 × 1024, 1024 × 512, 512 × 512, and 512 × 256) units are included. Finally, the diagnostic output is generated using the output layer with 253 × 3 units and bias 3.

Proposed Ensemble Learning Approach: XGB+DT+SVM
Based on their performance, the three top-performing learning models, including Extreme Gradient Boosting, Decision Tree and SVM-Poly kernel are chosen for the ensemble learning model in the proposed work. In machine learning, voting ensemble techniques often provide more accurate results than a unified framework due to challenges such as the representation problem, the computing difficulty, and the statistical problem, which can be overcome by employing ensemble approaches. The basic example of stacking is an ensemble learning strategy in which additional innovations are created by combining the

Proposed Ensemble Learning Approach: XGB + DT + SVM
Based on their performance, the three top-performing learning models, including Extreme Gradient Boosting, Decision Tree and SVM-Poly kernel are chosen for the ensemble learning model in the proposed work. In machine learning, voting ensemble techniques often provide more accurate results than a unified framework due to challenges such as the representation problem, the computing difficulty, and the statistical problem, which can be overcome by employing ensemble approaches. The basic example of stacking is an ensemble learning strategy in which additional innovations are created by combining the projections of many important functional, also known as classifiers, to teach a meta-classifier.
In this study, we implemented an ensemble of SVM (Poly kernel), Extreme Gradient Boosting Ensemble architecture, and a Decision Tree. This approach increases the performance by combining the advantages of a variety of machine learning architectures combining bagging, boosting and kernel-based rectilinear models such as Decision Tree, Extreme Gradient Boosting and SVM Poly Kernel. XGB + DT + SVM concept incorporates the Variance minimizer from the Decision tree, sampling accuracy enhancer via Extreme Gradient Boosting and Kernel Space Optimization vector using Poly kernel Support Vector Machine to achieve a more accurate average prediction than a single prediction obtained using a single kernel [56]. Any predictor of our selection can serve as the meta-classifier. By splitting the training set into multiple splits for the weak classifiers distribution, it would reduce time complexity or training time, allowing the task to be accomplished easily. Stage one filters are trained with the first split of your training examples. After that, utilize the stage one filters that have been learned to generate predictions about the remaining training splits. After that, the meta-classifier is trained using output from week learners. The confidence score (CS), a regressive value, measures the classification strength using the same feature vectors for classification distinction. The aggregating criteria are a decision of voting for each estimator output that is concatenated. Voting is calculated on the predicted probability of the output class. The architecture of the proposed XGB + DT + SVM is illustrated in Figure 7.
By aggregating the projected class or predicted probability based on voting, this proposed technique significantly improved the model's performance. Therefore, if many base models are provided to the ensemble voting classifier, it ensures that the mistake is resolved by any model. The challenges of classification include the summation of votes for crisp class labels from different types and the prediction of the most votes in the class. We considered ensemble voting to overcome the error that occurs due to an independent classifier [57,58]. A voting classifier estimator is experimented with to train three kernels (estimators) to predict based on aggregating the findings of each kernel (base estimator).
classifier. By splitting the training set into multiple splits for the weak classifiers distribution, it would reduce time complexity or training time, allowing the task to be accomplished easily. Stage one filters are trained with the first split of your training examples. After that, utilize the stage one filters that have been learned to generate predictions about the remaining training splits. After that, the meta-classifier is trained using output from week learners. The confidence score (CS), a regressive value, measures the classification strength using the same feature vectors for classification distinction. The aggregating criteria are a decision of voting for each estimator output that is concatenated. Voting is calculated on the predicted probability of the output class. The architecture of the proposed XGB+DT+SVM is illustrated in Figure 7. By aggregating the projected class or predicted probability based on voting, this proposed technique significantly improved the model's performance. Therefore, if many base models are provided to the ensemble voting classifier, it ensures that the mistake is resolved by any model. The challenges of classification include the summation of votes for crisp class labels from different types and the prediction of the most votes in the class. We considered ensemble voting to overcome the error that occurs due to an independent classifier [57,58]. A voting classifier estimator is experimented with to train three kernels (estimators) to predict based on aggregating the findings of each kernel (base estimator).

Grid Search-Based Optimization on ML Methods and Ensemble Technique
The grid Search tuning approach aims to calculate hyperparameter values optimally. In this study, an exhaustive search was conducted on these models' specific parameter values. This is performed by choosing suitable hyperparameters and tuning them to

Grid Search-Based Optimization on ML Methods and Ensemble Technique
The grid Search tuning approach aims to calculate hyperparameter values optimally. In this study, an exhaustive search was conducted on these models' specific parameter values. This is performed by choosing suitable hyperparameters and tuning them to control the behaviour of the algorithm. The efficiency of our models is optimized by tuning without causing excessive variance or overfitting. Determining the ideal collection of hyperparameters can be computationally challenging. The hyperparameter values were set before learning begins. Regularization parameters in SVMs, k in k-Nearest Neighbors, and hidden layer count in multi-layer perceptron neural networks are other examples. A parameter, on the other hand, constitutes the internal feature of the model and can be measured from the data.

Experimental Analysis
In this study, a total of 2125 neuroimaging (MRI) data are used, divided into three classes: AD, CN, and MCI. Further, three different forms of classifiers inspired by (machine learning, neural-inspired, and ensemble learning algorithms) are designed for AD classification.

Implementation Setup
We trained our models on the Google Colab framework, providing computational tools and flexibility to adjust utilization limits and hardware availability using Python 3.7, TensorFlow, and the Scikit-learn library. Additionally, we used many Python libraries, including Numpy, Pandas, Matplotlib, and Keras. We also used pydicom, a Python program for working with the DICOM files. TensorFlow and TensorFlow-GPU versions 2.0.0 has also been used.

Performance Evaluation Metrics
The objective of this study is to identify the effectiveness of different algorithms for accurate classification and minimum log loss. Here, different models are investigated to achieve the accuracy, sensitivity, specificity, and log loss values for each classifier measuring the performance of each classifier on the test-set. In the context of statistical classification, a confusion matrix, also known as an error matrix, is used for the visualization of an algorithm's performance. When evaluating a classification model's performance on a known set of test data, confusion matrices (also known as contingency tables) are frequently used. The confusion matrix is listed in Table 3.
(a) Accuracy: Classification models can be evaluated based on their accuracy. The fraction of correct predictions made by the model divided by all predictions is referred to as its accuracy as shown in Equation (16).
(b) Sensitivity: The value is represented numerically by dividing the number of correct positive predictions by the total number of positive values, as given in Equation (17).
(c) Specificity: The value is mathematically represented by dividing the number of successful negative predictions by the total number of negative values. Sensitivity and specificity are statistical indicators of test result performance. The specificity is computed using the formula given in Equation (18).
(d) Logarithmic loss (log loss): Using a log loss parameter, classification models that yield a probability value between zero and one may be evaluated for their efficiency. Increasing the logarithmic loss value is associated with increasing the anticipated probability value. The log loss action considers the confidence of the prediction in determining how incorrect classifications can be penalized [59]. The mathematical model used for computing logarithmic loss is shown in Equation (19).
where the number of instances is represented by N, y i indicates the output of ith instance and p i represents the probability of 1 and (1 − p i ) is the probability of 0. For the total number of cases, log loss accumulates the likelihood of a sample assuming both states 0 and 1. The next section presents comprehensive results that were obtained after applying different classification approaches.

Results
The current section Comparison of classic ML models with proposed architecture on the ADNI data. The details of the respective results obtained by the different classifiers are presented below.

Parameter Optimization Results
Grid search determined the ideal model hyperparameters that lead to the most accurate predictions. Table 4 lists the grid parameters and optimized parameters of the classification models implemented in this study. Table 4. Hyperparameter tuning to obtain optimized parameters using GridSearch.

Before Hyperparameter Tuning
ML models and proposed XGB + DT + SVM were investigated in the context of AD and MCI classification-based neuroimaging analysis. The performance analysis based on accuracy, sensitivity, specificity, and log loss for three classes (AD, CN, and MCI) is given in Table 5. The NB prediction obtained the lowest accuracy of 72.22%, 77.98% and 77.58%, respectively. The prediction model of Among SVM Polynomial achieved accuracies of 87.84%, 86.21% and 81.48%, respectively. The EGB model and GB obtained accuracies for AD 81.56% and 72.25%, respectively. The classification accuracy of the MLP-NN model for AD, CN and MCI is 89.20%, 81.56% and 84.30%, respectively. XGB + DT + SVM achieved better accuracy at 87.00%, 90.51% and 90.51% for AD, CN and MCI, respectively, compared to other algorithms. Overall, XGB + DT + SVM achieved an average accuracy of 89.77%. The sensitivity values for each class show the test's ability to correctly identify the class in test data. Among the ML algorithms, SVM Polynomial and the Proposed model have the lowest log loss value of 0.29615 and 0.27795, respectively. Whereas Naive Bayes has the highest log loss value of 13.69922.

After Hyperparameter Tuning
Additionally, Table 6 compares the overall information scores results obtained with optimized parameters for ML classifiers and the proposed ensemble (XGB + DT + SVM) model. GridSearch enhanced the classification performance of these models by choosing the optimized gird of parameters. MLP-NN has shown better classification accuracy of 89.71% among ML models. The proposed ensemble model XGB + DT + SVM with hyperparameter tuning obtained the highest classification average accuracy of 95.75%, which outperformed all the models implemented in this study. XGB + DT + SVM achieved better accuracy 96.12%, 95% and 96.15% for AD, CN and MCI, respectively, compared to other algorithms The conjunction of results after the hyperparameter tuning reduces the variance in the categorical sensitivity and specificity and keeps a reasonable increase for all the different models.  Using an ensemble method, we were able to lower the standard deviation and maintain high precision. Figure 8 further illustrates the comparative plot, which offers a visual benchmark for contrasting the ML models and proposed model XGB + DT + SVM based on classification accuracy (before and after hyperparameter tuning).
In Figure 8, the authors compared the performance improvement among the models before and after GridSearch. During hyperparameter optimization with GridSearch, the SVM polynomial kernel was chosen among the RBF, polynomial, and sigmoid kernels. Figure 8b represents the performance of models after using GridSearch which includes the SVM polynomial kernel along with other models, whereas, in Figure 8a, all three SVM kernels are included along with other implemented models. Overall, it is demonstrated that model tuning enhances the performance and classification accuracy of all the models.

Statistical Evaluation of Outcomes
In this experiment, k-fold CV, Friedman's rank test, and t-test were used to statistically validate the performance attained by various models for classifying AD, CN, and MCI.

Ten-Fold Cross-Validation
We performed 10 iterations of the k-fold cross-validation (k = 10) method, where each fold was determined to have a mean response value that was within a certain tolerance of the original value.
Cross-validation using a 10-fold set to tune the model on various splits and find the optimal kernel space boundary. The dataset is divided into 10 folds of roughly the same size. Cross-validation is applied to evaluate the performance of each model after hyperparameter tuning. Table 7 lists the 10-fold cross-validation for each model reported in this study after parameter tuning. The entire data was randomly divided into 10 folds (k = 10). A larger K value results in a less biased model, whereas a smaller k value is comparable to that of the train-test split strategy. Next, k − 1 folds were used to fit the model, and the kth fold was used for validation. This procedure was repeated until each K-fold functioned as a test set. The average of reported scores is then calculated to validate our models. The analysis from this procedure demonstrates the highest mean value is achieved by XGB + DT + SVM (98.84%), whereas Nave Bayes achieved the lowest (67.04%) mean value for 10-fold validation. Finally, the results obtained demonstrate that hyperparameter tuning, when implemented, yields marginally enhanced accuracies in comparison to the values without optimized parameters. In Figure 8, the authors compared the performance improvement among the models before and after GridSearch. During hyperparameter optimization with GridSearch, the SVM polynomial kernel was chosen among the RBF, polynomial, and sigmoid kernels.

Friedman's Rank Test
In this work, the significance of the differences between these classifiers is measured using the Friedman rank test [60]. The Friedman test compares methods based on their rankings and is a non-parameter test that is used to represent a comparison of approaches among one another. In our study, we choose a significance threshold of =0.05.
Models are ranked using Friedman's test, with the best-performing models placed at the top of the list.
The test determines the model's average rank by taking into account M rj , where M rj is the rank of the ith models on the jth fold of the dataset. Each model, according to the null hypothesis, displays identical behaviour and, hence, has equivalent precision. Friedman's rank test is distributed by x 2 FM having the r − 1 degree of freedom can be defined in Equation (20) given below: As seen in Equation (20), the distribution of Friedman's rank test is x 2 FM with r − 1 degrees of freedom.
If s > 10 or/and r > 5, the chi-square distribution can be used to estimate the critical value of Friedman's test statistics; otherwise, the table is referred to.
The p-value obtained after Friedman's statistics test at α = 0.05 demonstrates that the null hypothesis cannot be accepted and substantiates the claim that there is a meaningful distinction between the various models. The estimated average rank and p-value of models are shown in Table 8, which demonstrates that the proposed ensemble method XGB + DT + SVM (with highest rank value = 8) and MLP-NN (with second highest rank value = 6.65) has outstanding accuracy performance than other models used in this study.

T-Test
Further, the accuracy mean difference is evaluated using a t-test on two experimental settings: (i) before hyperparameter tuning, and (ii) after hyperparameter tuning. A t-test is a type of inferential statistic used to determine if there is a significant mean difference in performance accuracy [61]. The mean difference of accuracies obtained before and after hyperparameter tuning, i.e., 76.39% and 80.448% demonstrates the enhanced results of (ii) after hyperparameter tuning, with a high statistical difference of -4.058 at p < 0.05 as shown in Table 9. Note: (**) indicate the significance level at a 5% level of significance.

ROC Demonstration of Models Classification Abilities
In this subsection, we have shown the graphical illustration for the models in three categories. Receiver operating characteristic (ROC) curve plots for all classifiers are presented to indicate positive and negative results. The ROC curve is plotted to visualize how the predicted probabilities are compared to the true status [62].
False positive rates are plotted on the X-axis, while true positive rates are plotted on the Y-axis. Anticipated values are shown as TN, TP, FN, and FP on the ROC curve. There are three lines (blue, orange, and green) on the AUC graphs. In each ROC curve, the orange line represents Alzheimer's disease (AD), the green line represents Control Normal (CN), and the blue line represents Mild Cognitive Impairment (MCI) as shown in Figure 9. The trade-off between sensitivity (TPR) and specificity (1-FPR) is demonstrated in the receiver operating characteristic (ROC) curve. Better results are achieved where the curve is closer to the top-left corner.
XGB + DT + SVM has shown more accurate predictions than other models. MLP-NN showed impressive prediction compared to other ML models. The ROC curve is independent of class distribution. As a result, it is advantageous to test classifiers that predict unusual events such as disease prediction.

Comparison with Existing Studies
In this subsection, the efficacy of the offered models is demonstrated by comparing their performance to that of earlier research. A comprehensive analysis of machine learning algorithms and a proposed approach has been performed for the classification of Alzheimer's disease.
To ensure the generalization of these models, the efficiency models presented in this study are compared using the accuracy score obtained in previous studies on chosen benchmark dataset [63][64][65][66].
Among all the methods analyzed, EGB and MLP-NN have shown better performance. The outcomes of the comparison of EGB and XGB + DT + SVM with previous studies are listed in Table 10. The graphical demonstration of comparison between the results obtained by different models presented in this work with previous studies is shown in Figure 10. Among classical ML models used in this framework, XGB has shown better performance for the classification of AD, CN and MCI with an accuracy of 85.42%. In addition, the proposed XGB + DT + SVM and MLP-NN models outperformed existing approaches with an accuracy of 88.64% and 90.75%, respectively.   XGB+DT+SVM has shown more accurate predictions than other models. MLP-NN showed impressive prediction compared to other ML models. The ROC curve is independent of class distribution. As a result, it is advantageous to test classifiers that predict

Conclusions
Computer-aided detection (CAD) systems are being developed to help radiologists in interpreting medical imaging findings and minimizing inter-reader variability more precisely. Artificial intelligence and machine learning play an important role in the

Conclusions
Computer-aided detection (CAD) systems are being developed to help radiologists in interpreting medical imaging findings and minimizing inter-reader variability more precisely. Artificial intelligence and machine learning play an important role in the development of CAD systems as it is extensively used to uncover useful image features from complicated datasets and assist in more reliable diagnosis.
The master model is suggested in this study, and the main principle of this strategy is that when two or more learning procedures combine, the performance of the final model will be improved. Some state-of-the-art learning algorithms and the proposed master model were compared to evaluate the performance of these learning algorithms to classify Alzheimer's disease from neuroimaging data taken from ADNI. This study tries to improve the understanding of these algorithms and enhance their scope and significance in Alzheimer's disease classification.
Comparative result analysis showed that after applying GridSearch for parameter optimization, the classification performance improved significantly. Of the various considered algorithms with optimized parameters, MLP-NN outperforms other algorithms with the best classification accuracy of 89.64%. Overall, the proposed XGB + DT + SVM with hyperparameter tuning achieved the highest accuracy of 95.75%, outperforming all the other classifiers on high-resolution brain MRI data.
In this study, statistical analysis of the results was carried out to validate the performance of the proposed models. The analysis has been performed using K-Fold crossvalidation, Friedman's Rank test and t-test. These statistical validations ensure the accuracy, effectiveness and efficiency of the models.
Further, the results of the proposed model (XGB + DT + SVM) and other state-of-theart models considered in this study such as the neural-inspired model (MLP-NN) and XGB were compared with the results of existing studies. The results show that the models proposed in this study have outperformed the results of the existing studies.

Future Scope
From the state-of-the-art approaches, we can conclude that certain gaps need to be overcome in further research and more investigations are required to perform future work in the direction. There are several challenges identified. Firstly, the experimentation can be performed using a large number of samples to yield improved performance. Secondly, the work can be extended by using different neuroimaging modalities to determine the effective one. Thirdly, deep learning algorithms such as 3D-Convolution Neural Networks (3D-CNN), Recurrent Neural Networks (RNN) and pre-trained transfer learning models such as VGG-16, VGG-19, ResNet can be utilized to address the existing prediction and classification problems for better diagnosis. Fourthly, the AD data can be experimented with by including more viewing planes such as sagittal and coronal instead of considering just one single plane. Fifthly, the use of nature-inspired algorithms can be experimented with for optimization and also extract the relevant features from the data. Moreover, the work can be extended by considering the severity levels of AD, i.e., early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), static mild cognitive impairment (SMCI) and progressive mild cognitive impairment (PMCI). Funding: The authors received funding from Vellore Institute of Technology, Vellore for this research work.
Informed Consent Statement: Not applicable for this study.

Data Availability Statement:
The dataset available online can be downloaded from the links given as follows. ADNI DataSet at the URL: https://ida.loni.usc.edu/login.jsp?project=ADNI (accessed on 10 July 2022).