Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

: Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efﬁciently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in ﬁnal prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classiﬁcation and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classiﬁer (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classiﬁer (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.


Introduction
Patients with heart disease symptoms often require electrocardiography and blood tests in order to evaluate the disease appropriately [1,2]. Every year, almost 12 million people die due to heart diseases [3]. Thus, the diagnosis of this disease is vital at an early stage. While medical diagnosis is an important and complicated task, the recent development of artificial intelligence (AI) provides fast and alternative options, which can provide benefits for particular areas, such as rural areas, where a doctor and expensive equipment for diagnosis is very limited. Therefore, an automated diagnosis system would be beneficial that could be operated by nonmedical people as well. Over the years, it was observed that diagnosing heart disease with additional patient information and medical history at an early stage can save time, money, and health as well [4]. Several studies have shown the possibility of developing a decision support system using that information with the help of machine learning approaches [2,[5][6][7][8][9].
Artificial-intelligence-based algorithms (e.g., heuristics, metaheuristics) are a rapidly growing area of computer science that has shown promise in various applications, includ-ing online learning [10], scheduling [11], multiobjective optimization [12], and vehicle routing [13]. Recent research has demonstrated a significant potential for deep-learningbased approaches in medical diagnosis [14][15][16]. By leveraging deep learning capabilities such as image segmentation, diseases such as diabetes, cancer, and Sars-CoV-2 have been more efficiently and effectively detected and diagnosed [15]. For example, when the global pandemic SARS-CoV-2 began, numerous studies proposed using chest radiographs (X-ray) and computed tomography (CT) scan images to detect patients with COVID-19 symptoms. For instance, Ahsan et al. (2020) proposed MLP-CNN-based approaches to identify COVID-19 patients using patient attributes such as age, gender, and temperature in conjunction with X-ray images. The experiment was carried out with an office grade laptop and a small amount of data [15].
Combining data classification techniques with nature-inspired algorithms such as genetic programming [17] and the swarm algorithm [18] enables the differentiation of different bacteria from viral meningitis [19]. As a result, artificial intelligence has gained popularity in recent years as a beneficial tool for optimization and decision support systems. However, deep-learning-and neural-network-based approaches are computationally expensive when dealing with larger datasets [20]. Therefore, unless necessary, traditional machine learning approaches are frequently preferred over deep learning approaches due to their lower computational cost and memory consumption.
However, developing a data-analysis-based decision support system requires standard data, which often requires many preprocessing steps. Some of the important preprocessing steps include data cleaning, pruning, feature selection, and scaling. While most studies considered different ML algorithms along with feature selection [2,[5][6][7][8], few considered the effect of the data scaling process on overall model performance [9,21]. Thus, the primary purpose of this study is to evaluate the effect of different data scaling methods on different ML algorithms while developing a prediction model for patients with heart disease symptoms. The experimental result will give some insights to the researcher and practitioner to develop a robust, data-driven decision support system.
In the present study, eleven machine learning algorithms and six data scaling methods are used together to find the best match for heart disease prediction. Within the scope of this work, Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naïve Bayes (NB), Support Vector Machine (SVM), XGBoost Algorithm (XGB), Decision Tree Classifier (DT), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), and Extra Tree classifier (ET) machine learning algorithms, and scaling methods such as Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Sclaer (RS), and Quantile Transformer (QT) are used. The effect of different data scaling techniques are observed using the UCI Heart disease dataset.
The rest of the paper is organized as follows. In Section 2, a summary of the previous study is addressed, Sections 3-5 present the methodology, results, and discussion, respectively. Finally, in Section 6, a conclusion is drawn based on the overall experiment and the possibility of future work is discussed.

Literature Review
There is very limited work in the fields directly related to this paper. Most of the referenced literature represented their study output in terms of accuracy of the machine learning algorithm (ML). However, the performance of ML algorithms differs in each study due to the use of different ML approaches. For example, Tu [9].
Even though the performance of many of the reference literature shows promising result [24][25][26] using DT algorithms, surprisingly, computational accuracy varies by almost 7-8%, even though they have used the same dataset-the UCI heart disease dataset. None of the studies mentioned whether they applied data scaling methods or not; thus, it would be difficult to evaluate the reason behind the variation of DT accuracy on the same dataset in different studies. However, the potential reason could be the use of a different number of features, or variation in the training set/test set ratio. Additionally, sometimes the accuracy is not enough to represent the overall performance. Therefore, using a classification matrix and representing the overall performance with accuracy, precision, recall, and F1 scores is more reliable and suggested by many studies [9].
Since most of the studies used feature selection and somehow ignored the effect of data preprocessing on developing prediction models, instead of feature selection, in this study, we have focused and investigated data scaling methods more closely. However, there is no denying the fact that feature selection is also an important procedure in data analysis. For example, Amin et al. (2019) showed that the performance of different models' accuracy varied up to 4-5% considering different combinations of ML algorithms with the number of features [21]. Study results also revealed that, due to the limited number of features, accuracy often drops by up to 14%, which is significantly high in case of medical diagnosis.
Studies conducted by [9,21] used LR, KNN, and SVM to predict the heart disease. On the other hand, studies conducted by [5,21] considered Decision trees for their study. However, predicting heart disease using other robust techniques such as XGB, AB, and ET are missing from those previous studies. Note that, over the years, algorithms such as XGB and AB showed promising results with highly imbalanced data [30]. Therefore, we can infer that the performance of XGB, AB, and ET may differ compared to LR, KNN, and SVM in disease prediction as well.
While there are several data scaling techniques available, one of the main challenges associated with ML is to choose the appropriate scaling method. Many studies bolster the effect of data scaling techniques on different ML algorithms [31,32]. Shahriyari et al. (2019) showed that the performance of normalization has a significant effect on different ML approaches [32]. Their study used twelve different ML algorithms and some of the most commonly used algorithms in heart disease prediction. The study used different normalization methods and showed that the performance of ML algorithms and the selection of normalization methods are interconnected. Among all eleven supervised algorithms, SVM has the maximum accuracy with 78%. However, their study also shows that Naïve Bayes has the best performance in terms of accuracy and lowest fitting times [32].
Another study conducted by Ambarwari et al. (2020) showed that data scaling techniques such as MinMax normalization and standardization have also significant effects on data analysis [31]. The study was carried out using ML algorithms such as KNN, Naïve Bayesian, ANN, and SVM with RBF. Their study demonstrated that NB has the most stable performance without use of data scaling techniques, while KNN showed more stable performance compared to SVM and ANN. However, their computational result revealed that MinMax scaling with SVM outperformed other algorithms' performance, which is contradictory with the study conducted by [32]. Even though their studies do not synchronize with each other, it could still be inferred that data scaling does have some effect on overall performance.
Another study conducted by Balabaeva et al. (2020) addressed the effect of different scaling methods on heart failure patient datasets [33]. Their study uses more robust ML algorithms such as XGB, LR, DT, and RF with scaling methods such as Standard Scaler, MinMax Scaler, Max Abs Scaler, Robust scaler, and Quantile Transformer. In their study, RF showed higher performance with Standard Scaler and Robust Scaler. However, the performance of DT remained unchanged with scaling. Table 1 summarizes the referenced literature that considered UCI heart disease data for their study with some of the common machine learning algorithms.  * Takci (2018) [9] * * * Khourdifi and Bahaj (2018) [29] * As a means of understanding the effect of data scaling methods with different data scaling approaches, a thorough investigation is required. A summary of our technical contribution is presented below:

1.
Applied eleven different ML algorithms with data scaling methods on UCI heart disease dataset; 2.
Investigated the algorithms' performance without data scaling methods; 3.
Identified the best algorithm and scaling method by analyzing the study outcome. Table 2 details the overall assignment of data used in this study. The dataset contains 303 instances and 75 attributes. However, only 13 attributes are widely studied by referenced literature. The overall information about the dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Heart+Disease (UCI DATASET).  Table 3 summarizes the detailed attributes of the selected features of the heart disease dataset.  Figure 1 gives some insights about the data sparsity of one of the data attributes "age".  Figure 2 shows the pairplot of three attributes-age, sex, and thalach of heart disease dataset. On the other hand, seaborn heatmap was used to understand the importance of each feature, as shown in Figure 3. As all of the 13 attributes are highly correlated, this study uses all of those features.

Experimental Setup
To conduct this experiment, eleven machine learning algorithm were chosen: Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naïve Bayes (NB), Support Vector Machine (SVM), XGBoost Algorithm (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost Classifier (AB), Extra Tree classifier (ET). Note that the main reason behind choosing these machine learning techniques is to compare the overall performance with those found in previous studies in terms of different preprocessing techniques. The implementation of all eleven ML algorithms and study results took place using the Anaconda modules with Python 3.7 and were run on an office-grade laptop with common specifications (Windows 10, Intel Core I7-7500U, and 16 GB of RAM). Instead of developing different preprocessing steps, this study uses built-in preprocessing libraries provided by Scikit-learn tools: Normalization, Standardization, MinMax Scale, MaxAbs scale, Robust Scaler, Quantile Transformer. Figure 4 illustrates the overall experimental approach using a flowchart. The performance was evaluated based on accuracy, precision, recall, and F1 score, as shown in Table 4. Table 4. The parameter used to compute the confusion matrix [16,34].

Test Result
Truth Performance Measure

Heart Disease Non-Heart-Disease
Positive t n f n Accuracy = t p +t n t p +t n + f p + f n The matrix outcomes are as follows: True Positive (t n ) = Heart patient classified as patient; False Positive ( f n ) = Healthy people classified as patient; True Negative (t n ) = Healthy people classified as healthy; False Negative ( f n ) = Heart patient classified as Healthy.
The experiment was carried out by splitting the dataset into 80% and 20% for the training and testing set, respectively. The performance of the model was evaluated using 10-fold cross-validations, and the performance of the model is presented by averaging the outcomes of all 10 folds. Table 5 represents the overall accuracy for all eleven ML algorithms with different scaling techniques. The SVM and CART algorithms showed the highest accuracy (99%) compared with the other nine algorithms when applied without any scaling techniques. On the other hand, KNN showed the lowest performance with 75% accuracy. However, the overall performance improved up to 12% with the use of the MaxAbs scaling method. The study result also revealed that the overall performance is similar with or without using data scaling techniques for all algorithms except LR, KNN, SVM.

Result Accuracy
CART achieved an accuracy of 100% when implemented with Robust Scaler and Quantile Transformer methods. On the other hand, the performance of SVM reduced drastically from 99% to around 63% while using Normalization techniques. However, among all data scaling methods, with StandScale, SVM showed the higher accuracy at around 92%. To sum up, with data scaling methods, CART outperformed all other ML algorithms in the heart disease dataset.
While a previous study conducted by [22] showed that Naive Bayes outperformed all other methods, this study found that NB has the lowest accuracy when used with different scaling approaches. For different data scaling methods, the algorithms can be ranked as follows:   Table 6 summarizes the overall precision result for all eleven ML algorithms in terms of different data scaling methods.
Without scaling, CART has the highest precision (100%) and KNN has the lowest precision (78%). The performance of algorithms LR, LDA, CART, SVM, and AB degrade once Normalization is applied to the dataset. From Table 6, it is clear that CART has the highest precision rate compared with any other algorithm with or without scaling. Algorithms KNN, SVM, and LDA have the lowest precision rate. Apart from CART, RF and ET showed more stability compared with other algorithms.
For different data scaling methods, the algorithms can be ranked as follows:   Table 7 showed the overall recall results of all eleven ML algorithms along with different data scaling methods. Here, CART showed the highest recall rate of around 1, while KNN (0.72-0.84) and LR (0.74-0.89) showed the lowest recall rate. The overall recall performance could be ranked as follows:   Table 8 showed the overall performance of F1 score. Among all of the ML algorithms, CART shows the highest F1 score up to 100%; CART obtained the highest while KNN, SVM, LR, and NB had the lowest scores.
The algorithms based on their F1 scores could be ranked as follows:  As a means of comparing our results with those available in literature, Table 9 compares the performance of the previous and current study. Results show that the model developed by Amin et al. (2019) illustrates the best performance compared with other reference literature with an accuracy of 85.86% [21].  [5] 82.56% A comparison between previous studies and current literature that use KNN is demonstrated in Table 10. Results show that our KNN model along with different scaling approaches achieved the maximum accuracy of 87%, while the lowest accuracy was measured for the KNN models without scaling. A comparison of the previous studies and current study that use NB models for heart disease predictions is displayed in Table 11.  [21] 85.86% Bashir et al. (2019) [5] 84.24% Without scaling, most of the previous studies outperformed the current study, as shown in Table 12.  [21] 86.87% Bashir et al. (2019) [5] 84.85% Takci (2018) [9] 84.88% The performance of RF with different scaling approaches on heart disease prediction is comparatively better than the previous study, as shown in Table 13. Additionally, no significant change was observed for different scaling approaches.  Table 14 shows the comparison between the current and a recent study that use the same ML algorithm and presents the computational results in terms of the F1 score.

Discussion
In this study, the overall performances of eleven different ML algorithms were analyzed with different data scaling approaches. Without scaling, CART and SVM showed the highest accuracy. However, once those algorithms were tested with different scaling methods, only CART showed stable performance ( Table 5). The study result also showed that, using scaling methods, it is possible to achieve 100% accuracy using the CART algorithm (Table 5).
However, this study found that the experimental result for different data scaling methods may not be satisfied all the time. For example, while most of the previous studies achieved higher results with SVM, using scaling methods such as MinMax, Normalization, and StandScale, this study found that the performance of SVM was significantly degraded. Since there are no specific techniques available to decide the best scaling methods for any datasets, researchers need to find the best ones by experimenting with ML algorithms with multiple trails. Additionally, for each particular experiment, the dataset will be different; therefore, a better way to develop the best model for the specific dataset is to experiment with different ML algorithms incorporated with different scaling approaches.
Among eleven ML algorithms, CART outperformed all others in terms of accuracy, precision, recall, and F1 score. However, there was no single scaling method that outperformed other methods when using different algorithms. From the overall experiment, the study outcomes for different scaling methods can be expressed as follows: QT outperformed all other methods when used with LDA, LR, KNN, and NB. On the other hand, NR performed better when used with boosting methods such as XGB, GB, and AB.
We also discovered that standard machine learning algorithms could produce better outcomes-particularly when diagnosing heart disease patients-throughout our research. For example, Masih et al. (2020) and Jalali et al. (2019) used a multilayer-perceptronbased deep neural network to detect coronary heart disease early [35,36]. They achieved accuracies of approximately 96.5% and 92.39%, respectively, while we achieved nearly 100% accuracy using CART. While this study result may convey some light on the effect of different scaling methods in data analysis, still, our research has some limitations as well. The technical limitations can be summarized as follows: • Since the experiment was conducted using only one dataset, it could be difficult to conclude that the algorithm performance will remain the same if experimented with a different heart disease dataset. • During this study, we did not emphasize the feature selection process. Instead, we decided to choose a similar number of features as chosen and used by previous literatures for direct comparison. However, experimenting with different features, ML algorithms, and scaling approaches may produce different results. • Some of the recent trending ML approaches such as deep learning, CNN, and RNN were ignored, as the dataset was quite straightforward and easy to handle with standard ML algorithms.

Conclusions
This study evaluated eleven machine learning (ML) algorithms along with six distinct data scaling methods to detect patients with heart diseases using the UCI heart disease dataset. Our findings suggest that data scaling approaches have some effect on ML predic-tions. The CART algorithm achieved almost 100% accuracy and outperformed any other method proposed by previous literature for heart disease prediction. This study also evaluated the performance variation considering different data scaling approaches. Results show that algorithm performance fluctuates with different scaling methods. However, no single scaling approach is observed that can be ranked as the best one among all of the scaling methods. We believe the study result will give a direction to the researcher/practitioner who wants to develop a medical diagnosis model with a dataset containing outliers, features, and unequal data ratio. However, the limitations addressed in the discussion will be the primary concern for further study-including but not limited to experimenting with another ML algorithm with real-time heart disease data, changing the parameters of different data scaling techniques, and experimenting using deep learning with big data.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The