Machine Learning Methods for Hypercholesterolemia Long-Term Risk Prediction

Cholesterol is a waxy substance found in blood lipids. Its role in the human body is helpful in the process of producing new cells as long as it is at a healthy level. When cholesterol exceeds the permissible limits, it works the opposite, causing serious heart health problems. When a person has high cholesterol (hypercholesterolemia), the blood vessels are blocked by fats, and thus, circulation through the arteries becomes difficult. The heart does not receive the oxygen it needs, and the risk of heart attack increases. Nowadays, machine learning (ML) has gained special interest from physicians, medical centers and healthcare providers due to its key capabilities in health-related issues, such as risk prediction, prognosis, treatment and management of various conditions. In this article, a supervised ML methodology is outlined whose main objective is to create risk prediction tools with high efficiency for hypercholesterolemia occurrence. Specifically, a data understanding analysis is conducted to explore the features association and importance to hypercholesterolemia. These factors are utilized to train and test several ML models to find the most efficient for our purpose. For the evaluation of the ML models, precision, recall, accuracy, F-measure, and AUC metrics have been taken into consideration. The derived results highlighted Soft Voting with Rotation and Random Forest trees as base models, which achieved better performance in comparison to the other models with an AUC of 94.5%, precision of 92%, recall of 91.8%, F-measure of 91.7% and an accuracy equal to 91.75%.


Introduction
Cholesterol is a form of fat and a key component of cells. It plays a very important role in health as it participates in the synthesis of hormones, in the production of vitamin D and in the digestion and assimilation of fats. The molecules that result from the binding of cholesterol to proteins are called lipoproteins and are categorized into "bad" LDL cholesterol and "good" HDL cholesterol. LDL cholesterol is responsible for transporting cholesterol molecules from the liver to tissues and organs, while HDL cholesterol transports cholesterol molecules from tissues back to the liver [1,2].
Cholesterol is calculated in milligrams (mg) of cholesterol per deciliter (dL) of blood. An effect below 200 mg/dL (5.2 mmol/L) is ideal. A level somewhere between 200 and 239 mg/dL (5.2-6.2 mmol/L) is marginally below the high-risk number. A value above 240 mg/dL (6.3 mmol/L) is the high-risk limit. Total cholesterol results from the sum of HDL and LDL values [3].
In HDL, a value below 40 mg/dL (1 mmol/L) for men and 50 mg/dL (1.3 mmol /L) for women is low. This increases the risk of cardiovascular problems. A normal HDL cholesterol level is between 40 and 49 mg/dL (1-1.3 mmol/L) for men. For women, it is between 50 and 59 mg/dL (1.3-1.5 mmol/L). When this level is higher than 60 mg/dL (1.6 mmol/L), there is increased defense against coronary heart disease [4].
Familial hypercholesterolemia (FH) is a dominant genetic condition with an increased risk of coronary artery disease (in untreated cases) [25]. Machine learning-based strategies can lead to the effective identification of high-risk patients to enhance FH management.
First, the authors in [26] selected three machine learning algorithms including a classification tree, a gradient boosting machine and a neural network to predict the presence of FH in two different cohorts. The evaluation was based on the area under the ROC curve. The findings have shown the superiority of machine learning models against the clinical Dutch Lipid Score in predicting carriers of FH-causative mutations.
In [27], a Random Forest classifier was developed to identify potential FH patients using electronic health record (EHR) data. The model was trained on 197 known patients and 6590 without FH, achieving a positive predictive value (PPV) of 0.88 and a sensitivity of 0.75 on a hold-out test set. The accuracy of the classifier's predictions was further evaluated by a chart review of patients at risk of FH not included in the original dataset. The classifier correctly categorized 84% of patients at the highest probability threshold. Finally, the same classifier was validated on an external dataset from the Geisinger Healthcare System and achieved a PPV of 0.85.
Moreover, in [28], the authors developed a model for predicting hypercholesterolemia using a comprehensive set of body fat mass variables based on machine learning techniques, in addition to studying the correlation between body fat mass and hypercholesterolemia. They obtained the area under the receiver operating characteristic curve value of 0.739 and the Matthews correlation coefficient value of 0.36 in the model using the correlation-based feature subset selection and the Naive Bayes algorithm.
A machine learning approach for the prediction of cholesterol levels via regression using non-invasive and easy-to-collect data (clinical and anthropometric) is presented in [29]. In addition, clustering analysis is carried out to identify different groups of patients sharing some characteristics and give valuable information to clinical experts for diagnosis or prognosis.
Moreover, the authors in [30] aimed to compare the performance of various machine learning models to predict the prevalence of hypercholesterolemia associated with exposure to lead, mercury and cadmium. Five machine learning models, such as Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forest, and Support Vector Machines were constructed, and their predictive performance were compared. Finally, the Support Vector Machine model was the most accurate, and the logistic regression model had the highest area under the ROC curve of 0.718 (95% CI: 0.688-0.748).
In [31], the authors developed a high accuracy (97.45%) convolutional neural networkbased Android application that determines cholesterol levels in a person's body by capturing the image of the iris. A user with high cholesterol levels has a white-greyish circle on the outer circle of the iris.
Finally, the authors in [32] constructed a dataset based on the ELSA database, aiming at the prognosis of high cholesterol (hypercholesterolemia), targeting the elderly office workers. Naive Bayes, Support Vector Machines, Artificial Neural Network using two hidden layers, 5-Nearest Neighbors, Rotation Forest, Decision Trees, Logistic Model Trees and Random Forest were applied on the constructed dataset using a 10-fold cross-validation experimentation setup. The best overall performance was obtained with the Logistic Model Trees model, which performed best both concerning accuracy and recall metrics.

Materials and Methods
This section describes the dataset under consideration and the methodology adopted for determining the risk of being diagnosed with hypercholesterolemia.

Dataset Description
Our experimental results were based on the dataset of research work [32], which is derived from the English Longitudinal Study of Aging (ELSA) [33]. The initial features set included 106 variables, with 61 being nominal and 45 numerical attributes, and all the participants were over 50 years old. From the features list, we excluded socioeconomic data, including the type of employment, education, income, residence type, marital status, the number of children and insurance type. Moreover, after data cleaning and feature selection, the final list of features was reduced to 13, the number of participants to 350, and all the attributes (13 as input to ML models and 1 for target class) are described as follows: • Age (years) [34]: This feature refers to the age of a person who is over 50

Hypercholesterolemia Risk Prediction
Supervised machine learning models have become an important asset for clinicians and healthcare providers as they allow them to evaluate the long-term risk of a condition occurrence based on several risk factors. More specifically, here, our purpose is to formulate a binary classification problem with target class c = "HyperChol" (hypercholesterolemia occurrence) or c = "Non-HyperChol" (non-occurrence of the hypercholesterolemia) and design models which will achieve high recall and area under curve (AUC) to ensure that instances with hypercholesterolemia can be accurately classified.
Assuming an instance with an unknown class label (HyperChol, Non-HyperChol), the trained ML models will predict its class based on the features' values and thus the risk of occurring hypercholesterolemia in the long term.
The proposed methodology includes some specific steps, namely, data preprocessing, feature ranking, classification models training and performance evaluation.

Data Preprocessing
Data quality is a prerequisite for the development of efficient models suitable for the correct identification of healthy and with hypercholesterolemia instances. Hence, to ensure data validity, several preprocessing steps are usually applied. Data quality is ensured via the application of data cleaning methods, selecting the most appropriate for the data under consideration, such as excluding unnecessary or duplicate values, avoiding typos, handling missing values, data imputation, etc. [42]. As for the current dataset, we selected to remove instances whose feature values are missing and not valid (namely, out of the normal limits).
Moreover, the skewed class distribution constitutes a factor that can create ML models of poor performance. For this purpose, the imbalanced distribution of participants among the HyperChol and Non-HyperChol classes was tackled by employing SMOTE [43]. SMOTE uses a 5-NN classifier to create synthetic data on a minority class, i.e., HyperChol, which is oversampled such that the instances in two classes are equally distributed (i.e., 50%-50%). In Table 1, we present the minimum, maximum, mean and standard deviation of the numerical features in the balanced data.

Features Ranking
We employed four ranking methods to understand the importance of a feature in the target class. First, we applied the Pearson correlation coefficient [44] to evaluate the strength of association between all features and especially the worth of a feature in predicting the hypercholesterolemia class. Figure 1 demonstrates the outcomes of the correlation analysis. In the correlation matrix, we observe that the highest linear correlation of 0.92 is captured between total cholesterol levels and HDL, and total cholesterol levels and LDL. The next but still high linear relationship of rank 0.80 is noted between waist and BMI features, a high association of 0.75 is shown between hypertension and systolic blood pressure, while hypertension has a low association of 0.18 with the diastolic blood pressure. A moderate positive association of 0.3 is shown between age and systolic blood pressure. However, diastolic blood pressure has a moderate negative relationship with age. Finally, the target class records a moderate association of 0.68 and 0.62 with the total cholesterol and HDL, respectively, while a low relationship seems to exist with the other features.
Then, we applied the Information Gain method (InfoGain) [45] which evaluates the worth of a feature by measuring the information gain with respect to the class, according to the formula where H(c) and H(cy ij ) are the entropy of the hypercholesterolemia class and the conditional entropy of the class given the feature j y ij of an instance i. From (1), we see that this measure captures the difference in entropy before and after the split of a feature set based on a specific y ij . Alternatively, it shows the uncertainty reduction after splitting the set on a feature. The best feature for splitting is the one with the highest information gain. Furthermore, we employed the Gain Ratio (GR) method [46] which is calculated as , where H(y ij ) = −p y ij log 2 (p y ij ) is the entropy of an instance with feature y ij (with p y ij denoting the probability of selecting feature y ij ), H(c) = −p c log 2 (p c ) is the entropy of class c (with p c be the probability of selecting an instance in class c) and H(cy ij ) being the conditional entropy of feature y ij given class c. Gain ratio indicates the relevance of a feature and selects the ones that maximize gain ratio based on the probability of each feature value. Finally, the Random Forest classifier was selected to measure the importance of the features. Random Forest creates a forest of trees and per tree measures the discrimination ability of a potential feature to create the optimal split, namely the one that separates the instances of the two classes, using the Gini impurity.
In Table 2, we summarize features' importance in the balanced dataset concerning the hypercholesterolemia class. All considered methods show that TotChol and HDL features are of the highest importance for the prediction of hypercholesterolemia. In addition, we observe that DiasBP is the next most important for the three out of four methods. Moreover, InfoGain and Gain ratio, due to their relationship, assign the features in the same order except for physical activity and hypertension. Since all features are among the risk factors that are utilized by the clinicians for the diagnosis and management of hypercholesterolemia, the models' training and validation will be based on all of them. Finally, given that ML models can be retrained with more data, their importance will be re-investigated.

Data Exploration
In this subsection, after data preprocessing and features ranking, we will analyze the association between features and the HyperChol class.
Initially, in Figure 2, we present the participants' distribution according to their age group and their gender. Most of them who have HyperChol belong to the age group 60-64, where men's and women's percentages are approximately similar. In addition, from this figure, we observe that HyperChol prevails in women who are between 50 and 59 and men who are older than 65 years old. In Figure 3, we present the participants' distribution in terms of the BMI and waist categories. To define the BMI categories, we used the following rules [ Class III: BMI ≥ 40 (severe obesity). Cut-off points for the waist size were considered 88 cm for women and 102 cm for men [47]. Based on these points, the labels F88 and M102 capture women and men with waist circumference higher than 88 and 102 cm, respectively. The label "normal" refers to men and women with waist sizes lower than 88 and 102 cm, correspondingly. From this figure, we see that most of the participants with HyperChol are distributed in healthy and overweight and obese I categories of BMI. Moreover, when HyperChol and overweight classes coexist, the women with waists higher than the cut-off point are much more than men. In addition, some instances have HyperChol and are overweight, but they have normal waist size. Furthermore, in Figure 4, we capture the coexistence of hypertension and diabetes diseases in relation to HyperChol. We see that 25% of the participants have HyperChol, are hypertensive and have not been diagnosed with diabetes. A small percentage of 6% fulfills all criteria; 16% of the participants who have HyperChol do not suffer from hypertension and diabetes.  Figures 5 and 6 show the association of HyperChol with participants' habits in terms of alcohol consumption and physical activity. It is shown that the HyperChol participants are roughly the same distributed among the two possible values of the alcohol consumption feature. Finally, as for Figure 6, it should be noted that most of the HyperChol participants are very low or low physically active.

Machine Learning Models
Several models have been selected to evaluate their prediction performance. More specifically, Naive Bayes, Support Vector Machine, Artificial Neural Network, K-NN, Rotation Forests, Decision Trees, Logistic Model Trees, Random Forest, and Ensemble Learning (Stacking and Soft Voting) classification methods will be presented. We assume that each instance i in the dataset is represented by a features vector y i = y i1 , y i2 , y i3 , . . . , y in T , where n is the number of the features.

Naive Bayes
Naive Bayes(NB) classifies an instance y i at that class c for which P(c|y i1 , . . . , y in ) is maximized (under the assumption that the features are highly independent [48]). The conditional probability is defined as P(c|y i1 , . . . , y in ) = P(y i1 ,...,y in |c)P(c) P(y i1 ,...,y in ) , where P y i1 , . . . , y in |c = ∏ n j=1 P y ij |c is the features probability given class and P y i1 , . . . , y in , P(c) are the prior probability of features and class, respectively. The estimated class is derived by maximizing P(c) ∏ n j=1 P y ij |c , where c ∈ {HyperChol, Non − HyperChol}.

K-Nearest Neighbors
K-Nearest Neighbors (K-NN) is a non-parametric lazy learning classifier which measures the distance (i.e., via Euclidean, Manhattan methods) between the test instance and every other instance in the training dataset [49]. Then it determines the K instances that are closest to the test instance which are finally categorized into the class that most of its K neighbors stem from.

Logistic Regression
Logistic Regression (LR) [50] is a supervised classifier for binary and multinomial tasks. It uses a logistic or a sigmoid function to model the dependent output variable. The model output is dichotomous in nature, i.e., with two possible classes, in which p captures the probability of an instance to belong in the HyperChol class; thus, 1 − p is the probability of an instance belonging to the Non-HyperChol class. The relationship of log-odds with base b and model parameters β i is written as:

Rotation Forest
The Rotation Forest (RotF) [51] applies a rotation transformation matrix to the training before the training of each decision tree to increase the diversity of individual decision trees. A feature reduction technique creates a new feature set for every classifier in the ensemble. It randomly splits the feature set into subsets and applies principal component analysis (PCA) to every created subset separately. Then, a new feature set is acquired by combining principle components of each subset. In this study, the base classifier for the RotF is a J48 [52] decision tree.

Artificial Neural Network
Multilayer Perceptron (MLP) is the simplest fully connected feed-forward Neural Network. It consists of input and output layers and at least one hidden layer. Its neurons are trained by employing back-propagation learning which allows for classification into multiple labels. The MLP is able to learn non-linear models and execute online learning. Finally, it can use any arbitrary activation function [53].

Support Vector Machine
Support Vector Machine (SVM) [54] finds the hyperplane that can optimally separate instances into two classes. The most characteristic kernel functions are linear, polynomial, radial basis and quadratic. An instance x i can be optimally classified based on function: where M is the size of training instances, y i , c i are the training instance feature vector and its class label, respectively, b is a bias, c i ∈ {1, −1}, K(y i , y ) is the kernel function which corresponds the input vectors into an expanded feature space and f (y ) ∈ {−1, +1} is the kernelized binary classifier's predicted class for the unlabeled instance y .

Decision Tree
From the available Decision Trees, we considered the Reduced Error Pruning Tree (RepTree) [55]. It is a simple and fast decision learner which builds a decision/regression tree using information gain as an impurity measure and prunes it using reduced-error pruning. RepTree is even more accurate when dealing with a large volume of data.

Logistic Model Tree
A Logistic Model Tree (LMT) [56] consists of a standard decision tree structure with logistic regression functions f (y i ) = β 0 + ∑ n j=1 (β i y ij ) at the leaves. LMTree constructs the tree growing process using the LogitBoost algorithm and the tree pruning is performed using Classification And Regression Tree (CART).

Random Forest
Random Forest (RF) is a bootstrapping technique based on a decision tree with highperformance outcomes, in both regression and classification tasks. It considers the Information Gain or Gini index to find the optimal subset of features, trains multiple decision trees and then classifies an instance by applying majority voting on the results of multiple Decision Trees [57].

Ensemble Learning
Ensemble Learning is a machine learning method that combines the outcomes of several single classifiers called base models. Voting and Stacking are two common approaches which are utilized to acquire more accurate predictions than the single models'. Concerning Voting, there are two types, Soft and Hard. Soft Voting, which is exploited in this study, averages the corresponding probabilities of the considered base classifiers and assigns a test instance to the class with the highest probability [58]. On the other hand, Stacking uses the predicted class labels of the base models as input features to train a meta-classifier which undertakes to find the class label [59]. Figure 7 illustrates the two schemes as they will be evaluated in the experiments.

Evaluation Metrics
To assess the ML models' performance, accuracy, precision, recall, F-measure, and AUC metrics were considered [60].
Precision (or positive predicted value) shows the ratio of positive instances in relation to true and false positive instances. We also considered recall which captures the true positive rate or a model's sensitivity to identify the participants who actually had HyperChol and correctly considered as positive, concerning all positive participants. Precision is a measure of quality, while recall is a measure of quantity. F-measure is the harmonic mean of precision and recall and allows the evaluation of a model using a single score. Moreover, we computed accuracy, which shows a model's ability to correctly identify both HyperChol (positive) and Non-HyperChol (negative) instances. Taking into account the confusion matrix, TP, TN, FP and FN denote the number of true positive, true negative, false positive and false negative instances. Based on these quantities, the performance metrics are written as follows: Another important metric which will be taken into consideration for the models' evaluation is area under curve (AUC) which takes values in the range [0, 1]. AUC is a measure of separability. The ML models' performance in distinguishing HyperChol from Non-HyperChol instances is captured by AUC. If AUC attains one, it means that the models have the perfect discrimination ability of the two classes distributions.

Experiments Setup
In this section, the ML models performance is evaluated in the WEKA 3.8.6 environment [61]. WEKA is a free JAVA-based data mining tool created and distributed under the GNU General Public License. It provides a library of various models for data preprocessing, classification, clustering, forecasting, visualization, etc. The computing system in which the experiments were conducted has the following characteristics: Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz 2.59 GHz 16 GB Memory, Windows 10 Home, 64-bit Operating System, x64-based processor. For our experiments, 10-fold cross-validation was applied to measure the models' efficiency in the balanced dataset of 388 instances. In Table 3, the settings of the considered ML models are shown.

Evaluation
To fully evaluate the effectiveness of models, we should examine both precision and recall since the improvement of precision typically reduces recall and vice versa. However, if the classes' distribution is uniform, these metrics may achieve the same outcomes.
In Table 4, we show the performance of various models, which were obtained after the application of data cleaning and class balancing. Selecting the two best-performing single classifiers, we combined them under two schemes, Soft Voting and Stacking. All models' accuracy is higher than 86% except for the 3-NN model, which still achieved an acceptable accuracy of 70.62%. In addition, precision, AUC, recall and F-measure demonstrate the same promising outcomes as the accuracy. Soft Voting (SoV) is the model with consistently high efficiency in all metrics. Focusing on AUC, which aggregates the classification performance of a model, the SoV model is able to distinguish between HyperChol class and Non-HyperChol class with a chance of 94.5%. In addition, we see that the combination of rotation with the random forest improved the performance of individual models and especially achieved a higher upgrade in the case of RF than in the RotF. Stacking is the second model with excellent separation performance, although its AUC is 8%, 6% and 2% lower than the ones of SoV, RF and RotF, respectively. SoV's performance superiority lies in the fact that the base-models have been configured to predict probabilities instead of class labels. In the case of Stacking, class labels may add higher uncertainty to the predictions which are provided in the meta-model to learn how to find the best combination of them.
In Table 5, we capture recall and accuracy metrics comparing only single classifiers before [32] and after data cleaning and further restricting features number. A significant performance improvement is observed revealing the role of data quality, class balancing and dimensionality in the classifiers' performance.
In addition, we observe that our proposed models are superior to the work [32] in terms of recall and accuracy of at least 10%. The only case where we can see similar performance is in the recall of the 5-NN. It should be mentioned that the 3-NN (Table 4) showed better accuracy and recall than the 5-NN with a percentage gap of about 3%.
In conclusion, the performance of Stacking outperforms the prediction performance of each individual model and is closer to the best single model RotF. Soft Voting, which is a probabilistic scheme, achieves higher performance than Stacking since in the meta-level of the latter, the predicted classes are combined using a logistic regression classifier to achieve an output that is a simple linear combination of the predictions of the sub-models. In either case, ensemble schemes indicated promising efficiency compared to the rest of the classifiers.

Conclusions
In this study, we exploited supervised learning to develop models for the identification of individuals at risk for hypercholesterolemia manifestation based on several risk factors. Healthcare professionals and clinical experts can benefit from such models to prevent the severe consequences of hypercholesterolemia, such as cardiovascular disease. Data exploration through risk factors analysis can help identify associations among the features and HyperChol. A critical aspect of ML models is that they allow medical experts to regularly reassess the associated risk and give proper guidelines and interventions for its management and treatment or prevent its occurrence.
Performance analysis revealed that data preprocessing is an important step for the design of efficient and accurate models for hypercholesterolemia occurrence. The experimental results showed that Soft Voting having as base classifiers the Random and Rotation Forest prevailed with an AUC of 94.5%, precision of 92%, recall of 91.8%, F-measure of 91.7% and an accuracy equal to 91.75%. Hence, it constitutes a candidate HyperChol risk prediction model.
The future purpose of this study is to extend the ML framework via the employment of deep learning methods by applying the Long Short-Term-Memory (LSTM) algorithm and Convolutional Neural Networks (CNN) in the same data comparing the results in terms of accuracy.
Author Contributions: E.D. and M.T. conceived of the idea, designed and performed the experiments, analyzed the results, drafted the initial manuscript and revised the final manuscript. All authors have read and agreed to the published version of the manuscript.