A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases

Fitriyani, Norma Latif; Syafrudin, Muhammad; Chamidah, Nur; Rifada, Marisa; Susilo, Hendri; Aydin, Dursun; Qolbiyani, Syifa Latif; Lee, Seung Won

doi:10.3390/math13132194

Open AccessArticle

A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases

by

Norma Latif Fitriyani

^1,†

,

Muhammad Syafrudin

^1,†

,

Nur Chamidah

^2,3,*

,

Marisa Rifada

^2,3

,

Hendri Susilo

⁴

,

Dursun Aydin

^5,6

,

Syifa Latif Qolbiyani

⁷

and

Seung Won Lee

^8,9,10,11,*

¹

Department of Artificial Intelligence and Data Science, Sejong University, Seoul 05006, Republic of Korea

²

Department of Mathematics, Faculty of Science and Technology, Airlangga University, Surabaya 60115, Indonesia

³

Research Group of Statistical Modeling in Life Science, Faculty of Science and Technology, Airlangga University, Surabaya 60115, Indonesia

⁴

Department of Cardiology and Vascular Medicine, Faculty of Medicine, Airlangga University, Surabaya 60286, Indonesia

⁵

Department of Statistics, Faculty of Science, Muğla Sıtkı Koçman University, Muğla 48000, Turkey

⁶

Department of Mathematics, University of Wisconsin, Oshkosh Algoma Blvd, Oshkosh, WI 54901, USA

⁷

Department of Community Development, Universitas Sebelas Maret, Surakarta 57126, Indonesia

⁸

Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon 16419, Republic of Korea

⁹

Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of Korea

¹⁰

Personalized Cancer Immunotherapy Research Center, Sungkyunkwan University School of Medicine, Suwon 16419, Republic of Korea

¹¹

Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, Republic of Korea

Show full affiliation list

Hide full affiliation list

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(13), 2194; https://doi.org/10.3390/math13132194

Submission received: 7 May 2025 / Revised: 31 May 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Decision Making)

Download

Browse Figures

Versions Notes

Abstract

Cardiovascular diseases (CVDs) rank among the leading global causes of mortality, underscoring the necessity for early detection and effective management. This research presents a novel prediction model for CVDs utilizing a bagging algorithm that incorporates histogram gradient boosting as the estimator. This study leverages three preprocessed cardiovascular datasets, employing the Local Outlier Factor technique for outlier removal and the information gain method for feature selection. Through rigorous experimentation, the proposed model demonstrates superior performance compared to conventional machine learning approaches, such as Logistic Regression, Support Vector Classification, Gaussian Naïve Bayes, Multi-Layer Perceptron, k-nearest neighbors, Random Forest, AdaBoost, gradient boosting, and histogram gradient boosting. Evaluation metrics, including precision, recall, F1 score, accuracy, and AUC, yielded impressive results: 93.90%, 98.83%, 96.30%, 96.25%, and 0.9916 for dataset I; 94.17%, 99.05%, 96.54%, 96.48%, and 0.9931 for dataset II; and 89.81%, 82.40%, 85.91%, 86.66%, and 0.9274 for dataset III. The findings indicate that the proposed prediction model has the potential to facilitate early CVD detection, thereby enhancing preventive strategies and improving patient outcomes.

Keywords:

machine learning; bagging algorithm; histogram gradient boosting; local outlier factor; information gain

MSC:

62-07; 62H30; 62P10; 92C50

1. Introduction

Cardiovascular diseases (CVDs), which encompass a range of illnesses that affect the heart and blood vessels, represent a significant global health challenge. As highlighted by the World Heart Organization (WHO) and the American Heart Association, CVDs are the leading cause of death worldwide, with approximately 17.9 million fatalities reported in 2019 and an increase to 20.5 million in 2021 [1,2]. Furthermore, the WHO reported in 2024 that CVDs are responsible for over 42.5% of all deaths in the European region annually, making them a primary contributor to both disability and premature mortality, with an alarming average of around 10,000 fatalities each day [3]. The global impact of CVDs extends to over half a billion individuals.

Despite their prevalence, most CVDs can be prevented through appropriate measures [4]. Early detection of CVDs is paramount, especially in initiating medication and counseling-based therapies [4]. While traditional medical interventions often focus on addressing environmental and behavioral risk factors [4], there is a compelling need for innovative approaches to enhance prevention efforts. Among these, the application of machine learning (ML) algorithms has emerged as a promising solution for predicting CVDs, particularly through predictive models tailored to individual patient data [5,6,7,8,9,10].

Prior studies have demonstrated the effectiveness of machine learning models in clinical decision-making, with gradient boosting techniques—including original gradient boosting (GB), Extreme GB, and Histogram GB (HGB)—identified as robust methods for CVD prediction [8,11,12,13,14,15]. In addition to these techniques, the bagging algorithm has also shown promise as a precise tool for disease prediction [16,17,18]. However, it is noteworthy that applying the bagging algorithm, particularly with HGB as an estimator, remains relatively unexplored in the context of CVD prediction. Our work seeks to address this gap by proposing a CVD prediction model utilizing the bagging algorithm in conjunction with HGB as the estimator.

To create a reliable bagging and HGB-based CVD prediction model, employing several advanced techniques that tackle common challenges encountered in machine learning is essential. These challenges include the existence of outliers, the presence of unnecessary features, the multicollinearity problem, the risk of overfitting, and non-optimal hyperparameters. To effectively address these issues, this study incorporates a combination of the Local Outlier Factor (LOF) for outlier detection, information gain (IG) for feature selection, stratified k-fold cross validation (CV) for model validation, GridSearchCV for hyperparameter tuning, and Pearson’s correlation method for multicollinearity analysis. Previous research has underscored the significant improvements in prediction model performance achieved through the implementation of LOF for outlier detection [19,20,21] and IG for feature selection (FS) [22,23,24,25,26]. Moreover, the advantages of hyperparameter tuning with GridSearchCV [27,28,29,30,31] and validation using stratified k-fold CV [32,33,34,35] have further enhanced model reliability.

Therefore, this study aims to propose a novel predictive model for CVDs, employing a synergistic approach that integrates the bagging algorithm, HGB, LOF, IG, stratified k-fold CV, and GridSearchCV. To summarize, our work provides the following threefold contributions:

Development of an Innovative Prediction Model. We propose a CVD prediction model using the bagging algorithm with histogram gradient boosting as the estimator.
Implementation of Advanced Techniques. This study utilizes LOF for outlier detection and IG for FS, combined with GridSearchCV and stratified k-fold CV for optimal model performance.
Filling Existing Research Gaps. This research addresses the underutilization of the bagging algorithm, particularly with HGB, in the context of CVD prediction, thereby providing a novel approach to enhancing early detection and prevention efforts.

The proposed study details the methodologies and data sources used in developing the prediction model in Section 2, where the algorithms and their implementations are thoroughly outlined. Section 3 comprehensively discusses the study’s results, highlighting the proposed model’s efficacy and potential impact. Finally, Section 4 presents the implications of the findings and recommendations for future research.

2. Materials and Methods

2.1. Data Sources

In this research, we employed three publicly available clinical datasets to predict CVDs. Below is a comprehensive overview of each dataset utilized in our study:

Dataset I: The first data source is the CVD dataset sourced from Kaggle [36]. This dataset encompasses 70,000 samples, consisting of 34,979 positive and 35,021 negative instances. It includes 11 features: “age, height, weight, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking habits, alcohol consumption, and physical activity”. For our analysis, we converted the age from days to years. Notably, the dataset contains no missing values, and all features are considered significant risk factors for CVDs.
Dataset II: The second dataset, an updated version of dataset I, presents two primary distinctions. First, it includes the body mass index (BMI) feature calculated from weight and height attributes. Consequently, the original weight and height features were excluded after deriving the BMI. Therefore, dataset II comprises 10 features: “age, BMI, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking habits, alcohol consumption, and physical activity”.
Dataset III: The third dataset is the Cardiovascular Study Dataset, also procured from Kaggle [37]. This dataset contains 3390 samples, with 511 classified as positive cases and 2879 as negative. Due to the significant imbalance in sample sizes, we first eliminated missing values and balanced the dataset by randomly selecting additional positive samples from dataset I. This balancing process resulted in a final distribution of 2600 positive and 2600 negative samples. We ensured uniformity in feature values by converting similar units. Ultimately, dataset III comprises eight features, “age, sex, smoking status, total cholesterol, systolic blood pressure, diastolic blood pressure, BMI, and glucose”, with no missing values present.

Table 1 (for datasets I and II) and Table 2 (for dataset III) summarize the characteristics of each dataset.

2.2. Proposed Method

To achieve the objective of our work, we implemented a systematic approach that employed various techniques, resulting in a high-performance prediction model for CVDs based on risk factor data. As displayed in Figure 1, the initial step involved collecting publicly available secondary datasets. Next, we eliminated missing values and redundant data. Once the new dataset was prepared, we identified outliers using the LOF technique and removed them. Concurrently, we performed hyperparameter tuning using GridSearchCV to determine the optimal parameters for subsequent processes.

Following this, we executed FS based on the gain score via the IG technique, ensuring we selected the same number of features across all datasets to minimize training times and enhance model performance. In the fifth step, we employed stratified k-fold CV, with k = 10, to mitigate issues related to imbalanced datasets, which can lead to biased model predictions. Through this process, the training data was utilized with various ML models, including Gaussian Naïve Bayes (GNB), Logistic Regression (LR), GB, HGB, Random Forest (RF), AdaBoost, Support Vector Classifier (SVC), k-nearest neighbors (KNNs), Multi-layer Perceptron (MLP), and Extra Trees. We selected the best classifier as the estimator for the bagging method and retrained the model with the training data.

After completing the training phase, we performed additional hyperparameter tuning on both the HGB and the bagging classifier. Utilizing the most accurate hyperparameters identified, we retrained the model and evaluated its performance using several metrics, such as accuracy, recall, precision, F1 score, and ROC-AUC metrics. We also compared these results with existing ML models. Finally, since the datasets used in our work are publicly available and have been employed by previous researchers, we compared performance with prior study results. The detailed development steps of the proposed CVD model are illustrated in Figure 1.

2.3. Rendundancy Data Elimination Technique

In machine learning, removing data redundancy is one of the essential steps for cutting down on computational expenses and improving model performance. Models can learn more efficiently from pertinent data by eliminating superfluous information and improving their prediction power and generalizations.

In this study, we utilized the drop_duplicates function provided by the Pandas library [38] for removing redundancy data in the datasets. The steps to find duplicate data in the dataset are as follows:

Check all columns for duplicates: the function iterates through each row of the dataset, comparing its values to every other row, considering all columns simultaneously.
Define duplicates: if a row’s values are identical to another row’s values in all columns, that row is considered a duplicate.
Remove duplicates: only the first occurrence of a unique row is kept, and all subsequent duplicates are removed from the dataset.

2.4. Local Outlier Factor-Based Outlier Removal

Outlier elimination is a crucial step in the model development process. The choice of outlier detection method can significantly impact the model’s performance. To identify outliers within the datasets, we employed the robust LOF technique in the context of our proposed CVD prediction model. LOF is an unsupervised ML algorithm that detects outliers in relation to local neighborhoods rather than relying on the entire data distribution [39]. This density-based method utilizes nearest neighbor searches to recognize anomalous points.

One benefit of the LOF method is its capacity to pinpoint outlier points in relation to nearby clusters of data. In applying LOF, the neighbors of specific data points are determined, and their density is compared to that of surrounding points. In implementing an LOF model, the following steps are recommended:

Measure the distance from point P to all specified points utilizing a distance metric, either Euclidean or Manhattan.
Determine the k (k-nearest neighbor) closest point, which involves calculating the distance to the third nearest neighbor when k = 3.
Identify the k nearest points.
Compute the local reachability density using the following formula:

${l r d}_{k} (O) = \frac{| |N_{k} (O)| |}{\sum_{O^{'} \in N_{k} (O)} {r e a c h d i s t}_{k} (O^{'} \leftarrow O)}$

(1)

where $r e a c h d i s t$ (“reachable distance”) is defined as follows:

${r e a c h d i s k}_{k} (O^{'} \leftarrow O) = \max {{d i s k}_{k} (O), d i s t (O, O^{'})}$

(2)

where $N_{k} (O)$ indicates the quantity of neighbors.
Determine the LOF as follows:

${L O F}_{k} (O) = \frac{\sum_{O^{'} \in N_{k} (O)} \frac{{l r d}_{k} (O)}{{l r d}_{k} (O)}}{| |N_{k} (O)| |}$

(3)

According to the LOF algorithm, LOF excels in detecting local outliers, as it identifies points with significantly lower density than their neighbors, making it robust to noise and suitable for datasets with large amounts of data and complex structures (such as the data used in this study). The LOF’s ability to handle high-dimensional data and its computational efficiency further contribute to its superiority compared to other techniques.

2.5. Information Gain-Based Feature Selection

In addition to eliminating outliers, FS is a crucial step for improving prediction models’ performance while reducing the time required for the training process, thus enhancing efficiency [23]. In this study, we applied IG to select the most relevant features by calculating the gain score for each one. IG is an FS method that serves as a causal indicator, measuring the difference in entropy between the activities of two neurons [40]. It is calculated by subtracting the conditional entropy of the target neuron’s activity given the source neuron’s activity from the entropy of the target neuron’s activity. The process for FS using IG is outlined as follows:

Compute the entropy value of each feature.

$E n t r o p y (S) = \sum_{i = 1}^{n} {- p}_{i} {l o g}_{2} p_{i}$

(4)

where $S$ represents the dataset, and $p_{i}$ is the probability of an instance belonging to class i.
Compute the information gain value of each feature.

$G a i n (S, A) = E n t r o p y (S) - \sum_{v \in V a l u e s (A)} - (\frac{S_{v}}{S}) E n t r o p y (S_{v})$

(5)

where A represents the feature or attribute, v is a specific value of the attribute A, and $S_{v}$ is a subset of the dataset S where the attribute A has a specific value v.

According to the formula of IG, this informed us that IG is able to handle both continuous and discrete data and also has a tendency to create more homogeneous groups when the tree deepens. These strengths can lead to a more stable and robust decision tree.

2.6. Histogram Gradient Boosting

The HGB classification tree, also known as the HGB algorithm, is an ML technique that utilizes histograms to train GB models efficiently. The term “GB” refers to a subset of ML algorithms that utilize decision trees as base learners, specifically using classification trees for tasks related to categorizing patterns, while regression trees are used for the purpose of function approximation. This model performs effectively on datasets and complex datasets, offering advantages over traditional GB methods, such as faster training times and enhanced scalability [41]. As illustrated in Figure 2, each successive tree in the ensemble relies on its predecessors, as the latter tree aims to reduce the misclassification cases made by the former. This approach consistently minimizes the loss function, reflecting the ensemble’s overall misclassification score during the training step (see Figure 3). GB is recognized for its ability to converge globally, which is accomplished by adhering to the path of the negative gradient [42,43,44]. After training, a robust ensemble can be formed from a relatively weak base classifier [45]. Ultimately, the predictions for individual data samples are derived from the aggregate of the results produced by each decision tree (DT).

Let

{{x}_{i}, t_{i}}_{i = 0}^{N_{D} - 1}

be the dataset that was gathered. For binary classification, we used binary cross-entropy L() as the loss function. With

x_{i}

representing the i-th row of data,

t_{i}

is the true label (“ground-truth”) linked to

x_{i}

, and h(x) represents a base learner. The model’s initial predicted result is given by [44,46]:

F_{0} (x) = \underset{β}{\arg \min} \sum_{i = 0}^{N_{D} - 1} L (t_{i}, β)

(6)

where

N_{d}

represents the number of instances in a dataset, and

β

is the initial weight of the model.

The binary cross-entropy loss function is expressed in the following way:

L = \sum_{k = 0}^{C N - 1} t_{i} \log y_{i}

(7)

where CN represents the quantity of a class, and

t_{i}

is the true label (“ground-truth”).

y_{i}

is the model’s (“predicted”) output.

The direction of the gradient related to the classification error at the m-th iteration (with m ranging from 0 to M, where M represents the total number of iterations) can be determined as follows:

y_{i}^{m} = {- [\frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}]}_{F (x) - F_{m - 1} (x)}

(8)

DT is used to fit the data in the training set. After achieving the model’s parameters using the least squares approach, the base model is supplied as follows:

a_{m} = a r g \min_{β} \sum_{i = 0}^{N_{D} - 1} [y_{i}^{m} - β \times h (x_{i}, a_{m})]

(9)

Then, the weight is updated as follows:

β_{m} = a r g \min_{β} \sum_{i = 0}^{N_{D} - 1} L [y_{i}, F_{m - 1} (x) + β \times h (x_{i}, a_{m})]

(10)

The entire committee is updated as follows:

F_{m} (x) = F_{m - 1} (x) + β_{m} \times h (x_{i}, a_{m})

(11)

For training models based on GB, the histogram-based algorithm (refer to Figure 4) serves as an effective method. This technique discretizes the continuous feature ranges of DTs into small bins. A maximum of 255 bins can be set [47] and subsequently employed to create histograms that represent the distribution of feature values. Fundamental statistics, including the cumulative count of gradients within each bin and the total number of data instances, can be calculated. Analyzing these statistics aids in determining the most effective split points for training the base learners. Since the training step of a DT does not require a comprehensive scan of the entire feature range for evaluating split points, the histogram-based approach significantly reduces computational costs [48]. Furthermore, due to its resilience to noise, this method can enhance generalization performance [46]. Given these advantages in learning efficacy and computational efficiency, the histogram-based technique was implemented in this work to create the classification tree structure. All these advantages lead HGB to be a well-suited model for algorithms that require a base model for training (i.e., ensemble learning algorithms), such as the proposed model in this study.

2.7. Bootstrap Aggregating Algorithm

Bootstrap aggregating, commonly known as ‘bagging’ [49], is an ensemble technique in ML that involves training multiple models on various random subsets of the training data. The outputs generated by these models are subsequently aggregated using either an averaging technique or a majority vote approach. In the bagging algorithm, majority voting is employed to aggregate the outputs of distinct inductive models, which are built using bootstrap samples drawn from the same training dataset [50].

A bootstrap sample is generated to match the size of the original training dataset, achieved through uniform sampling with replacement. This implies that once an instance is drawn from the dataset, it remains accessible for subsequent selections, allowing multiple instances to be repeated within a single bootstrap sample. For sufficiently large training datasets, a bootstrap sample generally consists of approximately 63.2% unique instances, while the remaining portions are duplicates from the original data. The bagging process utilized in ensemble learning (EL) is described in Algorithm 1 [51].

Algorithm 1 The Bagging EL
Input: training-input set: TS = {( $x_{1}, y_{1}),$ ( $x_{2}, y_{2}), \dots, (x_{n}, y_{n})} (M i s t h e n u m b e r o f s a m p l e s);$ label-output set: Y = ${y_{i}} (1 \leq i \leq L, L i s t h e n u m b e r o f c a t e g o r i e s i n Y);$ h: based classifier; T: iteration step. Output: H: final classifier.
1. For t = 1 to T, step 1 do: 1.1 Select n training input samples randomly from the TS to create a subset of samples $d_{t}$ 1.2 Utilize the h to $d_{t}$ to train $h_{t}$ (classfier at time t). 2. end for 3. Considering the given sample x that requires classification, the corresponding label for x is:
$H (x) = {\arg m a x}_{y \in Y} \sum_{t = 1}^{T} I (h_{t} (x) = y)$	(12)
where the $I (h_{t} (x) = y)$ is calculated with:
$I (h_{t} (x) = y) = \{\begin{matrix} 1, i f h_{t} (x) = y \\ 0, e l s e \end{matrix}$	(13)

By training multiple models on different subsets of the data created by the bootstrap technique, bagging reduces variance, prevents overfitting, and combines the strengths of individual models, resulting in a more robust and accurate predictive model. These benefits led us to choose the bagging algorithm as the model for predicting CVDs in this study.

2.8. Stratified K-Fold CV

Stratified k-fold CV is an advanced technique that is particularly beneficial for datasets with uneven class distributions. The stratified k-fold CV method guarantees that every fold within the dataset retains a comparable ratio of samples from each class as found in the complete dataset. This approach improves the fairness and dependability of the training and validation phases. The method entails partitioning the dataset into k subsets, ensuring that each subset reflects a similar distribution of “samples from both the majority and minority classes” as observed in the complete dataset [34]. In Algorithm 2, the remaining folds are utilized for training, while one fold is reserved for testing. The subsequent steps of this process mirror those of standard CV. A summary of the stratified k-fold CV algorithm [52] is provided in Algorithm 2 as follows:

Algorithm 2 Fold generation of Stratified K-Fold CV

Require: number of folds, k; classes, c.
Ensure: generated folds,

F_{1}, F_{2}, \dots, F_{k} .

F_{1} \leftarrow \emptyset, F_{2} \leftarrow \emptyset, \dots, F_{k} \leftarrow \emptyset

for i := 1 to n do

n \leftarrow [\frac{c o u n t (C_{i})}{k}]

if

i \leq (c o u n t (C_{i}) m o d k)

then

n \leftarrow n + 1

end if
for j := 1 to do
S

\leftarrow

pick n samples randomly from

C_{i}

F_{j} \leftarrow F_{j} \cup S

C_{i} \leftarrow C_{i} / S

end for
end for

2.9. GridSearchCV Hyperparameter Tuning Technique

A common feature of ML models is the presence of hyperparameters, which are predefined settings that are not derived from the data. To determine the optimal values for these hyperparameters, a technique known as GridSearchCV is employed. GridSearchCV is a method specifically designed for tuning hyperparameters utilizing CV to systematically explore a specified parameter grid to discover the most effective set of hyperparameters for a specific model [53]. The algorithmic approach for conducting a grid search can be summarized as follows (see Algorithm 3).

Algorithm 3 The GridSearchCV Algorithm

Step 1: Set the parameters
Step 2: Create a “parameter grid”
Step 3: Develop a “base model”
Step 4: Configure the parameters for the grid search model
Step 5: Execute the grid search using training features and labels
Step 6: Identify the optimal grid.
Step 7: Select the optimal parameters

2.10. Performance Evaluation Metrics

To assess the effectiveness of the predictive models, a “2 × 2 confusion matrix” is employed. The disease status is categorized into two groups: “CVD” and “Non-CVD.” In this context, “CVD” is considered a positive event, while “Non-CVD” is regarded as a negative event. By applying these two categorical classes to the confusion matrix, we can generate values for true negatives, true positives, false negatives, and false positives.

To further assess the performance of the models using clinical variables recommended by the WHO and the American Heart Association, we will examine several metrics for all models, such as “precision, recall, F1 score, accuracy, and the area under the curve (AUC) associated with the receiver operating characteristic (ROC) curve” [54]. An ROC graph helps organize and select the classifier’s performance. The AUC indicates the likelihood that a classifier will give a higher score to a randomly selected positive instance compared to a randomly selected negative one. The following are the detailed formulas for the performance evaluation metrics utilized in this research:

P r e c i s i o n o r P P V = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(14)

R e c a l l o r S e n s i t i v i t y = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(15)

F 1 S c o r e = \frac{2 x P r e c i s i o n x R e c a l l}{P r e c i s i o n + R e c a l l}

(16)

A c c u r a c y = \frac{T r u e P o s i t i v e + T r u e N e g a t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e + F a l s e N e g a t i v e + T r u e N e g a t i v e}

(17)

3. Results and Discussion

3.1. Experimental Settings

The experimental environments utilized in this study include the Scikit-learn library version 1.1.2, Python 3.9.7, SPSS version 25.0, and Microsoft Excel version 2501 (64-bit). These environments were employed for performance measurements, the development of the proposed ML-based prediction models, and data analysis.

3.2. Redundancy Data Elimination

Data is crucial for developing ML models, which are widely employed in healthcare and biomedical industries. However, the large volumes of data generated, particularly by big data sensors, can lead to redundancy. If this redundancy goes unrecognized, it may carry over to subsequent datasets, ultimately diminishing the efficacy of data collection [55]. By analyzing and eliminating duplications within existing datasets, we can enhance the effectiveness of ML model training. A substantial amount of data poses significant challenges for ML model development, especially given the lengthy training periods and increasing processing demands. Interestingly, it has been observed that heavily filtering training data may lead to slight improvements in classification task performance [56,57,58,59,60]. In this study, we employed three datasets consisting of 70,000, 7000, and 5200 samples. These datasets are classified as large and contain numerous redundant samples. To optimize training time efficiency, we removed the redundant data present in the datasets used for this study. The number of redundant samples in datasets I, II, and III is detailed in Table 3.

After conducting data redundancy elimination, the number of samples in datasets I, II, and III are 68,996, 68,996, and 5172, respectively. These samples are then utilized for further processes, such as outlier detection and elimination. The performance of the model using datasets I, II, and III after data redundancy elimination is displayed in Table 4, Table 5 and Table 6.

Utilizing the three aforementioned datasets (I, II, and III) and applying stratified 10-fold CV to all techniques with a train–test split of 90% and 10%, the proposed CVD prediction model demonstrated positive results. It outperformed existing ML models across all performance evaluation metrics. As shown in Table 4, Table 5 and Table 6, the proposed model achieved the highest scores: precision (0.9375), recall (0.9877), F1 score (0.9619), accuracy (0.9615), AUC (0.9911) on dataset I; precision (0.9371), recall (0.9890), F1 score (0.9624), accuracy (0.9619), and AUC (0.9914) on dataset II; and precision (0.8954), recall (0.8283), F1 score (0.8601), accuracy (0.8607), and AUC (0.9280) on dataset III.

The second-best model, HGB, exhibited strong performance across all datasets. HGB performed best as a single algorithm, leading us to select it as the estimator (base model) for the bagging algorithm in the proposed model. The performance of HGB is better than other ML models, because it is a gradient boosting-based ensemble learning algorithm that iteratively builds models by combining weak learners (decision trees) into a strong learner (see Equation (11)). HGB iteratively builds the model by adding new weak learners that focus on correcting the errors of the previous learners (minimizing prediction errors through a process of sequentially improving upon the errors of previous models), thus the prediction errors become minimal. This leads to a strong learner with improved predictive accuracy. Besides this, HGB uses the binning concept, i.e., converting continuous features into discrete intervals, with the maximum number of intervals or bins being 255. Thus, when it is applied in the tree, the complex dataset will be simpler, significantly accelerating the training process and enhancing the performance of the model [46,47,48]. Along with this, it also makes the model smooth out the noise and reduces the impact of extreme values. Based on these superiorities, it is reasonable that HGB would perform better than the other single ML models utilized in this study (see Table 4, Table 5 and Table 6).

Furthermore, as the performance of HGB was most excellent among other ML models, we incorporated HGB into the bagging algorithm. With bootstrap resampling techniques that generate multiple numbers of new training datasets in the bagging algorithm, multiple models are then generated. At the end of the step, the outputs generated by these models are subsequently aggregated using either an averaging technique or a majority vote approach. This technique reduces the variance of the models, resulting in more stable and reliable predictions. In this experiment, the number of estimators we used is 10 (n_estimators = 10), meaning that the new training datasets generated by the bootstrap resampling technique is 10; thus, the model produced is also 10. Using 10 models, the final prediction results are aggregated, and the result that appears the most is then selected as the final result. Since the base model selected is HGB, the bagging algorithm is less sensitive to outliers; thus, the final model is more robust and accurate. Using our three large-scale datasets, the performance of bagging with HGB (the proposed model) was much better and outperformed all existing ML models (see Table 4, Table 5 and Table 6).

3.3. Outlier Detection and Elimination

The training datasets utilized in the proposed study include outliers attributed to data entry and measurement errors. To ensure that the data for each feature, particularly those with continuous values, falls within the appropriate range, outlier detection techniques were employed in this study. Five robust outlier detection techniques were applied: LOF, Elliptic Envelope (EE), Isolation Forest (iForest), One-Class SVM, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). The performance of these five outlier detection techniques was then compared to identify the one with the best performance for the model.

As detailed in Table 7, the number of outliers detected by the five techniques ranges from 1465 to 6566 in dataset I, from 3106 to 6210 in dataset II, and from 118 to 550 in dataset III. The number of detected outliers for each technique is considered optimal as a result of the hyperparameter tuning conducted. Utilizing the LOF technique, the number of training sets in this study consisted of 60,632 samples for dataset I, 58,181 samples for dataset II, and 4528 samples for dataset III, and the number of outliers removed were 1465, 3916, and 118 samples, respectively. The count of detected outliers and the model’s performance post-outlier elimination is presented in Table 7.

After removing the outliers that existed in the datasets, we evaluated the model’s performance for selecting the optimal outlier elimination technique. This evaluation was conducted by summing up the number of outperformed metrics in each ML model utilizing datasets I, II, and III. The optimal outlier elimination technique is selected based on which one has the highest number of outperformed metrics in total. As shown in Table 8 and the detailed information in the Supplementary File, the LOF technique predominantly outperforms the other aforementioned outlier elimination techniques. The LOF technique performs well when combined with KNNs, AdaBoost, SVC, MLP, LR, HGB, and the proposed model utilizing datasets I and II. When we total the number of outperformed metrics, LOF leads with a total number of 53, followed by OneClassSVM with a total of 48, DBSCAN with a total of 31, iForest with a total of 21, and EE with a total of 13 (see Supplementary File). These results were obtained by tuning the key hyperparameters in each outlier elimination technique to best suit the data used. For example, for LOF, we tuned the number of neighbors (n_neighbors) and contamination hyperparameters, setting n_neighbors to 34 and contamination to ‘auto’ (allowing the algorithm to determine the outlier proportion). As LOF is an algorithm that detects anomalies or outliers by assessing the relative density of an observation within its neighborhood, the crucial hyperparameters of LOF include the number of neighbors and contamination, which indicates the anticipated proportion of outliers [39]. After we tuned the hyperparameters, the proposed model achieved the following scores: for dataset I, precision is 0.9375, recall is 0.9866, the F1 score is 0.9614, accuracy is 0.9609, and AUC is 0.9901; for dataset II, precision is 0.9407, recall is 0.9898, the F1 score is 0.9646, accuracy is 0.9639, and AUC is 0.9922; and for dataset III, precision is 0.8950, recall is 0.8275, the F1 score is 0.8548, accuracy is 0.8666, and AUC is 0.9269, outperforming all existing ML models (see Supplementary File) consistent with the previous results. These results reveal that LOF is suitable for large-scale datasets with a variety of impacting elements and dense information, such as the three datasets used in this study.

Furthermore, after we eliminated the outliers that existed in the datasets, we conducted a correlation analysis to examine the relationship between the features. By the correlation analysis, the multicollinearity issue (the situation where two or more features are highly correlated, which causes difficulty in determining the precise impact of each feature on the target variable, leading to unstable and unreliable coefficient estimates) in the training sets can be detected. To measure the relationship, Pearson’s correlation coefficient was used in this study. The correlation coefficient ranges from −1 to +1, where a number around 0 indicates a low correlation between the features, and a positive or negative value indicates a very positive or highly negative correlation, respectively. Heatmap correlations between features for datasets I, II, and III are shown in Figure 5a–c, respectively. In Figure 5a,b, the green color indicates the coefficient is close to 0, meaning that there is no correlation between the features. While in Figure 5c, the coefficient close to 0 is displayed as the light green color. The red color in all figures indicates that the correlation between features is close to +1.

According to Figure 5a, in the training set of dataset I, the correlation coefficients ranged from −0.091 to +0.51. This indicates that no feature has a strong correlation with another, and no multicollinearity between the features exists (correlation coefficient < 0.8) [61]. In Figure 5b, the minimum coefficient of correlation is −0.047, and the maximum coefficient of correlation is +0.47. The coefficients are lower than 0.8; thus, none of the features correlated with one another. Finally, the correlation heatmap in Figure 5c shows that the minimum value is −0.29, and the maximum coefficient is +0.76. The correlation coefficient recorded is lower than 0.8, revealing that no features have a strong correlation with one another, and no multicollinearity exists in the training set.

3.4. Information Gain and Chi-Square Scores for Feature Selection

After removing outliers from the datasets, this study evaluated the significance of various features. This evaluation involved calculating their gain scores and chi-square scores. Figure 6 illustrates the information gain and chi-square scores for the features across datasets I, II, and III using the IG and chi-squared methods. To enhance the prediction model’s performance, particularly by reducing the training time, we selected the same number of features (k) for all datasets involved in this research. Since dataset III has the fewest features (k = 8), we applied k = 8 to datasets II and III following the third dataset’s approach.

According to the results displayed in Figure 6a, when k = 8, the features selected for dataset I using IG are “age_in_years, ap_hi, ap_lo, cholesterol, weight, gluc, active, and gender.” Conversely, when applying chi square, the selected features are ap_lo, ap_hi, weight, age_in_years, cholesterol, gluc, active, and smoke (see Figure 6b). For dataset II, as displayed in Figure 6c, the features selected by IG include “age_in_years, ap_hi, ap_lo, cholesterol, bmi, gluc, gender, and active,” while the chi-squared method identifies ap_lo, ap_hi, age_in_years, bmi, cholesterol, gluc, active, and gender (see Figure 6d). Lastly, as shown in Figure 6e,f, all features are chosen in dataset III when k = 8. A detailed visual analysis of the feature gain and chi-square scores is presented in Figure 6.

In our analysis using eight features (k = 8) across datasets I, II, and III, we observed that the performance of the proposed model exhibited only minor variations, remaining comparable to the performance of models that did not apply FS methods (performance assessed after outlier elimination). While FS showed limited impact on model performance in this context, it has the potential to reduce training times due to the decreased number of features [23,39].

Our evaluation results indicate that model performance slightly improved when the IG method was used, with no significant degradation in performance observed. Therefore, we decided to proceed with IG as our FS method, retaining eight features. The performance metrics with IG (k = 8) were as follows: for dataset I, precision was 0.9376, recall was 0.9867, the F1 score was 0.9615, accuracy was 0.9610, and AUC was 0.9902. For dataset II, precision was 0.9405, recall was 0.9898, the F1 score was 0.9645, accuracy was 0.9638, and AUC was 0.9923. Lastly, for dataset III, the metrics were precision (0.8950), recall (0.8275), F1 score (0.8548), accuracy (0.8666), and AUC (0.9269) (see Figure 7). These findings show that selected important features by IG provided better prediction results than selected important features by the chi-squared technique. As IG is suitable for handling continuous and discrete data and is able to identify more predictive features for tree-based models, it makes sense that that IG outperformed the chi-squared technique in this study.

3.5. The Performance of the Proposed Method Before and After Hyperparameter Tuning

The proposed model is based on the ensemble bagging algorithm, which utilizes a base model or estimator to make predictions. In our work, we enhanced the performance of the proposed CVD prediction model by conducting hyperparameter tuning on the estimator and then on the bagging model itself. We assessed the impact of hyperparameter tuning on the estimator’s performance using GridSearchCV, focusing on accuracy. The hyperparameters that yielded optimal accuracy were selected as the estimator parameters for the bagging classifier.

As the estimator of the proposed model is HGB, we conducted hyperparameter tuning on an HGB classifier. HGB is an ensemble learning decision tree algorithm that iteratively builds models by combining weak learners into a strong learner (better performance model). The process of producing a strong learner is conducted by correcting the errors of the previous learners; thus, the prediction errors in the model become minimal, making the model stronger (see Equations (9)–(11)). Based on these processes, the important hyperparameters of HGB involved the maximum number of iterations (max_iter), the maximum depth of the trees (max_depth), a penalty constant added to the loss function to mitigate overfitting using L2 or Ridge regularization (l2_regularization), the pseudo-random number generator that regulates the train/validation data split if early stopping is enabled (random_state), and the shrinkage that is used as a multiplier for the values of the leaves (learning_rate). These evaluated hyperparameters are also referenced in [46,47,62]. Table 9 displays the values of the hyperparameters for the HGB model obtained after GridSearchCV tuning.

After tuning the hyperparameters of the HGB estimator, we proceeded to tune the hyperparameters for the bagging classifier employed in the proposed model. As bagging is an ensemble ML algorithm that involves training multiple models on various random subsets of the training data using a specific algorithm (known as the base model), the primary hyperparameters involved in the bagging classifier encompass the base model utilized for fitting randomly selected subsets of the dataset (denoted as estimator); the total count of base estimators within the ensemble (“n_estimators”); the upper limit of samples (X) needed for training each base estimator (“max_samples”); the maximum number of features (X) required for training (“max_features”); and the random number generator (“random_state”), which governs the resampling of the original dataset. The optimal hyperparameter values for the bagging classifier, along with their effects on the performance of the proposed method, are presented in Table 10 and Figure 8.

According to the results following hyperparameter tuning (see Figure 8), the proposed model shows some improvements across all datasets. Specifically, an enhancement is evident in all metrics for datasets I and II. For dataset I, precision, recall, the F1 score, accuracy, and AUC increased by 0.0014 (0.14%), 0.0016 (0.16%), 0.0015 (0.15%), 0.0015 (0.15%), and 0.0014 (0.14%), respectively. The average improvement recorded is 0.00148 or 0.148%. In dataset II, improvements reached 0.0011 (0.11%) for precision, 0.0007 (0.07%) for recall, 0.0009 (0.09%) for the F1 score, 0.0010 (0.10%) for accuracy, and 0.0008 (0.08%) for AUC, with an average improvement of 0.0009 or 0.09%.

In contrast to datasets I and II, degradations were observed in dataset III regarding recall and F1 scores. These results stem from a significant increase in the precision score. In ML, a “trade-off typically exists between precision and recall; as precision increases, recall often declines”, and vice versa [63,64]. This phenomenon indicates that improvements in one statistic frequently occur at the expense of another. A notable increase in precision, reaching 0.0031 (0.031%) on dataset III, presents the highest improvement when compared to datasets I and II. Accuracy remains unchanged at 0.8666 or 86.66%. Additionally, the AUC score improved on dataset III, recorded at 0.0005 or 0.05%. This finding highlights the model’s enhanced capability to classify negative and positive events, despite decreases in recall and F1 scores.

To determine whether the performance of the model improves after hyperparameter tuning, a significance test called the paired t-test was conducted in this study. We conducted a paired t-test on the performance of the proposed model before vs. after hyperparameter tuning. We defined h = 0, or the null hypothesis, as the idea that there is no significant difference between the performance of the proposed model before hyperparameter tuning and the performance of the proposed model after hyperparameter tuning. For each dataset, we gathered the h, p-value, and t (calculated) values and put the significance level at 0.05 and t (tabulated) = 2.776. If the paired t-test yields a value of h = 0, the null hypothesis is accepted; if h = 1, which suggests a significant difference between the performance of the proposed model before hyperparameter tuning and the performance of the proposed model after hyperparameter tuning, the null hypothesis is rejected. The t (calculated) in a paired t-test could be either positive or negative. The positive or negative is a sign of this difference that is determined by the relative positions of the two means. Negative is when the mean of the model performance before hyperparameter tuning is smaller than the mean of the model performance after hyperparameter tuning, and vice versa for the positive sign. Evidence that the calculated t is bigger than the tabulated t and that the p-value is less than the significance level (0.05) may support this. Table 11 displays the results of the paired t-test for three datasets used in this study. For datasets I and II, h = 1, p-value < significance level, and t (calculated) > t (tabulated), indicating that the performance of the proposed model after hyperparameter tuning differs significantly from the performance of the proposed model before hyperparameter tuning. However, for dataset III, the paired t-test results showed no significant difference (p-value > significance level and calculated t < tabulated t). This insignificant result is caused by degradations in recall and F1 scores.

As illustrated in Figure 8, the performance scores of our CVD method displayed high values across all evaluation metrics and datasets utilized. The minimum recorded score was 0.8240, while the maximum reached 0.9931. Based on the ML performance metric score classification [65], the proposed model demonstrated excellent performance on datasets I and II and good to excellent performance on dataset III. Regarding the ROC curves, the model showed an excellent ability to distinguish diseased from nondiseased samples across all datasets, with the AUC scores being 0.9931 for dataset I (see Figure 9a), 0.9931 for dataset II (see Figure 9b), and 0.9274 for dataset III (see Figure 9c). Notably, the model on dataset II outperformed those on datasets I and III, showcasing its superior performance, especially given that it stemmed from a related dataset [36]. The conversion of weight and height features into BMI led to statistical improvements in performance, aligning with findings from prior studies [66,67].

There are interesting findings to be discussed after establishing the performance of the proposed model. The first concerns the base model, HGB. HGB is an iterative algorithm that utilizes the learning rate in calculating the value of the model (see Equations (9)–(11)). It updates its model over multiple iterations. Thus, the number of these iterations is crucial for convergence and model performance. In our experiment, we found that at least 100 iterations are needed in order to obtain the best possible performance model (see Table 9), especially when the dataset is large. During each iteration, the model’s parameter ‘learning rate’ is also essential, as it controls the step size during the optimization process. In HGB, the learning rate is the hyperparameter that needs to be tuned carefully to find the optimal balance between convergence speed and accuracy. The optimal number of learning rates in this study are 0.1 and 0.01, suggesting that the learning rate is better not to exceed 0.1 in HGB. Furthermore, the third important parameter in HGB is max_depth, due to the algorithm being decision tree-based. Second, in bagging classifier hyperparameter tuning, we found that choosing the base model is the most crucial step, because it directly impacts the model’s performance and adaptability to the specific dataset used. After this, as bagging is an ensemble ML algorithm that involves training multiple models on various random subsets of the training data, the number of the models that are going to be built (n_estimators) is the second most crucial factor. In our experiment, we found that the optimal number of estimators to obtain the best-performing model is 20 in dataset I, 19 in dataset II, and 16 in dataset III (see Table 10). The number of estimators in our study was higher than the default number (n_estimators = 10) provided by the Scikit-learn library. This was probably caused by the datasets used being large, thus requiring a higher number of estimators. Once the dataset is large, more models are needed to capture the complexity of the data.

3.6. Comparison with Previous Works

In the final evaluation, we compared the proposed model’s performance against that of models from previous studies. Approximately seven prior studies utilized the same dataset (dataset I) to develop CVD prediction models. Table 12 presents the performance comparison between our proposed study and previous studies.

As shown in Table 12, the ML algorithms utilized in prior studies vary, leading to differences in performance outcomes. Among the reviewed studies, the CVD prediction models proposed by Uddin and Halder [69] and Bhatt et al. [73] were the top performers in terms of accuracy and AUC, with scores of 0.9416 and 0.8727 for accuracy and 0.9400 and 0.9500 for AUC, respectively. In contrast, our proposed model achieved superior results, recording an accuracy of 0.9625 and an AUC of 0.9916. This indicates that our model significantly outperforms those of previous studies. Furthermore, our model excels not only in accuracy and AUC but also in other performance evaluation metrics, including precision, recall, and F1 score. The results of this comparison demonstrate that the incorporation of LOF and IG, along with the selection of estimators for the bagging classifier and hyperparameter tuning, play essential roles in enhancing the performance of the model. Consequently, these strategies represent effective methods for improving the performance of CVD prediction models.

4. Conclusions

We proposed a novel method for predicting CVDs. We demonstrated that the bagging algorithm, utilizing HGB as an estimator, significantly enhances the model’s performance when paired with the LOF-based outlier detection method, IG-based FS method, and GridSearchCV-based hyperparameter tuning. The algorithm was implemented and evaluated using three large publicly available datasets. Through the implementation of stratified k-fold CV with ten folds, we established that our proposed model outperforms both earlier research models and ten examined traditional ML models, including LR, SVC, GNB, MLP, KNN, RF, AdaBoost, GB, HGB, and Extra Trees. For dataset I, the model achieved “precision, recall, F1 score, accuracy, and ROC-AUC” scores of 93.90%, 98.83%, 96.30%, 96.25%, and 0.9916, respectively; for dataset II, the scores were 94.17%, 99.05%, 96.54%, 96.48%, and 0.9931; and for dataset III, they were 89.81%, 82.40%, 85.91%, 86.66%, and 0.9274. The integration of LOF and IG, combined with methodical estimator selection for the bagging classifier and hyperparameter tuning, proves to be effective in enhancing the CVD prediction model’s performance.

In our work, our proposed model excellently predicted cardiovascular diseases on large datasets for a binary classification problem. However, it has yet to be evaluated on smaller datasets or for other classification problems, such as multi-class and multi-label classifications. Therefore, future research will focus on analyzing our model on smaller datasets utilizing the same hyperparameters and techniques. Additionally, the proposed model will also be assessed for multi-class and multi-label classification challenges. Once the model consistently identifies CVDs accurately, it will be considered a reliable tool for predicting CVDs.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math13132194/s1.

Author Contributions

Conceptualization, N.L.F., M.S., N.C., H.S., D.A., M.R., S.L.Q. and S.W.L.; methodology, N.L.F. and M.S.; validation, M.R. and S.L.Q.; formal analysis, N.L.F., M.S., N.C., H.S., D.A. and N.L.F.; investigation, M.R. and S.L.Q.; data curation, N.L.F., M.S., N.C., H.S., D.A. and M.R.; writing—original draft preparation, N.L.F., M.S. and S.L.Q.; writing—review and editing, N.C., H.S., D.A., M.R. and S.W.L.; visualization, N.L.F. and M.S.; software, N.L.F. and M.S.; supervision, N.C., M.S. and S.W.L.; funding acquisition, N.C. and S.W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the International Research Network Universitas Airlangga under contract number 3097/B/UN3.LPPM/PT.01.09/2024. This research was supported by the Airlangga Post-Doctoral Fellowship Program, Ref. No: 214/B/UN3.AGE/HK.07.01/2025. This work was supported by National Research Foundation (NRF) grants funded by the Ministry of Science and ICT (MSIT) and Ministry of Education (MOE), Republic of Korea (NRF[2021-R1-I1A2(059735)]; RS[2024-0040(5650)]; RS[2024-0044(0881)]; RS[2019-II19(0421)]).

Data Availability Statement

The datasets used in this study are publicly available [36,37], and the source code is accessible on Zenodo at https://doi.org/10.5281/zenodo.15564489 (accessed on 31 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

World Heart Report 2023: Confronting The World’s Number One Killer. Available online: https://world-heart-federation.org/wp-content/uploads/World-Heart-Report-2023.pdf (accessed on 24 March 2025).
Cardiovascular Diseases. Available online: https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 (accessed on 24 March 2025).
Cardiovascular Diseases Kill 10,000 People in the WHO European Region Every Day, with Men Dying more Frequently than Women. Available online: https://www.who.int/europe/news/item/15-05-2024-cardiovascular-diseases-kill-10-000-people-in-the-who-european-region-every-day--with-men-dying-more-frequently-than-women (accessed on 24 March 2025).
Cardiovascular Diseases. Available online: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (accessed on 24 March 2025).
Khan, A.; Qureshi, M.; Daniyal, M.; Tawiah, K. A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction. Health Soc. Care Community 2023, 2023, 1406060. [Google Scholar] [CrossRef]
Mandava, M.; Reddy Vinta, S. MDensNet201-IDRSRNet: Efficient Cardiovascular Disease Prediction System Using Hybrid Deep Learning. Biomed. Signal Process. Control 2024, 93, 106147. [Google Scholar] [CrossRef]
El-Sofany, H.; Bouallegue, B.; El-Latif, Y.M.A. A Proposed Technique for Predicting Heart Disease Using Machine Learning Algorithms and an Explainable AI Method. Sci. Rep. 2024, 14, 23277. [Google Scholar] [CrossRef]
Peng, M.; Hou, F.; Cheng, Z.; Shen, T.; Liu, K.; Zhao, C.; Zheng, W. Prediction of Cardiovascular Disease Risk Based on Major Contributing Features. Sci. Rep. 2023, 13, 4778. [Google Scholar] [CrossRef] [PubMed]
Krive, J.; Chertok, D. Advancing Cardiovascular Disease Prediction Machine Learning Models With Psychological Factors. JACC Adv. 2024, 3, 101185. [Google Scholar] [CrossRef] [PubMed]
Dorraki, M.; Liao, Z.; Abbott, D.; Psaltis, P.J.; Baker, E.; Bidargaddi, N.; Wardill, H.R.; Van Den Hengel, A.; Narula, J.; Verjans, J.W. Improving Cardiovascular Disease Prediction With Machine Learning Using Mental Health Data. JACC Adv. 2024, 3, 101180. [Google Scholar] [CrossRef]
Theerthagiri, P. Predictive Analysis of Cardiovascular Disease Using Gradient Boosting Based Learning and Recursive Feature Elimination Technique. Intell. Syst. Appl. 2022, 16, 200121. [Google Scholar] [CrossRef]
Budholiya, K.; Shrivastava, S.K.; Sharma, V. An Optimized XGBoost Based Diagnostic System for Effective Prediction of Heart Disease. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 4514–4523. [Google Scholar] [CrossRef]
Lee, J.; Choi, Y.; Ko, T.; Lee, K.; Shin, J.; Kim, H.-S. Prediction of Cardiovascular Complication in Patients with Newly Diagnosed Type 2 Diabetes Using an XGBoost/GRU-ODE-Bayes-Based Machine-Learning Algorithm. Endocrinol. Metab. 2024, 39, 176–185. [Google Scholar] [CrossRef]
Feng, M.; Wang, X.; Zhao, Z.; Jiang, C.; Xiong, J.; Zhang, N. Enhanced Heart Attack Prediction Using eXtreme Gradient Boosting. J. Theory Pract. Eng. Sci. 2024, 4, 9–16. [Google Scholar] [CrossRef]
Jamimi, H.A.A. Early Prediction of Heart Disease Risk Using Extreme Gradient Boosting: A Data-Driven Analysis. Int. J. Biomed. Eng. Technol. 2024, 45, 296–313. [Google Scholar] [CrossRef]
Nematollahi, M.A.; Jahangiri, S.; Asadollahi, A.; Salimi, M.; Dehghan, A.; Mashayekh, M.; Roshanzamir, M.; Gholamabbas, G.; Alizadehsani, R.; Bazrafshan, M.; et al. Body Composition Predicts Hypertension Using Machine Learning Methods: A Cohort Study. Sci. Rep. 2023, 13, 6885. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Zhang, X.; Xu, Y.; Gao, L.; Ma, Z.; Sun, Y.; Wang, W. Predicting the Risk of Hypertension Based on Several Easy-to-Collect Risk Factors: A Machine Learning Method. Front. Public Health 2021, 9, 619429. [Google Scholar] [CrossRef] [PubMed]
Chandramouli, A.; Hyma, V.R.; Tanmayi, P.S.; Santoshi, T.G.; Priyanka, B. Diabetes Prediction Using Hybrid Bagging Classifier. Entertain. Comput. 2023, 47, 100593. [Google Scholar] [CrossRef]
Xu, H.; Zhang, L.; Li, P.; Zhu, F. Outlier Detection Algorithm Based on K-Nearest Neighbors-Local Outlier Factor. J. Algorithms Comput. Technol. 2022, 16, 17483026221078111. [Google Scholar] [CrossRef]
Adesh, A.; Shobha, G.; Shetty, J.; Xu, L. Local Outlier Factor for Anomaly Detection in HPCC Systems. J. Parallel Distrib. Comput. 2024, 192, 104923. [Google Scholar] [CrossRef]
Alghushairy, O.; Alsini, R.; Soule, T.; Ma, X. A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams. Big Data Cogn. Comput. 2020, 5, 1. [Google Scholar] [CrossRef]
Syafrudin, M.; Fitriyani, N.L.; Alfian, G.; Rhee, J. An Affordable Fast Early Warning System for Edge Computing in Assembly Line. Appl. Sci. 2018, 9, 84. [Google Scholar] [CrossRef]
Qu, K.; Xu, J.; Hou, Q.; Qu, K.; Sun, Y. Feature Selection Using Information Gain and Decision Information in Neighborhood Decision System. Appl. Soft Comput. 2023, 136, 110100. [Google Scholar] [CrossRef]
Fitriyani, N.L.; Syafrudin, M.; Alfian, G.; Rhee, J. Development of Disease Prediction Model Based on Ensemble Learning Approach for Diabetes and Hypertension. IEEE Access 2019, 7, 144777–144789. [Google Scholar] [CrossRef]
Syafrudin, M.; Alfian, G.; Fitriyani, N.L.; Rhee, J. Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time Monitoring System in Automotive Manufacturing. Sensors 2018, 18, 2946. [Google Scholar] [CrossRef] [PubMed]
Ijaz, M.; Alfian, G.; Syafrudin, M.; Rhee, J. Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Appl. Sci. 2018, 8, 1325. [Google Scholar] [CrossRef]
Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
Muzayanah, R.; Pertiwi, D.A.A.; Ali, M.; Muslim, M.A. Comparison of Gridsearchcv and Bayesian Hyperparameter Optimization in Random Forest Algorithm for Diabetes Prediction. J. Soft Comput. Explor. 2024, 5, 86–91. [Google Scholar] [CrossRef]
Ahamad, G.N.; Shafiullah; Fatima, H.; Imdadullah; Zakariya, S.M.; Abbas, M.; Alqahtani, M.S.; Usman, M. Influence of Optimal Hyperparameters on the Performance of Machine Learning Algorithms for Predicting Heart Disease. Processes 2023, 11, 734. [Google Scholar] [CrossRef]
Zhao, Y.; Zhang, W.; Liu, X. Grid Search with a Weighted Error Function: Hyper-Parameter Optimization for Financial Time Series Forecasting. Appl. Soft Comput. 2024, 154, 111362. [Google Scholar] [CrossRef]
Ogunsanya, M.; Isichei, J.; Desai, S. Grid Search Hyperparameter Tuning in Additive Manufacturing Processes. Manuf. Lett. 2023, 35, 1031–1042. [Google Scholar] [CrossRef]
Balamurali, A.; Kumar, K.V. Early Detection and Classification of Type-2 Diabetes Using Stratified k-Fold Validation. In Proceedings of the 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 12–13 December 2024; IEEE: Chennai, India, 2024; pp. 1–6. [Google Scholar]
Tougui, I.; Jilbab, A.; Mhamdi, J.E. Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications. Healthc. Inform. Res. 2021, 27, 189–199. [Google Scholar] [CrossRef]
Szeghalmy, S.; Fazekas, A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
Mahesh, T.R.; Vinoth Kumar, V.; Dhilip Kumar, V.; Geman, O.; Margala, M.; Guduri, M. The Stratified K-Folds Cross-Validation and Class-Balancing Methods with High-Performance Ensemble Classifiers for Breast Cancer Classification. Healthc. Anal. 2023, 4, 100247. [Google Scholar] [CrossRef]
Cardiovascular Disease Dataset. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 20 August 2024).
Cardiovascular Study Dataset. Available online: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea (accessed on 20 August 2024).
pandas.DataFrame.drop_duplicates. Available online: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html (accessed on 20 May 2025).
Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Rec. 2000, 29, 93–104. [Google Scholar] [CrossRef]
Agrawal, P.V.; Kshirsagar, D.D. Information Gain-Based Feature Selection Method in Malware Detection for MalDroid2020. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; IEEE: Villupuram, India, 2022; pp. 1–5. [Google Scholar]
Tamim Kashifi, M.; Ahmad, I. Efficient Histogram-Based Gradient Boosting Approach for Accident Severity Prediction With Multisource Data. Transp. Res. Rec. J. Transp. Res. Board 2022, 2676, 236–258. [Google Scholar] [CrossRef]
Tuv, E.; Borisov, A.; Runger, G.; Torkkola, K. Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination. J. Mach. Learn. Res. 2009, 10, 1341–1366. [Google Scholar]
Rao, H.; Shi, X.; Rodrigue, A.K.; Feng, J.; Xia, Y.; Elhoseny, M.; Yuan, X.; Gu, L. Feature Selection Based on Artificial Bee Colony and Gradient Boosting Decision Tree. Appl. Soft Comput. 2019, 74, 634–642. [Google Scholar] [CrossRef]
Devos, L.; Meert, W.; Davis, J. Fast Gradient Boosting Decision Trees with Bit-Level Data Structures. In Machine Learning and Knowledge Discovery in Databases; Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 11906, pp. 590–606. ISBN 978-3-030-46149-2. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
Nhat-Duc, H.; Van-Duc, T. Comparison of Histogram-Based Gradient Boosting Classification Machine, Random Forest, and Deep Convolutional Neural Network for Pavement Raveling Severity Classification. Autom. Constr. 2023, 148, 104767. [Google Scholar] [CrossRef]
Hossain, S.M.M.; Deb, K. Plant Leaf Disease Recognition Using Histogram Based Gradient Boosting Classifier. In Intelligent Computing and Optimization; Vasant, P., Zelinka, I., Weber, G.-W., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2021; Volume 1324, pp. 530–545. ISBN 978-3-030-68153-1. [Google Scholar]
Features in Histogram Gradient Boosting Trees. Available online: https://scikit-learn.qubitpi.org/auto_examples/ensemble/plot_hgbt_regression.html (accessed on 24 March 2025).
Fan, W.; Zhang, K. Bagging. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer US: Boston, MA, USA, 2009; pp. 206–210. ISBN 978-0-387-35544-3. [Google Scholar]
Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Wang, Y.; Liu, J.; Feng, L. Text Length Considered Adaptive Bagging Ensemble Learning Algorithm for Text Classification. Multimed. Tools Appl. 2023, 82, 27681–27706. [Google Scholar] [CrossRef]
Moreno-Torres, J.G.; Saez, J.A.; Herrera, F. Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1304–1312. [Google Scholar] [CrossRef]
Saranya, G.; Pravin, A. Grid Search Based Optimum Feature Selection by Tuning Hyperparameters for Heart Disease Diagnosis in Machine Learning. Open Biomed. Eng. J. 2023, 17, e187412072304061. [Google Scholar] [CrossRef]
Trevethan, R. Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front. Public Health 2017, 5, 307. [Google Scholar] [CrossRef]
Li, K.; Persaud, D.; Choudhary, K.; DeCost, B.; Greenwood, M.; Hattrick-Simpers, J. Exploiting Redundancy in Large Materials Datasets for Efficient Machine Learning with Less Data. Nat. Commun. 2023, 14, 7283. [Google Scholar] [CrossRef] [PubMed]
Geiping, J.; Goldstein, T. Cramming: Training a Language Model on a Single GPU in One Day. arXiv 2022, arXiv:2212.14034. [Google Scholar]
Sorscher, B.; Geirhos, R.; Shekhar, S.; Ganguli, S.; Morcos, A.S. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Yang, S.; Xie, Z.; Peng, H.; Xu, M.; Sun, M.; Li, P. Dataset Pruning: Reducing Training Data by Examining Generalization Influence. arXiv 2022, arXiv:2205.09329. [Google Scholar]
Choudhary, K.; DeCost, B.; Major, L.; Butler, K.; Thiyagalingam, J.; Tavazza, F. Unified Graph Neural Network Force-Field for the Periodic Table: Solid State Applications. Digit. Discov. 2023, 2, 346–355. [Google Scholar] [CrossRef]
Ou, D.; Ji, Y.; Zhang, L.; Liu, H. An Online Classification Method for Fault Diagnosis of Railway Turnouts. Sensors 2020, 20, 4627. [Google Scholar] [CrossRef]
Vatcheva, K.P.; Lee, M.; McCormick, J.B.; Rahbar, M.H. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiol 2016, 6, 227. [Google Scholar] [CrossRef]
Sipper, M.; Moore, J.H. AddGBoost: A Gradient Boosting-Style Algorithm Based on Strong Learners. Mach. Learn. Appl. 2022, 7, 100243. [Google Scholar] [CrossRef]
Fitriyani, N.L.; Syafrudin, M.; Ulyah, S.M.; Alfian, G.; Qolbiyani, S.L.; Anshari, M. A Comprehensive Analysis of Chinese, Japanese, Korean, US-PIMA Indian, and Trinidadian Screening Scores for Diabetes Risk Assessment and Prediction. Mathematics 2022, 10, 4027. [Google Scholar] [CrossRef]
Egghe, L. The Measures Precision, Recall, Fallout and Miss as a Function of the Number of Retrieved Documents and Their Mutual Interrelations. Inf. Process. Manag. 2008, 44, 856–876. [Google Scholar] [CrossRef]
Çorbacıoğlu, Ş.K.; Aksel, G. Receiver Operating Characteristic Curve Analysis in Diagnostic Accuracy Studies: A Guide to Interpreting the Area under the Curve Value. Turk. J. Emerg. Med. 2023, 23, 195–198. [Google Scholar] [CrossRef]
Peregrin-Alvarez, J. Reinventing the Body Mass Index: A Machine Learning Approach. medRxiv 2024. medRxiv: 26.24306457. [Google Scholar]
Gutiérrez-Gallego, A.; Zamorano-León, J.J.; Parra-Rodríguez, D.; Zekri-Nechar, K.; Velasco, J.M.; Garnica, Ó.; Jiménez-García, R.; López-de-Andrés, A.; Cuadrado-Corrales, N.; Carabantes-Alarcón, D.; et al. Combination of Machine Learning Techniques to Predict Overweight/Obesity in Adults. J. Pers. Med. 2024, 14, 816. [Google Scholar] [CrossRef] [PubMed]
Maiga, J.; Hungilo, G.G. Pranowo Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data. In Proceedings of the 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 24–25 October 2019; IEEE: Jakarta, Indonesia, 2019; pp. 45–48. [Google Scholar]
Uddin, M.N.; Halder, R.K. An Ensemble Method Based Multilayer Dynamic System to Predict Cardiovascular Disease Using Machine Learning Approach. Inform. Med. Unlocked 2021, 24, 100584. [Google Scholar] [CrossRef]
Ouf, S.; ElSeddawy, A.I.B. A proposed paradigm for intelligent heart disease prediction system using data mining techniques. J. Southwest Jiaotong Univ. 2021, 56, 220–240. [Google Scholar] [CrossRef]
Shorewala, V. Early Detection of Coronary Heart Disease Using Ensemble Techniques. Inform. Med. Unlocked 2021, 26, 100655. [Google Scholar] [CrossRef]
Punugoti, R.; Dutt, V.; Kumar, A.; Bhati, N. Boosting the Accuracy of Cardiovascular Disease Prediction Through SMOTE. In Proceedings of the 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 June 2023; IEEE: Gorakhpur, India, 2023; pp. 1–6. [Google Scholar]
Bhatt, C.M.; Patel, P.; Ghetia, T.; Mazzeo, P.L. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms 2023, 16, 88. [Google Scholar] [CrossRef]

Figure 1. A flowchart of our proposed CVD prediction method.

Figure 2. Gradient boosting model structure.

Figure 3. Gradient boosting training phase.

Figure 4. Histogram-based gradient boosting.

Figure 5. Heatmap correlations and Pearson’s correlation coefficients between features in the training sets of dataset I (a), dataset II (b), and dataset III (c).

Figure 6. Information gain (IG) and Chi scores of features across datasets I, II, and III. (a) IG of features in dataset I; (b) chi scores of features in dataset I; (c) IG of features in dataset II; (d) chi scores of features in dataset II; (e) IG of features in dataset III; (f) chi scores of features in dataset III.

Figure 7. The performance of the proposed model on datasets I, II, and III after feature selection using IG and chi-squared methods.

Figure 8. The performance of the proposed model before and after hyperparameter tuning utilizing datasets I, II, and III.

Figure 9. The ROC curves of the proposed model (bagging with HGB estimator) utilizing datasets (a) I, (b) II, and (c) III.

Table 1. Characteristics of datasets I and II.

No.	Feature	Type	Mean ± STD	Mean ± STD of Positive Samples	Mean ± STD of Negative Samples
1	age_in_years	integer	53.33 ± 6.76	54.95 ± 6.35	51.73 ± 6.78
2	gender	categorical (1: women; 2: men)	-	-	-
3	height	integer	164.36 ± 8.21	164.27 ± 8.27	164.45 ± 8.15
4	weight	float	74.21 ± 14.40	76.82 ± 14.96	71.59 ± 13.31
5	ap_hi (SBP)	integer	128.82 ± 154.01	137.21 ± 191.29	120.43 ± 103.55
6	ap_lo (DBP)	integer	96.63 ± 188.47	109.02 ± 217.81	84.25 ± 152.69
7	cholestarol	categorical (1: normal; 2: above normal; 3: well above normal)	-	-	-
8	gluc (FBG)	categorical (1: normal; 2: above normal; 3: well above normal)	-	-	-
9	smoke	binary (0: not smokers; 1: smokers)	-	-	-
10	alco (alcohol intake)	binary (0: non-alcohol drinker; 1: alcohol drinker)	-	-	-
11	active (physical activity)	binary (0: inactive; 1: active)	-	-	-

Table 2. Characteristics of datasets III.

No.	Feature	Type	Mean ± STD	Mean ± STD of Positive Samples	Mean ± STD of Negative Samples
1	age	continuous	51.77 ± 8.20	54.80 ± 6.74	48.74 ± 8.42
2	sex	binary (0: women; 1: men)	-	-	-
3	is_smoking	binary (0: not smokers; 1: smokers)	-	-	-
4	totChol (total cholesterol)	categorical (1: normal; 2: above normal; 3: well above normal)	-	-	-
5	sysBP	continuous	32.95 ± 20.33	135.21 ± 19.70	130.68 ± 20.71
6	diaBP	continuous	83.60 ± 11.30	84.94 ± 10.93	82.25 ± 11.50
7	BMI	continuous	26.95 ± 5.00	28.22 ± 5.55	25.69 ± 4.00
8	glucose	categorical (1: normal; 2: above normal; 3: well above normal)	-	-	-

Table 3. The number of redundant samples in datasets I, II, and III.

Dataset	Number of Samples	Number of Redundant Samples	Number of Samples After Data Redundancy Elimination
Dataset I	70,000	1004	68,996 (Train: 62,097, Test: 6899)
Dataset II	70,000	1004	68,996 (Train: 62,097, Test: 6899)
Dataset III	5200	38	5162 (Train: 4646, Test: 516)

Table 4. Performance comparison of proposed models and existing ML models on dataset I after data redundancy elimination.

Metric	LR	SVC	GNB	MLP	KNN	RF	AdaBoost	GB	HGB	Extra Trees	Proposed Model
Precision	0.7273	0.7257	0.7224	0.7287	0.7513	0.8472	0.8125	0.9371	0.9273	0.8246	0.9375
Recall	0.6627	0.6230	0.2980	0.6929	0.6565	0.8757	0.7681	0.9326	0.9822	0.8617	0.9877
F1 Score	0.6935	0.6911	0.4216	0.7005	0.7007	0.8612	0.7896	0.9348	0.9539	0.8427	0.9619
Accuracy	0.7114	0.7257	0.5976	0.7118	0.7236	0.8609	0.7984	0.9359	0.9532	0.8416	0.9615
AUC	0.7709	0.6911	0.7001	0.7883	0.7840	0.9396	0.8897	0.9877	0.9852	0.9214	0.9911

Table 5. Performance comparison of proposed models and existing ML models utilizing dataset II after data redundancy elimination.

Metric	LR	SVC	GNB	MLP	KNN	RF	AdaBoost	GB	HGB	Extra Trees	Proposed Model
Precision	0.7075	0.7807	0.7225	0.7728	0.7515	0.8613	0.8126	0.9354	0.9239	0.8594	0.9371
Recall	0.6451	0.6168	0.2810	0.6459	0.6723	0.9026	0.7647	0.9299	0.9822	0.9098	0.9890
F1 Score	0.6748	0.6891	0.4042	0.6977	0.7096	0.8814	0.7879	0.9299	0.9521	0.8838	0.9624
Accuracy	0.6939	0.7258	0.5923	0.7287	0.7290	0.8804	0.7972	0.9338	0.9513	0.8822	0.9619
AUC	0.7590	0.7882	0.6982	0.7964	0.6708	0.9540	0.8892	0.9871	0.9846	0.9549	0.9914

Table 6. Performance comparison of proposed models and existing ML models using dataset III after data redundancy elimination.

Metric	LR	SVC	GNB	MLP	KNN	RF	AdaBoost	GB	HGB	Extra Trees	Proposed Model
Precision	0.7645	0.6790	0.7693	0.7471	0.7741	0.8668	0.8484	0.8862	0.8951	0.8306	0.8954
Recall	0.7646	0.7593	0.7073	0.7770	0.8040	0.8062	0.8305	0.8283	0.8257	0.7776	0.8283
F1 Score	0.7643	0.7167	0.7369	0.7332	0.7884	0.8348	0.8388	0.8559	0.8585	0.8028	0.8601
Accuracy	0.7660	0.7021	0.7495	0.7471	0.7858	0.8422	0.8418	0.8618	0.8655	0.8108	0.8667
AUC	0.8359	0.7670	0.8159	0.8275	0.8507	0.9135	0.9097	0.9244	0.9278	0.8908	0.9280

Table 7. Number of outliers and training samples after elimination (Note: # = Total number of).

Technique	Dataset I		Dataset II		Dataset III
Technique	# Outliers	# Training Samples After Outlier Elimination	# Outliers	# Training Samples After Outlier Elimination	# Outliers	# Training Samples After Outlier Elimination
LOF	1465	60,632	3916	58,181	118	4528
EE	3105	58,992	3106	58,991	233	4413
iForest	3105	58,992	3106	58,991	233	4413
OneClassSVM	6210	55,887	6210	55,886	464	4182
DBSCAN	6566	55,531	1244	60,853	550	4098

Table 8. The number of outperformed metrics in each ML model for selecting the optimal outlier elimination technique (see the detailed information in the Supplementary File).

Model		Dataset
Model		Dataset I	Dataset II	Dataset III
LR	LOF	2 (Pre, F1)	3 (Rec, F1, AUC)	1 (Prec)
	EE	3 (Rec, Acc, AUC)	0	0
	iForest	0	0	0
	OneClassSVM	0	2 (Pre, Acc)	1 (Recall)
	DBSCAN	0	0	3 (F1, Acc, AUC)
SVC	LOF	5 (Pre, Rec, F1, Acc, AUC)	2 (Rec, Acc)	0
	EE	0	2 (F1, AUC)	0
	iForest	0	0	0
	OneClassSVM	0	0	2 (Rec, F1)
	DBSCAN	0	1 (Pre)	3 (Pre, Acc, AUC)
GNB	LOF	1 (Pre)	0	0
	EE	3 (F1, Acc, AUC)	2 (Acc, AUC)	0
	iForest	0	0	0
	OneClassSVM	1 (Rec)	2 (Rec, F1)	3 (Pre, Rec, F1)
	DBSCAN	0	1 (Pre)	2 (Acc, AUC)
MLP	LOF	3 (Pre, Acc, AUC)	3 (Pre, Acc, AUC)	1 (Pre)
	EE	0	2 (Rec, F1)	0
	iForest	2 (Rec, F1)	0	0
	OneClassSVM	0	0	2 (F1, Acc)
	DBSCAN	0	0	2 (Rec, AUC)
KNN	LOF	5 (Pre, Rec, F1, Acc, AUC)	5 (Pre, Rec, F1, Acc, AUC)	0
	EE	0	0	0
	iForest	0	0	0
	OneClassSVM	0	0	0
	DBSCAN	0	0	5 (Pre, Rec, F1, Acc, AUC)
RF	LOF	1 (Rec)	2 (F1, Acc)	0
	EE	1 (Pre)	0	0
	iForest	3 (F1, Acc, AUC)	2 (Acc, AUC)	0
	OneClassSVM	0	0	4 (Pre, Rec, F1, AUC)
	DBSCAN	0	1 (Pre)	1 (Acc)
AdaBoost	LOF	4 (Pre, Rec, F1, AUC)	4 (Pre, Rec, F1, AUC)	0
	EE	0	0	0
	iForest	0	0	0
	OneClassSVM	0	0	5 (Pre, Rec, F1, Acc, AUC)
	DBSCAN	1 (Acc)	1 (Acc)	0
GB	LOF	1 (Rec)	0	0
	EE	0	0	0
	iForest	0	2 (F1, Acc)	0
	OneClassSVM	1 (AUC)	1 (Pre)	5 (Pre, Rec, F1, Acc, AUC)
	DBSCAN	3 (Pre, F1, Acc)	2 (Rec, AUC)	0
HGB	LOF	4 (Pre, Rec, F1, Acc)	0	0
	EE	0	0	0
	iForest	1 (AUC)	0	0
	OneClassSVM	0	5 (Pre, Rec, F1, Acc, AUC)	5 (Pre, Rec, F1, Acc, AUC)
	DBSCAN	0	0	0
ExtraTrees	LOF	1 (AUC)	0	0
	EE	0	0	0
	iForest	4 (Pre, Rec, F1, Acc)	5 (Pre, Rec, F1, Acc, AUC)	0
	OneClassSVM	0	0	4 (Pre, Rec, F1, AUC)
	DBSCAN	0	0	1 (Acc)
Proposed Model	LOF	0	5 (Pre, Rec, F1, Acc, AUC)	0
	EE	0	0	0
	iForest	2 (Pre, F1)	0	0
	OneClassSVM	0	0	5 (Pre, Rec, F1, Acc, AUC)
	DBSCAN	3 (Rec, Acc, AUC)	0	0

Pre, precision; Rec, recall; F1, F1 score; Acc, accuracy.

Table 9. The hyperparameter values of the HGB estimator.

Hyperparameter	Value
Hyperparameter	Dataset I	Dataset II	Dataset III
max_iter	105	100	700
max_depth	none	None	5
l2_regularization	0	1	0.5
random_state	0	42	0
learning_rate	0.1	0.1	0.01

Table 10. The hyperparameter values of the bagging classifier.

Hyperparameter	Value
Hyperparameter	Dataset I	Dataset II	Dataset III
Estimator	HistGradientBoostingClassifier	HistGradientBoostingClassifier	HistGradientBoostingClassifier
n_estimators	20	19	16
max_samples	1.0	1.0	1.0
max_features	1.0	1.0	1.0
random_state	0	42	0

Table 11. The results of paired t-tests between proposed model performance before hyperparameter tuning vs. proposed model performance after hyperparameter tuning for datasets I, II, and III.

Paired t-Test Metric	Dataset
Paired t-Test Metric	Dataset I	Dataset II	Dataset III
h	1	1	1
p-value	0.000	0.000	0.957
t (calculated)	−39.555	−12.728	0.057

Table 12. Our proposed CVD model compared with previous studies.

Reference	Method	Precision	Recall	F1 Score	Accuracy	AUC
[68]	RF	-	0.8000	-	0.7300	-
[69]	Ensemble Method (RF + NB + GB)	-	-	-	0.9416	0.9400
[70]	Neural Network	-	-	-	0.7182	-
[71]	Stacking (KNN + RF + SVC + LR)	0.7601	0.6680	-	0.7510	-
[72]	SMOTE + RF	0.8300	0.8500	0.8400	0.8600	-
[8]	XGBH	-	-	-	-	0.8030
[73]	MLP	-	-	-	0.8727	0.9500
Proposed Model (Utilizing Dataset I)	Bagging with HGB + LOF + IG	0.9390	0.9883	0.9630	0.9625	0.9916

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fitriyani, N.L.; Syafrudin, M.; Chamidah, N.; Rifada, M.; Susilo, H.; Aydin, D.; Qolbiyani, S.L.; Lee, S.W. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics 2025, 13, 2194. https://doi.org/10.3390/math13132194

AMA Style

Fitriyani NL, Syafrudin M, Chamidah N, Rifada M, Susilo H, Aydin D, Qolbiyani SL, Lee SW. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics. 2025; 13(13):2194. https://doi.org/10.3390/math13132194

Chicago/Turabian Style

Fitriyani, Norma Latif, Muhammad Syafrudin, Nur Chamidah, Marisa Rifada, Hendri Susilo, Dursun Aydin, Syifa Latif Qolbiyani, and Seung Won Lee. 2025. "A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases" Mathematics 13, no. 13: 2194. https://doi.org/10.3390/math13132194

APA Style

Fitriyani, N. L., Syafrudin, M., Chamidah, N., Rifada, M., Susilo, H., Aydin, D., Qolbiyani, S. L., & Lee, S. W. (2025). A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics, 13(13), 2194. https://doi.org/10.3390/math13132194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources

2.2. Proposed Method

2.3. Rendundancy Data Elimination Technique

2.4. Local Outlier Factor-Based Outlier Removal

2.5. Information Gain-Based Feature Selection

2.6. Histogram Gradient Boosting

2.7. Bootstrap Aggregating Algorithm

2.8. Stratified K-Fold CV

2.9. GridSearchCV Hyperparameter Tuning Technique

2.10. Performance Evaluation Metrics

3. Results and Discussion

3.1. Experimental Settings

3.2. Redundancy Data Elimination

3.3. Outlier Detection and Elimination

3.4. Information Gain and Chi-Square Scores for Feature Selection

3.5. The Performance of the Proposed Method Before and After Hyperparameter Tuning

3.6. Comparison with Previous Works

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI