Next Article in Journal
Overview of Training LLMs on One Single GPU
Previous Article in Journal
Integrating Machine Learning with Medical Imaging for Human Disease Diagnosis: A Survey
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Optimizing Machine Learning for Healthcare Applications: A Case Study on Cardiovascular Disease Prediction Through Feature Selection, Regularization, and Overfitting Reduction †

by
Lamiae Eloutouate
1,
Hicham Gibet Tani
2,
Lotfi Elaachak
1,
Fatiha Elouaai
1 and
Mohammed Bouhorma
1
1
FSTT, Abdelmalek Essaadi University, Tetouan 93000, Morocco
2
FPL, Abdelmalek Essaadi University, Tetouan 93000, Morocco
Presented at the International Conference on Sustainable Computing and Green Technologies (SCGT’2025), Larache, Morocco, 14–15 May 2025.
Comput. Sci. Math. Forum 2025, 10(1), 13; https://doi.org/10.3390/cmsf2025010013
Published: 7 July 2025

Abstract

The application of machine learning (ML) to medical datasets offers significant potential for improving disease prediction and patient outcomes. However, challenges such as feature redundancy, overfitting, and suboptimal model performance limit the practical effectiveness of ML algorithms. This study focuses on optimizing ML techniques for cardiovascular disease prediction using the Kaggle Cardiovascular Disease dataset. We systematically apply feature selection methods, including correlation analysis and regularization techniques (L1/L2), to identify the most relevant attributes and address multicollinearity. Advanced ensemble models such as Random Forest, XGBoost, and LightGBM are employed to mitigate overfitting and enhance predictive performance. Through hyperparameter tuning and stratified k-fold cross-validation, we ensure model robustness and generalizability. The results demonstrate that ensemble methods, particularly gradient boosting algorithms, outperform traditional models, achieving superior predictive accuracy and stability. This study highlights the importance of algorithm optimization in ML applications for healthcare, offering a replicable framework for medical datasets and paving the way for more effective diagnostic tools in cardiovascular health.

1. Introduction

Cardiovascular diseases (CVDs) remain the leading cause of mortality worldwide, accounting for nearly 20.5 million deaths annually [1]. Early and accurate prediction of CVDs can significantly enhance patient outcomes by enabling timely interventions and personalized treatment plans [2]. In recent years, machine learning (ML) has emerged as a powerful tool in healthcare, offering the ability to process complex datasets and uncover patterns that traditional statistical methods may overlook. By leveraging ML, healthcare professionals can develop predictive models to assess a patient’s risk of developing cardiovascular conditions based on diverse clinical and demographic data [3].
The Kaggle Cardiovascular Disease dataset shared by Svetlana Ulianova, a data scientist, provides a rich source of structured medical data, including features such as age, gender, cholesterol levels, and smoking status. This dataset is widely used in research to develop and evaluate predictive ML models for cardiovascular risk assessment. Its comprehensive nature makes it an excellent candidate for exploring advanced ML techniques and their impact on prediction accuracy [4].
Despite significant progress in applying ML to medical datasets, prior studies often focus on comparing algorithmic performance without fully optimizing their implementations. Key challenges such as feature redundancy, multicollinearity, and overfitting are frequently overlooked, leading to suboptimal and less generalizable models [5]. Additionally, many studies do not incorporate systematic feature selection, regularization techniques, or robust validation frameworks, which are crucial for improving model interpretability and reliability in clinical settings.
This study aims to address these gaps by systematically optimizing ML algorithms for cardiovascular disease prediction. Specifically, we achieve the following:
  • Perform feature selection using correlation analysis, Recursive Feature Elimination (RFE), and mutual information classifiers.
  • Apply L1/L2 regularization to address multicollinearity and enhance model interpretability.
  • Mitigate overfitting through the use of Random Forest and Gradient Boosting models.
  • Implement cross-validation and data augmentation to ensure robust and generalizable predictions.
Through these methods, we seek to develop a replicable framework for optimizing ML models on medical datasets, ultimately contributing to the advancement of ML applications in healthcare.

2. Related Work

The application of machine learning (ML) in healthcare has grown rapidly, driven by the availability of large medical datasets and the need for precise diagnostic tools. Studies have explored ML techniques for predicting cardiovascular diseases (CVDs), demonstrating their potential in enhancing clinical decision making [6]. Traditional algorithms like Logistic Regression, SVM, and Decision Trees have been used for CVD prediction, but they often struggle with complex, high-dimensional datasets [6]. In contrast, ensemble methods like Random Forest and Gradient Boosting have shown improved accuracy by leveraging multiple weak learners, making them effective for noisy and imbalanced medical data [7]. Neural networks and deep learning models, while capable of capturing non-linear relationships, are limited by the need for large labeled datasets [8].
Feature selection is critical in healthcare ML applications. Techniques like Recursive Feature Elimination (RFE) and correlation analysis have been shown to improve model performance and interpretability, though their application is not always systematic [9]. Overfitting remains a challenge, particularly with limited datasets, and regularization techniques like L1/L2 penalties are underutilized [10]. Robust validation frameworks, such as k-fold cross-validation, are also frequently overlooked, leading to inflated performance metrics [10]. Additionally, data augmentation for tabular medical data, such as generating synthetic samples, remains largely unexplored [11].
In previous work [12], ML algorithms were applied to a cardiovascular dataset in a smart home environment for medical surveillance. The study evaluated algorithms like Linear Regression, Logistic Regression, and Multi-Layer Perceptron (MLP), with MLP achieving the highest accuracy (99.80%). However, the study lacked advanced optimization techniques like feature selection and regularization, potentially leading to overfitting and suboptimal generalization.
This study builds on these findings by integrating advanced optimization strategies, including feature selection, regularization, and robust validation, to address prior limitations. Using the Kaggle Cardiovascular Disease dataset, we aim to develop a replicable methodology that improves predictive accuracy and model reliability for CVD prediction.

3. Methodology

3.1. Dataset Description

The study uses the Kaggle Cardiovascular Disease dataset (CVDd) [4], comprising 70,000 records and 11 features, including age, height, weight, gender, blood pressure, cholesterol levels, etc., essential for predicting cardiovascular disease. Preprocessing steps included the following:
  • Handling Missing Values and Categorical Variables: No missing values or categorical variables were present, avoiding the need for imputation or encoding.
  • Normalization/Standardization: Numerical features were scaled using Robust Scaler [13], which centers data around the median and scales it based on the interquartile range (IQR), making it less sensitive to outliers (Figure 1) compared to Z-score normalization.
  • Class Imbalance Verification: After verifying the class distribution, we found that the dataset was balanced, meaning no additional techniques (e.g., SMOTE) were required to address class imbalance (Figure 2).

3.2. Feature Selection

To identify relevant features and reduce redundancy, we used correlation analysis (Figure 3).
  • Correlation Analysis: Pearson correlation coefficients [14,15,16] were calculated to measure the linear relationship between each feature and the target variable.
    Features with correlation coefficients below a threshold of 0.2 were considered less relevant and removed. This threshold was chosen to strike a balance between retaining features with meaningful relationships to the target variable and eliminating those with negligible contributions, thereby reducing dimensionality while preserving predictive power. The Pearson correlation coefficient for two variables X and Y is given by the following expression:
    r_xy = (cov(X,Y))/(σx σy)
    where cov(X,Y) is the covariance between X and Y, and σx,σy are the standard deviations of X and Y, respectively.

3.3. Regularization

To address multicollinearity and prevent overfitting, we applied L1 (Lasso) and L2 (Ridge) regularization techniques [17]:
  • L1 Regularization (Lasso): Lasso adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function;
  • L2 Regularization (Ridge): Ridge adds a penalty equal to the square of the magnitude of coefficients; this technique reduces the impact of multicollinearity by discouraging large coefficients, leading to more stable and generalizable models.

3.4. Model Optimization

To enhance performance and reduce overfitting:
  • Hyperparameter Tuning: Grid Search optimized hyperparameters for each algorithm [18].
  • Validation Strategy: Stratified k-Fold cross-validation (k = 10 or k = 5) preserved class distribution, reducing bias [19].
  • Dataset Splitting: The dataset was split into 70% training, 20% testing, and 10% validation to ensure generalizability [20].

3.5. Machine Learning Algorithms Tested

The following classification and regression algorithms were implemented and tested:
  • Classification: Logistic Regression, Random Forest, SVM, Gradient Boosting (XGBoost, LightGBM, CatBoost), k-NN.
  • Regression: Linear Regression, Random Forest Regressor, SVR, Gradient Boosting Regressor (XGBoost, LightGBM, CatBoost), k-NN.

3.6. Performance Evaluation

Model performance was assessed using the following metrics:
  • Classification Metrics [21]: Accuracy, Precision, Recall, F1-Score, AUC-ROC.
  • Regression Metrics [22]: Mean Squared Error (MSE), Mean Absolute Error (MAE), R2 (Coefficient of Determination).

4. Results and Discussion

This section presents the results of the experiments conducted on the Kaggle Cardiovascular Disease dataset using various classification and regression algorithms. The performance of each model is evaluated on both the validation and test sets, and the results are summarized in tables for clarity. A detailed analysis, general discussion, and recommendations are provided.

4.1. Classification Results

The performance of the classification algorithms is summarized in Table 1 and Table 2. Key metrics include Accuracy, Precision, Recall, F1-Score, and AUC-ROC.
  • Best Performing Models: XGBoost and LightGBM achieved the highest AUC-ROC scores (0.800 and 0.798, respectively), excelling in capturing complex data relationships.
  • Moderate Performers: Random Forest and CatBoost delivered competitive AUC-ROC scores (~0.797), showing robustness and reduced overfitting.
  • Lower Performers: Logistic Regression and SVM performed moderately, with AUC-ROC scores of 0.782 and 0.784, respectively, but were outperformed by ensemble methods.
  • Weakest Performer: k-NN underperformed (AUC-ROC: 0.759), likely due to sensitivity to data scaling and inability to capture complex patterns.

4.2. Regression Results

The performance of the regression algorithms is summarized in Table 3 and Table 4. Key metrics include Mean Squared Error (MSE), Mean Absolute Error (MAE), and R2.
  • Best Performing Models: CatBoost and LightGBM achieved the highest R2 values (0.282 and 0.280, respectively), excelling in modeling complex relationships and minimizing errors.
  • Moderate Performers: XGBoost and Random Forest Regressor performed well, with R2 values around 0.279 and 0.280, respectively, proving robust for regression tasks.
  • Lower Performers: SVR and k-NN Regressor delivered moderate results (R2: 0.233 and 0.267, respectively), limited by their inability to handle non-linear relationships effectively.
  • Weakest Performer: Linear Regression performed poorly (R2: 0.120), as expected due to its simplicity and inability to capture non-linear patterns.

4.3. General Discussion

The experimental results highlight the following key insights:
  • Superiority of Ensemble Methods: Ensemble methods (XGBoost, LightGBM, CatBoost, Random Forest) outperformed traditional algorithms (Logistic Regression, Linear Regression, k-NN) due to their ability to capture complex, non-linear relationships.
  • Importance of Hyperparameter Tuning: Grid Search significantly improved performance, as seen with XGBoost, LightGBM, and CatBoost.
  • Robustness of Gradient Boosting: XGBoost and LightGBM excelled in handling high-dimensional data and minimizing overfitting.
  • Limitations of Traditional Algorithms: Linear Regression and k-NN underperformed due to simplicity and sensitivity to data scaling, respectively.
  • Balanced Dataset: The dataset was balanced, ensuring stable and reliable results.

4.4. Recommendations

Based on the findings, the following recommendations are proposed:
  • Adopt Ensemble Methods: Prioritize XGBoost, LightGBM, and CatBoost for CVD prediction due to their superior performance.
  • Invest in Hyperparameter Optimization: Use Grid Search or Bayesian Optimization to maximize model performance.
  • Explore Advanced Techniques: Investigate deep learning or hybrid models for further accuracy improvements.
  • Focus on Interpretability: Use SHAP to explain predictions and build trust among healthcare professionals.
  • Validate on External Datasets: Ensure generalizability by testing on external datasets.

5. Conclusions and Future Work

This study highlights the superiority of ensemble methods, particularly XGBoost and LightGBM, for CVD prediction. Future work should focus on external validation, integrating unstructured data, and developing explainable AI techniques to enhance clinical applicability and improve patient outcomes.

Author Contributions

All authors contributed equally to the design, implementation, analysis, and writing of this study. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available upon reasonable request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Di Cesare, M.; Perel, P.; Taylor, S.; Kabudula, C.; Bixby, H.; Gaziano, T.A.; McGhie, D.V.; Mwangi, J.; Pervan, B.; Narula, J.; et al. The Heart of the World. Glob. Heart 2024, 19, 11. [Google Scholar] [CrossRef] [PubMed]
  2. Soham, B.; Ananya, S.; Monalisa, S.; Debasis, S. Novel framework of significant risk factor identification and cardiovascular disease prediction. Expert Syst. Appl. 2025, 263, 125678. [Google Scholar] [CrossRef]
  3. Singh, M.; Kumar, A.; Khanna, N.N.; Laird, J.R.; Nicolaides, A.; Faa, G.; Johri, A.M.; Mantella, L.E.; Fernandes, J.F.E.; Teji, J.S.; et al. Artificial intelligence for cardiovascular disease risk assessment in personalised framework: A scoping review. EClinicalMedicine 2024, 73, 102660. [Google Scholar] [CrossRef] [PubMed]
  4. Svetlana, U. Cardiovascular Disease Dataset. 2019. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 19 January 2023).
  5. Badawy, M.; Ramadan, N.; Hefny, H.A. Healthcare predictive analytics using machine learning and deep learning techniques: A survey. J. Electr. Syst. Inf. Technol. 2023, 10, 40. [Google Scholar] [CrossRef]
  6. Hossain, S.; Hasan, M.K.; Faruk, M.O.; Aktar, N.; Hossain, R.; Hossain, K. Machine learning approach for predicting cardiovascular disease in Bangladesh: Evidence from a cross-sectional study in 2023. BMC Cardiovasc. Disord. 2024, 24, 214. [Google Scholar] [CrossRef] [PubMed]
  7. Ansyari, M.R.; Mazdadi, M.I.; Indriani, F.; Kartini, D.; Saragih, T.H. Implementation of Random Forest and Extreme Gradient Boosting in the Classification of Heart Disease using Particle Swarm Optimization Feature Selection. J. Electron. Electromed. Eng. Med. Inform. 2023, 5, 250–260. [Google Scholar] [CrossRef]
  8. Dhafer, G.H.; Laszlo, S. A one-dimensional convolutional neural network-based deep learning approach for predicting cardiovascular diseases. Inform. Med. Unlocked 2024, 49, 101535. [Google Scholar] [CrossRef]
  9. Arif, M.P.; Triyanna, W. A systematic literature review: Recursive feature elimination algorithms. J. Ilmu Pengetah. Dan Teknol. Komputer 2024, 9, 2. [Google Scholar] [CrossRef]
  10. Demir-Kavuk, O.; Kamada, M.; Akutsu, T.; Knapp, E.W. Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features. BMC Bioinform. 2011, 12, 412. [Google Scholar] [CrossRef] [PubMed]
  11. Gracia Moisés, A.; Vitoria Pascual, I.; Imas González, J.J.; Ruiz Zamarreño, C. Data Augmentation Techniques for Machine Learning Applied to Optical Spectroscopy Datasets in Agrifood Applications: A Comprehensive Review. Sensors 2023, 23, 8562. [Google Scholar] [CrossRef] [PubMed]
  12. Lamiae, E.; Fatiha, E.; Hicham, G.T.; Mohammed, B. Smart home and machine learning for medical surveillance: Classification algorithms survey. J. Theor. Appl. Inf. Technol. 2021, 99, 12. [Google Scholar]
  13. Ahsan, M.M.; Mahmud, M.A.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
  14. Kirch, W. (Ed.) Pearson’s Correlation Coefficient. In Encyclopedia of Public Health; Springer: Dordrecht, The Netherlands, 2008. [Google Scholar] [CrossRef]
  15. Awad, M.; Fraihat, S. Recursive Feature Elimination with Cross-Validation with Decision Tree: Feature Selection Method for Machine Learning-Based Intrusion Detection Systems. J. Sens. Actuator Netw. 2023, 12, 67. [Google Scholar] [CrossRef]
  16. Dhindsa, A.; Bhatia, S.; Agrawal, S.; Sohi, B.S. An Improvised Machine Learning Model Based on Mutual Information Feature Selection Approach for Microbes Classification. Entropy 2021, 23, 257. [Google Scholar] [CrossRef]
  17. Mei, Y.; Ming, K.L.; Yingchi, Q.; Xingzhi, L.; Du, N. Deep neural networks with L1 and L2 regularization for high dimensional corporate credit risk prediction. Expert Syst. Appl. 2023, 213, 118873. [Google Scholar] [CrossRef]
  18. Radzi, S.F.M.; Karim, M.K.A.; Saripan, M.I.; Rahman, M.A.A.; Isa, I.N.C.; Ibahim, M.J. Hyperparameter Tuning and Pipeline Optimization via Grid Search Method and Tree-Based AutoML in Breast Cancer Prediction. J. Pers. Med. 2021, 11, 978. [Google Scholar] [CrossRef] [PubMed]
  19. Omar, C.; José, A.; Denisse, B.; Anthony, G.; Olga, M.; Manuel, Q.; Grégorio, T.; Raul, S. K-Fold Cross-Validation through Identification of the Opinion Classification Algorithm for the Satisfaction of University Students. Int. J. Online Biomed. Eng. (iJOE) 2023, 19, 11. [Google Scholar] [CrossRef]
  20. Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef] [PubMed]
  21. Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef] [PubMed]
  22. Alexei, B. Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology. arXiv 2018, arXiv:1809.03006. [Google Scholar] [CrossRef]
Figure 1. Boxplot of features to detect outliers on CVDd.
Figure 1. Boxplot of features to detect outliers on CVDd.
Csmf 10 00013 g001
Figure 2. Output class distribution.
Figure 2. Output class distribution.
Csmf 10 00013 g002
Figure 3. Cardiovascular dataset heatmap of correlations for feature selection.
Figure 3. Cardiovascular dataset heatmap of correlations for feature selection.
Csmf 10 00013 g003
Table 1. Performance of classification algorithms on the validation set.
Table 1. Performance of classification algorithms on the validation set.
AlgorithmAccuracyPrecisionRecallF1-ScoreAUC-ROC
Logistic Regression0.7120.7140.7120.7120.775
Random Forest0.7290.7310.7290.7280.792
XGBoost0.7330.7340.7330.7320.795
LightGBM0.7310.7330.7310.7300.793
CatBoost0.7290.7310.7290.7290.793
k-NN0.7060.7060.7060.7060.753
SVM0.7200.7220.7200.7190.777
Table 2. Performance of classification algorithms on the testing set.
Table 2. Performance of classification algorithms on the testing set.
AlgorithmAccuracyPrecisionRecallF1-ScoreAUC-ROC
Logistic Regression0.7160.7180.7160.7150.782
Random Forest0.7310.7330.7310.7300.797
XGBoost0.7350.7370.7350.7350.800
LightGBM0.7320.7330.7320.7310.798
CatBoost0.7300.7310.7300.7300.797
k-NN0.7030.7030.7030.7030.759
SVM0.7210.7230.7210.7200.784
Table 3. Performance of regression algorithms on the validation set.
Table 3. Performance of regression algorithms on the validation set.
AlgorithmMSEMAER2
Linear Regression0.2190.4410.123
Random Forest Regressor0.1820.3660.274
SVR0.1930.3980.229
XGBoost Regressor0.1810.3680.274
LightGBM Regressor0.1810.3660.275
CatBoost Regressor0.1810.3640.276
k-NN Regressor0.1850.3640.258
Table 4. Performance of regression algorithms on the testing set.
Table 4. Performance of regression algorithms on the testing set.
AlgorithmMSEMAER2
Linear Regression0.2200.4420.120
Random Forest Regressor0.1800.3650.280
SVR0.1920.3960.233
XGBoost Regressor0.1800.3680.279
LightGBM Regressor0.1800.3650.280
CatBoost Regressor0.1800.3630.282
k-NN Regressor0.1830.3620.267
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Eloutouate, L.; Tani, H.G.; Elaachak, L.; Elouaai, F.; Bouhorma, M. Optimizing Machine Learning for Healthcare Applications: A Case Study on Cardiovascular Disease Prediction Through Feature Selection, Regularization, and Overfitting Reduction. Comput. Sci. Math. Forum 2025, 10, 13. https://doi.org/10.3390/cmsf2025010013

AMA Style

Eloutouate L, Tani HG, Elaachak L, Elouaai F, Bouhorma M. Optimizing Machine Learning for Healthcare Applications: A Case Study on Cardiovascular Disease Prediction Through Feature Selection, Regularization, and Overfitting Reduction. Computer Sciences & Mathematics Forum. 2025; 10(1):13. https://doi.org/10.3390/cmsf2025010013

Chicago/Turabian Style

Eloutouate, Lamiae, Hicham Gibet Tani, Lotfi Elaachak, Fatiha Elouaai, and Mohammed Bouhorma. 2025. "Optimizing Machine Learning for Healthcare Applications: A Case Study on Cardiovascular Disease Prediction Through Feature Selection, Regularization, and Overfitting Reduction" Computer Sciences & Mathematics Forum 10, no. 1: 13. https://doi.org/10.3390/cmsf2025010013

APA Style

Eloutouate, L., Tani, H. G., Elaachak, L., Elouaai, F., & Bouhorma, M. (2025). Optimizing Machine Learning for Healthcare Applications: A Case Study on Cardiovascular Disease Prediction Through Feature Selection, Regularization, and Overfitting Reduction. Computer Sciences & Mathematics Forum, 10(1), 13. https://doi.org/10.3390/cmsf2025010013

Article Metrics

Back to TopTop