Predictive Analysis of Chronic Kidney Disease in Machine Learning

Haider, Husnain Ali; Hussain, Manzoor; Kharisma, Ivana Lucia

doi:10.3390/engproc2025107118

Open AccessProceeding Paper

Predictive Analysis of Chronic Kidney Disease in Machine Learning^†

by

Husnain Ali Haider

^1,*,

Manzoor Hussain

²

and

Ivana Lucia Kharisma

³

¹

Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan

²

Department of Computer Science, Indus University, Karachi 75300, Pakistan

³

Department of Informatics Engineering, Nusa Putra University, Sukabumi 43152, West Java, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 118; https://doi.org/10.3390/engproc2025107118

Published: 29 September 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figure

Versions Notes

Abstract

Chronic kidney disease is a systemic disease of multiple factors and slow progression, and is now becoming a rapidly changing global pathological problem affecting healthcare systems. Anyone who can go through diagnosis before getting to stage 5 Chronic Kidney Disease (CKD) or end stage renal failure has a better shot at the result. This work involves 1659 patient records and dependent variables include demographics, lifestyle, and clinical biochemistry of CKD. Based on the supervised techniques of machine learning which are Random Forest, K Nearest Neighbors (KNN), Logistic Regression, and Naïve Bayes, it was agreed that the performance of the model metrics such as accuracy, precision and recall would need to be used. These models were applied, evaluated by means of more or less simple effectiveness parameters including, for instance, accuracy, precision, or recall. Out of these [best algorithm] achieved [accuracy value] % of predictive accuracy in CKD, and so can be used for diagnosis of CKD in its early stage. This work offers the Framework and results in the development of data-integrated approaches in healthcare and improves the disease control and management.

Keywords:

chronic kidney disease (CKD); early detection; machine learning; predictive modeling; supervised classification; random forest; logistic regression; K-Nearest Neighbors (KNN); naïve bayes classifier; clinical decision support

1. Introduction

CKD is a slow, progressively fatal disease, which is lurking in the population of the world. This is chronic because assets of the kidney reduce gradually over months or a few years and the patient remains with End-Stage Renal Disease (ESRD) if not attended. CKD on its own or collectively with its complications, cardiovascular disease, hypertension or diabetes is a huge factor in morbidity and mortality [1]. World Health Organization (WHO) reports that the rate of development of CKD is rising steadily at the global level and stressed on screening for the two in addition to designing management strategies that could help healthcare systems and also patients achieve better outcomes in the disease [2]. The diagnostic tools employed for use in the routine management of CKD include clinical evaluations as well as biochemical investigations including blood urea nitrogen (BUN), serum creatinine, and estimate glomerular filtration rate (eGFR). But these techniques detect CKD in the last stage and enable several interferences which are started during first phase only [3]. Also, other factors such as clinical care within the region, delay in diagnosis, and overall lack of appropriate proper care for the disease also contribute to it. This makes a call for new approaches in making predictions of CKD in the early stages to tackle the disease. Since then, several applications of Artificial Intelligence (AI) and Machine Learning (ML) have come to light, especially so in the field of health for risk analysis and monitoring [4]. Machine learning algorithms are ideal to be applied on big data, which is a feature of Artificial Intelligence. The good news though is that these capabilities puts them in a good stead to be used to screen for CKD at an early stage [5,6]. This research employs hospital data that include data of 1, 659 patients including age, sex, Body Mass Index (BMI), blood pressure, and habits. The scope is to create and test algorithm for CKD prediction using machine learning approaches, and to compare algorithms: and some classifiers that are Random Forest, KNN, Logistic Regression, and Naïve Bayes. Therefore, according to the accuracy and precision, recall, and F-1 score, this research will suggest ranking the most efficient model for the early diagnosis of CKD. Furthermore, the study aims at revealing various techniques in preprocessing that can enhance the stability of models to be developed [7]. In addition, the study also built on what has been considered in the preprocessing process to help enhance the model accuracy and efficiency. The contributions of this study are based in the significance of early identification of CKD within the enunciations of this study as a strengths-based approach. Such advancements could help clinicians employ time interventions, reduce the burden on the CKD-related complications, improve the predisposition of the patients in question, and raise the quality of the lives of patients. This paper is structured as follows: Section 2 of the work shows the methods used; Section 3 of the work shows the finding of the work; and Section 4 of the work gives the conclusion of the work and pointers for further research.

This approach shows a great advancement compared to previous research studies in specificity and precision [8]. In their study, authors have compared different algorithms, which are logistic regression, random forest decision trees, k-nearest neighbors, and support vector machines with different kernels for CKD prediction [9].

The accuracy from 86% when using naïve Bayes, and to 91% when using the decision tree [10]. They studied the effects of 37 lifestyle characteristics on CKD risk using Light Gradient Boosting Machine (GBM) and a cox proportional hazard model where, participants were culled from the UK Biobank cohort. Unhealthy lifestyle added a score to the magnitude of risk. In the context of lifestyle interventions, the model had a C-statistic of 0.71 to prevent CKD [11]. Comparing nine predictive models based on blood derived tests to predict the severity of CKD pointed out that the highest AUC of 0.873 was achieved by the logistic regression. However, ref. [12] emphasized that despite the growing enthusiasm for machine learning in CKD, current applications remain at an early stage and face challenges in clinical translation. Ref. [13] provided a systematic rethink of CKD management with the application of ML in context to multi-disciplinary co-operation in the course of precision medicine. Ref. [14] work on the issue of CKD in the global health context is investigated using machine learning algorithms that improve the identification of early stages and goals for the disease it incorporates multi-class stage prediction with variants of Random Forest, Support Vector Machine (SVM) and Decision Tree along with better feature selection approaches. Using Ethiopian data, the models had a high accuracy (binary 99.8% multi-class 79%) [15]. Ref. [16] responds to the question of how the transition from CKD to ESKD is modelled by employing ML models in this study, random forest showed an AUC of 0.81 that was close to KFRE in terms of performance and higher sensitivity for screening of patients [17,18].

2. Proposed Methodology

The demonstrated methodological approach to this research is based on the idea of combining the various ML classifiers under one umbrella so as to enhance the diagnosis of CKD. The dataset applied in the present investigation was obtained from [source; for instance, UCI Repository or Kaggle] and consists of clinical and laboratory factors of CKD. Data cleaning and other data preprocessing tasks, which remain critical for generally any learning algorithm, were performed with the help of the RapidMiner toolset, an internationally renowned environment to archetypes, test, and operate rather with different forms of ML. The first step towards preparing the dataset involved methods like dealing with the missing values and converting the categorical attributes of a dataset into binary form [18]. The next process was to test multiple machine learning models such as KNN, Decision Tree (DT), and Logistic Regression. Moreover, techniques for aggregating the predictions of multiple models, known as ensemble learning methods, were also studied to improve accuracy of the system. All across this present piece of work, ethical issues like data privacy and fairness dilemmas have been handled in the most responsible way possible while implementing the field of AI. The chief goal, therefore, is to develop a dependable and reportable CKD diagnostic infrastructure that can support care professionals’ timely and accurate decisions. Some of the features we used with RapidMiner included applying machine learning algorithms and data preprocessing workbenches. The algorithms explored in this study are detailed below:

2.1. Algorithms

In this study, a range of machine learning algorithms was applied to the pre-processed dataset to evaluate their performance in predicting CKD. Each algorithm was selected based on its suitability for classification problems, interpretability, and prior success in healthcare-related studies. The following subsections provide a brief overview of the algorithms explored, along with their working principles and relevance to our task.

2.1.1. K-Nearest Neighbors

KNN is a simple algorithm and works as a non-parametric model that is used for classification and for regression too. Generally, the algorithm classifies a data point into that class which is dominant among the classes of its nearest neighbors in features space.

2.1.2. Decision Tree

Decision Trees are tree structures for classification (partitioning) and regression. They draw internal nodes for the test on an attribute, branches are derived from test results, and leaf nodes are indicative of classes. To split nodes we compared Entropy and Information Gain. Structure trees are rather interpretative, which means that drawing a tree will easily depict decision-making structures.

2.1.3. Logistic Regression

These make logistic regression ideal for the detection of CKD because it performs a probability estimation of a binary event. It approximates the probability of entry into the designated region of positive classification from the independent variables. This function is the sigmoid function that transforms the predictions of the data onto probabilities to be able to distinguish data between two discrete outcomes.

2.1.4. Vote Ensemble

Vote Ensemble is a way of improving performance of different classifiers by using their results for the final classification. All classifiers indicate a class, and the one that has been voted for more is selected as the one with the highest accuracy. In this study, we used hard voting whereby the votes from all models are equally considered. Vote Ensemble is advantageous for decrease model bias and variance, increase generalization.

2.1.5. Dataset Description

This study builds on a detailed CKD dataset containing a broad array of medical and clinical measures. It also contains both continuous and categorical data on patient health characteristics in relation to CKD diagnosis. These capabilities enable fine granule data analysis and construction of accurate early-stage disease diagnostic models. The primary attributes in the dataset are listed below, representing critical medical measurements: • Age: The number of years the patient has completed and can be categorized as either adult or pediatric patient. • Blood Pressure (BP): A measuring variable in which contains diastolic blood pressure. • Specific Gravity (SG): An index of the concentration of the urine. • Albumin (ALB): An important protein level marker in the urine necessary for diagnosing the kidneys problems. • Sugar (SU): A nominal variable which can indicate sugar presence or its absence in urine. • Red Blood Cells (RBC): A dichotomous measure of red blood cell presence, categorized according to seven intensity levels. • Pus Cell (PC) • Serum Creatinine (SC): A blood chemical that will indicate whether the kidneys are functioning effectively. • Sodium (SOD) and Potassium (POT): Potassium and sodium and vital for comprehending kidney function. • Hemoglobin (HB): An index that give information about the concentration of the red blood cell in order to detect anemia. • Diabetes Mellitus (DM) and Hypertension (HTN): Dummy variables equal to 1 if the patient has diabetes, or equals 1 if the patient has high blood pressure otherwise equals 0. • Chronic Kidney Disease (CKD): Covariate of interest, which is equal to 1 if the patient has CKD, otherwise, equal to 0. The next level of features includes failed red cell count and WBC other details, and lab results make the data even richer. These attributes create a rich collection of factors that help build theoretical frameworks for CKD diagnosis, which in turn would be beneficial for advanced health care data analysis and decision making.

3. Results

The model in our research achieved an accuracy of 92.37%, demonstrating its effectiveness in correctly classifying the data. The confusion matrix (Figure 1) illustrate that the model correctly identified 452 positive cases and 8 negative cases, while misclassifying 36 positives as negatives and 2 negatives as positives. The recall for Class 1 (predicted false) was 18.16%, while the recall for Class 2 (predicted true) was 96.56%, indicating a strong performance for the positive class. In terms of precision, Class 1 (predicted false) had a precision of 60.00%, and Class 2 (predicted true) achieved a precision of 92.92%, suggesting that the model was more reliable in predicting positive cases correctly. These results are summarized in Figure 1, which illustrates both the confusion matrix and the key performance metrics.

4. Conclusions

The screening of CKD and its identification in its early stage are very significant for the better management of patient’s outcome and shifting off the health care cost burdens. As the results of this research show, basic clinical characteristics including blood pressure, specific gravity, albumin concentration, and serum creatinine level affect. CKD prediction. With the use of RapidMiner and more specifically under a well-structured systematic approach, machine learning techniques were found as being a valuable tool for the classification of CKD. The paper provides insight to the process of data preprocessing and feature selection and presents a method of model assessment to achieve a high predictive accuracy. This same applies to Decision Tree, Logistic Regression, as well as k-Nearest Neighbors. It can deliver diversified performance and make their specific input to the classification process. Additional methods can also increase the credibility of predictions through ensemble methods, which combines the models.

The conclusion based on the outcome of this research outlines the importance of using data analytical approaches in the provision of health care services. In that way, gaining insights into health-related bookmarks and using data from machine learning, clinicians and researchers will be able to improve the understanding of progression of CKD, development of individualized treatments that might help avoid adverse outcomes. Future work can include using even larger data and more refined models to make the diagnostic process even more accurate. Finally, this paper advances a body of research that seeks to apply machine learning in clinical workflows in a manner that helps to underpin a more anticipatory and efficient system of healthcare delivery.

Author Contributions

H.A.H. was responsible for the conceptualization and methodology of the study, data curation, and writing the original draft of the manuscript. He also supervised the overall research project. M.H. contributed to the software development, validation, formal analysis, and investigation of the research, ensuring the accuracy and reliability of the results. I.L.K. played a key role in providing resources, reviewing and editing the manuscript, and creating visualizations to support the findings of the study. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were not required for this study, as the dataset used was fully anonymized and obtained from publicly available sources.

Informed Consent Statement

Patient consent was waived due to the retrospective and anonymized nature of the dataset.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dritsas, E.; Trigka, M. Machine learning techniques for chronic kidney disease risk prediction. Big Data Cogn. Comput. 2022, 6, 98. [Google Scholar] [CrossRef]
Ghosh, P.; Shamrat, F.J.M.; Shultana, S.; Afrin, S.; Anjum, A.A.; Khan, A.A. Optimization of prediction method of chronic kidney disease using machine learning algorithm. In Proceedings of the 2020 15th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Bangkok, Thailand, 18–20 November 2020; IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Nishat, M.M.; Faisal, F.; Dip, R.R.; Nasrullah, S.M.; Ahsan, R.; Shikder, F.; Asif, M.A.; Hoque, M.A. A Comprehensive Analysis on Detecting Chronic Kidney Disease Using Machine Learning. EAI Endorsed Trans. Pervasive Health Technol. 2021, 7, 15. [Google Scholar] [CrossRef]
Islam, M.A.; Majumder, M.Z.H.; Hussein, M.A. Chronic Kidney Disease Prediction Based on Machine Learning Algorithms. J. Pathol. Inform. 2023, 8, 20–23. [Google Scholar] [CrossRef] [PubMed]
Gupta, R.; Koli, N.; Mahor, N.; Tejashri, N. Performance analysis of machine learning classifier for predicting chronic kidney disease. In Proceedings of the 2020 International Conference for Emerging Technology (INCET), Belgaum, India, 5–7 June 2020; pp. 1–4. [Google Scholar]
Emon, M.U.; Islam, R.; Keya, M.S.; Zannat, R. Performance analysis of chronic kidney disease through machine learning approaches. In Proceedings of the 2021 6th International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India, 20–22 January 2021; pp. 713–719. [Google Scholar]
Arif, M.S.; Mukheimer, A.; Asif, D. Enhancing the early detection of chronic kidney disease: A robust machine learning model. Big Data Cogn. Comput. 2023, 7, 144. [Google Scholar] [CrossRef]
Mondol, C.; Shamrat, F.J.M.; Hasan, M.R.; Alam, S.; Ghosh, P.; Tasnim, Z.; Ahmed, K.; Bui, M.F.; Ibrahim, S.M. Early prediction of chronic kidney disease: A comprehensive performance analysis of deep learning models. Algorithms 2022, 15, 308. [Google Scholar] [CrossRef]
Swain, K.P.; Nayak, R.K.; Swain, A.; Nayak, S.R. The Impact of Machine Learning on Chronic Kidney Disease: Analysis and Insights. In Healthcare Industry Assessment: Analyzing Risks, Security, and Reliability; Springer Nature: Berlin, Germany, 2024; pp. 121–148. [Google Scholar]
Vupputuri, S.; Sandler, D.P. Lifestyle risk factors and chronic kidney disease. Ann. Epidemiol. 2003, 13, 712–720. [Google Scholar] [CrossRef] [PubMed]
Khalid, F.; Alsadoun, L.; Khilji, F.; Mushtaq, M.; Eze-Odurukwe, A.; Mushtaq, M.M.; Ali, H.; Farman, O.R.; Ali, M.S.; Fatima, R.; et al. Predicting the progression of chronic kidney disease: A systematic review of artificial intelligence and machine learning approaches. Cureus 2024, 16, e60145. [Google Scholar] [CrossRef] [PubMed]
Delrue, C.; De Bruyne, S.; Speeckaert, M.M. Application of machine learning in chronic kidney disease: Current status and future prospects. Biomedicines 2024, 12, 568. [Google Scholar] [CrossRef] [PubMed]
Dutta, S.; Sikder, R.; Islam, M.R.; Al Mukaddim, A.; Hider, M.A.; Nasiruddin, M. Comparing the Effectiveness of Machine Learning Algorithms in Early Chronic Kidney Disease Detection. J. Comput. Sci. Technol. Stud. 2024, 6, 77–91. [Google Scholar] [CrossRef]
Arora, A.; Sehgal, C.; Agarwal, N. An analysis of machine learning algorithms for chronic kidney disease prediction. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; pp. 581–586. [Google Scholar]
Rahat, M.A.R.; Islam, M.T.; Cao, D.M.; Tayaba, M.; Ghosh, B.P.; Ayon, E.H.; Nob, N.; Akter, A.; Rahman, M.; Bhuiyan, M.S. Comparing machine learning techniques for detecting chronic kidney disease in early stage. J. Comput. Sci. Technol. Stud. 2024, 6, 20–32. [Google Scholar] [CrossRef]
Diwaker, C.; Tomar, P.; Solanki, A.; Nayyar, A.; Jhanjhi, N.Z.; Abdullah, A.; Supramaniam, M. A New Model for Predicting Component- Based Software Reliability Using Soft Computing. IEEE Access 2019, 7, 147191–147203. [Google Scholar] [CrossRef]
Kok, S.H.; Abdullah, A.; Jhanjhi, N.Z.; Supramaniam, M. A review of intrusion detection system using machine learning approach. Int. J. Eng. Res. Technol. 2019, 12, 8–15. [Google Scholar]
Faisal, A.; Jhanjhi, N.Z.; Ashraf, H.; Ray, S.K.; Ashfaq, F. A Comprehensive Review of Machine Learning Models: Principles, Applications, and Optimal Model Selection. Authorea Prepr. 2025. Available online: https://www.techrxiv.org/users/662346/articles/1279092-a-comprehensive-review-of-machine-learning-models-principles-applications-and-optimal-model-selection?commit=49746fcb1a7a00742ebbbf9daf367aba3c9cf1b3 (accessed on 2 February 2025).

Figure 1. Performance analysis.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Haider, H.A.; Hussain, M.; Kharisma, I.L. Predictive Analysis of Chronic Kidney Disease in Machine Learning. Eng. Proc. 2025, 107, 118. https://doi.org/10.3390/engproc2025107118

AMA Style

Haider HA, Hussain M, Kharisma IL. Predictive Analysis of Chronic Kidney Disease in Machine Learning. Engineering Proceedings. 2025; 107(1):118. https://doi.org/10.3390/engproc2025107118

Chicago/Turabian Style

Haider, Husnain Ali, Manzoor Hussain, and Ivana Lucia Kharisma. 2025. "Predictive Analysis of Chronic Kidney Disease in Machine Learning" Engineering Proceedings 107, no. 1: 118. https://doi.org/10.3390/engproc2025107118

APA Style

Haider, H. A., Hussain, M., & Kharisma, I. L. (2025). Predictive Analysis of Chronic Kidney Disease in Machine Learning. Engineering Proceedings, 107(1), 118. https://doi.org/10.3390/engproc2025107118

Article Menu

Predictive Analysis of Chronic Kidney Disease in Machine Learning^†

Abstract

1. Introduction

2. Proposed Methodology

2.1. Algorithms

2.1.1. K-Nearest Neighbors

2.1.2. Decision Tree

2.1.3. Logistic Regression

2.1.4. Vote Ensemble

2.1.5. Dataset Description

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Predictive Analysis of Chronic Kidney Disease in Machine Learning †

Abstract

1. Introduction

2. Proposed Methodology

2.1. Algorithms

2.1.1. K-Nearest Neighbors

2.1.2. Decision Tree

2.1.3. Logistic Regression

2.1.4. Vote Ensemble

2.1.5. Dataset Description

3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Predictive Analysis of Chronic Kidney Disease in Machine Learning^†