Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model

: Clinical decision-making in chronic disorder prognosis is often hampered by high variance, leading to uncertainty and negative outcomes, especially in cases such as chronic kidney disease (CKD). Machine learning (ML) techniques have emerged as valuable tools for reducing randomness and enhancing clinical decision-making. However, conventional methods for CKD detection often lack accuracy due to their reliance on limited sets of biological attributes. This research proposes a novel ML model for predicting CKD, incorporating various preprocessing steps, feature selection, a hyperparameter optimization technique


Introduction
CKD presents a significant global health challenge, affecting approximately 850 million people worldwide [1]. The kidneys, vital organs situated on both sides of the spine just below the ribcage, play a crucial role in maintaining the body's internal environment by filtering the blood and removing waste products, excess fluids, and toxins through urine. Additionally, they regulate electrolyte levels, blood pressure, and the acid-base balance, while producing hormones that control calcium metabolism and stimulate red blood cell production [2,3].
CKD is characterized by a progressive and long-term decline in kidney function, leading to an inability to effectively filter waste and maintain fluid and electrolyte balance, resulting in the accumulation of waste products and fluid retention. The burden of CKD is immense, contributing to complications like electrolyte imbalances, bone disorders, anemia, and cardiovascular diseases [4][5][6]. If left untreated, CKD can progress to end-stage renal disease, necessitating dialysis or kidney transplantation [7,8]. Early detection and proper management of CKD are pivotal to preserving kidney function, slowing down the disease progression, and improving patient outcomes [9].
Despite its global prevalence and impact on public health, detecting CKD early and ensuring access to quality kidney care pose significant challenges, particularly in low-and middle-income countries with limited resources [10][11][12]. Traditional methods for CKD detection, such as blood tests and urinalysis, may have limitations in identifying the early stages of kidney damage and might not capture fluctuations in kidney health over time. Invasive procedures like kidney biopsy are unsuitable for routine screening, and imaging tests can be both expensive and time-consuming [13][14][15].
ML methods offer promising solutions to these challenges. ML algorithms can analyze large and complex datasets, improving the accuracy in CKD detection by identifying subtle patterns and trends that may go unnoticed with traditional methods. These models can incorporate various variables, enabling personalized risk assessments and tailored treatment plans. The efficiency of ML algorithms allows for quick processing of new patient data, facilitating timely diagnosis and intervention. Moreover, ML can predict CKD development in high-risk individuals, enabling early preventive measures [16][17][18].
In this paper, we investigate the feasibility and potential benefits of using ML for early CKD diagnosis. Our objective is to develop an ML model that incorporates data imputation, data scaling methods, split ratio, and optimal parameters, while evaluating classifiers based on their classification accuracy. The goal is to effectively detect CKD using ML algorithms such as the k-nearest neighbor and naive Bayes. Missing values are handled using iterative imputation, and a novel sequential data scaling method is introduced by combining robust scaling, z-standardization, and min-max scaling. Boruta feature selection is applied to identify important features, and the hyperparameters are tuned using grid-search CV. The testing accuracy of our proposed work is evaluated by comparing it to the results of various other studies.
The remaining sections of this paper are structured as follows: In Section 2, we conduct a comprehensive review of the existing literature and highlight the novelty of our work. Section 3 outlines the methodologies employed and presents the proposed system model. The experimental results are analyzed in Section 4. In Section 5, we engage in a discussion and compare our proposed model with other studies. Finally, the paper concludes in Section 6 by exploring potential avenues for future research.

Literature Review
In recent times, there has been a notable advancement in applying ML techniques to the field of healthcare, with a specific focus on early diagnosis and preventive measures [19][20][21]. This progress has also extended to the field of CKD, where numerous noteworthy studies have contributed to advancements in CKD research [17,22]. In this literature review, we provide a comprehensive overview of the current state of CKD research by thoroughly discussing the relevant studies. Our analysis includes a detailed examination of the methodologies employed, the findings obtained, and the limitations identified in each study. By doing so, we aim to present a comprehensive and unbiased understanding of the progress and challenges in CKD research.
A study by Debabrata et al. (2023) aimed to develop an ML model for early CKD detection using the UCI CKD dataset. The researchers employed imputation techniques, a sampling technique for data balancing, and data normalization. They selected nine features based on the chi-square test and used support vector machines for classification. However, the study had limitations, such as the exclusion of advanced imputation algorithms and the potential information loss from reducing the feature set [23].
In a study by Z. Ullah and M. Jamjoom (2023), the researchers aimed to predict CKD progression using a DT-based missing value imputation method. They performed feature selection using the filter method and employed the k-nearest neighbor algorithm for classification. However, the study did not utilize data scaling methods or hyperparameter optimization techniques [24].
A study conducted by A. Farjana et al. (2023) focused on CKD prediction using ML algorithms on the UCI CKD dataset. The researchers filled the missing data with mean values and employed hold-out validation. Light GBM demonstrated superior performance, but the study lacked advanced imputation techniques, outlier handling, data scaling, feature selection, and model optimization [25].
In a study by M. A. Islam et al. (2023), the researchers predicted CKD using ML algorithms. They used mean and mode techniques for missing data imputation and employed recursive feature elimination and principal component analysis for feature selection. However, the study did not utilize scaling methods or hyperparameter optimization techniques [26].
A study by M. M. Hassan (2023) focused on CKD prediction using ML on patients' clinical records. The researchers used predictive mean matching for missing data imputation and performed data clustering using K-means. They employed the XGBoost approach with SHAP value analysis for feature selection. However, the study did not incorporate scaling methods or hyperparameter optimization [27].
In a study conducted by C. Kaur et al. (2023), the researchers utilized machine learning for CKD prediction. They employed Little's MCAR test for missing data analysis and the Ant Colony Optimization algorithm for feature selection. They used ensemble methods and found that bagging produced the best results. However, the study did not employ scaling methods, cross validation, or hyperparameter optimization techniques [28].
Through the review of these studies, it is evident that several research gaps and limitations need to be addressed to further improve the field of CKD prediction. This study aims to specifically target these limitations and contribute novel approaches to the existing body of research. The key novelties of our work are as follows: 1.
An advanced imputation method is employed to iteratively estimate missing values in the dataset. By implementing this technique, the completeness and quality of the dataset can be improved, leading to enhanced accuracy in the CKD prediction models.

2.
A sequential approach to scaling the variables in the dataset is proposed. Robust scaling is initially used to adjust for outliers, ensuring that their influence is minimized. Subsequently, z-standardization is applied to further normalize the variables. Finally, min-max scaling is utilized to bring all features within a similar range.

3.
To ensure the inclusion of only relevant and informative features, a robust feature selection algorithm called Boruta, is utilized.

4.
Various ML models are explored and evaluated using grid-search CV to identify the most suitable algorithm for accurately classifying CKD. 5.
The performance of the proposed model is rigorously validated using a range of evaluation metrics, including accuracy, precision, recall, F1-score, and curve analysis.
By addressing these limitations and incorporating these novel approaches, we aim to contribute to the advancement of CKD prediction models and provide more accurate and reliable predictions forthe early detection and prevention of CKD.

Methodology
This work presents a precise system for the detection of CKD through the utilization of a robust model. The proposed approach leverages ML techniques to construct a prediction model that is both effective and accurate. To visually depict the various stages of the proposed system, Figure 1 provides a schematic representation.

Data Collection
In order to validate our proposed ML model, we obtained the CKD dataset from the UCI ML Repository. The dataset contains a total of 400 samples, which we used for evaluating and validating our ML model in this study [29]. Each sample comprises 24 predictive variables, including 11 numerical variables and 13 categorical (nominal) variables. The dataset also includes a categorical response variable called 'class', which indicates the presence or absence of CKD. The 'class' variable has two distinct values: 'ckd' for samples diagnosed with CKD and 'notckd' for samples without CKD. To provide additional insights, a descriptive summary of the attributes involved in our comprehensive analysis is presented in Table 1.

Preprocessing
Medical datasets are prone to various issues that can have a negative impact on the performance of ML models. Therefore, it is crucial to address these challenges to improve the quality of the data. The preprocessing stage plays a vital role in enhancing data quality by tackling key issues such as data encoding, missing values, and outliers [30].

Data Encoding
To handle the combination of categorical and numeric features in the dataset, the label encoder module from the Scikit-learn library was used. This module transformed the categorical features into numeric representations, allowing for the improved performance of the machine learning model.

Data Imputation
Handling missing data requires choosing appropriate statistical methods based on the extent of missing data and the significance of the missing feature. Traditional techniques like mean, maximum, and mode work well with a low proportion of missing values [31]. In our study, we encountered a substantial amount of missing data, as illustrated in Figure 2. To tackle this issue, we utilized iterative imputation, a statistical approach that iteratively estimates the missing values based on the observed data while considering the relationships between variables. This iterative process progressively refines the imputed values over multiple iterations, leading to a comprehensive and accurate estimation [32]. Algorithm 1 outlines the steps involved in constructing the iterative imputation process.

Data Scaling
To address outliers and achieve data normalization, a sequential approach of scaling techniques was employed, as outlined in Algorithm 2. The process began with robust scaling, which reduces the impact of extreme values and enhances robustness. It involved subtracting the median (Q 2 ) and dividing by the interquartile range (Q 3 − Q 1 ). This can be represented by the following equation: Next, z-score standardization was applied, resulting in a standardized distribution by subtracting the mean (µ) and dividing by the standard deviation (σ). This can be represented by the following equation: Finally, to bring the features within a specific range (typically 0 to 1), min-max scaling was performed by subtracting the minimum value (x min ) and dividing by the range (x max − x min ). This can be represented by the following equation: for each feature f in F missing do 8: Initialize missing mask M f for feature f 9: Initialize model M f (Linear Regression) for feature f 10: Initialize convergence ← False 11: Initialize iterations ← 0 12: while not convergence and iterations < η do 13: Fit model M f on X imputed 14: Predict missing values using M f

15:
Update X imputed with predicted values 16: Check for convergence using mean absolute change 17: if CheckConvergence(X imputed , f , ) then Apply Robust Scaling to X and store the result in X scaled

4:
Apply Z-score Standardization to X scaled and update X scaled

5:
Apply Min-Max Scaling to X scaled and update X scaled 6: return X scaled 7: end procedure

Feature Selection
Feature selection is a crucial step in ML, as it helps extract a subset of important features from the dataset. This process offers several benefits, including improved prediction accuracy, reduced model complexity, and enhanced interpretability.
In this study, we utilized the Boruta feature selection technique, which leverages random shadow features and an ML model. Boruta compares the importance of each feature to that of the shadow features iteratively, categorizing features as confirmed, tentative, or rejected based on their significance. Ultimately, Boruta provides a subset of the most significant features from the dataset. We implemented the technique using a random forest classifier as the base model to evaluate the feature importance. This classifier was trained on the dataset, including both original and shadow features, using measures such as the mean decrease in accuracy. The combination of the Boruta algorithm and the random forest classifier enabled us to identify the most relevant features for our analysis [33,34].
Algorithm 3 provides a concise overview of the Boruta feature selection algorithm, outlining the steps of initialization, iteration, feature evaluation, and the selection of confirmed features. Fit the random forest classifier on X scaled using features from T 10: Perform a permutation test for each feature in T to evaluate its importance 11: for each feature f in T do 12: if the feature importance of f is significantly higher than random, then 13: Move f from T to C 14: else 15: Move f from T to R 16: end if 17: end for 18: if T is empty then 19: break 20: end if 21: end for 22: selected_ f eatures ← C 23: return selected_ f eatures The Boruta feature selection technique was applied to the UCI CKD dataset, resulting in the selection of 19 features, while 5 features were rejected. The features that were rejected include pus cell clumps, bacteria, potassium, coronary artery disease, and anemia. The selected 19 features were considered important for the classification task and were used for further analysis and model building. These selected variables are also clinically relevant to CKD, as supported by the previous literature [23,24,26,27]. The incorporation of these relevant features enhances the model's ability to accurately identify and predict cases of CKD, making it a valuable tool for early detection and effective management of the condition.

Data Splitting
Data splitting is a crucial step in machine learning for reliable model evaluation and generalization [35]. It involves dividing the dataset into training and testing subsets: In this study, we used an 80:20 split ratio, where 80% of the dataset was allocated for training and the remaining 20% for testing. This ensures that the model learns from a significant portion of the data and is then evaluated on unseen data to assess its generalization performance.

Model Traning
During the model training phase, we employed two highly efficient ML classifiers: naïve Bayes and k-nearest neighbor. To optimize their performance, we utilized the hyperparameter optimization technique to tune the parameters of both algorithms.

Hyperparameter Optimization
Hyperparameter optimization is a critical step in ML to optimize the model performance by selecting the best combination of hyperparameters. In our study, we employed the widely used technique of grid-search CV. This approach systematically explores predefined grids of the hyperparameter values, evaluating the model's performance for each combination using CV. By exhaustively searching through the hyperparameter space, it allows for a comprehensive exploration and selection of the optimal hyperparameter configuration [36,37]. The workflow of grid search CV for the selection of the hyperparameters is illustrated in Figure 3.

Naïve Bayes
It is a supervised algorithm that assumes feature independence during classification. It is particularly useful for datasets with a high number of input features. The algorithm considers all features, including those with weak effects on the prediction. The probabilistic model is represented by the equation: In this equation, A and B represent independent events. This equation calculates the probability of event A occurring given that event B has occurred. By applying this model, naïve Bayes can make predictions based on the class with the highest probability.
In our study, we utilized the Gaussian naïve Bayes (NB) algorithm for classification. This variant assumes a Gaussian distribution for the features. It estimates the likelihood of observing specific feature values given a class label using the Gaussian probability density function.
The step-by-step procedure and essential hyperparameter choices for constructing the Gaussian NB model in this research are outlined in the pseudocode provided in Algorithm 4. The hyperparameters include the training data, smoothing parameter, and priors, which are utilized to build the model. The algorithm begins by calculating the prior probability for each class and then estimates the mean and variance of features for each class. Using Bayes' theorem, it computes the posterior probability for each class given a new data point. Finally, the algorithm assigns the class with the highest posterior probability as the predicted class for the new data point [38].

Input:
Training dataset: X train = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} New data point: Calculate the prior probability for each class y i : P(y i ) = count(y i ) n using priors 3: for each feature x j do 4: if feature x j is discrete, then Calculate the posterior probability for each class y i using Bayes' theorem: 13: P(y i |x) = P(x|y i )·P(y i ) P(x) 14: Assign the class label with the highest posterior probability as the predicted class: 15: y = arg max y i P(y i |x) 16: return y 17: end procedure

K-Nearest Neighbor
It is a simple and widely used supervised ML algorithm. It predicts the class of an observation by considering the classes of its k nearest neighbors, determined using a distance metric such as the Euclidean, Minkowski, or Manhattan distance. The equations for these distance metrics are as follows: In these equations, x i k and x j k represent the kth features of x i and x j in a d-dimensional space, respectively.
Using these distance metrics, it identifies the k nearest neighbors of a data point and determines its class based on the majority class among those neighbors. It is a straightforward and intuitive algorithm, making it applicable to various classification tasks.
The step-by-step procedure and essential hyperparameter choices for constructing the Gaussian NB model in this research are outlined in the pseudocode provided in Algorithm 5. The hyperparameters, such as the training data, leaf size, parameter, weight function, algorithm, number of neighbors, and distance metric, are utilized to build the model. The algorithm predicts the class label of a test instance by considering the majority class among its k nearest neighbors. It accomplishes this by calculating the distances between the test instance and training instances, selecting the k nearest neighbors and determining the predicted class label through a majority voting process [39,40].

Performance Metrics
The effectiveness and accuracy of the developed ML models in this research were evaluated using various performance metrics. These metrics, including the accuracy, recall, precision, and F1-score, provided valuable insights into different aspects of the classifiers' performance. The evaluation relied on a confusion matrix, which is presented in Table 2. The confusion matrix allowed for a comprehensive examination of the classification results. True positives (TP) represented instances correctly predicted as the positive class, while true negatives (TN) represented instances correctly predicted as the negative class. False positives (FP) were instances incorrectly predicted as the positive class, and false negatives (FN) were instances incorrectly predicted as the negative class. This evaluation approach facilitated a thorough assessment of the accuracy and effectiveness of the model in the early detection of CKD.

Results
An experimental study was conducted on the UCI CKD dataset, where the categorical features were encoded. The missing values were addressed using alternative imputation techniques. A novel sequential approach was implemented for data scaling, involving robust scaling, z-standardization, and min-max scaling in that order. To perform feature selection, we utilized the Boruta algorithm. The dataset was divided into training and testing sets using an 80:20 ratio. For constructing the models, we employed ML techniques such as k-nearest neighbor and Gaussian NB. To optimize the model parameters, a gridsearch CV was utilized. All preprocessing, visualization, and analysis tasks were carried out using Python programming.
In Figure 4, the confusion matrices are presented, depicting the performance of the models. Table 3 provides the optimal hyperparameters obtained through the grid-search CV, along with the performance metrics including the accuracy, precision, recall, and F1score. It shows that the k-nearest neighbors model achieved a 100% accuracy, precision, recall, and F1-score, indicating excellent performance.   Figure 5 displays the evaluation of the model through the area under the ROC curve and the precision-recall curve. The k-nearest neighbor model achieved the highest performance, indicating its superiority as the best model for the early detection of CKD.
To assess the generalization capability of the trained models, a rigorous 15-fold CV technique was employed. The results, as depicted in Figure 6, illustrate the accuracy of both models on each fold, providing valuable insights into their performance. The k-nearest neighbor algorithm demonstrated remarkable consistency across diverse folds, achieving an exceptional accuracy of 99.37%. This high score highlights the model's impressive performance and robustness, indicating its ability to generalize well to unseen data. In contrast, the Gaussian NB achieved a slightly lower CV accuracy of 97.05%.

Discussion
CKD is a critical condition, and accurate diagnosis plays a pivotal role in improving patient outcomes. To address this, our paper focuses on proposing a comprehensive ML model for CKD prediction. However, in implementing ML techniques for medical diagnosis, we must be mindful of the potential risks and ethical considerations. Complex ML models may lack interpretability, raising concerns about trust and accountability in the healthcare domain. Additionally, biases in training data can lead to discriminatory outcomes, exacerbating healthcare disparities, and handling sensitive patient information raises privacy and data security issues. Despite these challenges, ML models offer benefits like accurate and personalized diagnoses, identifying rare conditions, and adapting to changing scenarios. Therefore, striking a balance between the risks and benefits is essential to harness ML's potential for improved medical diagnosis while upholding ethical standards and patient wellbeing.
As we embark on improving CKD prediction, it is crucial to address the existing challenges in the field of ML-based medical diagnosis. Commonly used sampling techniques in existing studies to balance datasets and improve accuracy may introduce artificial data, limiting real-world applicability. Handling missing data is another significant challenge in medical datasets, with mean or mode imputation methods potentially introducing biases. Some studies focus on using a reduced set of features to improve accuracy, but this approach may not generalize well in real-world scenarios. Additionally, data scaling, often overlooked, is a critical preprocessing step that can significantly impact model performance. In our approach, we systematically address these challenges to enhance CKD prediction accuracy. By utilizing iterative imputation for missing data, introducing a novel sequential data scaling method, and employing the Boruta algorithm for feature selection, we aim to create a robust and reliable model for CKD prediction. Through grid-search CV, we optimize the k-nearest neighbor and Gaussian NB algorithms, further refining the model's performance.
To evaluate the efficacy of our proposed model, we conducted extensive validation on the UCI CKD dataset. Remarkably, our approach achieved an outstanding accuracy, precision, recall, and F1-score, all reaching 100%. Additionally, we compared our model with existing ML models that were developed on the same dataset. The comparison presented in Table 4 demonstrates the superiority of our proposed model, showcasing its higher accuracy compared to previous studies.  [24] K-nearest Neighbors 99.5% A. Farjana et al. [25] Light GBM 99% M. A. Islam et al. [26] Gradient Boosting 99% M. M. Hassan [27] Neural Network 100% C. Kaur et al. [28] Random Forest 96% M. M. Nishat et al. [41] Support Vector Machine 99.36% Our proposed model K-nearest Neighbors 100% Table 5 focuses on comparing our k-nearest neighborsand naïve Bayes models with other studies that also employed k-nearest neighbor and naïve Bayes algorithms. We evaluated the models' performance using the same dataset to validate the effectiveness of our preprocessing steps. The results show that our models consistently outperformed the previous studies, highlighting the impact of our preprocessing techniques in enhancing prediction accuracy. Table 5. Comparison of the k-nearest neighbor and naïve Bayes models with other studies on the UCI CKD dataset.

Conclusions
This study successfully developed a robust ML model for the early detection of CKD. The model's exceptional performance, achieving 100% accuracy, percision, recall, and F1score on the UCI CKD dataset, validates its reliability and potential for clinical application.
By incorporating various preprocessing steps and the Boruta algorithm for feature selection, our proposed model demonstrates its robustness in accurately identifying CKD cases. The results obtained through multiple performance metrics further strengthen the confidence in its accuracy. The implementation of this model as a reliable and accurate tool for early CKD detection holds great promise for improving clinical decision making and ultimately enhancing patient outcomes. The potential impact of this research in advancing early diagnosis and management of CKD highlights its significance in addressing a critical global health challenge.

Limitations
The main limitation of this study was the reliance on a single dataset, the UCI CKD dataset, which contains a substantial amount of missing values. While we employed iterative imputation to estimate the missing data, it is crucial to acknowledge the uncertainty introduced by imputation methods, which may influence the model's predictive capability. Additionally, the generalizability of our findings to other populations and real-world scenarios needs further investigation. The model's adaptability to handle diverse data sources and missing data patterns should be carefully examined in future research. Furthermore, the retrospective nature of the performance evaluation raises questions about the model's ability to predict CKD in real-time or prospective settings. Addressing these limitations will strengthen the model's reliability and applicability for early CKD detection, making it a more effective tool for clinical use.