A Machine Learning Method with Filter-Based Feature Selection for Improved Prediction of Chronic Kidney Disease

The high prevalence of chronic kidney disease (CKD) is a significant public health concern globally. The condition has a high mortality rate, especially in developing countries. CKD often go undetected since there are no obvious early-stage symptoms. Meanwhile, early detection and on-time clinical intervention are necessary to reduce the disease progression. Machine learning (ML) models can provide an efficient and cost-effective computer-aided diagnosis to assist clinicians in achieving early CKD detection. This research proposed an approach to effectively detect CKD by combining the information-gain-based feature selection technique and a cost-sensitive adaptive boosting (AdaBoost) classifier. An approach like this could save CKD screening time and cost since only a few clinical test attributes would be needed for the diagnosis. The proposed approach was benchmarked against recently proposed CKD prediction methods and well-known classifiers. Among these classifiers, the proposed cost-sensitive AdaBoost trained with the reduced feature set achieved the best classification performance with an accuracy, sensitivity, and specificity of 99.8%, 100%, and 99.8%, respectively. Additionally, the experimental results show that the feature selection positively impacted the performance of the various classifiers. The proposed approach has produced an effective predictive model for CKD diagnosis and could be applied to more imbalanced medical datasets for effective disease detection.


Introduction
Chronic kidney disease is among the leading causes of death globally. A recent medical report states that approximately 324 million people suffer from CKD globally [1]. The glomerular filtration rate (GFR) is a widely used CKD screening test [2]. Though CKD affects people worldwide, it is more prevalent in developing countries [3]. Meanwhile, early detection is vital in reducing the progression of CKD. However, people from developing countries have not benefitted from early-stage CKD screening due to the cost of diagnosing the disease and limited healthcare infrastructure. While the global prevalence of CKD is reported to be 13.4% [4], it is said to have a 13.9% prevalence in Sub-Saharan Africa [5,6]. Another study reported a 16% pooled prevalence of CKD in West Africa [7], the highest in Africa. Numerous research works have specified that CKD is more prevalent in developing countries [8]. Notably, it is reported that 1 out of every 10 persons suffers from CKD in South Asia, including Pakistan, India, Bhutan, Bangladesh, and Nepal [3].
Therefore, several researchers have proposed machine learning (ML)-based methods for the early detection of CKD. These ML methods could provide effective, convenient, and low-cost computer-aided CKD diagnosis systems to enable early detection and intervention, especially in developing countries. Researchers have proposed different methods to detect CKD effectively using the CKD dataset [9] available at the University of California, Irvine attributes with the least IG values are removed, and those whose IG values are above a particular threshold are used to train the classifiers.
Meanwhile, a significant challenge in applying machine learning algorithms for medical diagnosis is the imbalanced class problem [20,21]. Most ML classifiers underperform when trained with imbalanced datasets. Class imbalance implies there is an uneven distribution of samples in each class. The class with the most samples is the majority class, while the class with the lesser samples is the minority class. Imbalance learning can be divided into data and algorithm-level approaches [22]. Data level methods are based on resampling techniques. Several studies have employed resampling techniques such as undersampling and oversampling to solve the class imbalance problem [23][24][25]. In order to create a balanced dataset, undersampling methods remove samples from the majority class, while oversampling techniques artificially create and add more data in the minority class. However, there are limitations to using these resampling techniques. For example, the samples discarded from the majority class could be vital in efficiently training the classifiers [20]. Therefore, several studies have resorted to using algorithm-level methods such as ensemble learning and cost-sensitive learning to effectively handle the imbalanced data instead of data-level techniques [26][27][28][29].
Ensemble learning is a breakthrough in ML research and application, which is used to obtain a very accurate classifier by combining two or more classifiers. Boosting [30] and Bagging [31] are widely used ensemble learning techniques. Adaptive boosting [30] is a type of boosting technique that creates many classifiers by assigning weights to the training data and adjusting these weights after every boosting cycle. The wrongly classified training instances are given higher weights in the next iteration, whereas the weight of correctly predicted examples is decreased. However, the AdaBoost algorithm does not treat the minority class and majority class weight updates differently when faced with imbalanced data. Therefore, in this study, we develop an AdaBoost classifier that gives higher weight to examples in the minority class, thereby enhancing the prediction of the minority class samples and the overall classification performance. A cost-sensitive classifier is obtained by biasing the weighting technique to focus more on the minority class. Recent findings have demonstrated that cost-sensitive learning is an efficient technique suitable for imbalanced classification problems [32][33][34]. The contribution of this study is to obtain the most important CKD attributes needed to improve the performance of CKD detection and develop a cost-sensitive AdaBoost classifier that gives more attention to samples in the minority class.
The rest of this paper is structured as follows: Section 2 presents the materials and methods, including an overview of the CKD dataset, the information gain technique, the traditional AdaBoost method, the proposed cost-sensitive AdaBoost, and the performance evaluation metrics. Section 3 presents the experimental results and discussion, while Section 4 concludes the paper.

Materials and Methods
This section provides an overview of the CKD dataset and the various methods used in the research. In particular, a detailed overview of the traditional AdaBoost algorithm and the proposed cost-sensitive AdaBoost is presented, thereby showing the difference between both methods.

Dataset
This study utilizes the CKD dataset prepared in 2015 by Apollo Hospitals, Tamil Nadu, India. The dataset is publicly available at the University of California, Irvine (UCI) machine learning repository [35]. It contains medical test results and records from 400 patients; 250 correspond to patients with CKD, and 150 correspond to patients without CKD, so the dataset is imbalanced. There are 24 independent variables (11 numerical and 13 nominal) and a class variable (ckd or notckd). The attributes and their corresponding descriptions are shown in Table 1. Some of the features in Table 1 are briefly described as follows: Specific gravity estimates the concentration of particles in the urine and the urine's density relative to the density of water. It indicates the hydration status of a patient together with the functional ability of the patient's kidney. Albumin is a protein found in the blood [36]. When the kidney is damaged, it allows albumin into the urine. Higher albumin levels in the urine could indicate the presence of CKD. Meanwhile, blood urea indicates vital information about the functionality of the kidney. A blood urea nitrogen test measures the quantity of urea nitrogen in a patient's blood, and a high amount implies the kidneys are not functioning normally. While a random blood glucose test measures the amount of sugar circulating in a patient's blood, and a level of 200 mg/dL or above implies the patient has diabetes. Serum creatinine is a waste product produced by a person's muscles [37]. A creatinine test measures the creatinine levels in the blood or urine, and high levels of the substance imply the kidney is not functioning well enough to filter the waste from the blood.
Furthermore, sodium is an electrolyte in the blood that helps the muscles and nerves work effectively. A sodium blood test measures the amount of sodium in the patient's blood, and a very high or low amount may indicate a kidney problem, dehydration, or other medical condition. Potassium is another electrolyte in the blood, and a very high or low amount could signal the presence of an underlying condition. White blood cells (WBC) protect the human body from invading pathogens. They are part of the body's immune system, protecting it from infections [38]. The normal range is between 4000 and 11,000 per microliter of blood. Elevated WBC count is a popular indicator of the progression of CKD. Red blood cells (RBC) in humans deliver oxygen to the body tissues. The average RBC count is 4.7 to 6.1 million cells per microliter for men and 4.2 to 5.4 million cells per microliter for women. A low RBC, also called anemia, is a common complication of CKD.
Meanwhile, the data needs to be preprocessed to make it suitable for machine learning. Therefore, all the nominal or categorical data were coded. Specifically, the attributes whose scales are 'normal' and 'abnormal' were transformed to 1 and 0, respectively. The attributes whose scales are 'present' and 'not present' were transformed to 1 and 0, respectively. Additionally, the 'yes' and 'no' scales were coded to 1 and 0, respectively. Lastly, the attribute with 'good' and 'poor' scales was transformed to 1 and 0, respectively. Furthermore, the dataset contains a few missing values. It is vital to appropriately deal with missing values before building ML models because ignoring or deleting the missing data could degrade or bias the performance of the models [39,40]. Imputation is a method used to estimate and fill missing values in a dataset. Since the number of missing values in our dataset is not large, the mean imputation technique is used to handle the missing values. The mean imputation technique computes the mean of the observed values for each variable, and the missing values are filled with the corresponding computed mean value [41]. Meanwhile, except for the 'age' and binary attributes, the remaining attributes were scaled to have values between 0 and 1 using the Min-Max Scaling technique [42].
Additionally, the clinicians at Apollo Hospitals, Tamil Nadu, India, categorized attributes as normal or abnormal, present or not present, yes or no. The clinicians selected the 24 attributes representing the patient's medical tests and records associated with chronic kidney disease. However, the attributes do not carry equal weights in diagnosing a patient with CKD, and some attributes are more indicative of the presence of CKD than others. Additionally, certain attributes might be redundant and could increase the complexity of the ML model [43]. Hence, this research employs the information gain technique to rank the attributes according to their relevance in detecting the disease, and only the most relevant attributes are used to build the ML model.

Information Gain
Effective feature selection could remove attributes that are less useful in obtaining an excellent predictive model. Additionally, it is necessary to remove attributes unrelated to the target variable because these attributes could increase the computational cost and prevent the model from obtaining optimal performance [44]. This study utilizes the information gain (IG) technique to extract the optimal features. IG is a type of filter-based feature selection that calculates the predictor variable's ability to classify the dependent variable [45]. The IG method has its roots in information theory, and it calculates the statistical dependence between two variables. Mathematically, the IG between two variables X and Y is formulated as: where H(X) is the entropy for variable X and H(X|Y) represents the conditional entropy for X given Y. Computing the IG value for an attribute involves calculating the entropy of the target variable for the whole dataset and subtracting the conditional entropies for every potential value of that attribute [46]. Furthermore, the entropy H(X) and conditional entropy H(X|Y) are computed as: Hence, given two variables X and Z, a given variable Y is said to have a more significant correlation to X than Z if IG(X|Y) > IG(Z|Y). Furthermore, IG considers every attribute in isolation, calculates its information gain, and computes its relevance to the target variable.

AdaBoost Algorithm
The AdaBoost algorithm is an ML technique derived from the concept of boosting. The boosting technique primarily entails converting weak learners into strong learners [47]. Freund and Schapire [48] proposed the AdaBoost algorithm to iteratively train multiple learning classifiers using the same training dataset. After training the weak learners, they are combined to obtain a strong classifier. The AdaBoost procedure involves selecting an appropriate weak learner and employing the same training dataset to train the weak learner iteratively to enhance its performance, as shown in Algorithm 1. Two weights are utilized in implementing the AdaBoost algorithm; the first is the sample weight, and the second is the weight of every weak learner [49]. The algorithm adjusts the sample weight depending on the weak classifier's result, thereby giving more attention to wrongly classified samples. Subsequent base learners are trained with the adjusted samples [50]. The final strong classifier is obtained by combining the output of the weak learners using a weighted sum [51]. The AdaBoost is adaptive because subsequent weak classifiers are trained to pay more attention to the samples that were wrongly classified by preceding classifiers.
compute the weight of the sample select a training data subset X from S, fit h using X to get a weak classifier h t , compute the classification error ε t : denotes the predicted label of x i using the weak classifier h t , and y i denotes the actual label of x i .

6.
compute the weight of h t : update the weight of all the instances in S: where Z t is a normalization factor and is calculated as end for 9. end for 10. assuming H(x) is the class label for an instance x; after the iterations, the final classifier H is obtained as:

Proposed Cost-Sensitive AdaBoost
At every iteration, the AdaBoost algorithm increases the weights of the misclassified training samples and decreases that of the samples that were predicted correctly. This weighting method distinguishes instances as correctly or wrongly classified, and the examples from both classes are treated equally. For example, the weights of incorrectly classified instances from both classes are increased by a similar ratio. The weights of the samples that were predicted correctly from both classes are decreased by a similar ratio [52]. However, in imbalance classification, the goal is to improve the classifier's prediction performance on the minority class. Hence, a suitable weighting approach will identify the different types of instances and give more weight to the samples with greater detection importance, i.e., the minority class.
From the AdaBoost learning method (Algorithm 1), cost items (β i ) are added to the weight update equation to bias the weighting technique. The rate of correct and incorrect predictions of the various classes are included as part of β i . The cost of false positive is denoted as c 10 and the cost of false negative is denoted as c 01 , while the cost of true positive and true negative are denoted as c 11 and c 00 , respectively [53]. The weight update in the traditional AdaBoost takes into account the overall error rate only [54]. However, in the cost-sensitive AdaBoost, the error rate of each class is considered. In this new weighting strategy, the weights of the minority class examples are higher compared to the majority class examples. The new cost-sensitive AdaBoost is presented in Algorithm 2.
compute the weight of the sample x i : D 1 (i) = 1 n 3. end for 4.
select a training data subset X from S, fit h using X to get a weak classifier h t 6.
let n+ and n− indicate the positive and negative classes, respectively. Compute the error rate ε t of the base learner for both the positive class ε p t and negative class ε n t : update the weight of all the instances in S: for i = 1 : 1 : n where Z t is a normalization factor and is calculated as: and β i is calculated as: By giving higher weights to samples in the minority class, the weak classifiers tend to focus more on the misclassification of examples in that class, thereby accurately classifying more instances at each iteration. Therefore, the final strong classifier will obtain more correct predictions. The architecture of the proposed approach is shown in Figure 1.

Performance Evaluation Metrics
The dataset used in this research comprises two classes, ckd and notckd classes. The ckd-labeled data are the positive patients, while the notckd-labeled data are the negative patients. Meanwhile, accuracy (ACC), sensitivity (SEN), and specificity (SPE) are used to assess the performance of the classifiers. Accuracy is the total number of correct predictions divided by the total number of predictions made by the classifier. Sensitivity or true positive rate is the number of correct positive predictions divided by the number of positive cases in the dataset; it measures the ability of the classifier to correctly detect those with the disease [55]. Specificity measures the classifier's ability to correctly identify people without the disease, i.e., negative instances. These metrics are computed as follows: where true positive (TP) represents the ckd instances that were correctly classified, and false-negative (FN) represents the ckd instances that were wrongly classified. True negative (TN) denotes the notckd samples that were correctly classified, and false-positive (FP) indicates the notckd instances that were wrongly classified [56]. Additionally, this research utilizes the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) to further evaluate the classifiers' performance. The ROC curve is a plot of the true positive rate (TPR) against the false positive rate (FPR) at different threshold values. It shows the ability of the classifier to distinguish between the ckd and notckd classes. Meanwhile, the AUC is mainly utilized to summarize the ROC curve, and it has a value between 0 and 1 that shows the classifiers' ability to differentiate between both classes [57]. The higher the AUC value, the better the classifiers can distinguish between the ckd and notckd classes.
Bioengineering 2022, 9, x FOR PEER REVIEW 8 of 1 10. end for 11. the final classifier is obtained as follows: By giving higher weights to samples in the minority class, the weak classifiers ten to focus more on the misclassification of examples in that class, thereby accurately classi fying more instances at each iteration. Therefore, the final strong classifier will obtai more correct predictions. The architecture of the proposed approach is shown in Figure 1

Performance Evaluation Metrics
The dataset used in this research comprises two classes, ckd and notckd classes. Th ckd-labeled data are the positive patients, while the notckd-labeled data are the negativ patients. Meanwhile, accuracy (ACC), sensitivity (SEN), and specificity (SPE) are used t assess the performance of the classifiers. Accuracy is the total number of correct predic tions divided by the total number of predictions made by the classifier. Sensitivity or tru positive rate is the number of correct positive predictions divided by the number of pos tive cases in the dataset; it measures the ability of the classifier to correctly detect thos with the disease [55]. Specificity measures the classifier's ability to correctly identify peo ple without the disease, i.e., negative instances. These metrics are computed as follows: (4

Results
In this section, the experimental results are presented and discussed. All experiments used the preprocessed data, as discussed in Section 2.1, and the ML models were developed using scikit-learn [58], a machine learning library for Python programming. The experiments were conducted using a computer with the following specifications: Intel(R) Core(TM) i7-113H @ 3.30 GHz, 4 Core(s), and 16 GB RAM. Furthermore, to have a baseline for comparing the proposed cost-sensitive AdaBoost (CS AdaBoost), this research implements the traditional AdaBoost [30] presented in Algorithm 1 and other well-known classifiers, including logistic regression [59], decision tree [60], XGBoost [61], random forest [62], and SVM [63]. The classifiers are trained with the complete feature set and the reduced feature set to demonstrate the impact of the feature selection. Meanwhile, the 10-fold cross-validation method is employed to evaluate the performance of the various models. The decision tree algorithm is the base learner for the AdaBoost and CS AdaBoost implementations.

Performance of the Classifiers without Feature Selection
This subsection presents the experimental results obtained when the complete feature set was used to train the various classifiers. These results are tabulated in Table 2. Additionally, Figure 2 shows the AUC values and the ROC curves of the different classifiers. value between 0 and 1 that shows the classifiers' ability to differentiate between both classes [57]. The higher the AUC value, the better the classifiers can distinguish between the ckd and notckd classes.

Results
In this section, the experimental results are presented and discussed. All experiments used the preprocessed data, as discussed in Section 2.1, and the ML models were developed using scikit-learn [58], a machine learning library for Python programming. The experiments were conducted using a computer with the following specifications: Intel(R) Core(TM) i7-113H @ 3.30 GHz, 4 Core(s), and 16 GB RAM. Furthermore, to have a baseline for comparing the proposed cost-sensitive AdaBoost (CS AdaBoost), this research implements the traditional AdaBoost [30] presented in Algorithm 1 and other well-known classifiers, including logistic regression [59], decision tree [60], XGBoost [61], random forest [62], and SVM [63]. The classifiers are trained with the complete feature set and the reduced feature set to demonstrate the impact of the feature selection. Meanwhile, the 10fold cross-validation method is employed to evaluate the performance of the various models. The decision tree algorithm is the base learner for the AdaBoost and CS AdaBoost implementations.

Performance of the Classifiers without Feature Selection
This subsection presents the experimental results obtained when the complete feature set was used to train the various classifiers. These results are tabulated in Table 2. Additionally, Figure 2 shows the AUC values and the ROC curves of the different classifiers.    Figure 2 show that the proposed cost-sensitive AdaBoost obtained excellent performance by outperforming the traditional AdaBoost and the other classifiers, having obtained an AUC, accuracy, sensitivity, and specificity of 0.980, 0.967, 0.975, and 0.960, respectively.

Performance of the Classifiers after Feature Selection
The information-gain-based feature selection ranked the chronic kidney disease attributes. This step aims to select the features with the highest information gain with respect to the target variable. The ranked features and their IG values are shown in Table 3. After obtaining the IG values of the various features, the standard deviation [19] of the values is computed, which serves as the threshold value for the feature selection. The standard deviation measure has been used in recent research to obtain a reasonable threshold for feature selection [19,64,65]. The threshold value obtained is 0.156. Therefore, the IG values equal to or greater than 0.156 are selected as the informative features and used for building the models. In contrast, the attributes with IG values lower than the threshold are discarded.
Hence, from Table 3, the top 18 features are selected as the optimal feature set since their IG values (rounded to three decimal values) are greater than 0.156, and the following features are discarded: f22, f5, f24, f8, f9, and f21. To demonstrate the effectiveness of the feature selection, the reduced feature set is used to train the proposed CS AdaBoost and the other classifiers. The experimental results are shown in Table 4. Additionally, the ROC curve and the various AUC values are shown in Figure 3. The experimental results in Table 4 and Figure 3 show that the proposed CS AdaBoost obtained an AUC, accuracy, sensitivity, and specificity of 0.990, 0.998, 1.000, and 0.998, respectively, which outperformed the logistic regression, decision tree, XGBoost, random forest, SVM, and conventional AdaBoost. Secondly, it is observed that the performance of the various classifiers in Table 4 is better than their corresponding performance in Table 2. This improvement demonstrates the effectiveness of the feature selection step. Therefore, the combination of feature selection and cost-sensitive AdaBoost is an effective method for predicting CKD. and 0.998, respectively, which outperformed the logistic regression, decision tree, XGBoost, random forest, SVM, and conventional AdaBoost. Secondly, it is observed that the performance of the various classifiers in Table 4 is better than their corresponding performance in Table 2. This improvement demonstrates the effectiveness of the feature selection step. Therefore, the combination of feature selection and cost-sensitive AdaBoost is an effective method for predicting CKD.

Comparison with Other CKD Prediction Studies
Even though the proposed approach showed superior performance to the other algorithms, it is not enough to conclude its robustness. It is, however, necessary to compare it with other state-of-the-art methods in the literature. Hence, the proposed approach is compared with the following methods: a probabilistic neural network (PNN) [66], an enhanced sparse autoencoder (SAE) neural network [10], a naïve Bayes (NB) classifier with feature selection [67], a feature selection method based on cost-sensitive ensemble and random forest [3], a linear support vector machine (LSVM) and synthetic minority

Comparison with Other CKD Prediction Studies
Even though the proposed approach showed superior performance to the other algorithms, it is not enough to conclude its robustness. It is, however, necessary to compare it with other state-of-the-art methods in the literature. Hence, the proposed approach is compared with the following methods: a probabilistic neural network (PNN) [66], an enhanced sparse autoencoder (SAE) neural network [10], a naïve Bayes (NB) classifier with feature selection [67], a feature selection method based on cost-sensitive ensemble and random forest [3], a linear support vector machine (LSVM) and synthetic minority oversampling technique (SMOTE) [11], a cost-sensitive random forest [68], a feature selection method based on recursive feature elimination (RFE) and artificial neural network (ANN) [69], a correlation-based feature selection (CFS) and ANN [69]. The other methods include optimal subset regression (OSR) and random forest [9], an approach to identify the essential CKD features using improved linear discriminant analysis (LDA) [13], a deep belief network (DBN) with Softmax classifier [70], a random forest (RF) classifier with feature selection (FS) [71], a model based on decision tree and the SMOTE technique [12], a logistic regression (LR) classifier with recursive feature elimination (RFE) technique [14], and an XGBoost model with a feature selection approach combining the extra tree classifier (ETC), univariate selection (US), and RFE [15].
The proposed approach based on cost-sensitive AdaBoost and feature selection achieved excellent performance compared to several state-of-the-art methods in the literature, as shown in Table 5.

Discussion
This study aimed to solve two problems: first, to select the most informative features to enable the effective detection of CKD. The second aim was to develop an effective cost-sensitive AdaBoost classifier that accurately classifies samples in the minority class. The use of more features than necessary sometimes affects ML classifiers' performance and increases the computational cost of training the models. Hence, this research employed the IG-based feature selection method to obtain the optimal feature set. Furthermore, seven classifiers were used in this study, trained using the complete and the reduced feature sets. From the experimental results, the proposed framework showed improved classification performance with the reduced feature set, i.e., 18 out of 24 input variables. Additionally, the models trained with the reduced feature set performed better than those trained with the complete feature set. Remarkably, the proposed method obtained higher performance than the other classifiers. Furthermore, the features selected by the IG technique were similar to current medical practices. For example, the IG technique ranked albumin, hemoglobin, packed cell volume, red blood cell count, and serum creatinine as the most informative features, and numerous studies have identified a strong correlation between these variables and chronic kidney disease [71][72][73][74].
Meanwhile, the class imbalance problem is common in most real-world classification tasks. Another objective of this study was to develop a robust classifier to prevent the misclassification of the minority class that occurs when classifiers are trained using imbalanced data. Hence, this study developed a cost-sensitive AdaBoost classifier, giving more attention to the minority class. The experimental results indicate that the proposed method achieved a higher classification performance than the baseline classifiers and techniques in recent literature. Secondly, the results demonstrate that the combination of the information-gain-based feature selection and the cost-sensitive AdaBoost classifier significantly improved the detection of chronic kidney disease.

Conclusions
This paper proposed an approach that combines information-gain-based feature selection and a cost-sensitive AdaBoost classifier to improve the detection of chronic kidney disease. Six other machine learning classifiers were implemented as the baseline for performance comparison. The classifiers include logistic regression, decision tree, random forest, SVM, XGBoost, and the traditional AdaBoost. Firstly, the IG technique was used to compute and rank the importance of the various attributes. Secondly, the classifiers were trained with the reduced and complete feature sets. The experimental results show that selected features enhanced the performance of the classifiers.
Furthermore, the proposed cost-sensitive AdaBoost achieved superior performance to the other classifiers and methods in recent literature. Therefore, combining the IG-based feature selection technique and cost-sensitive AdaBoost is an effective approach for CKD detection and can be potentially applied for early detection of CKD through computeraided diagnosis. Future research will focus on collecting large amounts of data to train ML models, including datasets that allow for the prediction of the disease severity, duration of the disease, and the age of onset.