Next Article in Journal
Calculation of Safety Factors of the Eurocodes
Previous Article in Journal
Resolving Dilemmas Arising during Design and Implementation of Digital Repository of Heterogenic Scientific Resources
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm

by
Weilun Wang
1,*,
Goutam Chakraborty
1,2 and
Basabi Chakraborty
1
1
Faculty of Software Information Science, Iwate Prefectural University, Iwate 020-0693, Japan
2
Sendai Foundation for Applied Information Sciences, Sendai 980-0012, Japan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2021, 11(1), 202; https://doi.org/10.3390/app11010202
Submission received: 11 November 2020 / Revised: 7 December 2020 / Accepted: 23 December 2020 / Published: 28 December 2020

Abstract

:
Background: Creatinine is a type of metabolite of blood that is strongly correlated to glomerular filtration rate (GFR). As measuring GFR is difficult, creatinine value is used for indirectly determining GFR and then the stage of chronic kidney disease (CKD). Adding a creatinine test into routine health examination could detect CKD. As more items for comprehensive examination means higher cost, creatinine testing is not included in the routine health examination in many countries. An algorithm based on common test results, without creatinine test, to evaluate the risk of CKD will increase the chance of its early detection and treatment. Methods: In this study, we used open source data containing 1 million samples. These data contain 23 health-related features, including common diagnostic test results provided by National Health Insurance Sharing Service (NHISS). A low GFR indicates possible chronic kidney disease (CKD). As is commonly accepted in the medical community, a GFR of 60 mL/min is used as the threshold, below which is considered to have CKD. In this study, the first step aims to build a regression model to predict the value of creatinine from 23 features, and then combine the predicted value of creatinine with the original 23 features to evaluate the risk of CKD. We will show by simulation that by the proposed method we can achieve better prediction results compared to direct prediction from 23 features. The data is extremely unbalanced for predicting the target variable creatinine. We used undersampling method and proposed a new cost-sensitive mean-squared error (MSE) loss function to deal with the problem. Regrading model selection, this work used three machine learning models: a bagging tree model named Random Forest, a boosting tree model named XGBoost, and a neural network based model named ResNet. To improve the result of the creatinine predictor, we averaged results from eight predictors, a method known as ensemble learning. Finally, the predicted creatinine and the original 23 features is used to predict the risk of CKD. Results: We optimized results of R-Squared (R2) value to select the appropriate undersampling strategy and the regression model for the regression stage of creatinine prediction. Ensembled model achieved the best performance of R2 of 0.5590. The six factors from 23 are selected from the top of the list of how strongly they affect the creatinine value. They are sex, age, hemoglobin, the level of urine protein, waist circumference, and habit of smoking. Using the predicted value of creatinine, an area under Receiver Operating Characteristic curve (AUC) of 0.76 is achieved while classifying samples for CKD. Conclusions: Using commonly available health parameters, the proposed system can assess the risk of CKD for public health. High-risk subjects can be screened and advised to take a creatinine test for further confirmation. In this way, we can reduce the impact of CKD on public health and facilitate early detection for many, where a blanket test of creatinine is not available for all.

1. Introduction

Chronic kidney disease (CKD) is a type of kidney disease in which there is a gradual loss of glomerular filtration rate (GFR) over a period of more than 3 months [1]. It is a silent killer as there are no physical symptoms in the early stage. CKD affected 753 million people globally in 2016, 417 million females and 336 million males [2]. Over 1 million people in 112 poor countries die from renal failure every year, as they cannot afford the huge financial burden of regular dialysis or kidney replacement surgery [3]. Thus, early detection and effective intervention are important to reduce the impact of CKD on public health. Due to different economic conditions in different countries, the schedule for routine health examinations is different. Even in the same country, different groups get different levels of health examinations. A comprehensive routine health examination even for detection of common fatal diseases, like cancer and heart disease, is rare in most countries. Tests related to CKD are initiated only when there is a symptomatic problem, and then it is too late.
For screening of kidney function, a urine test and a blood test are needed [4]. Creatinine is a type of metabolite in blood, which reflects the value of glomerular filtration rate (GFR) indirectly. Direct measurement of GFR is difficult. GFR is estimated by a simple function whose parameters are creatinine value, sex, age, and race. Disease control agencies of some countries recommend that the whole population over a certain age should be screened for creatinine. Meanwhile, in many countries people with diabetes or hypertension (high blood pressure) are been screened for regular renal check [5]. Prediction of CKD through ultrasound imaging is also considered desirable in clinical practice [6].
Recently, researches on the prediction of CKD using machining learning methods were reported [7,8,9,10,11,12,13]. All of these works used a dataset from University of California Irvine (UCI) [14] which contains 400 samples with 24 features (age, blood pressure, creatinine, etc.) to measure CKD, and it achieved good classification results with over 97% accuracy. Although the result looks good, it cannot be applied to practice. The first problem is that there is a bias in the UCI dataset. There are 250 CKD samples and 150 non-CKD samples in the dataset: the ratio of CKD and non-CKD is different from reality. In addition, in the 250 CKD samples, there are nearly 140 samples with creatinine values exceeding 10 mL/min, which are meaningless to classify CKD. On this account, the proposed model will fail to classify, and the classification result will not be acceptable when we consider actual data. How the composition of CKD samples and non-CKD samples affects the classification results will be explained in Section 4.6. The second problem is that the ground truth of CKD is determined by the value of GFR, and the value of GFR is calculated by Equation (1). In Equation (1), the value of creatinine is the main contributing parameter with three other features: age, race, and sex. In other words, if the value of creatinine is already known, the value of GFR can be calculated directly using Equation (1) appears in Section 2.1, and we know the status of CKD from the value of GFR. Therefore, using machining learning algorithms to predict the result of CKD on condition that value of creatinine is already known and the ground truth is calculated using Equation (1), is meaningless. Thus, the premise of the previous published works [7,8,9,10,11,12,13] is flawed.
In addition, we found that there is another way to measure CKD as described in the work [5]. Features of age, gender, the existence of diabetes, hypertension, anemia, and cardiovascular disease are used to measure the risk score of CKD using a simple grid search method. From this work, we got a cue that the state of other common diseases could possibly be used to measure the risk of CKD where the parameters do not include creatinine. The existing method treats the existence of related diseases, like diabetes or hypertension, as binary variables, 0/1, to predict CKD risk. Details about diagnostic tests like blood sugar, blood pressure, hemoglobin are usually included in items of a regular health check. Other common physical measures (waist, BMI, vision, etc.) may possibly be related to the risk of CKD. Therefore, we got an idea to try whether we can predict the risk of CKD using these possibly related and commonly available data.
In this work, we proposed a two-stage method to evaluate the risk of CKD under the condition that creatinine data is not available. In the first stage, a machine learning regression model is used to predict the value of creatinine using a supervised data. As the values of creatinine in the data, which is the target variable, are extremely unbalanced, we used an undersampling method and proposed a cost-sensitive mean squared error (MSE) loss function to deal with the problem. With respect to model selection, in this work we used three machine learning models: a bagging tree model known as Random Forest [15], a boosting tree model called XGBoost [16], a neural network based model known as ResNet [17]. To improve the result of creatinine prediction, we averaged results from eight predictors as our ensembling learning. Finally, the predicted creatinine and the original 23 features are used to predict the risk of CKD in a binary way.
The paper is organized as follows. In Section 2, we describe the dataset and elaborate highlights of the experiment. Section 3 describes the proposed method. The experimental details and results are discussed in Section 4. The paper is concluded in Section 5 with some ideas about the future direction of the work.

2. Materials and Methods

2.1. Dataset

In this work, we used the open source data provided by National Health Insurance Sharing Service (NHISS), available from the site: https://nhiss.nhis.or.kr/. The dataset contains the regular heath check information for 1 million subjects between 25 and 90 years of age, data collected in 2017. There are 24 collected attributes; we added another three attributes, namely, GFR, Stage, and CKD, and their description is listed in Table 1.
In the first stage of the experiment, the 24th entry in Table 1 is the target variable, where 1 to 23 are input variables to the regression model. The attributes 25–27 are not contained in the original collected data. Those added attributes are derived as follows. The 25th attribute of glomerular filtration rate (GFR) is calculated using the formula shown below.
G F R = 186 × [ C r e a t i n i n e ] 1.154 × [ A g e ] 0.203 × [ 1.212 i f _ B l a c k ] × [ 0.742 i f _ F e m a l e ]
GFR is divided into 6 stages from normal level to renal failure, according to flow rate shown in Table 2. This is the 26th attribute in Table 1. Stages 1 and 2 are regarded as normal non-CKD (class 0), and below that flow rate is regarded as CKD (class 1). This is the 27th entry in Table 1 and is our final classification target (Target_2).

2.2. Regression Methods

2.2.1. Random Forest

Random forest (RF) is an ensemble learning method that constructs a multitude of decision trees at training time and outputting mean prediction from individual trees [15]. A basic structure of Random Forest is shown in Figure 1.
Each sub-tree model does random sampling with replacement from training data and finally average results from all sub-models. Every sub-model runs in parallel without any dependency. In addition to constructing each tree using a different subset of the data, random forests differ in the way of how trees are constructed. In standard decision trees, each node is branched using the optimum decision for division among all variables, so as to minimize entropy due to the splitting of the data set represented by the parent node. In a random forest, split points of each node are randomly chosen from the best split point among a subset of predictors. Random forest thus avoids overfitting, which is common with a single deep decision tree.

2.2.2. XGBoost

XGBoost is an optimized distributed gradient boosting library of algorithms [16]. It implements machine learning algorithms under the Gradient Boosting Decision Tree (GBDT) framework [18]. A basic structure of XGBoost is shown in Figure 2. It is to be noted that the residual from tree-1 is fed to tree-2 so as to reduce the residual and this continues.
Different from Random Forest, each tree model in XGBoost minimizes the residual from its previous tree model. The traditional GBDT uses only the first derivative of error information. XGBoost performs the second-order Taylor expansion of the cost function, and uses both the first and second derivatives. In addition, the XGBoost tool supports customized cost function.

2.2.3. RestNet

ResNet is a deep residual neural network used to learn the regression of nonlinear functions [17]. It replaces convolutional layers and pooling layers by fully connected layers to ensure that deep residual learning can be achieved for nonlinear regression. The basic structure of ResNet is shown in Figure 3. The model begins with an input layer and is followed by dense blocks and identity blocks. There are three hidden dense layers in both the dense block and identity block. In the dense block, the input is also connected to output via another dense layer, whereas in the identity block it is directly connected. In ResNet, output layer is the end layer. In this work, every dense block is followed by two identity blocks, and this set of three blocks are repeated 3 times. The last identity block is followed by the output layer.

3. Proposed Methodology

Figure 4 shows two methods of predicting CKD Risk. In Figure 4a, we predict CKD risk directly, and in Figure 4b, we predict creatinine first and then combine its value with 23 features to predict the risk of CKD. The complicated nonlinear prediction of creatinine in model (b), make the input to CKD classifier richer in information. Compared to model (a), model (b) can achieve better classification. In this work, we used model (b) for CDK risk prediction.
In Section 3.1, the preprocessing method of the data is described. To overcome the problem of imbalance in data, we used undersampling, which is described in Section 3.2. We introduce a new cost-sensitive loss function in Section 3.3. Finally, the model ensemble strategy is explained in Section 3.4.

3.1. Preprocessing Method

In the preprocessing part, we did data cleaning as described in item (i) and item (ii) below. We also modified coding of some attributes as described in item (iii).
(i)
Some samples have missing values on some attributes. As we already have a large data, 8654 samples with one or more missing attributes are removed.
(ii)
There are some data with very large values of creatinine. Those samples are from patients at a late stage of renal failure. These subjects are not targets of this work, and therefore such data are classed as outlier data as far as training our target model is concerned. We removed 1234 samples with a value of creatinine higher than 2.5 mL/min.
(iii)
In the original data, the attribute of Sex and SMK_STATE are not suitable for numerical coding. We changed them into one-hot coding format. We remove original variables and replace them with new binary variables where 0 is the value when the category is false and 1 when it is true. For example, sex is replaced by Male and Female, two attributes. For a Male subject, Male attribute is assigned a value “1” and female as “0”.
After preprocessing, 990,112 samples remained. We split it into a training set with 900,000 samples and a test set with 90,112 samples.

3.2. Undersampling for Data Balancing

3.2.1. Extremely Unbalanced Data

The distribution of target variable of creatinine is shown in Figure 5. Most of the samples are concentrated near the median, and the number of samples away from the median is very less. A machine learning model will fit better on the region with more samples and perform worse around the region where data is less. In another words, the confidence interval of prediction accuracy will be wider where the samples are less.
For general regression tasks, the problem of unbalanced data can be ignored. If an interval contains less samples, we can say that the samples in that range are outliers. However, for this task, it is the opposite. Our target is to correctly predict those people who have a high creatinine value, though we have a few data for that range. From Figure 5, we observe that most samples are between 0.5 and 1.4. If we use the whole dataset to train a machine learning model, as the training algorithm will try to minimize the sum error for the whole dataset, samples with high creatinine value where data is less will be ignored.
In order to show the impact of imbalance problem on regression more explicitly, we performed a simple experiment using XGBoost with Minimum Square Error (MSE) as loss function. The result is shown in Figure 6, where the y-axis represents the ground truth and the x-axis is the predicted value. We observed that the model failed to predict for the interval with less training samples.
Generally speaking, the problem occurs because the number of samples with a high creatinine value is not enough for the machine learning model to learn from them. The data imbalance problem usually can be solved by data level methods and model level methods [19]. Oversampling is the most common data level method. However, this data is obviously not suitable for oversampling. Over-sampling refers to balance data distribution by resampling or generating new data from low number of available ones. In this problem, high creatinine data is extremely low. Oversampling will cause noise in the rare data to be amplified countless times. We used undersampling method and proposed a cost-sensitive mean-squared error (MSE) loss function to deal with this problem.

3.2.2. Details about Undersampling

The target of the prediction, the creatinine value, is highly imbalanced in the data set. To alleviate the imbalance problem, undersampling is done on a range of values where a large amount of data is available. Table 3 shows the details about undersampling. Six types of undersampling strategies are considered with different levels of undersampling as shown in Table 3. Training data is more balanced as we go from sampling-1 to sampling-6. Total training data, shown in the last row of Table 3, were 191,312, 107,439, 61,757, 34,220, 18,896, and 11,069 samples, respectively. As we opt for a more balanced set, the number of training data decreases. We performed experiments with all 6 sample sets and compare the results.

3.3. Cost-Sensitive MSE Loss Function

Mean absolute error (MAE) and mean squared error (MSE) are the two basic evaluation methods for regression tasks. Their formulas are shown below,
M A E = 1 N i = 1 N y t i y p i
M S E = 1 N i = 1 N ( y t i y p i ) 2
where y t i is the target value of the i t h labeled data, and y p i is the predicted value of the same from the regression algorithm. Errors in MSE are squared; it makes samples with small error less important (due to squaring) and creates a stronger incentive to train data with larger error. This happens for attribute values where the available data are rare. However, it does not mean that cost function with further higher-order on errors will achieve better results. Especially for data with noise which will be amplified too. In order to strike a balance, MSE is considered as a suitable loss function for this task.
Although MSE makes less frequent data more important, due to the imbalance of data, the model is trained to reduce error for more abundant data. This causes the error of the rare data to be larger than the common data. To alleviate this problem, we split the data into k subsets. Then, calculate the mean error of each subset named as a v e r a g e _ e r r o r _ σ k and their average named as a v e r a g e _ e r r o r _ σ . The ratio between a v e r a g e _ e r r o r _ σ k and a v e r a g e _ e r r o r _ σ is used as a weight for the error of each sample from different subset. As the weight of error is sensitive to the cost of each subset, we called the loss function as cost-sensitive MSE, and its formula is shown as below.
C o s t _ S e n s i t i v e _ M S E = 1 N i = 1 N a v e r a g e _ e r r o r _ σ k a v e r a g e _ e r r o r _ σ × ( y t i y p i ) 2
Compared to the original MSE cost, we added a weight to the squared error for each sample. The weight is the ratio of a v e r a g e _ e r r o r _ σ k and a v e r a g e _ e r r o r _ σ . This new loss function is implemented as follows.
(a)
Calculate the range of target variable (Range) and minimum value of target variable (Min).
(b)
Split the data into 10 subsets depending on Range. For example, samples with target variable larger than (0.2 × Range + Min) and smaller than (0.3 × Range + Min) belong to subset_3.
(c)
After the first training epoch, calculate the mean error of each subset, which is named the a v e r a g e _ e r r o r _ σ k .
(d)
Calculate the mean value of a v e r a g e _ e r r o r _ σ k for 10 subsets, which is the a v e r a g e _ e r r o r _ σ .
For the subsets with fewer samples, a v e r a g e _ e r r o r _ σ k will be larger than a v e r a g e _ e r r o r _ σ and weight coefficient will be larger than 1. For the subsets with more samples, a v e r a g e _ e r r o r _ σ k will be smaller than a v e r a g e _ e r r o r _ σ and weight coefficient will be smaller. Unlike using higher order exponential on errors, which is applied on all samples, the method we proposed is tuned for samples in specific intervals. The advantage of this method is that it can not only increase the importance of rare data, but also avoids the negative effects from high-order exponential errors applied to all samples in the training data set.

3.4. Model Ensemble Strategy

As undersampling method is used for data balancing, only a small part of samples are used for training. In order to better utilize the training samples, the model ensemble strategy is used as shown in Figure 7. It is a bagging method that uses different sets of undersampled data and trains multiple predictors. Finally, the results from multiple predictors are averaged to improve generalization performance of the predictor.

4. Experiments and Results

4.1. Evaluation Method

Like the training set, for the test set too we need to attend to the imbalance problem. If the whole test set, which contains 90,112 samples, is used for evaluation, even if the mean error of the whole test set is very low, for less frequent samples the prediction could be poor. In order to effectively evaluate the results, undersampling is used on the test set too. We randomly take only 100 samples for those range of target values which contains over 100 samples, and all remaining (not used during training) samples for less frequent segments of the range from the test sample set. We finally took 1592 samples for testing. To improve the reliability of the evaluation results, 10 times test experiments are performed and the averaged results are used for comparison.
R-square score (R2) calculated using Equation (5) is used to evaluate the prediction results. The greater the value of R2, the better the regression result.
R 2 = 1 i = 1 N ( y t i y p i ) 2 i = 1 N ( y m i y p i ) 2

4.2. Experiments with Different Undersampling Strategies

Experimental results with different undersampling subsets are presented in Table 4. Different sampling strategies are listed in Table 3. Test-1 to Test-10 are results with different test samples. The entries of Table 4 are the R2 scores. In this experiment, XGBoost is used because it is the fastest algorithm among the 3. RF and ResNet are the other two, and are much slower. Sampling-5 strategy achieved the best R2 score of 0.5591 with 18,896 samples, and sampling-4 strategy achieved the second-best R2 score of 0.5523 with 34,220 samples. Our goal is to achieve a good result while keeping as many samples as possible. Even though sampling-5 strategy achieved a better R2 score result than sampling-4 strategy, sampling-4 strategy used almost 2 times the number of training samples. On this account, sampling-4 strategy which achieved an average R2 score of 0.5523, and use 34,220 training samples, is selected as the best under-sampling strategy for this problem. This strategy is used in all following experiments.

4.3. Experiments with Different Regression Algorithms

The next step is to test the efficacy of different regression algorithm. Experimental results using three regression algorithms with sampling-4 are presented in Table 5. It shows that XGBoost achieved the best result of 0.5523 of R2 score (also shown in Table 4). Therefore, XGBoost is selected as a standard predictor in the following experiments.
Additionally, we used random forest to evaluate the importance of each variable by calculating the decrease in gini_index, when that variable is used for decision at a higher level node. The top 6 important variables are shown in Table 6.

4.4. Experiments with Cost-Sensitive Loss Function

Before using cost-sensitive loss, the average R2 score was 0.5523 as shown in Table 4. After using cost-sensitive loss, the result of R2 score is improved slightly to 0.5546. From the simulations, we observed that the result of R2 score did not improve much by using cost-sensitive loss function. However, the cost-sensitive loss will increase the R2 score of the subset of data in the range where the number of data is small. From Figure 8 and Table 7, we can observe that in the range where there is fewer data, the error decreases, and the range where there is more data, the error increases. As accurately predicting in the region of high creatinine is more important due to higher risk of CKD. Thus, the cost-sensitive loss function achieved a positive goal.
By using sampling-4 strategy of undersampling for data balancing, and cost-sensitive loss function, the impact of data imbalance problem is partially mitigated. The result is shown in Figure 9, where the y-axis represents the ground truth value and the x-axis is the predicted value. The left part is the result before dealing with imbalance problem; the right part is the result after using undersampling and cost-sensitive loss function. We can observe that the model is able to predict with better accuracy over the whole range of data value.

4.5. Experiments with Model Ensemble

As we used undersampling method and selected only 34,220 samples to train a regression model, most of the data is not used for model training. In order to make full use of the data, 8 times undersampling were processed to extract 8 training datasets. P1 to P8: eight different models were trained by these eight datasets, using XGBoost. Table 8 shows results of 8 predictors and every single predictor achieved almost the same averaged R2 score.
Table 9 shows results of model ensembling. 1P means the results from P1; 2Ps means the averaged result from 1P and 2P; 3Ps means the averaged result from 1P, 2P, and 3P; and so on. We achieve a better result of R2 score of 0.5590 by using model ensembling strategy. More important is that generalization performance is improved by using averaged result from multiple predictors.

4.6. Prediction the Risk of CKD

The predicted creatinine value and the original 23 features will be combined to predict the risk of CKD. We trained a logistic regression model using XGBoost and sampling-4 method for training.
First, a test set of 1592 samples, which contains 555 CKD samples and 1037 non-CKD samples, is used for testing. The Receiver Operating Characteristic (ROC) curve is drawn in Figure 10, by varying the threshold. The area under the ROC curves (AUC) was 0.90. In order to display the results more precisely, Table 10 shows the results of true positive (TP), false positive (FP), true negative (TN), false negative (FN), true positive rate (TPR), and true negative rate (TNR) by varying the threshold. If the threshold is set at 0.3, we could achieve a TPR of 84% and a TNR of 83% under the condition that creatinine data is not available.
As the ratio of CKD and non-CKD in 1592 samples is different from reality, next the whole test set of 90,112 samples, which contains 3248 CKD samples and 86,864 non-CKD samples, is used for testing. The Receiver Operating Characteristic (ROC) curve is shown in Figure 11, by varying the threshold. The area under the ROC curves (AUC) was 0.76. Table 11 shows the detailed results with varying threshold. Obviously, when the whole test set of 90,112 samples is used for testing, the overall result is worse. This is because the test data are very unbalanced. If we use the same threshold of 0.3, the TPR decreases from 84% to 56% and the TNR decreases from 83% to 80%. This is because there are many samples that are difficult to clearly define the class. Even so, if the threshold is set at 0.4, 13,540 (1598 + 11,942) samples will be classified as positive. There will be 49.1% (1598/3248) positive samples detected, and only 15.6% (13,540/86,864) of the population need to be checked. In this way, we can reduce the impact of CKD through a compromise of testing creatinine for those identified in the CKD risk class.

5. Conclusions and Future Work

This study aims to build a regression model to predict the value of creatinine, then combine the predicted value of creatine with the common 23 health factors to evaluate the risk of CKD. As the creatinine value, which is the target variable, is extremely unbalanced, we used an undersampling method and proposed a cost-sensitive mean squared error (MSE) loss function to deal with the problem. Regrading model selection, we used three machine learning models: a bagging tree model named Random Forest, a boosting tree model named XGBoost, and a neural network based model named ResNet. To improve the result of creatinine predictor, we averaged results from eight predictors and ensembled their results. The ensembled model showed the best performance of R2 0.5590. The top six factors that influence creatinine are sex, age, hemoglobin, the level of urine protein, waist, and habit of smoking. With the predicted value of creatinine, an area under Receiver Operating Characteristic curve (AUC) of 0.76 is achieved when classifying samples for CKD.

Author Contributions

Formal analysis, W.W. and G.C.; Software, W.W.; Supervision, G.C. and B.C.; Writing—original draft, W.W.; Writing—review and editing, G.C. and B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available in a publicly accessible repository that does not issue DOIs Publicly available datasets were analyzed in this study. This data can be found here: [https://nhiss.nhis.or.kr/].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stevens, P.E.; Levin, A. Evaluation and management of chronic kidney disease: Synopsis of the kidney disease: Improving global outcomes 2012 clinical practice guideline. Ann. Intern. Med. 2013, 158, 825–830. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Bikbov, B.; Perico, N.; Remuzzi, G. Disparities in chronic kidney disease prevalence among males and females in 195 countries: Analysis of the Global Burden of Disease 2016 Study. Nephron 2018, 139, 313–318. [Google Scholar] [CrossRef] [PubMed]
  3. Couser, W.G.; Remuzzi, G.; Mendis, S.; Tonelli, M. The contribution of chronic kidney disease to the global burden of major noncommunicable diseases. Kidney Int. 2011, 80, 1258–1270. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. National Institute of Diabetes and Digestive and Kidney Diseases. Chronic Kidney Disease Tests and Diagnosis; National Institute of Diabetes and Digestive and Kidney Diseases: Bethesda, MD, USA, 2016.
  5. Yarnoff, B.O.; Hoerger, T.J.; Simpson, S.K.; Leib, A.; Burrows, N.R.; Shrestha, S.S.; Pavkov, M.E. The cost-effectiveness of using chronic kidney disease risk scores to screen for early-stage chronic kidney disease. BMC Nephrol. 2017, 18, 85. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Kuo, C.C.; Chang, C.M.; Liu, K.T.; Lin, W.K.; Chiang, H.Y.; Chung, C.W.; Ho, M.R.; Sun, P.R.; Yang, R.L.; Chen, K.T. Automation of the kidney function prediction and classification through ultrasound-based kidney imaging using deep learning. NPJ Digit. Med. 2019, 2, 1–9. [Google Scholar] [CrossRef] [PubMed]
  7. Basak, S.; Alam, M.M.; Rakshit, A.; Al Marouf, A.; Majumder, A. Predicting and Staging Chronic Kidney Disease of Diabetes (Type-2) Patient Using Machine Learning Algorithms. Int. J. Innov. Technol. Explor. Eng. 2019, 8. [Google Scholar] [CrossRef]
  8. Kumar, M. Prediction of chronic kidney disease using random forest machine learning algorithm. Int. J. Comput. Sci. Mob. Comput. 2016, 5, 24–33. [Google Scholar]
  9. Rady, E.H.A.; Anwar, A.S. Prediction of kidney disease stages using data mining algorithms. Inform. Med. Unlocked 2019, 15, 100178. [Google Scholar] [CrossRef]
  10. Chimwayi, K.B.; Haris, N.; Caytiles, R.D.; Iyengar, N.C.S. Risk Level Prediction of Chronic Kidney Disease Using Neuro-Fuzzy and Hierarchical Clustering Algorithm (s). Int. J. Multimedia Ubiq. Eng. 2017, 12, 23–36. [Google Scholar] [CrossRef]
  11. Almansour, N.A.; Syed, H.F.; Khayat, N.R.; Altheeb, R.K.; Juri, R.E.; Alhiyafi, J.; Alrashed, S.; Olatunji, S.O. Neural network and support vector machine for the prediction of chronic kidney disease: A comparative study. Comput. Biol. Med. 2019, 109, 101–111. [Google Scholar] [CrossRef]
  12. Salekin, A.; Stankovic, J. Detection of chronic kidney disease and selecting important predictive attributes. In Proceedings of the 2016 IEEE International Conference on Healthcare Informatics (ICHI), Chicago, IL, USA, 4–7 October 2016. [Google Scholar]
  13. Almasoud, M.; Ward, T.E. Detection of chronic kidney disease using machine learning algorithms with least number of predictors. Int. J. Soft Comput. Its Appl. 2019, 10. [Google Scholar] [CrossRef]
  14. Rubini, L.; Jerlin. Department of Computer Science and Engineering, Algappa University: TamilNadu. 2015. Available online: http://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease (accessed on 8 June 2020).
  15. Liaw, A.; Wiener, M. Classification and regression by randomForest. R News 2002, 2, 18–22. [Google Scholar]
  16. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  17. Chen, D.; Hu, F.; Nian, G.; Yang, T. Deep Residual Learning for Nonlinear Regression. Entropy 2020, 22, 193. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 1189–1232. [Google Scholar] [CrossRef]
  19. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2017, 106, 249–259. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Simplified structure of random forest.
Figure 1. Simplified structure of random forest.
Applsci 11 00202 g001
Figure 2. Simplified structure of XGBoost.
Figure 2. Simplified structure of XGBoost.
Applsci 11 00202 g002
Figure 3. Simplified structure of ResNet.
Figure 3. Simplified structure of ResNet.
Applsci 11 00202 g003
Figure 4. Two methods of predicting Chronic Kidney Disease (CKD) risk. (a) directly training a CKD classifier to predict CKD risk; (b) predict creatinine first and then combine its value with 23 features to predict the risk of CKD.
Figure 4. Two methods of predicting Chronic Kidney Disease (CKD) risk. (a) directly training a CKD classifier to predict CKD risk; (b) predict creatinine first and then combine its value with 23 features to predict the risk of CKD.
Applsci 11 00202 g004
Figure 5. Distribution of Creatinine.
Figure 5. Distribution of Creatinine.
Applsci 11 00202 g005
Figure 6. Impact of imbalance problem.
Figure 6. Impact of imbalance problem.
Applsci 11 00202 g006
Figure 7. Process of model ensemble.
Figure 7. Process of model ensemble.
Applsci 11 00202 g007
Figure 8. Effect of cost-sensitive loss function.
Figure 8. Effect of cost-sensitive loss function.
Applsci 11 00202 g008
Figure 9. Effect of undersampling and cost-sensitive Loss.
Figure 9. Effect of undersampling and cost-sensitive Loss.
Applsci 11 00202 g009
Figure 10. The ROC curve of 1592 testing samples.
Figure 10. The ROC curve of 1592 testing samples.
Applsci 11 00202 g010
Figure 11. The ROC curve of 90,112 testing samples.
Figure 11. The ROC curve of 90,112 testing samples.
Applsci 11 00202 g011
Table 1. Description of variables used in the analysis.
Table 1. Description of variables used in the analysis.
SerVariableTypeClassDescription
1SexCategoricalPredictor
2AgeNumericalPredictor
3WaistNumericalPredictor
4Listen_leftCategoricalPredictorhearing impairment or not
5Listen_rightCategoricalPredictorhearing impairment or not
6Vision_leftNumericalPredictor
7Vision_rightNumericalPredictor
8BP_HIGHNumericalPredictorsystolic pressure
9BP_LWSTNumericalPredictordiastolic pressure
10BLDSNumericalPredictorfasting blood sugar
11TOT_CHOLENumericalPredictortotal cholesterol
12TRIGLYCERIDENumericalPredictortriglycerides
13HDL_CHOLENumericalPredictorhigh-density lipoprotein cholesterol
14LDL_CHOLENumericalPredictorlow-density lipoprotein cholesterol
15HMGNumericalPredictorhemoglobin
16OLIG_PROTE_CDCategoricalPredictorthe level of urine protein
17SGOT_ASTNumericalPredictoraspartate amino-transferase
18SGPT_ALTNumericalPredictoralanine amino-transferase
19GAMMA_GTPNumericalPredictorgamma glutamyl transpeptidas
20SMK_STATECategoricalPredictorsmoking status
21DRINK_OR_NOTCategoricalPredictordrink habit
22MOUTH_CHECKCategoricalPredictordecayed teeth or not
23BMINumericalPredictorcalculated using height and weight
24CREATININENumericalTarget_1serum creatinine
25GFRNumerical calculated using Equation (1)
26StageCategorical determined from GFR
27CKDBinaryTarget_2GFR < stage 2 (class 1) or not (class 0, normal)
Table 2. Stage of glomerular filtration rate (GFR) [1].
Table 2. Stage of glomerular filtration rate (GFR) [1].
StageGFR (mL/min)Description
Stage 190 or highernormal
Stage 289 to 60mild loss
Stage 3a59 to 45mild to moderate
Stage 3b44 to 30moderate to severe
Stage 429 to 15severe
Stage 5less than 15failure
Table 3. Undersampling.
Table 3. Undersampling.
CreatinineOriginalSampling-1Sampling-2Sampling-3Sampling-4Sampling-5Sampling-6
0.1395395395395395395395
0.293939393939393
0.3549549549549549549549
0.4550355035503500025001200600
0.535,39520,00010,000500025001200600
0.699,18520,00010,000500025001200600
0.7149,23820,00010,000500025001200600
0.8177,22420,00010,000500025001200600
0.9164,15320,00010,000500025001200600
1.0127,88120,00010,000500025001200600
1.178,43220,00010,000500025001200600
1.237,18020,00010,000500025001200600
1.313,87313,87310,000500025001200600
1.4521652165216500025001200600
1.5222422242224222422241200600
1.6116011601160116011601160600
1.7687687687687687687600
1.8523523523523523523523
1.9309309309309309309309
2.0235235235235235235235
2.1144144144144144144144
2.2137137137137137137137
2.392929292929292
2.496969696969696
2.576767676767676
Sum900,000191,312107,43961,75734,22018,89611,049
Table 4. Experiments for different undersampling strategies used XGBoost.
Table 4. Experiments for different undersampling strategies used XGBoost.
Test SetOriginalSampling-1Sampling-2Sampling-3Sampling-4Sampling-5Sampling-6
10.25070.45860.50650.53890.55810.56670.5532
20.24880.45400.50050.53890.55620.56220.5473
30.25070.45210.49670.53090.54550.55430.5384
40.24840.45380.50170.53410.55740.56330.5482
50.24980.45370.50260.53480.55260.56040.5426
60.24680.45070.49770.53130.54530.55850.5407
70.24470.44840.49970.53580.55450.56440.5492
80.24610.44990.49480.53280.54710.55440.5359
90.24490.44940.49970.53610.55610.56490.5523
100.24230.44720.49730.53260.54990.56180.5499
Average0.24730.45180.49970.53460.55230.55910.5458
Table 5. Experiments with different regression algorithms.
Table 5. Experiments with different regression algorithms.
Test SetRandom ForestXGBoostResNet
10.53920.55810.5322
20.53690.55620.5266
30.52890.54550.5265
40.53620.55740.5338
50.53370.55260.5357
60.52890.54530.5254
70.53870.55450.5353
80.52670.54170.5238
90.53890.55610.5288
100.53530.54990.5289
Average0.53430.55230.5297
Table 6. Factors important in predicting Creatinine value.
Table 6. Factors important in predicting Creatinine value.
VariableDecrease in Gini Index
Sex0.49
Age0.11
HMG0.10
OLIG_PROTE_CD0.08
Waist0.04
SMOKE_STATE0.04
Table 7. Experiments with different loss function. XGBoost is used for the regression model.
Table 7. Experiments with different loss function. XGBoost is used for the regression model.
Error Using
Data Subset
Number of Data
in the Subset
Without Cost-
Sensitive Loss
Error
Comparison
With Cost-
Sensitive Loss
18400.6125>0.6036
220000.2065>0.1974
330000.1693<0.1799
420000.1952<0.1972
530000.1336<0.1544
620000.2309<0.2422
717100.2896>0.2831
88700.4746>0.4081
92200.5332>0.4560
102800.8235>0.6965
Table 8. Experiments with model ensemble.
Table 8. Experiments with model ensemble.
PredictorP1P2P3P4P5P6P7P8Average
Test Set
10.55620.55740.55760.55690.55910.55680.56240.55990.5583
20.55370.54830.55620.55510.55590.55190.55420.55380.5536
30.54490.54210.55160.54320.54760.54330.55120.54830.5465
40.56170.56050.56560.56250.55890.55090.56250.56140.5605
50.54980.55050.55300.55020.54960.55070.55260.55220.5511
60.54980.54740.55030.54990.55080.54830.55310.55030.5500
70.56350.56500.56900.56640.57070.56760.57080.56800.5676
80.55200.55110.55270.55030.55020.55280.55600.55390.5524
90.55920.55120.55850.55710.55950.55330.55850.55730.5568
100.54970.54580.54910.54950.54830.54730.54990.54590.5482
Average0.55410.55190.55640.55410.55510.55330.55710.55510.5546
Table 9. Experiments with model ensemble.
Table 9. Experiments with model ensemble.
Test Set1P2Ps3Ps4Ps5Ps6Ps7Ps8Ps
10.55620.55960.56050.56080.56150.56150.56230.5626
20.55370.55380.55620.55710.55790.55770.55800.5581
30.54490.54620.54970.54920.54990.54960.55060.5510
40.56170.56420.56630.56650.56610.56610.56630.5664
50.54980.55280.55450.55470.55460.55480.55520.5555
60.54980.55140.55260.55320.55360.55360.55420.5543
70.56350.56690.56920.56960.57080.57110.57180.5720
80.55200.55420.55530.55530.55520.55560.55640.5567
90.55920.55780.55960.56020.56100.56050.56100.5611
100.54970.55030.55160.55220.55250.55240.55270.5525
Average0.55410.55570.55750.55790.55830.55830.55880.5590
Table 10. Performance with varying threshold value on classification of 1592 samples.
Table 10. Performance with varying threshold value on classification of 1592 samples.
ThresholdTPFNTNFPTPRTNR
0.0555001037100%0%
0.15233257046794%55%
0.24886776127688%73%
0.34649186117684%83%
0.443512090912878%88%
0.54011549449372%91%
0.63482079756263%94%
0.72902659964152%96%
0.821833710162139%98%
0.912742810271023%99%
1.00555103700%100%
Table 11. Performance with varying threshold value of 90,112 testing samples.
Table 11. Performance with varying threshold value of 90,112 testing samples.
ThresholdTPFNTNFPTPRTNR
0.032480086,864100%0%
0.1264860042,09644,76882%48%
0.22171107760,52626,33867%70%
0.31805144369,36017,50456%80%
0.41598165074,92211,94249%86%
0.51245200379,564730038%92%
0.6971227782,275458930%95%
0.7720252884,422242222%97%
0.8476277285,770109415%99%
0.9138311086,7301344%99%
1.00324886,86400%100%
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wang, W.; Chakraborty, G.; Chakraborty, B. Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm. Appl. Sci. 2021, 11, 202. https://doi.org/10.3390/app11010202

AMA Style

Wang W, Chakraborty G, Chakraborty B. Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm. Applied Sciences. 2021; 11(1):202. https://doi.org/10.3390/app11010202

Chicago/Turabian Style

Wang, Weilun, Goutam Chakraborty, and Basabi Chakraborty. 2021. "Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm" Applied Sciences 11, no. 1: 202. https://doi.org/10.3390/app11010202

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop