Predicting the Risk of Chronic Kidney Disease (CKD) Using Machine Learning Algorithm

: Background: Creatinine is a type of metabolite of blood that is strongly correlated to glomerular ﬁltration rate (GFR). As measuring GFR


Introduction
Chronic kidney disease (CKD) is a type of kidney disease in which there is a gradual loss of glomerular filtration rate (GFR) over a period of more than 3 months [1]. It is a silent killer as there are no physical symptoms in the early stage. CKD affected 753 million people globally in 2016, 417 million females and 336 million males [2]. Over 1 million people in 112 poor countries die from renal failure every year, as they cannot afford the huge financial burden of regular dialysis or kidney replacement surgery [3]. Thus, early detection and effective intervention are important to reduce the impact of CKD on public health. Due to different economic conditions in different countries, the schedule for routine health examinations is different. Even in the same country, different groups get different levels of health examinations. A comprehensive routine health examination even for detection of common fatal diseases, like cancer and heart disease, is rare in most countries. Tests related to CKD are initiated only when there is a symptomatic problem, and then it is too late.
For screening of kidney function, a urine test and a blood test are needed [4]. Creatinine is a type of metabolite in blood, which reflects the value of glomerular filtration rate (GFR) indirectly. Direct measurement of GFR is difficult. GFR is estimated by a simple function whose parameters are creatinine value, sex, age, and race. Disease control agencies of some countries recommend that the whole population over a certain age should be screened for creatinine. Meanwhile, in many countries people with diabetes or hypertension (high blood pressure) are been screened for regular renal check [5]. Prediction of CKD through ultrasound imaging is also considered desirable in clinical practice [6].
Recently, researches on the prediction of CKD using machining learning methods were reported [7][8][9][10][11][12][13]. All of these works used a dataset from University of California Irvine (UCI) [14] which contains 400 samples with 24 features (age, blood pressure, creatinine, etc.) to measure CKD, and it achieved good classification results with over 97% accuracy. Although the result looks good, it cannot be applied to practice. The first problem is that there is a bias in the UCI dataset. There are 250 CKD samples and 150 non-CKD samples in the dataset: the ratio of CKD and non-CKD is different from reality. In addition, in the 250 CKD samples, there are nearly 140 samples with creatinine values exceeding 10 mL/min, which are meaningless to classify CKD. On this account, the proposed model will fail to classify, and the classification result will not be acceptable when we consider actual data. How the composition of CKD samples and non-CKD samples affects the classification results will be explained in Section 4.6. The second problem is that the ground truth of CKD is determined by the value of GFR, and the value of GFR is calculated by Equation (1). In Equation (1), the value of creatinine is the main contributing parameter with three other features: age, race, and sex. In other words, if the value of creatinine is already known, the value of GFR can be calculated directly using Equation (1) appears in Section 2.1, and we know the status of CKD from the value of GFR. Therefore, using machining learning algorithms to predict the result of CKD on condition that value of creatinine is already known and the ground truth is calculated using Equation (1), is meaningless. Thus, the premise of the previous published works [7][8][9][10][11][12][13] is flawed.
In addition, we found that there is another way to measure CKD as described in the work [5]. Features of age, gender, the existence of diabetes, hypertension, anemia, and cardiovascular disease are used to measure the risk score of CKD using a simple grid search method. From this work, we got a cue that the state of other common diseases could possibly be used to measure the risk of CKD where the parameters do not include creatinine. The existing method treats the existence of related diseases, like diabetes or hypertension, as binary variables, 0/1, to predict CKD risk. Details about diagnostic tests like blood sugar, blood pressure, hemoglobin are usually included in items of a regular health check. Other common physical measures (waist, BMI, vision, etc.) may possibly be related to the risk of CKD. Therefore, we got an idea to try whether we can predict the risk of CKD using these possibly related and commonly available data.
In this work, we proposed a two-stage method to evaluate the risk of CKD under the condition that creatinine data is not available. In the first stage, a machine learning regression model is used to predict the value of creatinine using a supervised data. As the values of creatinine in the data, which is the target variable, are extremely unbalanced, we used an undersampling method and proposed a cost-sensitive mean squared error (MSE) loss function to deal with the problem. With respect to model selection, in this work we used three machine learning models: a bagging tree model known as Random Forest [15], a boosting tree model called XGBoost [16], a neural network based model known as ResNet [17]. To improve the result of creatinine prediction, we averaged results from eight predictors as our ensembling learning. Finally, the predicted creatinine and the original 23 features are used to predict the risk of CKD in a binary way.
The paper is organized as follows. In Section 2, we describe the dataset and elaborate highlights of the experiment. Section 3 describes the proposed method. The experimental details and results are discussed in Section 4. The paper is concluded in Section 5 with some ideas about the future direction of the work.

Dataset
In this work, we used the open source data provided by National Health Insurance Sharing Service (NHISS), available from the site: https://nhiss.nhis.or.kr/. The dataset contains the regular heath check information for 1 million subjects between 25 and 90 years of age, data collected in 2017. There are 24 collected attributes; we added another three attributes, namely, GFR, Stage, and CKD, and their description is listed in Table 1. In the first stage of the experiment, the 24th entry in Table 1 is the target variable, where 1 to 23 are input variables to the regression model. The attributes 25-27 are not contained in the original collected data. Those added attributes are derived as follows. The 25th attribute of glomerular filtration rate (GFR) is calculated using the formula shown below.
GFR is divided into 6 stages from normal level to renal failure, according to flow rate shown in Table 2. This is the 26th attribute in Table 1. Stages 1 and 2 are regarded as normal non-CKD (class 0), and below that flow rate is regarded as CKD (class 1). This is the 27th entry in Table 1 and is our final classification target (Target_2). Random forest (RF) is an ensemble learning method that constructs a multitude of decision trees at training time and outputting mean prediction from individual trees [15]. A basic structure of Random Forest is shown in Figure 1.
Each sub-tree model does random sampling with replacement from training data and finally average results from all sub-models. Every sub-model runs in parallel without any dependency. In addition to constructing each tree using a different subset of the data, random forests differ in the way of how trees are constructed. In standard decision trees, each node is branched using the optimum decision for division among all variables, so as to minimize entropy due to the splitting of the data set represented by the parent node. In a random forest, split points of each node are randomly chosen from the best split point among a subset of predictors. Random forest thus avoids overfitting, which is common with a single deep decision tree.

XGBoost
XGBoost is an optimized distributed gradient boosting library of algorithms [16]. It implements machine learning algorithms under the Gradient Boosting Decision Tree (GBDT) framework [18]. A basic structure of XGBoost is shown in Figure 2. It is to be noted that the residual from tree-1 is fed to tree-2 so as to reduce the residual and this continues. Different from Random Forest, each tree model in XGBoost minimizes the residual from its previous tree model. The traditional GBDT uses only the first derivative of error information. XGBoost performs the second-order Taylor expansion of the cost function, and uses both the first and second derivatives. In addition, the XGBoost tool supports customized cost function.

RestNet
ResNet is a deep residual neural network used to learn the regression of nonlinear functions [17]. It replaces convolutional layers and pooling layers by fully connected layers to ensure that deep residual learning can be achieved for nonlinear regression. The basic structure of ResNet is shown in Figure 3. The model begins with an input layer and is followed by dense blocks and identity blocks. There are three hidden dense layers in both the dense block and identity block. In the dense block, the input is also connected to output via another dense layer, whereas in the identity block it is directly connected. In ResNet, output layer is the end layer. In this work, every dense block is followed by two identity blocks, and this set of three blocks are repeated 3 times. The last identity block is followed by the output layer.  Figure 4 shows two methods of predicting CKD Risk. In Figure 4a, we predict CKD risk directly, and in Figure 4b, we predict creatinine first and then combine its value with 23 features to predict the risk of CKD. The complicated nonlinear prediction of creatinine in model (b), make the input to CKD classifier richer in information. Compared to model (a), model (b) can achieve better classification. In this work, we used model (b) for CDK risk prediction. In Section 3.1, the preprocessing method of the data is described. To overcome the problem of imbalance in data, we used undersampling, which is described in Section 3.2.

Proposed Methodology
We introduce a new cost-sensitive loss function in Section 3.3. Finally, the model ensemble strategy is explained in Section 3.4.

Preprocessing Method
In the preprocessing part, we did data cleaning as described in item (i) and item (ii) below. We also modified coding of some attributes as described in item (iii).
(i) Some samples have missing values on some attributes. As we already have a large data, 8654 samples with one or more missing attributes are removed. (ii) There are some data with very large values of creatinine. Those samples are from patients at a late stage of renal failure. These subjects are not targets of this work, and therefore such data are classed as outlier data as far as training our target model is concerned. We removed 1234 samples with a value of creatinine higher than 2.5 mL/min. (iii) In the original data, the attribute of Sex and SMK_STATE are not suitable for numerical coding. We changed them into one-hot coding format. We remove original variables and replace them with new binary variables where 0 is the value when the category is false and 1 when it is true. For example, sex is replaced by Male and Female, two attributes. For a Male subject, Male attribute is assigned a value "1" and female as "0".
After preprocessing, 990,112 samples remained. We split it into a training set with 900,000 samples and a test set with 90,112 samples.

Extremely Unbalanced Data
The distribution of target variable of creatinine is shown in Figure 5. Most of the samples are concentrated near the median, and the number of samples away from the median is very less. A machine learning model will fit better on the region with more samples and perform worse around the region where data is less. In another words, the confidence interval of prediction accuracy will be wider where the samples are less.
For general regression tasks, the problem of unbalanced data can be ignored. If an interval contains less samples, we can say that the samples in that range are outliers. However, for this task, it is the opposite. Our target is to correctly predict those people who have a high creatinine value, though we have a few data for that range. From Figure 5, we observe that most samples are between 0.5 and 1.4. If we use the whole dataset to train a machine learning model, as the training algorithm will try to minimize the sum error for the whole dataset, samples with high creatinine value where data is less will be ignored.
In order to show the impact of imbalance problem on regression more explicitly, we performed a simple experiment using XGBoost with Minimum Square Error (MSE) as loss function. The result is shown in Figure 6, where the y-axis represents the ground truth and the x-axis is the predicted value. We observed that the model failed to predict for the interval with less training samples.
Generally speaking, the problem occurs because the number of samples with a high creatinine value is not enough for the machine learning model to learn from them. The data imbalance problem usually can be solved by data level methods and model level methods [19]. Oversampling is the most common data level method. However, this data is obviously not suitable for oversampling. Over-sampling refers to balance data distribution by resampling or generating new data from low number of available ones. In this problem, high creatinine data is extremely low. Oversampling will cause noise in the rare data to be amplified countless times. We used undersampling method and proposed a cost-sensitive mean-squared error (MSE) loss function to deal with this problem.

Details about Undersampling
The target of the prediction, the creatinine value, is highly imbalanced in the data set. To alleviate the imbalance problem, undersampling is done on a range of values where a large amount of data is available. Table 3 shows the details about undersampling. Six types of undersampling strategies are considered with different levels of undersampling as shown in Table 3. Training data is more balanced as we go from sampling-1 to sampling-6. Total training data, shown in the last row of Table 3, were 191,312, 107,439, 61,757, 34,220, 18,896, and 11,069 samples, respectively. As we opt for a more balanced set, the number of training data decreases. We performed experiments with all 6 sample sets and compare the results.

Cost-Sensitive MSE Loss Function
Mean absolute error (MAE) and mean squared error (MSE) are the two basic evaluation methods for regression tasks. Their formulas are shown below, where y i t is the target value of the ith labeled data, and y i p is the predicted value of the same from the regression algorithm. Errors in MSE are squared; it makes samples with small error less important (due to squaring) and creates a stronger incentive to train data with larger error. This happens for attribute values where the available data are rare. However, it does not mean that cost function with further higher-order on errors will achieve better results. Especially for data with noise which will be amplified too. In order to strike a balance, MSE is considered as a suitable loss function for this task. Although MSE makes less frequent data more important, due to the imbalance of data, the model is trained to reduce error for more abundant data. This causes the error of the rare data to be larger than the common data. To alleviate this problem, we split the data into k subsets. Then, calculate the mean error of each subset named as average_error_σ k and their average named as average_error_σ. The ratio between average_error_σ k and average_error_σ is used as a weight for the error of each sample from different subset.
As the weight of error is sensitive to the cost of each subset, we called the loss function as cost-sensitive MSE, and its formula is shown as below.
Compared to the original MSE cost, we added a weight to the squared error for each sample. The weight is the ratio of average_error_σ k and average_error_σ . This new loss function is implemented as follows. For the subsets with fewer samples, average_error_σ k will be larger than average_error_σ and weight coefficient will be larger than 1. For the subsets with more samples, average_error_σ k will be smaller than average_error_σ and weight coefficient will be smaller. Unlike using higher order exponential on errors, which is applied on all samples, the method we proposed is tuned for samples in specific intervals. The advantage of this method is that it can not only increase the importance of rare data, but also avoids the negative effects from high-order exponential errors applied to all samples in the training data set.

Model Ensemble Strategy
As undersampling method is used for data balancing, only a small part of samples are used for training. In order to better utilize the training samples, the model ensemble strategy is used as shown in Figure 7. It is a bagging method that uses different sets of undersampled data and trains multiple predictors. Finally, the results from multiple predictors are averaged to improve generalization performance of the predictor.

Evaluation Method
Like the training set, for the test set too we need to attend to the imbalance problem. If the whole test set, which contains 90,112 samples, is used for evaluation, even if the mean error of the whole test set is very low, for less frequent samples the prediction could be poor. In order to effectively evaluate the results, undersampling is used on the test set too. We randomly take only 100 samples for those range of target values which contains over 100 samples, and all remaining (not used during training) samples for less frequent segments of the range from the test sample set. We finally took 1592 samples for testing. To improve the reliability of the evaluation results, 10 times test experiments are performed and the averaged results are used for comparison.
R-square score (R2) calculated using Equation (5) is used to evaluate the prediction results. The greater the value of R2, the better the regression result.

Experiments with Different Undersampling Strategies
Experimental results with different undersampling subsets are presented in Table 4. Different sampling strategies are listed in Table 3. Test-1 to Test-10 are results with different test samples. The entries of Table 4 are the R2 scores. In this experiment, XGBoost is used because it is the fastest algorithm among the 3. RF and ResNet are the other two, and are much slower. Sampling-5 strategy achieved the best R2 score of 0.5591 with 18,896 samples, and sampling-4 strategy achieved the second-best R2 score of 0.5523 with 34,220 samples. Our goal is to achieve a good result while keeping as many samples as possible. Even though sampling-5 strategy achieved a better R2 score result than sampling-4 strategy, sampling-4 strategy used almost 2 times the number of training samples. On this account, sampling-4 strategy which achieved an average R2 score of 0.5523, and use 34,220 training samples, is selected as the best under-sampling strategy for this problem. This strategy is used in all following experiments.

Experiments with Different Regression Algorithms
The next step is to test the efficacy of different regression algorithm. Experimental results using three regression algorithms with sampling-4 are presented in Table 5. It shows that XGBoost achieved the best result of 0.5523 of R2 score (also shown in Table 4). Therefore, XGBoost is selected as a standard predictor in the following experiments. Additionally, we used random forest to evaluate the importance of each variable by calculating the decrease in gini_index, when that variable is used for decision at a higher level node. The top 6 important variables are shown in Table 6.

Experiments with Cost-Sensitive Loss Function
Before using cost-sensitive loss, the average R2 score was 0.5523 as shown in Table 4. After using cost-sensitive loss, the result of R2 score is improved slightly to 0.5546. From the simulations, we observed that the result of R2 score did not improve much by using cost-sensitive loss function. However, the cost-sensitive loss will increase the R2 score of the subset of data in the range where the number of data is small. From Figure 8 and Table 7, we can observe that in the range where there is fewer data, the error decreases, and the range where there is more data, the error increases. As accurately predicting in the region of high creatinine is more important due to higher risk of CKD. Thus, the cost-sensitive loss function achieved a positive goal.
By using sampling-4 strategy of undersampling for data balancing, and cost-sensitive loss function, the impact of data imbalance problem is partially mitigated. The result is shown in Figure 9, where the y-axis represents the ground truth value and the x-axis is the predicted value. The left part is the result before dealing with imbalance problem; the right part is the result after using undersampling and cost-sensitive loss function. We can observe that the model is able to predict with better accuracy over the whole range of data value.

Experiments with Model Ensemble
As we used undersampling method and selected only 34,220 samples to train a regression model, most of the data is not used for model training. In order to make full use of the data, 8 times undersampling were processed to extract 8 training datasets. P1 to P8: eight different models were trained by these eight datasets, using XGBoost. Table 8 shows results of 8 predictors and every single predictor achieved almost the same averaged R2 score.  Table 9 shows results of model ensembling. 1P means the results from P1; 2Ps means the averaged result from 1P and 2P; 3Ps means the averaged result from 1P, 2P, and 3P; and so on. We achieve a better result of R2 score of 0.5590 by using model ensembling strategy. More important is that generalization performance is improved by using averaged result from multiple predictors.

Prediction the Risk of CKD
The predicted creatinine value and the original 23 features will be combined to predict the risk of CKD. We trained a logistic regression model using XGBoost and sampling-4 method for training.
First, a test set of 1592 samples, which contains 555 CKD samples and 1037 non-CKD samples, is used for testing. The Receiver Operating Characteristic (ROC) curve is drawn in Figure 10, by varying the threshold. The area under the ROC curves (AUC) was 0.90. In order to display the results more precisely, Table 10 shows the results of true positive (TP), false positive (FP), true negative (TN), false negative (FN), true positive rate (TPR), and true negative rate (TNR) by varying the threshold. If the threshold is set at 0.3, we could achieve a TPR of 84% and a TNR of 83% under the condition that creatinine data is not available.  As the ratio of CKD and non-CKD in 1592 samples is different from reality, next the whole test set of 90,112 samples, which contains 3248 CKD samples and 86,864 non-CKD samples, is used for testing. The Receiver Operating Characteristic (ROC) curve is shown in Figure 11, by varying the threshold. The area under the ROC curves (AUC) was 0.76. Table 11 shows the detailed results with varying threshold. Obviously, when the whole test set of 90,112 samples is used for testing, the overall result is worse. This is because the test data are very unbalanced. If we use the same threshold of 0.3, the TPR decreases from 84% to 56% and the TNR decreases from 83% to 80%. This is because there are many samples that are difficult to clearly define the class. Even so, if the threshold is set at 0.4, 13,540 (1598 + 11,942) samples will be classified as positive. There will be 49.1% (1598/3248) positive samples detected, and only 15.6% (13,540/86,864) of the population need to be checked. In this way, we can reduce the impact of CKD through a compromise of testing creatinine for those identified in the CKD risk class. Figure 11. The ROC curve of 90,112 testing samples.

Conclusions and Future Work
This study aims to build a regression model to predict the value of creatinine, then combine the predicted value of creatine with the common 23 health factors to evaluate the risk of CKD. As the creatinine value, which is the target variable, is extremely unbalanced, we used an undersampling method and proposed a cost-sensitive mean squared error (MSE) loss function to deal with the problem. Regrading model selection, we used three machine learning models: a bagging tree model named Random Forest, a boosting tree model named XGBoost, and a neural network based model named ResNet. To improve the result of creatinine predictor, we averaged results from eight predictors and ensembled their results. The ensembled model showed the best performance of R2 0.5590. The top six factors that influence creatinine are sex, age, hemoglobin, the level of urine protein, waist, and habit of smoking. With the predicted value of creatinine, an area under Receiver Operating Characteristic curve (AUC) of 0.76 is achieved when classifying samples for CKD.
Author Contributions: Formal analysis, W.W. and G.C.; Software, W.W.; Supervision, G.C. and B.C.; Writing-original draft, W.W.; Writing-review and editing, G.C. and B.C. All authors have read and agreed to the published version of the manuscript.