IgA Nephropathy Prediction in Children with Machine Learning Algorithms

: Immunoglobulin A nephropathy (IgAN) is the most common primary glomerular disease all over the world and it is a major cause of renal failure. IgAN prediction in children with machine learning algorithms has been rarely studied. We retrospectively analyzed the electronic medical records from the Nanjing Eastern War Zone Hospital, chose eXtreme Gradient Boosting (XGBoost), random forest (RF), CatBoost, support vector machines (SVM), k-nearest neighbor (KNN), and extreme learning machine (ELM) models in order to predict the probability that the patient would not reach or reach end-stage renal disease (ESRD) within ﬁve years, used the chi-square test to select the most relevant 16 features as the input of the model, and designed a decision-making system (DMS) of IgAN prediction in children that is based on XGBoost and Django framework. The receiver operating characteristic (ROC) curve was used in order to evaluate the performance of the models and XGBoost had the best performance by comparison. The AUC value, accuracy, precision, recall, and f1-score of XGBoost were 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. The XGBoost model is useful for physicians and pediatric patients in providing predictions regarding IgAN. As an advantage, a DMS can be designed based on the XGBoost model to assist a physician to e ﬀ ectively treat IgAN in children for preventing deterioration.


Introduction
Immunoglobulin A nephropathy (IgAN) is the most common primary glomerular disease all over the world and it is a major cause of end-stage renal disease (ESRD). Additionally, there are similarities and differences of IgAN in adults and children. Clinical studies showed that the survival rate of IgAN patients over 18 years old was 70-80% within 10 years. Among them, 20-30% patients will progress to ESRD and need renal replacement therapy [1]. Wyatt et al. [2] found that the five-year survival rate of children with IgAN was 94-98%, and the 20-year survival rate was 70-89%. Although IgAN in children has relatively lower incidence and less progress in childhood as compared to adults, there is a continuous hazard of progression as children grow older [3]. The prognosis of children with IgAN varies from remission to progression to ESRD. Early detection and effective intervention are important in improving the outcome of IgAN, and the renal biopsy is still the cornerstone of the correct diagnosis of IgAN. However, most patients have entered ESRD at the time of diagnosis, since it is difficult for patients to accept invasive renal biopsy. It is difficult to determine a fixed treatment plan for IgAN in children because it is more troublesome to gather up the course of IgAN in children than in adults. Therefore, it is necessary to develop new predictive models for the progression of IgAN in children in Future Internet 2020, 12, 230 2 of 11 order to guide the selection of cases to be treated [4]. Machine learning has been successfully applied in many fields, providing a shield for diagnosis of IgAN for non-invasive diagnose in asymptomatic cases.
Traditional medical treatment methods rely entirely on doctors' diagnosis and the treatment of patients. In this way, it is difficult to distinguish between diseases with similar symptoms and discover the hidden diseases, leading to misdiagnosis, which may delay the patient's treatment or endanger the patient's life. With the explosive growth of electronic medical records, machine learning in medicine has attracted the attention of many scholars [5][6][7]. Ali et al. proposed intelligent smart healthcare monitoring systems that are based on machine learning approaches to extract useful features from the collected healthcare data of patients, reduce the dimensionality of the data, and improve the classification precision [8,9]. ESRD has the characteristics of high disability, high mortality, and high medical expenses. A lot of researches were invested in the early detection and prediction of kidney disease by machine learning algorithms in order to avoid the occurrence of ESRD, such as the prediction of the kidney disease [10,11], the formulation of the best course of treatment [12], the evaluation of the risk stratification [13], and so on.
The goal of treating IgAN is still to delay the progression of IgAN, since the exact treatment plan for IgAN has not yet been established. Researchers mainly focus on the prognosis [14][15][16], risk stratification [17][18][19][20], and ESRD prediction of IgAN [21,22], whose patients are over 18 years old. The researches on IgAN in children focused on pathological characteristics and statistics [23][24][25], but there are few on the prediction of IgAN in children based on machine learning. Machine learning algorithms are an effective method for solving the high-dimensional problem of medical data, and it plays an important role in smart medical care.
In this paper, a novel method is proposed for predicting the probability of children patients with IgAN reaching ESRD in five years. The main contributions of this paper are: • A dataset about IgAN in children was created. And the chi-square test was used to extract the most useful features from the dataset.

•
EXtreme Gradient Boosting (XGBoost) was adopted in order to predict whether IgAN disease in children patients would reach ESRD or not within five years using a new dataset instead of the traditional clinical pathology. A decision-making system that was based on the XGBoost algorithm was designed with the Django framework.

Dataset
The initial data include electronic medical records of 1167 patients aged 0-18 years old from 2003 to 2019, which came from the electronic medical records of the Nanjing Eastern War Zone Hospital. After the initial data cleaning and deletion of information with missing values, the dataset contained 1146 records. Each record contained 37 attributes, where the first 36 attributes are independent variables that correspond to the patient information, while the remaining attribute is the dependent variable of clinical interest. The 36 independent variables were collected according to five aspects, which are epidemiology, blood test, urine test, renal pathology, and treatment options. In detail, age, sex and hypertension are classified as epidemiology. Serum creatinine (Scr), cholesterol (CHOL), triglycerides (TG), albumin (ALB) complement C3, and glomerular filtration rate (eGFR) are blood test indicators. Urine tests include urine C3 (Ur_C3), α2-m, urine NAG enzyme (Ur_NAG), urine RBP (Ur_RBP), and uric acid (UA). ACEI_ARB, immunosuppression therapy, lipid lowering, and tonsillectomy are the treatment options. The remaining 19 attributes M, E, S, T, C, IgA, IgG, IgM, C3, C4, C1q, loop necrosis, focal segmental glomerular sclerosis (FSGS), glomerulosclerosis, arterial hyaline degeneration, crescent ratio, medullary interstitial fibrosis, thickening and stratification of elastic layer of interlobular artery, and vacuolar degeneration of arteriole smooth muscle cells are included in renal pathology. The dependent variable Future Internet 2020, 12, 230 3 of 11 was whether the patient would reach the end-stage renal disease within five years, which was expressed by ESRD.
These independent variable attributes are divided into continuous independent variables and categorical independent variables according to whether they are continuous values or not, as shown in Tables 1 and 2. Specifically, Table 1 shows the range, mean, and standard deviation for continuous independent variables and Table 2 displays the possible values, numeric value, as well as the number of records for categorical independent variables. Table 3 displays the possible values, numeric values, as well as the number of records for categorical dependent variables. The ESRD is distributed in 567 and 578 records, which correspond to the categories yes and no, respectively, which indicates that patients reach the final stage of IgAN or do not reach that within five years, respectively.

Feature Selection
We evaluated the importance and relevance of predictors with ESRD by the chi-square test for the purpose of identifying significant predictors of ESRD to be applied as inputs for the data mining methods. Feature analysis, as illustrated in Table 4, displays that the importance and relevance of all the predictors with ESRD. The p-value is called Pierce the correlation coefficient. A p-value of less than 0.05 means that there is significant difference. The score stands for the Chi-square statistics, which can be calculated according to Equation (1).
where A represents the actual value and T represents the theoretical value.  The higher the score, the more important the attribute. Sixteen features, which are shown in the first 16 rows of Table 4, were selected based on p-value that is equal to 0 and score greater than 10 for data dimension reduction in this paper.

Model
In this section, XGBoost has been introduced as the best performing algorithm. For processing high-dimensional data, dimensionality reduction, feature extraction, etc., it has a higher accuracy than traditional algorithms. XGBoost is an improved gradient boosting algorithm [26]. The innovation of XGBoost lies in the optimization of the objective function with the second-order Taylor expansion. It merges multiple weak classifiers in order to evolve into a strong classifier, and the base classifier is a classification and regression tree (CART).
The objective function of XGBoost consists of a loss function and a regularization term, which are defined, as follows: where f k is the function expression of the k-th tree model and y i andŷ i are the true label and predicted value of the i-th sample x i , respectively. XGBoost is an additive model, so the predicted value is the sum of the predicted values of each tree, i.e., The sum of the complexity of K trees is used as a regularization term for preventing the model from over-fitting. Assuming that the tree model that is trained on the t-th iteration is f t , then:

Substitute Equation (3) into Equation (2) to obtain Equation (4).
Ob Future Internet 2020, 12, 230 6 of 11 Expand the loss function where g i and h i are the first-order partial derivative and second-order partial derivative of the loss function l with regard toŷ Define the leaf node I j = i q(x i ) = j , and the objective function is finally reduced to During the training process of the XGBoost model, when the t-th tree is established, the greedy strategy is adopted in order to split the tree nodes. Every time the tree node splits into two left and right leaf nodes, it will bring gain to the loss function, which is defined, as follows: If Gain > 0, then the result of this split is added to the model construction.
XGBoost provides three calculation methods for feature importance. The first way is gain, which refers to the average gain of the feature when it is used in trees. The second way is weight, which is the number of times that a feature is used to split the data across all trees. The last way is cover, which relates to the average coverage of the feature when it is used in trees. In this study, the gain method was mainly used for calculating feature importance.

Performance Evaluation
ROC curve and area under curve (AUC) were used in order to evaluate the pros and cons of a binary classifier (binary classifier) in our paper. The abscissa of the curve is the false positive rate (FPR) and the ordinate is the true positive rate (TPR).
where TP stands for True Positive, TN represents True Negative, FP symbolizes False Positive, and FN means False Negative. Different (FPR, TPR) points can be obtained by adjusting the threshold value that is predicted by the model, and these points can be connected into a curve, which is the ROC curve. After the curve is drawn, qualitatively analyze the model that if you want, you need to calculate the AUC area.
AUC refers to the area under the ROC curve [27]. Calculating the AUC value only needs to integrate FPR on the horizontal axis of ROC. In a real scene, the ROC curve is generally above the line y = x, so the value of AUC is generally between 0.5 and 1. The larger the value of AUC, the better the performance of the model. In addition, compare the recall rate and f 1 score of all the classifiers. recall = TPR = TP TP + FN (10) The training set and test set were randomly selected with a ratio of 3 to 1, during model training. Training and testing applied the 10-fold cross validation method to separate the dataset into several partitions or fold, calculated the average of accuracy from all folds. In addition, the above-mentioned performances, such as ROC, AUC, recall, and f1 scores were used to evaluate all techniques.

System Implementation
XGBoost, which showed the best classification performance, was used in order to implement an online decision-making system. The core framework of the system is the Django framework, which was used to apply the machine learning models and build the web tool, and it made use of two programming languages, including Python (version 3.7.0) and HTML (version HTML 5). Django is an open source python web framework. Programmers can easily and quickly create high-quality, easy-to-maintain, database-driven applications with the framework. In addition, in the Django framework, it also contains many powerful third-party plug-ins, which makes Django highly extensible.
In the current implementation of the system, an HTML communicates with the Python service and formats the information that is shown to the user. The training model of the system that is shown in Figure 1 can be used for a single prediction. Additionally, Figure 1 shows a screenshot of the initial web page. When users enter this page, fill in the data of the initial web page according to the feature description in Section 2.1. The system backend predicts whether the patient will reach ESRD or not in five years based on the data that were submitted by the initial web page. Moreover, the web-based decision-making assistance system will obtain the probability of a patient reaching and not reaching ESRD within five years, which may help doctors to alert to some borderline patients. Figure 2 shows the prediction outcome of the decision-making system (DMS) that is based on XGBoost. Future Internet 2020, 12, x FOR PEER REVIEW 8 of 11

Results
The best performing model was selected in order to predict whether children suffering from IgAN would reach the end-stage renal disease after five years among six kinds of machine learning algorithms of XGBoost, RF, CatBoost, KNN, SVM, and ELM. Table 5 shows the accuracy, precision, recall, f1-score, and AUC values of XGBoost, RF, CatBoost, KNN, SVM, and ELM models. It can be concluded from the table that all of the performance indicators of XGBoost are the best, AUC, that accuracy, precision, recall, and f1-score are 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. Table 6 illustrates the importance scores of XGBoost on 16 variables that were selected from Table 4. The AUC indicators of RF and XGBoost are almost equal, but the accuracy, precision, recall, and f1-score of the RF model are all smaller than those of the XGBoost model.  Figure 3 depicts the ROC curve of six machine learning models. From the figure, it can be seen that the ROC curves of the XGBoost and RF models are close at the top; the ROC curve of the KNN model is at the bottom, which means that the XGBoost and RF models have the best performance and the KNN model has the worst performance according to the ROC curve. The ROC curves of the CatBoost, SVM, and ELM models are between that of the XGBoost and KNN models. Table 6. The corresponding variable importance score using XGBoost.

Results
The best performing model was selected in order to predict whether children suffering from IgAN would reach the end-stage renal disease after five years among six kinds of machine learning algorithms of XGBoost, RF, CatBoost, KNN, SVM, and ELM. Table 5 shows the accuracy, precision, recall, f1-score, and AUC values of XGBoost, RF, CatBoost, KNN, SVM, and ELM models. It can be concluded from the table that all of the performance indicators of XGBoost are the best, AUC, that accuracy, precision, recall, and f1-score are 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. Table 6 illustrates the importance scores of XGBoost on 16 variables that were selected from Table 4. The AUC indicators of RF and XGBoost are almost equal, but the accuracy, precision, recall, and f1-score of the RF model are all smaller than those of the XGBoost model.  Figure 3 depicts the ROC curve of six machine learning models. From the figure, it can be seen that the ROC curves of the XGBoost and RF models are close at the top; the ROC curve of the KNN model is at the bottom, which means that the XGBoost and RF models have the best performance and the KNN model has the worst performance according to the ROC curve. The ROC curves of the CatBoost, SVM, and ELM models are between that of the XGBoost and KNN models.

Discussion
In the paper, the use of machine learning algorithms for predicting IgAN in children was researched. At first, 16 features (serum creatinine, focal segmental glomerular sclerosis, crescent ratio, albumin, uric acid, glomerulosclerosis, cholesterol, urine NAG enzyme, eGFR, triglycerides, E, T, C, M, IgM, and C3) were chosen as the input of the classifiers by the chi-square test for dimension reduction. Subsequently, the XGBoost, RF, CatBoost, KNN, SVM, and ELM models were applied to predict IgAN in children. Finally, a decision-making system was build based on the best performing model and Django framework. The results that are shown in Tables 5 and Figure 3 indicate that the XGBoost model can provide better performance when compared to other models for the medical application that was considered in this study. The AUC value, accuracy, precision, recall, and f1-score of XGBoost were 85. 11% models are all lower than those of XGBoost. Here, we can highlight an advantage of XGBoost, because we not only need interpretable models to assist clinical decision-making, but also help clinicians to discover hidden factors that affect the disease. The XGBoost algorithm has a regularization term to prevent overfitting. Moreover, XGBoost can specify the default direction of the branch for missing values or specified values, which can greatly improve the efficiency of the algorithm. More importantly, the XGBoost model has high generalization performance and it can

Discussion
In the paper, the use of machine learning algorithms for predicting IgAN in children was researched. At first, 16 features (serum creatinine, focal segmental glomerular sclerosis, crescent ratio, albumin, uric acid, glomerulosclerosis, cholesterol, urine NAG enzyme, eGFR, triglycerides, E, T, C, M, IgM, and C3) were chosen as the input of the classifiers by the chi-square test for dimension reduction. Subsequently, the XGBoost, RF, CatBoost, KNN, SVM, and ELM models were applied to predict IgAN in children. Finally, a decision-making system was build based on the best performing model and Django framework. The results that are shown in Table 5 and Figure 3 indicate that the XGBoost model can provide better performance when compared to other models for the medical application that was considered in this study. The AUC value, accuracy, precision, recall, and f1-score of XGBoost were 85.11%, 78.60%, 75.96%, 76.70%, and 76.33%, respectively. While the AUC value, accuracy, precision, recall and f1-score of RF (76.42%, 74.26%, 72.82%, 73.53%, 85.07%), CatBoost (76.42%, 73.79%, 73.79%, 73.79%, 84.54%), KNN (75.55%, 73.27%, 71.84%, 72.55%, 80.90%), SVM (76.42%, 73.33%, 74.76%, 74.04%, 82.72%), and ELM (75.98%, 72.64%, 74.76%, 73.68%, 81.74%) models are all lower than those of XGBoost. Here, we can highlight an advantage of XGBoost, because we not only need interpretable models to assist clinical decision-making, but also help clinicians to discover hidden factors that affect the disease. The XGBoost algorithm has a regularization term to prevent overfitting. Moreover, XGBoost can specify the default direction of the branch for missing values or specified values, which can greatly improve the efficiency of the algorithm. More importantly, the XGBoost model has high generalization performance and it can clearly output the important scores of each attribute, namely it is interpretable, which is required by clinical medicine.
Despite the potential of this research, there are several limitations. First, the collected dataset is relatively small, and it cannot fully cover kidney cases in children, which leads to inaccurate predictions of special kidney disease cases in children. Therefore, the dataset needs to be increased and feature processing needs to be more refined by applying data mining techniques for predicting IgAN in children. Second, the prediction system is suitable for children with IgAN in the age range of 0-18 years old, but not for adults with IgAN.
In the future, we will devote to improve the accuracy of the model, perfect the system, add database to the system, expand the training dataset, and complement the system with an error correction function.