The Ensembles of Machine Learning Methods for Survival Predicting after Kidney Transplantation

: Machine learning is used to develop predictive models to diagnose different diseases, particularly kidney transplant survival prediction. The paper used the collected dataset of patients’ individual parameters to predict the critical risk factors associated with early graft rejection. Our study shows the high pairwise correlation between a massive subset of the parameters listed in the dataset. Hence the proper feature selection is needed to increase the quality of a prediction model. Several methods are used for feature selection, and results are summarized using hard voting. Modeling the onset of critical events for the elements of a particular set is made based on the Kapplan-Meier method. Four novel ensembles of machine learning models are built on selected features for the classiﬁcation task. Proposed stacking allows obtaining an accuracy, sensitivity, and speciﬁty of more than 0.9. Further research will include the development of a two-stage predictor.


Introduction
Organ transplantation is one of the most outstanding achievements of modern medicine. The use of organ transplants today allows the treatment of many patients who until recently could hope, at best, to continue their painful and limited existence.
The urgency of this problem lies in the fact that in Ukraine, legal issues are not yet fully resolved, and there is a lack of donor bodies [1]. Although many organ transplant centers now operate in Ukraine, the needs of Ukrainians for transplants are only about 2% met. The coronavirus pandemic has made adjustments in the number of transplants. In addition to the lack of organs for transplantation, one of the essential problems of Ukrainian transplantology is selecting a donor-recipient pair, which avoids episodes of acute and chronic rejection of the donor organ by the recipient's immune system [2].
Organ transplantation can significantly improve patients' quality of life, often offering a single solution for their survival [3]. However, this area does not generate large data sets of patients due to the severity of comorbidities and the complexity of clinical operations, thus preventing the widespread use of mathematical modeling techniques to predict transplant outcomes [4]. For a successful transplant, the recipient and donor should select tissue proteins called human leukocyte antigen (HLA).
The mismatch of HLA between the transplant recipient and their donor can lead to the production of antibodies against HLA, leading to failed transplantation and possibly unsuccessful re-transplantation if it has an HLA-type that is reactive with antibodies [5].
Data-driven approach is used in medicine. Paper [6] presents ensemble of statistical methods usage for kidney transplant survival prediction. Authors in [7] used prediction model to estimate time on a kidney transplant waiting list. The [8] proposes the overview of existing machine learning approaches, particularly, decision tree, artificial neural networks (ANN) with different configuration, Bayesian belief network, logistic regression. The feature selection and several widely-used machine learning models comparison is proposed in [9]. The combination of two predictive models is used for frailty conditions prediction in [10]. That is why the ensemble of machine learning models can be used for early kidney transplant survival prediction.
The main objectives of this study are to use the Kapplan-Meier method and machine learning ensembles to independently confirm the key risk factors associated with early graft rejection (within the first 30 days after transplantation) by changes in the main laboratory parameters of general blood, biochemical analysis of blood and urine conditions of a limited number of immunological laboratory tests.
The paper contribution is forming as the following: • Four ensembles are built on selected features for the classification task; this allows to obtain the accuracy of more than 0.9; • Kapplan-Maier method is used for modeling the processes of the onset of critical events for the elements of a particular set.
The remaining of this paper is organized as follows. Section 2 presents the description of the novel ensemble for feature selection. In Section 3, we discuss the proposed approach and compare the results. Finally, Sections 4 and 5 draw the final conclusions.

Dataset Description
Outpatient case histories of 164 patients who received HLA-compatible renal allografts between 1992 and 2020 were retrospectively analyzed: 152 patients underwent transplantation for the first time-64 (42.1%) women and 88 (57.9%) male patients with a mean age of 32.6 ± 8.7 years (range = 18-60 years) at the time of transplantation, and 12 patients (5 (41.7%) women and 7 (58.3%) men), who received a second transplant kidney, who were on an outpatient basis in the Department Hospital Nephrology and Dialysis of the Lviv Regional Clinical Hospital (LRCH).
The study used a group of patients in whom transplantation was performed for the first time. Transplantation in Ukraine (Lviv, Kyiv, Zaporizhia, Kharkiv, Odessa) was performed in 79.6% of patients and other countries (Belarus, Poland, Italy, Turkey, Austria, China, Pakistan) 20.4% of patients. 86.2% of patients from Ukraine received a kidney from a family donor, 13.8% of patients, all from abroad, received a tube kidney. Acute rejection crisis occurred in 13.1% of patients, with cessation of graft function during the perioperative period.
Dataset consists of 80 features and one target attribute (feature characteristics are given in Appendix A). 38 columns consist of more than 60% of missing data. Often, common methods such as mean, median, mode, frequent data, and constant do not provide correct data for missing values. That is why Multiple Imputation by Chained Equations (MICE) [11] was used to impute missing data for these attributes. However, the standard deviation is changed more than 15%. That is why 38 columns were eliminated. Forty-two features were left. The preprocessed dataset is available https://doi.org/10.6084/m9 .figshare.14906241.v1, accessed on 4 July 2021.

Clinical Studies
Studies of HLA genes were performed in the immunological laboratory of Lviv Regional Clinical Hospital following the instructions for using a set of sera of antileukocyte fluids HLA-A, B, C, DR (HLA-serum). Typing of HLA class I and II alleles using lymphocytotoxicity test. The strength of the action is determined on an 8-point scale. In this retrospective analysis, the history of renal dysfunction and other participants were obtained from personal history. The study used a cross-match according to a method that complements complement-dependent cytotoxicity [12,13]. Viral infections were determined by enzyme-linked immunosorbent assay and standard polymerase chain reaction methods.

Data Preprocessing
To perform statistical analysis of the obtained data, we developed analytical tables in the program Statistica 6.1 and Excel (Microsoft Office 2016), in which the collected primary data were entered. RStudio v.1.1.442 software (Slashdot Media, La Jolla, CA, USA) was used for statistical analysis of the obtained data.
First of all, the normality of the distribution in the obtained sample populations was determined using the Shapiro-Francia criterion. The results were presented as: • mean values and their standard deviations (M ± SD)-in the case of Gaussian distribution, • medians, 25th, and 75th percentiles: Me [25%; 75%]-in the case of non-Gaussian distribution, • shares (%).
When assessing the probability of the difference between the results obtained in the compared groups used: • odd t-criterion-for two groups with Gaussian distribution; • Mann Whitney U-test-for two groups with non-Gaussian distribution; • criterion 2 (xi-square)-when comparing particles.
The difference between the groups was considered significant at p < 0.05. Next, the feature selection is provided. This procedure is organized as follows: 1. Imputation: Missing data imputation using Multiple imputations by chained equations algorithm; 2. Feature selection: Correlation; Boruta feature selection; Recursive Feature Elimination; Hard voting of the mentioned algorithms.
Multiple imputations by chained equations (MICE) is a method for multiple imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary) as well as complexities such as bounds [14]. Based on statistical analysis, more than 30% of values in Ca attribute is missed (Figure 1). Axis x presents the attributes and axis y presents the proportion of missed values. Therefore, this attribute should be eliminated. The rest of the attributes were supplemented. The imputation summary is given below.  Na  69  1  1  1  1  1  1  1  29  1  1  1  1  1  1  1  28  1  1  1  1  1  1  1  20  1  1  1  1  1  1  1  3 1 In given above matrix each row corresponds to a missing data pattern (1 = observed, 0 = missing). Rows and columns are sorted in increasing amounts of missing information. The last column and row contain row and column counts, respectively.
Feature selection is built of the ensemble consists of information gain, Boruta and Recursive Feature Elimination. Hard voting is used for obtained results. Only those features which survived the majority of selection algorithms are used.
The correlation more than 0.9 between HB and donor type as well as between EP and HB is founded.
The next, Boruta algorithm was used for features selection. Boruta is a ranking and selection algorithm based on a random forest algorithm [15]. The advantage of Boruta is that it clearly decides whether a variable is important or not, and helps to select statistically significant variables.
Boruta extends the information system by adding copies of all variables and runs a random forest classifier on the extended information system and gather the Z scores computed. This procedure is repeated until the importance is assigned for all the attributes, or the algorithm has reached the previously set limit of the random forest runs. Obviously the larger the feature space and the more variables is redundant to, the larger the number of trees that must be in the forest for a tree where given attribute is not redundant to occur. However, cross-validated is optimal, if forest with a parametrization is used. This effect is why Boruta needs to run dozens of iterations to be effective. One hundred of iterations was provided in our case for depth = 5.
List of most influential variables is given in Table 1. The next feature selection using Recursive Feature Elimination [16] is provided. Recursive Feature Elimination (RFE) is a feature selection approach. It works by recursively removing attributes and building a model based on the attributes that remain. It uses the model's fidelity to determine which attributes (and combinations of attributes) contribute most to predicting the target attribute.
The twenty-five most important variables are shown below ( Figure 2). The next feature selector is built on information gain. The attributes that contribute more information will have a higher information gain value and can be selected (Table 2). Hard voting [17] is the simplest case of a majority vote. In this case, the important features are selected by the majority of the information gain, Boruta and RFE results.
The final dataset consists of 20 selected features given in Table 3.

The Modeling of the Key Risk Factors Associated with Early Graft Rejection
In the study of cumulative graft survival among patients, the censored Kaplan-Meier method [18] was used. The significance of the difference in the difference in survival levels in individual groups was determined using a logarithmic rank coefficient.
In clinical practice today, the most common survival analysis method for describing censored observations is a Kaplan-Meier method. This method allows you to estimate the proportion of patients who did not have a terminal event and estimate the probability of the absence of an event (to stay alive) by a specific point in time from the beginning of the observation. This probability is called survival, the function of the dependence of survival on time-The function survival rate.
The Kaplan-Meier method is often called the multiplier by the Kaplan-Meier estimate since the value of the survival function at some point in time is the product of the probabilities of survival for all previous points in time. When the analyzed event occurs, the proportion of observed patients is calculated for whom this event has not yet been occurred. In this calculated proportion of patients, there is a probability of surviving the moment of the onset of event i for observed patients. According to the rule of multiplication of probabilities, we obtain the cumulative probability of surviving until the event i occurs for the entire study group of patients [19].
A typical application might involve grouping patients into categories in medical statistics, for instance, those with Gene A profile and those with Gene B profile. In the graph, patients with Gene B die much quicker than those with Gene A. After two years, about 80% of the Gene A patients survive, but less than half of the patients with Gene B.
To generate a Kaplan-Meier estimator, at least two pieces of data are required for each patient (or each subject): the status at last observation (event occurrence or right-censored), and the time to event (or time to censoring). If the survival functions between two or more groups are to be compared, then the third piece of data is required: the group assignment of each subject [20].

The Modeling of the Key Risk Factors Using Machine Learning Algorithms
The weak classifiers given below are used for classification. After that, the ensembles will be built.
The k Nearest Neighbor (k-NN) method is a metric algorithm for automatically organizing objects. This classifier is used for organ transplantation tasks in [21]. The main principle of the nearest neighbor method is that the object is assigned to the class that is most common among the neighbors of a given element. Mathematically, the classification using k-NN is reduced to the calculation: where CSV is categorization status value of object d j , Tr k (d j ) is the set k of objects d z , for which we achieve the maximum of RSV d j , d z , RSV d j , d z -(retrieval status value) similarity measure between training dataset d j and object d z , ca iZ -Value of the target attribute, k-Threshold (number of objects) indicated how many similar objects we have to be considered to calculate CSV i d j . Any similarity function, either a probabilistic or a vector measure, can be used for these purposes. Logistic regression is used for classification [22]. Input values (x) are combined linearly, using weights or coefficient values (Beta) to predict the output value (y). The key difference from linear regression is that the initial simulated value is binary (0 or 1), not a numeric value.
The following is an example of a logistic regression equation: where y is the predicted output, b 0 is the offset or interception, b 1 is the coefficient for a single input value (x). In each input data column, there is a related b coefficient (constant real value), which is obtained from the training sample. The next method of data classification is the method of Support Vector Machine, SVM [23]. The mathematical formulation of the classification problem is as follows: let X be the space of objects (for example, R n ), Y be our classes (for example, Y = {−1,1}). Specified training sample: You need to construct a function F:X→Y (classifier) that maps the class y of the object x.
The classification function F takes the form F(x) = sign((w, φ(x)) + b). Positive certainty is necessary in order for the corresponding Lagrange function in the optimization problem to be limited from below, i.e., the optimization problem would be correctly defined. The accuracy of the classifier depends, in particular, on the choice of the kernel.
A classifier based on a decision tree is a tree whose inner vertices are denoted by terms, and the edges are marked with test scales [24]. The weight of the term in the vector representation of the test example is compared with these test scales. The leaves of such a tree are marked with category values. The classifier sequentially passes along the vertices of the tree, comparing the weight of the term of the current vertex in the presentation of the test example and determining the further direction of the bypass until it reaches the sheet. The value of the category of a given sheet is assigned to the test example.
The advantage of the decision tree is the ability to visualize the results and interpret the data obtained when obtaining a forecast.
A random forest is a classifier consists of a set of trees {h(x, θ k )k = 1, . . .}, where θ k is independent equally distributed random vectors; each tree contributes one vote in determining class X [24]. According to this definition, a random forest is a classifier consisting of an ensemble of decision trees, each of which is built using bagging. In this case, θ k is an independent equally distributed l-dimensional random vector, the coordinates of which are independent discrete random variables that take values from the subset {1,2,3, . . . l} with equal probabilities. The implementation of a random vector θ k determines the precedent numbers of the training sample, which form the bootstrap (i.e., the return sample, more details in the following sections) of the sample used in constructing the k-th decision tree.
The averaging of observational results can give a more stable and reliable estimate, as the effect of random fluctuations in a single dimension is weakened. The development of algorithms for combining models was based on a similar idea. The construction of their ensembles turned out to be one of the most powerful machine learning methods, which often surpasses other methods in the quality of forecasts.
One solution that provides the necessary variety of models is their re-learning on samples randomly selected from the general population or other subsets of data constructed from existing ones. To obtain a stable forecast, the predictions of these models are combined in one way or another, for example, by simple averaging or voting (possibly weighted).
Statistical bootstrap (bootstrapping) is a practical computer method for determining statistics of probability distributions based on multiple generations of samples by the Monte Carlo method based on the available sample. Allows you to easily and quickly evaluate various statistics (confidence intervals, variance, correlation, and so on) [25].
Just as the averaging of several observations reduces the estimate of data variance, a reasonable way to reduce the forecast variance is to obtain a large number of data from the general population, build a predictive model for each training sample and average the forecasts obtained. If instead of separate training samples (which we, as a rule, always do not have enough) to execute a bootstrap and based on the generated pseudo-samples to construct n trees of regression, the average collective forecastf bag will have a lower variance. This procedure is called bootstrap. Bagging can be performed with respect to regression trees and other models: reference vectors, linear discriminants, Bayesian probabilities, and others.
Another method of improving forecasts is boosting, which is an iterative process of sequential construction of private models [26]. Each new model is learned using the information about the mistakes made in the previous step. The resulting function is a linear combination of the whole ensemble of models while minimizing the penalty function. Like bagging, boosting is a general approach that can be applied to many statistical methods of regression and classification. Here we will limit ourselves to discussing gradient boosting in the context of regression trees. The AdaBoost is model where for any subsequent model the check and elimination of observations-emissions, erroneous conclusions of the previous model are carried out [27].
Stacking (Stacked Generalization) is a combination of heterogeneous classifiers [26]. To do this, models of any type are combined to create a composite model. It is similar to bagging, but is not limited to the decision tree. The initial data for basic training create features that can recombine with the data. A majority vote or weighing can combine basic training inputs. Additional data for retention is required if meta-learning parameters are used. It also increases the complexity of the model.
In our study, Kaplan-Mayer survival curves in the peroration period (0-30 days) were as follows. In the group of patients (38.8%) after transplantation without hemodialysis before transplantation, the survival results were significantly better (p = 0.001) than in the group of patients (61.2%) with previous hemodialysis (Figure 3). No significant difference was found between age and sex (p < 0.05). According to the indicators of the general blood analysis, a probable difference (p = 0.049) was found between the groups of patients with normal leukocytes in the blood (59.2%) and the group with leukocytosis in the blood (38.8%), due to better survival results in the first group (Figure 4), on other indicators of the general analysis of blood the probable difference (p < 0.05) was not established.
The biochemical analysis of blood revealed a probable fundamental difference in the levels of creatinine (p = 0.00001) and urea (p = 0.01), between normal values and elevated levels. Creatinine levels were higher than normal in 26.9% of patients, compared with 72.1% (Figure 5), and urea levels were higher than normal in 56.7% of patients, compared with 43.3% of patients with urea was normal (Figure 6), among other indicators of biochemical analysis of blood, no significant difference was found (p < 0.05).  From the indicators of the general analysis of urine, the indicators of proteinuria were probable (p = 0.0005): the percentage of patients with increased proteinuria was 64.5% of patients, compared with 35.5% of patients with normal protein levels (Figure 7). The level of erythrocytes in the urine was probably (p = 0.00001) higher in 34.8% of patients, compared with 65.2% of patients with erythrocytes within normal limits (Figure 8). No significant difference was found between other indicators of the general analysis of urine (p < 0.05).  No significant difference in transplant survival was found between the immunological indicators of the cross-match test 0-10% and the indicators of 10-20%. HLA-typing between matches 0-3 and 4-6 matches also did not show a significant difference in transplant survival.

ML Models and Developed Ensembles Results
The research methodology is organized as follows:

1.
We divide the available sample into training and testing. For training, we will select 70% of the original sample.

2.
We train nine models for selected features, for selected and tentative features, and the whole dataset, respectively: a. Decision forest (RTree using CART algorithm, tuning parameters is used too), b.
K Nearest neighbor (We will estimate three hyperparamters, namely n-neighbors, weights and metric. Next, the Exhaustive Grid Search technique for hyperparameter optimization is used. Since we have provided the class validation score as 3 (cv = 3), Grid Search will evaluate the model 6 × 2 × 3 × 3 = 108 times with different hyperparameters. The best values are the following: n_neighbors = 5, weights = 'uniform', algorithm = 'brute', metric = 'minkowski'), f.

3.
We built four ensembles: a. Boosted random forest, b.
Bagged decision tree based on CARD algorithm, d. Stacking.

4.
We carry out resampling of the obtained models, perform a statistical analysis of two generated model quality metrics (Accuracy and J. Cohen's Kappa index) on the training set, rank the models, and estimate the confidence intervals of the criteria used.

5.
The analysis is organized for the whole dataset and selected features, respectively. That is why the research methodology is used separately for the whole dataset and dataset based on selected features. 6.
We check is the model overfitted using k-fold.
At the first stage, the separated machine learning models are analyzed. The worst accuracy is for MLP trained on the whole dataset and selected features ( Table 3). The best weak classifier is ADABOOST, with an accuracy equal to 0.8533677. That is why MLP classifier is excluded.
The difference in accuracy for the whole dataset and selected features for each model is given in Table 3 in the last column. The accuracy of LG is low for the entire dataset but 0.13848 higher for the selected features.
At the next stage, two Boosting modes based on regression tree and logistic regression are built. Bosted LG shows higher accuracy than all weak classifiers built on the whole dataset. However, the result of SVMRadial built on selected features is better than for boosted ensembles.
Bagged CART without tuned parameters is trained at the next stage. However, the accuracy of this ensemble is worse than for boosted LG both for all features and for selected features.
The next step is the combination of multiple predictors using stacking. KNN Accuracy was used to select the optimal model using the largest value. The final value used for the model was mtry = 2. Here mtry is the number of variables available for splitting at each tree node.
The Table 4 summarizes the results of models'comparison. The numbers in the bold present the highest accuracy in the column for weak predictors, boosting models and stacking respectively. The numbers in the italic present the results of developed ensembles of machine learning models. The difference in accuracy column given the difference between model accuracy built on whole dataset and model accuracy built on selected features. The models comparison is given in Figure 9. The accuracy, specificity and sensitivity are taken into account. Therefore, the proposed ensembles show the good predictive accuracy for both classes too.  The pairwise correlation coefficients for the mentioned sub-models are given in Table 5. The pairwise correlation is calculated for ensemble quality estimation. From this table, we can see a generally low correlation for all pairs of predictions. The two methods with the highest correlation between their predictions are Logistic Regression and kNN, with a correlation coefficient of 0.517, which is not considered high (>0.75).

Discussion
Given the data obtained, it can be argued that six indicators affect the survival of kidney transplantation in the perioperative period, primarily the lack of previous hemodialysis, high leukocytosis, urea and creatinine in the blood, proteinuria, and erythrocyturia in the urine. Thus, based on these data, the first signs of possible rejection can be predicted based on five general screening laboratory tests.
The data preprocessing allows us to increase the quality of analysis. Boruta, RFE, and information gain are used for feature selection. The results of the selection are formed based on hard voting of all features selectors.
A stacking classifier for early graft rejection prediction algorithm based on combination of 6 weak classifiers was developed in this paper. Random forest algorithm is used to combine the predictions. Using this ensemble allows us to increase the defect prediction accuracy up to 92%. Therefore, the proposed algorithm shows higher accuracy comparing to weak classifiers and boosted and bagged ensembles.

Conclusions
This paper presents feature selection and kidney transplant survival for a month after transplantation prediction using Kapplan-Meier method and ensemble classifier. The set of individual parameters from the own dataset was used to predict the key risk factors associated with early graft rejection (within the first 30 days after transplantation) by changes in the main laboratory parameters of general blood, biochemical analysis of blood and urine. Our study shows a high pairwise correlation between a huge subset of the parameters listed in the dataset. Hence the proper feature selection is needed to increase the quality of a prediction model. Further research will include the development of a two-stage classifier. The first step will be a division of instances by clusters, and the second is analysis of each particular cluster.